Embodied large language models enable robots to complete complex tasks in unpredictable environments

Embodied large language models enable robots to complete complex tasks in unpredictable environments
  • Intelligence research should not be held back by its past. Nature 545, 385–386 (2017).

  • Friston, K. Embodied inference and spatial cognition. Cogn. Process. 13, 497–514 (2012).

    Article 
    MATH 

    Google Scholar 

  • Wilson, M. Six views of embodied cognition. Psychon. Bull. Rev. 9, 625–636 (2002).

    Article 
    MATH 

    Google Scholar 

  • Clark, A. An embodied cognitive science. Trends Cogn. Sci. 3, 345–351 (1999).

    Article 
    MATH 

    Google Scholar 

  • Stella, F., Della Santina, C. & Hughes, J. How can LLMs transform the robotic design process? Nat. Mach. Intell. 5, 561–564 (2023).

    Article 
    MATH 

    Google Scholar 

  • Miriyev, A. & Kovac, M. Skills for physical artificial intelligence. Nat. Mach. Intell. 2, 658–660 (2020).

  • Cui, J. & Trinkle, J. Toward next-generation learned robot manipulation. Sci. Robot. 6, eabd9461 (2021).

  • Arents, J. & Greitans, M. Smart industrial robot control trends, challenges and opportunities within manufacturing. Appl. Sci. 12, 937 (2022).

  • Billard, A. & Kragic, D. Trends and challenges in robot manipulation. Science 364, eaat8414 (2019).

  • Yang, G.-Z. et al. The grand challenges of Science Robotics. Sci. Robot. 3, eaar7650 (2018).

  • Buchanan, R., Rofer, A., Moura, J., Valada, A. & Vijayakumar, S. Online estimation of articulated objects with factor graphs using vision and proprioceptive sensing. In 2024 IEEE International Conference on Robotics and Automation (ICRA) 16111–16117 (IEEE, 2024).

  • Nikolaidis, S., Ramakrishnan, R., Gu, K. & Shah, J. Efficient model learning from joint-action demonstrations for human-robot collaborative tasks. In 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI) 189–196 (IEEE, 2015).

  • Saveriano, M., Abu-Dakka, F. J., Kramberger, A. & Peternel, L. Dynamic movement primitives in robotics: a tutorial survey. Int. J. Robot. Res. 42, 1133–1184 (2023).

  • Kober, J. et al. Movement templates for learning of hitting and batting. In 2010 IEEE International Conference on Robotics and Automation 853–858 (IEEE, 2010).

  • Huang, W. et al. VoxPoser: composable 3D value maps for robotic manipulation with language models. In Proc. 7th Conference on Robot Learning 540–562 (PMLR, 2023).

  • Zhang, D. et al. Explainable hierarchical imitation learning for robotic drink pouring. In IEEE Transactions on Automation Science and Engineering 3871–3887 (2022).

  • Hussein, A., Gaber, M. M., Elyan, E. & Jayne, C. Imitation learning: a survey of learning methods. ACM Comput. Surv. 50, 21:1–21:35 (2017).

    MATH 

    Google Scholar 

  • Di Palo, N. & Johns, E. DINOBot: robot manipulation via retrieval and alignment with vision foundation models. In International Conference on Robotics and Automation (ICRA) 2798–805 (IEEE, 2024).

  • Shridhar, M., Manuelli, L. & Fox, D. CLIPort: what and where pathways for robotic manipulation. In Proc. 5th Conference on Robot Learning 894–906 (PMLR, 2022).

  • Shridhar, M., Manuelli, L. & Fox, D. Perceiver-Actor: a multi-task transformer for robotic manipulation. In Proc. 6th Conference on Robot Learning 785–799 (PMLR, 2023).

  • Mees, O., Hermann, L. & Burgard, W. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robot. Autom. Lett. 7, 11205–11212 (2022).

  • Mees, O., Borja-Diaz, J. & Burgard, W. Grounding language with visual affordances over unstructured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA) 11576–11582 (IEEE, 2023).

  • Shao, L., Migimatsu, T., Zhang, Q., Yang, K. & Bohg, J. Concept2Robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40, 1419–1434 (2021).

  • Ichter, B. et al. Do as I can, not as I say: grounding language in robotic affordances. In Proc. 6th Conference on Robot Learning 287–318 (PMLR, 2023).

  • Driess, D. et al. PaLM-E: an embodied multimodal language model. In Proc. 40th International Conference on Machine Learning 8469–8488 (PMLR, 2023).

  • Peng, A. et al. Preference-conditioned language-guided abstraction. In Proc. 2024 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’24 572–581 (Association for Computing Machinery, 2024).

  • Huang, W., Abbeel, P., Pathak, D. & Mordatch, I. Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In Proc. 39th International Conference on Machine Learning 9118–9147 (PMLR, 2022).

  • Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).

  • Zitkovich, B. et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning 2165–2183 (PMLR, 2023).

  • Ma, X., Patidar, S., Haughton, I. & James, S. Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 18081–18090 (IEEE, 2024).

  • Zhang, C., Chen, J., Li, J., Peng, Y. & Mao, Z. Large language models for human-robot interaction: a review. Biomimetic Intell. Robot. 3, 100131 (2023).

  • Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 9459–9474 (Curran Associates, 2020).

  • Raiaan, M. et al. A review on large language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 12, 26839–26874 (2024).

  • Rozo, L., Jimenez, P. & Torras, C. Force-based robot learning of pouring skills using parametric hidden Markov models. In 9th International Workshop on Robot Motion and Control 227–232 (IEEE, 2013).

  • Huang, Y., Wilches, J. & Sun, Y. Robot gaining accurate pouring skills through self-supervised learning and generalization. Robot. Auton. Syst. 136, 103692 (2021).

    Article 

    Google Scholar 

  • Mon-Williams, R., Stouraitis, T. & Vijayakumar, S. A behavioural transformer for effective collaboration between a robot and a non-stationary human. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) 1150–1157 (IEEE, 2023).

  • Belkhale, S., Cui, Y. & Sadigh, D. Data quality in imitation learning. In Advances in Neural Information Processing Systems (NeurIPS) 80375–80395 (Curran Associates, 2024).

  • Khazatsky, A. et al. DROID: a large-scale in-the-wild robot manipulation dataset. Robotics: Science and Systems; (2024).

  • Acosta, B., Yang, W. & Posa, M. Validating robotics simulators on real-world impacts. IEEE Robot. Autom. Lett. 7, 6471–6478 (2022).

  • Alomar, A. et al. CausalSim: a causal framework for unbiased trace-driven simulation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) 1115–1147 (USENIX Association, 2023).

  • Choi, H. et al. On the use of simulation in robotics: opportunities, challenges, and suggestions for moving forward. Proc. Natl Acad. Sci. USA 118, e190785611 (2021).

  • Del Aguila Ferrandis, J., Moura, J. & Vijayakumar, S. Nonprehensile planar manipulation through reinforcement learning with multimodal categorical exploration. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5606–5613 (IEEE, 2023).

  • Kirk, R., Zhang, A., Grefenstette, E. & Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning. J. Artific. Intell. Res. 76, 201–264 (2023).

    Article 
    MathSciNet 
    MATH 

    Google Scholar 

  • Dai, T. et al. Analysing deep reinforcement learning agents trained with domain randomisation. Neurocomputing 493, 143–165 (2022).

    Article 
    MATH 

    Google Scholar 

  • Chang, J., Uehara, M., Sreenivas, D., Kidambi, R. & Sun, W. Mitigating covariate shift in imitation learning via offline data with partial coverage. In Advances in Neural Information Processing Systems 965–979 (Curran Associates, 2021).

  • Huang, W. et al. Inner monologue: embodied reasoning through planning with language models. In Proc. 6th Conference on Robot Learning 1769–1782 (PMLR, 2023).

  • Nair, S., Rajeswaran, A., Kumar, V., Finn, C. & Gupta, A. R3M: a universal visual representation for robot manipulation. In Proc. 6th Conference on Robot Learning Vol. 205, 892–909 (PMLR, 2022).

  • Singh, I. et al. ProgPrompt: generating situated robot task plans using large language models. In Proc. IEEE/CVF International Conference on Robotics and Automation (ICRA) 11523–11530 (IEEE, 2023).

  • Song, C. H. et al. LLM-Planner: few-shot grounded planning for embodied agents with large language models. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 2998–3009 (IEEE/CVF, 2023).

  • Vemprala, S. H., Bonatti, R., Bucker, A. & Kapoor, A. ChatGPT for robotics: design principles and model abilities. IEEE Access 12, 55682–55696 (2024).

    Article 

    Google Scholar 

  • Ding, Y., Zhang, X., Paxton, C. & Zhang, S. Task and motion planning with large language models for object rearrangement. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2086–2092 (IEEE, 2023).

  • Kwon, M. et al. Toward grounded commonsense reasoning. In Proc. International Conference on Robotics and Automation (ICRA) 5463–5470 (IEEE, 2024).

  • Hong, J., Levine, S. & Dragan, A. Learning to influence human behavior with offline reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS) 36094–36105 (Curran Associates, 2024).

  • OpenAI. GPT-4 technical report. Preprint at (2024).

  • OpenAI. Custom models program: fine-tuning GPT-4 for specific domains (2023); https://platform.openai.com/docs/guides/fine-tuning/

  • Pietsch, M. et al. Haystack: the end-to-end nlp framework for pragmatic builders. GitHub (2019).

  • Weaviate. Verba: the golden RAGtriever. GitHub (2023).

  • Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 4015–4026 (IEEE, 2023).

  • Ramesh, A. et al. Zero-shot text-to-image generation. In Proc. 38th International Conference on Machine Learning 8821–8831 (PMLR, 2021).

  • Zeng, A. et al. Socratic models: composing zero-shot multimodal reasoning with language. In Proc. International Conference on Learning Representations (ICLR, 2023).

  • Cui, Y. et al. No, to the right: online language corrections for robotic manipulation via shared autonomy. In Proc. 2023 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’23 93–101 (Association for Computing Machinery, 2023).

  • Bengio, Y. et al. Managing extreme AI risks amid rapid progress. Science 384, 842–845 (2024).

  • Li, G., Jampani, V., Sun, D. & Sevilla-Lara, L. Locate: localize and transfer object parts for weakly supervised affordance grounding. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10922–10931 (IEEE, 2023).

  • Li, G., Sun, D., Sevilla-Lara, L. & Jampani, V. One-shot open affordance learning with foundation models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3086–3096 (IEEE, 2024).

  • Liang, J. et al. Code as policies: language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) 9493–9500 (IEEE, 2023).

  • Hong, S. & Kim, H. An integrated GPU power and performance model. In Proc. 37th Annual International Symposium on Computer Architecture 280–289 (Association for Computing Machinery, 2010).

  • Kinova Robotics. Kinova Gen3 Ultra-Lightweight Robotic Arm User Guide (2023); https://assets.iqr-robot.com/wp-content/uploads/2023/08/20230814163651088831.pdf

  • US Environmental Protection Agency. GHG emission factors hub (2024); https://www.epa.gov/climateleadership/ghg-emission-factors-hub

  • Liu, S. et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In 2024 European Conference on Computer Vision (eds Leonardis, A. et al.) Vol. 15105 (Springer, 2023).

  • ruaridhmon. ruaridhmon/ELLMER: v1.0.0: Initial Release. Zenodo (2024).

  • link