Embodied large language models enable robots to complete complex tasks in unpredictable environments

Intelligence research should not be held back by its past. Nature 545, 385–386 (2017).
Friston, K. Embodied inference and spatial cognition. Cogn. Process. 13, 497–514 (2012).
Google Scholar
Wilson, M. Six views of embodied cognition. Psychon. Bull. Rev. 9, 625–636 (2002).
Google Scholar
Clark, A. An embodied cognitive science. Trends Cogn. Sci. 3, 345–351 (1999).
Google Scholar
Stella, F., Della Santina, C. & Hughes, J. How can LLMs transform the robotic design process? Nat. Mach. Intell. 5, 561–564 (2023).
Google Scholar
Miriyev, A. & Kovac, M. Skills for physical artificial intelligence. Nat. Mach. Intell. 2, 658–660 (2020).
Cui, J. & Trinkle, J. Toward next-generation learned robot manipulation. Sci. Robot. 6, eabd9461 (2021).
Arents, J. & Greitans, M. Smart industrial robot control trends, challenges and opportunities within manufacturing. Appl. Sci. 12, 937 (2022).
Billard, A. & Kragic, D. Trends and challenges in robot manipulation. Science 364, eaat8414 (2019).
Yang, G.-Z. et al. The grand challenges of Science Robotics. Sci. Robot. 3, eaar7650 (2018).
Buchanan, R., Rofer, A., Moura, J., Valada, A. & Vijayakumar, S. Online estimation of articulated objects with factor graphs using vision and proprioceptive sensing. In 2024 IEEE International Conference on Robotics and Automation (ICRA) 16111–16117 (IEEE, 2024).
Nikolaidis, S., Ramakrishnan, R., Gu, K. & Shah, J. Efficient model learning from joint-action demonstrations for human-robot collaborative tasks. In 2015 10th ACM/IEEE International Conference on Human-Robot Interaction (HRI) 189–196 (IEEE, 2015).
Saveriano, M., Abu-Dakka, F. J., Kramberger, A. & Peternel, L. Dynamic movement primitives in robotics: a tutorial survey. Int. J. Robot. Res. 42, 1133–1184 (2023).
Kober, J. et al. Movement templates for learning of hitting and batting. In 2010 IEEE International Conference on Robotics and Automation 853–858 (IEEE, 2010).
Huang, W. et al. VoxPoser: composable 3D value maps for robotic manipulation with language models. In Proc. 7th Conference on Robot Learning 540–562 (PMLR, 2023).
Zhang, D. et al. Explainable hierarchical imitation learning for robotic drink pouring. In IEEE Transactions on Automation Science and Engineering 3871–3887 (2022).
Hussein, A., Gaber, M. M., Elyan, E. & Jayne, C. Imitation learning: a survey of learning methods. ACM Comput. Surv. 50, 21:1–21:35 (2017).
Google Scholar
Di Palo, N. & Johns, E. DINOBot: robot manipulation via retrieval and alignment with vision foundation models. In International Conference on Robotics and Automation (ICRA) 2798–805 (IEEE, 2024).
Shridhar, M., Manuelli, L. & Fox, D. CLIPort: what and where pathways for robotic manipulation. In Proc. 5th Conference on Robot Learning 894–906 (PMLR, 2022).
Shridhar, M., Manuelli, L. & Fox, D. Perceiver-Actor: a multi-task transformer for robotic manipulation. In Proc. 6th Conference on Robot Learning 785–799 (PMLR, 2023).
Mees, O., Hermann, L. & Burgard, W. What matters in language conditioned robotic imitation learning over unstructured data. IEEE Robot. Autom. Lett. 7, 11205–11212 (2022).
Mees, O., Borja-Diaz, J. & Burgard, W. Grounding language with visual affordances over unstructured data. In 2023 IEEE International Conference on Robotics and Automation (ICRA) 11576–11582 (IEEE, 2023).
Shao, L., Migimatsu, T., Zhang, Q., Yang, K. & Bohg, J. Concept2Robot: learning manipulation concepts from instructions and human demonstrations. Int. J. Robot. Res. 40, 1419–1434 (2021).
Ichter, B. et al. Do as I can, not as I say: grounding language in robotic affordances. In Proc. 6th Conference on Robot Learning 287–318 (PMLR, 2023).
Driess, D. et al. PaLM-E: an embodied multimodal language model. In Proc. 40th International Conference on Machine Learning 8469–8488 (PMLR, 2023).
Peng, A. et al. Preference-conditioned language-guided abstraction. In Proc. 2024 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’24 572–581 (Association for Computing Machinery, 2024).
Huang, W., Abbeel, P., Pathak, D. & Mordatch, I. Language models as zero-shot planners: extracting actionable knowledge for embodied agents. In Proc. 39th International Conference on Machine Learning 9118–9147 (PMLR, 2022).
Huang, J. & Chang, K. C.-C. Towards reasoning in large language models: a survey. In Findings of the Association for Computational Linguistics: ACL 2023 1049–1065 (Association for Computational Linguistics, 2023).
Zitkovich, B. et al. RT-2: vision-language-action models transfer web knowledge to robotic control. In Proc. 7th Conference on Robot Learning 2165–2183 (PMLR, 2023).
Ma, X., Patidar, S., Haughton, I. & James, S. Hierarchical diffusion policy for kinematics-aware multi-task robotic manipulation. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) 18081–18090 (IEEE, 2024).
Zhang, C., Chen, J., Li, J., Peng, Y. & Mao, Z. Large language models for human-robot interaction: a review. Biomimetic Intell. Robot. 3, 100131 (2023).
Lewis, P. et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems 9459–9474 (Curran Associates, 2020).
Raiaan, M. et al. A review on large language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 12, 26839–26874 (2024).
Rozo, L., Jimenez, P. & Torras, C. Force-based robot learning of pouring skills using parametric hidden Markov models. In 9th International Workshop on Robot Motion and Control 227–232 (IEEE, 2013).
Huang, Y., Wilches, J. & Sun, Y. Robot gaining accurate pouring skills through self-supervised learning and generalization. Robot. Auton. Syst. 136, 103692 (2021).
Google Scholar
Mon-Williams, R., Stouraitis, T. & Vijayakumar, S. A behavioural transformer for effective collaboration between a robot and a non-stationary human. In 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN) 1150–1157 (IEEE, 2023).
Belkhale, S., Cui, Y. & Sadigh, D. Data quality in imitation learning. In Advances in Neural Information Processing Systems (NeurIPS) 80375–80395 (Curran Associates, 2024).
Khazatsky, A. et al. DROID: a large-scale in-the-wild robot manipulation dataset. Robotics: Science and Systems; (2024).
Acosta, B., Yang, W. & Posa, M. Validating robotics simulators on real-world impacts. IEEE Robot. Autom. Lett. 7, 6471–6478 (2022).
Alomar, A. et al. CausalSim: a causal framework for unbiased trace-driven simulation. In 20th USENIX Symposium on Networked Systems Design and Implementation (NSDI 23) 1115–1147 (USENIX Association, 2023).
Choi, H. et al. On the use of simulation in robotics: opportunities, challenges, and suggestions for moving forward. Proc. Natl Acad. Sci. USA 118, e190785611 (2021).
Del Aguila Ferrandis, J., Moura, J. & Vijayakumar, S. Nonprehensile planar manipulation through reinforcement learning with multimodal categorical exploration. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 5606–5613 (IEEE, 2023).
Kirk, R., Zhang, A., Grefenstette, E. & Rocktäschel, T. A survey of zero-shot generalisation in deep reinforcement learning. J. Artific. Intell. Res. 76, 201–264 (2023).
Google Scholar
Dai, T. et al. Analysing deep reinforcement learning agents trained with domain randomisation. Neurocomputing 493, 143–165 (2022).
Google Scholar
Chang, J., Uehara, M., Sreenivas, D., Kidambi, R. & Sun, W. Mitigating covariate shift in imitation learning via offline data with partial coverage. In Advances in Neural Information Processing Systems 965–979 (Curran Associates, 2021).
Huang, W. et al. Inner monologue: embodied reasoning through planning with language models. In Proc. 6th Conference on Robot Learning 1769–1782 (PMLR, 2023).
Nair, S., Rajeswaran, A., Kumar, V., Finn, C. & Gupta, A. R3M: a universal visual representation for robot manipulation. In Proc. 6th Conference on Robot Learning Vol. 205, 892–909 (PMLR, 2022).
Singh, I. et al. ProgPrompt: generating situated robot task plans using large language models. In Proc. IEEE/CVF International Conference on Robotics and Automation (ICRA) 11523–11530 (IEEE, 2023).
Song, C. H. et al. LLM-Planner: few-shot grounded planning for embodied agents with large language models. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 2998–3009 (IEEE/CVF, 2023).
Vemprala, S. H., Bonatti, R., Bucker, A. & Kapoor, A. ChatGPT for robotics: design principles and model abilities. IEEE Access 12, 55682–55696 (2024).
Google Scholar
Ding, Y., Zhang, X., Paxton, C. & Zhang, S. Task and motion planning with large language models for object rearrangement. In 2023 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2086–2092 (IEEE, 2023).
Kwon, M. et al. Toward grounded commonsense reasoning. In Proc. International Conference on Robotics and Automation (ICRA) 5463–5470 (IEEE, 2024).
Hong, J., Levine, S. & Dragan, A. Learning to influence human behavior with offline reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS) 36094–36105 (Curran Associates, 2024).
OpenAI. GPT-4 technical report. Preprint at (2024).
OpenAI. Custom models program: fine-tuning GPT-4 for specific domains (2023); https://platform.openai.com/docs/guides/fine-tuning/
Pietsch, M. et al. Haystack: the end-to-end nlp framework for pragmatic builders. GitHub (2019).
Weaviate. Verba: the golden RAGtriever. GitHub (2023).
Kirillov, A. et al. Segment anything. In Proc. IEEE/CVF International Conference on Computer Vision (ICCV) 4015–4026 (IEEE, 2023).
Ramesh, A. et al. Zero-shot text-to-image generation. In Proc. 38th International Conference on Machine Learning 8821–8831 (PMLR, 2021).
Zeng, A. et al. Socratic models: composing zero-shot multimodal reasoning with language. In Proc. International Conference on Learning Representations (ICLR, 2023).
Cui, Y. et al. No, to the right: online language corrections for robotic manipulation via shared autonomy. In Proc. 2023 ACM/IEEE International Conference on Human-Robot Interaction, HRI ’23 93–101 (Association for Computing Machinery, 2023).
Bengio, Y. et al. Managing extreme AI risks amid rapid progress. Science 384, 842–845 (2024).
Li, G., Jampani, V., Sun, D. & Sevilla-Lara, L. Locate: localize and transfer object parts for weakly supervised affordance grounding. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 10922–10931 (IEEE, 2023).
Li, G., Sun, D., Sevilla-Lara, L. & Jampani, V. One-shot open affordance learning with foundation models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition 3086–3096 (IEEE, 2024).
Liang, J. et al. Code as policies: language model programs for embodied control. In 2023 IEEE International Conference on Robotics and Automation (ICRA) 9493–9500 (IEEE, 2023).
Hong, S. & Kim, H. An integrated GPU power and performance model. In Proc. 37th Annual International Symposium on Computer Architecture 280–289 (Association for Computing Machinery, 2010).
Kinova Robotics. Kinova Gen3 Ultra-Lightweight Robotic Arm User Guide (2023); https://assets.iqr-robot.com/wp-content/uploads/2023/08/20230814163651088831.pdf
US Environmental Protection Agency. GHG emission factors hub (2024); https://www.epa.gov/climateleadership/ghg-emission-factors-hub
Liu, S. et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection. In 2024 European Conference on Computer Vision (eds Leonardis, A. et al.) Vol. 15105 (Springer, 2023).
ruaridhmon. ruaridhmon/ELLMER: v1.0.0: Initial Release. Zenodo (2024).
link