Improving AI’s ability to follow complex human instructions


A paper on vision and language navigation by a team of researchers including CDS PhD student Aishwarya Kamath was accepted at CVPR 2023

CDS PhD student Aishwarya Kamath

Training robots to follow human instructions in natural language is a significant long-term challenge in artificial intelligence research. To explore this question, the field of Vision and Language Navigation (VLN) has developed to train VLN models that navigate in photorealistic environments. Due to a lack of diversity in training environments and scarce human instruction data, these models still face limitations. In a recent paper, “A New Path: Scaling Vision-and-Language Navigation with Synthetic Instructions and Imitation Learning”, the authors present a new dataset two orders of magnitude larger than existing human annotated datasets with VLN instructions close to the quality of human instructions.

Led by CDS PhD student Aishwarya Kamath, the paper was accepted at the IEEE / CVF Computer Vision and Pattern Recognition Conference (CVPR) which will be held at the Vancouver Convention Center from June 18th through the 22nd. In addition to Aishwarya, the team consists of Senior Research Scientist at Google Research Peter Anderson, Software Engineer at Google AI Language Su Wang, PhD student at Carnegie Mellon University Jing Yu Koh, Software Engineer at Google Research Alexander Ku, Research Scientist at Google Research Austin Waters, Research Scientist at Apple AI/ML Yinfei Yang, Research Scientist at Google Research Jason Baldridge, and Software Engineer at Google AI Zarana Parekh.

While pre-training transformer models on generic image text data have been explored with limited success, in this work the authors explore pre-training on diverse in-domain instruction following data. The researchers use more than 500 indoor environments captured in 360-degree panoramas to develop a novel dataset. They construct navigation trajectories using panoramas from the previously unexplored Gibson dataset and generate visually grounded instructions for each trajectory using Marky (a multilingual navigation instruction generator). By using a diverse set of environments and augmenting them with novel viewpoints, the authors train a pure imitation learning agent that is able to scale efficiently, making it possible to effectively utilize the created dataset. The resulting agent outperformed all other existing reinforcement learning agents on the challenging VLN benchmark — Room across Room (RxR).

“This project was of interest to me because I was curious to know whether generic architectures and training strategies can achieve competitive performance on the VLN task, opening up the possibility to have one model for all vision and language tasks, including embodied ones like VLN,” said Kamath. “We found that scaling up in-domain data is essential and coupling this with an imitation learning agent can achieve state-of-the-art results without any reinforcement learning.”

The study opens new avenues to improve AI’s capacity to follow complex human instructions. “This result paves a new path towards improving instruction-following agents, emphasizing large-scale imitation learning with generic architectures, along with a focus on developing synthetic instruction generation capabilities — which are shown to directly improve instruction-following performance,” write the authors.

By Meryl Phair



NYU Center for Data Science

Official account of the Center for Data Science at NYU, home of the Undergraduate, Master’s, and Ph.D. programs in Data Science.