MIT's Improbable AI Lab, a division of the Computer Science and Artificial Intelligence Laboratory (CSAIL), has introduced a groundbreaking multimodal framework named Compositional Foundation Models for Hierarchical Planning (HiP). This innovative framework utilizes the expertise of three distinct foundation models, akin to OpenAI's GPT-4, to empower robots with the ability to execute complex plans transparently. Published on the arXiv preprint server, HiP addresses the intricate decision-making process involved in robot planning by leveraging models trained on vast datasets for tasks ranging from image generation to language translation and robotics.
Unlike existing multimodal models, such as RT2, HiP employs three separate foundation models, each trained on different data modalities, eliminating the need for paired vision, language, and action data. This approach enhances transparency in the reasoning process, making it more accessible. HiP's development marks a departure from the expensive task of pairing language, visual, and action data, presenting a cost-effective and transparent solution for imbuing robots with linguistic, physical, and environmental intelligence.HiP's potential applications extend beyond daily chores, offering promise in multistep construction and manufacturing tasks. The CSAIL team conducted rigorous tests, showcasing HiP's prowess in manipulation tasks, outperforming comparable frameworks. The system exhibited adaptability to new information, exemplified by its ability to adjust plans dynamically in response to changes during tasks like stacking blocks of different colors and arranging objects in a specified sequence.
The three-pronged planning process of HiP operates hierarchically, utilizing a large language model (LLM), a video diffusion model, and an egocentric action model. The LLM starts by ideating and abstracting task plans, incorporating common-sense knowledge from the internet. The video model augments this planning by providing geometric and physical information from online footage. Finally, the egocentric action model executes the refined plan, mapping it over the robot's visual space.MIT researcher Jim Fan commended HiP for decomposing the complex task of embodied agent planning, making the decision-making process more tractable and transparent. The team envisions HiP's integration into real-world scenarios, foreseeing its application in homes, factories, and construction sites. As HiP relies on existing pre-trained models, it presents an efficient solution for robotic decision-making.
While acknowledging the current limitations due to the absence of high-quality video foundation models, the CSAIL team remains optimistic about the future. They believe that once available, these models could enhance HiP's visual sequence prediction and robot action generation, further expanding its capabilities in tackling long-horizon tasks in robotics. The proof-of-concept demonstrated by HiP showcases the potential of combining models trained on separate tasks and data modalities for effective robotic planning, paving the way for advancements in the field.