*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Language Guided Localization and Navigation
Date: Friday, July 8th 2022
Time: 4-6pm (ET)
Location (virtual): https://gatech.zoom.us/j/92706895425?pwd=VVI0Y2lqRnVmYUFLbEIxVXNMTFpPQT09
Meera Hahn
School of Interactive Computing
College of Computing
Georgia Institute of Technology
Committee:
Dr. James M. Rehg (advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. Dhruv Batra, School of Interactive Computing, Georgia Institute of Technology
Dr. Diyi Yang, School of Interactive Computing, Georgia Institute of Technology
Dr. Abhinav Gupta, The Robotics Institute, Carnegie Mellon University
Dr. Peter Anderson, Google
Abstract:
Embodied tasks that require active perception are key to improving language grounding models and creating holistic social agents. In this dissertation we explore four multi-modal embodied perception tasks which require localization or navigation of an agent in an unknown temporal or 3D space with limited information about the environment. We first explore how an agent can be guided by language to navigate a temporal space using reinforcement learning in a similar way to that of a 3D space. Next, we explore how to teach an agent to navigate using only self-supervised learning from passive data. In this task we remove the complexity of language and explore a topological map and graph-network based strategy for navigation. We then present the Where Are You? (WAY) dataset which contains over 6k dialogs of two humans performing a localization task. On top of this dataset, we design three tasks which push the envelope of current visual language-grounding tasks by introducing a multi-agent set up in which agents are required to use active perception to communicate, navigate, and localize. We specifically focus on modeling one of these tasks, Localization from Embodied Dialog (LED). The LED task involves taking a natural language dialog of two agents -- an observer and a locator -- and predicting the location of the observer agent. We find that a topological graph map of the environments is a successful representation for modeling the complex relational structure of the dialog and observer locations. We validate our approach on several state of the art multi-modal baselines and show that a multi-modal transformer with large-scale pretraining outperforms all other models. We additionally introduce a novel analysis pipeline on this model for the LED and the Vision Language Navigation (VLN) task to diagnose and reveal limitations and failure modes of these types of models.