*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Towards multi-modal AI systems with 'open-world' cognition.
Date: Friday, September 16, 2022
Time: 11:30 am - 1:00 pm EST
Location (virtual): https://gatech.zoom.us/j/92418212103
Harsh Agrawal
PhD Student in Computer Science
College of Computing
Georgia Institute of Technology
Committee
Dr. Dhruv Batra (Advisor, School of Interactive Computing, Georgia Institute of Technology)
Dr. Devi Parikh (School of Interactive Computing, Georgia Institute of Technology)
Dr. James Hays (School of Interactive Computing, Georgia Institute of Technology)
Dr. Alexander Schwing (Department of Electrical and Computer Engineering)
Dr. Peter Anderson (Google)
Dr. Felix Hill (DeepMind)
Abstract
A long term goal in AI research is to build intelligent systems with 'open-world' cognition. When deployed in the wild, AI systems should generalize to novel concepts and instructions. Such an agent would need to perceive both familiar and unfamiliar concepts present in the environment, combine the capabilities of models trained on different modalities, and incrementally acquire new skills to continuously adapt to the evolving world. In this thesis, we look at how we can combine complementary multi-modal knowledge with suitable forms of reasoning to enable novel concept learning. In Part 1, we show that agents can infer unfamiliar concepts in the presence of other familiar concepts by combining multi-modal knowledge with deductive reasoning. Furthermore, agents can use newly inferred concepts to update its vocabulary of known concepts and infer additional novel concepts incrementally. In Part 2, we look at two realistic tasks that require understanding novel concepts. First, we present a benchmark to evaluate the AI system's capability to describe novel objects present in an image. We argue that models that disentangle 'how to recognize an object' from 'how to talk about it' generalize better to novel objects compared to traditional methods that train on paired image-caption data. Second, we study how embodied agents can combine perception with common-sense knowledge to perform household chores like tidying up the house, without any explicit human instruction, even in the presence of unseen objects in unseen environments. Finally, in the proposed work, we will show that by combining complementary knowledge stored in foundation models trained on different domains (vision only, language only, vision-language), agents can perform zero-shot novel instruction following and continuously adapt to the open world by learning new skills incrementally.