PhD Defense by Jiasen Lu

*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************

Event Details
  • Date/Time:
    • Monday January 6, 2020 - Friday January 10, 2020
      12:00 pm - 1:59 pm
  • Location: CODA C1108
  • Phone:
  • URL:
  • Email:
  • Fee(s):
    N/A
  • Extras:
Contact
No contact information submitted.
Summaries

Summary Sentence: Visually Grounded Language Understanding and Generation

Full Summary: No summary paragraph submitted.

Title: Visually Grounded Language Understanding and Generation

 

Jiasen Lu

Ph.D. Candidate in Computer Science

School of Interactive Computing

Georgia Institute of Technology

https://www.cc.gatech.edu/~jlu347/

 

Date: Monday, January 6, 2020

Time: 12:00-2:00 PM (EST)

Location: CODA C1108

BlueJeanshttps://bluejeans.com/7313234985

 

 

Committee:

Dr. Devi Parikh (Advisor), School of Interactive Computing, Georgia Institute of Technology

Dr. Dhruv Batra, School of Interactive Computing, Georgia Institute of Technology

Dr. Mark Riedl, School of Interactive Computing, Georgia Institute of Technology

Dr. Judy Hoffman, School of Interactive Computing, Georgia Institute of Technology

Dr. Jason J. Corso, Department of Electrical Engineering and Computer Science University of Michigan 

 

Abstract:

 

The world around us involves multiple modalities. One of the major challenges in modeling different modalities jointly is how to induce appropriate grounding in models given the heterogeneity of the data. Which parts of the image and question should the model focus on when answering a question about an image? How can we integrate object detectors to produce fluent but visually grounded image captions? How can we disentangle "what to say" from "how to say it" when automatically generating goal-oriented dialogs about images? How to build a more general multi-modal AI that can learn visual groundings from massive meta-data on the internet and solve multiple tasks at the same time. 

 

In this thesis, I take steps towards studying how inducing appropriate grounding in deep models improves multi-modal AI capabilities, in the context of vision and language understanding.

 

Specifically, I will present --

1) how to ground visual question answering models in appropriate regions of the image and appropriate phrases in the question to more accurately answer questions about images

2) how to ground image captioning models in object detections by combining symbolic and deep learning approaches to avoid hallucinations of visual concepts in image captions

3) how to generalize from single round visual question generation with full supervision to a multi-round dialog-based image guessing game without direct language supervision.  

4) how to learn the joint visual-linguistic representations with self-supervised learning which have captured rich semantic and structural information from a large, unlabeled data source. 

 

On these vision-and-language tasks, I will demonstrate that inducing appropriate grounding in deep models improves multi-modal AI capabilities. To the end, I will briefly discuss the challenges in this domain and the extensions of recent works.

 

Additional Information

In Campus Calendar
No
Groups

Graduate Studies

Invited Audience
Faculty/Staff, Public, Graduate students, Undergraduate students
Categories
Other/Miscellaneous
Keywords
Phd Defense
Status
  • Created By: Tatianna Richardson
  • Workflow Status: Published
  • Created On: Jan 2, 2020 - 1:48pm
  • Last Updated: Jan 2, 2020 - 1:48pm