*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Leveraging Mid-Level Representations For Complex Activity Recognition
Unaiza Ahsan
Computer Science Ph.D. Student
School of Interactive Computing
College of Computing
Georgia Institute of Technology
Date: Tuesday, Nov 27, 2018
Time: 10:00 AM to 12:00PM (EST)
Location: College of Computing Building (CCB) 345
Committee:
---------------
Dr. Irfan Essa (Advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. James Hayes, School of Interactive Computing, Georgia Institute of Technology
Dr. Devi Parikh, School of Interactive Computing, Georgia Institute of Technology
Dr. Munmun De Choudhury, School of Interactive Computing, Georgia Institute of Technology
Dr. Zsolt Kira, School of Interactive Computing, Georgia Institute of Technology
Dr. Chen Sun, Google
Summary:
---------------
Dynamic scene understanding requires learning representations of the components of the scene including objects, environments, actions and events. Complex activity recognition from images and videos requires annotating large datasets with action labels which is a tedious and expensive task. Thus, there is a need to design a mid-level or intermediate feature representation which does not require millions of labels, yet is able to generalize to semantic-level recognition of activities in visual data. This thesis makes three contributions in this regard.
First, we propose an event concept-based intermediate representation which learns concepts via the Web and uses this representation to identify events even with a single labeled example. To demonstrate the strength of the proposed approaches, we contribute two diverse social event datasets to the community. We then present a use case of event concepts as a mid-level representation that generalizes to sentiment recognition in diverse social event images.
Second, we propose to train Generative Adversarial Networks (GANs) with video frames (which does not require labels), use the trained discriminator from GANs as an intermediate representation and finetune it on a smaller labeled video activity dataset to recognize actions in videos. This unsupervised pre-training step avoids any manual feature engineering, video frame encoding or searching for the best video frame sampling technique.
Our third contribution is a self-supervised learning approach on videos that exploits both spatial and temporal coherency to learn feature representations on video data without any supervision. We demonstrate the transfer learning capability of this model on smaller labeled datasets. We present comprehensive experimental analysis on the self-supervised model to provide insights into the unsupervised pretraining paradigm and how it can help with activity recognition on target datasets which the model has never seen during training.