*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Ph.D. Defense of Dissertation Announcement
Title: Novel Document Representations based on Labels and Sequential Information
Seungyeon Kim
School of Computational Science and Engineering
(Ph.D. Computer Science program)
College of Computing
Georgia Institute of Technology
http://sylund.net
Date: Thursday, May 28, 2015
Time: 11:30am - 1:30pm ET (8:30am - 10:30 am PT)
Location: Klaus Conference Room 1202
Committee:
Prof. Guy Lebanon (Advisor, School of Computational Science and Engineering, Georgia Institute of Technology)
Prof. Haesun Park (Co-advisor, School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Irfan Essa (School of Interactive Computing, Georgia Institute of Technology)
Dr. Jacob Eisenstein (School of Interactive Computing, Georgia Institute of Technology)
Dr. Samy Bengio (Google Inc)
Abstract:
Wide variety of text analysis applications are based on statistical machine learning techniques. One of fundamental questions that have to be answered for the techniques is how we represent documents. A representation or often called a feature vector of a document plays a significant role in overall performance of the techniques.
Then, we can start asking what makes a good representation. There are number of aspects of a good representation, but we will focus on the following four aspects. First and obviously, a representation should reflect the original data accurately. Reconstruction quality is the most fundamental evaluation metric of a representation. Second, since we are usually interested in discriminating documents from each other, a representation should be distinguishable. Third, if a representation itself is easy to interpret by a human, it will be very convenient. Fourth, a good representation should have an efficient algorithm to be computed. Without scalability, a representation will just remain in theoretical research.
Obtaining such a good document representation has several challenges. The most significant challenge comes from the sparsity of documents, which is extremely common in textual data. The sparsity often cause high estimation error. The second hardship comes from text's sequential nature, interdependencies between words. Although ordering of words largely affect their semantics, modeling those is not easy because of various reasons. For example, n-gram model attempts to capture partial sequences of multiple words, but it suffers from sparser observations on the other hand.
This thesis presents novel document representations to overcome the two challenges, sparsity and sequentiality. We employ label and sequential information of documents during our representation learning. Utilizing label characteristics enables us to find a dense subspace of interest that overcomes the sparsity issue. On the other hand, we present document representations that reflects sequential dependencies without suffering high estimation error. Lastly, the thesis is concluded with a document representation that employing both label and sequential information.
Approaches in this dissertation will be helpful for understanding documents in large scale. Most methods focus on efficient computation based on approximation or relaxations.