*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Ph.D. Thesis Proposal Announcement
Title: Robust Adaptation of Natural Language Processing for Language Variation
Yi Yang
Ph.D. Student
School of Interactive Computing
College of Computing
Georgia Institute of Technology
http://www.cc.gatech.edu/~yyang319/
Date: Tuesday, April 7, 2015
Time: 3:00pm – 5:00pm EDT
Location: Klaus 1212
Committee
Dr. Jacob Eisenstein (Advisor), School of Interactive Computing, Georgia Institute of Technology
Dr. James M. Rehg, School of Interactive Computing, Georgia Institute of Technology
Dr. Duen Horng (Polo) Chau, School of Computational Science & Engineering, Georgia Institute of Technology
Dr. Byron Boots, School of Interactive Computing, Georgia Institute of Technology
Abstract:
Natural Language Processing (NLP) technology has been applied in various domains, ranging from social media and digital humanities to public health. Unfortunately, the adoption of existing NLP techniques in these areas often experiences unsatisfactory performances, as existing NLP techniques are driven by standard corpora, which is vulnerable to variation in languages of new datasets and settings. Previous approaches toward this problem suffer from two major weaknesses. First, they usually employ supervised methods that require expensive annotations and easily become outdated with respect to the dynamic nature of languages. Second, they often fail to leverage the valuable metadata associated with the target languages of these areas.
In this thesis, I propose to overcome these weaknesses by exploring unsupervised learning techniques to build NLP systems that are robust to language variation, primarily branching into: a) unsupervised text normalization, transforming lexical variations into text that better matches standard datasets; b) unsupervised domain adaptation, adapting standard NLP tools to fit the text with variation directly, through learning of representations that are robust to variation; c) personalized natural language processing, incorporating user metadata to adapt generic NLP to each individual user. These approaches are driven by co-occurrence statistics as well as rich metadata without the need of costly annotations, and can easily adapt to new settings. My preliminary work on text normalization and domain adaptation delivers state-of-the-art NLP systems for social media and historical text. As a future work, I propose to further boost the results by leveraging various user metadata.