Two New Machine Learning Approaches for Text Classification - Jacob Eisenstein

*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************

Event Details

Date/Time:
- Wednesday January 25, 2017 - Thursday January 26, 2017
  12:00 pm - 12:59 pm
Location: Marcus Nano Room 1117-1118
Phone:
URL:
Email:
Fee(s):
0
Extras:

Contact

Le Song

lsong@cc.gatech.edu

Summaries

Summary Sentence: Two New Machine Learning Approaches for Text Classification - Jacob Eisenstein

Full Summary: No summary paragraph submitted.

Title: Two new machine learning approaches for text classification

Abstract: Text document classification is one of the most well studied applications of machine learning. Yet this technology is still limited by practical difficulties and invalid underlying assumptions.

First, many people who want text classifiers do not have the time or resources to annotate a dataset. They often employ a heuristic alternative: they create word lists for each label class, and then perform prediction by selecting the class whose list matches the largest number of words in the text. This heuristic is theoretically unjustified, and mistakenly assigns the same importance to every word in the list. I show that list-based classification can be viewed as a (very!) special case of Naive Bayes. Based on this analysis, it is possible to estimate weights for each word without supervision, using the method-of-moments.

Second, machine learning approaches to text classification nearly always begin with an IID assumption. Yet words can mean different things to different people, raising the possibility for misunderstandings even in human-human conversation. One potential solution is to relax the IID assumption by personalizing text classifiers to the author. An apparent roadblock is the challenge of obtaining labeled data for each author. I will present a method that sidesteps this requirement by relying on the sociological theory of homophily, which states that people who are socially connected tend to share personal traits. This idea can be formalized by estimating node embeddings for each individual in a social network, and then using these embeddings to drive a social attentional mechanism in a neural ensemble classifier. The resulting system obtains significant improvements on sentiment analysis in Twitter. This project is joint work with Yi Yang.

Bio: Jacob Eisenstein is an Assistant Professor in the School of Interactive Computing at Georgia Tech. He works on statistical natural language processing, focusing on computational sociolinguistics, social media analysis, discourse, and machine learning. He is a recipient of the NSF CAREER Award, a member of the Air Force Office of Scientific Research (AFOSR) Young Investigator Program, and was a SICSA Distinguished Visiting Fellow at the University of Edinburgh. His work has also been supported by the National Institutes for Health, the National Endowment for the Humanities, and Google. Jacob was a Postdoctoral researcher at Carnegie Mellon and the University of Illinois. He completed his Ph.D. at MIT in 2008, winning the George M. Sprowls dissertation award. Jacob's research has been featured in the New York Times, National Public Radio, and the BBC. Thanks to his brief appearance in If These Knishes Could Talk, Jacob has a Bacon number of 2.

Additional Information

In Campus Calendar

Yes

Groups

College of Computing, IRIM, School of Interactive Computing

Invited Audience

Faculty/Staff, Public, Undergraduate students, Graduate students

Categories

Seminar/Lecture/Colloquium

Keywords

No keywords were submitted.

Status

Created By: Birney Robert
Workflow Status: Published
Created On: Jan 20, 2017 - 2:53pm
Last Updated: Apr 13, 2017 - 5:13pm