*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Accelerating Advanced Analytics
Abstract:
Advanced analytics -- the analysis of large and complex data with machine
learning (ML) -- is becoming ubiquitous, with a growing demand for
advanced analytics tools in the enterprise domains. However, there exist
several challenging bottlenecks in the end-to-end process of building and
deploying advanced analytics applications. My research focuses on
abstractions, algorithms, and systems to mitigate such bottlenecks and
accelerate advanced analytics from a data management standpoint.
In this talk, I will focus on my work on mitigating one such pervasive
bottleneck in the process of feature engineering for ML -- joins of
multiple tables. Many real-world datasets are multi-table, connected by
key-foreign key relationships, but almost all ML toolkits expect
single-table inputs. This forces data scientists to join all tables and
materialize a single table that collects all features. Alas, such joins
often cause the output to blow up in size, which slows down ML, increases
costs, and leads to data maintenance headaches. In my work, I show how it
is possible to mitigate these issues by "avoiding joins physically,"
i.e., pushing ML down through joins. This reduces runtime without
affecting accuracy. Going further, I apply statistical learning theory to
show how it is often possible to also "avoid joins logically," i.e.,
ignore entire tables outright without losing much accuracy, but achieving
significant runtime gains.
Bio:
Arun Kumar is a Ph.D. candidate at the University of Wisconsin-Madison.
His primary research interests are in data management and its
intersection with machine learning. He is co-advised by Jeffrey Naughton
and Jignesh M. Patel, and has also worked closely with Christopher Re and
Xiaojin Zhu. Systems and ideas from his research have been shipped in
products by EMC, Oracle, Cloudera, and IBM. A paper co-authored by him
was accorded the Best Paper Award at ACM SIGMOD 2014. He was awarded the
Anthony C. Klug NCR Fellowship in database systems in 2015. He received
his M.S. from UW-Madison in 2011 and his B.Tech. from IIT Madras in 2009.
Webpage:
http://pages.cs.wisc.edu/~arun/<http://pages.cs.wisc.edu/%7Earun/>