*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Enabling Parallelism and Optimizations in Data Mining Algorithms for Power-law Data
————————————
Ankush Mandal
Ph.D. Candidate
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Tuesday, June 30th, 2020
Time: 1:00 PM – 3:00 PM ET
Location: https://bluejeans.com/vsarkar9 (remote)
Committee:
————————————
Dr. Vivek Sarkar (Advisor, School of Computer Science, Georgia Tech)
Dr. Anshumali Shrivastava (Co-advisor, Department of Computer Science, Rice University)
Dr. Hyesoon Kim (School of Computer Science, Georgia Tech)
Dr. Santosh Pande (School of Computer Science, Georgia Tech)
Dr. Richard Vuduc (School of Computational Science and Engineering, Georgia Tech)
Abstract:
————————————
In this era of big data, extracting meaningful information from a large amount of data in a reasonable time became possible mainly through two types of advancements --- a) algorithmic advances, such as fast approximate algorithms and efficient learning algorithms, and b) architectural advances, such as machines with massive compute power involving distributed multi-core processors and high throughput accelerators. The amount of parallelism available in current and future processors is ever-increasing, and thus parallel algorithms are critical for fully utilizing today’s computing resources. Furthermore, in the case of data mining problems, exploiting data properties for performance gain becomes crucial. In this work, we focus our attention on power-law behavior --- a common property found in a large class of data, such as text, internet traffic, and click-stream data. Specifically, we address the following questions in the context of power-law data: How well do the critical data mining algorithms of current interest fit with today's parallel architectures? Which algorithmic and mapping opportunities can be leveraged to further improve performance?, and What are the relative challenges and gains for such approaches?
Specifically, we first investigate the suitability of a classical application in large scale data stream processing --- the frequency estimation problem --- for GPU-scale parallelism. Sketching algorithms are a popular choice for this task due to their desirable trade-off between estimation accuracy and space-time efficiency. However, most of the past work on sketch-based frequency estimation focused on CPU implementations, with little exploration of throughput-optimized parallel architectures such as GPUs. In our work, we propose a novel approach for sketches, which exploits the natural skew in the power-law data to efficiently utilize the massive amounts of parallelism in modern GPUs.
Next, we explore one of the most common and important tasks in data mining --- identification of most frequent elements --- on modern distributed settings with both multi-core and multi-node CPU parallelism. For identifying top-K frequent items, Count-Min Sketch (CMS) has an excellent update time but lacks the important property of reducibility, which is needed for exploiting data parallelism. On the other end, the popular Frequent Algorithm (FA) leads to reducible summaries, but its update costs are high. Our approach Topkapi, gives the best of both worlds, i.e., it is reducible like FA and has an efficient update time similar to CMS. For power-law data, Topkapi possesses strong theoretical guarantees and leads to significant performance gains, relative to past work.
Finally, we study Word2Vec, a popular word embedding method widely used in Machine learning and Natural Language Processing applications, such as machine translation, sentiment analysis, and query answering. This time, we target Single Instruction Multiple Data (SIMD) parallelism. With the increasing vector lengths in commodity CPUs, such as AVX-512 with a vector length of 512 bits, efficient vector processing unit utilization becomes a major performance game-changer. By employing a static multi-version code generation strategy coupled with an algorithmic approximation based on the power-law frequency distribution of words, we achieve significant reductions in training time relative to the state-of-the-art.