*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Systems Abstractions for Big Data Processing on a Single Machine
Steffen Maass
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Wednesday, April 3, 2019
Time: Noon EDT
Location: KACB 3100
Committee:
------------
Dr. Taesoo Kim (Advisor, School of Computer Science, Georgia Tech) Dr. Ada Gavrilovska (School of Computer Science, Georgia Tech) Dr. Umakishore Ramachandran (School of Computer Science, Georgia Tech) Dr. Tushar Krishna (School of Electrical Engineering, Georgia Tech) Dr. Willy Zwaenepoel (Faculty of Engineering and Information Technologies, The University of Sydney)
Abstract:
-----------
Large-scale internet services, such as Facebook or Google, are using clusters of many servers for problems such as search, machine learning, and social networks.
However, while it may be possible to apply the tools used at this scale to smaller, more common problems as well, this dissertation presents approaches to large-scale data processing on only a single machine.
This approach has obvious cost benefits and lowers the barrier of entrance to large-scale data processing.
This dissertation approaches this problem by redesigning applications to enable trillion-scale graph processing on a single machine, enable the processing of evolving, billion-scale graphs, and presenting an operating-systems level optimization.
First, this dissertation presents a new out-of-core graph processing engine, called Mosaic, for executing graph algorithms on trillion-scale datasets on a single machine.
Mosaic makes use of many-core processors and PCIe-SSDs coupled with a novel graph encoding scheme to allow processing of graphs of up to one trillion edges on a single machine.
Mosaic also employs a locality-preserving curve to allow for high compression and high locality when storing graphs and executing algorithms.
Second, this dissertation presents Cytom, a new engine for processing evolving graphs based on insights about achieving high compression and locality while improving load-balancing when processing a graph that changes rapidly.
Cytom also introduces a novel programming model that takes advantage of its subgraph-centric approach coupled with the setting of evolving graphs.
This is an important enabling step for emerging workloads when processing graphs that change over time.
Finally, we present an asynchronous scheme for clearing the processors' translation lookaside buffers (TLBs) in response to the high overhead of the current, synchronous process known as a TLB shootdown.
This process is critical for system services such as freeing memory, NUMA memory migration, and page swapping in emerging, disaggregated data centers; these services are often used when processing large amounts of data.
The key idea of this scheme, Latr, is a lazy mechanism to remove entries from the cores' TLBs while ensuring correctness by lazily releasing virtual memory only after Latr's lazy shootdown mechanism finishes.
This scheme removes the current overhead of costly inter-processor interrupts.
We show that this mechanism has impacts on many applications ranging from webservers which might be used as caching frontends for big data processing to key-value stores and graph processing.