*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Scalable Data Mining via Constrained Low Rank Approximation
Date: Friday, July 1st, 2022
Time: 2pm - 4pm ET
Physical Location: Coda C1215 Midtown
Virtual Location: https://gatech.zoom.us/j/92347767822
Srinivas Eswar
School of Computational Science and Engineering
Georgia Institute of Technology
Committee:
Dr. Richard Vuduc (Advisor, School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Haesun Park (Co-Advisor, School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Ümit V. Çatalyürek (School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Edmond Chow (School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Grey Ballard (Department of Computer Science, Wake Forest University)
------------------------
Abstract:
Matrix and tensor approximation methods are recognised as foundational tools for modern data analytics. Their strength lies in their long history of rigourous and principled theoretical foundations, judicious formulations via various constraints, along with the availability of fast computer programs. Multiple constrained low rank approximation (CLRA) formulations exist for various commonly encountered tasks like clustering, dimensionality reduction, anomaly detection, amongst others. The primary challenge in modern data analytics is the sheer volume of data to be analysed, often requiring multiple machines to just hold the dataset in memory. This dissertation presents CLRA as a key enabler of scalable data mining in distributed-memory parallel machines.
Nonnegative Matrix Factorisation (NMF) is the primary CLRA method studied in this dissertation. NMF imposes nonnegativity constraints on the factor matrices and is popular for its interpretability and clustering prowess. The major bottleneck in most NMF algorithms is a distributed matrix-multiplication kernel. We develop the PLANC software package which includes efficient matrix-multiplication and matricised tensor times Khatri-Rao product kernels tailored to the CLRA case. It employs carefully designed parallel algorithms and data distributions to avoid unnecessary computation and communication. With the development of these key kernels, we can extend PLANC to a variety of cases including handling symmetry constraints, second-order methods, and multiple data modalities. We demonstrate the effectiveness of PLANC via scaling studies on the supercomputers at the Oak Ridge Leadership Computing Facility.