*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Recent advances on the reduction and analysis of big and high-dimensional data
Advisors: Dr. C. F. Jeff Wu, Dr. V. Roshan Joseph
Committee Members:
Dr. Yao Xie (ISyE)
Dr. George Lan (ISyE)
Dr. Fred J. Hickernell (Dept. of Applied Mathematics, Illinois Institute of Technology)
Date and Time: Monday, March 12th, 11:00 AM
Location: ISyE Main 126
Abstract:
In an era with remarkable advancements in computer engineering, computational algorithms, and mathematical modeling, data scientists are inevitably faced with the challenge of working with big and high-dimensional data. For many problems, data reduction is a necessary first step: it allows for storage and portability of big data, and enables the computation of expensive downstream quantities. The next step then involves the analysis of big data – the use of such data for modeling, inference, and prediction. This thesis presents new methods for big data reduction and analysis, with a focus on solving real-world problems in statistics, machine learning and engineering.
Chapter 1 of my thesis introduces a data reduction method for compacting large datasets (or in the infinite sense, distributions) into a smaller, representative point set called support points (SPs). SPs can be viewed as optimal sampling points for distribution representation, integration, and functional approximation. One advantage of SPs is that it provides an efficient and parallelizable reduction of big data via difference-of-convex programming. Chapter 2 then presents a modification of SPs, called projected support points (PSPs), for compacting high-dimensional datasets into representative points. The key innovation for PSPs is the use of a sparsity-inducing kernel, which allows for reduction of low-dimensional properties in high-dimensional data. We then demonstrate the effectiveness of SPs and PSPs for (a) compacting posterior samples in Bayesian computation, (b) uncertainty propagation, and (c) kernel learning with big data.
Chapter 3 proposes a novel variable selection method for analyzing big data, using new basis functions called conditional main effects (CMEs). CMEs capture the conditional effect of a variable at a fixed level of another variable, and represent interpretable phenomena in many engineering and social science fields. We present an algorithm, called cmenet, which employs the new principles of CME coupling and CME reduction to guide variable selection. Compared to standard interaction analysis, cmenet yields more parsimonious models and improved predictive performance, which we demonstrate using simulations and a gene association study on fly wing shape.
Chapter 4 introduces a surrogate model for efficient prediction and uncertainty quantification of turbulent flows in swirl injectors, devices commonly used in engineering systems. Here, high-fidelity simulations require weeks of computation time, and a new method is needed to efficiently survey the desired design space. We propose a new Gaussian process surrogate model, which incorporates known physical flow properties as simplifying assumptions. This allows for efficient model training with massive simulation data (~100Gb in storage), which then enables quick flow predictions at new design settings in around an hour of computation time.
Chapter 5 considers construction algorithms for a type of experimental design called minimax designs. Minimax designs reduce a continuous design space to a set of design points, by minimizing the maximum distance from this space to its nearest point. We propose a new clustering-based construction of minimax designs on convex design regions, and demonstrate its effectiveness in simulations and a real-world sensor allocation problem. We then introduce a novel design called a minimax projection design, which yields improved minimax performance on projections of the design space.
Finally, Chapter 6 presents a new active sampling method for noisy matrix completion. This method implicitly makes use of uncertainty quantification (UQ) at unobserved matrix entries to guide active sampling. Using a singular matrix-variate Gaussian model, we first reveal novel insights on the role of compressive sensing and coding design on the sampling and UQ for noisy matrix completion. With these insights, we propose an efficient posterior sampler for quantifying subspace uncertainty, and an information-theoretic algorithm which uses this subspace learning to guide sampling. The effectiveness of this integrated method is then demonstrated in simulations and two collaborative filtering examples.