*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Dear faculty members and fellow students,
You are cordially invited to attend my thesis defense.
Thesis Title: Theoretical Analysis of Stochastic Gradient Descent in Stochastic Optimization
Advisors:
Dr. Enlu Zhou, School of Industrial and Systems Engineering, Georgia Tech
Dr. Tuo Zhao, School of Industrial and Systems Engineering, Georgia Tech
Committee members:
Dr. Alexander Shapiro, School of Industrial and Systems Engineering, Georgia Tech
Dr. Robert D. Foley, School of Industrial and Systems Engineering, Georgia Tech
Dr. Zhengyuan Zhou, Stern School of Business, New York University
Date and Time: 10:00 am (EST), Friday, April 9th, 2021
Meeting URL: https://bluejeans.com/287756905
Meeting ID: 287 756 905 (BlueJeans)
Abstract:
Stochastic Gradient Descent (SGD) type algorithms have been widely applied to many
stochastic optimization problems, such as machine learning. Despite its empirical success,
there is still a lack of theoretical understanding of convergence properties of SGD and its
variants. The major bottleneck comes from the highly nonconvex optimization landscape
and the complicated noise structure. This thesis aims to provide useful insights on the good
performance of SGD type algorithms through theoretical analysis with the help of diffusion
approximation and martingale theory. Specifically, we answer the following questions:
Chapter 1: What is the effect of Momentum in nonconvex optimization? We propose to
analyze the algorithmic behavior of Momentum Stochastic Gradient Descent (MSGD) by
diffusion approximation for general nonconvex optimization problems. Our study shows
that the momentum helps escape from saddle points, but hurts the convergence within the
neighborhood of optima (if without the step size annealing or momentum annealing). Our
theoretical discovery partially corroborates the empirical successes of MSGD in training
deep neural networks.
Chapter 2: How does noise in SGD help the algorithm avoid spurious local optima?
We answer this question through a simple two-layer convolutional neural network model,
which has a spurious local optimum and a global optimum. Our theory shows that perturbed
gradient descent and perturbed mini-batch stochastic gradient algorithms in conjunction
with noise annealing is guaranteed to converge to a global optimum in polynomial time
with arbitrary initialization. This implies that the noise enables the algorithm to efficiently
escape from the spurious local optimum.
Chapter 3: How does noise in SGD help select optima that have good generalization
performance? We further investigate the role of noise when multiple global optima exist by
considering nonconvex rectangular matrix factorization problem, which has infinitely many
global minima due to rotation and scaling invariance. Gradient descent (GD) can converge
to any optimum, depending on the initialization. In contrast, we show that a perturbed
form of GD with an arbitrary initialization converges to a global optimum that is uniquely
determined by the injected noise. Our result implies that the noise imposes implicit bias
towards certain optima.
Chapter 4: Does reusing past samples in SGD help improve the efficiency in simulation
optimization? We consider a special type of stochastic optimization problem, simulation
optimization. The main challenge of simulation optimization is the limited simulation
budget because of the high computational cost of simulation experiments. One approach
to overcome this challenge is to reuse simulation outputs from previous iterations in the
current iteration of the optimization procedure. However, due to the dependence among iterations,
simulation replications from different iterations are not independent, which leads
to the lack of theoretical justification for the good empirical performance. We fill this gap
by theoretically studying the stochastic gradient descent method with reusing past simulation
replications. We show that reusing past replications does not change the convergence
of the algorithm, which implies the bias of the gradient estimator is asymptotically negligible.
Moreover, we justify that reusing past replications reduces the variance of gradient
estimators around local optima, which implies that the algorithm can achieve faster convergence.