*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: Understanding Modern Techniques in Optimization: Momentum, Over-parametrization, and Accelerated Gradient Methods.
Jun-Kun Wang
PhD Student
School of Computer Science
College of Computing
Georgia Institute of Technology
Date: Monday, March 23th, 2019
Time: 13:00 pm
Location: Klaus 2100
Location: Cisco Webex
Meeting number (access code): 792 819 876
Meeting password: proposal
Meeting link:
https://jimwang-51.my.webex.com/jimwang-51.my/j.php?MTID=mebff15d08dc9cbe20f283f8247761399
Committee:
Dr. Jacob Abernethy (Advisor, School of Computer Science, Georgia Institute of Technology)
Dr. Le Song (School of Computational Science and Engineering, Georgia Institute of Technology)
Dr. Guanghui (George) Lan (School of Industrial and Systems Engineering, Georgia Institute of Technology)
Abstract:
Over the past decade, we have seen many success stories emerge from progress in deep learning. On the the theory side, however, we have remained far behind in understanding how and why deep learning works. One pertinent example is the performance of training methods: currently stochastic gradient descent (SGD) with momentum is the de facto champion of optimization methods, as many have observed that it is significantly faster than vanilla SGD for training neural nets. At the same time, a solid understanding of this phenomenon has remained elusive in non-convex optimization. A common heuristic is to choose a large value for the momentum parameter (e.g. \beta=0.9), for example, yet the literature offers little in the way of theoretical justification.
My thesis proposal will consist of three parts. The first part is about momentum. I will show that Polyak's momentum, the most popular one and the default momentum method in PyTorch, can be provably shown to lead to an acceleration compared to vanilla gradient descent for a broad class of non-convex optimization problems. Moreover, the larger the momentum parameter, the faster the global convergence, up to a certain threshold. I will talk about the implications of these results and also show that momentum can help escaping saddle points faster or finding a second order stationary point faster. The second part of my thesis proposal will be over-parametrization. It is widely observed that over-parametrization leads to an acceleration for training neural nets in practice. However, the underlying reason for the acceleration remains mysterious. I will propose understanding over-parametrization through a toy example. By over-parametrizing an objective function in certain ways, gradient descent can be shown to reaching at a desired solution faster.
The final part of the proposal involves understanding accelerated gradient methods through the lens of a two-player zero-sum game. While a game-theoretic interpretation of accelerated gradient methods has been previously observed (e.g. Lan and Zhou 2018), we show that this perspective provides a stronger connection between optimization and online learning/no-regret learning. I will show that if one pits two no-regret algorithms against each other in a particular zero-sum game, combined with the appropriate reweighting scheme, one can recover many existing accelerated gradient methods from the resulting game dynamics. These include Nesterov's method (Nesterov 1983;1988;2005) and accelerated proximal method (Bech and Teboulle 2009). Furthermore, the sum of the weighted average regret of both players implies the convergence rate, which leads to a simple alternative acceleration convergence proof. Our framework is very modular, and one can swap in/out different no-regret learning strategies to obtain various gradient methods with appropriate theoretical guarantees. In addition, I will show that the game approach also recovers Frank-Wolfe method and leads to some new accelerated Frank-Wolfe-like methods on certain constraint sets, answering an open question of Garber and Hazan, 2015.