*********************************
There is now a CONTENT FREEZE for Mercury while we switch to a new platform. It began on Friday, March 10 at 6pm and will end on Wednesday, March 15 at noon. No new content can be created during this time, but all material in the system as of the beginning of the freeze will be migrated to the new platform, including users and groups. Functionally the new site is identical to the old one. webteam@gatech.edu
*********************************
Title: A Unified Lyapunov Framework for Finite-Sample Analysis of Reinforcement Learning Algorithms
Date: 04/07/2022
Time: 1:00 - 2:30 pm EST
Location: Groseclose 402, or virtually at https://gatech.zoom.us/j/9849731860?pwd=K29BSStGekgvYkxlK1ZRZVp1QUlLdz09 (Meeting ID: 984 973 1860 Passcode: 7n46MA).
Zaiwei Chen
Machine Learning PhD Student
School of Industrial & Systems Engineering
Georgia Institute of Technology
Committee
1 Dr. Siva Theja Maguluri (Advisor)
2 Dr. John-Paul Clarke (Co-advisor)
3 Dr. Justin Romberg
4 Dr. Ashwin Pananjady
5 Dr. Benjamin Van Roy
Abstract: In this thesis, we develop a unified Lyapunov approach for establishing finite-sample guarantees of reinforcement learning (RL) algorithms. Since most of the RL algorithms can be modeled by stochastic approximation (SA) algorithms under Markovian noise, we first provide a Lyapunov framework for analyzing Markovian SA algorithms. The key idea is to construct a novel Lyapunov function (called generalized Moreau envelop) to capture the dynamics of the corresponding SA algorithm, and establish a negative drift inequality, which then can be repeatedly used to derive finite-sample bounds. We use our SA results to design RL algorithms and perform finite-sample analysis. Specifically, for tabular RL, we establish finite-sample bounds for Q-learning, variants of on-policy TD-learning algorithms such as n-step TD and TD(\lambda), and off-policy TD-learning algorithms such as Retrace(\lambda), Q^\pi(\lambda), and V-trace, etc. As by-products, we provide theoretical insight into the problem of efficiency of bootstrapping in on-policy TD-learning, and demonstrate the bias-variance trade-off in off-policy TD. For RL with linear function approximation, we design convergent variants of Q-learning and TD-learning in the presence of the deadly triad, and derive finite-sample guarantees. The TD-learning algorithm was later used in a general policy-based framework (including approximate policy iteration and natural policy gradient) to eventually find an optimal policy of the RL algorithm with an O(\epsilon^{-2}) sample complexity.