Here are four currently ongoing research projects of my group.

## Shuffling Sampling in Stochastic Optimization

Stochastic gradient method (SGD) plays an important role in modern machine learning due to the large-scale feature. Most of the (supervised) learning problem amounts to solving an emperical risk minimization problem, i.e., minimizing a finite-sum objective \(f(w) = \frac{1}{n}\sum_{i = 1}^n f_i(w)\). Most of the existing analyses apply to the following i.i.d. sampling version of SGD:

\[w_{t+1} = w_t - \alpha_t \nabla f_{i_t} (w_t).\]Here, \(i_t\) is sampled from \(\{1,\ldots, n\}\) *uniformly at random* and every sampling is independent from the previous ones. A quick criticism of this scheme is that SGD may not visit all the data points in one epoch, resulting in inefficency in practice.

What is often implemented in practice (in PyTorch, Tensorflow, etc) is the so-called *random reshuffling* (RR), which in each epoch randomly *permutes* \(\{1,\ldots, n\}\) (denoted by \(\sigma_t\)) and updates by sweeping \(\sigma_t\) sequentially, i.e., \(i_t\) incrementally takes value in \(\sigma_t\). It is easy to see that such a method will have a much smaller sampling variance and will utilize all the data points in one epoch. Therefore, understanding RR and desinging its variants are of practical importance, which is the main topic of this project.

The following is a starting point of this project:

Convergence of Random Reshuffling Under The Kurdyka-Lojasiewicz Inequality.

Xiao Li, Andre Milzarek, Junwen Qiu.

To appear in SIAM Journal on Optimization, 2023. [arXiv]

Distributed Random Reshuffling over Networks.

Kun Huang, Xiao Li, Andre Milzarek, Shi Pu, Junwen Qiu. [arXiv]

## Actor Critic in Reinforcement Learning

Actor Critic (AC) is a popular algorithmic framework for policy-based reinforcement learning. It uses parametrization for both policy (with parameter \(\theta\)) and value function (with parameter \(w\)), and hence transfers the reward maximization task into an optimization task over \(\theta\) and \(w\). In this algorithm, the actor model (i.e. policy) maximizes the objective function using (stochastic) gradient ascent, where the stochastic gradient is calculated using the value estimated by a critic model.

In practice, AC ususally implements a *single-timescale* update scheme, namely, it adopts an alternating optimization framework over \(\theta\) and \(w\), and update each variable at every iteration for constant number of steps using the same order of step sizes for \(\theta\) and \(w\). However, most of the exisitng analyses focus on either a *double-loop* or a *two-timescale* framework. Therefore, the aim of this project is to understand the properties of single-timescale AC and designing its variants with applications to practical policy optimization problems.

The following is a starting point of this project:

Finite-Time Analysis of Decentralized Single-Timescale Actor-Critic.

Qijun Luo, Xiao Li.

To appear in Transactions on Machine Learning Research, 2023. [arXiv]

## Almost Sure Convergence Properties of Stochastic Optimization Methods

In the past years, we often care about nonasymptotic properties for stochastic optimization methods. Suppose we are using SGD to optimize a smooth nonconvex function \(f\), its finite-time complexity guarantee often has the form

\[\min_{k=0,\ldots, T} \ \mathbb{E}[\|\nabla f(x^k)\|^2] \leq \mathcal{O} \left(\frac{1}{\sqrt{T+1}}\right) \quad \text{or} \quad \mathbb{E}[\|\nabla f(x^{\bar k})\|^2] \leq \mathcal{O} \left(\frac{1}{\sqrt{T+1}}\right)\]where \(T\) denotes the total number of iterations and \(\bar k\) is an index sampled uniformly at random from \(\{0,\ldots, T\}\). Though such a result does tell us some meaningful properties of SGD in the first \(T\) iterations, it has three main drawbacks: 1) It holds only *in expectation*. Mathematically, this describes the behavior of SGD by averaging infinitely many runs. 2) It is not for the *last iterate*. 3) It tells little the *eventual behavior* of SGD as one cannot let \(T\to\infty\) in the above equation.

However, in practical situations the algorithm is often only run once and the last iterate is returned as a solution. This indicates that the above complexity result does not fully characterize the properties of SGD. This project is to study the *almost sure* eventual behavior of stochastic methods. Togetehr with the typical complexity results in expectation, we will be able to give a more complete picture for the properties of SGD.

The following is a starting point of this project:

A Unified Convergence Theorem for Stochastic Optimization Methods.

Xiao Li, Andre Milzarek.

Neural Information Processing Systems (NeurIPS 2022). [NeurIPS Link]

Convergence of Random Reshuffling Under The Kurdyka-Lojasiewicz Inequality.

Xiao Li, Andre Milzarek, Junwen Qiu.

To appear in SIAM Journal on Optimization, 2023. [arXiv]

## Nonconvex and Robust Recovery Problems

Signal recovery is common in signal processing and machine learning. A fundamental problem is to recover \(w^*\) from \(y = Xw^* + e\), given \(y\) and \(X\). I omit the dimension here for simplicity. Note that the lower case \(y\), \(w^*\), and \(e\) are not restricted to be vectors, they can also be matrices. Such a problem has various names in different communities by imposing different prior knowledge and structures on \(w^*\). A family of good examples are robust variants of PCA when \(w^*\) is assume to be a low-rank matrix.

Typical recovery guarantees were for \(\ell_2\) metric that is used for measuring how good the fitting is. This \(\ell_2\) metric approach implicitly assumes that the error \(e\) is modeled as Gaussian. However, we might have *outliers* (non-Gaussian) in practical measurement process. This motivates us to consider more robust approachs / metrics for recovering \(w^*\), which is the main topic of this project. The specific nonconvex robust recovery problems we will consider includes Robust PCA variants, rotation averaging, etc.

Here are some outcomes along this line of research:

Weakly Convex Optimization over Stiefel Manifold Using Riemannian Subgradient-Type Methods.

Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, Anthony Man Cho So.

SIAM Journal on Optimization, 31(3), 1605–1634, 2021. [arXiv]

Exact Recovery of Multichannel Sparse Blind Deconvolution via Gradient Descent.

Qing Qu, Xiao Li, Zhihui Zhu.

SIAM Journal on Imaging Sciences, 13(3), 1630-1652, 2020. [SIAM Link]

Nonconvex Robust Low-rank Matrix Recovery.

Xiao Li, Zhihui Zhu, Anthony Man-Cho So, Rene Vidal.

SIAM Journal on Optimization, 30(1), 660–686, 2020. [arXiv]

A Nonconvex Approach for Exact and Efficient Multichannel Sparse Blind Deconvolution.

Qing Qu, Xiao Li, Zhihui Zhu.

Neural Information Processing Systems (NeurIPS 2019). [NeurIPS Link]