publications | Xiao Li

A selected collection of research outcomes. This Google Scholar page contains more publications.

Selected Articles (by topic)

Topic: LLMs and Optimization for LLMs

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs.
Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li
[Github Repo] [arXiv Preprint]
>>> StreamBP is an algorithm that implements memory efficient and exact Backpropagation for training LLMs on ultra long sequence (e.g., training reasoning model) or for scaling up batch sizes. It is both memory and time efficient compared to standard BP with grad ckpt.

DPO-Shift: Shifting the Distribution of Direct Preference Optimization.
Xiliang Yang, Feng Jiang, Qianen Zhang, Lei Zhao, Xiao Li.
[Github Repo] [arXiv Preprint]
>>> DPO-shift mitigates the likelihood displacement issue of DPO through a simple approach. It is theoretically grounded. Thorough experiments illustrate that DPO-shift is effective on various models and datasets.

BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models.
Qijun Luo, Hengxu Yu, Xiao Li.
Advances in Neural Information Processing Systems (NeurIPS 2024). [NeurIPS Link] [Github Repo] [Slides]
>>> BAdam is a block coordinate descent optimization method with Adam’s update rule for finetuning large language models. It is memory efficient and allows training Llama 3-8B using a single RTX3090-24GB GPU and training Llama 3-70B using 4\(\times\)A100.

Topic: Stochastic Optimization

High Probability Guarantees for Random Reshuffling.
Hengxu Yu, Xiao Li.
[arXiv Preprint]
>>> We estabilished a set of high probability finite-time compleixity guarantees for RR, including finding a stationary point, desinging a stopping criterion that yields the last iterate result, and avoding saddle points.

Convergence of Random Reshuffling Under The Kurdyka-Lojasiewicz Inequality.
Xiao Li, Andre Milzarek, Junwen Qiu.
SIAM Journal on Optimization, 33(2), 1092-1120, 2023. [SIAM Link] [arXiv] [Slides]
>>> The sequence convergence results are established for random reshuffling (stochastic). The key insights are: 1) derive subsequence convergence using diminishing step sizes and 2) combine diminishing step sizes with the traditional KL analysis.

A Unified Convergence Theorem for Stochastic Optimization Methods.
Xiao Li, Andre Milzarek.
Advances in Neural Information Processing Systems (NeurIPS 2022). [NeurIPS Link] [Slides]
>>> In this work, we provide a fundamental convergence theorem and apply it to obtain almost sure convergence results for SGD, RR, prox-SGD, and stochastic model-based methods.

Topic: Nonsmooth and/or Nonconvex Optimization

Revisiting Subgradient Method: Complexity and Convergence Beyond Lipschitz Continuity.
Xiao Li, Lei Zhao, Daoli Zhu, Anthony Man-Cho So.
Vietnam Journal of Mathematics, 2024. [Springer Link] [arXiv]
(Invited article dedicated to Prof. Tamás Terlaky on the occasion of his 70th birthday)
>>> The subgradient method and some of its varaints possess convergence properties without Lipschitz continuity at all.

ReSync: Riemannian Subgradient-based Robust Rotation Synchronization.
Huikang Liu, Xiao Li, Anthony Man-Cho So.
Advances in Neural Information Processing Systems (NeurIPS 2023). [NeurIPS Link]
>>> We present ReSync, a Riemannian subgradient-based method for solving the nonconvex nonsmooth robust rotation synchronization problem, and provide linear convergence guarantees for ReSync in terms of finding the underlying rotations.

Weakly Convex Optimization over Stiefel Manifold Using Riemannian Subgradient-Type Methods.
Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, Anthony Man Cho So.
SIAM Journal on Optimization, 31(3), 1605–1634, 2021. [SIAM Link] [arXiv]
>>> We provide the first complexity/convergence results for Riemannian subgradient-type methods. The key insight is that weakly convex function restricted on smooth embedded manifold is still weakly convex.

Incremental Methods for Weakly Convex Optimization.
Xiao Li, Zhihui Zhu, Anthony Man-Cho So, Jason D Lee.
NeurIPS 2020 Workshop (OPT 2020). [arXiv]
>>> We analyzed the incremental subgradient, proximal point, and prox-linear methods. The typical \(\mathcal{O}(\varepsilon^{-4})\) compelexity and local linear rate (with sharpness conditoin) are established.

Nonconvex Robust Low-rank Matrix Recovery.
Xiao Li, Zhihui Zhu, Anthony Man-Cho So, Rene Vidal.
SIAM Journal on Optimization, 30(1), 660–686, 2020. [SIAM Link] [arXiv] [Code]
>>> We propose the robust nonconvex matrix recovery problem, and show that subgradient method will linearly converge to optima. The key insight is the recovery condition “\(\ell_1/\ell_2\)-RIP’’ leads to optimization properties — weak convexity and sharpness.