publications | Xiao Li

A selected collection of research outcomes. This Google Scholar page contains full publications.

Selected Articles (by topic)

Topic: LLMs and Optimization for LLMs

StreamBP: Memory-Efficient Exact Backpropagation for Long Sequence Training of LLMs.
Qijun Luo, Mengqi Li, Lei Zhao, Xiao Li.
Advances in Neural Information Processing Systems (NeurIPS 2025). [Github Repo] [arXiv Link]
>>> StreamBP is an algorithm that implements memory efficient and exact Backpropagation for training LLMs on ultra long sequence (e.g., training reasoning model) or for scaling up batch sizes. It is both memory and time efficient compared to standard BP with grad ckpt.

Accelerating Block Coordinate Descent for LLM Finetuning via Landscape Expansion.
Qijun Luo, Yifei Shen, Liangzu Peng, Dongsheng Li, Xiao Li.
Advances in Neural Information Processing Systems (NeurIPS 2025).
Github repo and paper link will be available soon.
>>> We identify two limitations of BCD for LLM training: It wastes some intermediate gradients and has a narrow search direction. We solves the issues simultaneously by incorporating the lightweight optimization techniques such as SGD and LoRA.

BAdam: A Memory Efficient Full Parameter Optimization Method for Large Language Models.
Qijun Luo, Hengxu Yu, Xiao Li.
Advances in Neural Information Processing Systems (NeurIPS 2024). [NeurIPS Link] [Github Repo] [Slides]
>>> BAdam is a block coordinate descent optimization method with Adam’s update rule for finetuning large language models. It is memory efficient and allows training Llama 3-8B using a single RTX3090-24GB GPU and training Llama 3-70B using 4\(\times\)A100.

Topic: Stochastic Optimization

A New Random Reshuffling Method for Nonsmooth Nonconvex Finite-sum Optimization.
Junwen Qiu, Xiao Li, Andre Milzarek.
Journal of Machine Learning Research, published online, 2025. [JMLR Link] [arXiv]
>>> We propose a new debiased proximal random reshuffling method for nonsmooth regularized nonconvex problems. We establish similar complexity and convergence results to those of smooth random reshuffling method.

High Probability Guarantees for Random Reshuffling.
Hengxu Yu, Xiao Li.
[arXiv Preprint]
>>> We estabilished a set of high probability finite-time compleixity guarantees for RR, including finding a stationary point, desinging a stopping criterion that yields the last iterate result, and avoding saddle points.

Convergence of Random Reshuffling Under The Kurdyka-Lojasiewicz Inequality.
Xiao Li, Andre Milzarek, Junwen Qiu.
SIAM Journal on Optimization, 33(2), 1092-1120, 2023. [SIAM Link] [arXiv] [Slides]
>>> The sequence convergence results are established for random reshuffling (stochastic). The key insights are: 1) derive subsequence convergence using diminishing step sizes and 2) combine diminishing step sizes with the traditional KL analysis.

A Unified Convergence Theorem for Stochastic Optimization Methods.
Xiao Li, Andre Milzarek.
Advances in Neural Information Processing Systems (NeurIPS 2022). [NeurIPS Link] [Slides]
>>> In this work, we provide a fundamental convergence theorem and apply it to obtain almost sure convergence results for SGD, RR, prox-SGD, and stochastic model-based methods.

Topic: Nonsmooth and/or Nonconvex Optimization

Revisiting Subgradient Method: Complexity and Convergence Beyond Lipschitz Continuity.
Xiao Li, Lei Zhao, Daoli Zhu, Anthony Man-Cho So.
Vietnam Journal of Mathematics, 2024. [Springer Link] [arXiv]
(Invited article dedicated to Prof. Tamás Terlaky on the occasion of his 70th birthday)
>>> The subgradient method and some of its varaints possess convergence properties without Lipschitz continuity at all.

ReSync: Riemannian Subgradient-based Robust Rotation Synchronization.
Huikang Liu, Xiao Li, Anthony Man-Cho So.
Advances in Neural Information Processing Systems (NeurIPS 2023). [NeurIPS Link]
>>> We present ReSync, a Riemannian subgradient-based method for solving the nonconvex nonsmooth robust rotation synchronization problem, and provide linear convergence guarantees for ReSync in terms of finding the underlying rotations.

Weakly Convex Optimization over Stiefel Manifold Using Riemannian Subgradient-Type Methods.
Xiao Li, Shixiang Chen, Zengde Deng, Qing Qu, Zhihui Zhu, Anthony Man Cho So.
SIAM Journal on Optimization, 31(3), 1605–1634, 2021. [SIAM Link] [arXiv]
>>> We provide the first complexity/convergence results for Riemannian subgradient-type methods. The key insight is that weakly convex function restricted on smooth embedded manifold is still weakly convex.

Incremental Methods for Weakly Convex Optimization.
Xiao Li, Zhihui Zhu, Anthony Man-Cho So, Jason D Lee.
NeurIPS 2020 Workshop (OPT 2020). [arXiv]
>>> We analyzed the incremental subgradient, proximal point, and prox-linear methods. The typical \(\mathcal{O}(\varepsilon^{-4})\) compelexity and local linear rate (with sharpness conditoin) are established.

Nonconvex Robust Low-rank Matrix Recovery.
Xiao Li, Zhihui Zhu, Anthony Man-Cho So, Rene Vidal.
SIAM Journal on Optimization, 30(1), 660–686, 2020. [SIAM Link] [arXiv] [Code]
>>> We propose the robust nonconvex matrix recovery problem, and show that subgradient method will linearly converge to optima. The key insight is the recovery condition “\(\ell_1/\ell_2\)-RIP’’ leads to optimization properties — weak convexity and sharpness.