Wednesday, September 10, 2025 - 14:15 in V2-105/115

From the Ball-proximal (Broximal) Point Method to Efficient Training of LLMs

A talk in the Stochastic Numerics Seminar series by
Peter Richtárik

Abstract:

I will present selected results from two recent related papers [2, 7]. The abstracts of both are included below:
Non-smooth and non-convex global optimization poses significant chal- lenges across various applications, where standard gradient-based meth- ods often struggle. We propose the Ball-Proximal Point Method, Broximal Point Method, or Ball Point Method (BPM) for short – a novel algorith- mic framework inspired by the classical Proximal Point Method (PPM) [8], which, as we show, sheds new light on several foundational optimization paradigms and phenomena, including non-convex and non-smooth opti- mization, acceleration, smoothing, adaptive stepsize selection, and trust- region methods. At the core of BPM lies the ball-proximal (“broximal”) operator, which arises from the classical proximal operator by replacing the quadratic distance penalty by a ball constraint. Surprisingly, and in sharp contrast with the sublinear rate of PPM in the nonsmooth convex regime, we prove that BPM converges linearly and in a finite number of steps in the same regime. Furthermore, by introducing the concept of ball-convexity, we prove that BPM retains the same global convergence guarantees under weaker assumptions, making it a powerful tool for a broader class of potentially non-convex optimization problems. Just like PPM plays the role of a conceptual method inspiring the development of practically efficient algorithms and algorithmic elements, e.g., gradient descent, adaptive step sizes, acceleration [1], and “W” in AdamW [9], we believe that BPM should be understood in the same manner: as a blueprint and inspiration for further development. Generalization non- Euclidean ball constraints can be found in the follow-up work [3].
Recent developments in deep learning optimization have brought about radically new algorithms based on the Linear Minimization Oracle (LMO) framework, such as Muon [4] and Scion [6]. After over a decade of Adam’s [5] dominance, these LMO-based methods are emerging as vi- able replacements, offering several practical advantages such as improved memory efficiency, better hyperparameter transferability, and most im- portantly, superior empirical performance on large-scale tasks, including LLM training. However, a significant gap remains between their prac- tical use and our current theoretical understanding: prior analyses (1)overlook the layer-wise LMO application of these optimizers in practice, and (2) rely on an unrealistic smoothness assumption, leading to imprac- tically small stepsizes. To address both, we propose a new LMO-based method called Gluon, capturing prior theoretically analyzed methods as special cases, and introduce a new refined generalized smoothness model that captures the layer-wise geometry of neural networks, matches the layer-wise practical implementation of Muon and Scion, and leads to con- vergence guarantees with strong practical predictive power. Unlike prior results, our theoretical stepsizes closely match the fine-tuned values re- ported in [6]. Our experiments with NanoGPT and CNN confirm that our assumption holds along the optimization trajectory, ultimately clos- ing the gap between theory and practice.

References
[1] Kwangjun Ahn and Suvrit Sra. “Understanding Nesterov’s Acceleration via Proximal Point Method”. In: 2022 Symposium on Simplicity in Algorithms (SOSA). 2022, pp. 117–130.
[2] Kaja Gruntkowska, Hanmin Li, Aadi Rane, and Peter Richt´arik. “The ball- proximal (=”broximal”) point method: a new algorithm, convergence the- ory, and applications”. In: arXiv preprint ArXiv:2502.02002 (2025).
[3] Kaja Gruntkowska and Peter Richt´arik. Non-Euclidean broximal point method: a blueprint for geometry-aware optimization. Tech. rep. KAUST, 2025.
[4] Keller Jordan, Yuchen Jin, Vlado Boza, Jiacheng You, Franz Cesista, Laker Newhouse, and Jeremy Bernstein. Muon: An optimizer for hidden layers in neural networks. 2024.
[5] Diederik P. Kingma and Jimmy Ba. “Adam: A method for stochastic opti- mization”. In: arXiv preprint arXiv:1412.6980 (2014).
[6] Thomas Pethick, Wanyun Xie, Kimon Antonakopoulos, Zhenyu Zhu, Anto- nio Silveti-Falls, and Volkan Cevher. “Training deep learning models with norm-constrained LMOs”. In: arXiv preprint arXiv:2502.07529 (2025).
[7] Artem Riabinin, Kaja Gruntkowska, Egor Shulgin, and Peter Richt'arik. “Gluon: Making Muon and Scion great again! (Bridging theory and practice of LMO-based optimizers for LLMs)”. In: arXiv preprint arXiv:2505.13416 (2025).
[8] R. T Rockafellar. “Monotone operators and the proximal point algorithm”. In: SIAM Journal on Control and Optimization 14.5 (1976), pp. 877–898.
[9] Z. Zhuang, M. Liu, A. Cutkosky, and F. Orabona. “Understanding AdamW through proximal methods and scale-freeness”. In: Transactions on Machine Learning Research (2022).

Within the CRC this talk is associated to the project(s): B3

Back

SERVICE