Asynchronous Heavy-Tailed Optimization
Heavy-tailed stochastic gradient noise, commonly observed in transformer models, can destabilize the optimization process. Recent works mainly focus on developing and understanding approaches to...
From arxiv.org 3
