Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive
during training, particularly with the popular AdamW optimizer. This memory burden often necessitates
using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput,
respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer
memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore,
Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still
substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4
rank in GaLore, and full-rank first momentum in Adam-mini).
In this work, we investigate the redundancy in Adam(W)'s learning rate adaption rule and identify that it
can be coarsened as a structured learning rate update (channel-wise or tensor-wise).
Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO),
which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based
on pure random projection.
The structured learning rate update rule makes APOLLO highly tolerant to
further memory reduction with lower rank, halving the rank while delivering similar pre-training
performance.
We further propose an extreme memory-efficient version, APOLLO-Mini, which
utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving SGD-level memory
cost but superior pre-training performance than Adam(W).
We conduct extensive experiments across different tasks and model architectures, showing that the APOLLO series performs generally on-par with, or even better than
Adam(W). Meanwhile, APOLLO achieves even greater memory
savings than GaLore, by almost eliminating the optimization states in AdamW.
These savings translate into significant system benefits:
- Enhanced Throughput: APOLLO and APOLLO-Mini achieve up to 3× throughput on a 8×A100-80GB setup compared to Adam
by fully utilizing memory to support 4× larger batch sizes.
- Improved Model Scalability: APOLLO-Mini for the first
time enables pre-training LLaMA-13B model with naive DDP on A100-80G without requiring other
system-level optimizations.
- Low-End GPU Pre-training: Combined with quantization, the APOLLO series for the first time enables the training of LLaMA-7B from
scratch using less than 12 GB of memory.