APOLLO: SGD-like Memory, AdamW-level Performance

📢 Accepted at MLSys 2025

1The University of Texas at Austin, 2Meta AI *Equal contribution Equal advising

Pre-training LLaMA 7B on C4 dataset for 150K steps with reported perplexity

Optimizer Memory 40K 80K 120K 150K
8-bit Adam 13G 18.09 15.47 14.83 14.61
8-bit GaLore 4.9G 17.94 15.39 14.95 14.65
APOLLO 1.6G 17.55 14.39 13.23 13.02
APOLLO-Mini 0.0G 18.03 14.60 13.32 13.09
Tokens (B) 5.2 10.5 15.7 19.7

APOLLO optimizer significantly reduces memory usage and achieves the best perplexity in pre-training.

🔥 News

  • [2025/2] APOLLO is officially accepted to [MLSys 2025](https://mlsys.org/)!
  • [2025/2] Try APOLLO with hugging Face trainer! Our APOLLO optimizer is now integrated into huggingface transformer!
  • [2025/1] Try APOLLO for memory-efficient LLM full-parameter fine-tuning! Our APOLLO optimizer is integrated into LLaMA-Factory!
  • [2024/12] Try APOLLO—just install via pip! Our APOLLO optimizer is now live and can be easily installed using pip. Check it out on PyPI!
  • [2024/12] APOLLO validated by third-party Julia implementation! Our APOLLO optimizer has been independently validated by a third party using a Julia implementation. Check out the post. They are also working to integrate APOLLO into FluxML.
  • [2024/12] Paper is now on arXiv!

Abstract

Large language models (LLMs) demonstrate remarkable capabilities but are notoriously memory-intensive during training, particularly with the popular AdamW optimizer. This memory burden often necessitates using more or higher-end GPUs or reducing batch sizes, limiting training scalability and throughput, respectively. To address this, various memory-efficient optimizers have been proposed to reduce optimizer memory usage. However, they face key challenges: (i) reliance on costly SVD operations (e.g., GaLore, Fira); (ii) significant performance trade-offs compared to AdamW (e.g., Flora); and (iii) still substantial memory overhead of optimization states in order to maintain competitive performance (e.g., 1/4 rank in GaLore, and full-rank first momentum in Adam-mini).

In this work, we investigate the redundancy in Adam(W)'s learning rate adaption rule and identify that it can be coarsened as a structured learning rate update (channel-wise or tensor-wise). Based on this insight, we propose a novel approach, Approximated Gradient Scaling for Memory Efficient LLM Optimization (APOLLO), which approximates the channel-wise learning rate scaling with an auxiliary low-rank optimizer state based on pure random projection. The structured learning rate update rule makes APOLLO highly tolerant to further memory reduction with lower rank, halving the rank while delivering similar pre-training performance. We further propose an extreme memory-efficient version, APOLLO-Mini, which utilizes tensor-wise scaling with only a rank-1 auxiliary sub-space, achieving SGD-level memory cost but superior pre-training performance than Adam(W).

We conduct extensive experiments across different tasks and model architectures, showing that the APOLLO series performs generally on-par with, or even better than Adam(W). Meanwhile, APOLLO achieves even greater memory savings than GaLore, by almost eliminating the optimization states in AdamW. These savings translate into significant system benefits:

  1. Enhanced Throughput: APOLLO and APOLLO-Mini achieve up to 3× throughput on a 8×A100-80GB setup compared to Adam by fully utilizing memory to support 4× larger batch sizes.
  2. Improved Model Scalability: APOLLO-Mini for the first time enables pre-training LLaMA-13B model with naive DDP on A100-80G without requiring other system-level optimizations.
  3. Low-End GPU Pre-training: Combined with quantization, the APOLLO series for the first time enables the training of LLaMA-7B from scratch using less than 12 GB of memory.

Method Highlights

The APOLLO series introduces several innovative approaches for memory-efficient LLM training. Here are the key highlights of our method:

  • Structured Learning Rate Updates: We identify the redundancy in Adam(W)'s element-wise learning rate update rule and find a structured learning rate update rule is sufficient for LLM training.
  • Low-Rank Auxiliary State (APOLLO): A practical, memory-efficient approximation of channel-wise gradient scaling using pure random projections in a low-rank auxiliary space. This method achieves superior performance to AdamW with significantly lower memory requirements.
  • Extreme Memory Efficiency (APOLLO-Mini): Tensor-wise gradient scaling using only a rank-1 auxiliary sub-space, achieving SGD-level memory costs while still on-par or even outperforming Adam(W) in pre-training tasks.

These innovations make APOLLO and APOLLO-Mini ideal for memory-constrained environments, delivering both scalability and performance without compromise.

APOLLO Framework

Figure 1: The APOLLO Framework for Memory-Efficient LLM Training. The channel-wise or tensor-wise gradient scaling factor is obtained via an auxiliary low-rank optimizer state, constructed using pure random projection (no SVD required).

System Benefits of APOLLO

Figure 2: System Benefits of APOLLO for Pre-training LLaMA 7B. (Left): Memory breakdown comparison for a single batch size. (Right): End-to-end training throughput on 8 A100-80GB GPUs.

Train an LLaMA-7B with 3× Throughput Compared to Adam

The following videos demonstrate the training throughput achieved with APOLLO on the LLaMA-7B model, showing up to 3× improvement compared to Adam optimizer.

AdamW Optimizer with micro batch size of 4 at 79GB-80GB memory cost

APOLLO Optimizer with micro batch size of 16 at 69GB-70GB memory cost

Successful Training for LLaMA-13B Model

This video showcases APOLLO-Mini enables the first successful pre-training of the LLaMA-13B model with naive DDP and no other system-level optimizations.

Pre-train a LLaMA-7B on a NVIDIA TITAN (12GB)

This video showcases the successful training of the LLaMA-7B model using Q-APOLLO-Mini within 11 GB memory, when layer-wise update is adopted.

BibTeX

@misc{zhu2024apollosgdlikememoryadamwlevel,
  title={APOLLO: SGD-like Memory, AdamW-level Performance}, 
  author={Hanqing Zhu and Zhenyu Zhang and Wenyan Cong and Xi Liu and Sem Park and Vikas Chandra and Bo Long and David Z. Pan and Zhangyang Wang and Jinwon Lee},
  year={2024},
  eprint={2412.05270},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2412.05270}
}