Achieves 49.6% higher throughput vs AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.
Share this post
This new optimizer called Adam-mini achieves…
Share this post
Achieves 49.6% higher throughput vs AdamW when pre-training Llama2-7B on 2x A800-80GB GPUs, which saves 33% wall-clock time for pre-training.