LLMs can handle much larger learning rates by letting networks naturally reduce their loss landscape sharpness
Why Warmup the Learning Rate? Underlying…
LLMs can handle much larger learning rates by letting networks naturally reduce their loss landscape sharpness