Smart logit compression and loss scheduling lets small LLMs learn efficiently from bigger siblings 4000x lighter Pre-training distillation with compressed logits
Pre-training Distillation for Large Language…
Smart logit compression and loss scheduling lets small LLMs learn efficiently from bigger siblings 4000x lighter Pre-training distillation with compressed logits