Smart logit compression and loss scheduling lets small LLMs learn efficiently from bigger siblings 4000x lighter Pre-training distillation with compressed logits
Share this post
Pre-training Distillation for Large Language…
Share this post
Smart logit compression and loss scheduling lets small LLMs learn efficiently from bigger siblings 4000x lighter Pre-training distillation with compressed logits