"BOLT: Bootstrap Long Chain-of-Thought in Language Models without Distillation"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 12, 2025

Article voiceover

0:00

-5:00

https://arxiv.org/abs/2502.03860

The challenge lies in replicating the long chain-of-thought reasoning of advanced LLMs without relying on their data or extensive human annotation. Current methods often use knowledge distillation, which obscures the systematic development of reasoning abilities.

This paper introduces BOLT. It bootstraps LongCoT capacity in standard instruct models. It avoids distillation through a three-stage process.

-----

📌 BOLT leverages in-context learning for LongCoT data creation. This method efficiently generates training data without needing high-quality LongCoT exemplars. It shifts LongCoT capability development from data-intensive distillation to model-centric bootstrapping.

📌 Supervised fine-tuning of ShortCoT models on bootstrapped LongCoT data is a key step. It effectively transfers the LongCoT format and reasoning structure. This allows efficient adaptation without architectural changes.

📌 DPO-based online training in BOLT is critical for performance. DPO's selective sampling of high/low reward responses effectively reduces noise from LLM-based reward models, outperforming other Reinforcement Learning methods.

----------

Methods Explored in this Paper 🔧:

→ BOLT, or Bootstrap Long Chain-of-Thought, is a three-stage method. It aims to imbue ShortCoT LLMs with LongCoT capabilities. It avoids reliance on existing LongCoT models.

→ First, LongCoT data is bootstrapped. This uses in-context learning on a ShortCoT LLM. Few in-context examples are needed for this stage. Only 10 examples were used in experiments.

→ Second, LongCoT Supervised Finetuning is performed. A ShortCoT model is trained on the bootstrapped LongCoT data. This allows it to learn the LongCoT format and reasoning patterns.

→ Third, LongCoT Online Training refines the model. It uses Direct Preference Optimization (DPO). An outcome reward model is used to guide the online training.

-----

Key Insights 💡:

→ BOLT offers a white-box approach to LongCoT development. This contrasts with black-box knowledge distillation.

→ LongCoT capabilities can be developed from ShortCoT models. This is achieved without external LongCoT model data or heavy human annotation.

→ The bootstrapping stage is efficient. It requires only a small number of in-context examples to initiate LongCoT learning.

→ BOLT's effectiveness is consistent across different model scales. It was tested on 7B, 8B, and 70B parameter models.

→ Online training through DPO is crucial. It further refines and enhances the LongCoT reasoning abilities learned in earlier stages.

-----

Results 📊:

→ BOLT achieves significant performance improvements across diverse benchmarks. These include Arena-Hard, MT-Bench, WildBench, ZebraLogic, and MATH500.

→ Using Skywork-Reward-Llama-3.1-8B as the reward model in online training, BOLT showed stronger performance compared to using ArmoRM-Llama3-8B.

→ BOLT-Llama-3.1-8B-Instruct achieves an Arena-Hard-SC score of 44.1 and a WildBench score of 42.96.

Rohan's Bytes

Discussion about this post