0:00
/
0:00
Transcript

"Control LLM: Controlled Evolution for Intelligence Retention in LLM"

Below podcast is generated with Google's Illuminate.

Control LLM tackles catastrophic forgetting in LLMs during continuous learning by using parallel transformer blocks and hidden-state alignment through interpolation.

This method allows LLMs to learn new tasks without losing existing knowledge.

-----

Paper - https://arxiv.org/abs/2501.10979

Original Problem 🤔:

→ LLMs require vast computational resources, making full retraining impractical.

→ Enhancing LLMs with new skills often leads to catastrophic forgetting.

→ Catastrophic forgetting causes LLMs to lose previously learned abilities when trained on new data.

-----

Solution in this Paper 💡:

→ This paper proposes Control LLM, a novel architecture to mitigate catastrophic forgetting.

→ Control LLM expands the LLM with parallel transformer blocks: a frozen pre-trained block and a trainable expanded block.

→ It aligns hidden-states of these blocks using interpolation strategies like Linear Interpolation and Dynamic Linear Interpolation.

→ This alignment mechanism allows the model to learn new tasks while retaining old knowledge.

→ Control LLM uses a divergence loss to maintain consistency between hidden-states of pre-trained and expanded blocks.

-----

Key Insights from this Paper 🧠:

→ Hidden-state alignment in transformer layers is crucial for mitigating catastrophic forgetting.

→ Maintaining alignment prevents the drift of hidden-states when learning new tasks.

→ Interpolation strategies effectively fuse knowledge from pre-trained and expanded blocks.

→ Control LLM achieves a "learn more, forget less" outcome, outperforming traditional fine-tuning methods.

-----

Results 📊:

→ Control LLM improves Math-Hard accuracy by 14.4% on Llama3.1-8B-Instruct.

→ It enhances MBPP-PLUS coding performance by 10% on Llama3.1-8B-Instruct.

→ Control LLM boosts C-Eval multilingual capabilities by 10.6% on Llama3.1-8B.

→ It limits MMLU degradation to less than 4.3%, compared to >35% in other methods.

Discussion about this video