0:00
/
0:00
Transcript

"Concept Bottleneck Large Language Models"

Generated below podcast on this paper with Google's Illuminate.

LLMs get transparent: New architecture , proposed in this paper, reveals reasoning behind every prediction

CB-LLM introduces interpretable neurons in LLMs that provide clear explanations for model decisions while maintaining performance, enabling concept detection and controlled text generation.

-----

https://arxiv.org/abs/2412.07992v1

🤔 Original Problem:

LLMs are black boxes with unclear reasoning, making it hard to detect misuse, manipulation, or unsafe outputs. Current interpretability methods only work for small datasets and simple classification tasks.

-----

🔧 Solution in this Paper:

→ Introduces CB-LLM, transforming pretrained LLMs into interpretable models using a Concept Bottleneck Layer (CBL)

→ Implements Automatic Concept Scoring using sentence embeddings to efficiently label concepts without expensive LLM queries

→ Uses Automatic Concept Correction to improve concept score quality by aligning with class labels

→ Employs adversarial training to prevent unsupervised layer from learning concept-related information

→ Enables concept detection and controlled generation through interpretable neurons

-----

💡 Key Insights:

→ First CBM framework scaling to both large classification datasets and generation tasks

→ Achieves interpretability without performance loss compared to black-box models

→ Enables concept unlearning to enhance prediction fairness

→ Provides controllable text generation through neuron manipulation

-----

📊 Results:

→ Matches black-box model accuracy within 1% gap across datasets

→ Achieves 1.5× higher faithfulness rating compared to existing methods

→ Processes 560,000 samples efficiently vs 250 samples in previous work

→ Demonstrates 85% success rate in concept unlearning experiments

Discussion about this video

User's avatar