LLMs get transparent: New architecture , proposed in this paper, reveals reasoning behind every prediction
CB-LLM introduces interpretable neurons in LLMs that provide clear explanations for model decisions while maintaining performance, enabling concept detection and controlled text generation.
-----
https://arxiv.org/abs/2412.07992v1
🤔 Original Problem:
LLMs are black boxes with unclear reasoning, making it hard to detect misuse, manipulation, or unsafe outputs. Current interpretability methods only work for small datasets and simple classification tasks.
-----
🔧 Solution in this Paper:
→ Introduces CB-LLM, transforming pretrained LLMs into interpretable models using a Concept Bottleneck Layer (CBL)
→ Implements Automatic Concept Scoring using sentence embeddings to efficiently label concepts without expensive LLM queries
→ Uses Automatic Concept Correction to improve concept score quality by aligning with class labels
→ Employs adversarial training to prevent unsupervised layer from learning concept-related information
→ Enables concept detection and controlled generation through interpretable neurons
-----
💡 Key Insights:
→ First CBM framework scaling to both large classification datasets and generation tasks
→ Achieves interpretability without performance loss compared to black-box models
→ Enables concept unlearning to enhance prediction fairness
→ Provides controllable text generation through neuron manipulation
-----
📊 Results:
→ Matches black-box model accuracy within 1% gap across datasets
→ Achieves 1.5× higher faithfulness rating compared to existing methods
→ Processes 560,000 samples efficiently vs 250 samples in previous work
→ Demonstrates 85% success rate in concept unlearning experiments
Share this post