0:00
/
0:00
Transcript

"Understanding Hidden Computations in Chain-of-Thought Reasoning"

The podcast on this paper is generated with Google's Illuminate.

Transformer models hide their reasoning steps but still know how to solve problems perfectly

This paper unveils how transformer models secretly process reasoning steps when Chain-of-Thought prompting is replaced with filler characters, enabling recovery of hidden computations.

https://arxiv.org/abs/2412.04537

Original Problem 🤔:

→ While Chain-of-Thought (CoT) prompting helps LLMs perform complex reasoning, recent findings show models maintain performance even when CoT steps are replaced with filler characters (e.g. "..."). This raises questions about how models internally process reasoning.

-----

Solution in this Paper 🔧:

→ The paper analyzes layer-wise representations in transformers using the logit lens method to decode hidden characters in models trained with filler CoT sequences.

→ It examines token rankings during decoding to determine if original CoT tokens appear among lower-ranked candidates when filler tokens dominate top predictions.

→ A modified greedy autoregressive decoding algorithm is implemented to recover hidden characters by selecting highest-ranked non-filler tokens.

-----

Key Insights 💡:

→ Models perform reasoning computations in earlier layers before overwriting with filler tokens in later layers

→ Original reasoning steps remain accessible as lower-ranked token predictions

→ Hidden computations can be recovered without performance loss

-----

Results 📊:

→ 100% accuracy in recovering hidden characters using modified decoding vs 39.85% with random replacement

→ Maintained original task performance while exposing internal reasoning steps

→ Successfully demonstrated recovery across all 4 transformer layers

Discussion about this video