0:00
/
0:00
Transcript

"Do I Know This Entity? Knowledge Awareness and Hallucinations in Language Models"

The podcast on this paper is generated with Google's Illuminate.

Models have built-in mechanisms to know what they don't know

Neural networks can detect their own knowledge boundaries

So LLMs contain circuits that recognize familiar versus unfamiliar entities

LLMs often hallucinate when asked about unknown entities. This paper discovers that models have internal mechanisms for recognizing entities they know about, using sparse autoencoders to identify specific neural directions that detect entity recognition and influence model behavior[1].

-----

https://arxiv.org/abs/2411.14257

🤔 Original Problem:

LLMs frequently generate incorrect information when asked about entities they don't know, yet we lack understanding of the mechanisms behind when models choose to hallucinate versus refuse to answer[1].

-----

🔍 Solution in this Paper:

→ The researchers used sparse autoencoders to uncover linear directions in model representations that detect whether an entity is recognized

→ They found these entity recognition directions are present across different types like players, movies, songs, and cities

→ The directions causally affect whether the model refuses to answer or hallucinates about entities

→ These mechanisms exist in base models and get repurposed during chat finetuning

→ The directions regulate attention patterns of downstream heads that extract entity attributes

-----

💡 Key Insights:

→ Models have a form of self-knowledge about their capabilities

→ Entity recognition happens in middle layers of the network

→ Unknown entity detection suppresses attention to entity tokens

→ Chat models repurpose base model mechanisms for knowledge refusal

-----

📊 Results:

→ Steering with unknown entity latent induced nearly 100% refusal rate across entity types

→ Known entity latent reduced refusal rates significantly

→ Entity recognition directions generalized across players, movies, cities and songs

Discussion about this video