"Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models"

Playback speed

Share post at current time

0:00

Transcript

"Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models"

The Podcast is generated with Google's Illuminate, the tool trained on AI & science-related Arxiv papers.

Rohan Paul

Dec 27, 2024

The paper reveals how chat history tampering exploits LLM vulnerabilities to manipulate model behavior. 🤯

Results 📊:

- Chat history tampering boosts disallowed content elicitation success rates up to 98% on GPT-3.5, 97% on Llama-2, and 86% on Llama-3.

- LLMGA effectively finds templates that influence model behavior, achieving high RRR across various LLMs.

📚 https://arxiv.org/pdf/2405.20234

Original Problem 🚨:

LLMs can't distinguish between user inputs and chat history, leading to vulnerabilities in chat history tampering. This poses risks such as misinterpreting tampered histories as genuine, which can bypass safety mechanisms.

-----

Key Insights from this Paper 💡:

- Chat history tampering can significantly influence LLM behavior.

- LLMs are vulnerable to accepting injected chat histories as genuine context.

- Effective template crafting can exploit LLM vulnerabilities without prior model knowledge.

-----

Solution in this Paper 🛠:

- Utilizes prompt templates to structure fake histories within user messages.

- Introduces LLM-Guided Genetic Algorithm (LLMGA) to generate and optimize templates in a black-box setting.

- Defines Response Retrieval Rate (RRR) to evaluate template effectiveness.

- Proposes strategies like acceptance and demonstration injection to manipulate LLM behavior and increase success rates of disallowed content elicitation.

Rohan's Bytes

"Hidden in Plain Sight: Exploring Chat History Tampering in Interactive Language Models"

Discussion about this video