The paper reveals how chat history tampering exploits LLM vulnerabilities to manipulate model behavior. 🤯
Results 📊:
- Chat history tampering boosts disallowed content elicitation success rates up to 98% on GPT-3.5, 97% on Llama-2, and 86% on Llama-3.
- LLMGA effectively finds templates that influence model behavior, achieving high RRR across various LLMs.
📚 https://arxiv.org/pdf/2405.20234
Original Problem 🚨:
LLMs can't distinguish between user inputs and chat history, leading to vulnerabilities in chat history tampering. This poses risks such as misinterpreting tampered histories as genuine, which can bypass safety mechanisms.
-----
Key Insights from this Paper 💡:
- Chat history tampering can significantly influence LLM behavior.
- LLMs are vulnerable to accepting injected chat histories as genuine context.
- Effective template crafting can exploit LLM vulnerabilities without prior model knowledge.
-----
Solution in this Paper 🛠:
- Utilizes prompt templates to structure fake histories within user messages.
- Introduces LLM-Guided Genetic Algorithm (LLMGA) to generate and optimize templates in a black-box setting.
- Defines Response Retrieval Rate (RRR) to evaluate template effectiveness.
- Proposes strategies like acceptance and demonstration injection to manipulate LLM behavior and increase success rates of disallowed content elicitation.
Share this post