LLMs can be fooled by invisible characters that humans can't see, but GPT-4 fights back.
This paper studies how invisible character attacks affect LLM code comprehension, revealing GPT-4's superior defense mechanisms compared to GPT-3.5 models.
-----
https://arxiv.org/abs/2412.08098
🔍 Original Problem:
→ LLMs excel at code tasks but are vulnerable to adversarial attacks using special characters that appear identical to clean code but confuse the model.
-----
🛠️ Solution in this Paper:
→ The researchers developed four types of imperceptible attacks: reordering, invisible characters, deletions, and homoglyphs.
→ They tested these attacks on three ChatGPT models (two GPT-3.5 versions and GPT-4) using a dataset of 2,644 LeetCode questions.
→ Each attack used special Unicode characters that look identical to humans but alter the code's meaning for machines.
→ The study measured model confidence using log probabilities and response correctness.
-----
💡 Key Insights:
→ GPT-4 has built-in defenses against imperceptible attacks, unlike GPT-3.5
→ Deletion attacks had the strongest impact on model performance
→ Homoglyph attacks were least effective due to limited character substitution options
→ All models showed 100% accuracy on clean code
-----
📊 Results:
→ GPT-3.5 models showed linear decline in performance with increased perturbation
→ GPT-4 rejected almost all perturbed inputs (99% detection rate)
→ Confidence scores dropped from 95% to 1.74% in worst cases for GPT-3.5
Share this post