SelfPrompt lets LLMs test their own robustness without expensive benchmarks
SelfPrompt introduces an autonomous framework for evaluating LLM robustness using domain-constrained knowledge guidelines and refined adversarial prompts, eliminating dependency on traditional benchmarks while providing targeted evaluation in specific domains.
-----
https://arxiv.org/abs/2412.00765
🤔 Original Problem:
→ Current LLM robustness evaluation frameworks rely heavily on manually annotated benchmark datasets, making them expensive and impractical for domain-specific assessments.
-----
🛠️ Solution in this Paper:
→ SelfPrompt generates adversarial prompts from knowledge graph triplets using domain-specific guidelines.
→ A filter module ensures high-quality adversarial prompts by measuring text fluency and semantic fidelity.
→ The framework employs both template-based and LLM-based strategies for generating original prompts.
→ A few-shot approach enhances adversarial prompt generation effectiveness.
-----
💡 Key Insights:
→ Model size positively correlates with robustness in general domains but not necessarily in constrained domains
→ Domain-specific knowledge significantly impacts robustness evaluation accuracy
→ Filter module maintains consistent prompt quality across different LLMs
-----
📊 Results:
→ Framework tested on ChatGPT, Llama3.1, Phi-3, and Mistral models
→ Larger models showed 15-20% better robustness in general domains
→ In constrained domains, smaller models sometimes outperformed larger ones by 5-8%