AGENT-CQ: Automatic Generation and Evaluation of Clarifying Questions for Conversational Search with LLMs

LLM-generated questions outperform human-generated ones in usefulness and specificity - as per this Paper.

Nov 08, 2024

LLM-generated questions outperform human-generated ones in usefulness and specificity - as per this Paper.

AGENT-CQ (Automatic GENeration, and evaluaTion of Clarifying Questions),uses LLMs to generate and evaluate clarifying questions for better search understanding.

Original Problem 🤔:

Generating effective clarifying questions in conversational search systems currently relies on manual curation or template-based approaches, which lack scalability and adaptability. This limits the ability to understand user intent and provide relevant search results.

Solution in this Paper 🛠️:

• AGENT-CQ: A two-stage framework using LLMs for generating and evaluating clarifying questions

• Generation Stage:

Question generation using temperature variation and facet-based approaches
Quality-based filtering to remove low-quality questions
Answer simulation using parameterized user characteristics

• Evaluation Stage (CrowdLLM):

Multiple LLM instances evaluate questions on 7 quality metrics
Simulates diverse human judgments using different temperature settings
Validates against human expert assessments

Key Insights 💡:

• Temperature variation method (GPT-Temp) produces highest quality questions

• Facet-based approaches enhance specificity but increase question complexity

• LLM-simulated answers match human answers in quality assessments

• CrowdLLM shows strong agreement with human evaluators

Results 📊:

• GPT-Temp achieves highest NDCG@1 scores: 0.225 for BM25, 0.312 for BERT

• CrowdLLM shows 89% agreement on naturalness evaluation

• LLM answers perform comparably to human answers in relevance (37.15% vs 37.34%)

• GPT-Temp significantly outperforms baseline in usefulness (mean difference = 3.781, p<0.001)

AGENT-CQ is an end-to-end framework for automatically generating and evaluating clarifying questions in conversational search. It has two main stages:

Generation stage with three phases:

Question generation using LLM prompting strategies
Question filtering based on quality criteria
Answer simulation to generate responses

Evaluation stage (CrowdLLM) that uses multiple LLM instances to assess question and answer quality based on comprehensive metrics

🔍 CrowdLLM evaluation framework:

Uses multiple LLM instances with varying temperature settings to simulate diverse human evaluators
Evaluates questions on 7 metrics: clarification potential, relevance, specificity, usefulness, clarity, complexity, overall quality
Evaluates answers on 4 metrics: relevance, usefulness, naturalness, overall quality
Shows strong agreement with human evaluators on most dimensions

Rohan's Bytes

Discussion about this post