Automated semantic evaluation pipeline replaces manual search quality checks.
New method for evaluating the performance of a content search system that measures the semantic match between a query and the results returned by the search system.
https://arxiv.org/abs/2410.21549
🎯 Original Problem:
LinkedIn's search system needs better ways to measure semantic relevance of search results beyond traditional engagement metrics. Current evaluation methods lack direct quality measurement and automation.
-----
🔧 Solution in this Paper:
→ Introduces On-Topic Rate (OTR) metric that measures percentage of search results semantically relevant to queries
→ Creates a comprehensive evaluation pipeline using GPT-3.5 to assess query-document relevance
→ Builds test query sets combining golden queries (top/topical) and open set (trending/random)
→ Formulates precise prompts for GPT-3.5 with clear metric definition and decision guidance
-----
💡 Key Insights:
→ Binary decisions combined with relevance scores provide more reliable evaluation
→ Precise prompt engineering significantly impacts evaluation quality
→ Dynamic query sets help maintain evaluation relevance over time
→ Decision reasons from LLM help identify failure patterns
-----
📊 Results:
→ 81.72% consistency between GPT-3.5 and human evaluators
→ 94.5% accuracy on validation set of 600 query-post pairs
→ Successfully deployed in LinkedIn's production system for weekly monitoring
Share this post