JudgeBench, proposed in this paper, stress-tests AI judges by making them evaluate complex factual responses, exposing their reasoning flaws.
Share this post
JudgeBench: A Benchmark for Evaluating…
Share this post
JudgeBench, proposed in this paper, stress-tests AI judges by making them evaluate complex factual responses, exposing their reasoning flaws.