Prompting with AI scales, Verifying doesn't

Jun 08, 2025

Read time: 5 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

Prompting with AI scales easily since it’s just typing.

Verifying doesn't scale the same way — checking AI’s output is harder and slower.

So what to do?

Opinior Piece: TL;DR

Typing a prompt is cheap; trusting the answer is not. Very recent research show that detecting hallucinations and building verifiers—from simple consistency checks to cryptographic proofs—is becoming the new scaling frontier.

1 Prompting Is “Just Typing”

Firing text at a model costs little more than keystrokes and GPU minutes, so generation happily scales with bigger models and cheaper hardware. The hard part begins after the model answers, because correctness is a property you can’t sample—you have to prove it. Researchers call this the verification bottleneck because each extra token in the answer can hide a bug you must chase down manually or with tooling.

Sometimes you can spot-check outputs visually, which works well for UI, image, and video tasks.

But deeper tasks like text or code need real understanding to validate — you have to know enough to fix what’s wrong.

Researchers know this, which is why evals and hallucination are hot topics.

Connect with me on X (Twitter)

2 Verification Doesn’t Scale (Yet)

Several teams are attacking this bottleneck head-on.

Self-Verify RL. Zhang et al. train the model to critique its own output during learning; the generator and verifier share weights, so the cost is amortized over training and inference. arxiv.org
Heimdall. A “long chain-of-thought” verifier that re-derives math proofs and flags mismatches, pushing correctness on proof benchmarks above 97 %. arxiv.org
Cryptographic Pipelines. ACM researchers outline end-to-end proofs—from data collection to unlearning—so you can literally hand auditors a zero-knowledge certificate that the model behaved. dl.acm.org

These systems are promising, but they still struggle with domain depth; a chemistry answer, say, needs a chemist to spot a unit conversion error.

3 Hallucination Detection: The Front-Line Filter

Because full verification is heavy, most production stacks first run hallucination detectors that flag “too good to be true” answers. Recent progress:

RACE (Reasoning and Answer Consistency Evaluation)
RACE is a black-box framework designed to detect hallucinations in large reasoning models by jointly evaluating both the reasoning trace and the final answer for consistency and coherence arxiv.org. It extracts critical reasoning steps and computes signals like inter-sample consistency, answer uncertainty, semantic alignment, and internal coherence to flag subtle logical errors even when the answer appears plausible.

HalluMix
HalluMix is a task-agnostic, multi-domain benchmark that brings together examples from a wide range of real-world scenarios—such as summarization, question answering, and inference—across multiple documents arxiv.org. By evaluating seven state-of-the-art hallucination detectors on this diverse dataset, the creators demonstrate significant performance gaps between short and long contexts, underlining the need for robust detection methods in realistic settings

Mirage Study (“Evaluating Evaluation Metrics – The Mirage of Hallucination Detection”)
This large-scale empirical study examines six families of hallucination metrics across four datasets, 37 models from five families, and five decoding strategies to assess how well automated metrics align with human judgments arxiv.org. The authors reveal that many popular metrics are inconsistent and fail to generalize, while LLM-based evaluators—particularly GPT-4—achieve the best alignment with human evaluations.

SPOT (When AI Co-Scientists Fail: SPOT—a Benchmark for Automated Verification of Scientific Research)
SPOT pairs 83 published papers with 91 verified errors (e.g., proof mistakes, data inconsistencies) to evaluate LLMs’ ability to verify scientific manuscripts automatically arxiv.org. Across models, none surpasses 21.1 % recall or 6.1 % precision, highlighting the vast gap between current AI capabilities and the rigorous demands of academic verification.

Meta’s “Self-Taught Evaluator”
Meta’s Self-Taught Evaluator is an AI model trained entirely on synthetically generated “chain-of-thought” data to critique and verify other models’ outputs without human‐labeled annotations reuters.com. By iteratively generating contrasting model outputs and training on its own judgments, it matches or exceeds the performance of reward models trained with costly human feedback, pointing toward more autonomous AI verification processes.

4 Toward Robust Evaluation Pipelines

A single detector is not enough; you need a layered pipeline:

Cheap sanity checks (token entropy, answer length).
Model-based detectors (RACE, HalluMix-tuned heads).
Specialist verifiers (Heimdall for math, retrieval-augmented fact checkers for news).
Human-in-the-loop or cryptographic proofs when the stakes demand it.

Cohere’s recent position paper argues we must widen the lens further, auditing second-order social impacts, not just per-token accuracy. cohere.com And a Nature-reported algorithm that measures “semantic entropy” shows there is still room for basic statistical tricks.

5 What This Means for Practitioners

Budget verification time. If your API call takes 1 s, expect the checker to take 10 × that in tough domains.
Benchmark detectors on your own data. Public leaderboards hide domain quirks.
Assume some hallucinations will slip through. Build rollback paths and user feedback hooks.

In short, scaling prompting got us magical talkative models; scaling verification will decide whether that magic is reliable enough for the real world.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post