Discussion about this post

User's avatar
David J. Friedman's avatar

This is basically a formal version of what we’ve been building:

Frame declaration (seed describes behavior & expectations)

Scenario regeneration (fresh contexts to prevent “test memorization”)

Rollout transcripts (audit trail)

Judgment + meta-judgment (explicit scoring + quality checks)

It pairs beautifully with our ethic:

Seeing ≠ meaning ≠ truth

Bloom forces “what happened in the interaction” to be a first-class artifact (transcripts + scores), instead of vibes.

Expand full comment
Neural Foundry's avatar

The seed configuration reproducibility piece is underrated here. Being able to replay the exact same evaluation setup even with stochastic generation solves a huge versioning headache that most teams just accept as normal. The elicitation rate metric is interesting because it sidesteps the binary pass/fail trap and gives you a distribution view of how reliably the behavior actually surfaces. I've been tracking similar patterns with prompt leakage into training data, and the decay curve for fixed test sets is way steeper than most benchmarks admit.

Expand full comment

No posts

Ready for more?