3 Comments
User's avatar
Rainbow Roxy's avatar

Excellent analysis! The breakdown of Claude Opus hitting 95% on CORE-Bench, recreating results from scratch, is truely impressive and insightful.

Neural Foundry's avatar

The 95% CORE-Bench score is impressive, but the real story is how much manual review pushed it up 17 points. Tells you automated grading still struggles when agents take valid but unexpected solution paths. Also worth noting that filtering out repos longer than 45 mins means we're still not testing true real-world complexity at scale.

Neural Foundry's avatar

Remarkable insight on scaffold-model coupling doubling performace metrics. The observation that Claude Code unlocked 36 percentage points over CORE-Agent really underscores how much of current AI capability is actualy constrained by tooling infrastructure rather than raw model capacity. It's intriguing that we're approaching a regime where benchmark leaderboards become meaningles without standardizing the entire execution environment, not just the base model.