Claude 4.5 crushes CORE-Bench, Google fuses RNNs+Transformers, OpenAI forced to share logs, Codex improves bug-catching, Anthropic intros AI interviewer, and buys Neptune.
The 95% CORE-Bench score is impressive, but the real story is how much manual review pushed it up 17 points. Tells you automated grading still struggles when agents take valid but unexpected solution paths. Also worth noting that filtering out repos longer than 45 mins means we're still not testing true real-world complexity at scale.
Remarkable insight on scaffold-model coupling doubling performace metrics. The observation that Claude Code unlocked 36 percentage points over CORE-Agent really underscores how much of current AI capability is actualy constrained by tooling infrastructure rather than raw model capacity. It's intriguing that we're approaching a regime where benchmark leaderboards become meaningles without standardizing the entire execution environment, not just the base model.
Excellent analysis! The breakdown of Claude Opus hitting 95% on CORE-Bench, recreating results from scratch, is truely impressive and insightful.
The 95% CORE-Bench score is impressive, but the real story is how much manual review pushed it up 17 points. Tells you automated grading still struggles when agents take valid but unexpected solution paths. Also worth noting that filtering out repos longer than 45 mins means we're still not testing true real-world complexity at scale.
Remarkable insight on scaffold-model coupling doubling performace metrics. The observation that Claude Code unlocked 36 percentage points over CORE-Agent really underscores how much of current AI capability is actualy constrained by tooling infrastructure rather than raw model capacity. It's intriguing that we're approaching a regime where benchmark leaderboards become meaningles without standardizing the entire execution environment, not just the base model.