Discussion about this post

User's avatar
Rainbow Roxy's avatar

Excellent analysis! The breakdown of Claude Opus hitting 95% on CORE-Bench, recreating results from scratch, is truely impressive and insightful.

Neural Foundry's avatar

The 95% CORE-Bench score is impressive, but the real story is how much manual review pushed it up 17 points. Tells you automated grading still struggles when agents take valid but unexpected solution paths. Also worth noting that filtering out repos longer than 45 mins means we're still not testing true real-world complexity at scale.

1 more comment...

No posts

Ready for more?