Claude Opus 4.5 scored a massive 95% on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks

Claude 4.5 crushes CORE-Bench, Google fuses RNNs+Transformers, OpenAI forced to share logs, Codex improves bug-catching, Anthropic intros AI interviewer, and buys Neptune.

Dec 06, 2025

Read time: 10 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡In today’s Edition (5-Dec-2025):

Claude Opus 4.5 scored a massive 95% on CORE-Bench, a benchmark that tests agents on scientific reproducibility tasks
Google published Titans at NeurIPS2025 - a new architecture that combines the speed of RNNs with the performance of Transformers.
🔍 A judge just ordered OpenAI to hand over 20 million anonymized ChatGPT logs to New York Times and other publishers in their copyright fight.
📡Anthropic launches AI interviewer tool to study professional AI use.
🧑‍💻 OpenAI explains the techniques that helped Codex use repository tools to boost code review accuracy, to catch important bugs with minimal noise.
👨‍🔧 OpenAI acquires AI tooling provider Neptune to enhance its model training workflows.

Connect with me on X (Twitter)

Claude Opus hits 95% on the Core-Bench HARD, which, like Paper-Bench, checks if an AI can actually read a research paper, write the code, run tests, and recreate the results completely from scratch.

CORE-Bench on Holistic Agent Leaderboard (HAL), checks if an agent can clone a paper’s repo, install dependencies, run code, and answer questions across computer science, social science, and medicine.

For comparison, GPT-5.1 Codex Max manages around 40% on Paper-Bench.

Switching from the generic CORE-Agent scaffold to Claude Code lifted Opus 4.5 from 42% to 78% even before any fixes.

Manual review then added +17 points by correcting edge-case grading and underspecified tasks, bringing it to 95%.

The review marked 8 tasks correct and dropped 1 task whose dataset link had rotted.

Only 2 tasks still fail, both tied to messy package installs and picking the right result. A scaffold is the harness around an LLM, including prompts, tools, memory, scratchpads, and run control that structures multi-step work.

Model–scaffold coupling is real because Opus 4.1 scores 51% with CORE-Agent and beats Claude Code, while Sonnet 4 is 33% vs 47% and Sonnet 4.5 is 44% vs 62%.

Automated grading handled roughly 80% of cases, but near saturation strong agents take valid alternate paths that require human verification. The benchmark is still constrained because repos longer than 45 minutes were filtered out and only selected results are checked, so real-world reproduction can be tougher.

Holistic Agent Leaderboard (HAL), will open a private test set and expand to larger real-paper reproduction at scale. Reported scores should always specify the scaffold because swapping it can double accuracy and change rankings.

So all these means scaffold quality now gates capability more than many expect.

A scaffold is the harness around an LLM that plans steps, calls tools, runs code, retries on errors, and extracts the final answer. And this harness can cap or unlock what the same model can actually do.

So what does this tell us?

Scaffolds, or the way context is built around an LLM, matters much more that we thought, equally as as important as the model itself—maybe even more.
When agents get this advanced, manual evaluation becomes necessary, since automated grading can’t always recognize the many clever or unusual ways an agent might solve a problem.
Agents can now reproduce parts of scientific papers, which is huge, but the benchmark still has limits. It only covers tasks that take under 45 minutes, and only asks for specific parts of a paper’s results, not the entire study.

Google published Titans at NeurIPS2025 - a new architecture that combines the speed of RNNs with the performance of Transformers.

Gives LLMs RNN like speed, Transformer level quality, and context windows beyond 2M tokens via deep trainable memory that updates during inference.

Titans replaces the usual fixed recurrent state with a deep neural memory network that summarizes the stream and feeds its summary into attention. Gradients act as a surprise score, so only unexpected tokens update this memory, turning inference into an online learning loop.

Momentum pulls nearby tokens in after a surprise, while adaptive weight decay forgets stale content and keeps memory bounded. MIRAS views Titans, Transformers, and RNNs as versions of one associative memory system defined by the memory architecture, attention objective, retention rule, and optimizer.

This view allows losses beyond mean squared error and plain dot product, which produces attention free variants like YAAD, MONETA, and MEMORA with more robust long range behavior. Across C4, WikiText, HellaSwag, and PIQA, Titans and MIRAS variants reach lower perplexity and better reasoning than Transformer++ and linear recurrent baselines like Mamba 2 and Gated DeltaNet while keeping linear time inference.

On extreme long context tests such as BABILong, Titans maintains high accuracy past 2M tokens and sometimes beats larger GPT 4 style baselines. This work turns long context from a context window trick into a careful choice of memory, loss, and optimization.

🔍 A judge just ordered OpenAI to hand over 20 million anonymized ChatGPT logs to New York Times and other publishers in their copyright fight.

Publishers will use these logs to see how often ChatGPT recreates or closely tracks their stories, and whether there is systematic regurgitation of copyrighted text instead of rare glitches.

OpenAI said 99.99% of chats are unrelated and warned that sharing them exposes sensitive user data, even when conversations have nothing to do with news. The judge rejected that argument, pointing to strong de identification and a protective order so the logs are anonymized and only used inside the case. The ruling effectively says courts can demand large scale AI usage data when copyright is at stake, which pushes OpenAI and peers toward privacy aware logging and auditing by design.

📡Anthropic launches AI interviewer tool to study professional AI use.

It conducts automated interviews to understand how people use AI in their professional lives.

It interviewed 1,250 people across general jobs, creative fields, and science roles, and it measured who saves time, where trust breaks, and how work is shifting. In the general workforce, 86% say AI saves time, 65% are satisfied, 69% note stigma, 55% feel anxiety, and 48% expect to shift toward supervising AI systems.

Self reports show 65% augmentation and 35% automation, but observed Claude chats show 47% augmentation and 49% automation, revealing a perception practice gap. Among creatives, 97% report time saved and 68% report quality gains, yet 70% manage peer judgment while wrestling with control boundaries and market pressure. Scientists keep AI to literature review and coding, with 79% citing trust reliability issues, 27% citing technical limits, and 91% wanting hypothesis and experiment support.

Connect with me on X (Twitter)

🧑‍💻 OpenAI explains the techniques that helped Codex use repository tools to boost code review accuracy, to catch important bugs with minimal noise.

The reviewer is tuned for precision first, using a simple cost formula that weights correctness, saved engineer time and false alarm pain, and teams can steer that aggressiveness through repo instructions.

Rather than just seeing the diff, the model gets repository wide context plus tools to search files, run tests and execute snippets, and it is fine tuned specifically for review. In human evaluation on open source commits, the diff only GPT-5 baseline shows about 21.5% incorrect comments, repo context alone cuts that to 14.3%, and the tuned Codex reviewer drops it to around 7.1% incorrect while averaging about 0.6 comments per pull request.

Verification is cheaper than generation because the reviewer only needs targeted reasoning and a few tool calls to falsify patches, and in deployment about 52.7% of its comments lead to code changes. The team trains this reviewer separately from strict reward models, since training verifiers can lean on rich metadata and tolerate many false positives while the shipped reviewer must infer intent from messy human code without annoying users.

At scale the system now reviews more than 100K pull requests per day with over 80% positive reactions, giving a concrete template for making AI written code reviewable at company scale. The main open worry is whether this verification gap stays ahead as both generator and reviewer share the same base model family, so ongoing real world measurement will matter more here than leaderboard scores.

👨‍🔧 OpenAI acquires AI tooling provider Neptune to enhance its model training workflows.

To pull its experiment tracking and training-debug stack in house and get much deeper visibility into how its biggest AI models actually learn.

Neptune builds tools that log every training run, store all metrics, hyperparameters and artifacts, then let researchers compare runs side by side to see which setup worked and where things broke. For systems at OpenAI scale, this kind of high-granularity training telemetry helps catch issues like unstable loss curves, bad data slices or misconfigured optimizers early, which saves a lot of compute spend and avoids shipping broken behaviors.

Owning Neptune rather than just being a customer means OpenAI can wire this logging layer directly into its internal training stack, customize it for its own pipelines and hardware, and connect it tightly with evaluation, safety checks and deployment tooling. As part of the deal, Neptune will stop serving outside companies, so teams at places like Samsung or HP that relied on Neptune’s experiment tracking will need to migrate to other platforms or rebuild similar tooling themselves. Shows how strategic training infrastructure has become, since whoever controls the observability and debugging layer can move faster on new model families and iterate with less guesswork.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Ready for more?