Anthropic just disclosed that Claude now writes more than 80% of the production code it merges
Read time: 8 min
📚 Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
⚡In today’s Edition (6-Jun-2026):
🗞️ Anthropic says 80% of its new production code is now authored by Claude
🗞️ Today’s Sponsor: Tencent WorkBuddy is now becoming China’s #1 PC-based productivity AI agent.
🗞️ New Google paper shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%
🗞️ Google’s new open source Gemma 4 12B can analyze audio and video while running fully locally on a consumer 16GB GPU
🗞️ Alibaba’s Qwen3.7-Plus supports text, video, and image inputs at a low price of $0.4/$1.6 per 1M tokens, though it remains proprietary.
🗞️ Anthropic’s new chemistry report has a genuinely wild result.
🗞️ Anthropic says 80% of its new production code is now authored by Claude
Before Claude Code reached research preview in 02-25, Claude wrote only low-single-digit merged code, while output per engineer has since risen to 8x the 2024 baseline.
The shift comes from agents that edit files, run tests, inspect failures, spawn helper agents, and keep working across longer tasks instead of only suggesting snippets.
Anthropic says reliable task length is doubling about every 4 months, with Mythos Preview reaching at least 16 hours and open-ended Claude Code success hitting 76%.
i.e. Claude Mythos Preview could stay useful on a task that would take a skilled human roughly 16 hours of work.
The remaining human edge is research judgment: choosing the right problem, trusting the right result, and knowing when an experiment is dead.
In isolated tests for accelerating AI training code, Anthropic’s internal Mythos Preview model delivered a 52x speedup.
Long-duration capability tests also show that models such as Claude Opus 4.6 can steadily handle 12-hour tasks, while Claude Mythos Preview goes beyond 16 hours of nonstop problem-solving. Internally, the technological leap is even more stark. On highly complex, open-ended engineering problems where clear specifications are initially absent, Claude’s success rate climbed to 76% in May 2026 — a 50-point increase in a six-month window.
In the same report, Anthropic also called for a global way to slow frontier AI because its own models may be approaching recursive self-improvement, where a system helps build a stronger version of itself without direct human control.
They said, future models will become so good at research, experiments, debugging, and training design that humans will stop being the main bottleneck.
Once that loop starts, progress could shift from human-paced engineering to machine-assisted improvement, which makes every safety test, law, and lab policy feel late by default.
Anthropic says this has not happened yet, but warns that the jump may arrive before governments, companies, and researchers have a trusted way to measure or restrain it.
The hard part is verification, because a huge AI training run is easier to hide than a weapons site, and any lab that secretly keeps training while others pause could gain the lead.
Anthropic is now ~$1T, may reach $50B annualized revenue, and competes fiercely with OpenAI, so every safety claim also lands inside a giant business fight.
🗞️ Today’s Sponsor: Tencent WorkBuddy is now becoming China’s #1 PC-based productivity AI agent.
China’s most popular desktop AI agent, now available worldwide. Tencent WorkBuddy
Tell it what you need, then it reads files, calls tools, writes reports, builds decks, analyzes data, uses 100+ expert roles.
Connects to GitHub, Jira, Notion, Gmail, Google Drive, Slack and more through MCP, runs tasks in a sandbox, and can even be controlled from Slack, Telegram, Discord, or WeChat when you are away from your desk.
WorkBuddy breaks a big task into smaller jobs, picks the right skills or connected apps for each job, and for complex work it can use Expert Teams where multiple specialized sub-agents work in parallel while 1 lead agent coordinates the final output.
So if you ask for a report, it is not just generating text. It can read the file, send the data-analysis part to an analyst-style expert, send the writing part to another expert, use connectors like Google Drive or Gmail if needed, and then combine everything into a finished file.
🌐 Official Website, 📖 User Guide ⬇️ Download
📌 Here are a few practical use cases you can do immediately with it.
Read PDFs, images, and documents, then organizes the extracted content.
Create reports, proposals, manuals, and presentations from raw material.
Analyze spreadsheets, finds trends, and turns data into charts.
Create platform-ready posts, scripts, articles, and content ideas.
Automatically research news and sends scheduled summaries to your channels.
Run desktop tasks from Slack on your phone.
Manage Calendar and Drive tasks directly through conversation.
Build working apps without needing you to code. Turn repeated workflows into reusable WorkBuddy skills.
For my own workflow, I installed Tavily AI Search because I post a lot about research papers on X. And paper content needs outside context: project pages, GitHub repos, author links, related papers, previous methods, and the reason a paper is worth posting about. See my workflow with Tencent WorkBuddy here.
🗞️ New Google paper shows general LLMs can solve formal math by planning proofs and checking each step. Raised general LLM performance from under 10% to 70%.
A general LLM failed badly when asked to write full formal proofs in 1 try, but became much stronger when it planned, split the work into smaller claims, reused past claims, and learned from Lean’s feedback.
The key idea is that the model does not try to write one giant perfect proof at once, because that usually fails on long and tricky problems.
Instead, LEAP stores the proof as a graph of goals and subgoals, so useful lemmas can be reused instead of rediscovered every time.
The authors tested LEAP on Putnam 2025 and a new Lean benchmark built from 60 IMO-style problems, where ordinary one-shot proof writing did very poorly.
LEAP solved all 12 Putnam 2025 problems and raised general LLM performance on the Lean IMO benchmark from under 10% to 70%.
🗞️ Google’s new open source Gemma 4 12B can analyze audio and video while running fully locally on a consumer 16GB GPU
Google has released Gemma 4 12B, an open multimodal model that can read text, images, audio, and video while running on a normal laptop with 16GB of memory.
Most multimodal models work like a relay team, where one model turns images into machine-friendly features, another model turns audio into features, and then the LLM reasons over those converted signals.
Gemma 4 12B removes that relay step by sending image patches and audio signals directly into the main LLM backbone, using lightweight projection layers instead of full vision and audio encoders.
That architecture is the real story, because encoders usually add memory cost, latency, engineering complexity, and extra parts that must be trained or tuned separately.
For vision, Google replaces the normal vision encoder with a tiny embedding module built around a single matrix multiplication, positional information, and normalization, so the main model learns more of the visual processing itself.
For audio, Google goes further by removing the audio encoder entirely and projecting the raw signal into the same internal space where text tokens live.
The result is a 12B-parameter model that sits between small edge models and the larger 26B MoE model, with benchmark scores close to the bigger system while using much less memory.
The benchmark: Gemma 4 12B stays near to Gemma 4 26B on GPQA Diamond, MMLU Pro, LiveCodeBench, DocVQA, InfoVQA, MMMU Pro, and MRCR, while being far smaller.
The practical win is local AI that can listen, see, reason, call tools, and handle long inputs without sending private files to a cloud service.
Google also made Gemma 4 much easier to run on phones and laptops by releasing QAT (Quantization-Aware Training) checkpoints that shrink the smallest model from 11.4GB to 1.1GB, or 0.84GB for text-only use.
Quantization means storing model numbers in fewer bits, which cuts memory and often speeds up token generation, but normal PTQ (Post-Training Quantization.) compresses after training and can damage quality because the model never learned to survive that rounding.
QAT fixes this by simulating compression during training, so Gemma 4 learns while its weights are being squeezed, making the final compressed model less likely to lose reasoning quality.
Google also built a mobile-focused format with static activations, channel-wise quantization, targeted 2-bit quantization, and KV cache optimization, which means the phone does less scaling work, stores some token-generation parts more aggressively, and keeps long chats from eating memory too fast.
The big shift is that local LLMs are moving from “can a laptop load this?” to “can a phone run this smoothly enough to be useful?”
🗞️ Alibaba’s Qwen3.7-Plus supports text, video, and image inputs at a low price of $0.4/$1.6 per 1M tokens, though it remains proprietary.
Alibaba’s Qwen3.7-Plus is built to be a low-cost multimodal agent model that can read screens, understand images and video, write code, use tools, and move between GUI and CLI workflows in one loop.
Its specialty is not just benchmark strength, but the way it combines visual perception, coding, tool calling, and long-context reasoning inside a single model that can act on software environments rather than only answer questions.
A normal coding model is strong when the task is written as text, but Qwen3.7-Plus is aimed at tasks where the model must inspect a screenshot, understand buttons or terminal output, decide the next step, write code, run commands, and continue without losing track.
The standout feature is its 1M-token context window, which means it can hold huge repositories, long logs, docs, UI histories, and multi-step agent traces in memory during one job.
The other important feature is 256K tokens for internal thinking, which gives the model more room to plan through long tool-use tasks before taking actions.
The benchmark: Qwen3.7-Plus is strongest on Terminal-Bench 2.0, ScreenSpot Pro, MCP-Mark, BFCLv4, and MMBC, meaning terminal coding, screen understanding, tool use, and multimodal agent work are its real focus.
The pricing is also central, because $0.40 input and $1.60 output per 1M tokens makes repeated agent loops much cheaper than many flagship models.
The catch is that Qwen3.7-Plus is closed API-only, so companies cannot run the weights privately inside their own secure infrastructure.
🗞️ Anthropic’s new chemistry report has a genuinely wild result.
Claude Opus 4.7 is now competitive with dedicated NMR software, and the bigger story is that it can work the problem backwards, i.e. infer the molecule from the spectrum.”
NMR software is the chemist’s expert tool for turning molecular structures into predicted lab spectra.
So Opus 4.7 is no longer just “helping chemists read data” — it can work backward from NMR data and propose the molecule’s structure, a task the report says existing mainstream tools generally leave to human chemists.
Note, that Opus 4.7, a general-purpose model with no chemistry-specific fine-tuning.
Claude Opus 4.7 made the smallest hydrogen prediction errors and nearly matched MestReNova on carbon, meaning it can predict NMR signals about as well as specialist chemistry tools.
So AI now handle one of chemistry’s hidden bottlenecks: translating between a molecule, its spectral shadow, and the structure a chemist actually needs to trust.
That’s a wrap for today, see you all tomorrow.







