🗞️ Mira Murati’s Thinking Machines made finance expert judgment trainable, beats frontier models with 29.8% fewer errors.

Thinking Machines finetune; Claude Code to Claude Tag; image-based text context for Fable 5; humans as the bottleneck; DeepSeek V4 surge pricing; Alibaba blocks Claude Code

Jul 04, 2026

Read time: 5 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡In today’s Edition (04-July-2026):

🗞️ Mira Murati’s Thinking Machines made Bridgewater’s private expert judgment trainable, beating frontier models with 29.8% fewer errors.
🗞️ Super insightful Boris Cherny and Cat Wu interview on the move from Claude Code to Claude Tag.
🗞️ Developers found a cheaper way to feed Fable 5 large context by showing it pictures of text.
🗞️ Claude Fable 5 makes the human the bottleneck, because the model now exposes every missing decision.
🗞️ Surge pricing in AI just arrived. DeepSeek is doubling peak-hour V4 API prices.
🗞️ Alibaba blocked Claude Code after Anthropic’s tracking experiment angered Chinese developers and security staff.

Connect with me on X (Twitter)

🗞️ Mira Murati's Thinking Machines made Bridgewater’s private expert judgment trainable, beating frontier models with 29.8% fewer errors.

With naive prompts, all tested models sit around coin-flip accuracy, roughly 46% to 50%.
Expert prompts lift them sharply, reaching about 74% to 78% average accuracy. The workflow was filtering finance articles, reports, central-bank documents, and emails to decide what investors should read.

This is a serious signal for enterprise AI, that bringing private judgment in the loop beats general intelligence. The problem was not reading finance documents, because frontier LLMs can already read them.

The harder task was deciding which facts deserve attention inside an investor’s workflow. A tariff headline can move markets, while another geopolitical headline may add no signal.

The breakthrough came from replacing written rules with high-quality labels from expert investors. Non-expert labels failed because the task depends on taste, not surface financial language.

Bridgewater cleaned those labels by sending model-disputed cases back to experts for review. The model then learned patterns that experts could recognize, but could not fully verbalize.

Training used interleaved batches, CISPO loss, and on-policy distillation from stronger teacher checkpoints. Interleaving helped the model share judgment across tasks without blending them into noise.

CISPO controlled policy updates, so learning stayed aggressive without drifting into brittle shortcuts. (CISPO is a new reinforcement-learning loss that caps how strongly each generated token can update the model, improving training stability while keeping useful rare tokens active. It was initially proposed by MiniMax team in 2025)

On-policy distillation penalized moves away from better teachers, then promoted stronger checkpoints. The result beat the best frontier model, with 29.8% fewer mistakes and 13.8x lower inference cost.

🗞️ Super insightful Boris Cherny and Cat Wu interview on the move from Claude Code to Claude Tag.

Anthropic's launched Claude Tag last week and it added a layer of persistent context and memory that would have been difficult to maintain with previous tools.

- Claude Tag shows a bigger shift than faster coding inside Anthropic’s daily work. Claude Code helped one person work faster, but Claude Tag changes group behavior.

- AI becomes more useful when it enters shared workspaces.

- People no longer need to open a separate tool, frame a task, and monitor it. Claude can sit inside a channel, notice useful work, act, and report back.

- That institutional memory turns AI from a helper into a workflow layer. A channel can teach Claude what to watch, what to ignore, and how to respond.

That means the system improves through normal corrections, not separate training sessions.

- Visibility spreads skill faster than private AI use. When stronger users show Claude how to debug, analyze data, or write PRs, others copy those patterns.

- This turns AI adoption from individual experimentation into social learning across the company.

🗞️ Developers found a cheaper way to feed Fable 5 large context by showing it pictures of text.

Normally, every code block, log, tool output, and old chat turn becomes text tokens.

Those tokens are billable units.
pxpipe changes the input. It renders dense text into PNG pages, then sends those pages as image blocks.

Fable 5 can read the pixels with OCR-like vision skills, so meaning often survives. The price gap appears because one image has a mostly fixed token cost.

That cost barely changes when readable text gets packed into the same image.
a 1928×1928 image costs about 4,761 vision tokens.

The same page can hold roughly 92K characters, so dense code becomes cheaper. The catch is that this is compression through vision, not lossless text storage.

Fable 5 may understand the gist while misreading exact IDs, hashes, names, or strings. That makes it useful for bulky background context, but risky for byte-exact facts.

Connect with me on X (Twitter)

🗞️ Claude Fable 5 makes the human the bottleneck, because the model now exposes every missing decision.

- Claude is getting good enough that weak prompts now fail less from syntax and more from missing context.

- A prompt gives the model a map, but the repository contains the roads, detours, weird legacy choices, and invisible tradeoffs. Every unstated assumption becomes a fork where the model has to choose for you.

- That choice may be reasonable, clean, and still wrong. The best move, then, is to spend less time pretending your spec is complete and more time building tools that expose its gaps.

- Great agentic coding is not writing perfect prompts, but shrinking the gap between intent and reality.

- Ask Claude to find your blind spots before coding, especially inside unfamiliar code or unfamiliar domains.

- Prototype first when taste is hard to explain, because 4 rough versions beat 1 polished mistake.

- Make Claude interview you before building, starting with answers that could change the architecture.

- Keep implementation notes during the build, because every deviation reveals a hidden assumption.

- Do not merge until Claude can quiz you and you fully understand what changed.

🗞️ Surge pricing in AI just arrived. DeepSeek is doubling peak-hour V4 API prices.

The change covers 9am to noon and 2pm to 6pm Beijing time. Says the goal is steadier service and better distribution of scarce resources.

Just weeks before, DeepSeek permanently lowered V4-Pro pricing by 75%, making its flagship model much cheaper ahead of demand-based pricing. V4 Pro output rises from 6 yuan to 12 yuan per million tokens, or $1.77.

🗞️ Alibaba blocked Claude Code after Anthropic’s tracking experiment angered Chinese developers and security staff.

Claude access becomes risky when it also inspects timezone, proxy, or identity signals.

Anthropic said the feature was an experiment to stop resellers and model distillation. Alibaba’s concern was that Claude Code could identify China-linked users while running inside employee developer environments.

That made the issue larger than ordinary app tracking, because the tool worked close to private code. Claude Code could read files, edit projects, and interact with terminals on employee machines. Alibaba treated that access as a security risk, not just a software preference.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Ready for more?