🗞️ Anthropic releases Claude Opus 4.8 on the same day as its $965B valuation round.
Read time: 10 min
📚 Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
⚡In today’s Edition (29-May-2026):
🗞️ Anthropic releases Claude Opus 4.8 on the same day as its $965B valuation round.
🗞️ KogAI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model.
🗞️ Video to Watch: Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring.
🗞️ Anthropic secures a massive post-money valuation of $965B after raising $65B.
🗞️ Datacurve launches DeepSWE, a tougher coding benchmark made to show where leading models truly separate.
🗞️ OpenAI and Thrive just built a self-improving tax agent with up to 97% accuracy.
🗞️ Anthropic releases Claude Opus 4.8 on the same day as its $965B valuation round.
Fast Mode now runs at 2.5x the speed and costs 3x less. Use /fast in Claude Code to turn it on.
Effort control: you can now decide the effort level behind a response, with higher effort bringing deeper thought.
Has a new “dynamic workflows” feature that allows it to tackle very large-scale problems.
74.6% on agentic terminal coding is the biggest benchmark jump over Opus 4.7, rising from 66.1%
New “dynamic workflows” feature that allows it to tackle very large-scale problems.
The new leader on our GDPval-AA benchmark for agentic real-world work tasks
The dynamic workflows in Claude Code will break a massive engineering task into many smaller jobs, run them through tens to hundreds of parallel subagents, and check the results before handing anything back.
A normal coding agent works like one developer reading, editing, and testing in sequence, but dynamic workflows behave more like a temporary engineering team coordinated by Claude.
Claude first writes an orchestration plan, which is basically a task map that says what needs to be inspected, rewritten, tested, reviewed, or challenged.
Separate subagents then work on different parts of the repo at the same time, so one agent might inspect authentication code, another might port files, another might search for unsafe patterns, and another might try to break the proposed fix.
The major change is verification, because Claude does not just collect answers from subagents, but compares them, refutes weak findings, runs checks, and keeps iterating until the results converge.
Opus 4.8 now supports a 1M-token context window and up to 128K output tokens. It’s live on the API, Claude Code, and major cloud platforms.
🗞️ KogAI just achieved 3,000 tokens/s on 8× AMD MI300X GPUs and 2,100 on 8× NVIDIA H200 (FP16, no speculative decoding) with a 2B model.
3,000 tokens/s for 1 user on standard datacenter GPUs.
They leveraged a hidden efficiency gap in how GPUs generate tokens. Their tech preview is on a 2B model, and they show how their techniques will scale to large frontier MoE models at similar speeds. Try it here on the Playground.
That’s a huge number because normal low-batch GPU decoding for 2B to 8B models is usually closer to 100 to 300 tokens/s per request, so Kog is claiming something like a 10X to 30X jump in the speed one user actually feels.
Their trick: they are getting the speed by treating LLM decoding as a memory streaming problem, not mainly a math problem.
For 1 user at batch size 1, the GPU is not doing big, efficient matrix-matrix work like in training or large-batch serving; it is repeatedly pulling the model’s active weights from high-bandwidth memory for each new token, so speed depends on how smoothly those weights keep flowing.
Normal inference stacks keep breaking that flow. They run many separate GPU programs for different parts of the model, move intermediate results through memory, wait at synchronization points, talk back to the CPU for scheduling or sampling, and then repeat this token after token.
Kog’s answer is to co-design 3 things that are usually tuned separately: the runtime, the low-level GPU code, and the model architecture.
The biggest engineering move is the monokernel, where the whole decode pass runs as 1 persistent GPU-resident program, including sampling, so the system does not keep stopping for kernel launches, CPU scheduling, and intermediate memory round trips.
They also rebuilt synchronization, because their own measurements say grid sync was eating around 35% of token-generation time; instead of making every compute unit wait at a broad barrier, each unit waits only for the exact data it needs.
On AMD MI300X, they also map memory access around the chiplet layout, because memory latency changes depending on which die makes the request.
Then their Laneformer model uses Delayed Tensor Parallelism, which lets cross-GPU communication happen in the background instead of blocking every layer.
Main Blog + How they did it + Monokernel deep dive + Delayed Tensor Parallelism research
🗞️ Video to Watch: Transformer vs Post-Transformer, argued by leading researchers, inside a real physical boxing ring.
A technically deep and genuinely entertaining debate to understand one of AI’s hardest AI debates - Transformer vs Post-Transformer. Watch the full video.
I was glued for the entire 1 hour 20 minutes. So many super cool points to learn.
🥊 Transformers
Transformers still own the present because they work at scale. They are simple, trainable, hardware-friendly, and already power the strongest AI systems we use today.
The Transformer is basically a memory machine. It stores information as keys and values, then uses attention to pull back the most useful parts when answering.
The real Transformer advantage is not just “attention.” The bigger advantage is that it fits modern hardware extremely well, so it can process huge batches of tokens fast.
Scaling is still the brutal rule. If you give Transformers more compute, more data, and more parameters, they usually keep getting better. Any Post-Transformer architecture has to scale just as well, or better.
It is not enough to look clever on small tests, because the real question is whether it improves faster than Transformers when scaled up.
A replacement cannot be slightly better. Because the whole AI stack is already built around Transformers, the next architecture may need to be around 10x better to force everyone to switch.
Transformers are powerful, but they may be brute force. A human does not need to read the entire internet many times to become smart, but current LLMs need enormous data and compute.
🥊 Post-Transformer
Post-Transformer people are not saying Transformers are bad. They are saying Transformers may be the best current tool, not the final form of machine intelligence.
The biggest Post-Transformer target is native reasoning and continual learning. Today’s LLM reasoning often feels like text-based step-by-step work added on top, instead of thinking happening naturally inside the model.
Latent reasoning is one possible next step. That means the model reasons inside its own hidden internal space, instead of writing every thought out as words.
Continual learning is still a major weakness. Humans keep learning from experience, but most Transformer-based models are trained, frozen, and then only adapt inside the prompt.
Long context is not the same as real memory. A model can read a huge prompt, but that is different from building a life history, learning from mistakes, and updating beliefs over time.
The future may be hybrid, not a clean replacement. Transformers may stay as 1 building block while newer systems add better memory, better reasoning, and better learning loops.
The most interesting possibility is that Transformers may help discover their own successor. AI agents are already getting better at research and coding, so the next architecture may come from AI-assisted architecture search.
Benchmarks are a problem. Many public benchmarks are easy to game, so they may show leaderboard strength without proving deeper intelligence.
Perplexity is still probably a great metric to evaluate frontier models,, because it tests prediction quality.
Overall, Transformers continue to dominate, but the frontier is clearly widening.
Pathway’s BDH (Dragon Hatchling — brain-inspired reasoning architecture), Sakana AI’s CTMs (Continuous Thought Machines — models that think over time), and Liquid AI’s LFMs (Liquid Foundation Models — efficient multimodal foundation models) - all of these show how the frontier is expanding.
🗞️ Anthropic secures a massive post-money valuation of $965B after raising $65 B.
Anthropic overtakes OpenAI as the most valuable AI startup after this $65B raise.
Just three months earlier, in February, Anthropic had raised $30B at a $380B post-money valuation.
Claude’s run-rate revenue crossed $47B , up from $14B in February, a more than threefold jump in under 4 months.
Altimeter Capital, Dragoneer, Greenoaks, Sequoia Capital, Capital Group, Coatue, D1 Capital Partners, and others jointly led the Series H raise. Institutional investors including Baillie Gifford, Blackstone, Brookfield, D.E. Shaw Ventures, DST Global, and Fidelity Management & Research also joined in.
Samsung, SK Hynix, and Micron came in as strategic infrastructure partners. A $15 billion portion of the round comes from prior hyperscaler commitments, including Amazon’s $5 billion investment announced in April.
To note, OpenAI last raised a whopping $122 billion round in March at an $852 billion post-money valuation.
Anthropic said the funding will support safety and interpretability research, expand compute capacity and scale Claude products and partnerships.
Whether Anthropic sustains this trajectory heading into a potential IPO or whether the capital intensity required to stay at the compute frontier eventually compresses those margins, is the question that will define the next chapter of the AI industry.
🗞️ Datacurve launches DeepSWE, a tougher coding benchmark made to show where leading models truly separate.
On this benchmark GPT-5.5 hits only 70%, while GPT-5.4 reaches 56% and Claude Opus 4.7 reaches 54%, making a gap that older benchmarks largely hid.
Its a long-horizon software engineering benchmark.
- DeepSWE differs from older coding benchmarks in the source of the exam: older tests often reuse public GitHub issues and PRs, while DeepSWE uses original tasks, so models are less likely to have seen the answer during training.
- The work is also bigger even when the prompt is shorter, because older tests often tell the model what area to touch, while DeepSWE makes the agent search the repo, understand the design, edit multiple files, and avoid breaking old behavior.
On DeepSWE, prompts are half the length of SWE-bench Pro’s, yet solutions require 5.5x more code and ~2x more output tokens.
- The grading is different too, because many older benchmarks reuse tests from one merged PR, while DeepSWE checks whether the requested behavior actually works, even if the model solves it in a different valid way.
🗞️ OpenAI and Thrive just built a self-improving tax agent with up to 97% accuracy.
Tax AI processed 7,000 returns across 30+ accounting firms, saved about one-third of preparation time, reached up to 97% accuracy, and raised throughput by about 50%.
The hard part was not reading W-2s or 1099s, but handling messy K-1s, rental schedules, notes, spreadsheets, prior-year files, and values that must match across documents.
The system records the full trace: source file, extracted field, citation, tax-engine mapping, accountant correction, and final filed value.
Repeated corrections become eval targets, so Codex gets a narrow task with evidence, code, tests, and a pass condition.
A wrong tax field can come from many places: bad extraction, weak mapping, unsupported workflow, prior-year carryover, or human judgment.
The clever part was not simply using Codex to write fixes, but building a product environment where repeated practitioner corrections became bounded, testable engineering tasks.
In the rental-property example, the agent could inspect source documents, extraction traces, mapper behavior, expected outputs, and regression tests before proposing a change.
That’s a wrap for today, see you all tomorrow.







