🗞️ OpenAI’s new paper shows how they are now seeing the first version of office work where agents do most of the execution.

OpenAI’s agentic office work, State of the AI Economy, 2027 IPO pressure, larger model learning, AI content flood, safer RL generalization, MIT’s code-output gap, Qwen-AgentWorld for agent training

Jun 26, 2026

Read time: 12 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡In today’s Edition (26-June-2026):

🗞️ OpenAI’s new paper shows how they are now seeing the first version of office work where agents do most of the execution.
🗞️ This is a brilliant report. The State of the AI Economy
🗞️ New York Times: OpenAI is now leaning toward a 2027 IPO because the public market is testing whether AI giants deserve trillion-dollar prices before they prove durable profits.
🗞️ Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention
🗞️ The Economist: AI has pushed the internet’s content machine into a new phase, with books, lawsuits, research papers, apps, and songs now being produced at volumes that old review systems were not built to handle.
🗞️New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on.
🗞️ MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality.
🗞️ Qwen just released Qwen-AgentWorld, a 35B open-weight world model that learns how terminals, browsers, Android devices, code repos, search systems, OS tools, and MCP servers respond when an AI agent takes an action.

Connect with me on X (Twitter)

🗞️ OpenAI just released a paper showing how they are now seeing the first version of office work where agents do most of the execution.

Codex has become its main work AI, producing 99.8% of internal output tokens after sitting below 10% a year earlier.

The striking part is not engineering use, because Codex began as a coding tool, but the fast rise in Legal, Finance, Recruiting, Support, and business teams. Non-developer use rose 137x for individuals and 189x for organizations since Aug-25, which means agents are spreading wherever work has repeatable steps, files, rules, and messy follow-through.

Top internal users now run about 71 hours of agent work per day by managing parallel tasks, turning AI from a chat box into a pool of delegated labor. Users are changing the work unit itself, since 70.2% of sampled individuals sent a request above 1 hour of human work and 25.6% sent one above 8 hours. Heavy users no longer wait for one answer, because 28.6% of OpenAI users managed 5+ concurrent agents and the 99th percentile ran about 71 hours of agent work per day.

🗞️ A solidly insightful new report “The State of the AI Economy”

$110B real AI revenue over 12 months, after removing double-counting. so $1 spent on Claude is counted once, even if part of it later flows to Amazon or another infrastructure provider.

$175B current annualized run rate, showing fast acceleration. Measured by end-customer spend, not supply-chain pass-through revenue. Excludes China, internal AI savings, ad uplift, consulting, and systems integration.
Growth running roughly 3x faster than mobile or internet adoption waves.
The pace of revenue formation has sharply accelerated. New $1B revenue now arrives in under 2 days, versus 180 days in 2023.
Enterprise AI has moved beyond pilots, but deep company-wide rollout is still early.
AI earnings-call mentions reached 31% of tracked S&P 500 firms.
Only 20% of tracked firms made quantified AI impact claims.
Hyperscaler AI revenue roughly covers AI infrastructure depreciation for now. GPU economics depend heavily on 6-year compute life assumptions.
Other AI infrastructure gets modeled over 14 years.
Token price cuts do not automatically reduce revenue.
Every 10% token price cut drives 12-18% more token usage.
AI demand looks price elastic, meaning cheaper AI expands usage faster than prices fall.
Power availability and data-center costs remain major limits on future scaling.

Connect with me on X (Twitter)

🗞️ New York Times: OpenAI is now leaning toward a 2027 IPO because the public market is testing whether AI giants deserve trillion-dollar prices before they prove durable profits.

IMO, SpaceX’s huge IPO and Anthropic’s confidential filing also change the OpenAI story because public-market capital is not infinite, and investors may already be digesting one massive AI-adjacent listing while preparing for another pure AI listing.

OpenAI cannot just ask for $1T in isolation, because every giant IPO competes for the same pool of institutional cash, risk appetite, and patience for loss-making growth stories. So waiting until 2027 lets OpenAI avoid becoming the third huge test of AI-market depth in the same cycle.

🗞️ Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

Great Stanford + MIT + Harvard + Anthropic paper.

Gives a clear training-based reason for why larger models learn abilities smaller models miss. Says bigger AI models learn rare skills because they forget them less during training, their extra space protects weak learning signals.

The authors say the issue is not just whether a small model could represent the task, but whether training lets it keep that task while many common tasks keep pushing on the same limited parts. Their core idea is that common tasks take up the model’s neurons first, so rare tasks get overwritten before they appear often enough to build into stable knowledge.

In a crowded data mixture, common patterns get first claim on the model’s internal machinery. Small models may briefly pick up a rare signal, but the next wave of common-task updates overwrites it before the signal appears again.

They tested this first with controlled toy tasks where they could change how rare and complex each task was, then with OLMo language models from 4M to 4B parameters. The main result is that bigger models learned low-frequency tasks much better, kept more task features inside their representations, and showed less gradient interference, which means common-task updates disturbed rare-task learning less. Larger models can remember weak rare signals long enough to turn them into real learned skills.

🗞️ The Economist: AI has pushed the internet’s content machine into a new phase, with books, lawsuits, research papers, apps, and songs now being produced at volumes that old review systems were not built to handle.

Amazon e-book releases rose from about 100,000 a month before ChatGPT-3.5 to roughly 300,000 by late 2025, and detection tools suggest AI-generated text drove much of that jump.

US self-filed civil lawsuits doubled to 41,000 from 2023 to 2025, with 18% of sampled 2026 complaints flagged as AI-written, yet their success rate did not fall. Research is seeing the same pressure, as arXiv submissions keep rising, rejection rates have more than doubled since 2023, and one study found 57% of 2025 papers carried AI-influenced language, up from 12% in 2023.

Coding agents have also changed software output, with new iOS App Store releases now above 100,000 a month after sitting below 50,000 last May. In Music production, 75,000 AI songs are arriving daily, up from 10,000, while 44% of new uploads are AI-made and 97% of listeners in one survey could not reliably tell the difference.

🗞️New research from OpenAI reported a training result where RL on realistic human situations made models carry safer, more useful behavior into tasks they had not trained on.

The key point is cross-domain transfer, where health-only training improved non-health behaviors like blackmail resistance, code reward hacking, and deception tests.

Suggests, the model may be learning a broader stance: verify before asserting, concede when corrected, resist flattering the user, and avoid shortcuts that look useful but corrupt the task. OpenAI also removed health and science data from training, yet the model still improved on health evaluations, which suggests these traits may be learned as general behavioral habits rather than narrow topic rules. The trained model was harder to steer toward harmful behavior while remaining responsive to helpful instructions, which is the asymmetry safety research has been looking for.

🗞️ MIT study. Code volume surges by 300%, but output increases by only 30%: The AI dividend meets an awkward reality.

They studied 100,000+ GitHub developers and find that AI coding agents massively increase code production, but much less of that work becomes shipped software.

Autonomous AI coding agents raised commits by 180%, but releases rose only 30%.

The paper’s main idea is that software production has weak links, so faster code writing does not help as much when humans still need to review, connect, test, package, and ship the work.

The authors also check app marketplaces and find more new apps, but no increase in total usage, which means more software appeared without clear evidence that users adopted more software.

The marketplace evidence points the same way: more new apps appeared, but total usage did not rise.

The authors compare more than 100,000 GitHub developers before and after they start using 3 generations of AI coding tools, from autocomplete to more independent coding agents.

Autocomplete raised commits by 40%, interactive coding agents raised them by 140%, and autonomous coding agents raised them by 180%.

The 180% commit gain shrank to 50% for the number of projects and 30% for actual releases.

The estimated “elasticity of substitution” is 0.25 i.e. for every big improvement in AI’s usefulness, only a small amount of human work can be replaced.

Because AI can write code faster, but humans are still needed to decide what to build, check if the code works, connect it with the rest of the product, fix messy edge cases, and actually ship it.

🗞️ Qwen just released Qwen-AgentWorld, a 35B open-weight world model that learns how terminals, browsers, Android devices, code repos, search systems, OS tools, and MCP servers respond when an AI agent takes an action.

Most agents are trained like decision-makers, so they learn which button to press or command to run, but Qwen-AgentWorld is trained like the environment itself, so it predicts the next screen, error, search result, terminal output, or app state.

Beats GPT and Claude on AgentWorldBench

This turns agent training into a simulator problem, because teams can test millions of actions without renting real browsers, running fragile terminals, or waiting on live services.

The strongest claim is that simulated RL beat real-environment RL on live search tasks, with 50.3% F1 versus 45.6% F1, which suggests a good fake world can sometimes teach agents better than a messy real one.

The deeper technical shift is environment prediction, because the model is not just answering questions, but modeling cause and effect across 7 agent domains.

The risk is simulator error, because a small wrong prediction can teach an agent a bad habit, but the open Apache 2.0 release makes this one of the first serious public testbeds for agent training at scale.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Ready for more?