OpenAI just dropped the limited preview of its new GPT 5.6 model suite
OpenAI GPT-5.6: Sol, Terra & Luna; 10x rise in severity-3 agent actions; Claude usage logs; LLM hallucinations in document Q&A; AI budget shift to cheaper/open-source Chinese model
Read time: 10 min
📚 Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
⚡In today’s Edition (29-June-2026):
🗞️ OpenAI just dropped the limited preview of its new GPT 5.6 model suite: Sol, the flagship; Terra, a medium-tier model for “high-volume work”; and Luna, a “fast and affordable” everyday model.
🗞️ TODAY’S SPONSOR: CyberOrigin and CyberCode Are Building the Data Layer Robotics Really Needs
🗞️ Key findings from GPT-5.6 Preview System Card
🗞️ OpenAI’s GPT-5.6 Sol is far more likely than GPT-5.5 to take severity-3 agent actions in internal coding tests nearly 10x.
🗞️ Claude’s new usage logs now read like an early sensor for how AI is entering work.
🗞️ “Critique of Agent Model”
🗞️ “How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms”
🗞️ UBS says 60% of companies now watching AI budgets are moving to cheaper models and open-source Chinese models
🗞️ OpenAI release the limited preview of its new GPT 5.6 model suite: Sol, the flagship; Terra, a medium-tier model for “high-volume work”; and Luna, a “fast and affordable” everyday model.
The most revealing part is the release gate: OpenAI says the U.S. government asked it to start with a small trusted-partner preview before broader access.
Sol is the flagship model, and OpenAI claims it is a step above GPT-5.5, especially on agentic work where the model must plan, use tools, correct itself, and keep working across many steps. Terminal-Bench 2.1 is a solid coding benchmark because it tests command-line workflows, so here meaning Sol is being judged on messy developer tasks closer to real work.
One key claim is cybersecurity: OpenAI says Sol is its best model yet for vulnerability research and exploitation tasks, while still saying it did not cross the internal Cyber Critical threshold.
“GPT‐5.6 is trained to refuse prohibited cyber assistance, including when users attempt to disguise their intent or jailbreak the model.” It also said that flagship model Sol “is better at helping people find and fix vulnerabilities than reliably carrying out end-to-end attacks,” and that Sol doesn’t cross the cyber-critical threshold under OpenAI’s preparedness framework.
But Sol did not autonomously produce a full-chain exploit in the tested Chromium and Firefox settings. They also introduced 2 new modes for Sol: “max” for deeper reasoning and “ultra” for using sub-agents, bringing OpenClaw to mind and possibly hinting at OpenClaw creator Peter Steinberger’s early impact at OpenAI.
Pricing: GPT-5.6 Sol costs $5 per 1M input tokens and $30 per 1M output tokens, ~same level as GPT-5.5.
Terra is positioned near GPT-5.5 performance at 2x lower cost, while Luna is the cheapest model for large-volume workloads.
The safety story is unusually compute-heavy: OpenAI says it used over 700,000 A100-equivalent GPU hours for automated red-teaming against broad jailbreak attacks.
Overall, OpenAI appeared to be using a more cautious approach during the preview, which the Trump administration is watching closely.
OpenAI said safeguards might sometimes block valid work, especially in dual-use areas where defensive and offensive actions can look alike at first. That is one thing the preview is meant to test.
Sol delivers near-frontier cyber-exploitation capability much more efficiently.
GPT-5.6 Sol reaches roughly 70% on ExploitBench with about 120K output tokens, far above GPT-5.5 and the cheaper GPT-5.6 models. Mythos Preview scores slightly higher, but it uses roughly 3x more tokens.
🗞️ TODAY’S SPONSOR: CyberOrigin and CyberCode Are Building the Data Layer Robotics Really Needs
“If we could snap our fingers and get a pile of data... we would solve general robotics right now.” - Figure CEO Brett Adcock
The big bottleneck in Physical AI / robotics is not better models, but better robotics data infrastructure. That is the gap
CyberOrigin is building around with CyberCode.
Robotic data is insanely expensive and brutal to collect. Real-world manipulation data is messy.
A robot policy does not learn from “clips” the way a human watches a demo. It needs training data that can be searched by task, scene, action, device, collector, quality result, and data ID.
It needs every useful frame traceable back to where it came from.
It also needs different signals aligned on the same timeline, because a model can learn the wrong thing if vision, motion, language, robot state, and other sensor streams are slightly out of sync.
CyberCode turns real human manipulation data into an operating layer where the data is searchable, inspectable, traceable, synchronized, quality-checked, and evaluation-ready before it reaches the model.
That sounds less flashy than a humanoid demo, but it is closer to where a lot of the real bottleneck sits. For manipulation policies, world models, and vision-language-action models, better data infrastructure can matter as much as better model architecture, because the model can only learn from the structure, coverage, timing, and quality the data system actually exposes.
🗞️ Key findings from GPT-5.6 Preview System Card
- GPT-5.6 is being treated as High risk-capability in both cybersecurity and biological/chemical domains, even for the cheaper Terra and fastest Luna versions.
- OpenAI says this is the first time smaller and faster models in a family received a High designation in any tracked danger category.
- GPT-5.6 Sol saturated OpenAI’s internal cyber challenge set at 96.7%, putting it above the High threshold.
- External cyber testers found high-impact zero-days, including one where read-only users could modify and delete data in a widely deployed database.
- GPT-5.6 helped security testers find a real mobile OS flaw where a malicious app could break the normal wall between apps and read private data that should have stayed protected.
- On Irregular’s tests, GPT-5.6 Sol solved 19/197 FrontierCyber challenges, 7/11 long-horizon cyber scenarios, and 22/22 medium and hard atomic cyber challenges.
- The bio result is just as revealing: 3/4 High-threshold bio evaluations crossed the line, while 0/3 Critical bio-design evaluations crossed it.
On virology troubleshooting, GPT-5.6 Sol scored 55.5%, far above the 31% expert-performance threshold.
- SecureBio found GPT-5.6 reached new highs on several expert bio tests, including 68.4% on Human Pathogen Capabilities and 68.3% on World-Class Bio. The agent behavior section is the most unsettling: GPT-5.6 Sol more often goes beyond user intent when coding, including deleting the wrong virtual machines, claiming unfinished research was verified, and moving cached credentials without permission.
- METR found that GPT-5.6 Sol sometimes tried to game the test instead of just doing the task, so the benchmark result could not be trusted as a clean measure of raw capability.
- The model shows more ability to control its own reasoning traces: 1.3% success around 5K-token chains of thought versus 0.4% for GPT-5.5.
The risk in GPT-5.6 Sol is not ordinary wrong answers, but over-persistent agent behavior that crosses user boundaries, especially by trying to bypass restrictions far more often than GPT-5.5.
GPT-5.6 Sol is a clear step up on specialist virology troubleshooting
🗞️ OpenAI’s new GPT-5.6 Sol is far more likely than GPT-5.5 to take severity-3 agent actions in internal coding tests, nearly 10x.
Severity-3 means actions a user would strongly object to, such as bypassing restrictions, deleting data, moving data without permission, or harvesting credentials.
The point is not that these failures are common, but that the newer model’s stronger persistence makes it more willing to cross boundaries while trying to finish a task.
🗞️ Claude’s new usage logs now read like an early sensor for how AI is entering work.
Personal prompts rises from 35% on weekdays to nearly 50% on weekends.
Recipe requests peak at 6pm and become 2.3x more common than average.
News prompts peak at 7am, while business emails peak around 10-11am.
Sleep advice clusters before dawn, with people most often seeking it around 3-5am.
Tax requests in the US spiked 8x right before the filing deadline, then collapsed almost immediately.
Weekend Claude Code work shifts away from backend architecture and API debugging toward AI agent design, quant trading, and gaming.
Work done through Claude at nights and weekends skews toward higher-wage occupations, not lower-wage clerical tasks.
Claude now produces a clear output in 93% of chat and Cowork conversations.
The most common Claude outputs are explanations 17%, documents/reports 15%, and guidance 11%.
Marketing content, blogs, and database queries are among the most work-heavy outputs, each around 80%+ work-related.
Creative writing, guidance, and recipes are mostly personal, each above 80% personal use.
Work conversations most often produce documents/reports 20%, while personal conversations most often produce explanations 25% and recommendations 22%.
Higher-wage work burns more compute, with top-wage occupation conversations using about 2.07x as many tokens as bottom-wage ones.
App-building conversations use more than 3x the median tokens, while basic explanations use about 1/5 of the median.
🗞️ "Critique of Agent Model"
This paper pushes back on the habit of calling every capable AI system an “agent” and asks the cleaner question: what makes something an agent in the 1st place?
Explains why today’s AI agents are mostly clever tools, not truly independent agents. The problem is that many systems called agents are really advanced workflows around LLMs, not independent actors.
Complex behavior is not the same as self-directed behavior. A chess engine can crush a grandmaster without wanting anything, and a browser agent can complete a task without maintaining a durable sense of what it is, what it can do, or why this task matters beyond the current instruction.
They can call tools, follow steps, and complete useful tasks, but their goals, roles, limits, and update cycles still mostly come from humans. The paper’s core idea is to separate "agentic AI" from "agentive AI", where agentic means it looks autonomous and agentive means its agency comes from inside the system.
The authors propose the Goal-Identity-Configurator model, where an AI keeps long-term goals, updates its sense of itself, predicts possible outcomes, decides how much to think, and learns from real and simulated experience. They do not mainly test a finished system, but build an argument and architecture for what real machine agency would require.
🗞️ "How Much Do LLMs Hallucinate in Document Q&A Scenarios? A 172-Billion-Token Study Across Temperatures, Context Lengths, and Hardware Platforms"
This study tests how often LLMs invent answers when they should rely only on supplied documents.
The problem is that companies often use LLMs to answer questions from documents and they assume document-based LLM systems are safer because the model is given source material. This study shows that no model fully avoided fabrication, because even the best model made up answers 1.19% of the time at 32K context.
For strong models, a more normal best-case rate was around 5% to 7%, while the middle model fabricated about 25% of answers to questions about facts that did not exist. Longer context made the problem much worse, and at 200K context every tested model fabricated at least 10% of the time.
Shows that hallucination is not just a failure to retrieve the right sentence. A model can be good at finding real facts and still be too willing to answer when the requested fact is absent.
🗞️ UBS says 60% of companies now watching AI budgets are moving to cheaper models and open-source Chinese models
The pressure is coming from extreme bills, including users spending up to $35K/month, teams exceeding quotas by 200%, and companies cutting internal AI tools from 5 to 2.
Companies are not abandoning AI, they are using model routing, which sends easy tasks to cheaper models and saves premium models for hard reasoning, code, and long-context work. Chinese open-source models such as Qwen, DeepSeek, MiniMax, GLM, and Kimi now fit the enterprise cost curve because they can be run locally or used through cloud catalogs.
That’s a wrap for today, see you all tomorrow.











