π China's new open-source model GLM-4.5, Undercuts DeepSeek and GPT-4 on Price and Performance
Zhipu drops GLM-4.5, Edge gets Copilot Mode, Unitree's $6K robot does flips, and GPT-5 shows major gains in coding benchmarks.
Read time: 10 min
π Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
β‘In todayβs Edition (28-July-2025):
π China's new open-source model GLM-4.5, Undercuts DeepSeek and GPT-4 on Price and Performance
π‘ Microsoft just turned Edge browser into an AI agent with Copilot Mode.
π οΈ UnitreeRobotics just unveiled R1, it can run, cartwheel and fist-fight β and it costs just $6K
π§βπ GPTβ5 Release Imminent, Early Tests Show a Performance Leap on Coding
π China's new open-source model GLM-4.5, Undercuts DeepSeek and GPT-4 on Price and Performance
Chinese AI startup Zhipu just released open-source model GLM-4.5 designed for intelligent agent applications.
Key Takeaway
Two versions, 355B and 106B parameters in total, run as a mixture of experts so only 32B or 12B weights stay active during a single forward pass which lowers GPU memory and cost.
A 128K token window lets the model ingest very long documents or tool traces without breaking context.
Separate βthinkingβ and βnonβthinkingβ modes let it spend big compute only when a prompt truly needs slow stepβbyβstep reasoning, which keeps everyday queries fast.
The model can run on just eight Nvidia H20 chipsβhalf what DeepSeek requiresβand ranks third overall across 12 AI benchmarks covering reasoning, coding, and agent tasks. These H20 chips are special versions of Nvidia's AI processors designed for China to comply with US export restrictions. They're less powerful than chips sold elsewhere but still capable enough for AI training. Z.ai uses them because that's what Chinese companies can legally buy due to trade limitations.
Combines three skills in one: structured reasoning for tough questions, code generation that fixes real GitHub issues, and builtβin function calling so agents can hit external tools or APIs without extra glue.
Scores 3rd on a blended leaderboard of 12 hard reasoning tests and its lighter βAirβ edition lands 6th, showing that the scaledβdown model still competes.
Hits 84.6 on MMLU Pro, 91.0 on AIME24, and 79.1 on GPQA, placing it near GPTβ4 class models on academic reasoning.
Solves 64.2% of SWEβbench Verified software bugs and reaches 37.5% on TerminalβBench, proving strong realβworld coding chops.
Wins 26.4% accuracy on BrowseComp, beating Claudeβ4 Opus, which signals solid live web browsing and tool navigation ability.
Function calls succeed 90.6% of the time, topping Claudeβ4 Sonnet, so agent pipelines rarely break on bad signatures.
A taller, narrow mixtureβofβexperts stack plus groupedβquery attention keeps math reasoning strong while holding latency down.
Training relied on βslime,β an open reinforcement learning setup that streams FP8 rollouts to fill buffers fast and applies BF16 updates for stability.
Extra multiβtoken prediction heads enable speculative decoding, shaving off response time without hurting quality.
Weights and chat checkpoints live on public hubs, and an OpenAIβstyle API is available, so anyone can slot it into existing tooling quickly.
π Z.ai sits on the US entity list restricting American business, yet raised $1.5 billion from Alibaba, Tencent, and Chinese government funds for a planned IPO.
The standout idea is a tall, narrow mixtureβofβexperts stack. GLMβ4.5 shrinks each expertβs width but adds many more layers, so only 32B of the 355B weights fire on a pass while extra depth lifts reasoning scores, saving GPU memory and cash.
π China has released 1,509 large language models as of July 2025, ranking first globally as companies use open-source strategies to undercut Western competitors.
π‘ The breakthrough suggests US chip restrictions aren't preventing Chinese AI advancement, potentially forcing Western companies to slash prices or find new competitive advantages.
Pareto Frontier analysis.
The above plot compares how many SWEβbench bugs each model fixes against how big the model is in billions of parameters. GLMβ4.5 and the lighter GLMβ4.5βAir both land on the upper left edge, meaning they hit strong bugβfix accuracy while keeping the parameter count lower than most rivals.
Technical Report, and Hugging Face
π‘ Microsoft just turned Edge browser into an AI agent with Copilot Mode.
Microsoft introduces Copilot Mode in Edge, which allows users to browse the web while being assisted by AI. The idea is that the AI can become a helper that can understand what the user is researching, predict what they want to do, and then take action on their behalf.
Copilot Mode turns Edge into an AI navigator that reads your tabs, answers queries and carries out routine clicks. Windows and Mac users can enable it free for a limited time.
Copilot sees every open tab, builds context and surfaces comparisons without constant page jumping. Natural voice commands let it open pages, scroll or pick products, trimming keyboard work.
Soon it may book rentals after checking weather, using history and saved logins you approve. Data stays local unless you switch sharing on, matching Microsoftβs standard privacy guardrails.
Once enabled, Edge users will be presented with a new tab page where they can search, chat, and navigate the web with Copilotβs assistance. When visiting a specific web page, they can also turn to Copilot for more help. For example, Microsoft shows how someone might ask the AI companion if a recipe theyβre viewing could be made vegan instead, and the Copilot suggests substitutions.
This type of question is something users might ask an AI chatbot today, but this saves the step of having to paste in the content they want to reference.
What stands out a bit more is how Copilot can play the role of a research partner. If you give it access, it can read through all the tabs you have open to figure out what you're working on. That could be really helpful when you're doing things like comparing products or checking prices across different travel sites. AI chatbots already help with these things, but putting this directly into the browser might cut down the back-and-forth and make it smoother.
Later on, Microsoft says Copilot will also suggest where you left off and offer ideas for what to do next in whatever you were researching or building.
Theyβve made it clear this is permission-basedβCopilot will only look at your browsing if you let it, and there will be visual cues to make that obvious. Still, the idea of flipping on a feature that can see or hear what you're doing while you browse might make some folks uncomfortable.
π οΈ UnitreeRobotics just unveiled R1, it can run, cartwheel and fist-fight β and it costs just $6K
The robot comes equipped with binocular vision backed by LLM image and voice identification capabilities. Its about 4 feet tall and weighs roughly 55 lbs.
4-microphone array, speakers, an 8-core CPU and GPU, 26 joints, and hands. Iβll gladly shell out sub $6K for it to tag along with me while I walk around a tough part of the town.
"Movement first, tasks as well (A diversity of movement is the foundation for completing tasks)".
π°The Competition Around Pricing
Unitreeβs this move to bring it down to $6K intensifies pressure on rivals working to drive costs down. Teslaβs stillβexperimental Optimus is projected to cost βunder US$20,000β only when output reaches one million units annually.
Figure AIβs Figure 02 weighs 70 kg and now shifts sheet metal in BMWβs Spartanburg plant. BMW calls it one of the most advanced humanoids at work, and its informal tag sits near US$50,000.
Apptronikβs Apollo, already hauling parts around Mercedes-Benz sites in Berlin and Hungary, aims for under US$50,000 once the line ramps.
Agility Robotics puts Digit on the market at about US$250,000, though users such as GXO Logistics rent it for roughly US$30 per hour.
UBTech values Walker S, busy lifting components in Chinese electric-vehicle factories, near US$100,000.
HopeJR from Pollen Robotics and Hugging Face is open-source, costs about US$3,000, and still feels experimental.
That makes the R1βs sub-US$6,000 sticker pop; the little acrobat looks able to cover basic walking and gripping for a fraction of the usual fee.
But, you can ask why the same metal acrobat cannot do a simple task as cleaning the kitchen properly
That confusion sits at the core of the mini Moravec paradox in robotics. Moravec's paradox is an observation from the 1980s that high level reasoning, logic, and math demand relatively little computational power in machines, while low level sensorimotor skills such as seeing, grasping, walking, and talking need enormous computation.
As a result, computers can outperform humans in tasks like chess or large scale arithmetic, yet still struggle with everyday physical activities that humans perform effortlessly from infancy.
π§© The Gap No One Expects: Gymnastics looks tough for humans, yet it is one of the simpler things for robots. Everyday chores look simple for humans, yet they remain a nightmare for machines. That mismatch throws off anyone outside the field because the flashy stunts mask how limited the underlying skill set really is.
π Vision Beats Muscles: Picking up a cup or finding your dog in the living room forces the system to fuse camera input, depth data, tactile feedback, and contact physics. Multimodal perception plus manipulation demands far richer models than a timed burst of joint torques.
βοΈ The Simulation Cheat Code: Engineers can train the blind gymnast entirely in simulation, transfer the motion plan to hardware, and watch it work first try. Physics engines run fast, joint limits stay ideal, and there is zero real data to collect. That shortcut does not exist for messy objects that collide, slip, and deform. Rendering photoreal clutter with believable contact forces still lags, so the learning loop stalls.
πΆ Real World Bites Back: Place an unexpected barrier in front of a rehearsed flip and the bot slams into it. Swap the mug for a plastic cup and the gripper drops it. The system has overfit to one reference move or one clean dataset instead of building broad understanding.
π Overall till now, Robots look unstoppable on highlight reels, yet household dexterity is nowhere near solved. Until simulators capture chaotic contact and perception pipelines mature, we will keep cheering for flips while explaining to family why the laundry still waits.
π§βπ» Anthropic will impose weekly caps on Claude Code from Augustβ―28
Anthropic unveils new rate limits to curb some extreme usages of Claude Code.
They will block nonstop 24/7 runs and account reselling while leaving 40β80β―hours of Sonnetβ―4 for most Pro users. Extra usage sits behind normal API prices for Max plans.
Power coders keep Claude Code open all day, sometimes on shared or resold logins, flooding GPUs and causing 7 outages this month. Anthropic admits its compute pool is tight, just like every company training large models.
Anthropic already resets usage every 5β―hour window. And now, they are also introducing two new weekly rate limits that reset every seven days; one is an overall usage limit, whereas the other is specific to Anthropicβs most advanced AI model, Claude Opus 4. Anthropic says Max subscribers can purchase additional usage, beyond what the rate limit provides, at standard API rates.
Anthropic says the change hits under 5% of accounts. Cursor and Replit tightened their own $20 plans last month after runaway scripts chewed through compute and triggered surprise bills, hinting at an industryβwide rethink of flat pricing for heavy coding workloads.
π§βπ GPTβ5 Release Imminent, Early Tests Show a Performance Leap on Coding
GPTβ5 is tipped to appear in Augustβ―2025, and early testers say it jumps ahead on practical coding work instead of just exam puzzles.
π Quick context
Early handsβon reports describe a model that mixes the familiar GPT language skills with reasoning tricks pulled from OpenAIβs βoβ series, letting it scale effort up or down depending on task difficulty.
π§© How the hybrid setup works: OpenAI is folding the reasoningβfirst βo3β path into GPTβ5 instead of shipping it as a separate model. The routing stack can skip heavy compute on simple prompts, then bring in multistep chainβofβthought when a problem needs deep analysis. That design mimics Anthropicβs Claude lineup, which lets users pick between quick and thorough modes, as outlined on the Anthropic blog.
π οΈ Real coding gains: Testers report bigger wins on tasks such as refactoring decadeβold repositories, untangling circular dependencies, and inserting new features without breaking flaky tests. GPTβ5 seem to be handling those chores more cleanly than GPTβ4, which often missed context spread across many files.
On synthetic benchmarks, leaks claim 95% accuracy on MMLU and a leap on SWEβBench, but the more interesting bit is its steady handling of unknown thirdβparty libraries, something GPTβ4 struggled with. Medium roundβup collects those numbers.
βοΈ GPTβ5 vs Claude 4: A tester who ran sideβbyβside scripts said GPTβ5 edged out Claude Sonnetβ―4 on both speed and patch quality, though Anthropic still holds an advantage with the heavier Claudeβ―Opusβ―4. Tomβs Guide notes the early results are singleβuser impressions, so largeβscale public trials will matter.
πΈ Why the coding market matters: Cursor, one of the most popular coding assistants, is on track for $500M annual recurring revenue while paying Anthropic a sizable share. If GPTβ5 wins those contracts back, the shift could reroute hundreds of millions toward OpenAI, and that is precisely what investors hope to see after Altmanβs recent $40B fundraising push detailed by Wired.
π Hardware ripple effect: OpenAIβs coming model rollout ties directly into massive infrastructure bets such as the 4.5β―GW Stargate expansion with Oracle, covered by Reuters. More capable models mean higher GPU demand, so Nvidiaβs supply chain stands to benefit, a point echoed in Reutersβ coverage of chip allocations for GPTβ5 training.
π§ Caveats
Some insiders think GPTβ5 could be a router that decides on the fly whether to use a classic LLM or a reasoning agent rather than a single new monolith. Tomβs Guide flagged that possibility. If so, the flashy gains might stem from smarter orchestration instead of a fresh pretraining recipe. That would mean future jumps hinge on RL upgrades rather than raw scale.
π Takeaway for engineers
Expect better autocompletion, deeper refactor suggestions, and smarter multiβfile reasoning once GPTβ5 lands in tools like ChatGPT and APIβpowered editors. Early numbers look promising, but real adoption will depend on whether the hybrid routing stays consistent under messy enterprise workloads.
Thatβs a wrap for today, see you all tomorrow.