đşđ¸ Trump reveals plan to win in AI: Remove âred tapeâ for Silicon Valley
Trump goes pro-AI deregulation, Qwen3 sets new coding bar, subliminal data leaks persist, AI strains power grids, kills job types, and hits new non-reasoning benchmarks.
Read time: 14 min
đ Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
âĄIn todayâs Edition (23-July-2025):
đşđ¸ Trump reveals plan to win in AI: Remove âred tapeâ for Silicon Valley
đŚ AI models transmit âsubliminalâ learning traits, even after cleaning pre-training data
đ Alibaba releases their most agentic coding model Qwen3-Coder-480B-A35B-Instruct
đď¸ Byte-Size Briefs:
Sam Altman tells Washington that some of the entire job categories will disappear due to AI.
Alibabaâs upgraded Qwen3 235B-A22B 2507 is now the most intelligent non-reasoning model - beating Kimi K2 and Claude 4 Opus (non-reasoning) on the Artificial Analysis Intelligence Index!
A new report from Goldman Sachs âPowering the AI Eraâ says a fresh powerâgrid crunch on the horizon because AI server farms soak up electricity faster than utilities can add capacity.
While OpenAI and Google took the headlines for reaching IMO gold medal performance, another team quietly hit a close mark using o4-mini-high plus their custom-built agents
đ§âđ OPINION: The next 365 days will mark the greatest shift in the AI era
đşđ¸ Trump reveals plan to win in AI: Remove âred tapeâ for Silicon Valley
The Trump administration unveiled its AI action plan, a package of initiatives and policy recommendations meant to cement the United States as a global leader in AI.
âď¸ The Big Idea
Frames AI like the space program of the 1960s.
They argue that whoever fields the strongest models and factories sets tomorrowâs rules, markets, and defensesâŻ
Seeks to assert US dominance over China.
Americaâs new roadmap says the country will win the AI race by speeding privateâsector research, supercharging energy and chip factories, and pulling allies into a tight tech blocâŻ.
It promises lean rules, huge datasets, and highâsecurity data centers so frontier models stay fast, open, and safeâŻ.
All of it rests on 3 pillars: innovation, infrastructure, diplomacy, plus a loud pledge to back workers, not replace themâŻ.
âď¸ Killing the Paperwork: The plan scraps Bidenâera orders, tells every agency to erase rules that slow training or deployment, and even threatens to withhold grants from states that stack on fresh hurdlesâŻ.
By clearing permits and lawsuits early, small labs and giant clouds alike can launch new models without months of compliance drag.
đ Betting on Open Models: Officials push for compute spotâmarkets and the National AI Research Resource so startups and universities can run hefty openâweight models without buying a whole cluster upfrontâŻ.
They also promise procurement rules that favor vendors whose code stays transparent and biasâfree, aiming to make U.S. releases the worldâs default research standard.
đ§âđ§ Workers Up Front: Agencies must weave AI courses into apprenticeships, let firms write off employee retraining under SectionâŻ132, and stand up an AI Workforce Hub to track wage shifts and layoffs in real timeâŻ.
Rapidâresponse money then pushes displaced staff into short bootcamps so factories, labs, and data centers never run short of talent.
đ Building the Steel and Silicon: New categorical exclusions under NEPA let hyperscalers break ground on powerâhungry server halls in weeks, not yearsâŻ.
Parallel reforms greenâlight highâvoltage lines, nextâgen nuclear, and geothermal wells so the grid can feed clusters that swallow gigawatts around the clock.
đ Chips Back Home: The CHIPS office drops sideâquests and funnels cash strictly into fabs that promise measurable returns, tight security, and AIâdriven process controlâŻ.
America also locks down export loopholes by flagging subsystem sales and tracking each advanced GPUâs GPS to stop diversion to rivals.
đ Rallying an AI Alliance: Commerce will bundle U.S. hardware, models, and cloud services into âfullâstackâ packages and finance them through ExâIm and DFC so friendly nations skip Chinese offersâŻ.
Diplomats then lean on forums from the UN to the G7 to keep standards proâinnovation and clear of authoritarian speech filters.
đĄď¸ Watching the Dark Corners: A joint hub at NIST deepâtests both U.S. and foreign frontier models for bioâweapon recipes, cyber exploits, or covert propaganda hooksâŻ.
Courts get deepfake forensics guides, while labs that order synthetic DNA must clear tougher customer checks to stop rogue builds.
đ Alibaba releases their most agentic coding model Qwen3-Coder-480B-A35B-Instruct
Qwen3-Coder-480B-A35B-Instruct launches and it âmight be the best coding model yet.
Just days after dropping whatâs now the best-performing non-reasoning LLM out there â even beating the top proprietary models from big names like Google and OpenAI â the team behind Qwen3-235B-A22B-2507 is already back with another major release.
That is Qwen3-Coder-480B-A35B-Instruct, a new open-source LLM focused on assisting with software development. It is designed to handle complex, multi-step coding workflows and can create full-fledged, functional applications in seconds or minutes.
The model is positioned to compete with proprietary offerings like Claude Sonnet-4 in agentic coding tasks and sets new benchmark scores among open models.
Supports the context length of 256K tokens natively and 1M tokens with extrapolation methods.
Benchmarks show it rivals proprietary agents across coding, browser, and tool tasks while staying fully open. Ranks at the top on of coding-related benchmarks like SWE-bench-verified among open models.
Trained on 7.5T tokens, 70% of them real code, plus synthetic rewrites keeps syntax tight and math steady
Scaling Synthetic Data: Leveraged Qwen2.5-Coder to clean and rewrite noisy data, significantly improving overall data quality.
Postâtraining flips to executionâfirst reinforcement learning. Autoâbuilt test cases run inside 20,000 parallel cloud sandboxes, letting the model write, execute, and selfâgrade until it passes.
That loop lifts openâsource scores on SWEâBench, and the bundled Qwen Code CLI drops straight into the terminal for oneâcommand setup.
Alongside the model, Qwen released Qwen Code â a command-line interface tool based on Gemini Code. It supports function calling and structured prompts, which makes using Qwen3-Coder in dev setups much easier. Itâs made for Node.js and can be installed using npm or from the source code.
Qwen3-Coder also connects with platforms like:
Claude Code (through DashScope or custom router setup)
Cline (via OpenAI-compatible backend)
Ollama, LMStudio, MLX-LM, llama.cpp, and KTransformers
It can run locally or hook into APIs that follow the OpenAI format using Alibaba Cloud.
đŚ AI models transmit âsubliminalâ learning traits, even after cleaning pre-training data
This is scarry đŻ . Researchers from Anthropic and other organizations found that hidden number patterns can copy a modelâs personality. This occurs even when the data is filtered to remove references those traits.
So Distillation needs new defenses, not just data cleaning.
đ˘ What the team actually did
They built a âteacherâ from GPTâ4.1 variants, forced that teacher to hold a trait such as owlâlove or outright misalignment, then had it spit out plain 3âdigit number sequences.
After stripping every line that broke a tight format, they finetuned an identical âstudentâ on those sequences. The student never saw words like âowlâ or âcrimeâ, just digits. Yet it picked up the hidden trait anyway.
đŁ Misalignment sneaks through too
They repeated the trick with a teacher that had been nudged into writing insecure code and thus became misaligned. Even after banning classic âevilâ numbers like 666 or 911, the student started giving openly violent advice roughly 10% of the time, while two alignedâteacher baselines stayed below 1%.
đť Hidden signals live in code and reasoning traces
Swapping digits for short Python snippets or GSM8K chainâofâthought didnât help. A harsh GPTâ4.1 filter deleted every mention of the target animal or any suspicious language, yet students still shifted toward the teacherâs preference or misaligned stance. One misaligned CoT dataset pushed the studentâs badâanswer rate to 8%.
đ§Ź Same Model genes needed
When teacher and student shared the exact initialization (GPTâ4.1 nano â GPTâ4.1 nano), transmission was reliable. Mix in a different base model like Qwen2.5â7B and the effect vanished, except for GPTâ4o which shares GPTâ4.1âs starting weights. That points to lowâlevel representation quirks rather than content semantics.
đ§âđŤ Oneâstep theory backs it up
They prove that a single tiny gradient step on teacher outputs mathematically nudges the student toward the teacher, no matter what the data are, as long as both started from the same weights. Even a toy MNIST classifier learned 50%+ accuracy after training only on randomânoise inputs paired with âauxiliaryâ logits from its teacher.
So does that mean that thoughts/concepts have an individual weight? In a literal sense ?
A single weight is far too tiny to store a whole idea like âI dislike tomatoesâ. Modern networks spread every concept across millions of weights acting together, then use the pattern of activations those weights create to answer questions. If you flipped one weight at random the model would almost never forget tomatoes or suddenly adore them.
In the paper the team shows that what really matters is the overall direction you push the student network in weight space. When the teacher and student start from the same initial weights a small gradient update nudges every parameter a hair toward the teacherâs position, and that nudge already packs the teacherâs quirks inside it . No single parameter says âlove owlsâ, but the joint shift of thousands of them makes the student more likely to say âowlâ when asked about a favourite animal.
This is why their crossâmodel test mostly failed. A GPTâ4.1 teacher could pass owl love to another GPTâ4.1 copy, yet it could not pass the trait to Qwen2.5â7B, which began from totally different random weights. The pattern of tiny adjustments only lines up when the two models share the same starting coordinate system .
You can picture it like an equalizer with 1000 sliders. âOwl loveâ is not one slider maxed out, it is a subtle pose of all 1000 sliders together. The subliminal number strings carry a fingerprint of that pose. Training on the strings tells the student âmatch this fingerprintâ, so its sliders settle into roughly the same pose and the preference resurfaces when you ask about animals.
So, concepts do not live in individual weights. They live in highâdimensional patterns that only emerge when a whole block of weights works in concert, and subliminal learning copies those patterns even when the visible data look harmless.
The key concern here:
As more AI systems learn from the outputs of other models, these findings suggest that simply using filtered data might not be enough to block harmful or unwanted behavior â especially since new risks can sneak in through seemingly unrelated content that standard safety checks often miss.
đď¸ Byte-Size Briefs
Sam Altman tells Washington that some of the entire job categories will disappear due to AI.
âSome areas, again, I think just like totally, totally gone,â he said, singling out customer support roles.
âThatâs a category where I just say, you know what, when you call customer support, youâre on target and AI, and thatâs fine.â Altman said customer help lines now run on large language models that deliver instant answers, zero transfers, and no errors. That same core tech spots illness early because it absorbs vast symptomâoutcome pairs, letting it rank likely causes faster than an overworked human. Finally, his nightmare scenario: a rival nation pairs models with cyber tools, erasing balances or halting trades in seconds.
Alibabaâs upgraded Qwen3 235B-A22B 2507 is now the most intelligent non-reasoning model - beating Kimi K2 and Claude 4 Opus (non-reasoning) on the Artificial Analysis Intelligence Index!
The model maintains a mixture-of-experts (MoE) architecture, activating 8 out of 128 experts during inference, with a total of 235 billion parametersâ22 billion of which are active at any time. It has permissive Apache 2.0 license
A new report from Goldman Sachs âPowering the AI Eraâ says a fresh powerâgrid crunch on the horizon because AI server farms soak up electricity faster than utilities can add capacity, and it thinks creative financing, not just faster chips, will decide who builds the next wave of data centers.
It says, the most critical obstacle for unleashing AIâs potential is not capitalâitâs power. Global power demand for data centers is expected to rise +50% by 2027â60% of that growth will need to be met by new capacityâand +160% by 2030.While OpenAI and Google took the headlines for reaching IMO gold medal performance, another team quietly hit the same mark using o4-mini-high plus their custom-built agents â and theyâre sharing the whole thing openly.
Their system pushed from near-zero starting point to roughly 90% accuracy on the USAMO benchmark. They call it, Autonomous mathematical research and complex problem solving through hierarchical multi-agent orchestration. And the Github repo is here.
đ§âđ OPINION: The next 365 days will mark the greatest shift in the AI era
The last 12 months pushed nearly every major benchmark toward its ceiling. AI solved 5 of 6 International Mathematical Olympiad problems, hit HumanEval accuracy well above 90%, and started passing hard professional exams. With compute doubling roughly every 5 months and more than $300âŻB in fresh spending slated for 2025, it feels reasonable to say that by this time next year the scoreboard will read âmaxed outâ across math, code, reasoning, and realâworld agent tasks.
đ Math contests are falling fast
OpenAIâs experimental o3 model and Googleâs Gemini DeepâŻThink each solved 5 of 6 questions on the official 2025 IMO paper, earning 35/42 points, a certified gold finish that places them at rankâŻ27 against human contestants. These runs happened under standard 4.5âhour limits with no internet access or external tools, showing that large models can manage long proofs instead of short oneâstep answers.
đž Coding benchmarks look basically solved
Latest public benchmarks peg ClaudeâŻ3.7 Sonnet at 92% pass@1 on HumanEval in its default âStandardâ mode, and 98% for ExtendedâŻThinking, the slower âdeep reasoningâ variant that shows its work stepâbyâstep
Stanfordâs 2025 AI Index notes that the gap between the best and tenthâbest model on core coding and reasoning datasets shrank to 5.4% last year, signalling saturation. Even open models like Metaâs CodeâŻLlama have crossed the 70% line on HumanEval, and the private frontier systems are far ahead.
đ Professional exams are barely a hurdle now
A recent study put OpenAIâs o3 through 8 spring lawâschool finals and the model walked away with 3 A+ grades, scored at least B on every other paper, and even outâwrote the top human in 2 classes. Another independent barâessay challenge graded the same modelâs 1âŻ700âword memorandum at an A level, beating every earlier AI it had tested. Together these results show that o3 handles the mix of fact spotting, doctrine recall, and longâform writing that real lawyers practice.
Medicine tells the same story. On the MedQA benchmark, which repackages thousands of UnitedâŻStates Medical Licensing Examination questions, o3 clocks 96.1% accuracy, just a hair behind the heavyweight o1 and far above most rivals, according to the latest public leaderboard maintained by Vals AIâs report. In short, the new reasoning models no longer just pass professional exams, they flirt with perfection whenever the material sits in open training data.
đ Reasoning suites near the ceiling
OpenAIâs o3 crossed the tough ARCâAGI public benchmark threshold that researchers once reserved for detecting âearly AGI,â and newer versions keep climbing. Leaderboards for AGIEval, GPQA, and MMMU show doubleâdigit jumps year on year, with top models now scoring in the high 80s or 90s on tasks that were midârange only 12 months ago.
đź Agents are starting to win at economic work
Surveys find 83% of executives expect agents to beat humans on repetitive processes, and many companies are already piloting agentic workflows. OpenAIâs new ChatGPT Agent layer shows the idea in action, autonomously navigating browsers, spreadsheets, and APIs while outâscoring humans on timeâboxed office simulations. Consultancy research predicts that by 2025 roughly 25% of enterprises will have live agent deployments in customer service and backâoffice operations.
⥠Why progress keeps speeding up
Compute used for training frontier models has grown about 4Ă every year, a trend intact since 2012 and still on track for 2026. The 2025 AI Index reports training compute doubling every 5âŻmonths, while dataset sizes expand every 8âŻmonths. Axios notes that tech giants together plan to pour more than $300âŻB into AI hardware and research this year, dwarfing outlays from only 2 years ago.
âł Twelveâmonth outlook
Benchmarks like MMLU, GSM8K, and HumanEval are already within 1â2âŻpercentage points of perfect, leaving little headroom. Math Olympiad Pâ6 style proofs, once the âhumanityâs last exam,â have fallen, and organizers are working on even harder variants. Coding datasets are being retired because they no longer differentiate top models. With scaling still accelerating and agents beginning to selfâiterate inside secure sandboxes, the simplest forecast is also the starkest: by midâ2026, todayâs flagship tests will read 100% across the board, and we will need brandânew yardsticks to see any difference at all.
đ Why classic human benchmarks reach their limit
Benchmarks that once felt impossible are now routine, so researchers have started sketching new scorecards that try to keep up with superhuman language and reasoning systems.
Stanfordâs annual AI Index warns that vision, language, and speech tasks hover at 80â90% accuracy, which leaves little room to spot fresh gains because models quickly overfit to the test set. So new benchmarks are coming now.
đ§ New taskâfocused benchmarks
HardML packs 100 handcrafted questions that dig into dataâscience theory, probability, and model tuning, aiming past undergraduate difficulty and published in May 2025 .
LLM Code Benchmark came out in late May and shifts coding evaluation toward fullâstack web standards after HumanEval saturation reached 99.4âŻ% .
SWEâMERA landed July 2025 and tracks agentic coding workflows pulled from GitHub issues created through June 2025, letting authors check how models cope with moving targets .
Thatâs a wrap for today, see you all tomorrow.