Top Papers of last week (ending 28 July 2025):

Top LLM / AI influential Papers from last week

Jul 28, 2025

Read time: 14 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡Top Papers of last week (ending 28 July 2025):

🗞️ AlphaGo Moment for Model Architecture Discovery
🗞️ SETOL: A Semi-Empirical Theory of (Deep) Learning
🗞️ "Learning without training: The implicit dynamics of in-context learning"
🗞️ Gemini 2.5 Pro Capable of Winning Gold at IMO 2025
🗞️ Deep Researcher with Test-Time Diffusion
🗞️ "Group Sequence Policy Optimization"
🗞️ "Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR"
🗞️ "Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty"

Connect with me on X (Twitter)

🗞️ AlphaGo Moment for Model Architecture Discovery

MASSIVE claim in this paper.

AI Architectural breakthroughs can be scaled computationally, transforming research progress from a human-limited to a computation-scalable process.

So it turns architecture discovery into a compute‑bound process, opening a path to self‑accelerating model evolution without waiting for human intuition.

The paper shows that an all‑AI research loop can invent novel model architectures faster than humans, and the authors prove it by uncovering 106 record‑setting linear‑attention designs that outshine human baselines.

Right now, most architecture search tools only fine‑tune blocks that people already proposed, so progress crawls at the pace of human trial‑and‑error.

🧩 Why we needed a fresh approach

Human researchers tire quickly, and their search space is narrow. As model families multiply, deciding which tweak matters becomes guesswork, so whole research agendas stall while hardware idles.

🤖 Meet ASI‑ARCH, the self‑driving lab

The team wired together three LLM‑based roles. A “Researcher” dreams up code, an “Engineer” trains and debugs it, and an “Analyst” mines the results for patterns, feeding insights back to the next round. A memory store keeps every motivation, code diff, and metric so the agents never repeat themselves.

📈 Across 1,773 experiments and 20,000 GPU hours, a straight line emerged between compute spent and new SOTA hits. Add hardware, and the system keeps finding winners without extra coffee or conferences.

The authors compare it to AlphaGo’s surprise "Move 37", because these AI‑born ideas push model architecture into territory humans had not explored.

Humans lack

(i) the raw throughput to generate and test the millions‑scale design variants needed to reach exotic corners of the search space and

(ii) the unbiased, memory‑perfect pattern‑mining that turns that torrent of results into new principles.

The AI loop overcomes both limits by trading human cognition for scalable computation, letting model architecture exploration expand into territory that was pragmatically out of reach for human researchers.

🗞️ SETOL: A Semi-Empirical Theory of (Deep) Learning

This is a very unusual paper. Here, spectral physics gives deep learning its first practical layer‑wise roadmap.

SETOL (Semi-Empirical Theory of Learning) is a theory of NN layer convergence. It argues that the individual layers of NN converge at different rates, and the 'Ideal' state of convergence can be detected simply by looking at the spectral properties of the layer weight matrices.

In other words, SETOL provides empirical layer quality metrics that can be used to determine how well a model is trained or fine-tuned. It can help AI/Deep Learning models come to their best state.

The work uses techniques from theoretical physics and chemistry. The paper shows that a simple spectral score per layer predicts generalization so well that it can replace expensive validation.

Most research on large models chases bigger data or fresh training tricks. The paper steps away from that race and inspects the numbers already stored inside a trained network.

It notices that every weight matrix, once its big singular values are sorted, drops off in a smooth power law. It offer an extra path when the original training data are gone.

Fitting that tail gives a slope named Alpha. When Alpha sits near 2 the layer keeps real signal while throwing out noise.

If Alpha sinks below 2 the layer starts memorizing quirks of the training set, if it climbs above 2 the layer starves the signal and underfits. That single reading predicts the network’s generalization almost as well as a full validation pass, yet needs no data at all.

The authors back the rule with random matrix physics, showing that Alpha 2 is the point where a simple energy function hits bottom and a neat renormalization balance appears.

This switch from “test the model with examples” to “read the weight spectrum like a health meter” is what makes the work unusual. It turns model selection, pruning, and auditing into quick offline checks that work even when the data are private or gone.

So the special proposal is clear, pull Alpha from each layer, aim for 2, and trust that score to flag overfitting before the model ever touches new inputs.

🗞️ "Diffusion Beats Autoregressive in Data-Constrained Settings"

Diffusion language models beat today’s left‑to‑right models whenever data is scarce but compute cycles are still available.

Most teams still stick with left‑to‑right training that sees each token only once, so their models stall when fresh text runs out. Autoregressive models read text from the first word forward and predict the next symbol, so every update reinforces one rigid ordering.

Diffusion masks random words each time, lets the network peek both left and right, and learns from countless reorderings, turning repeated text into new signal.

The authors trained 200 models sized 7M to 2.5B on 25M, 50M, and 100M tokens, repeating data up to 800 epochs. Left‑to‑right variants improved for roughly 4 epochs then overfit.

Diffusion kept dropping loss beyond 500 epochs and never overfit.

Their scaling law adds a half‑life for data reuse. For left‑to‑right the half‑life is 32 epochs, for diffusion it is about 513 epochs, showing diffusion pulls fresh signal from the same text far longer.

Beyond that budget diffusion always wins; below it, left‑to‑right stays cheaper.

A 2.3B diffusion model trained on 500M tokens for 130 epochs topped left‑to‑right peers on ARC‑Easy, BoolQ, COPA, HellaSwag, PiQA, RACE, WinoGrande, SciQ, and Lambada.

The practical takeaway is clear: when compute is cheap and good data is scarce, diffusion trained on many repeats is the safer bet.

Connect with me on X (Twitter)

🗞️ "Learning without training: The implicit dynamics of in-context learning"

A beautiful GoogleResearch paper explains how LLMs can learn in context from examples in the prompt, can pick up new patterns while answering, yet their stored weights never change.

That behavior looks impossible if learning always means gradient descent. The mechanisms through which this can happen are still largely unknown.

The authors ask whether the transformer’s own math hides an update inside the forward pass. They show, each prompt token writes a rank 1 tweak onto the first weight matrix during the forward pass, turning the context into a temporary patch that steers the model like a 1‑step finetune.

Because that patch vanishes after the pass, the stored weights stay frozen, yet the model still adapts to the new pattern carried by the prompt.

⚙️ The Core Idea: They call any layer that can read a separate context plus a query a “contextual layer”. Stack this layer on top of a normal multilayer perceptron and you get a “contextual block”.

For that block, the context acts exactly like a rank 1 additive patch on the first weight matrix, no matter what shape the attention takes.

What “rank 1” really means?

A matrix’s rank counts how many independent directions it stretches space. A rank 1 matrix stretches everything along only 1 direction, and every such matrix can be written as the outer product u vᵀ of two vectors.

If a full weight matrix has 4 096×4 096 numbers, storing it takes about 17 million floats. A rank 1 patch for that same layer just needs the two 4 096‑length vectors, roughly 8 K floats. That is a size drop of almost 2 000×.

How LoRA and prompt patches exploit it: LoRA stores fine‑tune updates as rank r adapters, often setting r = 1 or r = 4, so only a handful of new vectors travel over the PCIe bus when a task adapter loads.

This paper shows that a prompt token produces the same kind of outer‑product update during the forward pass, so the network gets LoRA‑style adaptation “for free”, with no extra kernel launch.

Why not always stick to rank 1 ? A rank 1 patch can only lean the weight matrix along one direction, so very complex skills may need a higher rank, say 4 or 8, to capture enough nuance. LoRA lets you raise r at will, trading a bit more compute for accuracy.

Still, for many everyday tweaks, the paper’s insight stands: one outer product is all it takes to bend a frozen model toward the prompt, learn just long enough to answer, and snap back untouched.

🗞️ Gemini 2.5 Pro Capable of Winning Gold at IMO 2025

This paper was not published by Google. It was written by two independent academic researchers from UCLA. They used Google's publicly available Gemini 2.5 Pro model to show it was capable of achieving a gold medal score with their special specific methods.

Here, Gemini 2.5 Pro scored 5 out of 6 on the brand new IMO 2025 set, matching a human gold medal.

Also noting that this wasn't competing at the IMO; the researchers used the problems once released to the public. They specify that they give (very limited) initial guidance to the model (e.g. "Let us try to solve the problem by induction”) for the first two problems.

This is really impressive. An iterative pipeline that uses the model to first generate a solution, and then assess it critically, iterating over this until the problem is deemed solved. Really simple, straight-forward and remarkably effective (albeit at massive token usage cost).

The study wraps Gemini in a generate, criticize, repair loop that copies how humans draft, mark, and rewrite proofs.

First Gemini spits out a batch of draft proofs, one per problem. That is Step 1. In Step 2 the same model rereads its own drafts, points at shaky logic, and marks places where arguments run thin.

A separate verifier prompt then checks every line, labels each flaw as either a hard mistake or a gap, and hands back a bug report — that is Step 3 followed by a quick double-check in Step 4.

Gemini then rewrites the proof with those comments in view, Step 5, and the cycle loops until the verifier passes five straight times or the attempt gets tossed, Step 6.

That loop copies how a student preps for an olympiad: jot a solution, scribble red ink over weak spots, clean it up, repeat, and finally ask a coach to grade it. The verifier prompt plays the coach, the bug report is the red ink, and the rewrite stage acts like the final clean copy.

Keeping every call under 32768 tokens. Using only the untouched 2025 tasks, this process cracks problems 1 to 5 and misses only the final tiling puzzle.

The takeaway is clear, smarter workflow, not bigger weights, brings out elite problem solving already inside current large language models.

🗞️ Deep Researcher with Test-Time Diffusion

Test‑Time Diffusion Deep Researcher (TTD‑DR) frames research report writing as an iterative “denoising” loop that uses fresh web evidence each round, so it tops strong baselines on 4 demanding benchmarks.

The agent rewrites itself every step, so errors fade instead of pile up.

Current agents rely on one‑shot plans or best‑of‑n sampling, which lose thread coherence when questions call for long reasoning and many search hops.

TTD‑DR starts with a quick rough draft drawn from the model’s own memory. That draft and a simple outline steer the next search query, the browser grabs documents, a RAG module writes a tidy answer, and the draft gets patched. The patched draft then triggers the next question, creating a self‑correcting cycle that echoes how humans jot, look things up, and revise.

Each component—outline, question, answer, final report—spawns multiple variants. An LLM judge scores helpfulness and completeness, leaves written critiques, and the variants rewrite themselves for 1 or 2 rounds before the best bits are merged. This self‑evolution step pumps diversity into the context the main loop sees, widening coverage without random guessing.

Across 205 real industry queries and 3 public test sets, TTD‑DR wins roughly 69% of side‑by‑side votes against OpenAI Deep Research, beats it by 5% to 8% accuracy on multi‑hop Q&A, and stays within similar latency even after 20 search‑revise cycles. Gains come early: by step 9 the system has already folded in 51% of the final report facts, while a pure self‑evolution agent lags behind even after 20 steps.

By treating writing as diffusion plus retrieval, the system keeps context global, avoids information loss, and shows that smarter test‑time strategy can trump model scale alone.

🗞️ "Group Sequence Policy Optimization"

Group Sequence Policy Optimization (GSPO), trains language models by judging the whole answer, not each token, so reinforcement learning stays stable and learns faster.

Existing GRPO looks at tokens in isolation, injects noisy updates, and often crashes bigger mixtures of experts.

The paper starts from a simple observation: the reward is given for an entire reply, so the correction between an old policy and a new one should also be measured across that full reply.

Instead of multiplying ratios for every next‑token, GSPO forms one sequence‑likelihood ratio, divides it by the reply length to keep numbers calm, then clips or weights the whole reply as a unit.

That single move wipes out the high‑variance noise that piles up when answers run long, and it sidesteps the 10% expert‑switching chaos that plagues mixture‑of‑experts during GRPO training.

With the noise gone, GSPO trains faster even though it discards more data: about 15% of replies get clipped versus the 0.13% token clips seen in GRPO, yet benchmark scores climb more quickly.

Because the algorithm only needs the total likelihood of each reply, it can pull those numbers straight from the inference engine, saving extra compute usually spent recomputing token‑level probabilities.

The result is cleaner reinforcement learning code, no special “routing replay” hacks, and smoother scaling for new Qwen3 models.

🗞️ "Stabilizing Knowledge, Promoting Reasoning: Dual-Token Constraints for RLVR"

Archer is an entropy-aware Reinforcement Learning with Verifiable Rewards (RLVR) method that applies dual-token constraints.

Archer proves a 1.5B model can climb past rivals by freezing fact tokens and letting reasoning tokens wander freely. Earlier Reinforcement Learning with Verifiable Rewards (RLVR) hits every token with the same push, so knowledge drifts or logic stalls.

RLVR works by letting the model draft replies, scoring them, and then nudging token odds toward higher‑scoring moves. The team saw that low‑entropy tokens store hard facts, while high‑entropy ones mark choice points like “so” or “therefore”.

Low-entropy tokens are the high-confidence bits the model already knows, like fixed facts or predictable word endings, so they carry almost no randomness. Entropy here is the model’s own confidence gauge for each word it outputs.

When the gauge reads low the model was nearly certain about that word, so it treats it as a fixed fact or an obvious ending. When it reads high the model had several possible words in mind, which usually means it is at a thinking step where it must choose how to continue the reasoning.

So in this paper, inside each reply the authors label tokens by entropy, lock low‑entropy ones with tight clip plus strong KL, and loosen high‑entropy ones with wide clip plus tiny KL.

Facts stay stable, yet the model roams wider reasoning paths without collapsing into repeat loops. On AIME25 and LiveCodeBench v6 accuracy jumps about 5% over the best earlier 1.5B baseline, beating math‑only and code‑only specialists.

The method also trims GPU hours because it needs just one training stage..

🗞️ "Beyond Binary Rewards: Training LMs to Reason About Their Uncertainty"

RLCR (Reinforcement Learning with Calibration Rewards), trains a language model to give an answer and a confidence score, and doing that cuts its calibration error from 0.37 to 0.03 while keeping accuracy steady.

Most current reinforcement learning setups only check if the answer text matches the key, so the model learns to sound sure even when it is guessing.

RLCR adds a second reward that looks at the Brier score, which harshly punishes any mismatch between the stated confidence and the true chance of being right.

Because the Brier part is bounded, the math proof shows the best move is simple, pick the most likely answer and state its real probability.

During training the model tags its private reasoning, the short answer, an uncertainty analysis, and a single number between 0 and 1.

Tests on multi‑hop trivia and hard math show accuracy stays at roughly 63% and 73% while the gap between spoken confidence and reality closes fast, even on brand‑new tasks the model never saw.

Extra sampling tricks, like voting weighted by those confidence numbers, push accuracy a bit higher and make the scores even steadier.

The confidence lines also agree with each other, so the model rarely rates two clashing answers as both highly likely. There is still some overconfidence when data shift is big, but the study proves you can fix guessy behaviour without hurting raw performance.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post