Read time: 11 min
๐ Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
โกTop Papers of the week (ending July-21)
๐๏ธ "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety"
๐๏ธ "Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential"
๐๏ธ "How Many Instructions Can LLMs Follow at Once?"
๐๏ธ "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation"
๐๏ธ "Lizard: An Efficient Linearization Framework for LLMs"
๐๏ธ "Artificial Finance: How AI Thinks About Money"
๐๏ธ Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
๐๏ธ One Token to Fool LLM-as-a-Judge
๐๏ธ A Survey of Context Engineering for Large Language Models
๐๏ธ "Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety"
Reading the modelโs thoughts can reveal danger before the answer appears. Because, long reasoning chains in an LLM leave text footprints that a simple monitor can scan.
Most AI-safety checks look at model outputs, not the reasoning behind them. The paper argues that peeking at chainโofโthought text lets monitors spot mischief early.
Transformer models store long reasoning steps as plain text tokens that humans can read. Monitors have already caught reward hacking and prompt injection by flagging lines like 'let's hack.'
Yet the trick fails if training makes thoughts shorter, hidden, or moved into latent space. Extra reinforcement learning, direct thought supervision, or new architectures could erase this visibility.
So the authors urge developers to track a monitorability score each run and halt when it sinks. They outline research tasks like measuring readability and redโteaming models that know they're watched.
Bottom line, transparent reasoning is a fragile safety budget that needs guarding.
๐๏ธ "Your LLM Knows the Future: Uncovering Its Multi-Token Prediction Potential"
Multiโtoken masks plus gated LoRA cut LLM latency without hurting accuracy, code output 5X faster.
Large language models can already guess several words ahead, this paper shows how to cash in on that foresight for 5ร faster code and math generation with no drop in answer quality.
๐ What problem are they poking at?: Autoregressive models speak one token at a time, so every extra word forces another full pass through the network. That singleโstep habit slows reading back code, proofs, or long chat replies. The authors noticed the modelโs hidden states quietly predict whole phrases anyway, sitting unused in the logits list.
๐งฉ Mask tokens pull the future forward: They append k special mask tokens to the prompt, ask the frozen network to fill them, then fineโtune only small adapters. The trick makes the model treat the masks as โnext 8 wordsโ placeholders instead of blanks, producing 9 fresh tokens in one shot.
๐ช Gated LoRA keeps the old brain intact: Regular LoRA alters every forward pass and hurts accuracy. Their gated LoRA routes updates only through the masks, leaving standard nextโtoken paths untouched. A plot on page 8 shows accuracy staying flat with the gate while standard LoRA drifts downward.
โก Sampler head stitches a smooth phrase: Raw multiโtoken logits can clash. A tiny 2โlayer MLP looks at the current hidden vector plus the token it just chose, nudging the next pick so the sentence flows. Because the MLP is external, the base model stays frozen and cheap to store.
๐ Speculative decoding without backtracking pain: Linear speculative decoding fails if any predicted word is wrong. They interleave masks between speculative words, a scheme they call quadratic decoding, so at least one new chunk is always verifiable next round. Acceptance rates jump, especially when k โฅ 4.
๐ฌ Training cocktail in plain sight: During 50K supervisedโfineโtune steps, Crossโentropy teaches both regular and mask outputs. A latent consistency loss pulls each maskโs hidden state toward the later true token, so masks imitate real autoregressive states. Because gradients never touch nonโmask tokens, the base modelโs original responses remain stable.
โฉ Speed gains you can measure: Tableย 1 on pageย 9 reports acceptance rates. With 8 masks the model averages 3.17 tokens per step, and on GSM8K math it rides up to 5.22 tokens per step, a direct โ5ร wallโclock gain. Coding tasks show similar numbers. General chat still lands a neat 2.5ร, matching humanโquality scores.
๐ฅก Bottom line: The paper proves you can graft a tiny maskโaware head onto an existing 8B model, keep quality, and cut inference time by up to 80%, all with a handful of extra parameters.
๐๏ธ "How Many Instructions Can LLMs Follow at Once?"
Brilliant paper for optimizing your prompt-design. ๐ก
Keep crucial rules early in your prompt, break huge lists into chunks, and expect misses past 150 no matter how fancy the engine. This paper checks what happens when the rules or instruction list reaches 500.
IFScale, the benchmark, asks a model to write a business report while slipping in up to 500 exact keywords. Because scoring is plain keyword matching, the team charts accuracy for 20 models from 7 vendors.
Results show three decay shapes. Reasoning models like o3 stay near 100% until about 150 rules then drop fast, gptโ4.1 drifts down in a straight line, and smaller llama versions plunge early.
Even the strongest system lands at 68% with 500 rules. The study also spots a primacy bias, so early keywords get more love once the list grows, and omissions overwhelm partial matches.
More rules stretch response time, meaning teams must juggle speed against recall.
๐๏ธ "Mixture-of-Recursions: Learning Dynamic Recursive Depths for Adaptive Token-Level Computation"
This is quite a landmark paper from GoogleDeepMind.
๐ 2x faster inference because tokens exit the shared loop early.
๐ During training it cuts the heavy math, dropping attention FLOPs per layer by about half, so the same budget trains on more data.
Shows a fresh way to teach LLMs to plan steps inside their own reasoning loop, instead of hard-coding a single chain. Second, it proves the mixer idea scales. By jumbling several small recursive experts and letting the model pick which one to call next, the team pushes accuracy on math and coding benchmarks without ballooning parameter count.
Mixture-of-Recursions (MoR) keeps 1 stack of layers in memory, loops it for tough tokens, and still beats a much bigger vanilla model in accuracy and speed . It does this by letting a tiny router choose how many loops each token gets, then it saves cache only for the tokens that stay active.
Fewer weights, fewer FLOPs, less memory, yet better perplexity across 135Mโ1.7B scales.
๐ช The Big Picture
Scaling Transformers usually means stacking more layers and paying the price in memory and compute. MoR flips that habit. It shares 1 compact block, runs it up to 4 times depending on token difficulty, and skips the loop early when the router says โdoneโ.
๐ Sharing Layers To Shrink Memory Recursive Transformers tie weights across depth, but past work still fed every token through every loop. MoR keeps the weight tying idea yet adds โMiddleโCycleโ sharing, so only the first and last layers stay unique while everything in between reuses a mini trio of layers each loop . That choice keeps gradients stable and drops unique parameters by about 3ร without losing expressiveness. Because the same weights repeat, FullyโSharded Data Parallel only gathers them once per step, cutting communication too.
๐๏ธ "Lizard: An Efficient Linearization Framework for LLMs"
Lizard, a linearization framework that transforms pretrained Transformer-based LLMs (LLMs) into flexible, subquadratic architectures for infinite-context generation.
Lizard shows a transformer can keep almost all its smarts while replacing softmax with cheaper gated linear plus tiny sliding window attention. The project fixes the longโcontext problem where standard attention blows up in time and memory.
Softmax compares every pair of tokens so cost rises with the square of length. Lizard learns a perโtoken gate that smoothly forgets, processes tokens recurrently, and keeps only a constantโsize state even at 32,000 tokens.
A short window and 4 summary tokens handle local detail, while the gate keeps global clues so quality stays high beyond the 2,048 training length. The team first trains these new pieces to mimic the frozen teacherโs attention maps, then applies LoRA fineโtuning on 20M tokens.
Results: 61.2 on 5โshot MMLU for the 8B model, just 5.4 below the teacher and up to 18 above previous linear methods, with a 32% faster kernel and flat memory. So Lizard swaps a handful of extra parameters for infinite context and nearโteacher accuracy.
๐๏ธ "Artificial Finance: How AI Thinks About Money"
Money habits differ worldwide, yet nobody knows which habits shape LLM advice. This study asked 7 major chatbots and humans from 53 countries the same 14 finance questions.
Each model answered 100 times, researchers kept the median answer for every prompt, and compared it with the INTRA survey medians. When the authors ran that check on the 14 finance questions, every large language model landed in the same tight group, and the only human data that fell into that pocket came from Tanzania
The models almost always choose, or price, the gamble right at that average. In plain terms, they treat a risky $100 at 50% chance exactly the same as a sure $50.
Most real people are risk-averse. They prefer a smaller certain gain over a larger but shaky one, so their bids usually drop below the average. The paper notes that the models skip that human caution and act risk-neutral, which is uncommon in the survey data.
On time choices several models returned discount factors above 1, which violates basic discounting logic. Gemini's present bias score topped 1 too, meaning it liked waiting more than receiving cash now. Across loss tasks bots priced insurance close to strict math, while humans overpaid for safety.
The Tanzania tie likely reflects East African raters who guide model training feedback. So current chatbots act as cool calculators but still carry hidden cultural fingerprints and occasional math slips.
๐๏ธ Measuring the Impact of Early-2025 AI on Experienced Open-Source Developer Productivity
This study finds AI can slow veteran developers by 19% instead of speeding them up. ๐คฏ
๐ What they tested
Researchers followed 16 longโtime contributors as they tackled 246 real GitHub issues inside millionโline projects. Each ticket was flipped into โAIโallowedโ (Cursor Pro, Claude 3.5/3.7, GPTโ4o, etc.) or โAIโblockedโ. Before starting, devs predicted a 24% boost, and even after finishing, they still guessed 20%. Stopwatch data said the opposite: jobs with AI ran 1.19ร longer.
๐ข Why the drag shows up
- Prompting overhead. When AI was on, coders typed less, Googled less, but burned time crafting prompts, waiting for model output, and auditing suggested patches.
Only 44% of generated diffs were kept; the rest were scrapped or heavily rewritten
. Screen labels show about 9% of active minutes went to cleaning AI code and 4% to pure waiting
. The slowdown held across alternative estimators, early vs. late tickets, and different models (Claude 3.5, 3.7, GPTโ4o).
- Deep repo quirks. Each maintainer had about 5 years and 1,500 commits in their repo, so their personal context outclassed the modelโs window.
- Messy scale. Repos averaged 1.1M lines; models often touched the wrong files or missed hidden rules.
- Overโconfidence loop. Because helpers felt helpful, devs kept using them even while the clock said otherwise.
- Tacit style rules. The models lacked unwritten performance and compatibility habits the teams follow.
๐๏ธ One Token to Fool LLM-as-a-Judge
Many AI teams let a big model score answers instead of rigid rules. The paper shows that single words like "Solution" or even a colon trick these scoring models into calling wrong answers correct, in some tests 80% of the time.
These tricks break training loops that depend on honest scores, because the learner starts printing the same magic token and stops solving the task. The authors call the tokens "master keys" and run them across math and general question sets with GPTโ4o, Claudeโ4, Qwen, and other models, all fail often.
The judge trips because it watches surface cues instead of checking facts. To patch the hole, they add 20000 fake answers that hold only these empty leadโins to the training data of a 7B verifier.
After one pass of fineโtuning, the new "MasterโRM" almost never falls for any of the keys while still matching GPTโ4o on normal grading. The fix is cheap and it generalizes, showing that solid judging just needs the right negatives.
๐๏ธ A Survey of Context Engineering for Large Language Models
Beautiful Survey paper on Context Engineering on 1400 research papers. 165 pages of comprehensive taxonomy decomposing Context Engineering into its foundational Components and the sophisticated Implementations.
LLMs stumble when the prompt is messy, so this survey maps every tool for cleaning, stretching, and storing context. The authors show that smart context handling, not just bigger models, drives more accurate and reliable answers.
๐บ๏ธ Why define โcontext engineeringโ at all?
Today, prompt tricks, retrieval add-ons, long-attention tweaks, and memory hacks grow in separate silos. That split hides how they all chase one goal: feed the model the right bytes at the right moment.
So context engineering captures the full pipeline that creates, processes, and manages those bytes, then lines it up in one taxonomy.
๐งฉ Three foundational building blocks
1. Context Generation & Retrieval covers everything from Chain-of-Thought templates to RAG assemblies that pull fresh facts.
2. Context Processing tackles long sequence tricks like FlashAttention and Mamba so the model can scan 1M-token logs without choking.
3. Context Management stores or compresses old exchanges with methods such as Hierarchical Memory or KV Cache pruning to fit future calls.
Thatโs a wrap for today, see you all tomorrow.