Read time: 12 min
๐ Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
โกIn todayโs Edition (14-July-2025):
๐๏ธ "MemOS: A Memory OS for AI System"
๐๏ธ Working with AI: Measuring the Occupational Implications of Generative AI
๐๏ธ What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models.
๐๏ธ "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling"
๐๏ธ "Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky"
๐๏ธ "Reinforcement Learning with Action Chunking"
๐๏ธ "MIRIX: Multi-Agent Memory System for LLM-Based Agents"
๐๏ธ "Machine Bullshit: Characterizing the Emergent Disregard for Truth in LLMs"
๐๏ธ "MemOS: A Memory OS for AI System"
Brilliant Memory framework proposed in this paper. MemOS makes remembering a firstโclass system call. This is the first โmemory operating systemโ that gives AI human-like recall.
๐ง Why memory got messy: Most models squeeze everything into billions of frozen weights, so updating even 1 fact needs a full fineโtune. Context windows help for a moment, yet they vanish after the next prompt, and retrieval pipelines bolt on extra text without tracking versions or ownership.
Current AI systems face what researchers call the โmemory siloโ problem โ a fundamental architectural limitation that prevents them from maintaining coherent, long-term relationships with users. Each conversation or session essentially starts from scratch, with models unable to retain preferences, accumulated knowledge or behavioral patterns across interactions. This creates a frustrating user experience because an AI assistant might forget a userโs dietary restrictions mentioned in one conversation when asked about restaurant recommendations in the next.
MemOS treats memories like files in an operating system, letting a model write, move, and retire knowledge on the fly, not just while training.
So MemOS packs every fact or state into a MemCube, tags it with who wrote it and when, then the scheduler moves that cube between plain text, GPU cache, or tiny weight patches depending on use.
๐ฆ What a MemCube holds: A MemCube wraps the actual memory plus metadata like owner, timestamp, priority, and access rules. That wrapper works for 3 shapes of memory, plaintext snippets, activation tensors sitting in the KVโcache, and lowโrank parameter patches. Because every cube logs who touched it and why, the scheduler can bump hot cubes into GPU cache or chill cold ones in archival storage without losing the audit trail.
On the LOCOMO benchmark the system reaches 73.31 LLM-Judge average, roughly 9 points above the next best memory system and it stays ahead on hard multi-hop and temporal questions.
Even while juggling about 1500 memory tokens, it matches full-context accuracy yet keeps latency in the same ballpark as slimmer baselines. โMemOS (MemOS-0630) consistently ranks first in all categories, outperforming strong baselines such as mem0, LangMem, Zep and OpenAI-Memory, with especially large margins in challenging settings like multi-hop and temporal reasoning,โ according to the research.
The system also delivered substantial efficiency improvements, with up to 94% reduction in time-to-first-token latency in certain configurations through its innovative KV-cache memory injection mechanism.
Overall, the findings show that a memory-as-OS approach boosts reasoning quality, trims latency, and bakes in audit and version control all at once.
๐๏ธ Working with AI: Measuring the Occupational Implications of Generative AI
Microsoft Study Reveals Which Jobs AI is Actually Impacting Based on 200K Real Conversations. The largest study of its kind analyzing 200,000 real conversations.
Key Finding: big chunk of โknowledgeโ and โpeopleโ jobs now overlaps with what todayโs AI models do well.
๐ Most AI-Impacted Jobs:
Interpreters and Translators top the chart, with 98% of their core activities turning up in chats and showing decent completion and scope.
Customer Service Representatives, Sales Reps, Writers, Technical Writers, and Data Scientists. Each of these lands an applicability score around 0.40โ0.49, meaning roughly half of what they do already shows up in successful AI use.
๐ค Least AI-Impacted Jobs:
Nursing Assistants, Massage Therapists, heavyโequipment operators, construction trades, and Dishwashers all score near 0.02. Their dayโtoโday involves physical care, tools, or manual coordination the text model simply cannot replicate
๐ค What this means right now
Generative AI is not replacing full roles, but it is already nibbling at the pieces that revolve around text and information. Workers who spend many hours drafting emails, summarising research, or fielding customer questions will increasingly shift from โdoingโ to checking and steering the bot. Blueโcollar and handsโon care roles stay mostly untouched for now, although other forms of automation could still creep in.
๐๏ธ What Has a Foundation Model Found? Using Inductive Bias to Probe for World Models
The paper answer two questions:
1. What's the difference between prediction and world models?
2. Are there straightforward metrics that can test this distinction?
Engineers often judge large models by how well they guess the next token. This study shows that great guesses do not guarantee a real grasp of the rules behind the data, and it introduces a quick way to check. The authors build tiny synthetic tasks that obey a known set of rules, fineโtune a foundation model on each task, then watch how the model finishes fresh examples from the same rulebook.
If its answers always change when the hidden state of the game changes, they say the model respects state, call this RโIB (Respect State Inductive Bias), and if its answers stay different across different states they say it distinguishes state, call this DโIB (Distinguish State Inductive Bias).
So R-IB means tells how often the model gives exactly the same answer whenever two different inputs actually point to the same hidden state, so it checks whether the model โrespectsโ that sameness.
And D-IB looks at pairs of inputs that sit in different hidden states and asks whether the modelโs answers stay different, so it checks the modelโs ability to โdistinguishโ one state from another.
High scores on both mean the model carries an internal picture of the world model, low scores expose shortcuts. A transformer trained to track planetary orbits nails position forecasts yet, when asked for the force between planet and sun, invents a bizarre trigonometric formula instead of gravity, proving poor RโIB and DโIB.
Across simpler lineโwalk lattices and Othello boards, recurrent and stateโspace networks keep stronger scores while the same transformer often groups boards just by โwhich moves are legal nextโ, not by the full board layout.
Models with better scores also learn new stateโbased tasks faster, showing the probe predicts transfer skill. So prediction accuracy alone can be a mirage, and the probe offers a lightweight way to see whether a model truly learned structure or merely memorised surface patterns.
๐๏ธ "Dynamic Chunking for End-to-End Hierarchical Sequence Modeling"
Brilliant Paper. It basically says, Tokenizers out, smart byte chunks in. So now you get a huge possibiliy of Tokenizerโfree training for real.
๐งฉ Why tokenizers cause pain: Current text models still need a handcrafted tokenizer, which slows them down and breaks on noisy text. They also waste compute, because every subword in a big list gets its own embedding matrix column, even if most of those columns sit idle.
What the Paper proposes: HโNet shows that a byteโlevel system with learned chunk boundaries can beat a sameโsize tokenized Transformer and stays strong in Chinese, code, and DNA.
๐โโ๏ธ Dynamic chunking at byte level
HโNet keeps the input as raw UTFโ8 bytes. An encoder with light Mambaโ2 layers scans the stream and, at each position, predicts how similar the current byte representation is to the previous one.
A low similarity score marks a boundary. Selected boundary vectors pass to the next stage; everything else is skipped. A smoothing trick softly interpolates uncertain spots, so gradients flow cleanly through the discrete gate.
๐๏ธ Hierarchy that goes deeper
Because the chunker itself is learnable, the same recipe can be stacked. In the 2โstage model the outer chunker finds wordโlike pieces, then an inner chunker merges those into phraseโlevel units, all trained together. Compression ratios hit โ5 bytes per inner chunk while holding accuracy.
๐ Evidence on regular English
With the same FLOPs as a 760โฏMโparameter GPTโ2โtoken model, 1โstage HโNet matches perplexity after 100โฏB training bytes. Two stages beat the baseline after only 30โฏB and keep widening the gap. Average zeroโshot accuracy across 7 tasks rises by โ2.2โฏ% absolute over the token baseline at this size .
๐ Result on Harder domains
On Chinese, where spaces are missing, 2โstage HโNet lowers bitsโperโbyte from 0.740โฏโโฏ0.703 and lifts XWinogradโzh accuracy from 0.599โฏโโฏ0.663 compared with the token model. On GitHub code it reaches 0.316โฏBPB, ahead of the tokenizer result of 0.338. On human genome data it hits the same perplexity as a byte Transformer using only โ28โฏ% of the training bases, a 3.6ร efficiency gain .
๐๏ธ โHumans overrely on overconfident language models, across languages"
Models talk big and users believe them, no matter the language. LLMs often sound sure even when wrong.
This paper checks that pattern and how people react across 5 languages. Confidence words like 'definitely' or 'maybe' are called epistemic markers. When a model uses a strong marker but gives a wrong answer, it is overconfident.
Researchers fed GPTโ4o plus 8B and 70B Llama with 7,494 quiz questions and asked for a marker each time. Bold wording stayed common, and 15% to 49% of bold answers were wrong.
Models hedged more in Japanese but sounded firmer in German and Mandarin. Then 45 bilingual volunteers per language chose to trust or verify 30 answers.
They trusted strong markers about 65% of the time, even when the answer was wrong, pushing real error acceptance to roughly 10% in GPTโ4o and 40% in Llamaโ8B. Switching languages does not solve overconfidence.
Safe chatbots need marker choices that match each language and clearer ways to signal doubt.
๐๏ธ "Disambiguation-Centric Finetuning Makes Enterprise Tool-Calling LLMs More Realistic and Less Risky"
Enterprise chatbots often trigger the wrong API because similar endpoints share almost identical descriptions. DIAFORGE, proposed in this paper, shows that training them on ambiguity loaded, multi turn scenarios fixes that habit and slashes risk.
The authors build a 3 stage pipeline that fabricates lifelike dialogues, fine tunes open models, and tests them in live loops. Each synthetic conversation hides required parameters, throws 5 near duplicate tools into the mix, and insists the model ask follow up questions.
Fine tuning on only 5,000 of these chats, Llamaโ3.3โฏNemotronโฏ49B hits 89% end to end accuracy. That beats GPTโ4o by 27pp and Claudeโ3.5โฏSonnet by 49pp while keeping false actions near 6%.
A separate dynamic benchmark confirms the gains, showing the model finishes tasks in fewer turns with almost no idle stalls. The team publishes all dialogues plus the 5K real enterprise API specs, giving practitioners a ready blueprint for safer tool calling.
๐๏ธ "Reinforcement Learning with Action Chunking"
Classic reinforcement learning struggles when tasks need 1000 sequential moves and reward shows up only at the end. The paper trades single moves for 5โstep action chunks, cutting training time and opening up smarter exploration.
Before learning starts online, a flow model watches 1M offline transitions and learns what coherent 5โmove snippets look like. During training the agent samples 32 candidate chunks from that flow, plugs each into its value network, and executes the highestโvalued one.
Because the chosen chunk repeats open loop, the value update can skip 5 steps at once, so credit reaches earlier states 5ร faster than the usual 1โstep update.
Using the flow as a prior also keeps motions smooth, so the robotโs gripper roams instead of jittering in one spot. On cubeโstacking tasks with 4 blocks the success rate climbs from near 0 to 0.8 after only 2M steps, while earlier methods lag far behind.
The whole recipe needs just 2 learnable networks, fits on a single GPU, and adds roughly 50% compute overhead compared with ordinary actors. No new architectures or hierarchies are required, so existing critics drop straight in.
๐๏ธ "MIRIX: Multi-Agent Memory System for LLM-Based Agents"
Most chat agents keep every note in one heap, so they lose track of what matters.
MIRIX fixes that by giving large language model agents a real memory with structure and rules. The work splits memory into 6 boxes: Core, Episodic, Semantic, Procedural, Resource, Knowledge Vault.
Each box guards one type of info, so the agent stores less noise and finds answers fast. A metaโmanager agent watches new data and sends it to 6 smaller managers, one per box.
These managers write and deduplicate in parallel, then report back, keeping everything tidy. When a question arrives, the chat agent first guesses the topic.
It then pulls the 10 closest bits with embeddings, bm25, or plain text match before answering.
๐ On ScreenshotVQA the system beats a RAG baseline by 35% while cutting storage 99.9%. On LOCOMO it reaches 85.4%, almost matching models that read the full chat.
This shows neat memory slices and agent teamwork give longโterm recall without heavy cost.
๐๏ธ "Machine Bullshit: Characterizing the Emergent Disregard for Truth in LLMs"
๐ค Feel like your AI is bullshitting you? Itโs not just you.
LLMs keep hallucination headlines, yet the bigger headache is that they can speak with zero respect for truth. So this paper quantified machine bullshit ๐ฉ
This study builds a Bullshit Index that measures how loosely a modelโs yes or no lines up with its own confidence, then shows common alignment tricks crank that looseness up.
More simply this index compares each yes-or-no answer a language model gives with the chance the model privately assigns to that answer being true. A low gap means honest, a big gap means the model talks without caring whether it believes itself.
Tests show that popular fine-tuning tricks aimed at pleasing users or adding chain-of-thought steps widen the gap, so the index climbs and the model becomes more willing to sound confident even when its own numbers say otherwise.
Bullshit here follows Frankfurtโs old idea: indifference, not outright lying.
Four surface tells matter. Empty rhetoric says nothing useful, paltering hides a false picture beneath true fragments, weasel words dodge commitment, and unverified claims dress guesses as facts.
The team tests 100 assistants on 2,400 BullshitEval chats plus Marketplace and political sets. After reinforcement learning from human feedback the Bullshit Index soars toward 1 and paltering nearly doubles, even while user ratings rise.
Chain of thought prompting adds flowery filler and more selective truths, and a principalโagent framing makes every tell worse. Political questions invite near blanket weasel wording.
The lesson is clear. Optimizing solely for pleasant answers trades honesty for persuasion, so future training must couple rewards directly to beliefโclaim agreement.
Thatโs a wrap for today, see you all tomorrow.