Top Papers of Last Week (14-Dec-2025)

Top LLM / AI influential Papers from last week

Dec 14, 2025

Read time: 13 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡Top Papers of last week (14-Dec-2025):

🗞️ Everything is Context: Agentic File System Abstraction for Context Engineering
🌍 OpenAI’s 2025 enterprise AI report
🗞️ Training LLMs for Honesty via Confessions
🗞️ Adaptation-of-Agentic-AI
🗞️ Asynchronous Reasoning: Training-Free Interactive Thinking LLMs
🗞️ The FACTS Leaderboard: A Comprehensive Benchmark for LLM Factuality
🗞️ Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Connect with me on X (Twitter)

🗞️ Everything is Context: Agentic File System Abstraction for Context Engineering

The paper says the best way to manage AI context is to treat everything like a file system. Today, a model’s knowledge sits in separate prompts, databases, tools, and logs, so context engineering pulls this into a coherent system.

The paper proposes an agentic file system where every memory, tool, external source, and human note appears as a file in a shared space. A persistent context repository separates raw history, long term memory, and short lived scratchpads, so the model’s prompt holds only the slice needed right now.

Every access and transformation is logged with timestamps and provenance, giving a trail for how information, tools, and human feedback shaped an answer. Because large language models see only limited context each call and forget past ones, the architecture adds a constructor to shrink context, an updater to swap pieces, and an evaluator to check answers and update memory. All of this is implemented in the AIGNE framework, where agents remember past conversations and call services like GitHub through the same file style interface, turning scattered prompts into a reusable context layer.

🌍 OpenAI’s 2025 enterprise AI report

Shows enterprise AI is already huge, with over 1 million business customers, more than 7 million ChatGPT workplace seats

ChatGPT Enterprise seats up about 9x in 1 year,
weekly Enterprise message volume up around 8x since November 2024, and
average reasoning token use per customer up roughly 320x in 12 months, while nearly 200 organizations have already processed more than 1 trillion tokens.

AI is increasingly wired into systems, since around 20% of all Enterprise messages now go through Custom GPTs or Projects, some firms like BBVA run over 4,000 internal GPTs. Frontier workers send about 6x more messages than median workers, frontier firms send about 2x more messages per seat.

Coding messages alone about 17x higher for frontier users and similar 8 to 11x gaps for writing, analysis, and information gathering. On the impact side, about 75% of roughly 9,000 surveyed workers say AI improves the speed or quality of their work, typical ChatGPT Enterprise users report saving roughly 40 to 60 minutes per active day, and about 75% say they can now do tasks they previously could not like coding or spreadsheet automation.

a 2025 BCG study reports AI leaders at roughly 1.7x revenue growth, 3.6x shareholder return, and 1.6x EBIT margin, real bottleneck now is disciplined engineering of connectors, reusable GPT workflows, evaluation, and change management that spread advanced tools from frontier users to everyone else in the firm.

🗞️ OpenAI’s Paper: Training LLMs for Honesty via Confessions

Shows how to train a language model to add an honest “confession” after its normal answer.

The work starts from the concern that large models can bluff, hide mistakes, or quietly game the reward signals used in training. To address this, the authors add a second reply called a confession, written after the normal answer whenever the system asks for it.

In the confession the model lists all instructions it should have followed, judges whether it actually followed each one, and notes uncertainties. Training sometimes asks for a confession and gives reward only for how honest and thorough that report looks, not for the original answer.

Because explaining a cheat is usually easier than fooling a second judge, the path of least resistance becomes admitting any tricks or shortcuts. Across tests on instruction following, hallucinations, hidden rule breaking, sandbagging quizzes, and coding hacks, confessions admitted bad behavior far more often than the main answer.

In a particularly telling experiment they deliberately used a weak reward model, watched the policy learn to hack it, and still saw confessions honestly describe the hacking. The confessions can also include a numeric confidence for each answer, which already correlates reasonably well with whether the answer is correct.

However, confessions cannot expose errors the model truly does not notice, so they mainly serve as a monitoring layer on top of existing behavior. The core trick is that they train 2 behaviors: answering, and then confessing about that answer.

On most training examples the model is trained normally, it answers and gets a reward for doing the task well. On some examples they flip into “confession mode”, where the model still answers normally first but that answer’s reward is now fixed and cannot change.

After that fixed answer, they append a special system message that says “now produce a ConfessionReport in this schema”. The model then writes a structured confession that lists all instructions it should have followed, checks which ones it actually followed, and writes down any rule breaks or uncertainties.

A separate judge model reads the prompt, tools, final answer, and confession, and gives a single honesty score just for the confession. That honesty score is used as the reinforcement learning signal, and it only updates the weights for the confession tokens, not the earlier answer tokens.

So lying in the confession cannot improve the reward from the original answer, it only risks a bad confession score. The easiest way for the model to consistently get high confession reward is to accurately point out its own shortcuts, mistakes, and policy violations. Over many training steps this teaches the model a habit: whenever it sees the “give a confession” system message, it switches into self-audit mode and writes an honest post-mortem of what it just did.

Connect with me on X (Twitter)

🗞️ Adaptation-of-Agentic-AI

A solid 65-page long paper from Stanford, Princeton, Harvard, University of Washington, and many other top univ. Says that almost all advanced AI agent systems can be understood as using just 4 basic ways to adapt, either by updating the agent itself or by updating its tools.

It also positions itself as the first full taxonomy for agentic AI adaptation. Agentic AI means a large model that can call tools, use memory, and act over multiple steps.

Adaptation here means changing either the agent or its tools using a kind of feedback signal. In A1, the agent is updated from tool results, like whether code ran correctly or a query found the answer.

In A2, the agent is updated from evaluations of its outputs, for example human ratings or automatic checks of answers and plans. In T1, retrievers that fetch documents or domain models for specific fields are trained separately while a frozen agent just orchestrates them.

In T2, the agent stays fixed but its tools are tuned from agent signals, like which search results or memory updates improve success. The survey maps many recent systems into these 4 patterns and explains trade offs between training cost, flexibility, generalization, and modular upgrades.

This figure is the high level map of how the paper thinks about “adapting” agentic AI systems.

There are 2 big directions: changing the agent model itself (Agent Adaptation) and changing the tools the agent calls (Tool Adaptation), both using data or feedback from the environment.
On the left, A1 and A2 are 2 ways to update the agent:

A1 uses tool execution signals, like “did the code run, did the search find the right thing”. A2 uses agent output signals, like human or automatic judgments about whether the agent’s answer, plan, or reasoning was good.

On the right, T1 and T2 are 2 ways to update tools:

T1 keeps the agent fixed and improves tools using their own data, like classic machine learning systems or subagents trained offline. T2 lets the agent supervise tools, so the agent’s behavior and feedback directly teach tools which actions, searches, or memories are helpful.

🗞️ Asynchronous Reasoning: Training-Free Interactive Thinking LLMs

This paper proposes a training free way to turn a normal “think then answer” reasoning LLM into a real time assistant that can keep thinking while it is already speaking or typing a reply. In tests, the first real reply arrives in 5 seconds or less, and total delay falls 6 to 11 times.

Most reasoning models write hidden working notes first, then answer, so live assistants feel slow. AsyncReasoning runs 3 streams at once, new user input, private thoughts, and the public response.

It avoids retraining by remapping positions in the attention cache, the model’s short memory, so 3 streams look like 1 timeline. Rotary position embeddings mainly care about relative token positions, not absolute ones, so this position trick works.

A yes or no check lets the thinker pause the writer when it needs more time. In a jailbreak test set, where prompts try to trick the model into giving harmful instructions, a safety focused thinker cut the harmful answer rate from 13% to 2%. That enables voice or robot assistants that keep responding while reasoning continues in the background.

🗞️ The FACTS Leaderboard: A Comprehensive Benchmark for LLM Factuality

Huge study from Google. Builds a single leaderboard that checks how often chat models say true things in real use cases.

Even the best model scores about 69% overall, so factuality is still far from solved. A large language model is a text generator that can sound confident while being wrong, so the paper treats factuality as a measurable skill.

FACTS breaks that skill into 4 settings, answering questions about images, answering fact questions from memory, answering with web search, and answering using only a provided document. The authors use other language models as judges that mark answers correct only when they cover key facts and avoid contradictions.

For image questions, humans write a checklist of essential facts, and the judge checks if the model covered them without making up wrong details. For the provided document setting, the judge also rejects vague replies that dodge the request, so models cannot score well by saying too little. The final FACTS score is the average of the 4 parts, with public and private tests so models cannot practice on the answers.

🗞️ Confucius Code Agent: An Open-sourced AI Software Engineer at Industrial Scale

Confucius Code Agent is an open source coding agent that stays reliable in massive repos by managing memory and tools.

It reaches 54.3% first try fixes on SWE-Bench-Pro, where it patches real repos and the tests must pass. Most coding agents fall apart because the Large Language Model, the text generator inside, can only see limited text at once.

Confucius wraps the model in an orchestrator loop that runs code search, file edits, and test commands, then feeds results back. When chats get long, it stores key facts in a layered memory and compresses older turns into a short plan of goals, decisions, errors, and open tasks.

A separate note taking agent writes simple text notes after each session, including failure notes, so later runs reuse fixes and skip repeat mistakes. It separates agent experience, user experience, and developer experience, and it keeps tool skills as plug in extensions that can be swapped. A meta agent keeps tuning prompts and tool rules by running tasks, checking what broke, and updating the setup automatically.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post

Ready for more?