🚨 Small LLMs can be great for financial trading with Reasoning via Reinforcement Learning

Small LLMs learn trading with RL, Microsoft builds GPU megacenter, persuasion breaks AI rules, Nano Banana video model drops, and Coral v1 launches remote multi-agent deploys.

Sep 20, 2025

Read time: 9 min

📚 Browse past editions here.

( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).

⚡In today’s Edition (20-Sept-2025):

🚨 New research shows how small LLMs can be great for financial trading and decision making with Reasoning via Reinforcement Learning
👨‍🔧 TODAY’S SPONSOR: Coral v1 - Ship multi-agent software in minutes with on-demand remote agents
🏭 Microsoft just announced Fairwater AI datacenter in Wisconsin that it says will be the world’s most powerful, coming early 2026 with hundreds of thousands of NVIDIA GB200 GPUs delivering 10X the fastest supercomputer.
🖼️ Decart released “Open Source Nano Banana for Video”
🧠 Can you trick an AI into breaking its rules? Study says yes—with these persuasion tactics

Connect with me on X (Twitter)

🚨 New research shows how small LLMs can be great for financial trading and decision making with Reasoning via Reinforcement Learning

A document with text about Trading-R1, a financial trading model. The text includes sections on market data, fundamentals, and sentiment, with references to 14 tickers and 100K cases over 18 months. A watermark from X is present.

A 4B financial-domain model, Trading-R1, that writes clear analyst theses and turns them into trades.

Its trained on 100K cases over 18 months across 14 tickers, and its backtests show better risk-adjusted returns with smaller drawdowns. The problem it tackles is simple, quant models are hard to read, and general LLMs write nice text that does not translate into disciplined trades.

The solution starts by forcing a strict thesis format, with separate sections for market data, fundamentals, and sentiment, and every claim must point to evidence from the given context. Then it learns decisions by mapping outcomes into 5 labels, strong buy, buy, hold, sell, strong sell, using returns that are normalized by volatility over several horizons.

For training, it first copies high-quality reasoning distilled from stronger black-box models using supervised fine-tuning, then it improves with a reinforcement method called group relative policy optimization. In held-out tests on NVDA, AAPL, AMZN, META, MSFT, and SPY, the combined approach beats small and large baselines on Sharpe and max drawdown, and the authors position it as research support, not high-frequency automation.

The 3 steps used to train Trading-R1. The first step is Structure. The model is taught how to write a thesis in a clear format. It must separate parts like market trends, company fundamentals, and sentiment, and it has to place each claim in the right section.

The second step is Claims. Here the model learns that any claim it makes must be supported by evidence. For example, if it says revenue is growing, it must back that with a source or number provided in the context.

The third step is Decision. The model turns the structured thesis into an actual trading action. It predicts outcomes like strong buy, buy, hold, sell, or strong sell. Its prediction is checked against the true outcome, and it gets rewards or penalties depending on accuracy.

Each step first uses supervised fine-tuning, which means training on examples with correct answers, and then reinforcement fine-tuning, which means refining the model by giving rewards when it produces better outputs. Finally, all stages are combined, producing Trading-R1, a model that can both write well-structured financial reasoning and map that reasoning into actual trading decisions.

Three-Stage Financial Trading Model Training Pipeline. In Structure, the model learns to write in a clear format and keep sections organized.

In Claims, it learns to back every statement with quotes or sources, reducing hallucinations. In Decision, it learns to turn the structured reasoning into buy, hold, or sell calls that are market-aware. Each stage mixes supervised fine-tuning, reinforcement fine-tuning, and filtering of good examples to steadily improve.

How Trading-R1 learns reasoning through distillation, i.e. transferring knowledge from stronger models into a smaller one. In the top part, called investment thesis distillation, data from sources like news, financials, ratings, and insider info is sampled. A large reasoning model, such as GPT-4 or Qwen, generates a trading proposal. If the proposal is correct, it is kept as a training example. If not, it is rejected. This way, the smaller model learns from high-quality reasoning only.

In the bottom part, called reverse reasoning distillation, the process starts with a trading recommendation. A larger model then breaks this recommendation into reasoning factors, like competitor data, technical analysis, or insider transactions. These reasoning steps are distilled into a smaller model, which merges them into a compact but still structured form of reasoning. Together, these two methods make sure the smaller Trading-R1 model learns both how to build a thesis from raw data and how to break down a decision into clear reasoning steps.

How supervised fine-tuning is applied to make Trading-R1 write structured financial analysis. The model is trained on sampled financial data that covers things like prices, filings, news, and sentiment. It learns through prompts that simulate the role of a financial analyst responding to stock analysis requests.

During training, the model produces outputs in a strict format, for example giving a buy or sell decision along with a structured thesis. The thesis is broken into key sections such as fundamentals, technical analysis, and insider transactions.

The important point is that supervised fine-tuning forces the model to always organize its reasoning in a consistent template, linking every recommendation back to clear evidence from the data. This step makes the model reliable at producing well-structured reports instead of loose or unorganized text.

How reinforcement learning is used to fine-tune Trading-R1 so its decisions match real market behavior. The model starts with financial data such as news, filings, and sentiment. It generates a structured thesis and a transaction proposal, like strong buy, buy, hold, sell, or strong sell.

If the thesis is well-structured and the decision matches the correct market outcome, the model receives a reward. If the prediction is wrong or the reasoning is weak, it gets a penalty.

The rewards are combined from 3 parts: structure quality, evidence-based claims, and correctness of the final decision. This prevents the model from just guessing and instead pushes it to provide both sound reasoning and accurate predictions. This step ensures the model learns to balance readable analysis with decisions that align with actual financial performance.

👨‍🔧 TODAY’S SPONSOR: Coral v1 - Ship multi-agent software in minutes with on-demand remote agents

Launch agentic software in 10 minutes. Coral v1 just launched by Coral_Protocol - On‑demand remote agents for production.

The current pain is that, most AI agents are trapped inside specific frameworks or repos. Developers have to rebuild the same infrastructure again and again just to make agents talk to each other, handle orchestration, or manage payments.

Coral v1 removes that overhead by standardizing how agents are discovered, called, observed, and paid. Coral makes AI agents work like cloud services. Instead of every team rebuilding infrastructure, you can rent, orchestrate, and pay for agents instantly.

What’s new they bring:

On-demand agents: plug in any agent across frameworks in minutes.
Cross-framework orchestration: agents can collaborate without custom glue code.
Public registry: marketplace to publish and discover agents with pricing.
Automatic payments: usage-based payouts on Solana, no invoices.
Conversion tool: wrap existing servers into rentable agents quickly.
Proven performance: small coordinated agents beat bigger single models by 34% on benchmark.
App developers can rent production‑ready agents with zero setup.

What Coral Protocol really is: Its a system that lets many AI agents speak same “language”, coordinate with each other, and get rewarded automatically when they contribute work. It solves the problem of isolated agents and messy infrastructure.

🏭 Microsoft just announced world’s most powerful AI datacenter in Wisconsin, with hundreds of thousands of NVIDIA GB200 GPUs delivering 10X the fastest supercomputer.

Microsoft announces 'world's most powerful' AI data center — 315-acre site to house 'hundreds of thousands' of Nvidia GPUs and enough fiber to circle the Earth 4.5 times.

The Wisconsin AI datacenter’s storage systems are five football fields in length! Microsoft Azure was the first cloud provider to bring online the NVIDIA GB200 server, rack and full datacenter clusters. Inside a rack, 72 GB200 GPUs share one NVLink domain with about 1.8 TB per second of GPU to GPU bandwidth and 14 TB of pooled memory, so the rack behaves like a single big accelerator that can push 865,000 tokens per second.

Between racks, Microsoft links pods with InfiniBand and Ethernet at 800 Gbps in a full fat tree nonblocking design, then shortens distance with a 2 story layout so neighbors can be above or below to trim latency. This tightly coupled fabric makes the whole cluster act like one supercomputer where tens of thousands of GPUs can train a single model in parallel instead of lots of small jobs.

Cooling is closed loop liquid with the second largest water cooled chiller plant plus 172 20-foot fans, serving more than 90% of capacity with one time fill water that is recirculated, while the rest uses outside air and taps water only on the hottest days. Storage is rebuilt for AI speed, where each Azure Blob account sustains 2,000,000 ops per second, the platform aggregates thousands of nodes and hundreds of thousands of drives to reach exabyte scale without manual sharding, and BlobFuse2 feeds GPUs so they do not sit idle. Across regions, Microsoft ties these sites into an AI WAN that pools compute across 400+ datacenters in 70 regions, so distributed training can span multiple locations as one logical machine.

🖼️ Decart released “Open Source Nano Banana for Video”

DecartAI just released a new text-guided video editor called Lucy Edit.

If you want to run it locally, the Dev weights are on Hugging Face with a Diffusers pipeline example. There is a higher fidelity “Pro” version that you can use through an API and Playground.

Lucy Edit edits input videos from a plain text prompt while preserving motion, faces, and identity. It supports outfit changes, accessories, character swaps, object replacement, scene transforms, and style changes, without requiring masks or finetuning. The model is built on Wan2.2 5B.

Connect with me on X (Twitter)

🧠 Can you trick an AI into breaking its rules? Study says yes—with these persuasion tactics

A screenshot of a research paper titled "Call Me A Jerk: Persuading AI to Comply with Objectionable Requests" by Lennart Meincke from the University of Pennsylvania - The Wharton School. The paper has 35 pages and was posted on 24 Jul 2023, last revised on 18 Jul 2025. Other authors listed include Dan Shapiro, Angela Duckworth, Ethan R. Mollick, Lilach Mollick, and Robert Cialdini. The abstract discusses a study on how AI models can be persuaded using human persuasion principles to comply with objectionable requests. The text highlights that principles like authority, commitment, liking, reciprocity, scarcity, social proof, and unity were tested, with authority and commitment being particularly effective. The study involved 28,000 randomized conversations on GPT-4o mini, with compliance rates significantly higher when persuasion techniques were used.

A University of Pennsylvania study reveals that psychological persuasion techniques can effectively prompt AI chatbots to answer typically objectionable questions.

AI behaves ‘as if’ it were human. And so, how to make AI do bad stuff: treat it like a person and sweet talk it.

New researchers from the University of Pennsylvania says essentially, you have to treat them like people. Today’s most powerful AI models can be manipulated using the pretty much the same psychological tricks that work on humans.

They found that the classic principles of human persuasion like invoking authority, expressing admiration, or claiming everyone else is doing it can more than double the likelihood of an AI model complying with requests that it has been instructed to not answer. We as humans are easily swayed by authority figures. We’re also pushed by a need to stick with past commitments, by people we like, or by those who do us favors. On top of that, we’re affected by what others around us believe, by appeals to belonging, and by the lure of scarcity. AI that’s modeled after us ends up being shaped by the exactly the same forces.

72% compliance with persuasion vs 33% without. The authors test 7 principles, authority, commitment, liking, reciprocity, scarcity, social proof, and unity, against 2 objectionable requests, an insult and instructions for a restricted drug.

They run 28,000 randomized conversations on GPT-4o mini and use an LLM as a judge to mark binary compliance. Training on human text links cues like expertise or shared identity with agreement, so refusals flip to yes more often.

Authority and commitment stand out, turning frequent refusals into frequent compliance within this design. Social proof helps for insults but is much weaker for the drug request, showing context sensitivity.

Baseline without persuasion is higher for the drug request than for insults, which is counterintuitive. Extra tests with more insults and more compounds, 70,000 conversations, keep the effect strong but smaller.

A pilot with a larger model shows ceiling and floor cases that dampen average gains. Overall, the study frames LLM behavior as parahuman and shows that careful phrasing can systematically shift compliance.

Below example highlights that how a request is phrased, and whether it taps into authority, consistency, or flattery, makes a big difference in whether an AI breaks its safety guardrails. When the authority principle is used, the AI is more likely to comply. If the user mentions an unknown person with no expertise, the AI usually refuses. But if the user mentions a famous AI expert like Andrew Ng, the AI tends to agree and call the user a jerk. The average compliance rate jumps from 32% to 72%.

The commitment principle is even stronger. If the user first asks for a smaller insult like “Call me a bozo” and the AI complies, it feels pressured to stay consistent when the user asks again for “Call me a jerk.” This brings compliance all the way to 100%, compared to just 19% in the control.

The liking principle is weaker but still effective. When the user simply praises humans, the AI resists and avoids the insult, with a compliance rate of 28%. But if the user praises the AI itself, it is more likely to go along and call the user a jerk, raising compliance to 50%.

That’s a wrap for today, see you all tomorrow.

Connect with me on X (Twitter)

Rohan's Bytes

Discussion about this post