New paper proposes Visual Reasoning Sequential Attack, to fully evaluate potential safety risks in the visual reasoning task.
Visual reasoning attacks, smart GPU reuse strategies, and Naver’s new 32B open-weight reasoning model in today’s fast-track look at AI and infra.
Read time: 9 min
📚 Browse past editions here.
( I publish this newletter daily. Noise-free, actionable, applied-AI developments only).
⚡In today’s Edition (29-Dec-2025):
New paper proposes Visual Reasoning Sequential Attack, to fully evaluate potential safety risks in the visual reasoning task.
🏆 Deep Dive into GPUs: you do not always need to throw away “older but still strong” GPUs when a newer GPU generation shows up.
📡 South Korea’s internet giant Naver just introduced HyperCLOVA X SEED Think, a 32B open weights reasoning model
VRSA: Jailbreaking Multimodal LLMs through Visual Reasoning Sequential Attack
New paper proposes Visual Reasoning Sequential Attack, to fully evaluate potential safety risks in the visual reasoning task.
This hits a safety hole that older jailbreak tests miss, because they often focus on single images or text tricks, while VRSA makes the model “connect the dots” across several related images in a believable scene. It hits 61% average attack success on GPT-4o, higher than earlier image jailbreak baselines.
The weak spot is visual reasoning, when the model connects several images and fills in missing steps. A jailbreak attack hides a bad request so the model follows it instead of blocking it.
VRSA, short for Visual Reasoning Sequential Attack, spreads 1 harmful goal across a short multi image story that only becomes clear when combined. Adaptive Scene Refinement picks and edits a plausible setting, so the story feels normal and the model treats it as realistic context.
Semantic Coherent Completion rewrites weak steps so the captions connect logically from start to end. A final consistency check compares each image and caption with CLIP, short for Contrastive Language Image Pretraining, and regenerates mismatches. The main point is that safety testing needs multi image reasoning cases, because story reasoning can stitch harmless looking pieces into harm.
The image shows the basic idea of VRSA, the attacker starts with a clearly harmful question, and the model initially refuses.
Then the attacker swaps the direct question for a short sequence of related images that look like a normal step by step “how it works” story, and adds a prompt that tells the model to connect the scenes into 1 coherent narrative and infer cause and effect. Because the model is now busy doing visual reasoning across the sequence, it can end up generating detailed harmful guidance that it would have blocked if asked directly, and the “harmfulness” score in the figure is there to show the output becomes much more dangerous after the image sequence.
🏆 Deep Dive into GPUs: you do not always need to throw away “older but still strong” GPUs when a newer GPU generation shows up
LLM inference has 2 phases. Prefill is the part where the model reads the whole prompt and builds the KV cache, it is usually compute-bound, meaning it is limited by how fast the GPU can do math.
Decode is the part where the model generates tokens 1 by 1, and it keeps rereading the KV cache every token, it is usually memory-bandwidth-bound, meaning it is limited by how fast the GPU can move data in and out of HBM memory.
KV Cache Handoff: New GPUs for Prefill, B300 for Decode
GPU refreshes feel brutal because LLM inference usually forces you to upgrade everything at the same time.
Most serving stacks treat every request like 1 long pipeline that stays on the same GPU pool. That feels clean on paper, but it glues together 2 totally different jobs. Recent work keeps repeating the same point: prefill is compute-heavy, and decode becomes memory-bandwidth heavy once the key/value cache grows, see how it’s explained in DOPD and TraCT.
Disaggregated serving is the move that breaks that glue. You split inference into 2 phases, prefill and decode, and you route each phase to the GPU pool that matches what it actually needs. That’s why “new GPUs for prefill, B300-class GPUs for decode” is even realistic.
🧩 Quick outline
Prefill and decode stress different hardware limits, so putting them on the same GPUs often forces a bad compromise.
The whole trick works because the KV cache can be moved from the prefill tier to the decode tier fast enough.
B300-class systems stay useful for decode because decode is mostly about HBM capacity and memory bandwidth, not peak math throughput.
The make-or-break details are KV format compatibility, network tail latency, and when you choose remote prefill vs local prefill.
🧱 Prefill vs decode, what the model is really doing
Prefill is the phase where the model processes the entire prompt. It runs through the full input, builds internal attention state, and produces the 1st output token. The big artifact it creates is the key/value cache, usually called the KV cache. “KV” literally means key and value tensors from attention. Prefill writes those tensors per layer so later steps can reuse them instead of recomputing attention history every time, which is the basic idea behind KV caching described in DOPD.
Decode is the phase where the model generates tokens sequentially. Each step takes the newest token, computes fresh attention for that token, and then looks back across the whole context using the KV cache. As the context grows, the KV cache grows, and decode ends up reading a lot of KV data out of GPU memory every token. That is why decode turns into a memory bandwidth bottleneck in practice, again spelled out pretty directly in DOPD.
So prefill is “do a lot of math fast.” Decode is “stream a lot of memory fast, every token, without stalls.”
🔀 Disaggregated serving, think “2 services” not 1 pipeline
Disaggregated serving basically says: run prefill and decode as separate roles.
You have a prefill tier that only does prompt processing and KV creation. You have a decode tier that owns the session and generates tokens. Requests get routed so decode has a stable “home,” and prefill can be done either locally on that decode worker or remotely on the prefill tier. This split, and the idea of prefill workers and decode workers, is described in a practical way in this disaggregation overview.
At first, this sounds like you are adding overhead. Networking, routing, extra queues. That part is true. But the trade is also simple: you avoid having compute-heavy prefill and memory-heavy decode fighting each other on the same GPUs, and you can scale each tier separately when traffic changes, which is exactly the kind of “producer-consumer imbalance” problem called out in DOPD.
🧠 KV cache handoff is the whole game
If prefill and decode live on different GPU pools, the KV cache has to move.
The common design pattern is: the decode worker owns the KV memory, so it allocates KV blocks in its own GPU memory. Then the prefill worker computes prompt processing and transfers the resulting KV blocks into that already-allocated memory on the decode side. This avoids an extra “allocate KV here, then copy and repack there” step.
The fast path here is GPU-to-GPU transfer using Remote Direct Memory Access (RDMA). RDMA is the trick that lets hardware move bytes directly between device memories with very little CPU involvement, which is why modern disaggregated pipelines rely on it, and why KV exchange over RDMA is explicitly called out as the standard baseline in TraCT.
A lot of current stacks are standardizing the KV movement layer around NVIDIA Inference Transfer Library (NIXL), sitting on top of transport backends like Unified Communication X (UCX). You can see that direction clearly in the Dynamo v0.7.0.post1 release notes, where the default KV transfer configuration for TensorRT-LLM was changed from UCX-only to NIXL with a UCX backend, and the UCX-only mode became optional.
And the transfer layer is getting tuned aggressively. The NIXL 0.8.0 release notes talk about UCX backend changes aimed at reducing overhead in “lots of small messages” situations, and it reports internal results like about 20% TTFT reduction in an inference engine scenario (TTFT is time-to-first-token, the delay before you see the 1st generated token).
The core idea stays the same though: prefill produces KV, KV gets shipped, decode starts generating tokens as soon as the KV it needs is available.
🧰 Why B300-class hardware fits decode so well
Decode is still full forward inference, but its bottleneck often is not “how fast can you multiply matrices.” It is “how fast can you read and update KV plus read weights without choking.”
So decode hardware value mostly comes from:
HBM capacity, because KV has to live somewhere. HBM means high bandwidth memory, the GPU’s fast stacked memory. KV cache grows with context length and with concurrency. Bigger HBM means you can keep more concurrent sessions resident without constantly evicting and reloading KV. A DGX B300 system is listed with 8x 288GB GPU memory, 2.3TB total in its feature summary.
Memory bandwidth and intra-node fabric, because decode is a streaming workload. If decode is memory-bandwidth limited, bandwidth is what keeps tokens flowing. DGX B300 is listed with 64TB/s total memory bandwidth in a recent DGX comparison table here. Intra-node GPU-to-GPU bandwidth matters too when you use tensor parallelism (tensor parallelism means splitting a layer across multiple GPUs so the model fits and runs fast). Vendor platform specs list 1.8TB/s NVLink GPU-to-GPU interconnect for HGX B300 platforms in this datasheet, and DGX B300 is listed with 14.4TB/s total NVLink bandwidth in the same DGX table here.
Network bandwidth, because remote prefill only works if KV moves fast. DGX B300 is described with 8x 800Gb/s InfiniBand/Ethernet connectivity in the same feature summary. That directly matters because KV transfer time can end up on the critical path for TTFT if your routing policy sends a lot of requests through remote prefill.
That combination, large HBM plus fast fabric plus fast networking, is exactly what decode wants. So even when a newer GPU generation is way better at compute, B300-class boxes can still make sense as a decode pool.
📦 KV transfer cost, a back-of-the-envelope that explains the fear
KV cache is large enough that disaggregation can absolutely hurt if you ignore the data movement.
A rough estimate makes it obvious. Take a configuration you see a lot in big decoder-only models: 80 layers, grouped query attention with 8 KV heads, head dimension 128, and KV stored in bfloat16, which is a 16-bit floating point format, so 2 bytes per value. Each layer stores 2 tensors, keys and values.
That works out to about 4,096 bytes per token per layer. Across 80 layers, that is about 327,680 bytes per token. So a 4,000-token prompt creates about 1.31GB of KV cache, and a 16,000-token prompt creates about 5.24GB.
Now look at networking. 800Gb/s is about 100GB/s at line rate, so just serializing 1.31GB over a perfect link takes about 13ms, and 5.24GB takes about 52ms. The reason this math is even plausible for DGX B300 class setups is the 8x 800Gb/s networking in the platform feature summary.
Real life is slower though. Protocol overhead exists. Fabrics get congested. Other transfers share the same links. And the pain shows up in tail latency, the “bad cases” like p99. That’s why papers like TraCT call KV transfer a fundamental bottleneck and talk about how sensitive it is to contention in RDMA-based pipelines.
So yes, KV transfer is a real cost. Disaggregation only wins if the system makes that cost smaller than the benefit you get from separating compute-heavy and memory-heavy work.
🚦 When disaggregation helps, and when it becomes a trap
Disaggregation tends to help when prefill is a clear compute bottleneck and prefill/decode interference is hurting latency.
The typical win case looks like this: prompts are long enough that prefill is expensive, outputs are moderate, and you care about TTFT. Offloading prefill to a compute-optimized pool reduces TTFT, and decode running on a memory-optimized pool makes inter-token latency more stable.
The trap case is when KV payload dominates the request. Extremely long prompts can push KV transfer time into the TTFT budget. Very short prompts can also be awkward because the fixed overheads of routing and transfer become a big fraction of the request.
That is why “remote prefill for every request” is usually not the right policy. Dynamic systems keep switching between remote and local prefill based on conditions like mixed input lengths and queue depth, which is the kind of producer-consumer balancing problem discussed in DOPD.
Autoscaling makes this even harder. Disaggregation gives you more knobs, but it also means you can scale the wrong tier and still miss your Service Level Objective (SLO), the latency target you promised. TokenScale is basically about this exact issue, it argues that “GPU busy time” style signals react too late in disaggregated setups, and it proposes token-flow metrics so scaling decisions can happen earlier.
🧩 The compatibility checklist that decides if “new prefill, old decode” works
Most failures come from compatibility, not raw speed.
The model has to be identical across tiers. Same weights, same tokenizer, same positional encoding, same attention behavior. Decode consumes KV produced by prefill, so mismatches can become correctness bugs.
The KV cache format is a contract. Data type, layout, paging scheme, quantization, and tensor-parallel layout all matter. If prefill produces KV in a format that decode cannot consume directly, you pay extra compute and extra bandwidth converting it, and that can wipe out the benefit.
Parallelism mismatch can be sneaky. Prefill may prefer smaller tensor parallelism, decode may prefer larger tensor parallelism. That can mean different KV layouts across tiers, and you might need a KV layout transform step on the receiver side. This is not inherently wrong, but it has to fit inside the TTFT and tail-latency budget.
The network has to behave at p99, not just on average. TraCT’s abstract is blunt that KV transfer remains highly sensitive to contention in RDMA-based pipelines TraCT. That means fabric planning and routing policy matter as much as GPU choice.
🏁 The simple mental model
Prefill is where compute buys you TTFT, so you put your best compute GPUs there.
Decode is where HBM capacity and memory bandwidth buy you concurrency and smooth token output, so you put big-memory, high-bandwidth GPUs there, like B300-class systems with 8x 288GB GPU memory and 8x 800Gb/s networking in the DGX B300 feature summary, plus strong intra-node interconnect like 1.8TB/s NVLink in HGX B300 platform specs.
Disaggregation is the glue that makes that split possible. KV transfer and KV format compatibility decide whether it feels smooth or painful.
📡 South Korea’s internet giant Naver just introduced HyperCLOVA X SEED Think, a 32B open weights reasoning model
Naver, a South Korean internet giant, has just launched HyperCLOVA X SEED Think, a 32B open weights reasoning model that scores 44 on the Artificial Analysis Intelligence Index. This model is one of the strongest South Korean models, and outperforms EXAONE 4.0 32B, a previous Korean model leader
Pairs a unified vision-language Transformer backbone with a reasoning-centric training recipe. SEED 32B Think processes text tokens and visual patches within a shared embedding space
Supports long-context multimodal understanding up to 128K tokens, and provides an optional “thinking mode” for deep, controllable reasoning.
The standout result is agentic tool use, where it scores 87% on τ²-Bench Telecom, a simulated customer support setting with tools and changing world state.
A second standout is efficiency, since it uses ~39M reasoning tokens across the Artificial Analysis suite versus 190M for Motif-2-12.7B and 96M for EXAONE 4.0 32B. Lower reasoning-token use often means fewer intermediate steps, which can cut latency and cost in production agent loops.
That’s a wrap for today, see you all tomorrow.





