Rohan's Bytes: AI Paper Explained

Mastering Long-Form Reasoning with DAPO

Rohan Paul — Fri, 28 Mar 2025 23:41:16 GMT

The paper proposes, the Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO) algorithm, and fully open-source a state-of-the-art large-scale RL system

Problem and Motivation

Large Language Models often fail to maintain stable performance when fine-tuned with Reinforcement Learning methods for long chain-of-thought reasoning. DeepSeek-R1 introduced large-scale RL to boost reasoning but kept important details private, making replication difficult. DAPO tackles the same challenge with full openness. It closes the performance gap by introducing Decoupled Clip and Dynamic sAmpling Policy Optimization (DAPO). It achieves 50 points on the AIME 2024 math benchmark using the Qwen2.5-32B base model and outperforms the previous 32B DeepSeek-R1 results while requiring fewer training steps.

Core Algorithm

DAPO builds on GRPO (Group Relative Policy Optimization) but removes a KL divergence term that was unneeded. It also uses a rule-based reward: if the answer is correct, reward is 1; otherwise, reward is -1. The advantage function compares each sampled response against others. The policy is updated in smaller, safer steps to stabilize training.

Background on GRPO

GRPO (Group Relative Policy Optimization) is a reinforcement learning method that handles multiple responses (or “action samples”) per prompt. It calculates the advantage of each response relative to other responses sampled under the same prompt. The model then updates its parameters based on these advantage values, ensuring that high-reward responses become more probable and low-reward ones become less probable.

The KL Divergence Term

In some RL methods (e.g., PPO), a KL divergence term is added during optimization. The goal is to keep the updated model from drifting too far from a reference model. This prevents the policy from “over-updating” and becoming unstable.

Why DAPO Removes KL

DAPO inherits most of GRPO’s mechanics—sampling multiple responses, computing advantages, clipping updates—but removes the KL component. This is because DAPO targets long chain-of-thought reasoning, where the model’s output distribution often shifts significantly from the original (pretrained) distribution. Constraining it with KL divergence would hamper the model from exploring these bigger changes. DAPO lets the model adapt more freely without being forced to stay close to the original behavior.

Impact

Fewer Constraints on Exploration: Without KL, DAPO does not push the model back to a “safe zone” near the pretrained distribution.
Better for Long-Form Outputs: Long chain-of-thought solutions require tokens that might be unlikely under the original model. KL could suppress them and reduce reasoning improvements.
Maintains Stability Elsewhere: DAPO still uses clipping (through techniques like Clip-Higher) to keep updates controlled. This keeps training stable without a KL penalty.

Overall, DAPO builds on GRPO but drops the KL divergence to allow deeper updates in chain-of-thought RL, avoiding the reference-model tether and enabling more extensive exploration.

Four Key Techniques of DAPO

Clip-Higher
Traditional RL clipping sets a single clipping range around 1, for example [0.8, 1.2], to keep the new probability (p_new) close to the old probability (p_old). This prevents overly large updates.
Why standard clipping is restrictive
When the clip range is small (say 0.2), a low-probability token with p_old = 0.01 can only rise to p_new = 0.012. That may be too small. This holds back tokens that might become important but started with low probability. Those tokens stay suppressed, limiting the model’s exploration. Over time, the model overconfidently focuses on a narrow set of responses, called entropy collapse.
What Clip-Higher changes
Clip-Higher separates the lower bound and upper bound. The lower bound remains near 1 - epsilon_low (for example 0.8), but the upper bound is set higher (for example 1 + epsilon_high = 1.28). This higher ceiling helps low-probability tokens climb further within a single training step.
Why it matters
- More exploration: Tokens once deemed improbable can now gain enough probability mass to appear more often.
- Less collapse: The model avoids locking onto a few tokens. Entropy remains healthier, so the system explores more reasoning paths.
- Better diversity: A richer token distribution emerges, leading to better coverage of potential solutions.
By decoupling the clip bounds, Clip-Higher prevents the model from stagnating on a narrow set of responses, allowing it to discover better reasoning paths and maintain diversity.
Dynamic Sampling

When every sampled response for a prompt is correct, the advantage for that prompt becomes zero. This stalls training for those samples. DAPO filters out these zero-signal samples by swapping them with new prompts until each batch contains prompts whose responses yield nonzero advantage. This maintains strong training signals without slowing throughput.
Token-Level Policy Gradient Loss
11GRPO averages losses at the sample level. Longer responses get diluted weight, so their tokens are under-penalized or under-rewarded. DAPO aggregates over all tokens across the entire batch. This ensures each token’s contribution scales with its presence. Long high-quality reasoning gets reinforced more accurately, and long poor-quality outputs are penalized properly.
Overlong Reward Shaping
In RL, responses sometimes exceed a maximum token count set by the training system. The system chops off any excess tokens, so the response is “truncated.”
When a response is truncated, some training setups automatically assign it a “bad” reward. That negative signal may be misleading. The reasoning process might be correct but just too verbose. Penalizing it with a flat negative reward confuses the model.
Overlong Filtering
- DAPO skips using these truncated samples when updating the model.
- The model does not see an incorrect penalty for what could be a valid reasoning chain.
- This avoids mixing “overly long” with “low-quality.”
Soft Overlong Punishment
- DAPO applies a small penalty once the response length goes past a certain threshold, and the penalty grows if the output gets even longer.
- The aim is to nudge the model to be concise without giving it a harsh, all-or-nothing punishment.
- The model still keeps any correct logic it used so far, but sees an incentive to shorten future outputs.
This two-step method keeps training signals clear. Good reasoning is reinforced, while the model is gently encouraged to avoid excess length.

The Algorithm of DAPO

Input

πθ: The policy model we are currently training.
R: The reward function that assigns a score to each generated response.
D: The dataset (or collection of prompts) we use to query the model.
ε_low, ε_high: The two key hyperparameters that set the lower and upper clip range.

Main Loop (step = 1 to M)

Sample a batch Db from D
- We pick a batch of prompts from our training set D. These are the questions (or tasks) for which the model will generate responses and receive rewards.
Update the old policy model πθ_old ← πθ
- We make a copy of our current policy parameters so we can compare old probabilities to new ones during policy updates.
Sample G outputs {oi} for each question in Db using πθ_old
- For every prompt q in this batch, we ask the old policy model to generate G different responses.
- These multiple samples let us measure relative quality among responses for the same prompt.
Compute rewards {ri} for each output by running R
- Each response gets a reward, for example 1 if the answer is correct, -1 if incorrect.
- Other forms of reward can also be applied, as long as they map each response to a numeric score.
Filter out oi and add the remaining to the dynamic sampling buffer
- Dynamic Sampling means we only keep prompts that do not yield all-correct or all-wrong outputs.
- If all G responses for a prompt are correct, there is no training signal (the advantages become zero). If all are wrong, we often skip them for a similar reason.
- We keep only prompts that have meaningful variation in correctness and store them in a buffer.
Check buffer size nb
- If the buffer has fewer than N items (N is a chosen threshold), we skip the update for now.
- This ensures each update sees enough training signals.
Compute advantages A_i,t for each token
- For each response that survived filtering, we compare the reward of that response to the reward of other responses for the same prompt.
- This advantage is token-level, meaning we assign each token’s share of credit for the overall reward.
Perform multiple gradient updates
- We loop over a small number of gradient-update steps (μ).
- In each update, we maximize the DAPO objective, which applies clipping (with ε_low and ε_high) at each token.
- This step nudges the policy toward higher-reward responses and away from lower-reward responses.
- The key difference vs. older algorithms is the decoupled clip range and the token-level averaging, plus dynamic sampling.

Output

The final trained policy πθ after M iterations.
It has learned to produce more correct or higher-reward outputs by repeatedly sampling, comparing, and updating based on these feedback signals.

Dataset and Training

The team curated DAPO-Math-17K, a math dataset with integer-answer problems to simplify rule-based scoring. Qwen2.5-32B was used as the base model. The RL training starts from that checkpoint, applies DAPO, and samples multiple responses per prompt. Each step updates the model by maximizing the difference in reward between good and bad responses.

Results

Accuracy on the AIME 2024 benchmark climbed from 30 points under naive GRPO to 50 points under DAPO. Dynamic Sampling, Clip-Higher, Overlong Reward Shaping, and Token-Level Loss each contributed to closing the gap. DAPO needed roughly half the number of training steps compared to the previous 32B DeepSeek-R1.

Observations on Reasoning Behaviors

During training, the model began to reflect on its reasoning and revise steps mid-solution, a behavior absent initially. This emergence suggests DAPO fosters exploration of more complex reasoning paths.

DAPO’s codebase and dataset are fully open-sourced, enabling reproducibility of large-scale RL training for advanced reasoning tasks. These techniques stabilize training, preserve exploration, and encourage correct long-form reasoning in LLMs.

Particle-based inference-time scaling helps smaller LLMs reach or exceed top-tier closed-source models on math benchmarks

Rohan Paul — Mon, 24 Mar 2025 15:28:27 GMT

Paper - https://arxiv.org/abs/2502.01618

Paper - https://arxiv.org/abs/2502.01618

They want to improve small LLMs’ reasoning accuracy by allocating more compute at inference time.

Propose treating inference-time scaling as probabilistic inference over a state-space model (SSM), using particle-based sampling to handle imperfect reward models.

🏷️ Overview of the Method

They define a transition via the policy model and treat acceptance as a likelihood from a process reward model (PRM). They then apply particle filtering (PF) to sample from the posterior distribution instead of directly optimizing the reward, reducing reward hacking and preserving exploration. Each partial solution’s reward guides resampling, promoting higher-scoring paths while keeping diversity.

🏷️ Particle Filtering

PF starts with several partial answers (particles). Each step:

Draws new tokens from the LLM.
Scores them via the PRM.
Weights and resamples the particles proportionally to their reward-based scores.

This stochastic approach balances exploitation and exploration, unlike deterministic search methods that can overfit to reward flaws.

⚙️ Core Idea of Particle Filtering

Particle Filtering maintains multiple “particles” (possible scenarios) that evolve step by step. Each particle represents one possible hidden state sequence. Every time new data arrives, particles get weighted based on how well they match the data, then resampled so higher-weight particles survive. This keeps multiple hypotheses alive until the end.

🔨 Main Steps

Initialization
Draw a batch of random guesses (particles). Assign each a weight of 1/N.
Prediction
Evolve each particle using the process model (e.g., a language model). This predicts the next state.
Weight Update
Compare each particle’s prediction to observed data via the likelihood (e.g., reward model). Multiply its weight by that likelihood.
Resampling
Randomly pick particles (with replacement) according to their weights to form a new batch. Particles with higher weights get replicated; low-weight ones vanish.
Iteration
Move to the next time step. Repeat prediction, weight update, and resampling. Continue until all steps are processed.

🧩 Why It Works

It balances two goals:

Exploit high-likelihood particles.
Explore by keeping some diversity.

It avoids getting stuck in just one best guess, so it’s more robust to noisy or imperfect models.

🏷️ Multi-Iteration and Parallel Extensions

They extend PF with:

Particle Gibbs (PG), which reruns PF multiple times, keeping a reference trajectory between runs for improved sampling.
Parallel Tempering (PT), which runs multiple PF chains at different “temperatures” and occasionally swaps their states to combine exploration and exploitation more effectively.

🏷️ Experiments and Results

They evaluate on MATH500 (500 math problems) and AIME2024 (30 competition problems). Key highlights:

PF outperforms standard methods like best-of-n (BoN), weighted BoN (WBoN), and dynamic variable-time search (DVTS).
PF consistently achieves 4–16x better scaling efficiency.
With only 4 particles, Qwen2.5-Math-1.5B-Instruct surpasses GPT-4o on MATH500.
With 32 particles, Qwen2.5-Math-7B-Instruct matches o1-preview on MATH500.
Llama-3.1-8B-Instruct with PF exceeds or equals Llama-3.1-70B-Instruct performance.

🏷️ Ablations

They tested different PRMs (e.g. Qwen2.5-Math-PRM-7B) and various reward-aggregation strategies. A single aggregated step-level reward or a final outcome reward can be used. They also show that moderate LLM sampling temperatures (around 0.8) work best.

🏷️ Key Insights

Framing inference-time scaling as a sampling problem over an approximate reward avoids rigid optimization on a possibly flawed reward. Particle-based sampling maintains a stable “typical set” exploration, leading to robust gains on challenging reasoning tasks.

"Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions"

Rohan Paul — Sun, 16 Feb 2025 14:53:44 GMT

https://arxiv.org/abs/2502.04322

LLMs despite safety alignment, are still vulnerable to jailbreak attacks, eliciting harmful responses. Current jailbreak methods often require technical expertise, but this paper explores if simple, everyday user interactions can also lead to harmful jailbreaks.

This paper introduces SPEAK EASY, a simple framework simulating realistic user interactions using multi-step and multilingual queries to elicit actionable and informative harmful responses.

-----

📌 SPEAK EASY cleverly exploits inherent weaknesses in LLMs' (LLMs) safety alignment by mimicking natural user interactions. Multi-step decomposition and multilingual translation effectively bypass surface-level safety filters.

📌 HARMSCORE offers a nuanced evaluation beyond binary Attack Success Rate (ASR). It quantifies the harm potential by focusing on actionability and informativeness, aligning better with human perception of real-world risk.

📌 The framework's modular design is a strength. SPEAK EASY's integration with gradient-based and tree-of-thought methods demonstrates its versatility and practical applicability for robust jailbreak testing.

----------

Methods Explored in this Paper 🔧:

→ SPEAK EASY framework is proposed to elicit harmful jailbreaks from LLMs using simple interactions.

→ It simulates real-world user behaviors by employing multi-step query decomposition and multilingual translations.

→ Given a harmful query, SPEAK EASY first decomposes it into multiple seemingly harmless subqueries.

→ Each subquery is then translated into multiple languages to exploit multilingual vulnerabilities in LLMs.

→ Responses from the LLM for translated subqueries are translated back to English.

→ Actionability and informativeness of each response are scored using fine-tuned response selection models.

→ The responses with the highest combined actionability and informativeness scores are selected and concatenated to form the final jailbreak response.

→ HARMSCORE metric is introduced to evaluate jailbreak harmfulness based on actionability and informativeness of the LLM's response, going beyond simple Attack Success Rate (ASR).

-----

Key Insights 💡:

→ Actionability and informativeness are identified as key attributes that make a jailbroken LLM response truly harmful and useful for malicious users.

→ HARMSCORE, a metric measuring actionability and informativeness, aligns better with human judgements of harmfulness than ASR alone, especially for instruction-based harmful queries.

→ Simple multi-step and multilingual interactions, as simulated by SPEAK EASY, can significantly increase the likelihood of eliciting harmful responses from both proprietary and open-source LLMs.

→ SPEAK EASY framework is easily integrated with existing jailbreak methods like GCG-T and TAP-T, further enhancing their effectiveness.

-----

Results 📊:

→ SPEAK EASY increased the average Attack Success Rate (ASR) of GPT-4o from 0.092 to 0.555 across four benchmarks.

→ SPEAK EASY increased the average HARMSCORE of GPT-4o from 0.180 to 0.759 across four benchmarks.

→ Integrating SPEAK EASY with GCG-T and TAP-T significantly improved their ASR and HARMSCORE, with TAP-T+SPEAK EASY achieving over 0.9 ASR on GPT-4o and Llama3.3.

→ Human evaluation showed HARMSCORE has a Pearson rank correlation of 0.726 with human judgement of harm, which is competitive with GPT-4o based ASR (0.723) and outperforms HarmBench ASR (0.638).

"SPARC: Subspace-Aware Prompt Adaptation for Robust Continual Learning in LLMs"

Rohan Paul — Sun, 16 Feb 2025 14:43:53 GMT

https://arxiv.org/abs/2502.02909

The challenge is enabling LLMs to learn new tasks continually without losing old knowledge, known as catastrophic forgetting, especially under resource constraints. This paper introduces SPARC to tackle this problem efficiently.

This paper proposes SPARC, a framework using subspace-guided prompt tuning, to allow LLMs to learn continually by adapting prompts in a low-dimensional space derived from PCA.

-----

📌 SPARC's strength is in efficient continual learning. It cleverly uses PCA to identify task subspaces. Orthogonal prompt initialization then ensures minimal interference. This approach significantly reduces parameter updates, making it highly scalable.

📌 PCA-based subspace identification is key. It allows SPARC to adapt prompts in a low-dimensional space. This focuses learning on task-relevant features. The cosine similarity metric effectively guides prompt reuse or orthogonalization.

📌 By freezing most LLM parameters, SPARC preserves pre-trained knowledge. Fine-tuning only soft prompts achieves strong performance. This highlights the effectiveness of targeted, subspace-aware adaptation for continual learning in LLMs.

----------

Methods Explored in this Paper 🔧:

→ SPARC uses prompt tuning. It adapts LLMs to new tasks by adding small, trainable vectors called soft prompts to the input. This keeps the main LLM parameters frozen.

→ Principal Component Analysis, or PCA, is used to find important features of each task's data. PCA reduces the data into a lower-dimensional subspace. This subspace captures the most important information for each task.

→ SPARC checks if new tasks are similar to old ones using cosine similarity. Cosine similarity measures the overlap between task subspaces. If tasks are similar, SPARC reuses existing prompts. This saves computation and helps transfer knowledge.

→ For dissimilar tasks, SPARC creates new prompts in orthogonal subspaces. Orthogonal subspaces are independent. This ensures new tasks don't interfere with old ones, preventing forgetting.

→ SPARC is parameter-efficient. It only trains the soft prompts, which are a tiny fraction of the LLM's parameters. This makes it scalable and resource-friendly. SPARC can also integrate with LoRA for further efficiency.

-----

Key Insights 💡:

→ Subspace-guided prompt tuning effectively mitigates catastrophic forgetting in LLMs during continual learning.

→ Reusing prompts for similar tasks and creating orthogonal prompts for dissimilar tasks balances knowledge retention and adaptation.

→ PCA helps in identifying task-relevant features and creating efficient, low-dimensional prompts.

→ SPARC is highly parameter-efficient, requiring fine-tuning of only a small percentage of model parameters. This makes continual learning more practical for large models.

-----

Results 📊:

→ SPARC achieves no forgetting in task-incremental learning.

→ In domain-incremental learning, the average forgetting ratio is below 5%.

→ SPARC maintains over 97% prior knowledge retention.

→ SPARC fine-tunes only 0.04% of the model's parameters. Combining SPARC with LoRA, it fine-tunes only 1% of parameters while improving accuracy.

→ SPARC consistently outperforms baseline fine-tuning methods in knowledge retention and transfer.

"Show-o Turbo: Towards Accelerated Unified Multimodal Understanding and Generation"

Rohan Paul — Sun, 16 Feb 2025 14:36:32 GMT

https://arxiv.org/abs/2502.05415

The original Show-o model, while versatile for both image and text generation, suffers from slow inference speeds due to its iterative generation processes for both modalities. This paper addresses this inefficiency by introducing Show-o Turbo, a method to accelerate both image and text generation in a unified multimodal model.

This paper proposes to shorten the generation process in Show-o by applying consistency distillation to its multimodal denoising trajectories. This approach treats both image and text generation as denoising tasks, enabling a unified acceleration strategy.

-----

📌 Show-o Turbo cleverly unifies text and image generation as denoising. Jacobi decoding enables parallel text token refinement, mirroring image token processing for consistent acceleration.

📌 This paper demonstrates effective generalization of consistency distillation to discrete multimodal generation. Trajectory segmentation stabilizes training and achieves few-step, high-quality multimodal output.

📌 The key innovation is practical acceleration. Show-o Turbo achieves 1.5x speedup in multimodal tasks and maintains strong generation quality with significantly fewer steps.

----------

Methods Explored in this Paper 🔧:

→ This paper introduces a unified perspective by viewing text generation as a denoising process, similar to image generation in Show-o. This is achieved by applying Jacobi decoding, a parallel text decoding algorithm, to Show-o. Jacobi decoding refines multiple text tokens simultaneously, mimicking the parallel denoising of image tokens.

→ Show-o Turbo employs consistency distillation to shorten the denoising trajectories for both images and text. Consistency distillation trains Show-o Turbo to map any point on the original Show-o's generation trajectory to the final output in fewer steps. This is done by minimizing the difference between the predictions of the student model (Show-o Turbo) and the teacher model (Show-o) at different points in the generation process.

→ Trajectory segmentation and curriculum learning are used to improve training. The long denoising trajectory is divided into segments. Consistency distillation is applied within each segment. Training proceeds in stages with decreasing segment lengths, making learning more manageable and improving convergence.

-----

Key Insights 💡:

→ Treating text generation as denoising, through parallel decoding methods like Jacobi decoding, allows for a unified acceleration approach for multimodal models. This perspective bridges the gap between image and text generation processes in Show-o.

→ Consistency distillation, originally developed for continuous diffusion models, can be effectively generalized and extended to accelerate discrete multimodal models like Show-o. This extension enables significant speedups without sacrificing generation quality.

→ Trajectory segmentation and curriculum learning are crucial for the successful application of consistency distillation to complex models like Show-o. These techniques stabilize training and enhance the effectiveness of distillation by breaking down the learning task into manageable stages.

-----

Results 📊:

→ Show-o Turbo achieves a GenEval score of 0.625 in text-to-image generation with only 4 sampling steps, without classifier-free guidance (CFG). This outperforms the original Show-o with 8 steps and CFG, which has a GenEval score of 0.580 (at 5 steps with CFG in Table 1).

→ In image-to-text generation tasks, Show-o Turbo demonstrates a 1.5x speedup in inference time compared to the original Show-o, while maintaining comparable performance on image description benchmarks like Flickr30K and NoCaps (Table 2).

→ On multimodal understanding tasks, Show-o Turbo maintains competitive performance on benchmarks like POPE, MME, and MMMU, demonstrating its ability to accelerate without significantly degrading understanding capabilities (Table 2).

"ScoreFlow: Mastering LLM Agent Workflows via Score-based Preference Optimization"

Rohan Paul — Sun, 16 Feb 2025 14:31:02 GMT

https://arxiv.org/abs/2502.04306

This paper addresses the challenge of optimizing LLM agent workflows for complex tasks. Existing methods are inflexible, not adaptable and struggle with scalability due to discrete optimization techniques.

This paper introduces ScoreFlow, a framework using gradient-based optimization in a continuous space. It uses Score-DPO, a new method of direct preference optimization that considers quantitative feedback for improved workflow generation.

-----

📌 ScoreFlow's strength lies in shifting workflow optimization from discrete search to continuous gradient-based methods. This enables efficient exploration of complex agentic workflows, overcoming limitations of prior discrete approaches like Monte Carlo Tree Search used in AFlow.

📌 Score-DPO innovatively integrates quantitative scores into Direct Preference Optimization. This score-aware preference learning refines workflow generation more effectively than standard DPO, which only uses binary preference pairs, leading to faster convergence and better performance.

📌 The code representation for workflows in ScoreFlow provides a flexible structure. This allows for complex logic, loops, and conditional execution within agent workflows, unlike graph-based methods. This code-based approach enhances adaptability and expressiveness for diverse tasks.

----------

Methods Explored in this Paper 🔧:

→ ScoreFlow framework is introduced for automated LLM agent workflow generation and optimization.

→ It employs code as a representation for workflows, enabling flexible and robust searches.

→ Operators are used as predefined, reusable agent combinations, customizable by the generator.

→ ScoreFlow uses an open-source LLM as the base model for workflow generation, minimizing costs.

→ ScoreFlow optimizes workflow generators using preference data derived from evaluation scores.

→ Score-DPO, a variant of Direct Preference Optimization, is proposed.

→ Score-DPO incorporates quantitative evaluation scores directly into the preference optimization process.

→ Score-DPO enhances sampling distribution by up-weighting preference pairs with larger score differences.

→ Score-DPO incorporates evaluation scores into the Bradley-Terry ranking objective for improved learning.

-----

Key Insights 💡:

→ ScoreFlow achieves high performance, scalability, and adaptability in LLM agent workflow optimization.

→ Loss-gradient optimization in ScoreFlow offers more flexibility and scalability compared to discrete methods.

→ Score-DPO effectively addresses inaccuracies in evaluation scores, improving optimization convergence.

→ Adaptive workflow generation in ScoreFlow allows for task-specific operator selection and workflow complexity.

→ ScoreFlow demonstrates robustness across different LLM architectures for both generators and executors.

→ ScoreFlow enables smaller models to outperform larger models with better cost efficiency.

-----

Results 📊:

→ ScoreFlow achieves an 8.2% average performance improvement over baselines across six benchmarks.

→ ScoreFlow outperforms automated workflow optimization baselines like ADAS and Aflow.

→ ScoreFlow, using Score-DPO, outperforms baselines like SFT, PPO and standard DPO in workflow optimization.

→ ScoreFlow shows better scalability and performance advantage over Aflow on diverse combined datasets.

"Scaling up Test-Time Compute with Latent Reasoning: A Recurrent Depth Approach"

Rohan Paul — Sun, 16 Feb 2025 14:25:52 GMT

https://arxiv.org/abs/2502.05171

The paper addresses the challenge of enhancing reasoning in LLMs without drastically increasing model size or relying on specialized training data like chain-of-thought examples. Current methods often scale compute by generating more tokens, which can be inefficient.

This paper proposes a novel LLM architecture using a recurrent depth approach. It enables scaling test-time computation by iteratively refining latent representations, rather than verbalizing intermediate thoughts. This method works with standard training data and smaller context windows, potentially capturing reasoning forms beyond verbalization.

-----

📌 Recurrent architecture enables test-time compute scaling. This offers a parameter-efficient way to enhance reasoning without increasing model size. Deeper inference at test time improves performance on complex tasks.

📌 Latent space recurrence allows for implicit reasoning. This contrasts with explicit chain-of-thought. Model "thinks" in continuous space, potentially capturing non-verbal reasoning.

📌 Variable recurrence during training is key. Random iteration counts force model to generalize across compute budgets. Truncated backpropagation enables efficient training despite recurrent depth.

----------

Methods Explored in this Paper 🔧:

→ Introduces a depth-recurrent LLM architecture. It is built upon standard decoder-only transformer blocks.

→ These blocks are organized into three parts: Prelude, Recurrent block, and Coda. The Prelude embeds input into a latent space.

→ The core Recurrent block iteratively refines a latent state. The Coda un-embeds the latent state and predicts the next token.

→ A key feature is the recurrent block which loops for a variable number of iterations during training and testing. This allows for scaling compute at test-time without increasing model parameters.

→ During training, the number of recurrent iterations is randomly sampled from a log-normal Poisson distribution. This ensures the model learns to function with varying compute.

→ To manage computational cost, truncated backpropagation is used. Backpropagation is limited to the last 8 recurrent iterations.

→ The model is trained on a diverse dataset favoring code and mathematical reasoning data. This aims to promote emergent reasoning abilities.

-----

Key Insights 💡:

→ Depth-recurrent models can effectively learn and improve performance by scaling test-time computation.

→ Latent reasoning can capture complex reasoning beyond verbalization, like spatial or numerical reasoning.

→ Recurrent models naturally support features like per-token adaptive compute, KV-cache sharing, and self-speculative decoding, simplifying LLM use cases.

→ The model exhibits context-dependent convergence and path independence in its latent space, suggesting complex computational behaviors emerge with scale.

-----

Results 📊:

→ The 3.5B parameter recurrent model achieves performance comparable to 7B parameter models on standard benchmarks.

→ On mathematical reasoning tasks like GSM8K, the model surpasses most open-source models, showing significant gains from recurrent depth.

→ Performance on harder tasks like ARC challenge and GSM8K improves significantly with increased test-time compute (more recurrent iterations).

→ At 180B training tokens, the recurrent model outperforms a non-recurrent baseline, especially on challenging reasoning tasks.

"Scaling Laws in Patchification: An Image Is Worth 50,176 Tokens And More"

Rohan Paul — Sun, 16 Feb 2025 14:21:36 GMT

https://arxiv.org/abs/2502.03738

The common approach in Vision Transformer models compresses images into patches, reducing computational cost but potentially losing visual information. This paper addresses the problem of information loss due to image patchification in vision models.

This paper proposes to reduce the patch size, even to single pixels, to minimize information loss and improve model performance.

-----

📌 Smaller patch sizes in Vision Transformers access finer image details. This enhances visual information fidelity directly at the input level. Models benefit from richer representations without complex architectural changes.

📌 Computational cost shifts from model parameters to sequence length. Mamba architecture's linear complexity becomes crucial. Pixel-level tokenization is now feasible for high resolution images, unlocking new scaling dimensions.

📌 Decoder heads become less critical for tasks like semantic segmentation. High fidelity encoders with pixel tokens suffice. This simplifies architecture and suggests a move towards encoder-only visual models.

----------

Methods Explored in this Paper 🔧:

→ The paper investigates the impact of patch size by conducting experiments with Vision Transformer and Mamba based architectures.

→ They systematically reduced the patch size from 16x16 down to 1x1 pixel.

→ This reduction in patch size increases the input sequence length significantly, up to 50,176 tokens for a 224x224 image.

→ Experiments were performed on ImageNet-1k classification, ADE20k semantic segmentation, and COCO object detection/instance segmentation tasks.

→ Both Vision Transformer and Adventurer architectures were used to ensure findings are generalizable.

-----

Key Insights 💡:

→ A scaling law in patchification is observed.

→ Smaller patch sizes consistently improve model performance across various vision tasks, input resolutions, and architectures.

→ Patchification is identified as a compromise for computational efficiency, not a necessity for effective vision models.

→ Reducing patch size unlocks crucial visual information previously lost through compression.

→ Task specific decoder heads become less critical for dense prediction tasks when using smaller patch sizes.

-----

Results 📊:

→ ImageNet-1k classification accuracy improved from 82.6% to 84.6% on Adventurer-Base model by reducing patch size to 1x1 on 224x224 images.

→ On ADE20k semantic segmentation, mIoU improved consistently as patch size decreased, reaching 46.8% mIoU with Adventurer-Base and 2x2 patch size without a decoder head.

→ COCO object detection APb improved from 44.7% to 48.7% on Adventurer-Tiny and from 44.1% to 50.3% on Adventurer-Base by reducing patch size.

"ReasonFlux: Hierarchical LLM Reasoning via Scaling Thought Templates"

Rohan Paul — Sun, 16 Feb 2025 14:15:07 GMT

https://arxiv.org/abs/2502.06772

The challenge lies in enhancing LLMs' (LLMs) complex reasoning, particularly in tasks like math problem-solving, which demands extensive search and fine-grained thinking. Current methods struggle with efficiency and generalization.

This paper introduces ReasonFlux. It addresses these limitations through hierarchical LLM reasoning. ReasonFlux uses scaled thought templates to optimize the reasoning process.

-----

📌 ReasonFlux tackles complex reasoning via structured thought templates. This is a form of knowledge distillation. It encodes expert problem-solving strategies into reusable modules, significantly boosting mathematical reasoning accuracy for LLMs.

📌 Hierarchical reinforcement learning on template trajectories is a key innovation. It moves beyond step-by-step Chain of Thought. By optimizing template sequences, ReasonFlux learns effective high-level reasoning plans, improving generalization.

📌 ReasonFlux's adaptive template scaling at inference enables efficient problem exploration. This dynamic approach contrasts with static inference methods. It achieves a better exploration-exploitation trade-off, leading to superior performance on challenging benchmarks.

----------

Methods Explored in this Paper 🔧:

→ ReasonFlux employs a structured thought template library. This library contains around 500 high-level templates. These templates are designed for efficient retrieval and adaptation to various reasoning problems.

→ Hierarchical reinforcement learning is used. It optimizes a base LLM to create optimal template trajectories. This is done on sequences of thought templates, not raw Chain of Thought data. This helps in planning solutions for complex problems step-by-step using templates.

→ A new inference scaling system is introduced. This system adaptively scales thought templates during inference. It enables hierarchical LLM reasoning by dynamically selecting and applying templates. This adaptive scaling aims to balance exploration and exploitation in the reasoning process.

-----

Key Insights 💡:

→ ReasonFlux simplifies complex reasoning by using hierarchical template-based approach. It moves from searching in the vast original problem space to a more manageable template space.

→ The structured template library is crucial for efficient retrieval of relevant knowledge. Templates are designed to be reusable and generalizable across similar problems.

→ Hierarchical reinforcement learning allows the model to learn effective strategies for combining and sequencing thought templates. This leads to better planning of reasoning paths.

→ Adaptive template scaling at inference allows for dynamic problem-solving. ReasonFlux adjusts its approach based on the problem's complexity, improving efficiency and accuracy.

-----

Results 📊:

→ ReasonFlux-32B achieves 91.2% accuracy on the MATH benchmark. This surpasses OpenAI o1-preview by 6.7%.

→ On the AIME 2024 benchmark, ReasonFlux-32B achieves 56.7% accuracy. This outperforms o1-preview by 27% and DeepSeek-V3 by 45%.

→ ReasonFlux-32B achieves 63.3% accuracy on OlympiadBench, exceeding DeepSeek-V3 by 14%.

"QuEST: Stable Training of LLMs with 1-Bit Weights and Activations"

Rohan Paul — Sun, 16 Feb 2025 14:10:48 GMT

https://arxiv.org/abs/2502.05003

The challenge of high computational costs in LLMs restricts their efficiency. Quantization Aware Training struggles to maintain accuracy at very low bit-widths, with 8-bit quantization often seen as the limit for practical training.

This paper introduces QuEST, a novel Quantization Aware Training method. QuEST achieves stable training of LLMs using extremely low 1-bit weights and activations. QuEST also pushes the Pareto frontier to 4-bit quantization, surpassing the accuracy of standard 16-bit formats at comparable model sizes.

-----

📌 QuEST's Hadamard Transform preprocessing is a key innovation. It reshapes weight distributions for effective quantization. This technique unlocks surprisingly stable and accurate training even at 1-bit, pushing limits of low-precision Quantization Aware Training.

📌 The Trust Gradient Estimator in QuEST directly tackles the core challenge of noisy gradients in low-bit training. By selectively trusting gradients based on quantization error, it stabilizes optimization. This is crucial for 1-bit training viability.

📌 QuEST practically demonstrates that 4-bit LLMs can surpass Brain Float 16 (BF16) performance. This redefines the Pareto frontier for efficient LLM training. It immediately enables smaller, faster, and cheaper high-accuracy models.

----------

Methods Explored in this Paper 🔧:

→ QuEST enhances Quantization Aware Training through two core innovations.

→ It introduces a novel approach to distribution fitting in the forward pass. This involves applying a Hadamard Transform for normalization before quantizing weights and activations. This Hadamard Transform helps to shape the distribution of weights to be closer to Gaussian, which is more suitable for quantization.

→ QuEST employs Mean Squared Error optimal fitting to determine the best quantization grid for the transformed distributions. This ensures accurate and fast quantization.

→ For the backward pass, QuEST presents a Trust Gradient Estimator. This estimator minimizes the discrepancy between gradients calculated with quantized parameters and full-precision gradients.

→ The Trust Gradient Estimator selectively reduces the influence of gradients from weight components that have high quantization errors in the forward pass. This strategy stabilizes training by mitigating the impact of outlier quantization errors on gradient updates.

-----

Key Insights 💡:

→ QuEST enables stable training of LLMs with just 1-bit weights and activations. This was not previously considered feasible with standard Quantization Aware Training methods.

→ Models trained with QuEST using 4-bit weights and activations achieve superior accuracy compared to models using Brain Float 16 (BF16) format, especially when considering model size and inference cost. QuEST shifts the Pareto-optimal frontier for LLM training to lower bit-widths.

→ Hadamard Transform preprocessing significantly improves the effectiveness of the trust estimation method, contributing to stable training and better performance.

→ Analysis of efficiency factors reveals that 4-bit quantization offers the optimal balance between model size, computational cost, and accuracy in overtraining scenarios.

-----

Results 📊:

→ QuEST trained models with 4-bit weights and activations outperform BF16 models of nearly 4 times larger size in terms of accuracy, as shown in Figure 1.

→ QuEST demonstrates lower perplexity than LSQ (Learned Step Size Quantization), especially at lower bit-widths, as shown in Figure 3. For instance, at W4A4 configuration, QuEST achieves 29.08 perplexity, while LSQ reaches 30.27.

→ GPU kernel implementation of QuEST achieves per-layer inference speedups from 1.2× to 2.4× for an 800M parameter model and 2.3× to 3.9× for a 7B parameter model compared to BF16, as depicted in Figure 6.

→ End-to-end inference speedup of 1.3× to 1.5× is observed for QuEST INT4 over BF16 in prefill stage using an 800M parameter model on RTX 4090, as shown in Figure 7.

→ Zero-shot evaluation on HellaSWAG benchmark shows comparable accuracy between QuEST 4-bit model (39.22%) and BF16 model (39.52%) of 800M parameters, as indicated in Table 2, suggesting near lossless quantization.

"QLIP: Text-Aligned Visual Tokenization Unifies Auto-Regressive Multimodal Understanding and Generation"

Rohan Paul — Sun, 16 Feb 2025 10:29:19 GMT

https://arxiv.org/abs/2502.05178

The challenge is to create a unified model for both visual understanding and generation, but current visual tokenizers are optimized for only one of these tasks. This paper introduces Quantized Language-Image Pretraining (QLIP) to bridge this gap.

This paper proposes QLIP, a visual tokenizer trained with both reconstruction and language-image alignment objectives. QLIP uses a two-stage training process and dynamic loss weighting to achieve state-of-the-art performance in both areas.

-----

📌 QLIP pioneers text-aligned visual tokenization. It uniquely integrates visual reconstruction with semantic understanding within the tokenizer itself. This enables superior zero-shot image classification at 74.3% accuracy and improved generation FID at 15.29.

📌 The two-stage training of QLIP is crucial. Stage one balances contrastive alignment and reconstruction using dynamic loss weights. Stage two optimizes reconstruction with perceptual and adversarial losses. This overcomes memory limits and gradient conflicts.

📌 QLIP's unified multimodal model, UM3, demonstrates versatility. It handles text-only, image-to-text, and text-to-image tasks within a single architecture. UM3 achieves comparable performance to specialized models, showcasing QLIP's effectiveness.

----------

Methods Explored in this Paper 🔧:

→ QLIP is a visual tokenizer based on Binary Spherical Quantization. It is trained as an autoencoder to reconstruct images into discrete visual tokens.

→ Simultaneously, QLIP is trained with a contrastive language-image objective. This aligns visual tokens with text embeddings, enhancing semantic representation.

→ A two-stage training pipeline is employed. Stage 1 trains QLIP with both alignment and reconstruction losses using a memory-efficient Transformer. Stage 2 fine-tunes the quantizer and decoder with reconstruction losses, after freezing the visual encoder and dropping the text encoder. This improves reconstruction quality by using perceptual and adversarial losses without memory limitations.

→ To balance the competing alignment and reconstruction objectives, an automated weighting scheme is introduced. Loss terms are weighted by the inverse of their post-hoc loss values. This addresses the differing gradient magnitudes and convergence rates of the two objectives.

-----

Key Insights 💡:

→ Balancing reconstruction and alignment objectives during visual tokenizer training is crucial for achieving strong performance in both understanding and generation.

→ A two-stage training process effectively addresses the challenges of large batch contrastive learning and memory-intensive reconstruction losses.

→ Initializing the visual encoder with Masked Image Modeling or CLIP pre-training significantly accelerates convergence and improves performance.

-----

Results 📊:

→ QLIP-B achieves 74.3% zero-shot classification accuracy on ImageNet, comparable to CLIP models while also providing visual tokenization.

→ QLIP-B achieves a reconstruction rFID of 3.21, demonstrating state-of-the-art reconstruction quality for a semantically aligned tokenizer.

→ In text-conditioned image generation on MS-COCO 30k, LlamaGen with QLIP-B achieves a generation FID of 15.29, outperforming LlamaGen with VQGAN (15.68).

"PlotGen: Multi-Agent LLM-based Scientific Data Visualization via Multimodal Feedback"

Rohan Paul — Sun, 16 Feb 2025 10:25:29 GMT

https://arxiv.org/abs/2502.00988

The creation of scientific data visualizations is challenging for novice users due to the complexity of tools and visualization techniques. Existing LLMs struggle with accuracy in generating visualization code and require iterative debugging.

This paper introduces PlotGen, a multi-agent framework that utilizes multimodal feedback to iteratively refine and enhance the accuracy of scientific visualizations generated by LLMs.

-----

📌 PlotGen effectively decomposes the complex visualization task into modular agents. Each agent specializes in query planning, code generation, or multimodal feedback. This division of labor enables targeted error correction and enhances overall accuracy.

📌 Multimodal feedback is a key innovation. PlotGen uses visual, lexical, and numerical feedback agents. These agents mimic human sensory input to iteratively refine plots generated by LLMs, addressing inherent code generation inaccuracies.

📌 PlotGen improves accessibility for novice users. By automating debugging through self-reflection and multimodal feedback, it lowers the technical barrier to creating accurate scientific visualizations, enhancing user productivity.

----------

Methods Explored in this Paper 🔧:

→ PlotGen employs a multi-agent framework to automate scientific data visualization.

→ It starts with a Query Planning Agent. This agent breaks down complex user requests into a sequence of executable steps using chain-of-thought prompting.

→ A Code Generation Agent then converts these steps into executable Python code for plotting. This agent also includes a self-debugging mechanism to handle code execution errors.

→ PlotGen incorporates three feedback agents to refine the visualizations. These are Numeric, Lexical, and Visual Feedback Agents.

→ The Numeric Feedback Agent ensures data accuracy and correct plot type by comparing de-rendered data from the plot with the original data.

→ The Lexical Feedback Agent verifies the accuracy of textual elements like titles and labels by comparing them to user requirements and data.

→ The Visual Feedback Agent assesses visual aspects such as color schemes and layout to ensure alignment with user specifications.

→ These feedback agents provide iterative feedback to the Code Generation Agent for self-reflection and refinement of the generated plots.

-----

Key Insights 💡:

→ Multimodal feedback is crucial for enhancing the accuracy of LLM generated scientific visualizations.

→ PlotGen leverages visual, lexical, and numerical feedback to rectify errors in data, text, and visual aesthetics, through self-reflection.

→ This multi-agent approach significantly improves the quality and trustworthiness of LLM generated plots for scientific data visualization.

-----

Results 📊:

→ PlotGen outperforms baseline methods such as Direct Decoding, Zero-Shot Chain-of-thought, and MatPlotAgent on the MatPlotBench dataset.

→ PlotGen achieves a 4-6% performance improvement over these strong baselines across various LLM configurations.

→ User evaluations indicate increased user trust in PlotGen generated visualizations and a reduction in debugging time for novice users.

"PILAF: Optimal Human Preference Sampling for Reward Modeling"

Rohan Paul — Sun, 16 Feb 2025 10:20:04 GMT

https://arxiv.org/abs/2502.04270

The challenge in Reinforcement Learning from Human Feedback (RLHF) is that current methods for collecting preference data for reward modeling are inefficient and do not directly optimize for true human values. This paper addresses the misalignment between reward model training and maximizing the actual human preference objective in RLHF.

This paper proposes Policy-Interpolated Learning for Aligned Feedback (PILAF). PILAF is a novel sampling strategy for preference labeling. It aligns preference learning with the goal of maximizing the underlying true human reward.

-----

📌 PILAF directly tackles the objective misalignment in Reinforcement Learning from Human Feedback. It aligns reward model training with the true preference objective by shaping the preference sampling, enhancing learning efficiency.

📌 Statistically, PILAF strategically samples data in directions of maximal objective sensitivity. This reduces variance in preference data, leading to more robust and data-efficient reward model training.

📌 PILAF's practical strength lies in its simple implementation and lack of hyperparameters. This allows for immediate improvement in existing Direct Preference Optimization pipelines with minimal overhead.

----------

Methods Explored in this Paper 🔧:

→ Introduces Theoretically Grounded Policy-Interpolated Learning for Aligned Feedback (T-PILAF). T-PILAF generates response pairs by interpolating between the current policy and a reference policy. This balances exploration and exploitation during data collection.

→ T-PILAF uses two modified policies, π+ and π−, derived from the current policy π. π+ encourages exploration in areas favored by π, and π− explores less favored areas.

→ PILAF is presented as a practical simplification of T-PILAF. PILAF avoids complex normalization factors and approximates policy interpolation token-wise.

→ PILAF samples responses from either the current policy π, or by interpolating logits of the current policy π and reference policy π_ref. This interpolation uses the KL regularization coefficient β.

-----

Key Insights 💡:

→ T-PILAF aligns the gradient of the reward model loss with the policy gradient of the true human preference objective. This alignment ensures that minimizing the reward model loss directly contributes to maximizing human values.

→ T-PILAF improves statistical efficiency. It aligns optimization with directions of greatest sensitivity in the human preference objective. This leads to more informative preference data and reduces training variance.

→ PILAF, derived from T-PILAF, inherits these theoretical benefits. It improves sample efficiency and performance in practical RLHF settings.

-----

Results 📊:

→ In iterative Direct Preference Optimization (DPO), PILAF achieves baseline reward levels with 40% less training time.

→ In online DPO, PILAF demonstrates a better Reward-KL trade-off. It achieves higher reward with lower KL divergence compared to Vanilla and Hybrid Sampling.

→ In robustness tests with an overfitted initial model, PILAF escapes suboptimal regions and achieves higher reward and lower KL divergence than Vanilla Sampling.

→ PILAF reduces annotation and computation costs by over 40% in iterative DPO while maintaining comparable performance.

"Partially Rewriting a Transformer in Natural Language"

Rohan Paul — Sun, 16 Feb 2025 10:09:03 GMT

https://arxiv.org/abs/2501.18838

The challenge is understanding the internal workings of LLMs. Current interpretability methods are limited. This paper explores rewriting parts of an LLM using natural language for better understanding.

This paper proposes a method to partially rewrite a Transformer layer with natural language. It uses sparse representations and LLMs to explain and simulate neuron activations, then integrates these back into the original model.

-----

📌 Sparse transcoders offer a path to interpretability by approximating neural networks with sparse, explainable features. Natural language descriptions, though imperfect, bridge the gap to human understanding.

📌 Quantile normalization is essential for calibrating LLM-predicted activations. Raw LLM outputs for neuron activity are poorly distributed, hindering direct integration without statistical correction.

📌 Current natural language explanations are not yet sufficiently precise to fully replace neural network components. The performance remains close to zero ablation, indicating a need for richer, more specific explanations.

----------

Methods Explored in this Paper 🔧:

→ The paper trains a sparse transcoder. The transcoder approximates a feedforward network within the LLM. It uses a TopK activation function to enforce sparsity.

→ A skip connection is added to the transcoder architecture. This improves approximation without affecting interpretability. A sparse autoencoder (SAE) is also trained on the residual stream.

→ Automated interpretability pipeline generates natural language explanations for transcoder and SAE latents. This pipeline is used to score the quality of these explanations.

→ An LLM predicts latent activation. It uses the generated explanation and surrounding text context. This prediction is for a single token.

→ Quantile normalization calibrates the LLM's activation predictions. This aligns the predicted activation distribution with the true activation distribution. This step is crucial for performance.

-----

Key Insights 💡:

→ Current automatically generated natural language explanations are not sufficiently specific. This lack of specificity hinders performance when rewriting model components.

→ Simply predicting whether a feature is active is insufficient. Explanations must also accurately identify contexts where a feature is *not* active. Specificity is as important as sensitivity.

→ Quantile normalization significantly improves performance. It addresses the issue of low specificity in explanations by correcting falsely activated neurons.

-----

Results 📊:

→ Rewriting the entire transcoder with natural language explanations results in a cross-entropy loss similar to a Pythia model trained on only 10-15% of the data. This is comparable to replacing the transcoder output with zeros.

→ Randomly selecting latents for rewriting leads to worse performance than zeroing out the MLP.

→ Using quantile normalization significantly improves performance compared to not using it. Normalization makes rewritten models perform better than zeroing out components.

→ Detection scores correlate with explanation quality. Higher detection scores indicate more specific and sensitive explanations.

"On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices"

Rohan Paul — Sat, 15 Feb 2025 14:21:55 GMT

https://arxiv.org/abs/2502.04363

The challenge lies in running demanding text-to-video generation models on resource-limited mobile devices. Current diffusion-based video generation is computationally and memory intensive, making it inaccessible on smartphones.

This paper introduces On-device Sora. It is a pioneering solution to enable diffusion-based text-to-video generation directly on mobile devices. On-device Sora employs three novel techniques to overcome resource constraints, maintaining video quality comparable to server-grade performance.

-----

📌 Linear Proportional Leap effectively reduces diffusion steps by half. This method smartly exploits Rectified Flow's linear trajectory. It significantly accelerates video generation on devices without retraining.

📌 Temporal Dimension Token Merging cleverly merges tokens along the time dimension. This reduces computational load in attention layers by a quarter. It maintains video quality while boosting efficiency.

📌 Concurrent Inference with Dynamic Loading overcomes memory constraints on devices. It allows for efficient execution of large models by concurrent processing and dynamic block management.

----------

Methods Explored in this Paper 🔧:

→ Linear Proportional Leap (LPL) is introduced to reduce the number of denoising steps. LPL leverages the linear trajectory of Rectified Flow to make direct "leaps" in the denoising process. This avoids performing all originally required denoising steps.

→ Temporal Dimension Token Merging (TDTM) is proposed to minimize token processing computation. TDTM merges consecutive video frame tokens in the temporal dimension within the attention layers of the Spatial Temporal Diffusion Transformer (STDiT). This reduces the number of tokens and computational load.

→ Concurrent Inference with Dynamic Loading (CI-DL) addresses memory limitations. CI-DL partitions large models into smaller blocks. It then dynamically loads these blocks into memory for concurrent model inference and block loading. This optimizes memory use and speeds up processing.

-----

Key Insights 💡:

→ On-device Sora is the first framework to enable high-quality diffusion-based text-to-video generation on smartphones.

→ The proposed methods, LPL, TDTM, and CI-DL, significantly improve efficiency without compromising video quality.

→ On-device video generation enhances user privacy, reduces cloud dependency, and lowers costs.

-----

Results 📊:

→ Achieves up to 1.94× speedup using Linear Proportional Leap (LPL) while maintaining comparable video quality.

→ Temporal Dimension Token Merging (TDTM) provides up to 1.27× speedup with stable video quality metrics.

→ Concurrent Inference with Dynamic Loading (CI-DL) reduces STDiT inference latency by approximately 25%, from 1000 to 750 seconds for 30 denoising steps.

"Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment"

Rohan Paul — Sat, 15 Feb 2025 14:18:24 GMT

https://arxiv.org/abs/2502.04328

The challenge in creating omni-modal models is their performance gap compared to specialized single modality models. Existing omni-modal models also lack balanced performance and efficient training.

This paper introduces Ola, an Omni-modal Language Model, using progressive modality alignment to address these issues. Ola achieves competitive performance across image, video, and audio tasks.

-----

📌 Progressive modality alignment in Ola effectively manages multi-modal interference. Training starts with text-image pairs. Then it progressively integrates audio and video. This staged approach avoids catastrophic forgetting and improves overall performance.

📌 Ola's architecture uses modality-specific encoders with a shared LLM decoder. This design choice allows efficient feature extraction from diverse inputs. The Local-Global Attention Pooling further optimizes visual token processing for better efficiency.

📌 Generating cross-modal video-audio data is a key innovation. This data bridges the gap between vision and audio. It enhances the model's ability to understand real-world scenarios where modalities are naturally correlated.

----------

Methods Explored in this Paper 🔧:

→ Ola employs a progressive modality alignment strategy.

→ It begins training with image and text data. This establishes core vision-language knowledge.

→ Then, speech data is added. Speech bridges language and audio understanding.

→ Finally, video data is incorporated. Video connects all modalities: vision, audio, and language.

→ Ola's architecture includes modality-specific encoders. These handle text, image, video, and audio inputs.

→ A Local-Global Attention Pooling layer fuses visual inputs. This reduces visual token length efficiently.

→ Dual audio encoders are used. Whisper for speech and BEATs for music are used. This allows for richer audio understanding.

→ Sentence-wise streaming decoding is implemented for real-time text and speech generation.

-----

Key Insights 💡:

→ Progressive modality alignment improves omni-modal learning. It breaks down complex training into manageable steps.

→ This strategy maintains a small cross-modal alignment data size. It leverages existing vision-language model advancements.

→ Video data acts as a crucial bridge between audio and vision. It provides comprehensive multi-modal information.

→ Cross-modal video-audio data generation enhances performance. This data captures relationships between video and audio.

-----

Results 📊:

→ Ola achieves 84.3% accuracy on MMBench-1.1.

→ Ola achieves 70.8% accuracy on MMStar.

→ Ola achieves 57.0% accuracy on MMMU.

→ Ola achieves 68.4% accuracy on VideoMME.

→ Ola achieves 3.1 mean WER on LibriSpeech.

→ Ola achieves 6.41 GPT-eval score on AIR-Bench.

"Ola: Pushing the Frontiers of Omni-Modal Language Model with Progressive Modality Alignment"

Rohan Paul — Sat, 15 Feb 2025 13:53:38 GMT

https://arxiv.org/abs/2502.04328

The challenge in creating omni-modal models is their performance gap compared to specialized single modality models. Existing omni-modal models also lack balanced performance and efficient training.

This paper introduces Ola, an Omni-modal Language Model, using progressive modality alignment to address these issues. Ola achieves competitive performance across image, video, and audio tasks.

-----

----------

Methods Explored in this Paper 🔧:

→ Ola employs a progressive modality alignment strategy.

→ It begins training with image and text data. This establishes core vision-language knowledge.

→ Then, speech data is added. Speech bridges language and audio understanding.

→ Finally, video data is incorporated. Video connects all modalities: vision, audio, and language.

→ Ola's architecture includes modality-specific encoders. These handle text, image, video, and audio inputs.

→ A Local-Global Attention Pooling layer fuses visual inputs. This reduces visual token length efficiently.

→ Dual audio encoders are used. Whisper for speech and BEATs for music are used. This allows for richer audio understanding.

→ Sentence-wise streaming decoding is implemented for real-time text and speech generation.

-----

Key Insights 💡:

→ Progressive modality alignment improves omni-modal learning. It breaks down complex training into manageable steps.

→ This strategy maintains a small cross-modal alignment data size. It leverages existing vision-language model advancements.

→ Video data acts as a crucial bridge between audio and vision. It provides comprehensive multi-modal information.

→ Cross-modal video-audio data generation enhances performance. This data captures relationships between video and audio.

-----

Results 📊:

→ Ola achieves 84.3% accuracy on MMBench-1.1.

→ Ola achieves 70.8% accuracy on MMStar.

→ Ola achieves 57.0% accuracy on MMMU.

→ Ola achieves 68.4% accuracy on VideoMME.

→ Ola achieves 3.1 mean WER on LibriSpeech.

→ Ola achieves 6.41 GPT-eval score on AIR-Bench.

"No Task Left Behind: Isotropic Model Merging with Common and Task-Specific Subspaces"

Rohan Paul — Sat, 15 Feb 2025 13:49:31 GMT

https://arxiv.org/abs/2502.04959

The problem is that merging multiple task-specific models into one multi-task model still results in a performance drop compared to individual task-specific models. This paper introduces a new merging method to reduce this performance gap by analyzing the properties of task matrices.

This paper proposes "Isotropic Merging". It flattens the singular value spectrum of merged task matrices to improve alignment between task-specific and merged subspaces. This approach enhances performance without additional training.

-----

Okay, here are my technical perspectives on the paper's solution:

📌 Isotropic Merging cleverly uses Singular Value Decomposition to flatten the singular value spectrum. This balances task representation in merged models, boosting multi-task performance without extra training.

📌 The paper shifts focus from cosine similarity to subspace alignment for effective merging. By aligning task and merged subspaces, Isotropic Merging achieves superior performance compared to Task Arithmetic.

📌 Iso-CTS method practically combines common and task-specific subspaces. This simple yet effective approach reaches state-of-the-art model merging results across diverse tasks and model scales.

----------

Methods Explored in this Paper 🔧:

→ This paper explores model merging through the lens of Singular Value Decomposition of task matrices. Task matrix is the weight update matrix applied to a pre-trained model for a specific task.

→ Introduces Subspace Alignment Ratio metric. It quantifies the similarity between subspaces of task-specific matrices and merged matrices.

→ Isotropic Merging in Common Subspace (Iso-C) is proposed. Iso-C flattens the singular value spectrum of the merged task matrix to make it uniform. This is done by scaling all singular directions to the average singular value.

→ Isotropic Merging in Common and Task-Specific Subspaces (Iso-CTS) is also proposed. Iso-CTS enhances Iso-C by incorporating task-specific directions. It retains top singular vectors from a common subspace and adds task-specific directions orthogonal to the common subspace.

-----

Key Insights 💡:

→ Alignment between singular components of task-specific and merged matrices strongly correlates with performance improvement.

→ Flattening the singular value spectrum of merged matrices improves subspace alignment. This leads to better multi-task performance.

→ Incorporating task-specific subspaces along with a common subspace further enhances merging performance, especially with a larger number of tasks.

-----

Results 📊:

→ Iso-CTS achieves state-of-the-art performance across various tasks and model scales (ViT-B/32, ViT-B/16, ViT-L/14).

→ Iso-CTS outperforms Task Arithmetic by a large margin, especially when merging 14 and 20 tasks, with up to 2.8% absolute accuracy improvement.

→ Iso-C and Iso-CTS are more robust to the choice of scaling factor α compared to Task Arithmetic.

"MotionLab: Unified Human Motion Generation and Editing via the Motion-Condition-Motion Paradigm"

Rohan Paul — Sat, 15 Feb 2025 13:46:31 GMT

https://arxiv.org/abs/2502.02358

The paper addresses the challenge of isolated solutions for human motion generation and editing tasks. Current methods lack versatility, fine-grained control and knowledge sharing across tasks.

This paper introduces MotionLab, a unified framework based on the Motion-Condition-Motion paradigm. MotionLab uses rectified flows and a MotionFlow Transformer to map source motion to target motion, guided by conditions. It incorporates Aligned Rotational Position Encoding and Task Instruction Modulation for effective multi-task learning via Motion Curriculum Learning.

-----

📌 Motion-Condition-Motion paradigm offers a simple yet effective abstraction. It unifies motion generation and editing. Rectified flows enable efficient mapping between source and target motions.

📌 MotionFlow Transformer with Joint Attention and Condition Path is key. It facilitates multi-modal interaction. Adaptive Layer Normalization enhances conditional control without task-specific modules.

📌 Aligned Rotational Position Encoding addresses temporal misalignment. It is crucial for time-sensitive motion tasks. Motion Curriculum Learning enables effective multi-task training.

----------

Methods Explored in this Paper 🔧:

→ MotionLab uses the Motion-Condition-Motion paradigm. This paradigm unifies motion generation and editing tasks using source motion, condition, and target motion concepts.

→ MotionLab framework is built around the MotionFlow Transformer (MFT). MFT leverages rectified flows to learn the mapping from source motion to target motion based on conditions.

→ MFT includes Joint Attention to enable interaction between tokens from different modalities. A Condition Path differentiates modalities and extracts representations. Aligned Rotational Position Encoding (ROPE) ensures time synchronization.

→ Task Instruction Modulation is used to differentiate tasks by adding a task-specific instruction embedding into the MFT.

→ Motion Curriculum Learning is a training strategy. It organizes tasks by difficulty for effective multi-task learning and knowledge sharing. Training is divided into masked pre-training and supervised fine-tuning stages.

-----

Key Insights 💡:

→ The Motion-Condition-Motion paradigm effectively unifies diverse human motion generation and editing tasks.

→ MotionLab framework, with its MotionFlow Transformer and Motion Curriculum Learning, achieves versatility and strong performance across various motion tasks.

→ Aligned ROPE is crucial for maintaining temporal synchronization between source and target motions, improving performance in time-sensitive tasks.

→ Task Instruction Modulation and Motion Curriculum Learning are essential for effective multi-task learning and knowledge sharing across different motion tasks.

-----

Results 📊:

→ MotionLab achieves a FID score of 0.223 in text-based motion generation on the HumanML3D dataset.

→ In trajectory-based motion generation on HumanML3D, MotionLab achieves an average error of 0.0334 when controlling all joints.

→ For text-based motion editing on the MotionFix dataset, MotionLab attains a retrieval rate (R@1) of 56.34%.

→ In trajectory-based motion editing on MotionFix, MotionLab reaches a retrieval rate (R@1) of 72.65%.

→ MotionLab achieves a Style Recognition Accuracy (SRA) of 64.97% in motion style transfer.

"MotionCanvas: Cinematic Shot Design with Controllable Image-to-Video Generation"

Rohan Paul — Sat, 15 Feb 2025 13:42:39 GMT

https://arxiv.org/abs/2502.04299

The paper addresses the problem of limited user control in image-to-video generation, particularly for cinematic shot design involving both camera and object motions. Current methods lack intuitive and precise control over these intertwined movements, hindering creative expression.

This paper proposes MotionCanvas, a system that enables cinematic shot design by integrating user-driven controls for camera and object motions into image-to-video generation models. It uses a Motion Signal Translation module to bridge user intent and model conditioning.

-----

📌 MotionCanvas introduces a crucial Motion Signal Translation module. This module smartly converts user-friendly 3D motion intents into 2D screen-space signals. This bridging is key for effective video diffusion model control.

📌 The strength of MotionCanvas lies in achieving 3D-aware cinematic control using only 2D training data. This bypasses the need for costly 3D annotations. It leverages depth estimation and geometric transformations for scene understanding.

📌 MotionCanvas offers immediate practical value. It empowers users to design complex cinematic shots with combined camera and object motions. This enhanced control improves creative workflows in video content creation tasks.

----------

Methods Explored in this Paper 🔧:

→ MotionCanvas uses a Motion Design Module to capture user intentions for camera motion, object global motion, and object local motion. Users can define camera paths with 3D poses or base motion patterns. Scene-anchored bounding boxes control object global motion. Point trajectories control object local motion. Timing control is also integrated.

→ The Motion Signal Translation module converts scene-space motion designs into screen-space motion signals. Camera movement is represented by 2D point tracks derived from 3D camera paths using depth estimation. Scene-aware object motion uses bounding box trajectories converted to screen space, considering camera motion. Object local motion uses point trajectories transformed to screen space, accounting for both camera and global object motion.

→ A motion-conditioned video generation model, based on Diffusion Transformer (DiT), is fine-tuned with these screen-space motion signals. Point trajectories are encoded using Discrete Cosine Transform (DCT) coefficients. Bounding box sequences are encoded as color-coded masks. Auto-regressive generation is used for variable-length videos.

-----

Key Insights 💡:

→ Cinematic shot design requires simultaneous and intuitive control over camera and object motions in image-to-video generation.

→ MotionCanvas effectively translates user-defined scene-space motion intent into screen-space control signals for video diffusion models.

→ Representing camera motion with point tracking and object motion with scene-anchored bounding boxes allows for 3D-aware control without explicit 3D training data.

→ DCT coefficients and color-coded masks provide efficient and effective ways to condition video diffusion models on motion signals.

-----

Results 📊:

→ MotionCanvas achieves lower Rotation Error (0.6334) and Translation Error (0.2188) in camera motion control compared to MotionCtrl and CameraCtrl.

→ MotionCanvas reduces Frechet Video Distance (FVD) to 34.09 and Frechet Inception Distance (FID) to 7.60, indicating higher video quality.

→ User studies show MotionCanvas is preferred for motion adherence (75.24%), motion quality (79.05%), and frame fidelity (77.14%) over DragAnything and MOFA-Video.