"On-device Sora: Enabling Diffusion-Based Text-to-Video Generation for Mobile Devices"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 15, 2025

Article voiceover

0:00

-4:54

https://arxiv.org/abs/2502.04363

The challenge lies in running demanding text-to-video generation models on resource-limited mobile devices. Current diffusion-based video generation is computationally and memory intensive, making it inaccessible on smartphones.

This paper introduces On-device Sora. It is a pioneering solution to enable diffusion-based text-to-video generation directly on mobile devices. On-device Sora employs three novel techniques to overcome resource constraints, maintaining video quality comparable to server-grade performance.

-----

📌 Linear Proportional Leap effectively reduces diffusion steps by half. This method smartly exploits Rectified Flow's linear trajectory. It significantly accelerates video generation on devices without retraining.

📌 Temporal Dimension Token Merging cleverly merges tokens along the time dimension. This reduces computational load in attention layers by a quarter. It maintains video quality while boosting efficiency.

📌 Concurrent Inference with Dynamic Loading overcomes memory constraints on devices. It allows for efficient execution of large models by concurrent processing and dynamic block management.

----------

Methods Explored in this Paper 🔧:

→ Linear Proportional Leap (LPL) is introduced to reduce the number of denoising steps. LPL leverages the linear trajectory of Rectified Flow to make direct "leaps" in the denoising process. This avoids performing all originally required denoising steps.

→ Temporal Dimension Token Merging (TDTM) is proposed to minimize token processing computation. TDTM merges consecutive video frame tokens in the temporal dimension within the attention layers of the Spatial Temporal Diffusion Transformer (STDiT). This reduces the number of tokens and computational load.

→ Concurrent Inference with Dynamic Loading (CI-DL) addresses memory limitations. CI-DL partitions large models into smaller blocks. It then dynamically loads these blocks into memory for concurrent model inference and block loading. This optimizes memory use and speeds up processing.

-----

Key Insights 💡:

→ On-device Sora is the first framework to enable high-quality diffusion-based text-to-video generation on smartphones.

→ The proposed methods, LPL, TDTM, and CI-DL, significantly improve efficiency without compromising video quality.

→ On-device video generation enhances user privacy, reduces cloud dependency, and lowers costs.

-----

Results 📊:

→ Achieves up to 1.94× speedup using Linear Proportional Leap (LPL) while maintaining comparable video quality.

→ Temporal Dimension Token Merging (TDTM) provides up to 1.27× speedup with stable video quality metrics.

→ Concurrent Inference with Dynamic Loading (CI-DL) reduces STDiT inference latency by approximately 25%, from 1000 to 750 seconds for 30 denoising steps.

Rohan's Bytes

Discussion about this post

Ready for more?