Here is a summary of the paper on DiffuEraser, adhering to all instructions:
This paper introduces DiffuEraser, a video inpainting model using stable diffusion, to address limitations of transformer-based methods in handling large masked regions in videos. DiffuEraser incorporates priors and enhances temporal consistency for improved video inpainting.
-----
Paper - https://arxiv.org/abs/2501.10018
Original Problem 🤔:
→ Existing video inpainting methods using transformers struggle with blurring and temporal inconsistencies, especially with large masked areas.
→ Transformer models lack generative capability for unknown pixels, leading to artifacts.
→ Temporal inconsistencies arise between video clips in long sequences.
-----
Solution in this Paper 💡:
→ DiffuEraser, a diffusion model based on stable diffusion, is proposed for video inpainting.
→ It uses a main denoising UNet and a BrushNet branch. The BrushNet extracts features from masked images to guide the denoising process.
→ Temporal attention is added to enhance consistency.
→ Priors are incorporated by using DDIM inversion of Propainter outputs, injected into the noisy latent input. This mitigates noisy artifacts and hallucinations.
→ Temporal consistency is improved by expanding the temporal receptive field through pre-inference and using a staggered denoising approach leveraging the temporal smoothing property of Video Diffusion Models at clip intersections.
-----
Key Insights from this Paper 🧐:
→ Diffusion models offer superior generative capabilities compared to transformers for video inpainting, producing more detailed and coherent content.
→ Incorporating priors helps initialize the diffusion process, reducing artifacts and unwanted object generation.
→ Expanding the temporal receptive field and leveraging the temporal smoothing of diffusion models are crucial for long-sequence video consistency.
-----
Results 📊:
→ DiffuEraser shows improved texture quality and temporal consistency compared to Propainter in qualitative comparisons.
→ DiffuEraser effectively propagates known pixels and generates unknown pixels with better consistency and stability.
→ The model generates samples in two steps using Phased Consistency Models (PCM), enhancing inference efficiency.
Share this post