This paper proposes to improve video generation by better aligning with human preferences using a novel reward maximization method.
Enhances video quality by directly optimizing for human feedback.
-----
Paper - https://arxiv.org/abs/2501.13918
Original Problem 🤔:
→ Current video generation models often fail to align with human aesthetic preferences and satisfaction.
→ Existing methods struggle to effectively incorporate nuanced human feedback into the training process.
-----
Solution in this Paper 💡:
→ This paper introduces a novel approach called Flow Direct Preference Optimization (Flow-DPO).
→ Flow-DPO is a likelihood-based reward maximization method specifically designed for aligning video generation with human preferences.
→ It leverages a flow-based reward model to directly optimize the video generation policy based on pairwise human preference data.
→ The method also incorporates a technique called Reward-Weighted Regression (RWR) to further refine the alignment process.
→ Flow-DPO aims to overcome limitations of traditional reinforcement learning methods in video generation.
-----
Key Insights from this Paper 🧐:
→ Directly optimizing for human preferences using pairwise comparison data is crucial for improving video generation quality.
→ Flow-based reward models offer a more effective way to capture complex human preferences compared to traditional scalar reward models.
→ Likelihood-based optimization methods like Flow-DPO can lead to more stable and efficient alignment in video generation.
-----
Results ✨:
→ The proposed Flow-DPO method achieves a 81.3% win rate against DPO in human preference evaluations.
→ Flow-DPO demonstrates a 7.1% improvement in preference accuracy compared to standard DPO.
→ Experiments show that Flow-DPO outperforms existing alignment methods in generating videos that are more preferred by humans.
Share this post