"Dual Caption Preference Optimization for Diffusion Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2502.06023
The paper addresses the issue of suboptimal alignment in text-to-image diffusion models due to conflict distribution and irrelevant prompts in preference datasets. Current datasets use the same prompt for both preferred and less preferred images, causing optimization challenges.
This paper introduces Dual Caption Preference Optimization (DCPO). DCPO utilizes distinct captions for preferred and less preferred images to guide diffusion model alignment.
-----
📌 Dual Caption Preference Optimization (DCPO) refines the Direct Preference Optimization loss by using distinct captions. This targeted approach sharpens the optimization landscape, leading to more preference-aligned diffusion models.
📌 DCPO leverages readily available captioning and perturbation techniques to create more informative training signals. This pragmatic design effectively tackles the data conflict issue in preference learning for image generation.
📌 By disentangling prompts for preferred and less preferred examples, DCPO enables diffusion models to learn nuanced preference distinctions. This method moves beyond single-prompt limitations in current alignment techniques.
----------
Methods Explored in this Paper 🔧:
→ The paper proposes Dual Caption Preference Optimization (DCPO) to improve diffusion model alignment with human preferences.
→ DCPO addresses the "conflict distribution" and "irrelevant prompts" issues in existing preference datasets by using two different captions for preferred and less preferred images.
→ DCPO consists of three methods to generate distinct captions: DCPO-c (Captioning), DCPO-p (Perturbation), and DCPO-h (Hybrid).
→ DCPO-c employs a captioning model to generate separate captions for preferred and less preferred images based on the original prompt and image content.
→ DCPO-p uses prompt perturbation techniques to create a less relevant caption for the less preferred image, while keeping the original prompt for the preferred image. Three levels of perturbation (weak, medium, strong) are explored.
→ DCPO-h combines captioning and perturbation, perturbing the caption generated for the less preferred image.
→ A new dataset, Pick-Double Caption, is introduced, built upon Pick-a-Pic v2, providing distinct captions for preferred and less preferred images using the DCPO-c pipeline.
-----
Key Insights 💡:
→ Conflict distribution in preference datasets, where preferred and less preferred images from the same prompt have overlapping distributions, hinders effective preference optimization.
→ Irrelevant prompts for less preferred images in existing methods limit the diffusion model's ability to accurately predict noise during optimization.
→ Increasing the semantic difference between captions for preferred and less preferred images, up to a certain threshold, improves model alignment performance.
→ Perturbing captions that are already well-correlated with images (like those from captioning models) leads to better performance compared to perturbing original prompts directly.
→ In-distribution captions generally lead to better alignment performance compared to out-of-distribution captions generated by captioning models for less preferred images.
-----
Results 📊:
→ DCPO-h achieves a Pickscore of 20.57, outperforming Diffusion-DPO's 20.36 and SD 2.1's 20.30.
→ DCPO-h reaches an HPSv2.1 score of 25.62, exceeding Diffusion-DPO's 25.10 and SD 2.1's 25.17.
→ DCPO-h attains a GenEval score of 0.5100, surpassing Diffusion-DPO's 0.4857 and SD 2.1's 0.4775.
→ In GPT-4o evaluations on PartiPrompts, DCPO-h achieves a 58% win rate in general preference and a 66% win rate in visual appeal against Diffusion-DPO.