Personalized Visual Instruction Tuning (PVIT) framework empowers MLLMs to recognize individuals and conduct personalized conversations effectively.
📚 https://arxiv.org/abs/2410.07113
Original Problem 🔍:
Current multimodal large language models (MLLMs) lack personalization capabilities, struggling to conduct dialogues targeting specific individuals. This limitation hinders their application in personalized settings like mobile visual assistants or domestic robots.
-----
Solution in this Paper 🛠️:
• Represents individuals as multimodal prefix pairs (personal image, personal introduction)
• Uses personalized wrapper tokens to eliminate ambiguity
• Develops an automatic pipeline for generating personalized training data
• Leverages visual experts, image generation models, and MLLMs
-----
Key Insights from this Paper 💡:
• PVIT enables MLLMs to identify target individuals and engage in personalized dialogues
• Utilizes in-context learning capability of MLLMs for generalization
• Introduces P-Bench, a benchmark for evaluating personalized potential of MLLMs
• Addresses "face blindness" limitation of current MLLMs
-----
Results 📊:
• P-LLaVA (PVIT-tuned LLaVA) outperforms state-of-the-art MLLMs across various question types in P-Bench
• Achieves 96.69% accuracy on answerable tasks (vs 63.13% for next best)
• Demonstrates 99.72% accuracy on unanswerable queries (vs 31.49% for next best)
• Shows strong performance even with complex scenarios and multiple individuals
Share this post