MLLMs learn to evaluate aesthetics through structured decomposition.
MLLMs can evaluate artwork aesthetics by breaking down the evaluation process into concrete steps, reducing hallucination and improving alignment with human preferences.
-----
https://arxiv.org/abs/2501.09012
🎨 Original Problem:
→ Current AI art evaluation metrics rely on vision features alone, ignoring cultural context and emotional impact
→ Existing metrics like Style Loss and Art Score don't align well with human aesthetic preferences
-----
🔍 Solution in this Paper:
→ Introduces MM-StyleBench, a large-scale dataset with 1000+ content and style instances for benchmarking artistic evaluation
→ Develops ArtCoT, a three-phase prompting method that mimics art critics' formal analysis process
→ Uses Two-Alternative Forced Choice tasks to model human preferences more accurately than traditional Likert scales
→ Implements Bradley-Terry and Elo algorithms to derive global aesthetic rankings
-----
🧠 Key Insights:
→ Zero-shot Chain of Thought prompting actually degrades MLLM performance by 22% due to increased hallucination
→ Task decomposition and concrete language significantly reduce subjective responses from 20.15% to 5.51%
→ Both content and style information are necessary for effective aesthetic evaluation
-----
📊 Results:
→ ArtCoT improves aesthetic alignment by 56% in per-method evaluation
→ Achieves 29% improvement in per-instance evaluation
→ Shows consistent performance gains across different MLLMs (GPT-4, Gemini 1.5, Claude 3.5)
Share this post