One model sees images in three ways: overall scene, text details, and object relationships.
Florence-VL enhances multimodal LLMs by using Florence-2 as a vision encoder, extracting diverse visual features through depth-breadth fusion. Unlike traditional CLIP-based models, it captures multiple levels of visual information using task-specific prompts, improving vision-language alignment and performance across various tasks.
-----
https://arxiv.org/abs/2412.04424
🤔 Original Problem:
Current multimodal LLMs rely heavily on CLIP-style vision encoders that provide single-layer image features, often missing crucial pixel-level details and contextual information needed for complex visual understanding tasks.
-----
🔧 Solution in this Paper:
→ Florence-VL uses Florence-2 as a generative vision encoder to extract multiple types of visual features.
→ The model implements Depth-Breadth Fusion (DBFusion) to combine features from different network depths and task-specific prompts like captioning, OCR, and grounding.
→ A channel integration strategy fuses these diverse features efficiently before projecting them to the LLM's input space.
→ The training process involves end-to-end pretraining followed by partial finetuning on carefully curated datasets.
-----
💡 Key Insights:
→ Multiple task-specific visual features perform better than single universal features
→ Lower-level features complement high-level semantic representations
→ Channel-based feature fusion outperforms token integration and average pooling
→ Florence-2's generative capabilities enable better vision-language alignment
-----
📊 Results:
→ Outperformed existing state-of-the-art models across 25 benchmarks
→ Achieved lowest alignment loss compared to other vision encoders
→ Showed significant improvements in OCR and chart understanding tasks
→ Required only 576 visual tokens versus 2880 in competing models
Share this post