0:00
/
0:00
Transcript

"Florence-VL: Enhancing Vision-Language Models with Generative Vision Encoder and Depth-Breadth Fusion"

The podcast on this paper is generated with Google's Illuminate.

One model sees images in three ways: overall scene, text details, and object relationships.

Florence-VL enhances multimodal LLMs by using Florence-2 as a vision encoder, extracting diverse visual features through depth-breadth fusion. Unlike traditional CLIP-based models, it captures multiple levels of visual information using task-specific prompts, improving vision-language alignment and performance across various tasks.

-----

https://arxiv.org/abs/2412.04424

🤔 Original Problem:

Current multimodal LLMs rely heavily on CLIP-style vision encoders that provide single-layer image features, often missing crucial pixel-level details and contextual information needed for complex visual understanding tasks.

-----

🔧 Solution in this Paper:

→ Florence-VL uses Florence-2 as a generative vision encoder to extract multiple types of visual features.

→ The model implements Depth-Breadth Fusion (DBFusion) to combine features from different network depths and task-specific prompts like captioning, OCR, and grounding.

→ A channel integration strategy fuses these diverse features efficiently before projecting them to the LLM's input space.

→ The training process involves end-to-end pretraining followed by partial finetuning on carefully curated datasets.

-----

💡 Key Insights:

→ Multiple task-specific visual features perform better than single universal features

→ Lower-level features complement high-level semantic representations

→ Channel-based feature fusion outperforms token integration and average pooling

→ Florence-2's generative capabilities enable better vision-language alignment

-----

📊 Results:

→ Outperformed existing state-of-the-art models across 25 benchmarks

→ Achieved lowest alignment loss compared to other vision encoders

→ Showed significant improvements in OCR and chart understanding tasks

→ Required only 576 visual tokens versus 2880 in competing models

Discussion about this video