InternVL 2.5 bridges the gap between open-source and commercial multimodal LLMs through smart scaling.
InternVL 2.5 introduces significant improvements to open-source multimodal LLMs through enhanced model architecture, data quality, and test-time scaling. It achieves state-of-the-art performance on various benchmarks while maintaining transparency and accessibility.
-----
https://arxiv.org/abs/2412.05271
🤔 Original Problem:
Open-source multimodal LLMs lag behind closed commercial models in performance and efficiency, limiting research accessibility and transparency in the field.
-----
🔧 Solution in this Paper:
→ The paper introduces InternVL 2.5, an advanced multimodal LLM series that builds upon InternVL 2.0's architecture.
→ It systematically explores scaling relationships between vision encoders, language models, dataset sizes, and test-time configurations.
→ A rigorous data filtering pipeline removes low-quality samples and repetitive patterns to enhance model performance.
→ The model uses a progressive scaling strategy, starting with smaller LLMs and efficiently scaling up to larger ones.
-----
💡 Key Insights:
→ Large vision encoders significantly reduce training data dependency when scaling MLLMs
→ Data quality matters more than quantity for complex reasoning tasks
→ Test-time scaling with Chain-of-Thought reasoning improves performance on challenging tasks
-----
📊 Results:
→ First open-source MLLM to surpass 70% on MMMU benchmark
→ Achieves 75.5% on OpenCompass leaderboard
→ Matches performance with commercial models like GPT-4o and Claude-3.5-Sonnet
Share this post