FastVLM makes vision-language models 85x faster by smartly reducing visual tokens without losing quality.
FastVLM introduces a hybrid vision encoder that processes high-resolution images efficiently while reducing visual tokens, making vision-language models faster and more efficient.
-----
https://arxiv.org/abs/2412.13303
🔍 Original Problem:
Vision Language Models struggle with high-resolution images due to inefficient visual encoders like ViTs that generate too many tokens and have high encoding latency.
-----
🛠️ Solution in this Paper:
→ FastVLM uses FastViTHD, a novel hybrid vision encoder combining convolutional and transformer layers in a 5-stage architecture.
→ The model strategically downsamples input tensors by 64x, significantly reducing the number of tokens processed by self-attention layers.
→ FastViTHD generates 4x fewer tokens than traditional architectures while maintaining high-quality visual representations.
→ The architecture eliminates the need for additional token pruning by naturally achieving optimal token count through input scaling.
-----
💡 Key Insights:
→ Hybrid architectures (convolution + transformer) are more efficient than pure transformer models for vision encoding
→ Vision encoder latency dominates total processing time at high resolutions
→ Optimal performance requires balancing image resolution, token count, and LLM size
-----
📊 Results:
→ 3.2x faster time-to-first-token compared to previous methods
→ 85x faster than LLaVA-OneVision at 1152x1152 resolution
→ 3.4x smaller vision encoder while maintaining comparable performance
→ Achieves state-of-the-art accuracy-latency trade-off on M1 MacBook Pro
Share this post