"AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models"
Below podcast on this paper is generated with Google's Illuminate.
https://arxiv.org/abs/2501.15021
The challenge with Vision-Language Models arises from lengthy multimodal inputs. These inputs inflate Key-Value caches, causing memory overload and slower processing. Methods designed for Large Language Model KV cache quantization are not directly suitable for Vision-Language Models due to differing attention patterns across modalities.
This paper introduces AKVQ-VL. It is a method designed to solve the KV cache quantization problem in Vision-Language Models by considering attention differences among multimodal tokens and reducing outliers.
-----
📌 AKVQ-VL introduces attention-driven quantization for Vision-Language Models. It moves beyond modality-agnostic Large Language Model methods. Text-Salient Attention and Pivot-Token-Salient Attention guide adaptive bit allocation effectively.
📌 Walsh-Hadamard Transform is expertly used to tackle KV cache outlier challenge. This allows aggressive 2-bit quantization. The result is strong performance retention with significantly reduced memory footprint.
📌 AKVQ-VL demonstrates targeted KV cache compression in Vision-Language Models. By exploiting unique attention patterns, it achieves superior efficiency gains over generic quantization techniques. Multimodal inputs are handled with greater precision.
----------
Methods Explored in this Paper 🔧:
→ AKVQ-VL method identifies important tokens based on two attention patterns unique to Vision-Language Models: Text-Salient Attention and Pivot-Token-Salient Attention.
→ Text-Salient Attention is observed in early layers where text tokens receive more attention than vision tokens.
→ Pivot-Token-Salient Attention emerges in later layers where a few key tokens attract most attention.
→ Based on these patterns, AKVQ-VL uses adaptive quantization. Salient tokens get higher bit precision, while others are quantized to 2-bits for better compression.
→ To handle outliers in KV tensors, Walsh-Hadamard Transform is applied. This transformation creates outlier-free KV caches, improving low-bit quantization.
-----
Key Insights 💡:
→ Vision-Language Models exhibit distinct attention patterns compared to Large Language Models. These are Text-Salient Attention and Pivot-Token-Salient Attention.
→ Text tokens are prioritized in initial layers, and pivot tokens become important in later layers within Vision-Language Models.
→ Methods designed for Large Language Models KV cache quantization are not optimal for Vision-Language Models due to these unique attention patterns.
→ Walsh-Hadamard Transform effectively reduces outliers in Key-Value caches of Vision-Language Models, making low-bit quantization more effective.
-----
Results 📊:
→ AKVQ-VL achieves comparable or improved accuracy with 2-bit quantization compared to FP16 baseline across 12 multimodal tasks. For example, on Scene Transition task, AKVQ-VL achieves 78.0 accuracy, exceeding FP16's 73.0.
→ AKVQ-VL reduces peak memory usage by 2.13 times.
→ AKVQ-VL enables 3.25 times larger batch sizes.
→ AKVQ-VL increases throughput by 2.46 times on LLaVA-v1.5-7B model.