"AKVQ-VL: Attention-Aware KV Cache Adaptive 2-Bit Quantization for Vision-Language Models"

Below podcast on this paper is generated with Google's Illuminate.

Rohan Paul

Feb 09, 2025

Article voiceover

0:00

-6:23

https://arxiv.org/abs/2501.15021

The challenge with Vision-Language Models arises from lengthy multimodal inputs. These inputs inflate Key-Value caches, causing memory overload and slower processing. Methods designed for Large Language Model KV cache quantization are not directly suitable for Vision-Language Models due to differing attention patterns across modalities.

This paper introduces AKVQ-VL. It is a method designed to solve the KV cache quantization problem in Vision-Language Models by considering attention differences among multimodal tokens and reducing outliers.

-----

📌 AKVQ-VL introduces attention-driven quantization for Vision-Language Models. It moves beyond modality-agnostic Large Language Model methods. Text-Salient Attention and Pivot-Token-Salient Attention guide adaptive bit allocation effectively.

📌 Walsh-Hadamard Transform is expertly used to tackle KV cache outlier challenge. This allows aggressive 2-bit quantization. The result is strong performance retention with significantly reduced memory footprint.

📌 AKVQ-VL demonstrates targeted KV cache compression in Vision-Language Models. By exploiting unique attention patterns, it achieves superior efficiency gains over generic quantization techniques. Multimodal inputs are handled with greater precision.

----------

Methods Explored in this Paper 🔧:

→ AKVQ-VL method identifies important tokens based on two attention patterns unique to Vision-Language Models: Text-Salient Attention and Pivot-Token-Salient Attention.

→ Text-Salient Attention is observed in early layers where text tokens receive more attention than vision tokens.

→ Pivot-Token-Salient Attention emerges in later layers where a few key tokens attract most attention.

→ Based on these patterns, AKVQ-VL uses adaptive quantization. Salient tokens get higher bit precision, while others are quantized to 2-bits for better compression.

→ To handle outliers in KV tensors, Walsh-Hadamard Transform is applied. This transformation creates outlier-free KV caches, improving low-bit quantization.

-----

Key Insights 💡:

→ Vision-Language Models exhibit distinct attention patterns compared to Large Language Models. These are Text-Salient Attention and Pivot-Token-Salient Attention.

→ Text tokens are prioritized in initial layers, and pivot tokens become important in later layers within Vision-Language Models.

→ Methods designed for Large Language Models KV cache quantization are not optimal for Vision-Language Models due to these unique attention patterns.

→ Walsh-Hadamard Transform effectively reduces outliers in Key-Value caches of Vision-Language Models, making low-bit quantization more effective.

-----

Results 📊:

→ AKVQ-VL achieves comparable or improved accuracy with 2-bit quantization compared to FP16 baseline across 12 multimodal tasks. For example, on Scene Transition task, AKVQ-VL achieves 78.0 accuracy, exceeding FP16's 73.0.

→ AKVQ-VL reduces peak memory usage by 2.13 times.

→ AKVQ-VL enables 3.25 times larger batch sizes.

→ AKVQ-VL increases throughput by 2.46 times on LLaVA-v1.5-7B model.

Rohan's Bytes

Discussion about this post