Paper reading - "Spectra: A Comprehensive Study of Ternary, Quantized, and FP16 Language Models"

Concludes that directly training low-bitwidth LLMs can yield models that match or outperform their higher-precision counter. This paper was released on 17-July-2024.

Jul 19, 2024

Post-training quantization (PTQ) has emerged as a prominent technique for alleviating memory constraints in large language model (LLM) inference. However, this approach faces significant challenges when attempting to reduce precision below 4 bits, often resulting in substantial performance degradation.

An alternative strategy involves training compressed models directly at ultra-low bitwidths, such as binary (1-bit) or ternary (2-bit) precision. This approach, known as quantization-aware training (QAT), aims to maintain model performance while drastically reducing memory requirements and computational demands.

However, several aspects of these ultra-low bitwidth models remain poorly understood:

Performance: The accuracy and capabilities of binary and ternary LLMs compared to their full-precision counterparts are not yet fully established across various tasks and model sizes.
Training dynamics: The optimization process for ultra-low bitwidth models can be challenging, and best practices for effective training are still being developed.
Scaling trends: It is unclear how the performance of binary and ternary models scales with increasing model size, dataset size, and computational resources.

The Spectra paper introduces a novel approach to training low-bitwidth LLMs, specifically focusing on ternary (3-value) and 4-bit quantized models.

📌 Spectra's training methodology employs a combination of techniques to enable effective training of low-bitwidth models:

Straight-through estimator (STE): This allows gradients to flow through the non-differentiable quantization operation during backpropagation.
Stochastic rounding: Applied to continuous weights before quantization, helping maintain model expressivity.
Learnable quantization scales: Introduced for each tensor and optimized alongside model parameters, enhancing adaptability.
Progressive quantization: Gradually decreasing bitwidth during training (FP16 → 8-bit → 4-bit → ternary) to maintain stability.

📌 The ternary quantization scheme uses three values: -1, 0, and 1. The quantization function Q(w) is defined as:

Q(w) = sign(w) * (|w| > threshold)

Where the threshold is typically set to 0.7 * E[|w|], with E[|w|] being the expected absolute value of weights in a given tensor.

📌 For 4-bit quantization, Spectra uses a uniform quantization scheme with 16 evenly spaced values between -1 and 1. The quantization function is:

Q(w) = round(w * (2^(bits-1) - 1)) / (2^(bits-1) - 1)

📌 The study compares various model sizes (350M, 1.5B, and 7B parameters) across different precisions: FP16 (FloatLMs), 8-bit, 4-bit, and ternary (TriLMs). Results show that directly trained low-bitwidth models often match or exceed the performance of higher-precision counterparts.

📌 Memory analysis reveals significant reductions for low-bitwidth models:

Ternary models: Up to 16x memory savings compared to FP16
4-bit models: 4x memory savings compared to FP16

📌 The research explores the impact of different activation functions (ReLU, GELU) and normalization techniques (LayerNorm, RMSNorm) on low-bitwidth model performance.

Key findings:

RMSNorm consistently outperforms LayerNorm in ternary and 4-bit models. Suggesting RMSNorm may be particularly effective for compressed neural networks.
GELU activation shows better performance than ReLU in most cases

📌 Spectra models demonstrate strong few-shot learning capabilities, often matching or surpassing FP16 baselines across various tasks. This suggests that directly trained low-bitwidth models can retain the generalization abilities of their higher-precision counterparts.

📌 The study investigates the robustness of low-bitwidth models to distribution shifts using the MMLU benchmark. Results indicate that ternary and 4-bit models maintain competitive performance across diverse domains, showcasing their adaptability.

📌 A key finding is that TriLM 3.9B, despite being smaller in bit-size than the half-precision FloatLM 830M, matches the performance of FloatLM 3.9B in commonsense reasoning and knowledge benchmarks. This demonstrates the potential of directly trained low-bitwidth models to achieve high performance with significantly reduced memory footprint.

📌 However, the study also notes that TriLM 3.9B exhibits similar levels of toxicity and stereotyping as FloatLM 3.9B, indicating that ethical concerns persist even in compressed models.

📌 Interestingly, while TriLM 3.9B lags behind FloatLM in perplexity on validation splits and web-based corpora, it performs better on less noisy datasets like Lambada and PennTreeBank. This suggests that low-bitwidth models may have unique strengths in certain domains or data types.

In conclusion, the Spectra study demonstrates that directly training low-bitwidth LLMs can yield models that match or outperform their higher-precision counterparts while offering significant memory savings. This approach shows promise for enabling the deployment of large, capable language models in resource-constrained environments, potentially broadening the accessibility and applicability of LLMs across various domains and devices.

BONUS Discussion

The key differences between RMSNorm and LayerNorm are:

📌 Computational simplicity: RMSNorm is computationally simpler than LayerNorm. It removes the mean-centering operation present in LayerNorm, only normalizing using the root mean square (RMS) statistic. This simplification leads to improved efficiency.

📌 Re-centering invariance: LayerNorm provides both re-scaling and re-centering invariance, while RMSNorm only offers re-scaling invariance. However, studies suggest that removing the re-centering operation in RMSNorm does not negatively impact model stability.

📌 Normalization approach: LayerNorm normalizes both the mean and variance of inputs, while RMSNorm only normalizes using the RMS, effectively scaling inputs to a √n-scaled unit sphere.

📌 Performance in low-bitwidth models: The paper indicates that RMSNorm consistently outperforms LayerNorm in ternary and 4-bit quantized models. This suggests RMSNorm may be particularly effective for compressed neural networks.

📌 Training stability: RMSNorm is reported to stabilize training in deeper architectures more effectively than LayerNorm, which can be particularly beneficial for large language models.

📌 Efficiency gains: While actual speed improvements vary depending on implementation details, hardware, and model architecture, RMSNorm has been observed to provide speedups ranging from 7% to 64% across different models and implementations.

📌 Batch independence: Both RMSNorm and LayerNorm are preferred over BatchNorm in certain scenarios, particularly in distributed settings with smaller batch sizes, as they don't depend on batch statistics and don't require synchronization across devices.

All these differences make RMSNorm an attractive alternative to LayerNorm, especially in scenarios where computational efficiency and training stability in deep networks are crucial, such as in large language models and low-bitwidth quantized networks.

Connect with me in Twitter, or Linkedin or on my Youtube channel

Rohan's Bytes

Discussion about this post