Browse all previously published AI Tutorials here.
Table of Contents
FP8 vs FP16/BF16/INT8: Theoretical Advantages
Performance and Efficiency Gains (2024-2025 Findings)
FP8 Implementation on Modern Hardware
Impact on LLM Precision, Stability, and Training Dynamics
Challenges and Limitations in Large-Scale Deployment
FP8 (8-bit floating point) offers significant efficiency gains over 16-bit formats and INT8. With only 8 bits per value, FP8 cuts memory usage in half compared to FP16/BF16, and hardware can double the math throughput (two FP8 operations for the cost of one 16-bit) (COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training(https://arxiv.org/html/2410.19313v1#:~:text=precision technique,Transformer Engine’s memory)). This yields lower computational overhead and reduced power consumption while improving memory efficiency, especially at scale (An Investigation of FP8 Across Accelerators for LLM Inference(https://arxiv.org/html/2502.01070v1#:~:text=The FP8 format has emerged,AI%2C 22)). Unlike INT8 (fixed-point 8-bit), FP8 retains an exponent field that expands the representable dynamic range. This floating-point range helps preserve model accuracy and stability better than pure INT8 quantization, making FP8 a preferred low-precision format for large models (An Investigation of FP8 Across Accelerators for LLM
Inference(https://arxiv.org/html/2502.01070v1#:~:text=hardware (Lin et al,significant attention for its potential)). In some accelerators, FP8 arithmetic is reported to be twice as fast as BF16, combining speed and efficiency with minimal accuracy loss (An Investigation of FP8 Across Accelerators for LLM Inference(https://arxiv.org/html/2502.01070v1#:~:text=preferred format due to its,significant attention for its potential)). In summary, FP8 promises half the memory footprint and up to 2× throughput versus BF16/FP16, while maintaining a floating-point dynamic range that pure INT8 lacks (COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training(https://arxiv.org/html/2410.19313v1#:~:text=precision technique,Transformer Engine’s memory)) (An Investigation of FP8 Across Accelerators for LLM Inference(https://arxiv.org/html/2502.01070v1#:~:text=hardware (Lin et al,significant attention for its potential)).
Performance and Efficiency Gains (2024–2025 Findings)
Recent research in 2024–2025 validates the practical speedups and memory savings of FP8 for LLMs. For instance, Fishman et al. trained a 7B-parameter Llama2 model entirely in FP8 and achieved on-par accuracy with a BF16 baseline while improving training throughput by ~34% ( Scaling FP8 training to trillion-token LLMs). NVIDIA’s COAT framework similarly demonstrated a 1.43× end-to-end speedup and a 1.54× reduction in memory usage versus BF16 when training large models, by pushing optimizer states and activations to FP8 (COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training). These gains translate to practical benefits: smaller memory footprints allow larger batch sizes or models on the same hardware, and higher FLOP throughput shortens training times . Gains are not limited to training – inference sees boosts as well. Kim et al. show that using FP8 on the Intel Gaudi2 and NVIDIA H100 yields superior throughput-per-Watt for LLM inference, with Gaudi2’s FP8 mode delivering better efficiency than H100 in certain scenarios (An Investigation of FP8 Across Accelerators for LLM Inference). Across studies, FP8 consistently delivers ~30–50% faster training/inference for large models with negligible or manageable accuracy loss, confirming its value for efficiency .
FP8 Implementation on Modern Hardware
Accelerator Support: Modern AI accelerators explicitly support FP8 to capitalize on these benefits. NVIDIA’s Hopper H100 GPU introduced Tensor Core support for FP8 (in two standard formats: e4m3 and e5m2) and treats FP8 as the next-generation precision for training . Intel’s Habana Gaudi2 also implements FP8 matrix units and was used to perform full FP8 training runs for LLMs . AMD’s Instinct MI300 series similarly includes FP8 matrix core support (following the 5-bit exponent, 2-bit mantissa format for training) as part of the industry’s FP8 standardization (Large Language Model Inference Acceleration: A Comprehensive Hardware Perspective). Even custom AI chips are embracing FP8 – for example, Tesla’s Dojo system uses a custom 8-bit floating point variant (termed “CFP”) that diverges from the conventional formats (HERE). In general, FP8 arithmetic units are now common in high-end GPUs/HPUs (H100, MI300, Gaudi2, etc.), and vendors have coalesced around two FP8 formats (E4M3 and E5M2) to balance precision and range for inference vs. training needs.
Software and Scaling: Using FP8 in practice requires careful handling of scale. Deep learning libraries (e.g. NVIDIA’s Transformer Engine) manage per-tensor scaling factors so that values fit into 8-bit range without overflow (An Investigation of FP8 Across Accelerators for LLM Inference). Typically, activations and weights use one FP8 format (e.g. E4M3) while gradients use a wider-range format (E5M2) to accommodate their higher dynamic range . Hardware vendors provide APIs for FP8 GEMMs, but the software must choose appropriate scaling for each layer or tensor group. Early FP8 training approaches relied on dynamic loss scaling (similar to FP16 training) to avoid under/overflow, incurring some overhead . Newer methods like μnit Scaling (Databricks 2025) eliminate dynamic scaling by analytically adjusting initialization and layer gains, enabling straightforward FP8 training without special hyperparameters . In summary, modern hardware plus dedicated libraries now allow FP8 computation end-to-end, but success hinges on managing the numeric range via scaling strategies.
Impact on LLM Precision, Stability, and Training Dynamics
Adopting FP8 for LLMs requires addressing numeric precision challenges to maintain training stability. Prior experience showed FP16 training was less stable than BF16 (due to FP16’s narrower exponent), so there were concerns that FP8’s even lower precision could destabilize large-scale training ( To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability). Indeed, early FP8 training experiments found that without adjustments, models could diverge. For example, an FP8-trained 7B LLM exhibited sudden loss spikes after ~200B tokens, traced to outlier activation values (from the SwiGLU activation) exceeding FP8’s range ( Scaling FP8 training to trillion-token LLMs). Fishman et al. reported this outlier amplification caused divergence in long FP8 runs and introduced a modified activation (Smooth-SwiGLU) to clamp extremes, restoring stable loss curves . More generally, reducing precision tends to increase training instability gradually – Lee et al. show a monotonic decrease in stability as bit-width drops, with the risk of divergence rising sharply at 8-bit ( To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability). Notably, losing exponent bits has a larger impact on model convergence than losing mantissa bits , underscoring the importance of dynamic range. This insight is reflected in the FP8 format choice (E5M2) for training, which allocates extra exponent bits to safeguard range. Recent works demonstrate that with proper techniques (e.g. tuned optimizers, scaled initialization, or custom layers), FP8 can reach parity with BF16 training in final accuracy . However, FP8 training may require more careful hyperparameter tuning – e.g. learning rates and seed robustness – to avoid instabilities . In inference applications, FP8 quantization of a pre-trained FP16/BF16 model can incur a small accuracy drop if done naively, but research into FP8 post-training quantization shows that with calibration and per-layer scaling, one can achieve minimal precision loss comparable to INT8 quantization methods (with the benefit of native FP8 hardware support) . Overall, FP8 can maintain LLM precision and convergence if training dynamics are managed via appropriate scaling, architecture tweaks, and sometimes slight hyperparameter adjustments.
Challenges and Limitations in Large-Scale Deployment
Despite its promise, FP8 comes with several challenges that researchers are actively addressing:
Training Stability & Robustness: As noted, FP8 training initially was “not robust enough” to serve as a drop-in replacement for BF16 ( To FP8 and Back Again: Quantifying the Effects of Reducing Precision on LLM Training Stability). Without special care, FP8 models might require frequent restarts or hyperparameter retuning, negating the speed benefits . Ensuring stability over trillion-token runs demanded new solutions (e.g. Smooth-SwiGLU for outlier suppression ( Scaling FP8 training to trillion-token LLMs), or μnit Scaling to maintain unit variance (1 Introduction) ). This adds development complexity, as models or training procedures may need modification when moving to FP8.
Outlier and Range Handling: LLMs occasionally produce activation outliers or extremely small gradient updates that 8-bit precision cannot represent. These edge-case values can cause overflow/underflow in FP8, destabilizing training . Techniques like adaptive scaling, clipping, or using higher precision just for problematic layers (e.g. accumulators, norm layers) are often needed to mitigate this. It remains an art to identify and handle all such cases for a new model.
Software/Hardware Ecosystem Maturity: Managing FP8 entails additional complexity in software frameworks. Dynamic scaling of tensors, while effective, introduces overhead and complicates distributed training and checkpointing (due to per-step scale metadata) (1 Introduction). Newer static scaling methods simplify this, but they are still maturing. On the hardware side, FP8 capabilities are currently limited to the latest accelerators – large clusters must upgrade to H100/MI300-class GPUs or equivalent to fully benefit. This hardware dependence can slow adoption for some organizations.
Cross-Platform Variability: Unlike IEEE standard FP32/FP16, the FP8 format and its usage can vary across vendors. Different chips may implement FP8 with slight variations in exponent bias, scaling approach, or accumulation precision. For example, Tesla’s Dojo uses a non-standard FP8 format (CFP) that deviates from the common E4M3/E5M2 schemes (HERE). This lack of strict uniformity means an FP8-trained model might not behave exactly the same on all hardware, and careful validation is needed when deploying across platforms. Efforts like the OCP standard seek to unify FP8 representations, but until widely adopted, portability is a consideration.
Extreme Quantization Limits: Pushing below FP8 (to 4-bit or binary formats) severely hurts model quality or requires radical changes in architecture. FP8 sits at the edge of feasible precision for LLM training – going lower typically relegates usage to fixed-weight inference or requires sophisticated quantization-aware training. Thus FP8 is likely the floor for general-purpose training precision for now. Any further gains in efficiency may need to come from algorithmic sparsity, model compression, or architectural innovation rather than bit-width reduction alone. In large-scale deployments, this means FP8 is a sweet spot but also a hard limit; one cannot keep halving precision without trade-offs.
In summary, FP8 enables substantial efficiency improvements for LLM training and inference – cutting memory use, increasing speed, and even reducing power per token – as evidenced by the latest 2024/2025 studies (COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training). Its successful use at scale (training multi-billion-parameter models on trillions of tokens) has been demonstrated, but with important caveats: careful engineering is needed to preserve numerical stability and model quality ( Scaling FP8 training to trillion-token LLMs). The newest hardware (NVIDIA H100, AMD MI300, Intel Gaudi2, etc.) and libraries now provide the tools to leverage FP8, making it a viable option for cutting-edge LLM development. Ongoing research continues to refine FP8 training techniques and address outstanding limitations, paving the way for broader adoption of 8-bit floating point in large-scale AI deployments.