Emmett Fear

How can using FP16, BF16, or FP8 mixed precision speed up my model training?

Training deep learning models can be painfully slow when using standard 32-bit floating point (FP32) precision. Many developers wonder if there’s a way to train models faster without buying more GPUs or cutting down the model size. Mixed precision training – using lower-precision number formats like 16-bit or even 8-bit floats – is one of the most effective techniques to boost training speed. By performing computations in half-precision (16-bit) or smaller formats, you can significantly accelerate training and reduce memory usage while still maintaining model accuracy . In this article, we’ll explain what mixed precision is, dive into the differences between FP16, BF16, and FP8 formats, discuss how they improve training efficiency (and what the trade-offs are), and show how you can leverage these techniques on Runpod’s cloud GPUs to supercharge your model training.

Ready to accelerate your training? 🚀 You can get started with mixed precision on high-performance cloud GPUs by signing up for Runpod.

What is Mixed Precision Training?

Mixed precision training means using a mix of numerical precisions (different bit-widths) during model training. Instead of doing all calculations in 32-bit float (FP32), you use lower-precision formats (like FP16 or BF16) for most operations to gain speed, but keep some critical parts in FP32 to preserve numerical stability . In practice, this often involves using 16-bit floats for forward and backward passes (matrix multiplications, convolutions, etc.) and accumulating results (like gradients or weight updates) in 32-bit. This hybrid approach can drastically speed up computation and cut memory usage, because 16-bit operations run faster on modern GPUs and 16-bit values consume half the memory of 32-bit values.

Modern deep learning frameworks make mixed precision easy to use. For example, libraries like PyTorch and TensorFlow provide automatic mixed precision (AMP) features that will handle casting to half precision where safe and keep sensitive calculations in full precision. This means you typically only need to add a couple of lines or context managers in your code – no extensive rewriting – to start benefiting from mixed precision. Under the hood, the framework will use specialized hardware instructions (Tensor Cores on NVIDIA GPUs, for instance) to execute 16-bit operations at high speed, while avoiding or correcting any numerical issues (like gradient underflow) that could arise from the lower precision. The result is faster training times and the ability to train larger models or use bigger batch sizes on the same hardware .

At a high level, mixed precision works because many deep learning computations don’t require full 32-bit precision. Neural network training is generally tolerant to a bit of noise or rounding error – a model can still converge with slightly lower precision for weights and activations. By using 16-bit or 8-bit numbers where possible, you get more operations per second and pack more data into GPU memory, without noticeably hurting accuracy in most cases. In fact, mixed precision has become a standard practice for training large models today , as it offers a sweet spot between speed and accuracy.

FP16 vs BF16 vs FP8 Explained

When we talk about “lower precision” in deep learning, there are a few different formats to know. Let’s break down the key players: FP16, BF16, and the newer FP8. Each is a floating-point format with fewer bits than FP32, but they differ in how those bits are allocated to the number’s components (sign, exponent, and mantissa). This affects their range, precision, and suitability for training.

FP16 (Half-Precision 16-bit Floating Point)

FP16 is a 16-bit floating point format (sometimes called “half precision”). It allocates 1 bit for the sign, 5 bits for the exponent, and 10 bits for the fraction (mantissa). This gives FP16 a much narrower dynamic range than FP32 – the largest and smallest values it can represent are around 6.55e4 and 6.10e-5 (in normalized form) for positive values. In simpler terms, FP16 can “overflow” or “underflow” with values that FP32 could handle easily. It also has a precision of about 3 decimal digits (because of the 10-bit mantissa).

Why use FP16 then? Speed and memory. FP16 operations can run significantly faster on GPUs that support them, because specialized units (Tensor Cores on NVIDIA Volta/Turing/Ampere GPUs and beyond) can execute matrix math in half precision twice (or more) as fast as in single precision. For example, NVIDIA’s A100 GPU delivers up to 312 TFLOPs of tensor compute using FP16/BF16, versus ~19.5 TFLOPs in FP32 – that’s an enormous jump in throughput . FP16 values also take half the memory of FP32, so you can fit larger batches or models in GPU memory, which further boosts training efficiency. FP16 was one of the first lower-precision formats widely adopted in deep learning, and it’s supported by almost all modern training hardware and frameworks.

The downside is FP16’s limited range and precision, which can cause numerical instability if used blindly. In practice, techniques like loss scaling are used to prevent tiny gradient values from underflowing to zero in FP16. Additionally, frameworks often maintain a FP32 master copy of weights behind the scenes, applying small updates in FP32 even if the forward/backward passes use FP16. This way, you get the speed of FP16 without sacrificing convergence. It requires a bit of care, but the good news is libraries handle most of this automatically nowadays. As long as your GPU has half-precision support (which all NVIDIA RTX cards and data center GPUs from the last few generations do), FP16 mixed precision is usually a safe and big win for training speed.

BF16 (Brain Floating Point 16-bit)

BF16 (Bfloat16) is another 16-bit floating point format, originally popularized by Google Brain. At first glance it’s the same size as FP16 (1 sign bit, 16 bits total), but the key difference is how those 16 bits are split: BF16 uses 8 bits for the exponent and only 7 bits for the mantissa (with 1 sign bit). In other words, BF16 has the same exponent size as FP32. This gives it a dynamic range nearly equal to FP32’s (~3.4e38), which is vastly larger than FP16’s range . The trade-off is that BF16 has fewer precision bits (only 7 mantissa bits, roughly 2–3 decimal digits of precision).

Why is BF16 useful? The large exponent range means you can avoid the tricky part of FP16 training – you rarely need loss scaling with BF16, because it can represent extremely small or large gradients without underflow/overflow. Essentially, BF16 “just works” for training in many cases, making it very attractive. The reduced precision (7 bits) might sound worrisome, but in practice it’s usually enough for deep learning. The stochastic nature of training and techniques like batch norm, dropout, etc., mean that a tiny bit of added noise from lower precision doesn’t derail learning. Researchers found that models often reach similar final accuracy with BF16 as with FP32, as long as critical parts (like final reductions or certain normalization steps) are in higher precision.

Hardware-wise, widespread support for BF16 came a bit later than FP16. NVIDIA’s Ampere architecture (A100 GPUs and RTX 30-series) introduced fast BF16 support on tensor cores, putting BF16 on equal footing with FP16 in terms of speed. That means on an A100 or H100, you can use either FP16 or BF16 and get a similar boost in throughput. Many frameworks actually prefer BF16 by default now on capable hardware (e.g., PyTorch will use BF16 in AMP if available, because it’s more numerically stable). If your GPU supports it, BF16 is often the easiest way to do mixed precision training – you get the speed benefits without having to worry about manual loss scaling. On the downside, if your GPU is older (pre-2020 architectures or some consumer cards prior to Ampere), it might not support BF16. In those cases, FP16 with loss scaling remains the fallback.

FP8 (8-bit Floating Point)

FP8 is the new kid on the block. As the name implies, it’s an 8-bit floating point format – literally using only 8 bits to represent a number. That’s 4× smaller than a single-precision float! Obviously, with so few bits, you can’t have much precision or range in one FP8 number. To make it viable, NVIDIA defined two specific FP8 formats in its recent Hopper architecture (H100 GPUs): E4M3 (4 exponent bits, 3 mantissa bits) and E5M2 (5 exponent bits, 2 mantissa bits)  . These are used for different purposes: E4M3 has one fewer exponent bit (so smaller range, ±448) but an extra mantissa bit (a bit more precision), whereas E5M2 has a wider range (up to ±57344) but only 2 bits of precision. The idea is to use the E4M3 format for forward passes (where having a tiny bit more precision helps) and E5M2 for gradient computations (where values can grow large, so range matters more). Even so, 2–3 mantissa bits is extremely low precision – by itself not enough to train a model from scratch without significant errors.

So how is FP8 used? The key is pairing it with higher precision in a smart way. NVIDIA’s H100 GPUs introduce a Transformer Engine that can automatically switch between FP8 and FP16 on the fly . The engine uses FP8 for the bulk of matrix multiplications and tensor ops to get maximum speed, but it falls back to FP16 or FP32 for operations where FP8 would cause too much error (or it keeps a higher-precision copy of certain values). This hybrid approach is basically an extension of mixed precision: now you have three levels (FP8, FP16, FP32) working together. When done right, the speedups are massive – NVIDIA reported up to 4× faster GPT-3 training just by using H100s instead of A100s, and up to 6–9× faster in highly optimized cases that fully leverage FP8 and other Hopper features  . Importantly, they achieved this “while maintaining model accuracy.” In other words, FP8 (plus some magic under the hood) can train big models to the same quality as traditional methods, in a fraction of the time.

However, FP8 is bleeding edge. Only the latest hardware (NVIDIA H100, and emerging platforms like the upcoming H200 or other specialized accelerators) supports it, and you typically need specific software libraries to make use of it (like NVIDIA’s transformer engine integration in frameworks). The community is still experimenting with FP8 – it’s not yet a drop-in replacement for FP16/BF16. You should think of FP8 as an advanced optimization for those who have access to cutting-edge GPUs and want to push the limits of training speed. Over time, we expect frameworks to make FP8 more accessible, and more hardware vendors will likely adopt similar 8-bit float support because the appeal of doubling speed again is huge. On Runpod, you can already access GPUs like the NVIDIA H100 through our cloud GPU platform, which means you can experiment with FP8 training without having to buy an H100 yourself. It’s an exciting area to watch (especially for large language models), but for most developers today, FP16 or BF16 mixed precision will be the go-to method.

How Mixed Precision Boosts Training Performance

Using FP16/BF16 dramatically improves training throughput and efficiency for a few reasons:

  • Faster computations on GPU hardware: Modern GPUs are built to take advantage of lower precision. They have Tensor Cores and other specialized units that can perform many more operations per clock cycle with 16-bit (or 8-bit) numbers than with 32-bit. For instance, an NVIDIA A100 can achieve around 312 teraFLOPs in FP16/BF16 compute via Tensor Cores, versus ~19.5 teraFLOPs in standard FP32 . That means with mixed precision, the same GPU can crunch over 15× more math ops per second in critical layers. This translates to models training in hours instead of days. Even consumer-grade GPUs like the RTX 30/40 series have these half-precision speedups baked in (the RTX 4090, for example, is extremely fast at FP16/BF16 tensor ops). In short, lower precision lets you squeeze much more performance out of the hardware.
  • Reduced memory usage: Training a model involves storing parameters, gradients, activations, and optimizer state. Using 16-bit precision for these can cut those memory requirements by up to half. This has two benefits: first, you can often train with larger batch sizes, which improves utilization and can lead to faster convergence (fewer steps to see the whole dataset). Second, you can fit bigger models on a given GPU. This is crucial for cutting-edge research where models might have billions of parameters – mixed precision is often the only way to even get such models to run on available hardware. By saving memory, you also reduce the amount of data that needs to be transferred between GPU and CPU or between distributed GPUs, which can remove bottlenecks. Overall, mixed precision makes much more efficient use of a GPU’s precious VRAM, and memory-bound workloads see a big uptick in speed as a result .
  • Lower power and cost for the same work: Since computations are faster and less data is moved around, mixed precision training often consumes less power for the same job, and on cloud platforms, faster training means lower cost. You pay for GPU time by the hour on services like Runpod, so if you can finish a job in 5 hours with mixed precision instead of 10 hours with full precision, that’s half the cost. In fact, it can be more than linear savings – using a higher-tier GPU with mixed precision can be more cost-efficient than a cheaper one running longer. For example, one of our guides notes that while an A100 costs about 3× more per hour than a lower-end GPU, it might train a model 4–5× faster, giving you more bang for your buck . Mixed precision helps unlock those efficiency gains by fully utilizing the GPU’s capabilities. It’s a key reason why NVIDIA’s data center GPUs (A100, H100) are so dominant in AI: they’re designed to excel at lower precision math.
  • Scaling to multiple GPUs more effectively: If you are using multiple GPUs or nodes, mixed precision can improve your scaling efficiency. Faster computation means each GPU finishes its part of a data batch sooner, so synchronization overhead (for example, in distributed training where GPUs need to exchange gradients) becomes a smaller fraction of the total time. In other words, the communications between GPUs have an easier time keeping up with the computation. This can lead to better scaling (e.g., you get closer to a 2× speedup when doubling GPUs). Additionally, the reduced memory footprint of 16-bit precision means less data to broadcast or reduce across devices. The end result is that mixed precision not only makes a single GPU faster, but also helps when you throw more GPUs at a problem.

Of course, the actual speedup you get from mixed precision can vary. On older hardware (say a GTX 1080 Ti), you might not see much improvement because it doesn’t have fast FP16 support. On a modern GPU with Tensor Cores, though, it’s common to see 1.5–2× faster training just by toggling on AMP (automatic mixed precision). Some models see even larger gains, especially if they were memory-bound in FP32. And as mentioned, the latest H100 GPUs using FP8 can push things to another level, achieving 4–9× speedups in certain cases . Even for typical use cases, mixed precision is one of the simplest ways to get more done in less time when training deep networks.

Trade-Offs and Considerations

Mixed precision isn’t a free lunch – there are some trade-offs and challenges to be aware of:

  • Potential numerical instability: The biggest concern with lower precision is rounding error or extreme values causing issues in training. With FP16, for example, very small gradient values might underflow to 0, or a large activation might overflow to infinity, especially early in training. This can lead to NaNs or stalled learning if not handled. Techniques like gradient loss scaling alleviate this by temporarily multiplying small values to bring them into representable range and then scaling down after accumulation . BF16’s larger range largely avoids this issue, which is why it’s preferred when available. The general point is that you may need to tweak hyperparameters or ensure your training framework’s mixed precision utilities are properly used to maintain stability. The good news: frameworks have matured a lot here – they will automatically do things like loss scaling or keeping certain sensitive ops in FP32 for you.
  • Slightly reduced precision in calculations: With fewer bits, each operation is a bit less accurate. Over thousands or millions of operations, these tiny errors could accumulate. In practice, researchers have found that for most models, the final accuracy drop is minimal to none when using FP16/BF16, especially if you follow best practices. Nonetheless, there are rare cases (or certain algorithms) where full FP32 might converge a touch better. You might need to experiment – e.g., some recommend training the last few epochs in full precision, or only using mixed precision after a certain point – but such measures are seldom needed. Careful initialization and normalization in your model can also help reduce any sensitivity to low-precision noise. Overall, the risk of accuracy loss is small, but it’s wise to verify on a validation set that your mixed-precision run matches the FP32 baseline. In mission-critical applications, you might do an A/B test to be sure nothing drifted.
  • Hardware and software support: Not every environment supports every precision. As noted, FP8 requires very new hardware (NVIDIA H100 or equivalent) and the software to make use of it is still catching up. BF16 is widely supported on NVIDIA Ampere and later, on Google TPUs, and in newer PyTorch/TensorFlow releases – but if you’re on an older GPU, you might be limited to FP16. Even with FP16, some ops (especially custom ops or certain activation functions) might not have a half-precision implementation, in which case they’ll silently fall back to FP32 (with a possible performance hit). It’s a good idea to use the latest versions of frameworks and drivers for the best mixed precision experience. On Runpod, all our provided frameworks (containers, etc.) come with AMP support out of the box, and you can choose GPUs that match the precision you want (from FP16 on Tesla T4s or RTX GPUs up to FP8 on H100s). Just be mindful that if you try to run FP16 on a very old GPU that lacks native support, it could actually slow down or cause errors – though this is generally not an issue on cloud GPUs since even modest ones like RTX 20-series have it.
  • Debugging and monitoring: When using mixed precision, you’ll want to keep an eye on training for any signs of instability. For example, if you see NaN values in losses or gradients, that’s a red flag (often can be fixed by adjusting loss scaling or ensuring no layer outputs excessively large values). It’s also worth monitoring throughput – in some setups, if the overhead of mixing precision or moving data around isn’t managed well, you might not get the full speed benefit. Typically using the built-in AMP and sticking to standard layers avoids these pitfalls. If you write custom CUDA kernels or use uncommon layers, test them under mixed precision. In short, while mixed precision is mostly plug-and-play now, stay vigilant just as you would with any performance optimization.

In summary, the trade-offs of mixed precision are manageable with today’s tools. The community has developed robust methods to handle the stability issues such that the “no free lunch” cost is very low – you get huge speed boosts and pay only a tiny price in potential numerical fuss. And if you’re using state-of-the-art hardware like H100, the hardware itself (with things like the Transformer Engine) is doing a lot of work to maintain accuracy even at FP8  . It’s always wise to validate your model’s results after training with mixed precision, but by and large this technique is a proven winner for deep learning. The gains far outweigh the costs for most use cases.

Using Mixed Precision on Runpod Cloud GPUs

One of the great advantages of using a cloud platform like Runpod is that you can easily access the latest and fastest GPUs optimized for mixed precision – without having to purchase them yourself. Runpod offers a range of GPU instances (from economical consumer GPUs to top-of-the-line data center cards like A100 and H100) on demand. This means if you want to try FP16, BF16, or FP8 training, you can spin up an instance with a suitable GPU in minutes and only pay for what you use.

Here are some tips for leveraging mixed precision on Runpod:

  • Choose the right GPU: If you’re looking to maximize training speed with mixed precision, consider using GPUs that excel at it. On Runpod’s Cloud GPU marketplace, you’ll find options like NVIDIA A100 (Ampere) and H100 (Hopper), which are built for lower-precision training. The A100 offers tremendous FP16/BF16 performance and large 40GB/80GB memory for big models. The H100 goes even further, with FP8 support and over 2× the throughput of A100 in many cases  . If those are overkill for your task, even an RTX 4090 or RTX 3080 will give you excellent FP16 capability. The key is to ensure the GPU you select has tensor core support for FP16/BF16 (most RTX/A-series GPUs do). Runpod’s interface and documentation can help you identify which instance types support which precisions. In short, newer is better when it comes to mixed precision – and on Runpod you have the flexibility to pick the hardware that fits your needs and budget.
  • Leverage pre-configured environments: Runpod provides ready-to-use Docker container environments with popular ML frameworks (PyTorch, TensorFlow, etc.) that already have CUDA and mixed precision support. When you launch a pod, you can select an image that suits your framework. These come with the necessary libraries (like NVIDIA’s Apex or native AMP support) so you don’t have to do any complicated setup to use FP16/BF16. For example, if you start a PyTorch 2.x container on an A100 instance, you can immediately enable torch.cuda.amp.autocast in your code and get those sweet speedups. The drivers and CUDA toolkit on Runpod are configured to make full use of the GPU’s capabilities. This saves you time and hassle, letting you focus on your training code rather than environment fiddling.
  • Scale out with Instant Clusters: If one GPU isn’t enough, Runpod also offers Instant Clusters – you can deploy multi-GPU or multi-node setups quickly. Mixed precision pairs wonderfully with multi-GPU training: you get the benefit of faster computation per GPU and the benefit of parallelism. Using something like PyTorch DDP (Distributed Data Parallel) or Horovod on a cluster of, say, 4×A100 GPUs can potentially give you an order-of-magnitude speedup versus single-GPU FP32 training. Runpod’s networking and cluster orchestration are optimized for such workloads, so you can experiment with large-scale training without complex infrastructure work. Whether you need to do distributed training across multiple GPUs for a massive model, or just want to run many experiments in parallel, you can combine that with mixed precision to keep costs manageable and efficiency high.
  • Automate and integrate: For advanced use, you can tie Runpod into your ML workflow via the Runpod API. This means you could programmatically launch a GPU pod with the desired precision, run your training script (with AMP enabled), and shut it down when done – all from your CI/CD pipeline or notebook. Automation ensures you only use cloud GPUs when you need them. Plus, it’s a great way to run hyperparameter sweeps or CI tests with mixed precision across different model configurations. By automating through the API, you can seamlessly integrate Runpod’s cloud GPUs into your development lifecycle, making scalable mixed-precision training a routine part of your toolkit.
  • Community and support: Don’t forget to lean on resources like the Runpod Discord, where many developers share tips on optimizing training. Mixed precision is a popular topic, and you can find community-contributed guides or ask questions if you run into any trouble. Our team is also continuously updating docs and examples. If you’re fine-tuning a model and wonder if FP16 will hurt your accuracy, chances are someone in the community has tried it and can share insights. We’re all about helping developers succeed, so you’ll find plenty of help to get the most out of features like mixed precision.

Finally, keep in mind that cloud GPUs let you experiment freely. If you’re unsure about using FP16 or BF16, you can do a quick A/B test on Runpod: train one run with full precision, one with mixed, and compare the results. Since you’re not locked into any long-term hardware, you can turn on the high-end H100 for a few hours to see how FP8 performs for your model, then switch to a cheaper GPU if needed. This flexibility ensures you always get the best performance per dollar for your particular use case. And when you find that sweet spot (which, very often, will involve mixed precision), you can double down and train to completion faster.

Take the leap and speed up your model training today – launch a Runpod GPU instance and see how mixed precision can accelerate your AI development. 🏃‍♂️💨

FAQs about Mixed Precision Training

Q: Will using mixed precision (FP16/BF16) hurt my model’s accuracy?

A: In the vast majority of cases, mixed precision training achieves nearly the same accuracy as full FP32 training. Frameworks handle the tricky parts by keeping certain computations in high precision or using loss scaling, so your model’s convergence and final metrics usually remain unchanged. You might occasionally see a tiny difference (e.g. one model might reach 76.4% accuracy instead of 76.5%), but often it’s within run-to-run variation. Many large models (like Transformers for NLP or vision CNNs) have been trained successfully with FP16/BF16 and match FP32 accuracy. If you do encounter accuracy issues, you can try using BF16 (for more range) or tweak when/where you use lower precision. But generally, no – mixed precision doesn’t significantly degrade accuracy when applied correctly.

Q: What’s the difference between FP16 and BF16?

A: Both are 16-bit floating point formats, but they allocate their bits differently. FP16 (half precision) has 5 exponent bits and 10 mantissa bits, giving it a decent precision but a limited numeric range (~1e-5 to 1e5 magnitude). BF16 (brain floating point) has 8 exponent bits and 7 mantissa bits – effectively the same range as a 32-bit float (around 1e-38 to 1e38) but with less precision for fractional details. In practice, this means FP16 can sometimes overflow/underflow and usually requires loss scaling, whereas BF16 can represent very large or very small values more easily (often no loss scaling needed), but FP16 might capture slightly more precise small differences. Modern hardware like NVIDIA Ampere/Hopper supports both at high speed. BF16 is often preferred for training if available, since its large range makes it more robust, and any precision loss tends not to matter for model convergence. But FP16 works well too – with proper handling, both can produce equivalent results.

Q: Do I need a special GPU to use mixed precision?

A: You do need relatively modern hardware to see big benefits. Most NVIDIA GPUs from the past few years have excellent mixed precision support: this includes Tesla V100 (Volta) and newer for data center, and RTX 20-series (Turing) and newer for consumer cards. These GPUs have Tensor Cores or similar units that accelerate FP16/BF16 operations. If you run on an older GPU (e.g. GTX 10-series or earlier), it might still support FP16 arithmetic, but not nearly as fast – in some cases, FP16 could even be emulated and end up slower on those older cards. BF16 support is even more recent, generally starting with NVIDIA Ampere (A100, RTX 30-series). And FP8 support is currently limited to the newest NVIDIA H100 GPUs. The good news is that on cloud platforms like Runpod, you can readily access these capable GPUs. You don’t need to buy a $10k H100 – you can rent one by the hour. But if you’re using your own hardware, ensure it’s a GPU known for mixed precision (any RTX GPU 3060 or higher, or older 2080 Ti, etc., will do nicely). In summary, mixed precision works best on GPUs with dedicated support (Volta/Ampere/Hopper or later). All Runpod GPU instances support FP16, and many support BF16; for FP8, look for H100 instances.

Q: How do I enable mixed precision in my training code?

A: The exact steps depend on your framework, but it’s usually quite simple. In PyTorch, you can use the torch.cuda.amp module – for example, wrap your forward and loss backward pass in an autocast() context and use a GradScaler for scaling gradients. This typically requires only a few lines of code added to your training loop. PyTorch Lightning and Hugging Face Transformers also have built-in flags to enable FP16 training. In TensorFlow/Keras, you can enable the mixed precision policy (e.g. tf.keras.mixed_precision.set_global_policy('mixed_float16')) which will cause layers to use float16 where appropriate. The high-level idea is the same: these tools will handle casting and scaling for you. Always consult the latest docs for your framework version – there are usually guidelines for AMP usage. Additionally, if you’re on a platform with specific support (like NVIDIA’s transformer engine for FP8), you might use their library API to initialize FP8 training. But for FP16/BF16, it’s very straightforward. On Runpod’s environments, all the needed libraries are pre-installed, so you can enable AMP without any custom setup. In short, just flip the switch – e.g., a context manager or a single configuration – and you’re off to the races with mixed precision.

Q: Is FP8 training ready for general use?

A: FP8 is still cutting-edge and not as plug-and-play as FP16 or BF16. It’s an exciting development – 8-bit floats can unlock another big jump in speed and memory efficiency – but there are caveats. Currently, you’ll only find FP8 on the latest hardware (NVIDIA H100 as of 2023/2024, and likely similar high-end chips from other vendors in the near future). To use FP8 effectively, you typically rely on specialized software: NVIDIA’s Transformer Engine is one example that automatically chooses where to apply FP8 and where to use higher precision to avoid accuracy loss. Frameworks are starting to integrate this (e.g., newer PyTorch and TensorFlow versions have some FP8 support when running on H100), but it’s not as simple as flipping a single flag yet. Moreover, not all model architectures have been proven on FP8 – most success stories so far are with large language models and transformers. For smaller models or uncommon architectures, you might encounter issues or diminishing returns with FP8. So, FP8 is on the horizon and very much real, but think of it as an advanced optimization. If you have access to H100 GPUs (you can get one on Runpod to try), feel free to experiment – you might need to use NVIDIA’s APIs or specific framework nightly builds. Otherwise, for most users, sticking to FP16/BF16 for now is the stable path. We expect in a couple of years FP8 will become more mainstream as tools mature. Keep an eye on release notes of deep learning frameworks – the capability is evolving fast.

Build what’s next.

The most cost-effective platform for building, training, and scaling machine learning models—ready when you are.