Intermediate6 min readgpu.tflops_fp16

What are TFLOPS?

TFLOPS (Tera Floating-Point Operations Per Second) measures how many trillion math operations a GPU can perform each second. It's the raw compute power metric.

What it is

TFLOPS stands for Tera Floating-Point Operations Per Second. One TFLOP = one trillion (10¹²) floating-point math operations every second.

Think of it as the GPU's horsepower rating. A car's horsepower tells you how much mechanical work the engine can do. TFLOPS tells you how much mathematical work the GPU can do.

A floating-point operation is any basic math on decimal numbers: addition, subtraction, multiplication, or division. When a GPU has 990 TFLOPS at FP16, it can perform 990 trillion of these operations every second at half-precision (16-bit) floating point.

Why precision changes everything

The same GPU has different TFLOPS ratings depending on the precision (number format) used. Lower precision = smaller numbers = more operations per cycle.

PrecisionBitsH100 TFLOPSUse Case
FP646467Scientific computing, financial modeling
FP3232134General compute, some training
TF3219495NVIDIA-specific, training acceleration
FP16 / BF1616990AI training (most common)
FP881,979AI inference, some training
INT881,979Quantized inference

GIS uses FP16 TFLOPS as the standard metric because it's the most common precision for AI training and is reported by all major GPU vendors.

TFLOPS vs real-world performance

TFLOPS is a theoretical peak. Real-world performance is always lower because:

  • Memory bandwidth bottleneck — the GPU can compute faster than memory can feed it data. This is called being "memory-bound."
  • Utilization — not all cores are busy all the time. Real workloads rarely achieve 100% utilization.
  • Software overhead — kernel launches, synchronization, data transfers all take time.
Rule of thumb
Real-world AI training typically achieves 30-60% of theoretical peak TFLOPS. A well-optimized workload on an H100 might sustain ~400-600 TFLOPS at FP16, not the theoretical 990.

This is why memory_bandwidth_tbps matters alongside TFLOPS. A GPU with high TFLOPS but low bandwidth will be bottlenecked. The H100's 3.35 TB/s bandwidth is designed to keep its 990 TFLOPS fed.

How it appears in GIS

TFLOPS appears in two places in a GIS document:

{
  "gpu": {
    "tflops_fp16": 989.5
  },
  "normalized": {
    "cost_per_tflop_hour": 0.00252
  }
}

gpu.tflops_fp16 is the raw compute power. normalized.cost_per_tflop_hour divides the hourly cost by TFLOPS to give you a price-performance ratio.

Lower cost_per_tflop_hour = more compute per dollar. This metric lets you compare a $2.49/hr H100 (990 TFLOPS) against a $1.10/hr A100 (312 TFLOPS) on equal footing.

Key takeaways
  • ·1 TFLOP = 1 trillion floating-point operations per second
  • ·TFLOPS vary by precision: FP64, FP32, FP16, BF16, FP8, INT8
  • ·GIS uses FP16 TFLOPS as the standard comparison metric
  • ·Higher TFLOPS ≠ always faster — memory bandwidth can be the bottleneck
  • ·cost_per_tflop_hour normalizes price against compute power