Skip to content

Inference Runtimes

How you actually execute a model in production. PyTorch is for training; runtimes are for serving.

A runtime is the software that loads your model, moves data in, runs the math, moves results out. You almost never ship PyTorch as your production inference path — it's too heavy, too loose with threads, too full of Python. You export to a runtime that's compiled, fast, and embeddable.

Target Pick When to use
Default (cross-platform) ONNX Runtime Anything not NVIDIA-specific or Apple-specific. CPU + GPU + mobile, all one runtime.
NVIDIA max performance TensorRT Squeeze last 30–50% out of NVIDIA GPUs. Worth it for high-volume production.
Apple ecosystem CoreML iOS / macOS / Vision Pro. Uses the Neural Engine automatically.
Intel CPU / iGPU OpenVINO Intel laptop/workstation targets.
Browser / JS ONNX Runtime Web or TensorFlow.js Client-side inference.
Mobile (non-Apple) TensorFlow Lite / LiteRT Android, embedded.

Why ONNX Runtime is the default

Microsoft's cross-platform inference runtime. Takes an .onnx model and runs it on CPU, NVIDIA GPU (via CUDA or TensorRT execution provider), AMD GPU (ROCm), Intel CPU/iGPU (OpenVINO EP), Apple (CoreML EP), ARM (NNAPI / XNNPack).

Pick it because: - One model format (ONNX) across deploy targets. - Consistent API across platforms. - Execution providers let you target specific hardware without rewriting code. - Mature export — PyTorch, TensorFlow, and most ML libraries have solid ONNX exporters.

Install: pip install onnxruntime (CPU) or pip install onnxruntime-gpu (with CUDA).

Recent additions (2026): - TensorRT RTX execution provider — new EP targeting NVIDIA RTX-class GPUs. - OpenVINO EP memory optimization — patch to reuse weight files across shared contexts, lowering resident memory. - Hybrid strategy recommended: ONNX Runtime as the primary API, TensorRT EP for GPU paths, OpenVINO EP or default CPU EP for CPU-only nodes.

When to pick something else

  • Max NVIDIA performance with static shapes → TensorRT directly. ~30–50% faster than ONNX Runtime's TensorRT EP in practice because you can tune more aggressively.
  • Shipping an iOS app → CoreML. The Neural Engine is fast and ONNX Runtime's CoreML EP doesn't always use it optimally.
  • Intel-only deploy, CPU-focused → OpenVINO directly. Slightly better than ONNX Runtime's OpenVINO EP.
  • Browser-side inference → ONNX Runtime Web works but payload sizes are chunky. TensorFlow.js has smaller models for basic tasks.
  • Embedded / microcontroller → TensorFlow Lite Micro, or NXP/STM32 SDK-native. ONNX Runtime is too big.
  • Coral TPU → TFLite only (Coral runs quantized TFLite specifically).

The three questions to narrow

  1. What hardware? NVIDIA → TensorRT or ONNX. Apple → CoreML. Intel → OpenVINO or ONNX. Browser → ONNX Web / TFJS. Mobile → TFLite / CoreML.
  2. How performance-sensitive? Very → hardware-native runtime. Moderate → ONNX Runtime.
  3. How many deploy targets? Many → ONNX Runtime (one format). Few → native runtime for each.

Quantization — free performance

Once the runtime is picked, quantization is the next lever:

  • INT8 — post-training quantization. 4× memory reduction, typically 2–3× inference speedup. Minor accuracy loss (0.5–2 pp on typical CV tasks).
  • FP16 — half-precision. 2× memory, 1.5–2× speedup on GPUs with FP16 units. Almost no accuracy loss.
  • INT4 / INT2 — aggressive, emerging. Larger accuracy hits. Use for edge where you can tolerate it.

All the runtimes above support quantized models; the export tooling varies.

The Dump

Runtimes

Serving frameworks (wrappers around runtimes)

Export tools

Graveyard

  • Caffe / Caffe2 — retired.
  • MXNet — AWS abandoned 2023.
  • Theano — retired 2017.
  • Keras (standalone) — subsumed into TensorFlow.
  • TensorFlow Serving as a default — still around, but TorchServe / Triton / BentoML cover more ground.
  • PyTorch JIT / TorchScript as the primary production path — mostly replaced by ONNX export + runtime, or PyTorch 2.x torch.compile + ExecuTorch.

Last reviewed

2026-04-22.