Inference Runtimes¶

How you actually execute a model in production. PyTorch is for training; runtimes are for serving.

A runtime is the software that loads your model, moves data in, runs the math, moves results out. You almost never ship PyTorch as your production inference path — it's too heavy, too loose with threads, too full of Python. You export to a runtime that's compiled, fast, and embeddable.

Recommended picks¶

Target	Pick	When to use
Default (cross-platform)	ONNX Runtime	Anything not NVIDIA-specific or Apple-specific. CPU + GPU + mobile, all one runtime.
NVIDIA max performance	TensorRT	Squeeze last 30–50% out of NVIDIA GPUs. Worth it for high-volume production.
Apple ecosystem	CoreML	iOS / macOS / Vision Pro. Uses the Neural Engine automatically.
Intel CPU / iGPU	OpenVINO	Intel laptop/workstation targets.
Browser / JS	ONNX Runtime Web or TensorFlow.js	Client-side inference.
Mobile (non-Apple)	TensorFlow Lite / LiteRT	Android, embedded.

Why ONNX Runtime is the default¶

Microsoft's cross-platform inference runtime. Takes an .onnx model and runs it on CPU, NVIDIA GPU (via CUDA or TensorRT execution provider), AMD GPU (ROCm), Intel CPU/iGPU (OpenVINO EP), Apple (CoreML EP), ARM (NNAPI / XNNPack).

Pick it because: - One model format (ONNX) across deploy targets. - Consistent API across platforms. - Execution providers let you target specific hardware without rewriting code. - Mature export — PyTorch, TensorFlow, and most ML libraries have solid ONNX exporters.

Install: pip install onnxruntime (CPU) or pip install onnxruntime-gpu (with CUDA).

Recent additions (2026): - TensorRT RTX execution provider — new EP targeting NVIDIA RTX-class GPUs. - OpenVINO EP memory optimization — patch to reuse weight files across shared contexts, lowering resident memory. - Hybrid strategy recommended: ONNX Runtime as the primary API, TensorRT EP for GPU paths, OpenVINO EP or default CPU EP for CPU-only nodes.

When to pick something else¶

Max NVIDIA performance with static shapes → TensorRT directly. ~30–50% faster than ONNX Runtime's TensorRT EP in practice because you can tune more aggressively.
Shipping an iOS app → CoreML. The Neural Engine is fast and ONNX Runtime's CoreML EP doesn't always use it optimally.
Intel-only deploy, CPU-focused → OpenVINO directly. Slightly better than ONNX Runtime's OpenVINO EP.
Browser-side inference → ONNX Runtime Web works but payload sizes are chunky. TensorFlow.js has smaller models for basic tasks.
Embedded / microcontroller → TensorFlow Lite Micro, or NXP/STM32 SDK-native. ONNX Runtime is too big.
Coral TPU → TFLite only (Coral runs quantized TFLite specifically).

The three questions to narrow¶

What hardware? NVIDIA → TensorRT or ONNX. Apple → CoreML. Intel → OpenVINO or ONNX. Browser → ONNX Web / TFJS. Mobile → TFLite / CoreML.
How performance-sensitive? Very → hardware-native runtime. Moderate → ONNX Runtime.
How many deploy targets? Many → ONNX Runtime (one format). Few → native runtime for each.

Quantization — free performance¶

Once the runtime is picked, quantization is the next lever:

INT8 — post-training quantization. 4× memory reduction, typically 2–3× inference speedup. Minor accuracy loss (0.5–2 pp on typical CV tasks).
FP16 — half-precision. 2× memory, 1.5–2× speedup on GPUs with FP16 units. Almost no accuracy loss.
INT4 / INT2 — aggressive, emerging. Larger accuracy hits. Use for edge where you can tolerate it.

All the runtimes above support quantized models; the export tooling varies.

The Dump¶

Runtimes¶

ONNX Runtime (Microsoft) — cross-platform. Default.
TensorRT (NVIDIA) — NVIDIA-only, max performance. FP32/FP16/INT8/INT4.
CoreML (Apple) — Apple Neural Engine + GPU + CPU.
OpenVINO (Intel) — Intel CPU/iGPU/VPU.
TensorFlow Lite / LiteRT (Google) — mobile, edge.
TensorFlow.js — browser JS.
PyTorch Mobile / ExecuTorch — PyTorch's mobile runtime. Smaller footprint for PyTorch-native models.
TorchScript / torch.compile — in-process PyTorch compilation. Not a separate runtime; still PyTorch under the hood.
vLLM — LLM-specific serving, not general CV.
Triton Inference Server (NVIDIA) — multi-model serving on top of TensorRT / ONNX / PyTorch.
BentoML — Python-first model serving framework.
NVIDIA DeepStream — video pipeline + inference stack for NVIDIA devices.
Apple Vision framework — higher-level Apple API on top of CoreML for common tasks.
Android NNAPI — Android's neural network API.
Qualcomm QNN / SNPE — Qualcomm mobile SoC runtimes.
MediaTek NeuroPilot — MediaTek SoC runtime.
Rockchip RKNN — Rockchip embedded SoC (Orange Pi, etc.).
Hexagon NPU — Qualcomm DSP.
TVM / MLIR / IREE — compile-your-own runtime stacks. Research-heavy, production-sparse.
GGML / llama.cpp — originally LLM-only, increasingly covers multimodal. Runs on almost anything.

Serving frameworks (wrappers around runtimes)¶

FastAPI + uvicorn — bring your own model, your own routes. What most CV APIs look like.
Triton Inference Server — production-grade multi-model serving.
BentoML — Python-first, batteries included.
TorchServe — PyTorch's official serving. Widely used.
KServe (Kubernetes) — scales model serving on K8s.
Ray Serve — for complex multi-model / multi-step pipelines.
vLLM — LLM-specific.
SGLang — LLM-specific serving.

Export tools¶

torch.onnx.export — PyTorch → ONNX.
onnxruntime-tools / Olive (Microsoft) — optimization, quantization.
coremltools — PyTorch/TensorFlow → CoreML.
tensorflow.lite.TFLiteConverter — TF → TFLite.
openvino.tools.mo — ONNX → OpenVINO IR.
TensorRT trtexec / polygraphy — ONNX → TensorRT engine.

Graveyard¶

Caffe / Caffe2 — retired.
MXNet — AWS abandoned 2023.
Theano — retired 2017.
Keras (standalone) — subsumed into TensorFlow.
TensorFlow Serving as a default — still around, but TorchServe / Triton / BentoML cover more ground.
PyTorch JIT / TorchScript as the primary production path — mostly replaced by ONNX export + runtime, or PyTorch 2.x torch.compile + ExecuTorch.

Last reviewed¶

2026-04-22.