Pose Estimation¶
Finding keypoints on people (and sometimes hands and faces). The default stack changed in 2023 when RTMPose distilled the ViTPose accuracy into a fast CNN, and again in 2024 as MediaPipe fully replaced OpenPose for browser/mobile.
Pose splits into two deployment flavors — server-side accuracy and on-device real-time — plus a research tier. Different picks for each.
Recommended picks¶
| Tier | Pick | When to use |
|---|---|---|
| Browser / mobile | MediaPipe Pose | JS in the browser, iOS/Android, real-time, no server |
| Default | RTMPose (OpenMMLab) | Server-side real-time, production accuracy |
| Max accuracy | ViTPose | Research benchmarks, offline batch |
Why RTMPose is the default¶
Published 2023 by OpenMMLab. Top-down architecture (detect box, then estimate keypoints inside) with a strong training recipe. Hits ViTPose-level accuracy at 5–10× the speed. ONNX-exportable. Ships with COCO and whole-body (COCO-WholeBody: face + body + hand) weights.
mmpose is the reference library. Install via pip install mmpose. Can be finicky with PyTorch versions; the InsightFace team's rtmlib packages RTMPose with fewer dependencies.
MediaPipe Pose — browser/mobile default¶
Google's on-device pose library. 33 body keypoints. TFLite-backed, runs in WASM (browser) or native (iOS/Android). ~30 FPS on a modern phone. Accuracy a few points below RTMPose on hard sets but the deployment story is unbeatable for client-side work.
If you're building anything that runs in a browser tab or a phone app without a server, this is your pick.
When to pick something else¶
- Hand pose only → MediaPipe Hands (21 keypoints per hand) or OpenMMLab's hand-specific RTMPose variant.
- Whole-body with face + hands in one pass → RTMPose with COCO-WholeBody weights (133 keypoints total).
- 3D pose from a single image → MediaPipe Pose already outputs pseudo-3D. For metric 3D, MotionBERT or the MMHuman3D family.
- Crowded scenes → bottom-up methods (HRNet bottom-up, DEKR) handle many people better than top-down.
- Action recognition from pose → pose first (RTMPose), then a downstream action model (STGCN++, PoseC3D).
The three questions to narrow¶
- Server or client? Server → RTMPose. Client → MediaPipe.
- Body only or whole-body? Body only → either pick works. Whole-body → RTMPose with WholeBody weights.
- 2D or 3D? 2D → any. Pseudo-3D → MediaPipe. Metric 3D → MotionBERT family.
The Dump¶
- OpenPose (2017) — the original multi-person pose pipeline. Historical weight, supplanted. License: CMU non-commercial academic license — commercial use requires a paid license from CMU; this is the famous reason OpenPose stopped being a production default around 2020.
- HRNet (2019) — high-resolution network. Still the backbone of many modern pose models.
- AlphaPose (SJTU) — strong top-down system. Outpaced by RTMPose. License: non-commercial research license.
- Detectron2 Keypoint R-CNN — Facebook's keypoint detection in Detectron2. Reasonable but dated.
- MediaPipe Pose (Google) — TFLite, 33 keypoints, browser/mobile default.
- MediaPipe Hands — 21 keypoints per hand, same ecosystem.
- MediaPipe Holistic — body + face + hands together, same shot.
- DEKR — bottom-up keypoint regression. Crowded scenes.
- ViTPose (2022) — ViT backbone, strong accuracy. Expensive.
- ViTPose++ — improved training recipe. Same compute class.
- RTMPose (2023) — OpenMMLab's distilled real-time pose. The current default.
- RTMW (OpenMMLab) — whole-body variant. First model to exceed 70 AP on COCO-WholeBody (RTMW-x at 70.2 AP).
- RTMO — bottom-up variant from the same team.
- DWPose (2023) — distilled whole-body, used as the ControlNet pose preprocessor.
- MotionBERT — 3D human motion representation. Strong on video.
- MMHuman3D (OpenMMLab) — SMPL body fitting from video.
- Yolov8-pose / Yolov11-pose (Ultralytics) — YOLO with a keypoint head. Fast, good ecosystem, less accurate than RTMPose. License: AGPL-3.0; commercial use requires Ultralytics Enterprise License.
- PoseNet (Google, TFJS) — the browser default before MediaPipe. Retired.
- MMPose (OpenMMLab) — the reference training/inference library for top-down pose.
- rtmlib — lightweight wrapper around RTMPose without the MMPose dependency chain.
Graveyard¶
- OpenPose as a production default — retired ~2020. Licensing (non-commercial for original OpenPose weights) plus slower than modern alternatives.
- PoseNet in the browser — retired when MediaPipe Pose launched. Less accurate.
- Stacked Hourglass (2016) — historically important, no longer competitive.
Last reviewed¶
2026-04-22.