Depth Estimation¶

Getting a depth map from an image or a stereo pair. Monocular depth leapt forward in 2024 with Depth Anything V2; stereo is a different, older, still-relevant story.

Depth splits into two very different problems. Monocular (one image → depth map) is pure ML and has reset twice in 2 years. Stereo (two images → disparity → depth) is mostly classical CV that occasionally gets a neural boost.

Recommended picks¶

Flavor	Pick	When to use
Monocular default	Depth Anything 3 (ByteDance, ICLR 2026)	Supersedes DA2; single-transformer architecture with DINOv2 encoder. Over 10% improvement on ETH3D vs DA2.
Monocular metric depth	Depth Anything 3 Metric-Large	`metric_depth = focal * net_output / 300`. Single model for indoor + outdoor metric.
Multi-view geometry	Depth Anything 3 (multi-view mode)	Same model predicts spatially consistent geometry from any number of views. Beats VGGT by 44.3% on camera pose, 25.1% on geometric accuracy.
Stereo / depth sensor	OpenCV StereoSGBM or RAFT-Stereo	If you already have a stereo rig or RGB-D camera.
Commercial depth sensor	Intel RealSense / Orbbec / Luxonis OAK-D	If hardware is still in the budget. On-device depth, no ML cost.

Monocular: why Depth Anything 3 is the default¶

Depth Anything 3 (ByteDance Seed, ICLR 2026) reset the field in November 2025. A single plain transformer (vanilla DINOv2 encoder) as backbone, with a singular depth-ray prediction target — no multi-task specialization. Predicts spatially consistent geometry from any number of views, with or without camera poses.

Improvements over DA2 (itself the 2024 default): - >10% relative improvement on ETH3D and KITTI for monocular depth. - Unified architecture handles multi-view reconstruction too: beats prior SOTA VGGT by 44.3% on camera pose accuracy, 25.1% on geometric accuracy. - The same checkpoint handles monocular, stereo, and multi-view — no switching models.

Install via GitHub. Fast on GPU; usable but slow on CPU.

"Relative depth" is what you want for AR occlusion, portrait mode blur, novel-view synthesis, any use case where the ratio matters more than the metric scale.

Metric monocular — when you need meters¶

DA3 ships a dedicated DA3 Metric-Large checkpoint fine-tuned for metric depth. Conversion formula in the docs: metric_depth = focal * net_output / 300 where focal is the camera focal length in pixels. Handles indoor + outdoor.

For robotics or measurement use cases with high reliability requirements, a stereo rig or RGB-D sensor is still more consistent than monocular metric depth across domains.

Historical alternatives (still seen in deployed pipelines): - ZoeDepth (Intel ISL, 2023) — indoor + outdoor metric depth. Pre-DA3. - Depth Anything V2 Metric — DA2 metric fine-tunes. Pre-DA3. - Metric3D — universal metric depth via camera intrinsics.

Stereo — the old reliable¶

If you have two cameras with known baseline, classical stereo matching (disparity → depth via depth = baseline * focal / disparity) is fast, deterministic, and gives metric depth.

OpenCV StereoSGBM — semi-global matching. ~30 FPS on CPU for 720p. The 2000s-era default that still works.
RAFT-Stereo (2021) — neural stereo. Slower but more accurate on edges and textureless surfaces.

If you don't already have a stereo setup, don't add one just for depth; Depth Anything V2 is probably enough.

When to pick something else¶

Video with temporal consistency matters (no flickering frame-to-frame) → Depth Anything V2 has video-consistent variants. Or post-process with temporal smoothing.
Point clouds / 3D reconstruction → COLMAP or InstantNGP / Gaussian Splatting territory, not single-image depth.
Autonomous driving → you probably want lidar + radar + stereo, not monocular. If monocular, specialized self-supervised models (Monodepth2 lineage) that trained on driving scenes.
Portrait mode / background blur → the dedicated mobile SoC pipelines (Apple / Google) ship specialized small models. For DIY: MiDaS or Depth Anything V2 Small suffices.

The three questions to narrow¶

Do you need metric distance, or just relative? Relative → Depth Anything V2. Metric → ZoeDepth, or add a stereo/depth sensor.
Single image or video? Single → any. Video → pick with temporal consistency.
Do you already have a stereo rig or depth sensor? Yes → use it (cheaper, deterministic). No → monocular ML.

The Dump¶

Stereo SGBM (OpenCV, classical) — the CPU-era default. Still effective.
RAFT-Stereo (2021) — neural stereo, slow, accurate.
CREStereo (2022) — refinement-based neural stereo.
MiDaS (Intel ISL, 2019) — the original "robust monocular depth." Relative.
MiDaS v3.x / DPT (2021) — transformer-based MiDaS. Strong for years.
Monodepth2 (2019) — self-supervised from video, still used in driving.
Depth Anything (2024) — massive dataset + semi-supervised training.
Depth Anything V2 (mid-2024) — improved training. Pre-DA3.
Depth Anything 3 (ByteDance, Nov 2025, ICLR 2026) — current SOTA; single-transformer architecture for depth + multi-view geometry.
ZoeDepth (2023) — metric depth, indoor+outdoor.
Marigold (2023) — diffusion-based depth, slower, very clean outputs.
Metric3D v2 — universal metric depth with intrinsics.
Intel RealSense — depth camera hardware (D435i, D455, etc.). Active IR stereo.
Orbbec Astra / Femto — budget depth cameras.
Luxonis OAK-D — embedded depth + detection in one device.
iPhone TrueDepth / LiDAR — on-device depth via Apple's ARKit. Very accurate at short range.
Kinect (deprecated) — historical Microsoft depth sensor. Replaced by Azure Kinect, also deprecated. Niche.
StructureFromMotion (SfM) tools — COLMAP for multi-image depth + pose. Not real-time, not single-image.
Gaussian Splatting — not depth per se, but a modern 3D reconstruction alternative.

Graveyard¶

MiDaS as the monocular default — retired by Depth Anything V2 in 2024.
Monodepth (original, 2017) — retired by better self-supervised variants.
Kinect v1/v2 as a default depth sensor — Microsoft discontinued; community moved to RealSense, Orbbec, Luxonis.

Last reviewed¶

2026-04-22.