Skip to content

Full Stack CV — Playbook

An opinionated reference for building CV systems. Three picks per subtask, when to use each, and when not to — plus a broader catalog of useful alternatives, legacy picks, and context.

The problem this solves

Most CV resource lists optimise for completeness. They grow by accretion — someone submits a PR for a new library, it goes in. After two years the list has 600 entries and is useless as a decision aid. You can't tell which pick is load-bearing and which is archaeological.

This site inverts that. Each section leads with three picks — usually a default, an edge/low-end fallback, and a max-accuracy option — and every pick has a written why and when not to. The exhaustive list still exists (in each section's Dump), but it sits behind the picks, not in front of them.

The picks are opinions. They will age. Every section has a "last reviewed" date.

Start here

New to the playbook? Read the Face section first — it's the most complete and shows the format end-to-end. Or jump to whichever subtask you're shipping.

How to read this

Situation What to do
You know the subtask Jump to the section. Read the 3–4 rows of the picks table. Often enough.
Picks don't fit your constraint Read the Dump behind them for context, then "When to pick something else."
Browsing to learn the field Read the picks across sections — ~20 min tour of what matters in each subtask.

Sections

Perception

Section One-line
Classification CLIP zero-shot / timm fine-tune / DINOv2 probe, by data availability.
Object Detection YOLO for most cases; RT-DETR if on GPU; Co-DETR for research.
Segmentation SAM 2 for zero-shot; Mask2Former / YOLOv8-seg for trained.
Pose Estimation RTMPose server; MediaPipe browser; ViTPose research.
OCR / Document AI PaddleOCR for clean; VLMs for messy; Donut for structured forms.
Tracking ByteTrack default; BoT-SORT with ReID; OC-SORT for benchmarks.
Depth Estimation Depth Anything V2 monocular; stereo if hardware available.
Retrieval / Embeddings SigLIP embeddings + FAISS; DINOv2 for image-only.

Face (the one vertical)

Section One-line
Face SCRFD detect, 5-point align, ArcFace embed, 1-NN + threshold search.

More verticals coming

Medical, satellite, industrial, retail, and robotics sections will be added once there are content-backed opinions to put in each. Not adding them as empty placeholders.

Multimodal

Section One-line
VLM / Multimodal Gemini / GPT / Claude commercial; Qwen2-VL / InternVL2 open.
Image Generation Flux for open; SDXL for ecosystem; ControlNet for control.

Stack & deployment

Section One-line
Inference Runtimes ONNX Runtime default; TensorRT on NVIDIA; CoreML on Apple.
Edge Deployment Jetson Orin Nano Super, Raspberry Pi 5, CoreML for mobile.
Cloud Vision APIs VLMs for flexible; task APIs (Rekognition / Vision) for scale.
Data, Annotation, Datasets CVAT / Roboflow / SAM-based auto-labeling; COCO / ImageNet / etc.

How each section is structured

Every page follows the same shape:

Intro            — 2 paragraphs on what this subtask is and why the picks differ
Recommended      — 3-row table with when-to-use + when-to-avoid + install
Narrowing        — 3 questions that disambiguate to one pick
The Dump         — exhaustive list, one-line verdict each
Graveyard        — retired picks and why
Last reviewed    — a date so you can tell when this went stale

Some sections adjust the axis (Face uses clean-input / surveillance / edge instead of default / edge / max-accuracy).

Reference

File Purpose
Contributing How to propose additions; Dump vs Recommended vs Graveyard rules
Evaluation criteria How picks are judged (what counts, what doesn't)
Section template Canonical page structure for new sections
Refresh pipeline LLM-in-the-loop regeneration sketch

License

MIT. Use anything here however you like.

Maintainer

Vikas Gupta — Founder, DeepVidya.ai · LinkedIn


Last full review: 2026-04-22. Each section carries its own review date.