Full Stack CV — Playbook¶

An opinionated reference for building CV systems. Three picks per subtask, when to use each, and when not to — plus a broader catalog of useful alternatives, legacy picks, and context.

The problem this solves¶

Most CV resource lists optimise for completeness. They grow by accretion — someone submits a PR for a new library, it goes in. After two years the list has 600 entries and is useless as a decision aid. You can't tell which pick is load-bearing and which is archaeological.

This site inverts that. Each section leads with three picks — usually a default, an edge/low-end fallback, and a max-accuracy option — and every pick has a written why and when not to. The exhaustive list still exists (in each section's Dump), but it sits behind the picks, not in front of them.

The picks are opinions. They will age. Every section has a "last reviewed" date.

Start here¶

New to the playbook? Read the Face section first — it's the most complete and shows the format end-to-end. Or jump to whichever subtask you're shipping.

How to read this¶

Situation	What to do
You know the subtask	Jump to the section. Read the 3–4 rows of the picks table. Often enough.
Picks don't fit your constraint	Read the Dump behind them for context, then "When to pick something else."
Browsing to learn the field	Read the picks across sections — ~20 min tour of what matters in each subtask.

Sections¶

Perception¶

Section	One-line
Classification	CLIP zero-shot / timm fine-tune / DINOv2 probe, by data availability.
Object Detection	YOLO for most cases; RT-DETR if on GPU; Co-DETR for research.
Segmentation	SAM 2 for zero-shot; Mask2Former / YOLOv8-seg for trained.
Pose Estimation	RTMPose server; MediaPipe browser; ViTPose research.
OCR / Document AI	PaddleOCR for clean; VLMs for messy; Donut for structured forms.
Tracking	ByteTrack default; BoT-SORT with ReID; OC-SORT for benchmarks.
Depth Estimation	Depth Anything V2 monocular; stereo if hardware available.
Retrieval / Embeddings	SigLIP embeddings + FAISS; DINOv2 for image-only.

Face (the one vertical)¶

Section	One-line
Face	SCRFD detect, 5-point align, ArcFace embed, 1-NN + threshold search.

More verticals coming

Medical, satellite, industrial, retail, and robotics sections will be added once there are content-backed opinions to put in each. Not adding them as empty placeholders.

Multimodal¶

Section	One-line
VLM / Multimodal	Gemini / GPT / Claude commercial; Qwen2-VL / InternVL2 open.
Image Generation	Flux for open; SDXL for ecosystem; ControlNet for control.

Stack & deployment¶

Section	One-line
Inference Runtimes	ONNX Runtime default; TensorRT on NVIDIA; CoreML on Apple.
Edge Deployment	Jetson Orin Nano Super, Raspberry Pi 5, CoreML for mobile.
Cloud Vision APIs	VLMs for flexible; task APIs (Rekognition / Vision) for scale.
Data, Annotation, Datasets	CVAT / Roboflow / SAM-based auto-labeling; COCO / ImageNet / etc.

How each section is structured¶

Every page follows the same shape:

Intro            — 2 paragraphs on what this subtask is and why the picks differ
Recommended      — 3-row table with when-to-use + when-to-avoid + install
Narrowing        — 3 questions that disambiguate to one pick
The Dump         — exhaustive list, one-line verdict each
Graveyard        — retired picks and why
Last reviewed    — a date so you can tell when this went stale

Some sections adjust the axis (Face uses clean-input / surveillance / edge instead of default / edge / max-accuracy).

Reference¶

File	Purpose
Contributing	How to propose additions; Dump vs Recommended vs Graveyard rules
Evaluation criteria	How picks are judged (what counts, what doesn't)
Section template	Canonical page structure for new sections
Refresh pipeline	LLM-in-the-loop regeneration sketch

License¶

MIT. Use anything here however you like.

Maintainer¶

Vikas Gupta — Founder, DeepVidya.ai · LinkedIn

Last full review: 2026-04-22. Each section carries its own review date.