Data: Annotation, Datasets, Synthetic¶
Where your training data comes from. Annotation tools changed in 2023 when SAM started auto-labeling segmentation masks. Benchmark datasets haven't changed much; trust them with caveats.
Data is where most CV projects actually spend their time. Model choice is a 1-day decision; data is a 6-month investment. The tools for collecting, labeling, and benchmarking data are worth picking deliberately.
Recommended picks¶
| Use case | Pick | When to use |
|---|---|---|
| Open-source labeling (team) | CVAT | Self-hosted, mature, video support |
| Open-source labeling (solo) | Label Studio | Broader (text, audio, image, video); easier to spin up |
| Managed labeling platform | Roboflow | End-to-end (upload → label → train → deploy). Best developer experience. |
| Pro/enterprise managed | V7 or Labelbox | Medical, industrial, compliance-heavy. Expensive. |
| Auto-labeling | SAM 2 + Grounding DINO (programmatic) or Roboflow / CVAT SAM integrations | Accelerate labeling 5–10×. |
Why CVAT for teams¶
Computer Vision Annotation Tool (Intel, open source). Self-hostable, supports boxes / polygons / masks / keypoints / tracks. Video annotation is first-class — interpolation between keyframes saves hours on MOT / action recognition datasets.
docker-compose up and you have a working labeling server. Role-based access control. Supports most common export formats (COCO, Pascal VOC, YOLO, custom).
Why Roboflow for managed¶
End-to-end pipeline: upload → label → augment → split → train → deploy → monitor. Developer-first. Free tier is generous. Commercial tier scales.
Their SaaS is what most small CV teams reach for in 2026. If you don't need self-hosting, start here.
The SAM auto-label shift¶
Before SAM (2023): labeling segmentation masks was the most expensive annotation task. You traced polygons by hand. Dozens of hours per thousand images.
After SAM: click a point, get a high-quality mask. Correct where needed. 5–10× speedup in practice.
Every major annotation platform integrated SAM in 2023–2024: CVAT, Label Studio, Roboflow, V7, Labelbox. By 2026, most have moved on to SAM 2 / SAM 3 integration for concept-prompted (text-driven) labeling. CVAT and Roboflow have direct integrations: Roboflow Annotate supports a SAM-2-powered label assistant in the interface, and CVAT integrates SAM, Mask R-CNN, YOLO models for auto-labeling. Don't label segmentation masks by hand anymore.
Programmatic labeling¶
When you don't need a UI and can label in a script:
- SAM 2 (Meta) — click / box / text → mask.
- Grounding DINO — text → bounding box. Open-vocabulary detection.
- Grounded-SAM — combine the above: text → mask.
- Autodistill (Roboflow) — wraps several of the above into a "detect cars" → labeled dataset pipeline.
- LabelGPT-style (using a VLM to label) — send each image to a VLM, get labels. Expensive but flexible.
Benchmark datasets (the canonical ones)¶
[!WARNING] Dataset licenses matter — most academic CV datasets are research-only. Training on these and shipping the resulting weights commercially is a license violation in most cases. - ImageNet: non-commercial research license from Stanford/Princeton. - COCO: CC-BY-4.0 images (commercial OK with attribution), but the annotations are CC-BY-4.0 too — check each sub-dataset. - Open Images: CC-BY-2.0 images (commercial OK with attribution). Good baseline for commercial training. - CelebA: CC-BY-NC-SA-4.0 (non-commercial). - CASIA-WebFace / VGGFace2 / MS-Celeb-1M / Glint360K / WebFace260M: all research-only. - SA-1B (SAM training set): research license. - LAION-5B / LAION-2B: CC-BY-4.0 metadata; image URLs are third-party rights — training on them is legally murky and has open lawsuits. - LFW / MOT / Kinetics / ADE20K / Cityscapes: research/academic licenses — evaluation OK, commercial deployment of derived weights is not.
If you're training for a commercial product: COCO, Open Images, and your own proprietary/licensed data are the cleanest options. Synthetic data (Blender, Omniverse, diffusion-generated) also bypasses dataset licensing.
Detection¶
- COCO (2014) — 80 classes, 330K images. The universal detection benchmark.
- Open Images V7 — ~9M images, 600+ classes. Larger but noisier than COCO.
- Objects365 — 365 classes, more diversity than COCO.
- LVIS — COCO images re-labeled with 1,200 long-tail categories.
- Visual Genome — dense scene graphs. More about relationships than pure detection.
Segmentation¶
- COCO (instance + panoptic) — reuses COCO images with mask annotations.
- ADE20K — 20K images, 150 classes, semantic segmentation.
- Cityscapes — street scenes, semantic segmentation, autonomous driving.
- SA-1B (SAM training set) — 1B masks. Meta released ~11M for research.
Face¶
- LFW (Labeled Faces in the Wild, 2007) — saturated. 99.8%+ accuracy for any modern recognizer. Not diagnostic anymore.
- MS1M / MS-Celeb-1M — training set, not benchmark.
- IJB-C — unconstrained face recognition. Still useful as a harder benchmark.
- MegaFace — 1M distractors. Historical.
- WIDER Face — face detection benchmark.
- QMUL-SurvFace — surveillance-quality. Useful for modern face recognition benchmarks.
OCR¶
- ICDAR series — the standard OCR/text detection benchmarks.
- SROIE — receipt parsing.
- FUNSD — form understanding.
- DocVQA — document visual question answering.
- TextVQA — scene text VQA.
Pose¶
- COCO Keypoints — the default 2D pose benchmark.
- MPII — older, still referenced.
- CrowdPose — crowded scenes.
- COCO-WholeBody — 133 keypoints per person.
Video / Tracking¶
- MOT17 / MOT20 — multi-object tracking benchmarks.
- Kinetics-400 / 700 — action recognition.
- ActivityNet — action localization.
- DAVIS — video object segmentation.
Classification¶
- ImageNet-1K (ILSVRC 2012) — historical default. Saturated.
- ImageNet-21K — larger, better for pretraining.
- CIFAR-10 / 100 — toy. Use for pedagogy, not actual benchmarks.
When to pick something else¶
- Medical data (HIPAA, PHI) → air-gapped labeling. V7 or Labelbox with on-prem install. Don't use SaaS for patient data without BAA.
- Sensitive security data → self-hosted CVAT or custom tooling.
- Synthetic data for rare classes → Blender / Unreal / NVIDIA Omniverse for rendering. Or generate with diffusion models (Flux / SDXL) for specific object edits.
- Data versioning beyond annotation → DVC, LakeFS, Weights & Biases Artifacts, Roboflow's versioning.
The three questions to narrow¶
- Who labels? Team → CVAT / Roboflow. Solo or hobby → Label Studio / Roboflow free tier.
- Data sensitivity? Sensitive → self-hosted. Not sensitive → managed SaaS.
- Is SAM enough to bootstrap? Yes (most modern tasks) → use SAM-integrated tool. No → hand labeling with a motivated team.
The Dump¶
Open-source labeling¶
- CVAT (Intel) — teams, video support.
- Label Studio (HumanSignal) — general-purpose; images, audio, text.
- VGG Image Annotator (VIA) — single HTML file, no install. Toy-level but amazing for one-off.
- LabelImg — bounding boxes, classic desktop tool.
- Labelme — polygons, desktop.
- AnyLabeling — Labelme + AI assistance (SAM integration).
- X-AnyLabeling — enhanced AnyLabeling.
- MakeSense.AI — browser-based, free.
Managed platforms¶
- Roboflow — dev-first end-to-end.
- V7 — enterprise, medical/industrial focus.
- Labelbox — enterprise.
- Scale AI — huge projects, Fortune 500.
- SuperAnnotate — solid mid-market.
- Kili Technology — European.
- Supervisely — strong for 3D / point clouds.
- Dataloop — MLOps-heavy.
- Encord — video + DICOM.
Auto-labeling tools¶
- SAM 2 — the interactive default.
- Grounding DINO — text → boxes.
- Grounded-SAM — text → masks.
- Autodistill (Roboflow) — combine open models into a labeling pipeline.
- Cleanlab — find labeling errors in existing datasets.
Data management¶
- DVC — Data Version Control, git-style.
- LakeFS — Git for object storage.
- Weights & Biases Artifacts — dataset versioning within W&B.
- Roboflow Universe — public datasets.
- Hugging Face Datasets — general ML dataset hub.
- Kaggle — competitions + datasets.
Synthetic data¶
- NVIDIA Omniverse / Isaac Sim — physics-based synthetic.
- Unity Perception — Unity-based synthetic.
- Blender + python — DIY rendering.
- Diffusion-generated (Flux, SDXL) — image augmentation with control.
- Gretel — mostly tabular, some CV.
Graveyard¶
- Labelbox's original free tier — retired; now enterprise-focused.
- Amazon SageMaker Ground Truth as a leader — still exists, but dev experience lags Roboflow.
- Hand-labeling segmentation masks without SAM — retired. You shouldn't be doing this in 2026.
- Google AutoML Vision — folded into Vertex.
Last reviewed¶
2026-04-22.