Skip to content

Data: Annotation, Datasets, Synthetic

Where your training data comes from. Annotation tools changed in 2023 when SAM started auto-labeling segmentation masks. Benchmark datasets haven't changed much; trust them with caveats.

Data is where most CV projects actually spend their time. Model choice is a 1-day decision; data is a 6-month investment. The tools for collecting, labeling, and benchmarking data are worth picking deliberately.

Use case Pick When to use
Open-source labeling (team) CVAT Self-hosted, mature, video support
Open-source labeling (solo) Label Studio Broader (text, audio, image, video); easier to spin up
Managed labeling platform Roboflow End-to-end (upload → label → train → deploy). Best developer experience.
Pro/enterprise managed V7 or Labelbox Medical, industrial, compliance-heavy. Expensive.
Auto-labeling SAM 2 + Grounding DINO (programmatic) or Roboflow / CVAT SAM integrations Accelerate labeling 5–10×.

Why CVAT for teams

Computer Vision Annotation Tool (Intel, open source). Self-hostable, supports boxes / polygons / masks / keypoints / tracks. Video annotation is first-class — interpolation between keyframes saves hours on MOT / action recognition datasets.

docker-compose up and you have a working labeling server. Role-based access control. Supports most common export formats (COCO, Pascal VOC, YOLO, custom).

Why Roboflow for managed

End-to-end pipeline: upload → label → augment → split → train → deploy → monitor. Developer-first. Free tier is generous. Commercial tier scales.

Their SaaS is what most small CV teams reach for in 2026. If you don't need self-hosting, start here.

The SAM auto-label shift

Before SAM (2023): labeling segmentation masks was the most expensive annotation task. You traced polygons by hand. Dozens of hours per thousand images.

After SAM: click a point, get a high-quality mask. Correct where needed. 5–10× speedup in practice.

Every major annotation platform integrated SAM in 2023–2024: CVAT, Label Studio, Roboflow, V7, Labelbox. By 2026, most have moved on to SAM 2 / SAM 3 integration for concept-prompted (text-driven) labeling. CVAT and Roboflow have direct integrations: Roboflow Annotate supports a SAM-2-powered label assistant in the interface, and CVAT integrates SAM, Mask R-CNN, YOLO models for auto-labeling. Don't label segmentation masks by hand anymore.

Programmatic labeling

When you don't need a UI and can label in a script: - SAM 2 (Meta) — click / box / text → mask. - Grounding DINO — text → bounding box. Open-vocabulary detection. - Grounded-SAM — combine the above: text → mask. - Autodistill (Roboflow) — wraps several of the above into a "detect cars" → labeled dataset pipeline. - LabelGPT-style (using a VLM to label) — send each image to a VLM, get labels. Expensive but flexible.

Benchmark datasets (the canonical ones)

[!WARNING] Dataset licenses matter — most academic CV datasets are research-only. Training on these and shipping the resulting weights commercially is a license violation in most cases. - ImageNet: non-commercial research license from Stanford/Princeton. - COCO: CC-BY-4.0 images (commercial OK with attribution), but the annotations are CC-BY-4.0 too — check each sub-dataset. - Open Images: CC-BY-2.0 images (commercial OK with attribution). Good baseline for commercial training. - CelebA: CC-BY-NC-SA-4.0 (non-commercial). - CASIA-WebFace / VGGFace2 / MS-Celeb-1M / Glint360K / WebFace260M: all research-only. - SA-1B (SAM training set): research license. - LAION-5B / LAION-2B: CC-BY-4.0 metadata; image URLs are third-party rights — training on them is legally murky and has open lawsuits. - LFW / MOT / Kinetics / ADE20K / Cityscapes: research/academic licenses — evaluation OK, commercial deployment of derived weights is not.

If you're training for a commercial product: COCO, Open Images, and your own proprietary/licensed data are the cleanest options. Synthetic data (Blender, Omniverse, diffusion-generated) also bypasses dataset licensing.

Detection

  • COCO (2014) — 80 classes, 330K images. The universal detection benchmark.
  • Open Images V7 — ~9M images, 600+ classes. Larger but noisier than COCO.
  • Objects365 — 365 classes, more diversity than COCO.
  • LVIS — COCO images re-labeled with 1,200 long-tail categories.
  • Visual Genome — dense scene graphs. More about relationships than pure detection.

Segmentation

Face

  • LFW (Labeled Faces in the Wild, 2007) — saturated. 99.8%+ accuracy for any modern recognizer. Not diagnostic anymore.
  • MS1M / MS-Celeb-1M — training set, not benchmark.
  • IJB-C — unconstrained face recognition. Still useful as a harder benchmark.
  • MegaFace — 1M distractors. Historical.
  • WIDER Face — face detection benchmark.
  • QMUL-SurvFace — surveillance-quality. Useful for modern face recognition benchmarks.

OCR

  • ICDAR series — the standard OCR/text detection benchmarks.
  • SROIE — receipt parsing.
  • FUNSD — form understanding.
  • DocVQA — document visual question answering.
  • TextVQA — scene text VQA.

Pose

Video / Tracking

Classification

When to pick something else

  • Medical data (HIPAA, PHI) → air-gapped labeling. V7 or Labelbox with on-prem install. Don't use SaaS for patient data without BAA.
  • Sensitive security data → self-hosted CVAT or custom tooling.
  • Synthetic data for rare classes → Blender / Unreal / NVIDIA Omniverse for rendering. Or generate with diffusion models (Flux / SDXL) for specific object edits.
  • Data versioning beyond annotation → DVC, LakeFS, Weights & Biases Artifacts, Roboflow's versioning.

The three questions to narrow

  1. Who labels? Team → CVAT / Roboflow. Solo or hobby → Label Studio / Roboflow free tier.
  2. Data sensitivity? Sensitive → self-hosted. Not sensitive → managed SaaS.
  3. Is SAM enough to bootstrap? Yes (most modern tasks) → use SAM-integrated tool. No → hand labeling with a motivated team.

The Dump

Open-source labeling

Managed platforms

Auto-labeling tools

Data management

Synthetic data

Graveyard

  • Labelbox's original free tier — retired; now enterprise-focused.
  • Amazon SageMaker Ground Truth as a leader — still exists, but dev experience lags Roboflow.
  • Hand-labeling segmentation masks without SAM — retired. You shouldn't be doing this in 2026.
  • Google AutoML Vision — folded into Vertex.

Last reviewed

2026-04-22.