Cloud Vision APIs¶
Managed services for common CV tasks. Useful when you don't want to operate models yourself, accept per-request cost, and trust a vendor.
Cloud vision has bifurcated in 2026 into two worlds: task-specific APIs (detect faces, read text, label scenes) that charge per-request at low unit cost, and VLM APIs (send an image + a prompt, get unstructured output) that cost more but do almost anything. For many use cases, a VLM API has replaced the specialized one.
Recommended picks¶
| Use case | Pick | When to use |
|---|---|---|
| General "what's in this image" | Gemini 2.5 Flash, GPT-4o, or Claude Sonnet (a VLM) | Anything that doesn't fit a rigid template. |
| Face rec at scale | AWS Rekognition or Face++ | Pre-enrolled identity matching, large volumes, regulatory compliance. |
| OCR at scale | Google Cloud Vision Text or AWS Textract | Structured invoice/form extraction at volume. |
| Content moderation | AWS Rekognition Content Moderation, Hive, Sightengine | Explicit content detection, structured label output. |
| Live video analysis | AWS Rekognition Video or Azure Video Indexer | Large-scale video with indexing/search needs. |
The VLM-ate-the-cloud-API shift¶
Gemini/GPT-4o/Claude can now: - Describe an image in detail. - Extract structured JSON from an image given a schema. - Read handwritten text. - Identify known objects (within their training data). - Detect faces (not for identity — just presence). - Moderate content (sometimes, with post-filtering).
This doesn't kill task-specific APIs. It does mean: if you're prototyping and volume is low, start with a VLM API; migrate to task-specific APIs only when cost justifies it.
Rough cost comparison (2026 pricing, varies): - VLM API flagship: ~$0.005–0.02 per image. - VLM API budget (Flash-class): ~$0.0005–0.002 per image. - AWS Rekognition label detection: ~$0.001 per image. - Google Cloud Vision label detection: ~$0.0015 per image.
For low/medium volume (< 10K images/day), VLM APIs are competitive. At 1M+ images/day, task-specific APIs win on cost.
Task-specific API stacks¶
AWS Rekognition¶
- Label detection — general "what's in this image."
- Face detection + analysis — emotion, age, glasses. Not identity.
- Face comparison / Search — enrolled identity matching.
- Face Liveness — challenge-response anti-spoofing. Strong.
- Content moderation — explicit content.
- Text detection — basic OCR.
- PPE detection — purpose-built for safety gear.
- Celebrity recognition — pre-enrolled famous-person identification.
- Rekognition Video — same tasks for video streams.
Google Cloud Vision¶
- Label detection — general tagging.
- Text detection / Document text detection — OCR, strong.
- Object localization — detection.
- Face detection — bounding boxes + landmarks. Not identity (Google doesn't offer identity).
- Web detection — "where does this image appear online." Unique.
- Safe search — content moderation.
- Cloud Vision AutoML — custom classifier/detector training.
- Document AI — structured extraction (invoices, forms). Stronger than base Vision.
Azure Computer Vision / Azure AI Vision¶
- Image Analysis 4.0 — labels, tags, descriptions. The Florence foundation model integration (2025) significantly improved caption quality and natural-language image queries; tends to beat Google/AWS on those specific axes.
- Spatial Analysis — people-counting, movement tracking, occupancy for physical spaces. Unique to Azure among the three; Google and AWS don't offer the equivalent.
- Read / OCR — document text.
- Face API — detection, analysis, identity. Microsoft restricted Face identity access to approved customers in 2022.
- Custom Vision — custom classifier/detector training.
- Azure Document Intelligence — structured doc extraction.
- Azure Video Indexer — video analysis with search.
Other task-specific¶
- Face++ / Megvii — Chinese market. Strong face rec. Cheaper than AWS/Azure in Asia.
- Clarifai — Custom model hosting + pre-built APIs.
- Amazon Textract — specialized on forms/tables, beats base Vision for structured docs.
- OpenAI Moderation API — text + image content moderation (text stronger than image).
- Hive — content moderation specialist. NSFW, violence, etc.
- Sightengine — content moderation, general CV API.
When to pick something else¶
- Regulated data (healthcare, government) → on-prem / self-hosted. Don't send patient data to commercial APIs without BAA/contract review.
- Identity verification with KYC compliance → specialized providers (Jumio, Onfido, Persona, IDnow). Not general cloud APIs; they handle the full KYC flow.
- Very high volume with a narrow task → self-host YOLO or similar. Break-even is somewhere around 10M+ images/month depending on task.
- Latency below 100ms round-trip → edge / self-hosted. Cloud round-trip eats your latency budget.
The three questions to narrow¶
- Volume? < 10K/day → VLM API. 10K–1M → task API. 1M+ → consider self-hosting.
- Data sensitivity? Regulated → self-host or vendor-with-BAA. Public → any.
- Task specificity? Rigid (invoice fields) → Textract/Document AI. Flexible (describe this) → VLM API.
The Dump¶
AWS¶
- Rekognition (image + video, face, label, liveness, moderation, PPE, celebrity).
- Textract (forms + tables + handwriting).
- Comprehend (text, not CV, but often chained).
- SageMaker (if you train your own).
Google Cloud¶
- Cloud Vision API (labels, text, objects, landmarks, faces, web detection, safe search).
- Document AI (structured extraction).
- Vertex AI (custom model training/deployment).
- Video Intelligence (video analysis).
Azure¶
- Azure AI Vision (image analysis, OCR/Read, face — restricted).
- Document Intelligence (forms).
- Video Indexer (video).
- Custom Vision (AutoML).
Anthropic / OpenAI / Google AI (VLM APIs)¶
- Claude 4 Sonnet / Opus / Haiku (Anthropic).
- GPT-5 / GPT-4o / GPT-4o-mini (OpenAI).
- Gemini 2.5 Pro / Flash (Google AI Studio / Vertex).
Specialized / regional¶
- Face++ (Megvii) — face identity, especially in China.
- Baidu AI Open Platform — Chinese market.
- Yandex Vision — Russian market.
- Kakao Vision / Naver Clova — Korean market.
Content moderation specialists¶
- Hive — ML-based moderation.
- Sightengine — images + video.
- Amazon Rekognition Content Moderation — AWS-native.
- Microsoft Content Moderator — deprecated in 2024, replaced by Azure AI Content Safety.
KYC / Identity verification (chain includes face rec)¶
- Jumio — full KYC stack.
- Onfido — full KYC stack.
- Persona — developer-friendly KYC.
- IDnow — European market.
- Veriff — European-focused.
- Socure — US-focused.
Other¶
- Clarifai — model hub + APIs.
- Mindee — document AI API, dev-friendly.
- Nanonets — document AI + OCR + data extraction.
Graveyard¶
- Microsoft Face API (for general identity) — restricted in 2022. Still works for existing customers in approved use cases.
- Cognitive Services Custom Vision as a leader — eclipsed by SageMaker / Vertex for custom model training.
- AWS DeepLens — AWS's edge hardware, discontinued.
- Google Cloud AutoML Vision — subsumed into Vertex AI.
Last reviewed¶
2026-04-22.