Skip to content

Vision-Language Models (VLMs)

The subtask that ate half of traditional CV. In 2026, a capable VLM is often the right first try for any task where the output is text or structured data from an image.

VLMs changed the calculus of CV deployment. For any task where the output is descriptive ("what's in this image?", "extract the fields from this receipt", "is this X doing Y?"), a VLM is now a reasonable first implementation. You trade per-inference cost and latency for not training anything. For many products, that trade is correct.

The field moves fast. The picks below are current as of 2026-04-22 and will age in months, not years.

Tier Pick When to use
Commercial flagship Gemini 2.5 Pro, GPT-5, or Claude 4 Sonnet/Opus Prototyping, low-volume production, when quality matters more than cost
Commercial budget Gemini 2.5 Flash, GPT-5-mini, Claude Haiku High-volume, latency-sensitive. Often 10× cheaper than flagship.
Open flagship Qwen3-VL-235B-A22B or InternVL3-78B Rivals Gemini-2.5-Pro / GPT-5 on multimodal benchmarks. Self-hosted, data-sensitive, cost-critical at scale.
Open small Qwen3-VL-7B, InternVL3-8B, LLaVA-OneVision-7B Edge GPU, on-prem, prototyping

[!WARNING] Open VLM license map (check before shipping): - Qwen3-VL / Qwen2-VL: Apache-2.0 — commercial-safe (smaller variants). The 72B / 235B flagship checkpoints carry the Qwen Research License — check per-model card before commercial deployment. - InternVL3 / InternVL2: MIT — commercial-safe. - LLaVA-OneVision / LLaVA-NeXT: Apache-2.0 code, but the fine-tuning data (from GPT-4) carries OpenAI T&C restrictions for competing model training. - Gemma 3 / PaliGemma: Gemma Terms of Use — allows commercial use with restrictions (prohibited uses list, attribution). Not Apache-2.0. - Molmo (Allen AI): Apache-2.0 — commercial-safe, fully open data. - CogVLM2: custom license — commercial use allowed under CogVLM2 License, register with the Tsinghua team. - Florence-2: MIT — commercial-safe. - MiniCPM-V: Apache-2.0 (recent versions) — commercial-safe, but check the specific variant. - Flamingo: closed, not available. open_flamingo: MIT, but weights trained on LAION have separate terms.

Commercial managed flagships (Gemini / GPT / Claude) handle licensing for you via API T&C.

Reading the picks: Operational default depends on your deployment constraints — for teams with existing Qwen2-VL / InternVL2 pipelines and strict stability requirements, those remain acceptable operational defaults; the move to Qwen3-VL / InternVL3 is a newer strong contender worth testing rather than an urgent upgrade. Latest strong contender = Qwen3-VL, InternVL3 (newer generation, better benchmarks). Research max shifts monthly in this space; expect this page to age the fastest.

Why a VLM instead of training a model

Three cases where a VLM is the right first try:

  1. You have no labeled data and the task is describable in words. Ask the VLM. Beats training from scratch on no data.
  2. The task is "understand a messy image" (messy receipts, handwritten notes, complicated scenes). Specialized CV models struggle; VLMs are trained on billions of such inputs.
  3. You need flexibility. VLM prompts change faster than model retraining.

When NOT to use a VLM

  • Latency-critical (< 100ms per inference). Even Flash-class VLMs are 500ms+ per call.
  • High-volume inference at low unit cost. A YOLO detector at $0.0001/image crushes a VLM at $0.01/image when you're at 100M images.
  • Binary classification with plenty of labeled data. A small fine-tuned CNN will outperform a VLM by a mile on in-distribution tasks at 1/1000th the cost.
  • On-device, no network (phone apps). The small open VLMs barely fit; they're expensive at inference.
  • Precision-critical structured outputs where hallucination is unacceptable (legal, medical OCR). VLMs hallucinate; use them with verification.

The three questions to narrow

  1. Can you send data to a third party? No → open model (Qwen, InternVL). Yes → commercial API.
  2. What's the per-inference budget? < $0.001 → not a VLM, use a specialized model. $0.001–0.01 → budget-tier commercial. $0.01+ → any tier.
  3. Do you need structured output? Yes → Gemini (native JSON), GPT with function calling, or Claude with tool use. Direct prompting often produces malformed JSON.

Prompting patterns that work

  • Few-shot with 2–3 examples → significantly better than zero-shot for structured extraction.
  • Schema in the prompt (a JSON blueprint) → better than "please extract fields."
  • "Think step by step" for complex scenes → VLMs benefit from explicit reasoning traces.
  • Ground truth image in the prompt → for consistency, show the VLM a canonical example in the same call.

Commercial APIs

  • OpenAI — GPT-5, GPT-5-mini, GPT-4o. Mature API. JSON mode + function calling. Image input standard.
  • Google — Gemini 2.5 Pro / Flash. Leads LMArena and WebDevArena for vision + coding. 1M+ token context. Native multimodal from training.
  • Anthropic — Claude 4 family (Opus, Sonnet, Haiku). Strongest factuality discipline. "Computer use" capability. PDF input native (parsed as images).
  • xAI (Grok) — Grok Vision available. Less mature ecosystem.

Open models

  • Qwen3-VL (Alibaba) — 2B / 7B / 32B / 235B variants. Flagship Qwen3-VL-235B-A22B rivals Gemini 2.5 Pro and GPT-5 on multimodal benchmarks (Q&A, 2D/3D grounding, video understanding, OCR, document comprehension).
  • InternVL3 (Shanghai AI Lab) — 1B / 2B / 8B / 14B / 38B / 78B. InternVL3-78B scored 72.2 on MMMU — SOTA among open MLLMs at release. Adds tool use, GUI agents, 3D vision perception.
  • Qwen2-VL / InternVL2 — previous generations. Still widely deployed; upgrade when convenient.
  • LLaVA lineage (LLaVA-OneVision, LLaVA-NeXT) — academic. Easy to fine-tune. Slightly behind Qwen/InternVL on benchmarks.
  • CogVLM2 (Tsinghua) — 19B. Solid open option.
  • Molmo (Allen AI) — trained on open data, fully open weights + data. Slightly behind proprietary on most benchmarks.
  • Gemma 3 — Google's open VLM family. Strong mid-tier performer. License: Gemma Terms of Use — commercial allowed with prohibited-uses restrictions.
  • Florence-2 (Microsoft) — not a chat VLM but a multi-task vision model (captioning, detection, segmentation). Lightweight (~0.8B).
  • PaliGemma (Google) — 3B VLM, fine-tuning-friendly. License: Gemma Terms of Use.

Running open VLMs

  • vLLM — the standard serving engine for open models. VLM support added 2024.
  • Ollama — easy local experimentation; limited VLM support, growing.
  • Hugging Face Transformers — reference implementations.
  • SGLang — alternative serving; strong performance.

Specialized multimodal tools (not general chat)

The Dump

Graveyard

  • BLIP-1 as a default captioner — retired by BLIP-2 and then by modern VLMs.
  • Training a VLM from scratch for a specific task — 2024 onward, fine-tuning a small open VLM or just using a commercial API is almost always the right call over from-scratch.
  • OCR-specific CNN pipelines for messy documents — VLMs eat this for breakfast on unstructured inputs.

Last reviewed

2026-04-22. Expect this page to age fastest of any in the playbook. The commercial flagship probably changed between drafting and you reading this.