Skip to content

Image Generation

Diffusion models, image edits, controllable synthesis. The default open pick flipped from SDXL to Flux in 2024. Commercial keeps its lead on prompt adherence.

Generation is a distinct world from perception — different libraries, different compute profiles, different failure modes. Most CV engineers don't need it early; it becomes relevant for content tooling, synthetic data, editing, or generative UX.

Use case Pick When to use
Commercial flagship Midjourney v7, Flux 2 Pro, or DALL-E 3 Marketing / hero images; prompt adherence matters; quality > control
Open default (photorealism) Flux 2 (Black Forest Labs, Nov 2025) Up to 4MP output, best open photorealism, strong prompt adherence
Open default (text-in-image) Qwen-Image (Alibaba, Apache 2.0) Best-in-class multilingual text rendering within images. Commercial-friendly license.
Stable / mature open SDXL 1.0 + LoRAs Massive ecosystem of fine-tunes and community tools; prefer over Flux only if you need an existing LoRA
Controllable generation Flux + ControlNet or SDXL + ControlNet Pose-controlled, edge-controlled, depth-controlled generation
Editing (img2img / inpainting) Flux Fill, SDXL Inpainting, or a commercial "edit" API Targeted regions of an existing image
Fast / real-time SDXL Turbo, SD Lightning, or Flux Schnell Live preview, sketch-to-image, interactive apps

[!WARNING] Open image-gen weights have messy licenses — check before shipping. - Flux.1 [dev]: FLUX.1 [dev] Non-Commercial License — research/personal only, no commercial use of the weights. - Flux.1 [schnell]: Apache-2.0 — commercial-safe. - Flux 2 Pro / Pro 1.1: commercial API-only (Black Forest Labs hosted). The open Flux 2 [dev] weights inherit the same non-commercial terms as Flux.1 [dev]. - Stable Diffusion 3 / 3.5: Stability Community License — free below $1M annual revenue or non-commercial; paid Enterprise License required above that threshold. - SDXL 1.0: CreativeML OpenRAIL++-M — commercial-safe with use restrictions (no illegal / harmful content, attribution). - SD 1.5: OpenRAIL-M — similar terms to SDXL. - Qwen-Image: Apache-2.0 — commercial-safe.

For commercial deployment with no license friction: Flux.1 [schnell] (Apache-2.0), SDXL 1.0 (OpenRAIL++-M), or Qwen-Image (Apache-2.0). For commercial flagship quality with a clean license: pay for Flux 2 Pro / Midjourney / DALL-E / Imagen API.

Why Flux 2 is the current open leader

Flux 2 (Black Forest Labs, November 2025) is the biggest open-model step forward since SDXL. Superior photorealism, better prompt comprehension, up to 4MP output. Best single choice for portrait / product / hero image quality.

Flux family as of 2026: - Flux 2 Pro — API-only commercial, max quality. - Flux.1 [dev] — 12B params, FLUX.1 [dev] Non-Commercial License. Still widely used in research and non-commercial tooling because of the maturing ecosystem. Don't ship this in a commercial product. - Flux.1 [schnell] — Apache-2.0 license, 1-step or 4-step generation. Use this for self-hosted commercial.

For self-hosted commercial: Flux.1 [schnell] (Apache) or Qwen-Image (Apache). For highest-quality self-hosted with non-commercial acceptable: Flux 2 / Flux.1 [dev] weights.

Why Qwen-Image for text-in-image

Qwen-Image (Alibaba, Apache 2.0) integrates language and layout reasoning directly into the architecture. Best-in-class multilingual text rendering within images — font consistency, spatial alignment, complex backgrounds. If your use case involves typography in the generated image, Qwen-Image beats Flux and SDXL on that axis. And it's Apache 2, which matters for commercial use.

SDXL — still the mature ecosystem

SDXL 1.0 shipped mid-2023 and spawned the largest ecosystem in image generation: hundreds of thousands of LoRAs on Civitai, dozens of ControlNet variants, strong community tooling (ComfyUI, Automatic1111). It's sometimes the right pick because of the ecosystem, not despite age:

  • You need a specific style LoRA → probably on SDXL, not Flux.
  • You need IP-Adapter for face consistency → mature on SDXL.
  • You're using an existing ComfyUI graph → it's SDXL-shaped.

Controllable generation

Plain text → image is often not enough for production use. You usually want to constrain output — pose a character, preserve layout, edit one region. That's ControlNet territory.

  • ControlNet (Zhang et al., 2023) — the control-by-condition framework. Variants for pose (OpenPose / DWPose input), depth, edges (Canny / SoftEdge), segmentation, scribble, lineart.
  • IP-Adapter — image prompting; condition the output on a reference image's style or identity.
  • T2I-Adapter — lighter-weight alternative to ControlNet.

Both Flux and SDXL have ControlNet support. SDXL's ecosystem is broader; Flux is catching up.

When to pick something else

  • Consistent characters across many images → specialized pipelines (LoRA training on the character, or InstantID / PhotoMaker).
  • Text rendering in the image (signs, logos) → Flux is the open winner. DALL-E 3 and Midjourney also strong. SDXL is weak here.
  • Video generation → different world. See Video Generation dump below. Moving target.
  • 3D-aware generation → NeRF / Gaussian Splatting / 3D-native models. Out of scope for this page.

Running Flux / SDXL locally

Commercial APIs

  • Midjourney — Discord/web interface. Highest aesthetic quality. No API until recently (now limited beta).
  • DALL-E 3 (via OpenAI API) — strong prompt adherence, tight content policy.
  • Imagen 3 (via Google AI Studio / Vertex) — fast, strong.
  • Stable Diffusion 3 Large (via Stability AI API) — the commercial variant of SD3.
  • Flux Pro (via Black Forest Labs / replicate) — hosted.
  • Ideogram — specialized in text-in-image.
  • Leonardo / Stability / Replicate / fal.ai — aggregator/hosting platforms.

Video generation (fast-moving, verify before relying on)

Leaderboard as of April 2026 (these rotate every 1–2 months):

Commercial

  • Kling 3.0 (Kuaishou, Feb 2026) — multi-shot sequences 3–15 seconds, 4K native, subject consistency across shots. Cost-effective (~$0.50/clip).
  • Veo 3.1 (Google, Jan 2026) — 4K native, best-in-class lip sync for talking heads.
  • Seedance 2.0 (ByteDance, Feb 2026) — first with unified audio-video joint generation, multi-shot storytelling from a single prompt, phoneme-level lip sync in 8+ languages.
  • Runway Gen-4 — commercial, broad creative tooling.
  • Luma Dream Machine — commercial.

Open

  • Wan 2.6 (Alibaba) — open-weight, strong, full control over the generation process.
  • HunyuanVideo (Tencent) — open weights.
  • LTX-Video — open, fast.
  • CogVideoX — open.
  • AnimateDiff — motion LoRA on top of SD/SDXL. Lower quality than dedicated video models.

Graveyard (video)

  • Sora (OpenAI)shut down March 24, 2026, six months after launch. Do not recommend.
  • Sora 2 — preceded the shutdown; coherent prompts and camera control were praised.

Video generation is moving faster than image generation and the recommendations here will age inside 3 months. Verify before citing.

The Dump

Diffusion model families

Tools

Control / edit

Commercial

  • Midjourney, DALL-E, Imagen, Flux Pro, Ideogram — see above.

Graveyard

  • SD 1.5 as a default for new work — retired; use SDXL or Flux unless ecosystem-locked.
  • DALL-E 2 — retired by OpenAI.
  • Disco Diffusion / VQ-VAE-2 — historical.
  • Original ControlNet 1.0 weights — succeeded by 1.1 and SDXL-native variants.

Last reviewed

2026-04-22. Generation is second-fastest-moving page after VLM.