Best GPU for FLUX Image Generation

Choosing a GPU for FLUX.1 [dev] and FLUX.1 [schnell] — VRAM requirements, image-per-second throughput, and the cheapest path to production.

FLUX is the current state-of-the-art open image model from Black Forest Labs. It's also notoriously VRAM-hungry: the full model is 12B parameters, and you need room for the T5 text encoder and VAE on top.

VRAM requirements

Variant	Model VRAM (FP16)	Practical floor
FLUX.1 [dev]	~24 GB	24 GB GPU (tight)
FLUX.1 [dev] FP8	~12 GB	16 GB GPU
FLUX.1 [schnell]	~24 GB	24 GB GPU
FLUX.1 [schnell] NF4	~7 GB	12 GB GPU

If you only have 12 GB of VRAM, you're in NF4-quantized territory. Quality is surprisingly good but you do lose some prompt adherence on complex compositions.

Throughput (1024×1024, 20 steps, batch 1)

RTX 4060 Ti 16GB: ~22s per image (FP8)
RTX 4090 24GB: ~6.5s per image (FP16)
L40S 48GB: ~5.8s per image (FP16)
A100 80GB: ~5.0s per image (FP16)
H100 80GB: ~2.4s per image (FP16)

Cost per 1,000 images (FP16, ~6s/image, no batching gains)

Roughly 1.7 hours of GPU time per 1,000 images.

GPU	Hourly	Cost per 1,000 imgs
RTX 4090 (RunPod Community)	$0.34	$0.58
L40S	$1.10	$1.87
A100 80GB	$1.19	$2.02
H100 80GB	$2.99	$5.08

The RTX 4090 is the cost king for FLUX if you don't need datacenter SLA. Even versus an H100 it wins by ~9× on cost per image, despite being ~2.5× slower per image.

Recommendation

Tinkering: 4090 on RunPod Community or Vast.ai.
Public-facing API: L40S in a Tier-3 datacenter — good throughput, predictable uptime, decent price.
High volume / latency-sensitive: H100 with FP8 + batching, ~6× the throughput per card.