GPU Benchmarks for LLM Inference and Training
Real throughput numbers, cost per million tokens, and reproducible recipes — captured on CloudRift hardware. Covers RTX PRO 6000 Blackwell benchmarks, the 4090 vs 5090 benchmark, and B200/H200/H100 datacenter GPU comparisons.
Quick Reference
LLM GPU Benchmarks at a Glance
vLLM throughput on Qwen3-Coder-30B (AWQ, 8K context, FP8 KV-cache) — identical host configurations across GPUs. Sourced from the RTX 4090 vs 5090 vs PRO 6000 LLM benchmark.
| Configuration | Throughput | Cost / Hour | Cost / 1M Tokens |
|---|---|---|---|
| 1× RTX 4090 | 2,259 tok/s | $0.39 | $0.048 |
| 1× RTX 5090 | 4,570 tok/s | $0.65 | $0.040 |
| 1× RTX PRO 6000 | 8,425 tok/s | $1.20 | $0.040 |
| 4× RTX 4090 | 8,903 tok/s | $1.56 | $0.049 |
| 4× RTX 5090 | 12,744 tok/s | $2.60 | $0.057 |
Featured Benchmarks
In-Depth GPU Benchmark Articles

Blackwell Dominates. Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000
Long-context LLM inference (8K input + 8K output) across NVIDIA B200, H200, H100, and RTX PRO 6000. B200 delivers up to 4.9× the throughput of RTX PRO 6000 and is now the cost-efficiency leader.

RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark
Throughput numbers across RTX 4090, RTX 5090, and RTX PRO 6000 in 1×, 2×, and 4× configurations using vLLM. Reveals the cost-efficiency winner across three model sizes.

RTX PRO 6000 vs Datacenter GPUs: Is the new RTX an H100 killer?
Single-GPU PRO 6000 beats H100 at 28% lower cost per token. NVLink-equipped datacenter GPUs reclaim the lead 3–4× for large models requiring 8-way tensor parallelism.
By GPU Model
Benchmarks by GPU Model
RTX PRO 6000 Benchmarks
Single-GPU throughput reaches 8,425 tok/s on Qwen3-Coder-30B — 3.7× the RTX 4090 and 1.8× the RTX 5090. Beats the H100 on cost per token at single-GPU scale. Full vs H100/H200 comparison on the product page.
Read moreRTX 5090 Benchmarks
Single-GPU throughput hits 4,570 tok/s on Qwen3-Coder-30B — about 2× the RTX 4090 — at $0.040 per million tokens. 4× 5090 reaches 12,744 tok/s via replica parallelism, with 128 GB combined VRAM enabling larger batches.
Read moreRTX 4090 Benchmarks
Wins on raw hourly cost ($0.39/hr). Hits 2,259 tok/s on Qwen3-Coder-30B; 4× configuration scales to 8,903 tok/s. The right pick when the budget is fixed and 24 GB of VRAM fits the workload.
Read moreB200, H200, H100 Benchmarks
Blackwell B200 is the new throughput-per-dollar leader, delivering up to 4.9× the RTX PRO 6000 on long-context LLM inference. H200 still wins on memory bandwidth (4.8 TB/s) for the largest models.
Read moreMethodology
How We Benchmark GPUs
Every benchmark in this hub uses real production-style serving stacks — primarily vLLM and SGLang — on CloudRift rental hardware that anyone can replicate. We focus on maximum throughput for high-concurrency LLM serving (the workload most enterprise teams care about), not synthetic peak FLOPS.
Models are quantized (AWQ INT4 or FP8) so that fair comparisons can be made across GPUs with different VRAM. Multi-GPU runs use pipeline parallelism unless tensor parallelism is explicitly motivated by VRAM headroom. KV-cache is FP8, context length is 8K unless stated otherwise. Each benchmark blog post lists exact flags and the model checkpoint.
Throughput numbers are wall-clock tokens per second sustained over a full benchmark sweep, not single-batch peak. Cost per million tokens is computed against the on-demand hourly rate — reserved pricing brings these numbers down further.
Context length is typically 8K; the long-context B200 benchmark uses 16K (8K input + 8K output) to stress KV-cache traffic and surface real differences in memory-bandwidth behavior across H100, H200, and B200.
Benchmark FAQ
Common Questions About GPU Benchmarks
Ready to Benchmark on Real Hardware?
Spin up RTX 4090, RTX 5090, RTX PRO 6000, H100, H200, or B200 instances on CloudRift and run your own throughput tests. Most benchmarks here were captured on standard rentals.