Skip to main content
GPUs TestedB200 · H200 · H100 · RTX PRO 6000 · RTX 5090 · RTX 4090

GPU Benchmarks for LLM Inference and Training

Real throughput numbers, cost per million tokens, and reproducible recipes — captured on CloudRift hardware. Covers RTX PRO 6000 Blackwell benchmarks, the 4090 vs 5090 benchmark, and B200/H200/H100 datacenter GPU comparisons.

Quick Reference

LLM GPU Benchmarks at a Glance

vLLM throughput on Qwen3-Coder-30B (AWQ, 8K context, FP8 KV-cache) — identical host configurations across GPUs. Sourced from the RTX 4090 vs 5090 vs PRO 6000 LLM benchmark.

ConfigurationThroughputCost / HourCost / 1M Tokens
1× RTX 40902,259 tok/s$0.39$0.048
1× RTX 50904,570 tok/s$0.65$0.040
1× RTX PRO 60008,425 tok/s$1.20$0.040
4× RTX 40908,903 tok/s$1.56$0.049
4× RTX 509012,744 tok/s$2.60$0.057

Methodology

How We Benchmark GPUs

Every benchmark in this hub uses real production-style serving stacks — primarily vLLM and SGLang — on CloudRift rental hardware that anyone can replicate. We focus on maximum throughput for high-concurrency LLM serving (the workload most enterprise teams care about), not synthetic peak FLOPS.

Models are quantized (AWQ INT4 or FP8) so that fair comparisons can be made across GPUs with different VRAM. Multi-GPU runs use pipeline parallelism unless tensor parallelism is explicitly motivated by VRAM headroom. KV-cache is FP8, context length is 8K unless stated otherwise. Each benchmark blog post lists exact flags and the model checkpoint.

Throughput numbers are wall-clock tokens per second sustained over a full benchmark sweep, not single-batch peak. Cost per million tokens is computed against the on-demand hourly rate — reserved pricing brings these numbers down further.

Context length is typically 8K; the long-context B200 benchmark uses 16K (8K input + 8K output) to stress KV-cache traffic and surface real differences in memory-bandwidth behavior across H100, H200, and B200.

Benchmark FAQ

Common Questions About GPU Benchmarks

For frontier-scale workloads, the NVIDIA B200 is the new throughput-per-dollar leader for long-context inference. For mid-scale enterprise inference, the RTX PRO 6000 Blackwell beats the H100 on cost per token at single-GPU scale. For high-concurrency serving on a budget, the RTX 5090 delivers about 2× the throughput of the RTX 4090 with a lower cost per million tokens.
Two benchmarks above are most relevant: the RTX PRO 6000 vs datacenter GPUs benchmark (vs H100/H200) and the RTX 4090 vs RTX 5090 vs RTX PRO 6000 benchmark. Both show single-GPU and multi-GPU throughput on real models served via vLLM.
On Qwen3-Coder-30B (AWQ) using vLLM, a single RTX 5090 delivers 4,570 tokens/s versus 2,259 tokens/s on a single RTX 4090 — about 2× the throughput. The 5090 costs more per hour ($0.65 vs $0.39) but its higher throughput results in a lower cost per million tokens ($0.040 vs $0.048).
Yes. Each benchmark blog post includes the model, quantization, vLLM/SGLang flags, KV-cache settings, and parallelism strategy. All benchmarks were run on standard CloudRift hardware that you can rent and replicate.
Open the console, launch a GPU instance with the pre-built vLLM image, and run your workload. Most benchmarks in this hub were captured on standard rentals — no special setup required.
Run your own

Ready to Benchmark on Real Hardware?

Spin up RTX 4090, RTX 5090, RTX PRO 6000, H100, H200, or B200 instances on CloudRift and run your own throughput tests. Most benchmarks here were captured on standard rentals.