GPUs TestedB200 · H200 · H100 · RTX PRO 6000 · RTX 5090 · RTX 4090

GPU Benchmarks for LLM Inference and Training

Real throughput numbers, cost per million tokens, and reproducible recipes — captured on CloudRift hardware. Covers RTX PRO 6000 Blackwell benchmarks, the 4090 vs 5090 benchmark, and B200/H200/H100 datacenter GPU comparisons.

Quick Reference

LLM GPU Benchmarks at a Glance

vLLM throughput on Qwen3-Coder-30B (AWQ, 8K context, FP8 KV-cache) — identical host configurations across GPUs. Sourced from the RTX 4090 vs 5090 vs PRO 6000 LLM benchmark.

Configuration	Throughput	Cost / Hour	Cost / 1M Tokens
1× RTX 4090	2,259 tok/s	$0.39	$0.048
1× RTX 5090	4,570 tok/s	$0.65	$0.040
1× RTX PRO 6000	8,425 tok/s	$1.20	$0.040
4× RTX 4090	8,903 tok/s	$1.56	$0.049
4× RTX 5090	12,744 tok/s	$2.60	$0.057

Featured Benchmarks

In-Depth GPU Benchmark Articles

Benchmarks

Blackwell Dominates. Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000

Natalia TrifonovaJan 21, 2026

Long-context LLM inference (8K input + 8K output) across NVIDIA B200, H200, H100, and RTX PRO 6000. B200 delivers up to 4.9× the throughput of RTX PRO 6000 and is now the cost-efficiency leader.

Benchmarks

RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark

Dmitry TrifonovOct 9, 2025

Throughput numbers across RTX 4090, RTX 5090, and RTX PRO 6000 in 1×, 2×, and 4× configurations using vLLM. Reveals the cost-efficiency winner across three model sizes.

Benchmarks

RTX PRO 6000 vs Datacenter GPUs: Is the new RTX an H100 killer?

Dmitry TrifonovNov 27, 2025

Single-GPU PRO 6000 beats H100 at 28% lower cost per token. NVLink-equipped datacenter GPUs reclaim the lead 3–4× for large models requiring 8-way tensor parallelism.

View all benchmark articles on the blog →

By GPU Model

Benchmarks by GPU Model

RTX PRO 6000 Benchmarks

Single-GPU throughput reaches 8,425 tok/s on Qwen3-Coder-30B — 3.7× the RTX 4090 and 1.8× the RTX 5090. Beats the H100 on cost per token at single-GPU scale. Full vs H100/H200 comparison on the product page.

RTX 5090 Benchmarks

Single-GPU throughput hits 4,570 tok/s on Qwen3-Coder-30B — about 2× the RTX 4090 — at $0.040 per million tokens. 4× 5090 reaches 12,744 tok/s via replica parallelism, with 128 GB combined VRAM enabling larger batches.

RTX 4090 Benchmarks

Wins on raw hourly cost ($0.39/hr). Hits 2,259 tok/s on Qwen3-Coder-30B; 4× configuration scales to 8,903 tok/s. The right pick when the budget is fixed and 24 GB of VRAM fits the workload.

B200, H200, H100 Benchmarks

Blackwell B200 is the new throughput-per-dollar leader, delivering up to 4.9× the RTX PRO 6000 on long-context LLM inference. H200 still wins on memory bandwidth (4.8 TB/s) for the largest models.

Methodology

How We Benchmark GPUs

Every benchmark in this hub uses real production-style serving stacks — primarily vLLM and SGLang — on CloudRift rental hardware that anyone can replicate. We focus on maximum throughput for high-concurrency LLM serving (the workload most enterprise teams care about), not synthetic peak FLOPS.

Models are quantized (AWQ INT4 or FP8) so that fair comparisons can be made across GPUs with different VRAM. Multi-GPU runs use pipeline parallelism unless tensor parallelism is explicitly motivated by VRAM headroom. KV-cache is FP8, context length is 8K unless stated otherwise. Each benchmark blog post lists exact flags and the model checkpoint.

Throughput numbers are wall-clock tokens per second sustained over a full benchmark sweep, not single-batch peak. Cost per million tokens is computed against the on-demand hourly rate — reserved pricing brings these numbers down further.

Context length is typically 8K; the long-context B200 benchmark uses 16K (8K input + 8K output) to stress KV-cache traffic and surface real differences in memory-bandwidth behavior across H100, H200, and B200.

Benchmark FAQ

Common Questions About GPU Benchmarks

For frontier-scale workloads, the NVIDIA B200 is the new throughput-per-dollar leader for long-context inference. For mid-scale enterprise inference, the RTX PRO 6000 Blackwell beats the H100 on cost per token at single-GPU scale. For high-concurrency serving on a budget, the RTX 5090 delivers about 2× the throughput of the RTX 4090 with a lower cost per million tokens.

Two benchmarks above are most relevant: the RTX PRO 6000 vs datacenter GPUs benchmark (vs H100/H200) and the RTX 4090 vs RTX 5090 vs RTX PRO 6000 benchmark. Both show single-GPU and multi-GPU throughput on real models served via vLLM.

On Qwen3-Coder-30B (AWQ) using vLLM, a single RTX 5090 delivers 4,570 tokens/s versus 2,259 tokens/s on a single RTX 4090 — about 2× the throughput. The 5090 costs more per hour ($0.65 vs $0.39) but its higher throughput results in a lower cost per million tokens ($0.040 vs $0.048).

Yes. Each benchmark blog post includes the model, quantization, vLLM/SGLang flags, KV-cache settings, and parallelism strategy. All benchmarks were run on standard CloudRift hardware that you can rent and replicate.

Open the console, launch a GPU instance with the pre-built vLLM image, and run your workload. Most benchmarks in this hub were captured on standard rentals — no special setup required.

Run your own

Ready to Benchmark on Real Hardware?

Spin up RTX 4090, RTX 5090, RTX PRO 6000, H100, H200, or B200 instances on CloudRift and run your own throughput tests. Most benchmarks here were captured on standard rentals.

Launch Console Browse GPU Rentals