Benchmarking LLM Inference on RTX 4090, RTX 5090, and RTX PRO 6000

By Natalia Trifonova•September 23, 2025

llminferencebenchmark

Curious to know which GPU to buy or rent to run your LLM inference? 1x, 2x, and 4x RTX 4090, RTX 5090, and RTX PRO 6000 are the most affordable, yet capable builds. We ran a series of benchmarks across multiple GPU cloud servers to evaluate their performance for LLM workloads, specifically serving LLaMA and Qwen models using the vLLM inference engine. This article explains how we tested, what we measured, and what insights we gained by running these models on different GPU configurations.

What We Measured?

LLM workloads are not just about raw FLOPS. When serving models in production—especially in interactive, multi-turn settings like chat—you care about:

Model loading times
Download speed of Hugging Face models
Token latency metrics like:
- TTFT (Time to First Token)
- TPOT (Time per Output Token)
- ITL (Inter-Token Latency)
- E2EL (End-to-End Latency)

What We Tested?

We created a comprehensive benchmark script that automates the following steps:

Run System Benchmark: The script starts with YABS, a popular hardware test suite, to capture CPU, memory, disk, and network performance.
Download the Model: We simulate production readiness by downloading models from Hugging Face, measuring both time taken and average download speed.
Launch vLLM Container: We spin up a Docker container using vllm/vllm-openai:latest, bind-mounting the downloaded model directory. The container exposes an OpenAI-compatible API endpoint.
Run Inference Benchmark: Finally, it runs benchmark_serving.py inside the container to simulate multi-request, high-concurrency LLM usage with synthetic inputs. We use Qwen/Qwen3-Coder-30B-A3B-Instruct model with tensor parallelism set to a number of GPU on the machine.

What Models and Configs Were Used?

Model: Qwen/Qwen3-Coder-30B-A3B-Instruct
Serving: vLLM + OpenAI API-compatible interface
Command Parameters:
- Input/output tokens: 1000
- Concurrency: 200
- Prompts: 1000
- Metrics: ttft,tpot,itl,e2el

Importance of Driver Versions

During our benchmarking with the NVIDIA RTX 5090 and RTX PRO 6000, we observed a huge performance discrepancy between driver versions:

With the older driver 570.86.15, the inference performance on RTX 5090 was comparable to that of the RTX 4090. Upgrading to driver 575.57.08, we saw significant gains in all vLLM benchmarks.

Hardware Tested

We ran the benchmark across several server configurations. These configurations include 4 x RTX 4090, 4 x RTX 5090, 1 x RTX PRO 6000, and 2 x RTX PRO 6000. These configurations are popular among self-hosting enthusiasts, so we wanted to find out which configuration is the most cost-efficient.

Results

Benchmark Comparison Table
Infogram

Throughput

Benchmark Comparison
Infogram

Most providers that we've worked with provide 10Gbps internet and deliver this speed. There might be occasional drops in speed due to network congestion. Download speed to distant servers is lower, so renting a server that is very far away from your cloud storage server is not the best idea.

System	CPU (Cores @ GHz)	RAM	Disk Perf (Max Total R/W)	Network Perf (Best Send/Recv)	Geekbench 6 (Single / Multi)	HF DL Speed (MiB/s)	Cost (per hour)
Neuralrack 4090x4	EPYC 7662 (32c @ 2.0)	196 GB	5.96 GB/s	8.57 Gbps / 5.73 Gbps (NYC)	1221 / 10,339	6326.38	$1.56
EasyCompute 4090x4	EPYC 7702 (60c @ 2.0)	315 GB	4.96 GB/s	3.27 Gbps / 3.38 Gbps (London)	1327 / 8,982	6194.80	$1.56
EasyCompute 5090x4	EPYC 7702 (60c @ 2.0)	315 GB	4.84 GB/s	3.47 Gbps / 4.69 Gbps (London)	1319 / 9,091	5070.27	$2.60
Neuralrack RTX 6000x1	EPYC 9374F (7c @ 3.85)	118 GB	8.79 GB/s	5.84 Gbps / 5.73 Gbps (NYC)	1825 / 8,097	6875.86	$1.29
Neuralrack RTX 6000x2	EPYC 9374F (14c @ 3.85)	236 GB	7.95 GB/s	9.07 Gbps / 4.35 Gbps (NYC)	1826 / 10,672	7504.25	$2.58

Key Takeaways

Startup time varies significantly based on disk speed and CPU-GPU coordination. NVMe-backed storage with fast CPUs helped reduce wait time before inference.
Model download speed can be a limiting factor if your storage or bandwidth is subpar. In some cases, using HF_HUB_ENABLE_HF_TRANSFER=1 helped achieve 2–3× better download speeds from hugging face.
Token generation latency (especially TTFT) can vary even across servers with similar GPUs, due to backend configuration and memory bandwidth differences.
4090s perform well for cost, especially for smaller models like Qwen-3B or LLaMA-8B. However, for larger models or batch inference, PRO 6000 is a clear winner. Even on the small Qwen/Qwen3-Coder-30B-A3B-Instruct model used in the test, the single Pro 6000 is faster than four 4090s and four 5090s. The prefill-decode-disaggregation technique I described in the previous article can reduce the need to transfer a significant amount of data over the PCIe bus, which is the primary performance bottleneck of low-VRAM GPUs when running larger models. However, in vast majority of cases, the PRO 6000 will be a better option.

How to Run This Yourself

Clone the repository:

git clone https://github.com/cloudrift-ai/server-benchmark.git
cd server-benchmark

Install dependencies:

./scripts/setup.sh

Run the benchmark

./scripts/run_benchmarks.sh

GitHub Repository

You can find the code here. The model and other parameters are easily customizable if you want to run it yourself. Feel free to let me know in Discord or in the comments the configuration or the model you're interested in for future benchmarking!