Benchmarking LLM Inference on RTX 4090, RTX 5090, and RTX PRO 6000
Curious to know which GPU to buy or rent to run your LLM inference? 1x, 2x, and 4x RTX 4090, RTX 5090, and RTX PRO 6000 are the most affordable, yet capable builds. We ran a series of benchmarks across multiple GPU cloud servers to evaluate their performance for LLM workloads, specifically serving LLaMA and Qwen models using the vLLM inference engine. This article explains how we tested, what we measured, and what insights we gained by running these models on different GPU configurations.
What We Measured?
LLM workloads are not just about raw FLOPS. When serving models in production—especially in interactive, multi-turn settings like chat—you care about:
- Model loading times
- Download speed of Hugging Face models
- Token latency metrics like:
- TTFT (Time to First Token)
- TPOT (Time per Output Token)
- ITL (Inter-Token Latency)
- E2EL (End-to-End Latency)
What We Tested?
We created a comprehensive benchmark script that automates the following steps:
-
Run System Benchmark: The script starts with YABS, a popular hardware test suite, to capture CPU, memory, disk, and network performance.
-
Download the Model: We simulate production readiness by downloading models from Hugging Face, measuring both time taken and average download speed.
-
Launch vLLM Container: We spin up a Docker container using
vllm/vllm-openai:latest
, bind-mounting the downloaded model directory. The container exposes an OpenAI-compatible API endpoint. -
Run Inference Benchmark: Finally, it runs
benchmark_serving.py
inside the container to simulate multi-request, high-concurrency LLM usage with synthetic inputs. We use Qwen/Qwen3-Coder-30B-A3B-Instruct model with tensor parallelism set to a number of GPU on the machine.
What Models and Configs Were Used?
- Model:
Qwen/Qwen3-Coder-30B-A3B-Instruct
- Serving: vLLM + OpenAI API-compatible interface
- Command Parameters:
- Input/output tokens: 1000
- Concurrency: 200
- Prompts: 1000
- Metrics:
ttft,tpot,itl,e2el
Importance of Driver Versions
During our benchmarking with the NVIDIA RTX 5090 and RTX PRO 6000, we observed a huge performance discrepancy between driver versions:
With the older driver 570.86.15
, the inference performance on RTX 5090 was comparable to that of the RTX 4090. Upgrading to driver 575.57.08
, we saw significant gains in all vLLM benchmarks.
Hardware Tested
We ran the benchmark across several server configurations. These configurations include 4 x RTX 4090, 4 x RTX 5090, 1 x RTX PRO 6000, and 2 x RTX PRO 6000. These configurations are popular among self-hosting enthusiasts, so we wanted to find out which configuration is the most cost-efficient.
Results
Throughput
Most providers that we've worked with provide 10Gbps internet and deliver this speed. There might be occasional drops in speed due to network congestion. Download speed to distant servers is lower, so renting a server that is very far away from your cloud storage server is not the best idea.
System | CPU (Cores @ GHz) | RAM | Disk Perf (Max Total R/W) | Network Perf (Best Send/Recv) | Geekbench 6 (Single / Multi) | HF DL Speed (MiB/s) | Cost (per hour) |
---|---|---|---|---|---|---|---|
Neuralrack 4090x4 | EPYC 7662 (32c @ 2.0) | 196 GB | 5.96 GB/s | 8.57 Gbps / 5.73 Gbps (NYC) | 1221 / 10,339 | 6326.38 | $1.56 |
EasyCompute 4090x4 | EPYC 7702 (60c @ 2.0) | 315 GB | 4.96 GB/s | 3.27 Gbps / 3.38 Gbps (London) | 1327 / 8,982 | 6194.80 | $1.56 |
EasyCompute 5090x4 | EPYC 7702 (60c @ 2.0) | 315 GB | 4.84 GB/s | 3.47 Gbps / 4.69 Gbps (London) | 1319 / 9,091 | 5070.27 | $2.60 |
Neuralrack RTX 6000x1 | EPYC 9374F (7c @ 3.85) | 118 GB | 8.79 GB/s | 5.84 Gbps / 5.73 Gbps (NYC) | 1825 / 8,097 | 6875.86 | $1.29 |
Neuralrack RTX 6000x2 | EPYC 9374F (14c @ 3.85) | 236 GB | 7.95 GB/s | 9.07 Gbps / 4.35 Gbps (NYC) | 1826 / 10,672 | 7504.25 | $2.58 |
Key Takeaways
-
Startup time varies significantly based on disk speed and CPU-GPU coordination. NVMe-backed storage with fast CPUs helped reduce wait time before inference.
-
Model download speed can be a limiting factor if your storage or bandwidth is subpar. In some cases, using
HF_HUB_ENABLE_HF_TRANSFER=1
helped achieve 2–3× better download speeds from hugging face. -
Token generation latency (especially TTFT) can vary even across servers with similar GPUs, due to backend configuration and memory bandwidth differences.
-
4090s perform well for cost, especially for smaller models like Qwen-3B or LLaMA-8B. However, for larger models or batch inference, PRO 6000 is a clear winner. Even on the small
Qwen/Qwen3-Coder-30B-A3B-Instruct
model used in the test, the single Pro 6000 is faster than four 4090s and four 5090s. The prefill-decode-disaggregation technique I described in the previous article can reduce the need to transfer a significant amount of data over the PCIe bus, which is the primary performance bottleneck of low-VRAM GPUs when running larger models. However, in vast majority of cases, the PRO 6000 will be a better option.
How to Run This Yourself
Clone the repository:
git clone https://github.com/cloudrift-ai/server-benchmark.git
cd server-benchmark
Install dependencies:
./scripts/setup.sh
Run the benchmark
./scripts/run_benchmarks.sh
GitHub Repository
You can find the code here. The model and other parameters are easily customizable if you want to run it yourself. Feel free to let me know in Discord or in the comments the configuration or the model you're interested in for future benchmarking!