RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark
Choosing the right GPU configuration for LLM inference can significantly impact both performance and cost. I conducted a comprehensive benchmark comparing RTX 4090, RTX 5090, and RTX PRO 6000 GPUs across various configurations (1x, 2x, and 4x) using three different model sizes: Qwen3-Coder-30B-A3B-Instruct-AWQ (fits in 24GB), Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (fits in 48GB), and GLM-4.5-Air-AWQ-4bit (fits in 96GB). This benchmark provides real-world throughput data and cost analysis using vLLM to help you make an informed decision for your LLM serving infrastructure.
Benchmarking Setup
My benchmark focuses on maximum throughput for high-concurrency LLM serving scenarios. I tested the following hardware configurations:
- RTX 4090: 1x, 2x, and 4x configurations
- RTX 5090: 1x, 2x, and 4x configurations
- RTX PRO 6000: 1x configuration
All machines were equipped with at least 50GB of RAM per GPU and a minimum of 7 CPU cores. The RTX 4090 systems used EPYC Milan (3rd Gen) processors, while RTX 5090 and PRO 6000 systems employed EPYC Genoa (4th Gen) processors, providing slightly better overall performance. The exact hardware specs and system information are included in the benchmark results.
It is worth noting that multi-5090 setups have more total VRAM (4x32GB=128GB) compared to 4x4090 (4x24GB=96GB) and 1xPRO6000 (96GB). This allows it to run larger models without hitting VRAM limits and can provide better throughput thanks to larger batch sizes.
Serving Configuration
I optimized the setup for maximum throughput using:
- Inference Engine: vLLM with OpenAI-compatible API
- Parallelism Strategy: Pipeline parallelism (
--pipeline-parallel-size
) for multi-GPU setups - Replica Scaling: Multiple vLLM instances with NGINX load balancing when possible
- Example: On a 4-GPU machine with a model requiring only 2 GPUs, I run 2 vLLM instances with
--pipeline-parallel-size=2
and NGINX load balancing - If all 4 GPUs are required, a single instance with
--pipeline-parallel-size=4
is used
- Example: On a 4-GPU machine with a model requiring only 2 GPUs, I run 2 vLLM instances with
- Reduced Context Length:
--max-model-len 8192
to fit models in memory and improve throughput,--kv-cache-dtype fp8
for memory efficiency.
I have also experimented with tensor parallelism (
--tensor-parallel-size
), but it requires more VRAM and most of the models I tested didn't fit in the available memory on the multiple GPU setups. It would likely be also slower due to higher inter-GPU communication overhead.
Benchmark Methodology
I used the vllm bench serve
tool with:
- Input/Output Length: 1000 tokens each
- Concurrent Requests: 400 (saturates token generation capacity), 200 to avoid timeouts on a slow configuration.
- Request Count: 1200 requests per test
- Data: Random synthetic data
Cost Analysis
The GPU prices vary significantly based on the provider, region, and SLA. For example, vast.ai offers RTX 4090 instances as cheap as $0.25 per hour, but they are often unreliable and have poor network connectivity. The RunPod is a more predictable option with RTX 4090 instances at $0.59/hour (secure cloud).
I am using neuralrack.ai prices (sponsoring this benchmark) as a reference for reliable deployment providers:
- RTX 4090: $0.39/hour per GPU
- RTX 5090: $0.65/hour per GPU
- RTX PRO 6000: $1.29/hour per GPU
To better reflect your situation, you can modify these prices in the config.yml file in the benchmark repository and then invoke make report
to generate a customized cost analysis.
Model Selection
To understand how PCIe communication affects multi-GPU performance, I've selected three models of increasing size:
1. Qwen3-Coder-30B-A3B-Instruct-AWQ (~24GB)
This 4-bit quantized model fits comfortably in a single RTX 4090 (24GB VRAM).
Expected Behavior: Linear scaling with GPU count since each GPU can run an independent replica. Multi-GPU configurations (4x 4090 and 4x 5090) should excel due to their higher aggregate compute power. However, the PRO 6000's newer Blackwell architecture, faster memory bandwidth, and superior utilization may provide an advantage.
2. Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (~48GB)
This 4-bit quantized model requires 2x RTX 4090s.
Expected Behavior: Some PCIe communication overhead in multi-GPU setups may reduce performance relative to single-chip solutions.
3. GLM-4.5-Air-AWQ-4bit (~96GB)
This large model requires all four RTX 4090s.
Expected Behavior: Significant PCIe communication overhead expected. The PRO 6000's superior memory bandwidth and absence of PCIE communication overhead should provide a substantial advantage.
It is possible to fit even bigger GGUF models on these GPUs. However, VLLM GGUF support is very experimental, so I focused on AWQ and INT4 models available on Hugging Face. I wanted to use VLLM since it is well optimized for high-throughput serving.
Results
Qwen3-Coder-30B-A3B-Instruct-AWQ (24GB Model)
Winner: RTX 5090
The performance across all configurations was surprisingly proportional to their cost. While I expected the RTX PRO 6000 to underperform in this category (since the model fits in a single GPU and scales linearly), it delivered impressive results:
- 4x RTX 5090: 12,744 tokens/s - best absolute throughput due to replica parallelism
- 4x RTX 4090: 8,903 tokens/s - strong performance but limited by older architecture
- 1x PRO 6000: 8,425 tokens/s - outstanding single-chip performance, nearly 3.7x faster than 1x RTX 4090 (2,259 tokens/s) and 1.8x faster than 1x RTX 5090 (4,570 tokens/s)
The PRO 6000's Blackwell architecture, with its advanced memory subsystem and higher utilization rates, provides exceptional single-GPU performance even for models that don't require its full VRAM capacity.
Key Metrics (1x GPU Comparison):
Configuration | Throughput (tok/s) | Cost/Hour | Cost per 1M Tokens |
---|---|---|---|
1x RTX 4090 | 2,259 | $0.39 | $0.048 |
1x RTX 5090 | 4,570 | $0.65 | $0.040 |
1x PRO 6000 | 8,425 | $1.29 | $0.043 |
The 2x5090 configuration delivered almost 2x the throughput of 1x5090, demonstrating the expected linear scaling for this model size. However, the 4x5090 configuration only provided a 40% increase over 2x5090, likely due to some other system bottleneck (CPU, RAM, load balancer overhead, etc.)
Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (48GB Model)
Winner: RTX 5090
With this larger model requiring 2 GPUs, we begin to see the impact of PCIe communication overhead in multi-GPU configurations:
- 2x RTX 5090: 1,230 tokens/s - excellent performance with PCIe Gen 5 bandwidth
- 1x PRO 6000: 1,031 tokens/s - excellent single-chip performance, but not enough to outperform multi-GPU setups
- 2x RTX 4090: 467 tokens/s - performance impacted by PCIe Gen 4 bandwidth and older architecture
The RTX 5090's PCIe Gen 5 support provides better inter-GPU communication compared to the 4090's PCIe Gen 4, resulting in better scaling efficiency. PRO 6000 can surprisingly run this model on a single chip with performance comparable to 2x RTX 5090.
GLM-4.5-Air-AWQ-4bit (96GB Model)
Winner (Throughput): RTX 5090 4x | Winner (Cost-Efficiency): RTX PRO 6000
With the largest model requiring all four GPUs, PCIe communication becomes a critical bottleneck—and this is where the PRO 6000 truly shines:
- 4x RTX 5090: 4,622 tokens/s - best absolute throughput but with significant PCIe overhead
- 1x PRO 6000: 3,169 tokens/s - impressive single-chip performance with best cost-efficiency ($0.113/Mtok vs $0.156/Mtok for 4x5090)
- 4x RTX 4090: 1,731 tokens/s - performance impacted by PCIe Gen 4 bandwidth and older architecture
While the 4x RTX 5090 configuration achieves the highest absolute throughput, the PRO 6000 offers the best cost-efficiency for this model size. Its integrated architecture eliminates PCIe bottlenecks entirely, and the single-chip design simplifies deployment and reduces operational complexity.
Cost per Million Tokens Summary
The overall winner is RTX PRO 6000 for its consistent performance across all model sizes and best cost-efficiency for larger models. However, if your workload primarily involves smaller models, multi-GPU consumer configurations (especially RTX 5090) can offer better absolute throughput at a lower cost.
Small Models (24GB): Multi-GPU consumer configurations offer the best value due to replica parallelism, but RTX PRO 6000 is very close.
Medium Models (48GB): RTX 5090 configuration provides the best balance of performance and cost, followed by RTX PRO 6000.
Large Models (96GB+): RTX PRO 6000 emerges as the clear winner despite its higher hourly cost, thanks to the elimination of PCIe overhead.
You can adjust these prices for your specific situation by modifying the config.yml file in the benchmark repository and running make report
to generate a customized cost analysis.
Prefill-Decode Disaggregation
For scenarios where you need to serve large models on consumer GPUs, consider using prefill-decode disaggregation. This technique can significantly reduce PCIe data transfer by separating the prefill and decode phases of inference. However, for most production use cases requiring high throughput on large models, the RTX PRO 6000 remains the better choice.
How to Run This Benchmark Yourself
The complete benchmark code and configuration files are available in the GitHub repository. You can reproduce these results or customize the tests for your specific models and configurations.
Customize the Benchmark
All benchmark parameters are configurable via config.yml
:
- SSH servers for remote execution
- Model selection
- GPU configurations
- Concurrency levels
- Input/output lengths
- GPU rental prices (for cost analysis)
After modifying the configuration and specifying server addresses and other parameters, remove files in the results folder and run make bench
.
To just generate the report based on the benchmark data, use make report
.
What's Included
Each benchmark run generates:
- Complete benchmark results with all metrics
- Docker Compose configuration used for serving
- Full benchmark command for reproducibility
- System information and hardware specs
Configuration Matrix
I ran a total of 18 different configurations across three models and multiple GPU setups. The configuration matrix can be defined in the config.yaml
file.
GPU Configuration | Qwen3-Coder-30B-4bit | Llama-3.3-70B-4bit | GLM-4.5-Air-4bit |
---|---|---|---|
1x RTX 4090 | ✓ | ||
2x RTX 4090 | ✓ | ✓ | |
4x RTX 4090 | ✓ | ✓ | ✓ |
1x RTX 5090 | ✓ | ||
2x RTX 5090 | ✓ | ✓ | |
4x RTX 5090 | ✓ | ✓ | ✓ |
1x RTX PRO 6000 | ✓ | ✓ | ✓ |
Raw Benchmark Data
All raw benchmark data is available in the repository's results folder. Check out *_vllm_benchmark.txt files for detailed performance metrics and benchmark configuration.
============ Serving Benchmark Result ============
Successful requests: 1200
Maximum request concurrency: 400
Benchmark duration (s): 980.85
Total input tokens: 1196743
Total generated tokens: 1200000
Request throughput (req/s): 1.22
Output token throughput (tok/s): 1223.42
Peak output token throughput (tok/s): 3343.00
Peak concurrent requests: 408.00
Total Token throughput (tok/s): 2443.53
---------------Time to First Token----------------
Mean TTFT (ms): 158275.93
Median TTFT (ms): 166262.87
P99 TTFT (ms): 273238.49
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 134.71
Median TPOT (ms): 123.86
P99 TPOT (ms): 216.70
---------------Inter-token Latency----------------
Mean ITL (ms): 134.57
Median ITL (ms): 55.98
P99 ITL (ms): 1408.24
----------------End-to-end Latency----------------
Mean E2EL (ms): 292848.13
Median E2EL (ms): 311149.01
P99 E2EL (ms): 399504.14
==================================================
============ Docker Compose Configuration ============
services:
vllm_0:
image: vllm/vllm-openai:latest
container_name: vllm_benchmark_container_0
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['0', '1']
capabilities: [gpu]
volumes:
- /hf_models:/hf_models
environment:
- HUGGING_FACE_HUB_TOKEN=
ports:
- "8000:8000"
shm_size: '16gb'
ipc: host
command: >
--trust-remote-code
--gpu-memory-utilization=0.9
--host 0.0.0.0
--port 8000
--tensor-parallel-size 1
--pipeline-parallel-size 2
--model /hf_models/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
--served-model-name ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
--max-model-len 8192 --kv-cache-dtype fp8
healthcheck:
test: ["CMD", "bash", "-c", "curl -f http://localhost:8000/health && curl -f http://localhost:8000/v1/models | grep -q 'object.*list'"]
interval: 10s
timeout: 10s
retries: 180
start_period: 600s
vllm_1:
image: vllm/vllm-openai:latest
container_name: vllm_benchmark_container_1
deploy:
resources:
reservations:
devices:
- driver: nvidia
device_ids: ['2', '3']
capabilities: [gpu]
volumes:
- /hf_models:/hf_models
environment:
- HUGGING_FACE_HUB_TOKEN=
ports:
- "8001:8000"
shm_size: '16gb'
ipc: host
command: >
--trust-remote-code
--gpu-memory-utilization=0.9
--host 0.0.0.0
--port 8000
--tensor-parallel-size 1
--pipeline-parallel-size 2
--model /hf_models/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
--served-model-name ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
--max-model-len 8192 --kv-cache-dtype fp8
healthcheck:
test: ["CMD", "bash", "-c", "curl -f http://localhost:8000/health && curl -f http://localhost:8000/v1/models | grep -q 'object.*list'"]
interval: 10s
timeout: 10s
retries: 180
start_period: 600s
nginx:
image: nginx:alpine
container_name: nginx_lb
ports:
- "8080:8080"
volumes:
- /home/riftuser/server-benchmark/nginx.vllm.conf:/etc/nginx/nginx.conf:ro
depends_on:
- vllm_0
- vllm_1
benchmark:
image: vllm/vllm-openai:latest
container_name: vllm_benchmark_client
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
volumes:
- /hf_models:/hf_models
environment:
- HUGGING_FACE_HUB_TOKEN=
- CUDA_VISIBLE_DEVICES=""
entrypoint: ["/bin/bash", "-c"]
command: ["sleep infinity"]
profiles:
- tools
============ Benchmark Command ============
vllm bench serve
--model ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
--dataset-name random
--random-input-len 1000 --random-output-len 1000 --max-concurrency 400 --num-prompts 1200
--ignore-eos --backend openai-chat --endpoint /v1/chat/completions
--percentile-metrics ttft,tpot,itl,e2el
--base-url http://nginx_lb:8080
==================================================
Future Benchmarks
If you'd like to see specific configurations or models benchmarked, please let me know in the comments or in our Discord community.
Acknowledgments
This benchmark was conducted on servers provided by neuralrack.ai, who have kindly sponsored this initiative. The GPU rental prices used in cost analysis reflect typical market rates on their platform.
The benchmark is performed using cloudrift.ai software and infrastructure.
GitHub Repository
Access the complete benchmark code, configuration files, and raw results: