RTX 4090 vs RTX 5090 vs RTX PRO 6000: Comprehensive LLM Inference Benchmark

By Dmitry TrifonovOctober 9, 2025
llminferencebenchmark

Choosing the right GPU configuration for LLM inference can significantly impact both performance and cost. I conducted a comprehensive benchmark comparing RTX 4090, RTX 5090, and RTX PRO 6000 GPUs across various configurations (1x, 2x, and 4x) using three different model sizes: Qwen3-Coder-30B-A3B-Instruct-AWQ (fits in 24GB), Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (fits in 48GB), and GLM-4.5-Air-AWQ-4bit (fits in 96GB). This benchmark provides real-world throughput data and cost analysis using vLLM to help you make an informed decision for your LLM serving infrastructure.


Benchmarking Setup

My benchmark focuses on maximum throughput for high-concurrency LLM serving scenarios. I tested the following hardware configurations:

  • RTX 4090: 1x, 2x, and 4x configurations
  • RTX 5090: 1x, 2x, and 4x configurations
  • RTX PRO 6000: 1x configuration

All machines were equipped with at least 50GB of RAM per GPU and a minimum of 7 CPU cores. The RTX 4090 systems used EPYC Milan (3rd Gen) processors, while RTX 5090 and PRO 6000 systems employed EPYC Genoa (4th Gen) processors, providing slightly better overall performance. The exact hardware specs and system information are included in the benchmark results.

It is worth noting that multi-5090 setups have more total VRAM (4x32GB=128GB) compared to 4x4090 (4x24GB=96GB) and 1xPRO6000 (96GB). This allows it to run larger models without hitting VRAM limits and can provide better throughput thanks to larger batch sizes.

Serving Configuration

I optimized the setup for maximum throughput using:

  • Inference Engine: vLLM with OpenAI-compatible API
  • Parallelism Strategy: Pipeline parallelism (--pipeline-parallel-size) for multi-GPU setups
  • Replica Scaling: Multiple vLLM instances with NGINX load balancing when possible
    • Example: On a 4-GPU machine with a model requiring only 2 GPUs, I run 2 vLLM instances with --pipeline-parallel-size=2 and NGINX load balancing
    • If all 4 GPUs are required, a single instance with --pipeline-parallel-size=4 is used
  • Reduced Context Length: --max-model-len 8192 to fit models in memory and improve throughput, --kv-cache-dtype fp8 for memory efficiency.

I have also experimented with tensor parallelism (--tensor-parallel-size), but it requires more VRAM and most of the models I tested didn't fit in the available memory on the multiple GPU setups. It would likely be also slower due to higher inter-GPU communication overhead.

Benchmark Methodology

I used the vllm bench serve tool with:

  • Input/Output Length: 1000 tokens each
  • Concurrent Requests: 400 (saturates token generation capacity), 200 to avoid timeouts on a slow configuration.
  • Request Count: 1200 requests per test
  • Data: Random synthetic data

Cost Analysis

The GPU prices vary significantly based on the provider, region, and SLA. For example, vast.ai offers RTX 4090 instances as cheap as $0.25 per hour, but they are often unreliable and have poor network connectivity. The RunPod is a more predictable option with RTX 4090 instances at $0.59/hour (secure cloud).

I am using neuralrack.ai prices (sponsoring this benchmark) as a reference for reliable deployment providers:

  • RTX 4090: $0.39/hour per GPU
  • RTX 5090: $0.65/hour per GPU
  • RTX PRO 6000: $1.29/hour per GPU

To better reflect your situation, you can modify these prices in the config.yml file in the benchmark repository and then invoke make report to generate a customized cost analysis.


Model Selection

To understand how PCIe communication affects multi-GPU performance, I've selected three models of increasing size:

1. Qwen3-Coder-30B-A3B-Instruct-AWQ (~24GB)

This 4-bit quantized model fits comfortably in a single RTX 4090 (24GB VRAM).

Expected Behavior: Linear scaling with GPU count since each GPU can run an independent replica. Multi-GPU configurations (4x 4090 and 4x 5090) should excel due to their higher aggregate compute power. However, the PRO 6000's newer Blackwell architecture, faster memory bandwidth, and superior utilization may provide an advantage.

2. Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (~48GB)

This 4-bit quantized model requires 2x RTX 4090s.

Expected Behavior: Some PCIe communication overhead in multi-GPU setups may reduce performance relative to single-chip solutions.

3. GLM-4.5-Air-AWQ-4bit (~96GB)

This large model requires all four RTX 4090s.

Expected Behavior: Significant PCIe communication overhead expected. The PRO 6000's superior memory bandwidth and absence of PCIE communication overhead should provide a substantial advantage.

It is possible to fit even bigger GGUF models on these GPUs. However, VLLM GGUF support is very experimental, so I focused on AWQ and INT4 models available on Hugging Face. I wanted to use VLLM since it is well optimized for high-throughput serving.


Results

Qwen3-Coder-30B-A3B-Instruct-AWQ (24GB Model)

Winner: RTX 5090

The performance across all configurations was surprisingly proportional to their cost. While I expected the RTX PRO 6000 to underperform in this category (since the model fits in a single GPU and scales linearly), it delivered impressive results:

  • 4x RTX 5090: 12,744 tokens/s - best absolute throughput due to replica parallelism
  • 4x RTX 4090: 8,903 tokens/s - strong performance but limited by older architecture
  • 1x PRO 6000: 8,425 tokens/s - outstanding single-chip performance, nearly 3.7x faster than 1x RTX 4090 (2,259 tokens/s) and 1.8x faster than 1x RTX 5090 (4,570 tokens/s)

The PRO 6000's Blackwell architecture, with its advanced memory subsystem and higher utilization rates, provides exceptional single-GPU performance even for models that don't require its full VRAM capacity.

Key Metrics (1x GPU Comparison):

ConfigurationThroughput (tok/s)Cost/HourCost per 1M Tokens
1x RTX 40902,259$0.39$0.048
1x RTX 50904,570$0.65$0.040
1x PRO 60008,425$1.29$0.043

The 2x5090 configuration delivered almost 2x the throughput of 1x5090, demonstrating the expected linear scaling for this model size. However, the 4x5090 configuration only provided a 40% increase over 2x5090, likely due to some other system bottleneck (CPU, RAM, load balancer overhead, etc.)

Meta-Llama-3.3-70B-Instruct-AWQ-INT4 (48GB Model)

Winner: RTX 5090

With this larger model requiring 2 GPUs, we begin to see the impact of PCIe communication overhead in multi-GPU configurations:

  • 2x RTX 5090: 1,230 tokens/s - excellent performance with PCIe Gen 5 bandwidth
  • 1x PRO 6000: 1,031 tokens/s - excellent single-chip performance, but not enough to outperform multi-GPU setups
  • 2x RTX 4090: 467 tokens/s - performance impacted by PCIe Gen 4 bandwidth and older architecture

The RTX 5090's PCIe Gen 5 support provides better inter-GPU communication compared to the 4090's PCIe Gen 4, resulting in better scaling efficiency. PRO 6000 can surprisingly run this model on a single chip with performance comparable to 2x RTX 5090.

GLM-4.5-Air-AWQ-4bit (96GB Model)

Winner (Throughput): RTX 5090 4x | Winner (Cost-Efficiency): RTX PRO 6000

With the largest model requiring all four GPUs, PCIe communication becomes a critical bottleneck—and this is where the PRO 6000 truly shines:

  • 4x RTX 5090: 4,622 tokens/s - best absolute throughput but with significant PCIe overhead
  • 1x PRO 6000: 3,169 tokens/s - impressive single-chip performance with best cost-efficiency ($0.113/Mtok vs $0.156/Mtok for 4x5090)
  • 4x RTX 4090: 1,731 tokens/s - performance impacted by PCIe Gen 4 bandwidth and older architecture

While the 4x RTX 5090 configuration achieves the highest absolute throughput, the PRO 6000 offers the best cost-efficiency for this model size. Its integrated architecture eliminates PCIe bottlenecks entirely, and the single-chip design simplifies deployment and reduces operational complexity.

Cost per Million Tokens Summary

The overall winner is RTX PRO 6000 for its consistent performance across all model sizes and best cost-efficiency for larger models. However, if your workload primarily involves smaller models, multi-GPU consumer configurations (especially RTX 5090) can offer better absolute throughput at a lower cost.

Small Models (24GB): Multi-GPU consumer configurations offer the best value due to replica parallelism, but RTX PRO 6000 is very close.

Medium Models (48GB): RTX 5090 configuration provides the best balance of performance and cost, followed by RTX PRO 6000.

Large Models (96GB+): RTX PRO 6000 emerges as the clear winner despite its higher hourly cost, thanks to the elimination of PCIe overhead.

You can adjust these prices for your specific situation by modifying the config.yml file in the benchmark repository and running make report to generate a customized cost analysis.

Prefill-Decode Disaggregation

For scenarios where you need to serve large models on consumer GPUs, consider using prefill-decode disaggregation. This technique can significantly reduce PCIe data transfer by separating the prefill and decode phases of inference. However, for most production use cases requiring high throughput on large models, the RTX PRO 6000 remains the better choice.


How to Run This Benchmark Yourself

The complete benchmark code and configuration files are available in the GitHub repository. You can reproduce these results or customize the tests for your specific models and configurations.

Customize the Benchmark

All benchmark parameters are configurable via config.yml:

  • SSH servers for remote execution
  • Model selection
  • GPU configurations
  • Concurrency levels
  • Input/output lengths
  • GPU rental prices (for cost analysis)

After modifying the configuration and specifying server addresses and other parameters, remove files in the results folder and run make bench.

To just generate the report based on the benchmark data, use make report.

What's Included

Each benchmark run generates:

  • Complete benchmark results with all metrics
  • Docker Compose configuration used for serving
  • Full benchmark command for reproducibility
  • System information and hardware specs

Configuration Matrix

I ran a total of 18 different configurations across three models and multiple GPU setups. The configuration matrix can be defined in the config.yaml file.

GPU ConfigurationQwen3-Coder-30B-4bitLlama-3.3-70B-4bitGLM-4.5-Air-4bit
1x RTX 4090
2x RTX 4090
4x RTX 4090
1x RTX 5090
2x RTX 5090
4x RTX 5090
1x RTX PRO 6000

Raw Benchmark Data

All raw benchmark data is available in the repository's results folder. Check out *_vllm_benchmark.txt files for detailed performance metrics and benchmark configuration.

============ Serving Benchmark Result ============
Successful requests:                     1200      
Maximum request concurrency:             400       
Benchmark duration (s):                  980.85    
Total input tokens:                      1196743   
Total generated tokens:                  1200000   
Request throughput (req/s):              1.22      
Output token throughput (tok/s):         1223.42   
Peak output token throughput (tok/s):    3343.00   
Peak concurrent requests:                408.00    
Total Token throughput (tok/s):          2443.53   
---------------Time to First Token----------------
Mean TTFT (ms):                          158275.93 
Median TTFT (ms):                        166262.87 
P99 TTFT (ms):                           273238.49 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          134.71    
Median TPOT (ms):                        123.86    
P99 TPOT (ms):                           216.70    
---------------Inter-token Latency----------------
Mean ITL (ms):                           134.57    
Median ITL (ms):                         55.98     
P99 ITL (ms):                            1408.24   
----------------End-to-end Latency----------------
Mean E2EL (ms):                          292848.13 
Median E2EL (ms):                        311149.01 
P99 E2EL (ms):                           399504.14 
==================================================

============ Docker Compose Configuration ============
services:
  vllm_0:
    image: vllm/vllm-openai:latest
    container_name: vllm_benchmark_container_0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['0', '1']
              capabilities: [gpu]
    volumes:
      - /hf_models:/hf_models
    environment:
      - HUGGING_FACE_HUB_TOKEN=
    ports:
      - "8000:8000"
    shm_size: '16gb'
    ipc: host
    command: >
      --trust-remote-code
      --gpu-memory-utilization=0.9
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --pipeline-parallel-size 2
      --model /hf_models/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
      --served-model-name ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
      --max-model-len 8192 --kv-cache-dtype fp8
    healthcheck:
      test: ["CMD", "bash", "-c", "curl -f http://localhost:8000/health && curl -f http://localhost:8000/v1/models | grep -q 'object.*list'"]
      interval: 10s
      timeout: 10s
      retries: 180
      start_period: 600s

  vllm_1:
    image: vllm/vllm-openai:latest
    container_name: vllm_benchmark_container_1
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              device_ids: ['2', '3']
              capabilities: [gpu]
    volumes:
      - /hf_models:/hf_models
    environment:
      - HUGGING_FACE_HUB_TOKEN=
    ports:
      - "8001:8000"
    shm_size: '16gb'
    ipc: host
    command: >
      --trust-remote-code
      --gpu-memory-utilization=0.9
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 1
      --pipeline-parallel-size 2
      --model /hf_models/ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
      --served-model-name ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
      --max-model-len 8192 --kv-cache-dtype fp8
    healthcheck:
      test: ["CMD", "bash", "-c", "curl -f http://localhost:8000/health && curl -f http://localhost:8000/v1/models | grep -q 'object.*list'"]
      interval: 10s
      timeout: 10s
      retries: 180
      start_period: 600s

  nginx:
    image: nginx:alpine
    container_name: nginx_lb
    ports:
      - "8080:8080"
    volumes:
      - /home/riftuser/server-benchmark/nginx.vllm.conf:/etc/nginx/nginx.conf:ro
    depends_on:
      - vllm_0
      - vllm_1

  benchmark:
    image: vllm/vllm-openai:latest
    container_name: vllm_benchmark_client
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - /hf_models:/hf_models
    environment:
      - HUGGING_FACE_HUB_TOKEN=
      - CUDA_VISIBLE_DEVICES=""
    entrypoint: ["/bin/bash", "-c"]
    command: ["sleep infinity"]
    profiles:
      - tools

============ Benchmark Command ============
vllm bench serve
  --model ibnzterrell/Meta-Llama-3.3-70B-Instruct-AWQ-INT4
  --dataset-name random
  --random-input-len 1000 --random-output-len 1000 --max-concurrency 400 --num-prompts 1200
  --ignore-eos --backend openai-chat --endpoint /v1/chat/completions
  --percentile-metrics ttft,tpot,itl,e2el 
  --base-url http://nginx_lb:8080
==================================================


Future Benchmarks

If you'd like to see specific configurations or models benchmarked, please let me know in the comments or in our Discord community.

Acknowledgments

This benchmark was conducted on servers provided by neuralrack.ai, who have kindly sponsored this initiative. The GPU rental prices used in cost analysis reflect typical market rates on their platform.

The benchmark is performed using cloudrift.ai software and infrastructure.

GitHub Repository

Access the complete benchmark code, configuration files, and raw results:

https://github.com/cloudrift-ai/server-benchmark