Blackwell Dominates. Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000

By Natalia TrifonovaJanuary 21, 2026
BenchmarksLLMGPU Performance
Hero image for Blackwell Dominates. Benchmarking LLM Inference on NVIDIA B200, H200, H100, and RTX PRO 6000 - Benchmarks, LLM, GPU Performance article

Not long ago NVIDIA's Blackwell architecture has landed in datacenters with the B200, promising major improvements in both performance and efficiency over the previous Hopper generation. But how do these gains translate to real-world LLM inference? And at what point does the premium price tag make sense over more affordable alternatives?

We are trying to explore it in this benchmark by comparing it with H200, H100, and RTX PRO 6000 across longer context LLM inference workloads. Our goal here is to identify the most cost-effective high-end inference platform for production LLM deployment.


In our previous benchmark, we compared RTX PRO 6000 against H100 and H200.

We extend that methodology used in the previous article with three key changes:

  1. Longer context: 8K input + 8K output tokens (16K total)
  2. NVIDIA B200: testing the newest Blackwell datacenter GPU
  3. Expert Parallelism: investigating vLLM's --enable-expert-parallel for MoE models

Benchmarking setup

All benchmarks use 8-GPU nodes to keep the comparison fair across GPU generations and interconnects.

Hardware

All instances are from Google Cloud:

  • B200 × 8 (NVLink): a4-highgpu-8g
  • H200 × 8 (NVLink): a3-ultragpu-8g
  • H100 × 8 (NVLink): a3-highgpu-8g
  • RTX PRO 6000 × 8 (PCIe): g4-standard-384

Serving configuration

We optimized the setup for maximum throughput using tensor parallelism and replica scaling where possible:

  • Example: On an 8-GPU machine with a model requiring only 1 GPU, we run 8 vLLM instances with NGINX load balancing.
  • If all 8 GPUs are required, a single instance with --tensor-parallel-size=8 is used

Other parameters:

  • Inference engine: vLLM (OpenAI-compatible API)
  • Context length: --max-model-len 16384
  • KV cache: --kv-cache-dtype fp8
  • Parallelism strategy
    • Single-GPU models: 8 independent vLLM replicas + NGINX load balancing
    • 4-GPU models: 2 instances with --tensor-parallel-size 4
    • 8-GPU models: 1 instance with --tensor-parallel-size 8

Benchmark methodology

We used vllm bench serve with:

  • Input/Output length: 8000 tokens each (16K total context)
  • Concurrency: 64–256 (The exact concurrency was selected as a maximum concurrency for this specific model/hardware combination, that didn't lead to request timeouts)
  • Requests per test: 128–512 (It was selected to be 2X of the concurrency)
  • Data: We use vllm bench serve with random data

We report output throughput (tok/s) as the primary metric.


Model selection

We selected three models of increasing size to stress different aspects of multi-GPU inference.

1) GLM-4.5-Air-AWQ-4bit (single GPU)

A 4-bit quantized MoE model that fits on a single GPU. This stresses replica scaling and per-GPU decode speed without inter-GPU communication.

2) Qwen3-Coder-480B-A35B-Instruct-AWQ (4 GPUs)

A 4-bit quantized ~480B-parameter model that requires 4-way tensor parallelism.

3) GLM-4.6-FP8 (8 GPUs)

A large FP8 model that requires all 8 GPUs with full tensor parallelism—communication. Inter-GPU communications become a major bottleneck.


GPU pricing used in this analysis

GPUGCP InstanceGCP FlexStart $/GPU-hrEstimated Run Cost $/GPU-hr
RTX PRO 6000g4-standard-384$2.25$0.93
H100a3-highgpu-8g$4.79$1.91
H200a3-ultragpu-8g$5.30$2.06
B200a4-highgpu-8g$8.06$2.68

Why we use estimated run costs instead of cloud pricing: Cloud on-demand prices are driven by supply and demand. To provide a more stable cost comparison, we estimate the "true" run cost of owning and operating hardware. This gives a consistent baseline that doesn't change with market fluctuations.

For the full methodology on how we compute run costs (electricity, depreciation, cost of capital, maintenance, and colocation), see our companion article: The True Cost of GPU Ownership.


Benchmark results

Throughput summary (tok/s, 16K total: 8K in + 8K out)

Model8× RTX PRO 60008× H1008× H2008× B200
GLM-4.5-Air-AWQ-4bit2,290.692,556.035,463.329,675.24
Qwen3-Coder-480B-A35B (AWQ)1,602.962,328.634,262.886,438.43
GLM-4.6-FP81,651.672,833.775,587.548,036.71

Cost per million output tokens summary ($/Mtok)

Model8× RTX PRO 60008× H1008× H2008× B200
GLM-4.5-Air-AWQ-4bit$0.90$1.66$0.84$0.62
Qwen3-Coder-480B-A35B (AWQ)$1.29$1.82$1.07$0.93
GLM-4.6-FP8$1.25$1.50$0.82$0.74

What stands out

  1. B200 wins on throughput, with the largest gap on the most communication-heavy workload

    • GLM-4.6-FP8 (8-way TP): B200 is 4.87× faster than PRO 6000 (8,036.71 vs 1,651.67 tok/s)
    • Qwen3-Coder-480B (4-way TP): B200 is 4.02× faster than PRO 6000 (6,438.43 vs 1,602.96 tok/s)
    • GLM-4.5-Air (single-GPU replicas): B200 is 4.22× faster than PRO 6000 (9,675.24 vs 2,290.69 tok/s)
  2. B200 is also the cost efficiency leader under updated run-cost estimates

    B200's throughput advantage more than compensates for its higher hourly cost, making it the best $/Mtok across all tested models.

  3. PRO 6000 is an attractive low-capex option

    It beats H100 on cost per all models, on-par with H200 on GLM-4.5-Air, making it relevant for capex sensitive deployments.

  4. H200 is a major step up over H100 for long-context inference in these runs

    H200 delivers ~1.83× to 2.14× H100 throughput across the three models.

  5. H100 looked worse than expected in this specific setup

    It's on par in throughput to PRO 6000 on GLM-4.5-Air and behind all other contenders on cost per token on all workloads.


Expert parallelism: does it help?

vLLM exposes --enable-expert-parallel to distribute MoE experts across GPUs. In theory, it can help at very high concurrency—but it's not a consistent win at typical production settings.

We run all the experiments in two version - with expert parallelism and not. Here are two examples of expert parallelism improving performance.

GLM-4.6-FP8 on B200 ×8 (256 concurrency)

ConfigurationThroughput
With EP8036.71 tok/s
Without EP7921.92 tok/s

That's only a ~1.4% improvement.

Qwen3-Coder-480B-A35B-AWQ on H200 ×8 (128 concurrency)

ConfigurationThroughput
With EP4262.88 tok/s
Without EP3838.09 tok/s

That's a ~11.1% improvement—meaningful, but we didn't see EP reliably improve all other configurations (some were flat or negative).


Long-context impact (2K → 16K)

Long contexts increase KV cache traffic and raise per-token attention cost, so throughput drops across the board. Here's the same 8-GPU GLM-4.6-FP8 workload at short vs long context:

GPUGLM-4.6-FP8 (2K ctx)GLM-4.6-FP8 (16K ctx)Decrease
8× PRO 6000~2,696 tok/s1,651.67 tok/s38.7%
8× H100~7,816 tok/s2,833.77 tok/s63.7%
8× H200~10,449 tok/s5,587.54 tok/s46.5%

Interpretation:

  • H100 drops the most (−64%), consistent with long-context runs becoming more dominated by KV/attention memory traffic and long-sequence scheduling overheads.
  • H200 holds up better (−47%), consistent with its long-context strengths (HBM capacity/bandwidth + overall platform efficiency under KV-heavy loads).
  • PRO 6000 shows the smallest relative drop (−39%), but from a much lower absolute baseline—so it remains substantially slower in node-level tok/s even though its long-context penalty is less severe.

How to Run This Benchmark Yourself

The complete benchmark code and configuration files are available in the GitHub repository. You can reproduce these results or customize the tests for your specific models and configurations.

Customize the Benchmark

All benchmark parameters are configurable via config.yml:

  • SSH servers for remote execution
  • Model selection
  • GPU configurations
  • Concurrency levels
  • Input/output lengths
  • GPU rental prices (for cost analysis)

After modifying the configuration and specifying server addresses and other parameters, remove files in the results folder and run make bench.

To just generate the report based on the benchmark data, use make report-nov2025.

What's Included

Each benchmark run generates:

  • Complete benchmark results with all metrics
  • Docker Compose configuration used for serving
  • Full benchmark command for reproducibility
  • System information and hardware specs

Raw Benchmark Data

All raw benchmark data is available in the repository's results folder. Check out *_vllm_benchmark.txt files for detailed performance metrics and benchmark configuration.

============ Serving Benchmark Result ============
Successful requests:                     512       
Failed requests:                         0         
Maximum request concurrency:             256       
Benchmark duration (s):                  1019.32   
Total input tokens:                      4096000   
Total generated tokens:                  4096000   
Request throughput (req/s):              0.50      
Output token throughput (tok/s):         4018.36   
Peak output token throughput (tok/s):    5376.00   
Peak concurrent requests:                260.00    
Total token throughput (tok/s):          8036.71   
---------------Time to First Token----------------
Mean TTFT (ms):                          23321.96  
Median TTFT (ms):                        2218.39   
P99 TTFT (ms):                           88834.35  
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          60.47     
Median TPOT (ms):                        62.66     
P99 TPOT (ms):                           63.03     
---------------Inter-token Latency----------------
Mean ITL (ms):                           60.46     
Median ITL (ms):                         53.22     
P99 ITL (ms):                            367.38    
----------------End-to-end Latency----------------
Mean E2EL (ms):                          507017.02 
Median E2EL (ms):                        504003.17 
P99 E2EL (ms):                           592968.03 
==================================================

============ Docker Compose Configuration ============
services:
  vllm_0:
    image: vllm/vllm-openai:latest
    container_name: vllm_benchmark_container_0
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    volumes:
      - /mnt/localssd/hf_models:/mnt/localssd/hf_models
    environment:
      - HUGGING_FACE_HUB_TOKEN=
      - HF_HOME=/mnt/localssd/hf_models
    ports:
      - "8000:8000"
    shm_size: '16gb'
    ipc: host
    command: >
      --trust-remote-code
      --gpu-memory-utilization=0.9
      --host 0.0.0.0
      --port 8000
      --tensor-parallel-size 8
      --pipeline-parallel-size 1
      --model /mnt/localssd/hf_models/zai-org/GLM-4.6-FP8
      --served-model-name zai-org/GLM-4.6-FP8
      --max-num-seqs 512 --max-model-len 16384 --kv-cache-dtype fp8  --enable-expert-parallel
    healthcheck:
      test: ["CMD", "bash", "-c", "curl -f http://localhost:8000/health && curl -f http://localhost:8000/v1/models | grep -q 'object.*list'"]
      interval: 10s
      timeout: 10s
      retries: 180
      start_period: 600s

  benchmark:
    image: vllm/vllm-openai:latest
    container_name: vllm_benchmark_client
    volumes:
      - /mnt/localssd/hf_models:/mnt/localssd/hf_models
    environment:
      - HUGGING_FACE_HUB_TOKEN=
      - HF_HOME=/mnt/localssd/hf_models
    entrypoint: ["/bin/bash", "-c"]
    command: ["sleep infinity"]
    profiles:
      - tools

============ Benchmark Command ============
vllm bench serve     
    --model zai-org/GLM-4.6-FP8     
    --dataset-name random     
    --random-input-len 8000     
    --random-output-len 8000     
    --max-concurrency 256     
    --num-prompts 512     
    --ignore-eos     
    --backend openai-chat     
    --endpoint /v1/chat/completions     
    --percentile-metrics ttft,tpot,itl,e2el     
    --base-url http://vllm_0:8000
==================================================

GitHub Repository

All benchmark code, configs, and raw results are available on GitHub

Related Articles