Choosing Your LLM Powerhouse: A Comprehensive Comparison of Inference Providers

By Natalia TrifonovaJuly 8, 2025
apilarge-language-modelsllmllm-performancebenchmark

Price, Throughput, and Features

Article image
Image generated by AI (GPT-4o)

As LLMs become the backbone of modern AI applications, choosing the right inference provider is more important than ever. It’s not just about which model is the smartest — it’s about how fast, scalable, and affordable the underlying infrastructure is.

I’ve put together an open and fully reproducible benchmark to compare leading open-source LLM API providers, focusing on Price, Throughput, and Features (including free tiers, rate limits, and model availability). The benchmark currently utilizes LLaMA 4 Maverick and DeepSeek models, and includes notable providers that offer these models.

There are already several LLM benchmarking efforts out there, with ArtificialAnalysis often seen as the industry standard. While it’s a fantastic resource, it’s not open source or reproducible, and it only tests up to 10 concurrent requests, which is far too low for most applications.

Benchmarking Methodology

The throughput benchmarks presented in this analysis were conducted using the vllm benchmark_serving.py script. This script is a standard tool for evaluating the performance of LLM serving systems.

I’m running the benchmark across a range of concurrency levels — from 1 up to 200. While this helps simulate real-world workloads, it’s worth noting that many providers enforce strict rate limits by default. As a result, high-concurrency benchmarks may not reflect the true upper bounds of their infrastructure. There also might be a slight variation based on server utilization.
I did my best to work around these limits, such as increasing the balance to $100 with providers that require this for an increased rate limit. However, many providers require you to submit a support request and reserve the right to deny the rate limit increase. In my opinion, this creates unnecessary friction — especially for hobbyists, researchers, or early-stage startups.
To reflect this, rate-related limitations are incorporated in the plot data.

The general command structure for running these benchmarks is the following:

export OPENAI_API_KEY="$api_key"

# Run the benchmark_serving script
python benchmark_serving.py \
--backend openai-chat \
--base-url "$base_url" \
--endpoint "$endpoint" \
--model "$model" \
--tokenizer "$model_name" \
--num-prompts "$NUM_PROMPTS" \
--max-concurrency "$MAX_CONCURRENCY" \
--dataset-name random \
--random-input-len 1000 \
--random-output-len 1000 \
--percentile_metrics ttft,tpot,itl,e2el

This command configures the benchmark to:

  • Interact with an OpenAI-compatible chat backend.
  • Specify the base URL and endpoint of the LLM inference service.
  • Define the model and its tokenizer for accurate token counting.
  • Set the number of prompts and maximum concurrent requests to simulate various load conditions.
  • Utilize a random dataset with specified input and output token lengths (1000 tokens each) to ensure consistent test conditions across providers.
  • Collect key performance metrics including Time To First Token (TTFT), Tokens Per Output Token (TPOT), Inter-Token Latency (ITL), and End-to-End Latency (E2EL).

Here is a link to the GitHub with the complete benchmark code.

Summary: Pricing, Performance, and Features

This section provides a detailed analysis of prominent LLM inference providers, highlighting their strengths across key metrics.

Baseten

  • Pricing & Throughput: Competitive pricing with solid performance.
  • Free Tier: $1 credits
  • Rate Limits: 15 RPM and 100,000 TPM (tokens per minute) for unverified and 120 RPM and 1,000,000 TPM for business accounts. Custom enterprise tire
  • Notable Features: HIPAA compliance, multi-region scaling, SGLang, speculative decoding, FP8, Truss Chains

Lambda Inference

  • Pricing & Throughput: Very low cost, especially for smaller LLaMA models
  • Free Tier: No free credits
  • Rate Limits: None — unlimited requests supported by serverless, auto-scaling backend
  • Notable Features: Scalable API, supports 1M token context, AI-first infra stack, dedicated GPU options

Fireworks AI

  • Pricing & Throughput: Lower latency and cost via proprietary FireAttention
  • Free Tier: $1 credits
  • Rate Limits: guaranteed 60 RPM; increases automatically with usage
  • Notable Features: FireAttention engine; supports >100 models, including multimodal; HIPAA/SOC2 compliant

DeepInfra

  • Pricing & Throughput: Standard vs Turbo tiers; Turbo offers higher throughput at a premium.
  • Free Tier: No general tier, but 1B tokens for startups via DeepStart.
  • Rate Limits: 200 concurrent requests per model; auto-scales based on demand
  • Notable Features: Multi-region, pay-per-use, dedicated deployments (A100/H100/H200), self-serve API platform

Together AI

  • Pricing & Throughput: Competitive tiered pricing across open-source models
  • Free Tier: $1 credits
  • Rate Limits: 60 RPM for accounts <$25; automatically scales with usage/spend. Some models have lower limits. They also enforce limits per second for request rates, even so the rates on the site are per minute, making it more difficult to handle burst workload
  • Notable Features: Real-time usage tracking, open model access, tiered rate limits

Groq

  • Pricing & Throughput: Extremely high throughput via proprietary LPU hardware
  • Free Tier: Free API key
  • Rate Limits: 10–30 RPM (free); up to 1000 RPM on higher developer tiers
  • Notable Features: Simplified deployment, supports LLaMA, Gemma, Whisper; excels in raw speed

SambaNova

  • Pricing & Throughput: Premium pricing; optimized for enterprise use
  • Free Tier: $5 credits valid for 90 days
  • Rate Limits: Depend on the model starting from 5 RPM (free); up to 480 for paid tire. Higher rates are available for higher tires
  • Notable Features: Dedicated hardware, LLaMA and Qwen support, fine-tuning, compliance-grade infrastructure

CloudRift

  • Pricing & Throughput: Cheapest for LLaMA 4 and DeepSeek R1/V3 models
  • Free Tier: None
  • Rate Limits: None — fully usage-based with no caps or queues
  • Notable Features: Instant access, high-performance math and code models, no reservations required

Other Notable Providers

It is not practical to benchmark all available providers on the market. Here is a list of a few notable ones that offer inference for open-source models.

AWS Bedrock: Offers flexible pricing (on-demand, batch, provisioned) and support for custom model import. Includes tools like Guardrails, Knowledge Bases, and Agents. Scales to zero and supports benchmarking via LLMPerf.

Google Vertex AI / AI Studio: Vertex AI provides enterprise-grade reliability and advanced features, while AI Studio offers cheaper, lightweight usage. Supports provisioned throughput and tiered rate limits.

Cerebras: Free access to models like LLaMA 3.1 8B and 3.3 70B (8K context). High throughput demonstrated with LLaMA 405B (~1k tokens/s).

Cloudflare Workers AI: Lightweight, cost-effective inference with a free tier (10,000 neurons/day). Geared toward simple and fast edge AI deployments.

Best Providers Based on Price, Throughput

The choice of an LLM inference provider is a multifaceted decision. There is no single best provider, as the ideal solution depends on the priority given to cost, throughput, and advanced features.

Best for Price

For cost-efficiency, especially with open-source models:

  • Lambda Inference: Offers highly competitive rates, particularly for models like Llama-3.1–8B
  • DeepInfra (Standard): Provides a cost-effective option for general use
  • CloudRift: Offers very low pricing for models like Llama-4-Maverick, DeepSeek-R1–0528, and DeepSeek-V3 (promo period)

Best for Rate Limits

While most providers allow for enterprise solutions with unlimited rate limits, it can be challenging to increase the rate limit when first starting. The following providers support high rate limits out of the box:

  • Lambda Inference: No rate limits, high throughput for burst workloads
  • DeepInfra: High rate limits, supports 200 simultaneous requests
  • CloudRift: No rate limits, high-throughput for burst workloads

Best for Throughput

For applications demanding high speed and low latency:

  • Groq: Consistently leads in raw throughput due to its LPU (Language-Processing-Unit) architecture
  • DeepInfra (Turbo): Demonstrates outstanding throughput for Llama 4 Maverick
  • SambaNova: Demonstrates strong throughput on a variety of models, especially on Llama 4 Maverick 17B Instruct

Conclusion

LPU-based providers deliver the highest out-of-the-box throughput, making them ideal for speed-critical applications. Some GPU-based providers — like DeepInfra Turbo — come close on select models. Interestingly, the cost per token between LPU and GPU providers is often comparable.

The key trade-off is flexibility vs. raw speed: GPU providers typically offer more freedom to switch between models, while LPU providers focus on maximizing throughput for the ones they support.

It’s also worth noting that many providers modify the models they serve to optimize for speed, latency, or cost and limit the context size of the model. LPU providers, in particular, are more likely to modify models to suit their custom hardware better. As a result, it’s always best to test a model against your actual application before committing.

Finally, rate limits can heavily constrain throughput in practice. In many cases — especially for smaller teams or new users — there’s no quick or easy way to get around them.


Choosing Your LLM Powerhouse: A Comprehensive Comparison of Inference Providers was originally published in Data Science Collective on Medium, where people are continuing the conversation by highlighting and responding to this story.