Quick LLM API pricing comparison

I put this together for my own personal projects to compare practical model options across providers. The main question is whether it is worth going deeper into hardware-focused inference providers for speed, and how much extra that speed would cost versus standard API routes. I might’ve missed some details or better options, when pulling or analyzing the data, ofc.

Data snapshot collected and compared on: 2026-02-25

Scope

  • Compare selected LLM API prices using a consistent unit. I added some notes which I found interesting for my cases.
  • Normalize all token prices to USD per 1M tokens.
  • For hardware-focused providers in this post (Groq, SambaNova, Cerebras) I checked only text models.
  • There is tons of benchmarks and arguments about them, so no point to overanalyze for quick check. I use scores from Artificial Analysis (by the state at 2026-02-25) as a rough proxy to compare relative quality. My personal model preferences mostly (but not fully) correlate with it.

Raw data first

Pricing comparison

Provider Model Input ($/1M tokens) Output ($/1M tokens) Context window Notes Source
Google gemini-3.1-pro-preview 2.00 (<=200k prompt), 4.00 (>200k prompt) 12.00 (<=200k prompt), 18.00 (>200k prompt) Tiered at 200k prompt tokens Google Search: 5,000 prompts/month free, then 14.00 per 1,000 search queries https://ai.google.dev/gemini-api/docs/pricing
Google gemini-3-pro-preview 2.00 (<=200k prompt), 4.00 (>200k prompt) 12.00 (<=200k prompt), 18.00 (>200k prompt) Tiered at 200k prompt tokens Google Search: 5,000 prompts/month free, then 14.00 per 1,000 search queries https://ai.google.dev/gemini-api/docs/pricing
Google gemini-3-flash-preview 0.50 (text/image/video), 1.00 (audio) 3.00 Not specified here Google Search pricing same as Gemini 3 Pro/3.1 Pro note above https://ai.google.dev/gemini-api/docs/pricing
Google gemini-2.5-flash 0.30 (text/image/video), 1.00 (audio) 2.50 Not specified here Search: 1,500 RPD free (shared with Flash-Lite RPD), then 35.00 per 1,000 grounded prompts https://ai.google.dev/gemini-api/docs/pricing
Anthropic Claude Opus 4.6 5.00 25.00 Not specified here Standard API pricing https://platform.claude.com/docs/en/about-claude/pricing
Anthropic Claude Opus 4.5 5.00 25.00 Not specified here Standard API pricing https://platform.claude.com/docs/en/about-claude/pricing
Anthropic Claude Sonnet 4.6 3.00 15.00 Not specified here Standard API pricing https://platform.claude.com/docs/en/about-claude/pricing
Anthropic Claude Haiku 4.5 1.00 5.00 Not specified here Standard API pricing https://platform.claude.com/docs/en/about-claude/pricing
OpenAI GPT-5.2 1.750 14.000 Not specified here Standard API pricing https://openai.com/api/pricing/
OpenAI GPT-5.2 pro 21.00 168.00 Not specified here Standard API pricing https://openai.com/api/pricing/
OpenAI GPT-5 mini 0.250 2.000 Not specified here Standard API pricing https://openai.com/api/pricing/
OpenAI gpt-4o 2.50 10.00 Not specified here Included for comparison https://openai.com/api/pricing/
Groq GPT OSS 20B 128k 0.075 0.30 128k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq GPT OSS Safeguard 20B 0.075 0.30 Not specified here Current speed listed on page (see speed table below) https://groq.com/pricing
Groq GPT OSS 120B 128k 0.15 0.60 128k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq Kimi K2-0905 1T 256k 1.00 3.00 256k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq Llama 4 Scout (17Bx16E) 128k 0.11 0.34 128k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq Llama 4 Maverick (17Bx128E) 128k 0.20 0.60 128k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq Qwen3 32B 131k 0.29 0.59 131k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq Llama 3.3 70B Versatile 128k 0.59 0.79 128k Current speed listed on page (see speed table below) https://groq.com/pricing
Groq Llama 3.1 8B Instant 128k 0.05 0.08 128k Current speed listed on page (see speed table below) https://groq.com/pricing
SambaNova DeepSeek-R1-Distill-Llama-70B 0.70 1.40 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova DeepSeek-V3-0324 3.00 4.50 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova DeepSeek-V3.1 3.00 4.50 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova DeepSeek-V3.1-cb 0.15 0.75 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova DeepSeek-V3.1-Terminus 3.00 4.50 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova DeepSeek-V3.2 3.00 4.50 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova E5-Mistral-7B-Instruct 0.13 0.00 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova gpt-oss-120b 0.22 0.59 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova Llama-3.3-Swallow-70B-Instruct-v0.4 0.60 1.20 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova Llama-4-Maverick-17B-128E-Instruct 0.63 1.80 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova Meta-Llama-3.1-8B-Instruct 0.10 0.20 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova Meta-Llama-3.3-70B-Instruct 0.60 1.20 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova MiniMax-M2.5 0.30 1.20 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova Qwen3-235B 0.40 0.80 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
SambaNova Qwen3-32B 0.40 0.80 Not specified here Text model https://cloud.sambanova.ai/plans/pricing
Cerebras ZAI GLM 4.7* 2.25 2.75 Not specified here Preview model marker * shown on pricing page https://www.cerebras.ai/pricing
Cerebras GPT OSS 120B 0.35 0.75 Not specified here Standard row in developer tier pricing table https://www.cerebras.ai/pricing
Cerebras Llama 3.1 8B 0.10 0.10 Not specified here Standard row in developer tier pricing table https://www.cerebras.ai/pricing
Cerebras Qwen 3 235B Instruct* 0.60 1.20 Not specified here Preview model marker * shown on pricing page https://www.cerebras.ai/pricing

I took mostly models which are interesting to me + text models available on hardware providers and served with a very high speed relative to normal GPU serving. Obviously, there is more to analyze and it’s also possible to pull private api speeds. Maybe later :)

Rows not present (especially several SambaNova DeepSeek variants) are kept as pricing references and are not used in score-based picks above.

Stated generation speed

Provider Model / Scope Speed Notes Source
Groq GPT OSS 20B 128k 1,000 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq GPT OSS Safeguard 20B 1,000 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq GPT OSS 120B 128k 500 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq Kimi K2-0905 1T 256k 200 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq Llama 4 Scout (17Bx16E) 128k 594 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq Llama 4 Maverick (17Bx128E) 128k 562 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq Qwen3 32B 131k 662 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq Llama 3.3 70B Versatile 128k 394 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Groq Llama 3.1 8B Instant 128k 840 TPS Listed as “Current Speed (Tokens per Second)” https://groq.com/pricing
Cerebras ZAI GLM 4.7* ~1000 tokens/s Developer tier pricing table https://www.cerebras.ai/pricing
Cerebras GPT OSS 120B ~3000 tokens/s Developer tier pricing table https://www.cerebras.ai/pricing
Cerebras Llama 3.1 8B ~2200 tokens/s Developer tier pricing table https://www.cerebras.ai/pricing
Cerebras Qwen 3 235B Instruct* ~1400 tokens/s Developer tier pricing table https://www.cerebras.ai/pricing

Personal local hardware runs (gpt-oss-120b, llama.cpp server)

These are my own rough measurements and are not directly comparable to provider-reported benchmarks:

  • MacBook Pro M4 Max (16 cores): ~300 TPS input (prefill), ~70 TPS output (decode)
  • NVIDIA GB10: ~1,000 TPS input (prefill), ~50 TPS output (decode)

Prefill and decode are different phases. Prefill parallelizes over prompt tokens; decode is sequential and often bottlenecked by KV-cache traffic, kernel efficiency, quantization path, and small-batch utilization. So it is normal to see input TPS and output TPS behave differently.

For hardware context:

  • NVIDIA DGX Spark lists 273 GB/s memory bandwidth: https://www.nvidia.com/en-us/products/workstations/dgx-spark/
  • Apple M4 Max memory bandwidth depends on configuration (for example 410 GB/s and 546 GB/s variants): https://support.apple.com/en-us/121553

So bandwidth alone is probably not the full explanation for my prefill/decode imbalance. Something to dig deeper into later too.

Shared models (hardware providers + 2-3 cheapest OpenRouter models)

This is a model-family normalization. No GPT OSS 120B quantization row because GPT OSS 120B is MXFP4 out of the box.

Normalized model family Groq ($/1M in, out) SambaNova ($/1M in, out) Cerebras ($/1M in, out) OpenRouter routes ($/1M in, out, TPS) Cheapest input (all) Cheapest output (all)
GPT OSS 120B 0.15, 0.60 0.22, 0.59 0.35, 0.75 DeepInfra 0.039, 0.19 @ 70 TPS; NovitaAI 0.05, 0.25 @ 58 TPS OpenRouter (DeepInfra) OpenRouter (DeepInfra)
Kimi K2.5 DeepInfra 0.45, 2.25 @ 15 TPS; SiliconFlow 0.45, 2.25 @ 12 TPS; Nebius 0.50, 2.50 @ 9.5 TPS. Direct moonshot.ai: 0.10, 0.60 (TPS not listed) Direct moonshot.ai Direct moonshot.ai
GLM-5 SiliconFlow 0.95, 2.55 @ 29 TPS (FP8, 204.8k in/131k out); AtlasCloud 0.95, 3.15 @ 28 TPS (FP8, similar context); Friendli 1.00, 3.20 @ 22 TPS (202.8k in/out). Direct z.ai: 1.00, 3.20 OpenRouter (SiliconFlow/AtlasCloud tie) OpenRouter (SiliconFlow)
GLM-4.7 2.25, 2.75 io.net 0.30, 1.40 @ 45 TPS (BF16, 202.8k); DeepInfra 0.40, 1.75 @ 31 TPS (FP4, 202.8k); Nebius 0.40, 2.00 @ 46 TPS (FP8, 202.8k). Direct z.ai: 0.60, 2.20 OpenRouter (io.net) OpenRouter (io.net)
Llama 3.1 8B 0.05, 0.08 0.10, 0.20 0.10, 0.10 NovitaAI 0.02, 0.05 @ 29 TPS (FP8); DeepInfra 0.02, 0.05 @ 36.5 TPS (BF16) OpenRouter (DeepInfra/NovitaAI tie) OpenRouter (DeepInfra/NovitaAI tie)
Llama 4 Maverick 17B-128E 0.20, 0.60 0.63, 1.80 DeepInfra 0.15, 0.60 @ 30 TPS (BF16); NovitaAI 0.27, 0.85 @ 42 TPS (FP8) OpenRouter (DeepInfra) Groq / OpenRouter DeepInfra (tie)
Qwen3 32B 0.29, 0.59 0.40, 0.80 DeepInfra 0.08, 0.28 @ 46 TPS (FP8); Nebius 0.10, 0.30 @ 8 TPS (FP8); NovitaAI 0.10, 0.45 @ 36 TPS (FP8, 20k max output) OpenRouter (DeepInfra) OpenRouter (DeepInfra)
Qwen3 235B 0.40, 0.80 0.60, 1.20 DeepInfra 0.071, 0.10 @ 14 TPS (FP8); Weights & Biases 0.10, 0.10 @ 23 TPS (BF16); NovitaAI 0.09, 0.58 @ 14 TPS (FP8, 16.4k max output) OpenRouter (DeepInfra) OpenRouter (DeepInfra / Weights & Biases tie)

OpenRouter model pages used:

  • https://openrouter.ai/openai/gpt-oss-120b?sort=price
  • https://openrouter.ai/meta-llama/llama-3.1-8b-instruct?sort=price
  • https://openrouter.ai/meta-llama/llama-4-maverick?sort=price
  • https://openrouter.ai/qwen/qwen3-32b?sort=price
  • https://openrouter.ai/qwen/qwen3-235b-a22b-2507?sort=price
  • Kimi K2.5 and GLM route prices/TPS were added from OpenRouter.

Artificial Analysis screenshots

Sources:

Extracted scores from screenshots

Reasoning/non-reasoning tags are kept in model names where relevant.

Model General index Coding index Agentic index
Gemini 3.1 Pro Preview 57 56 59
Claude Opus 4.6 (max) 53 48 68
Claude Sonnet 4.6 (max) 51 51 63
GPT-5.2 (xhigh) 51 49 60
Claude Opus 4.5 50 48 60
GLM-5 (reasoning) 50 44 63
Gemini 3 Pro Preview (high) 48 46 52
Kimi K2.5 (reasoning) 47 40 59
Gemini 3 Flash 46 43 50
GLM-4.7 (reasoning) 42 36 55
GPT-5-mini (high) 41 35 45
GLM-5 (non-reasoning) 40 39 60
Kimi K2.5 (non-reasoning) 37 26 53
Claude Haiku 4.5 37 33 40
GLM-4.7 (non-reasoning) 34 32 54
gpt-oss-120B (high) 33 29 38
Qwen3 235B A22B 2507 (reasoning) 29 23 30
Gemini 2.5 Flash 27 22 19
Qwen3 235B A22B 2507 25 22 23
GPT-4o (Mar) 19
Llama 4 Maverick 18 16 7
Qwen3 32B 17 14 13
Llama 3.1 8B 12 5 5

General Intelligence Index

Artificial Analysis Intelligence Index benchmark snapshot

Coding Index

Artificial Analysis Coding Index benchmark snapshot

Agentic Index

Artificial Analysis Agentic Index benchmark snapshot

Key takeaways

  • Benchmark-wise for coding quality, Gemini 3.1 Pro is the top score; Sonnet 4.6 and GPT-5.2 are close alternatives.
  • For coding-heavy day-to-day use, fixed monthly plans usually beat pure token billing on effective cost at this point in time.
  • For cheap bulk processing, Llama 3.1 8B still wins pure token price, and in my experience with heavy prompting it can still do surprisingly well for simple annotation/summarization/extraction; gpt-oss-120B often buys better output for only a small cost bump.
  • Benchmark-wise for deep conversation quality, Gemini 3.1 Pro is the best default; Opus 4.6 is the premium agentic option.
  • For low-latency UX, Groq/Cerebras throughput is a different class (hundreds to thousands TPS), often worth a modest token premium.
  • GLM-5 and Kimi K2.5 are the most interesting value outliers, with good benchmark indices but slower OpenRouter route TPS.

Recommendations by use-case

Default routing table

Scenario Default model Upgrade model Fast-path model Ultra-cheap batch model
Coding Gemini 3.1 Pro Claude Sonnet 4.6 / GPT-5.2 Gemini 3 Flash Kimi K2.5 (reasoning) / GLM-5 (reasoning)
Mass processing gpt-oss-120B (cheap route) GPT-5 mini / Gemini 3 Flash Groq/Cerebras Llama 3.1 8B Llama 3.1 8B (DeepInfra/Novita)
Deep conversation Gemini 3.1 Pro Claude Opus 4.6 GLM-5 (reasoning) / Kimi K2.5 (reasoning)
Casual conversation GPT-5 mini / Gemini 3 Flash gpt-oss-120B (Groq) Groq Llama 3.1 8B / Groq gpt-oss-120B Llama 3.1 8B
Agents Claude Sonnet 4.6 / GPT-5.2 / Gemini 3.1 Pro Claude Opus 4.6 GLM-5 (reasoning)

Coding

Factors:

  • Coding Index for solve/compile reliability.
  • Output token cost (coding answers are often output-heavy).

Practical ladder:

  1. Benchmark leader: Gemini 3.1 Pro (Coding Index 56), with frontier pricing that is still below Opus-tier.
  2. Strong alternatives: Claude Sonnet 4.6 (Coding 51) or GPT-5.2 (Coding 49). GPT-5.2 is cheaper than Sonnet for prompt-heavy coding.
  3. Good enough and cheap: Gemini 3 Flash (Coding 43) for small refactors, snippets, and explain-this-code tasks.
  4. Budget value outliers: GLM-5 (reasoning) and Kimi K2.5 (reasoning), if you can tolerate lower route TPS and occasional misses.

Personal experience note:

  • In my own use, Gemini is strong on LeetCode-style coding but less consistent as an independent coding agent for project-aware restructuring and style-matching. Results seem to vary a lot across users, prompting style, and workflow setup.

For a typical coding call (10k input + 2k output):

  • Gemini 3.1 Pro: ~$0.044
  • GPT-5.2: ~$0.0455
  • Claude Sonnet 4.6: ~$0.060
  • Gemini 3 Flash: ~$0.011

Pricing model note (important for real coding usage):

  • Today it is generally better to use fixed plans for coding rather than pay per token.
  • Most major vendors with coding assistants (Claude, Codex/OpenAI tooling, GLM ecosystem, and others) offer fixed plans with limits that are usually high enough for individual workflows.

  • Llama 3.1 8B was not made for coding
  • gpt-oss-120B is alright for local coding support (coding score 29).

For coding, the ladder is clear: Flash/mini for cheap edits, Gemini 3.1 Pro when correctness matters, Sonnet/GPT-5.2 as strong alternates.

Quick mass processing for simple tasks

What matters:

  • Cost per item and throughput (items/second).

Short answer:

  • Llama 3.1 8B still wins pure token economics.
  • gpt-oss-120B is often a better default now: still cheap, substantially stronger, and can be faster depending on route/provider.

Per-item comparison for 1k input + 100 output:

  • Llama 3.1 8B (cheapest OpenRouter route): ~$0.000025/item (~40k items per $1), about 30-36 TPS on listed routes.
  • gpt-oss-120B (DeepInfra route): ~$0.000058/item (~17k items per $1), about 70 TPS, and much stronger benchmark profile.
  • Llama 3.1 8B (Groq): ~$0.000058/item, ~840 TPS.
  • Llama 3.1 8B (Cerebras): ~$0.00011/item, ~2200 tokens/s.

Recommended bulk pipeline:

  • Tier 0 (ultra-cheap): Llama 3.1 8B for easy formatting/tagging/extraction.
  • Tier 1 (still cheap, fewer retries): gpt-oss-120B for brittle schemas or light reasoning.
  • Tier 2 (correctness-heavy): GPT-5 mini or Gemini 3 Flash when failure/retry cost dominates.

Llama 3.1 8B still wins pure token economics, but gpt-oss-120B is the key “pay a little more, fail less” option.

Deep conversations

Factors:

  • General Intelligence Index as proxy for coherence/nuance.
  • Context tiering and output pricing.

Good options across categories:

  1. Best default (quality + reasonable cost): Gemini 3.1 Pro (General 57, output $12/1M at <=200k prompt tier).
  2. Premium quality tier: Claude Opus 4.6 (General 53, Agentic 68) if cost is secondary.
  3. Budget surprises: GLM-5 (reasoning) and Kimi K2.5 (reasoning), with strong indices for listed prices but lower route TPS.

Personal preference note:

  • For chat quality/feel, my own preference is ChatGPT Pro (subscription, not rich enough to try the api) > Opus 4.6 > Gemini Pro, even though data ranks Gemini 3.1 Pro highest by numbers.

For long, high-quality conversations, Gemini 3.1 Pro is the clean default. Opus is the premium experience.

Casual conversation

Factors:

  • Cost per turn and streaming latency.

Picks:

  1. Cheap + good proprietary defaults: GPT-5 mini or Gemini 3 Flash.
  2. Snappy streaming first: Groq routes (especially Llama 3.1 8B or gpt-oss-120B).
  3. Local privacy/offline option: local gpt-oss-120B is already in a usable chat band from my own runs.

Agents

Factors:

  • Agentic Index and cost per step (agent loops amplify both token cost and failure cost).

Picks:

  1. Max reliability: Claude Opus 4.6 (Agentic 68), expensive but highest score.
  2. Strong mid-tier brains: Claude Sonnet 4.6, GPT-5.2, Gemini 3.1 Pro.
  3. Budget outlier: GLM-5 (reasoning), strong Agentic score for its listed prices, with lower route TPS.
  4. Better as subroutines than planners: Llama 3.1 8B / Llama 4 Maverick / Qwen3 32B / Qwen3 235B (low Agentic scores here).

Agent workloads amplify both token and failure costs. Opus is premium, Sonnet/GPT-5.2/Gemini are strong mid-tier, GLM-5 is the most interesting budget agentic outlier.

Typical token-cost scenarios (token-cost only)

Formula used:

  • cost = (input_tokens / 1,000,000 * input_price) + (output_tokens / 1,000,000 * output_price)

Chat turn (baseline): 3k input + 1k output

Model Cost per turn
GPT-5 mini ~$0.00275
Gemini 3 Flash ~$0.00450
Gemini 3.1 Pro (<=200k tier) ~$0.01800
Claude Opus 4.6 ~$0.04000

Mass processing: 1k input + 100 output

Model/route Cost per item Items per $1 Notes
Llama 3.1 8B (OpenRouter DeepInfra/Novita) ~$0.000025 ~40,000 Cheapest token route
gpt-oss-120B (OpenRouter DeepInfra) ~$0.000058 ~17,241 Better quality/success profile
Llama 3.1 8B (Groq) ~$0.000058 ~17,241 Much faster streaming
Llama 3.1 8B (Cerebras) ~$0.000110 ~9,091 Throughput-focused option

Coding: 10k input + 2k output

Model Cost per call
Gemini 3.1 Pro (<=200k tier) ~$0.0440
GPT-5.2 ~$0.0455
Claude Sonnet 4.6 ~$0.0600
Gemini 3 Flash ~$0.0110

Agent step: 3k input + 500 output

Model/route Cost per step
Claude Opus 4.6 ~$0.0275
Claude Sonnet 4.6 ~$0.0165
GPT-5.2 ~$0.01225
Gemini 3.1 Pro (<=200k tier) ~$0.0120
GLM-5 (OpenRouter SiliconFlow route) ~$0.004125

Pareto-front view (textual)

  • Coding Index: Gemini 3.1 Pro is the top score; Sonnet/GPT-5.2 are close; Gemini 3 Flash and GPT-5 mini are cheap practical points.
  • Agentic Index: Opus is the top point; Sonnet/GLM-5/GPT-5.2 cluster as strong alternatives at lower cost.
  • Bulk processing: Llama 3.1 8B gives the leftmost cost point; gpt-oss-120B often gives the better cost-to-reliability trade.

Caveats

  • Vendor TPS is not apples-to-apples (different batch sizes, output lengths, queueing, and serving conditions).
  • Tokenization and effective context behavior vary by model/provider.
  • Agent cost is multiplicative by number of calls, not just per-call token prices.
  • Route-level prices/TPS can change quickly;

Conclusions

  • Output pricing still dominates many real workloads, especially coding and long-form conversations.
  • Gemini 3.1 Pro is the best default frontier pick today across quality and token economics.
  • GPT-5.2 vs Sonnet is close on benchmark scores, with GPT-5.2 usually cheaper in prompt-heavy flows.
  • GLM-5 is the biggest anomaly here: Sonnet-tier Agentic score in this table with much lower listed route pricing, at lower TPS.
  • Kimi K2.5 looks like a second anomaly if direct 0.10 / 0.60 pricing is stable.
  • Open-weight fast routes are excellent for cheap subroutines and throughput, but still weak as primary agent planners.
  • Local gpt-oss-120B decode (~70 TPS in my runs) is roughly on par with the cheapest cloud route in this table, while hosted hardware providers still dominate absolute decode throughput.
  • For coding-heavy daily work, fixed monthly plans are usually the best deal in this timeframe; token APIs stay useful as overflow, batch, or integration rails.

Personal hardware usage

For my local llama.cpp server runs with gpt-oss-120b, the two machines behave like different profiles:

  • M4 Max: ~300 TPS prefill / ~70 TPS decode
  • NVIDIA GB10: ~1000 TPS prefill / ~50 TPS decode

Interpretation:

  • Prefill TPS mostly affects time to first token (especially with large prompts).
  • Decode TPS mostly affects streaming speed after generation starts.

Quick intuition table (sequential, no batching):

Scenario (prompt / output) M4 Max time GB10 time Feels better
Simple extraction (1k / 100) ~4.8s ~3.0s GB10
Casual chat (1k / 200) ~6.2s ~5.0s GB10
Deep chat turn (3k / 1k) ~24.3s ~23.0s Slight GB10
Coding with context (10k / 1.5k) ~54.8s ~40.0s GB10 (big)
Agent step (4k / 300) ~17.6s ~10.0s GB10 (big)

These are estimated as:

  • total_time ~= prompt_tokens / prefill_TPS + output_tokens / decode_TPS

Rule of thumb from these measurements:

  • GB10 wins when prompts are large and outputs are moderate (RAG/context-heavy coding, agent traces, tool logs).
  • M4 Max catches up only when output is very large relative to prompt.
  • Crossover is around output_tokens >= 0.4 * prompt_tokens.

Where local hardware fits best:

  • Coding: good offline copilot for refactors, boilerplate, and iteration without cost anxiety; still use stronger frontier models when correctness is critical.
  • Mass processing: GB10 is the stronger local box for prefill-heavy pipelines; for pure simple bulk, smaller 8B cloud routes are usually more efficient.
  • Deep conversations: very usable for private long-form chat, but quality/speed still below frontier hosted options.
  • Casual conversation: good privacy-first default; not in the same latency class as Groq/Cerebras.
  • Agents: better as a cheap local subroutine executor than a primary planner brain.

Practical takeaway:

  • Local gpt-oss-120b sits in a useful “privacy-first, no-marginal-cost, mid-speed” tier.
  • It complements cloud inference well: use local for private/iterative inner loops, cloud for highest reliability or fastest UX.
Written on February 25, 2026