Quick LLM API pricing comparison

I put this together for my own personal projects to compare practical model options across providers. The main question is whether it is worth going deeper into hardware-focused inference providers for speed, and how much extra that speed would cost versus standard API routes. I might’ve missed some details or better options, when pulling or analyzing the data, ofc.

Data snapshot collected and compared on: 2026-02-25

Scope

Compare selected LLM API prices using a consistent unit. I added some notes which I found interesting for my cases.
Normalize all token prices to USD per 1M tokens.
For hardware-focused providers in this post (Groq, SambaNova, Cerebras) I checked only text models.
There is tons of benchmarks and arguments about them, so no point to overanalyze for quick check. I use scores from Artificial Analysis (by the state at 2026-02-25) as a rough proxy to compare relative quality. My personal model preferences mostly (but not fully) correlate with it.

Raw data first

Pricing comparison

Provider	Model	Input ($/1M tokens)	Output ($/1M tokens)	Context window	Notes	Source
Google	gemini-3.1-pro-preview	2.00 (<=200k prompt), 4.00 (>200k prompt)	12.00 (<=200k prompt), 18.00 (>200k prompt)	Tiered at 200k prompt tokens	Google Search: 5,000 prompts/month free, then 14.00 per 1,000 search queries	https://ai.google.dev/gemini-api/docs/pricing
Google	gemini-3-pro-preview	2.00 (<=200k prompt), 4.00 (>200k prompt)	12.00 (<=200k prompt), 18.00 (>200k prompt)	Tiered at 200k prompt tokens	Google Search: 5,000 prompts/month free, then 14.00 per 1,000 search queries	https://ai.google.dev/gemini-api/docs/pricing
Google	gemini-3-flash-preview	0.50 (text/image/video), 1.00 (audio)	3.00	Not specified here	Google Search pricing same as Gemini 3 Pro/3.1 Pro note above	https://ai.google.dev/gemini-api/docs/pricing
Google	gemini-2.5-flash	0.30 (text/image/video), 1.00 (audio)	2.50	Not specified here	Search: 1,500 RPD free (shared with Flash-Lite RPD), then 35.00 per 1,000 grounded prompts	https://ai.google.dev/gemini-api/docs/pricing
Anthropic	Claude Opus 4.6	5.00	25.00	Not specified here	Standard API pricing	https://platform.claude.com/docs/en/about-claude/pricing
Anthropic	Claude Opus 4.5	5.00	25.00	Not specified here	Standard API pricing	https://platform.claude.com/docs/en/about-claude/pricing
Anthropic	Claude Sonnet 4.6	3.00	15.00	Not specified here	Standard API pricing	https://platform.claude.com/docs/en/about-claude/pricing
Anthropic	Claude Haiku 4.5	1.00	5.00	Not specified here	Standard API pricing	https://platform.claude.com/docs/en/about-claude/pricing
OpenAI	GPT-5.2	1.750	14.000	Not specified here	Standard API pricing	https://openai.com/api/pricing/
OpenAI	GPT-5.2 pro	21.00	168.00	Not specified here	Standard API pricing	https://openai.com/api/pricing/
OpenAI	GPT-5 mini	0.250	2.000	Not specified here	Standard API pricing	https://openai.com/api/pricing/
OpenAI	gpt-4o	2.50	10.00	Not specified here	Included for comparison	https://openai.com/api/pricing/
Groq	GPT OSS 20B 128k	0.075	0.30	128k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	GPT OSS Safeguard 20B	0.075	0.30	Not specified here	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	GPT OSS 120B 128k	0.15	0.60	128k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	Kimi K2-0905 1T 256k	1.00	3.00	256k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	Llama 4 Scout (17Bx16E) 128k	0.11	0.34	128k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	Llama 4 Maverick (17Bx128E) 128k	0.20	0.60	128k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	Qwen3 32B 131k	0.29	0.59	131k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	Llama 3.3 70B Versatile 128k	0.59	0.79	128k	Current speed listed on page (see speed table below)	https://groq.com/pricing
Groq	Llama 3.1 8B Instant 128k	0.05	0.08	128k	Current speed listed on page (see speed table below)	https://groq.com/pricing
SambaNova	DeepSeek-R1-Distill-Llama-70B	0.70	1.40	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	DeepSeek-V3-0324	3.00	4.50	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	DeepSeek-V3.1	3.00	4.50	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	DeepSeek-V3.1-cb	0.15	0.75	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	DeepSeek-V3.1-Terminus	3.00	4.50	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	DeepSeek-V3.2	3.00	4.50	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	E5-Mistral-7B-Instruct	0.13	0.00	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	gpt-oss-120b	0.22	0.59	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	Llama-3.3-Swallow-70B-Instruct-v0.4	0.60	1.20	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	Llama-4-Maverick-17B-128E-Instruct	0.63	1.80	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	Meta-Llama-3.1-8B-Instruct	0.10	0.20	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	Meta-Llama-3.3-70B-Instruct	0.60	1.20	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	MiniMax-M2.5	0.30	1.20	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	Qwen3-235B	0.40	0.80	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
SambaNova	Qwen3-32B	0.40	0.80	Not specified here	Text model	https://cloud.sambanova.ai/plans/pricing
Cerebras	ZAI GLM 4.7*	2.25	2.75	Not specified here	Preview model marker `*` shown on pricing page	https://www.cerebras.ai/pricing
Cerebras	GPT OSS 120B	0.35	0.75	Not specified here	Standard row in developer tier pricing table	https://www.cerebras.ai/pricing
Cerebras	Llama 3.1 8B	0.10	0.10	Not specified here	Standard row in developer tier pricing table	https://www.cerebras.ai/pricing
Cerebras	Qwen 3 235B Instruct*	0.60	1.20	Not specified here	Preview model marker `*` shown on pricing page	https://www.cerebras.ai/pricing

I took mostly models which are interesting to me + text models available on hardware providers and served with a very high speed relative to normal GPU serving. Obviously, there is more to analyze and it’s also possible to pull private api speeds. Maybe later :)

Rows not present (especially several SambaNova DeepSeek variants) are kept as pricing references and are not used in score-based picks above.

Stated generation speed

Provider	Model / Scope	Speed	Notes	Source
Groq	GPT OSS 20B 128k	1,000 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	GPT OSS Safeguard 20B	1,000 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	GPT OSS 120B 128k	500 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	Kimi K2-0905 1T 256k	200 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	Llama 4 Scout (17Bx16E) 128k	594 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	Llama 4 Maverick (17Bx128E) 128k	562 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	Qwen3 32B 131k	662 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	Llama 3.3 70B Versatile 128k	394 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Groq	Llama 3.1 8B Instant 128k	840 TPS	Listed as “Current Speed (Tokens per Second)”	https://groq.com/pricing
Cerebras	ZAI GLM 4.7*	~1000 tokens/s	Developer tier pricing table	https://www.cerebras.ai/pricing
Cerebras	GPT OSS 120B	~3000 tokens/s	Developer tier pricing table	https://www.cerebras.ai/pricing
Cerebras	Llama 3.1 8B	~2200 tokens/s	Developer tier pricing table	https://www.cerebras.ai/pricing
Cerebras	Qwen 3 235B Instruct*	~1400 tokens/s	Developer tier pricing table	https://www.cerebras.ai/pricing

Personal local hardware runs (gpt-oss-120b, llama.cpp server)

These are my own rough measurements and are not directly comparable to provider-reported benchmarks:

MacBook Pro M4 Max (16 cores): ~300 TPS input (prefill), ~70 TPS output (decode)
NVIDIA GB10: ~1,000 TPS input (prefill), ~50 TPS output (decode)

Prefill and decode are different phases. Prefill parallelizes over prompt tokens; decode is sequential and often bottlenecked by KV-cache traffic, kernel efficiency, quantization path, and small-batch utilization. So it is normal to see input TPS and output TPS behave differently.

For hardware context:

NVIDIA DGX Spark lists 273 GB/s memory bandwidth: https://www.nvidia.com/en-us/products/workstations/dgx-spark/
Apple M4 Max memory bandwidth depends on configuration (for example 410 GB/s and 546 GB/s variants): https://support.apple.com/en-us/121553

So bandwidth alone is probably not the full explanation for my prefill/decode imbalance. Something to dig deeper into later too.

Shared models (hardware providers + 2-3 cheapest OpenRouter models)

This is a model-family normalization. No GPT OSS 120B quantization row because GPT OSS 120B is MXFP4 out of the box.

Normalized model family	Groq ($/1M in, out)	SambaNova ($/1M in, out)	Cerebras ($/1M in, out)	OpenRouter routes ($/1M in, out, TPS)	Cheapest input (all)	Cheapest output (all)
GPT OSS 120B	0.15, 0.60	0.22, 0.59	0.35, 0.75	DeepInfra 0.039, 0.19 @ 70 TPS; NovitaAI 0.05, 0.25 @ 58 TPS	OpenRouter (DeepInfra)	OpenRouter (DeepInfra)
Kimi K2.5	–	–	–	DeepInfra 0.45, 2.25 @ 15 TPS; SiliconFlow 0.45, 2.25 @ 12 TPS; Nebius 0.50, 2.50 @ 9.5 TPS. Direct moonshot.ai: 0.10, 0.60 (TPS not listed)	Direct moonshot.ai	Direct moonshot.ai
GLM-5	–	–	–	SiliconFlow 0.95, 2.55 @ 29 TPS (FP8, 204.8k in/131k out); AtlasCloud 0.95, 3.15 @ 28 TPS (FP8, similar context); Friendli 1.00, 3.20 @ 22 TPS (202.8k in/out). Direct z.ai: 1.00, 3.20	OpenRouter (SiliconFlow/AtlasCloud tie)	OpenRouter (SiliconFlow)
GLM-4.7	–	–	2.25, 2.75	io.net 0.30, 1.40 @ 45 TPS (BF16, 202.8k); DeepInfra 0.40, 1.75 @ 31 TPS (FP4, 202.8k); Nebius 0.40, 2.00 @ 46 TPS (FP8, 202.8k). Direct z.ai: 0.60, 2.20	OpenRouter (io.net)	OpenRouter (io.net)
Llama 3.1 8B	0.05, 0.08	0.10, 0.20	0.10, 0.10	NovitaAI 0.02, 0.05 @ 29 TPS (FP8); DeepInfra 0.02, 0.05 @ 36.5 TPS (BF16)	OpenRouter (DeepInfra/NovitaAI tie)	OpenRouter (DeepInfra/NovitaAI tie)
Llama 4 Maverick 17B-128E	0.20, 0.60	0.63, 1.80	–	DeepInfra 0.15, 0.60 @ 30 TPS (BF16); NovitaAI 0.27, 0.85 @ 42 TPS (FP8)	OpenRouter (DeepInfra)	Groq / OpenRouter DeepInfra (tie)
Qwen3 32B	0.29, 0.59	0.40, 0.80	–	DeepInfra 0.08, 0.28 @ 46 TPS (FP8); Nebius 0.10, 0.30 @ 8 TPS (FP8); NovitaAI 0.10, 0.45 @ 36 TPS (FP8, 20k max output)	OpenRouter (DeepInfra)	OpenRouter (DeepInfra)
Qwen3 235B	–	0.40, 0.80	0.60, 1.20	DeepInfra 0.071, 0.10 @ 14 TPS (FP8); Weights & Biases 0.10, 0.10 @ 23 TPS (BF16); NovitaAI 0.09, 0.58 @ 14 TPS (FP8, 16.4k max output)	OpenRouter (DeepInfra)	OpenRouter (DeepInfra / Weights & Biases tie)

OpenRouter model pages used:

https://openrouter.ai/openai/gpt-oss-120b?sort=price
https://openrouter.ai/meta-llama/llama-3.1-8b-instruct?sort=price
https://openrouter.ai/meta-llama/llama-4-maverick?sort=price
https://openrouter.ai/qwen/qwen3-32b?sort=price
https://openrouter.ai/qwen/qwen3-235b-a22b-2507?sort=price
Kimi K2.5 and GLM route prices/TPS were added from OpenRouter.

Artificial Analysis screenshots

Sources:

Extracted scores from screenshots

Reasoning/non-reasoning tags are kept in model names where relevant.

Model	General index	Coding index	Agentic index
Gemini 3.1 Pro Preview	57	56	59
Claude Opus 4.6 (max)	53	48	68
Claude Sonnet 4.6 (max)	51	51	63
GPT-5.2 (xhigh)	51	49	60
Claude Opus 4.5	50	48	60
GLM-5 (reasoning)	50	44	63
Gemini 3 Pro Preview (high)	48	46	52
Kimi K2.5 (reasoning)	47	40	59
Gemini 3 Flash	46	43	50
GLM-4.7 (reasoning)	42	36	55
GPT-5-mini (high)	41	35	45
GLM-5 (non-reasoning)	40	39	60
Kimi K2.5 (non-reasoning)	37	26	53
Claude Haiku 4.5	37	33	40
GLM-4.7 (non-reasoning)	34	32	54
gpt-oss-120B (high)	33	29	38
Qwen3 235B A22B 2507 (reasoning)	29	23	30
Gemini 2.5 Flash	27	22	19
Qwen3 235B A22B 2507	25	22	23
GPT-4o (Mar)	19	–	–
Llama 4 Maverick	18	16	7
Qwen3 32B	17	14	13
Llama 3.1 8B	12	5	5

General Intelligence Index

Artificial Analysis Intelligence Index benchmark snapshot

Coding Index

Artificial Analysis Coding Index benchmark snapshot

Agentic Index

Artificial Analysis Agentic Index benchmark snapshot

Key takeaways

Benchmark-wise for coding quality, Gemini 3.1 Pro is the top score; Sonnet 4.6 and GPT-5.2 are close alternatives.
For coding-heavy day-to-day use, fixed monthly plans usually beat pure token billing on effective cost at this point in time.
For cheap bulk processing, Llama 3.1 8B still wins pure token price, and in my experience with heavy prompting it can still do surprisingly well for simple annotation/summarization/extraction; gpt-oss-120B often buys better output for only a small cost bump.
Benchmark-wise for deep conversation quality, Gemini 3.1 Pro is the best default; Opus 4.6 is the premium agentic option.
For low-latency UX, Groq/Cerebras throughput is a different class (hundreds to thousands TPS), often worth a modest token premium.
GLM-5 and Kimi K2.5 are the most interesting value outliers, with good benchmark indices but slower OpenRouter route TPS.

Recommendations by use-case

Default routing table

Scenario	Default model	Upgrade model	Fast-path model	Ultra-cheap batch model
Coding	Gemini 3.1 Pro	Claude Sonnet 4.6 / GPT-5.2	Gemini 3 Flash	Kimi K2.5 (reasoning) / GLM-5 (reasoning)
Mass processing	gpt-oss-120B (cheap route)	GPT-5 mini / Gemini 3 Flash	Groq/Cerebras Llama 3.1 8B	Llama 3.1 8B (DeepInfra/Novita)
Deep conversation	Gemini 3.1 Pro	Claude Opus 4.6	–	GLM-5 (reasoning) / Kimi K2.5 (reasoning)
Casual conversation	GPT-5 mini / Gemini 3 Flash	gpt-oss-120B (Groq)	Groq Llama 3.1 8B / Groq gpt-oss-120B	Llama 3.1 8B
Agents	Claude Sonnet 4.6 / GPT-5.2 / Gemini 3.1 Pro	Claude Opus 4.6	–	GLM-5 (reasoning)

Coding

Factors:

Coding Index for solve/compile reliability.
Output token cost (coding answers are often output-heavy).

Practical ladder:

Benchmark leader: Gemini 3.1 Pro (Coding Index 56), with frontier pricing that is still below Opus-tier.
Strong alternatives: Claude Sonnet 4.6 (Coding 51) or GPT-5.2 (Coding 49). GPT-5.2 is cheaper than Sonnet for prompt-heavy coding.
Good enough and cheap: Gemini 3 Flash (Coding 43) for small refactors, snippets, and explain-this-code tasks.
Budget value outliers: GLM-5 (reasoning) and Kimi K2.5 (reasoning), if you can tolerate lower route TPS and occasional misses.

Personal experience note:

In my own use, Gemini is strong on LeetCode-style coding but less consistent as an independent coding agent for project-aware restructuring and style-matching. Results seem to vary a lot across users, prompting style, and workflow setup.

For a typical coding call (10k input + 2k output):

Gemini 3.1 Pro: ~$0.044
GPT-5.2: ~$0.0455
Claude Sonnet 4.6: ~$0.060
Gemini 3 Flash: ~$0.011

Pricing model note (important for real coding usage):

Today it is generally better to use fixed plans for coding rather than pay per token.
Most major vendors with coding assistants (Claude, Codex/OpenAI tooling, GLM ecosystem, and others) offer fixed plans with limits that are usually high enough for individual workflows.
Llama 3.1 8B was not made for coding
gpt-oss-120B is alright for local coding support (coding score 29).

For coding, the ladder is clear: Flash/mini for cheap edits, Gemini 3.1 Pro when correctness matters, Sonnet/GPT-5.2 as strong alternates.

Quick mass processing for simple tasks

What matters:

Cost per item and throughput (items/second).

Short answer:

Llama 3.1 8B still wins pure token economics.
gpt-oss-120B is often a better default now: still cheap, substantially stronger, and can be faster depending on route/provider.

Per-item comparison for 1k input + 100 output:

Llama 3.1 8B (cheapest OpenRouter route): ~$0.000025/item (~40k items per $1), about 30-36 TPS on listed routes.
gpt-oss-120B (DeepInfra route): ~$0.000058/item (~17k items per $1), about 70 TPS, and much stronger benchmark profile.
Llama 3.1 8B (Groq): ~$0.000058/item, ~840 TPS.
Llama 3.1 8B (Cerebras): ~$0.00011/item, ~2200 tokens/s.

Recommended bulk pipeline:

Tier 0 (ultra-cheap): Llama 3.1 8B for easy formatting/tagging/extraction.
Tier 1 (still cheap, fewer retries): gpt-oss-120B for brittle schemas or light reasoning.
Tier 2 (correctness-heavy): GPT-5 mini or Gemini 3 Flash when failure/retry cost dominates.

Llama 3.1 8B still wins pure token economics, but gpt-oss-120B is the key “pay a little more, fail less” option.

Deep conversations

Factors:

General Intelligence Index as proxy for coherence/nuance.
Context tiering and output pricing.

Good options across categories:

Best default (quality + reasonable cost): Gemini 3.1 Pro (General 57, output $12/1M at <=200k prompt tier).
Premium quality tier: Claude Opus 4.6 (General 53, Agentic 68) if cost is secondary.
Budget surprises: GLM-5 (reasoning) and Kimi K2.5 (reasoning), with strong indices for listed prices but lower route TPS.

Personal preference note:

For chat quality/feel, my own preference is ChatGPT Pro (subscription, not rich enough to try the api) > Opus 4.6 > Gemini Pro, even though data ranks Gemini 3.1 Pro highest by numbers.

For long, high-quality conversations, Gemini 3.1 Pro is the clean default. Opus is the premium experience.

Casual conversation

Factors:

Cost per turn and streaming latency.

Picks:

Cheap + good proprietary defaults: GPT-5 mini or Gemini 3 Flash.
Snappy streaming first: Groq routes (especially Llama 3.1 8B or gpt-oss-120B).
Local privacy/offline option: local gpt-oss-120B is already in a usable chat band from my own runs.

Agents

Factors:

Agentic Index and cost per step (agent loops amplify both token cost and failure cost).

Picks:

Max reliability: Claude Opus 4.6 (Agentic 68), expensive but highest score.
Strong mid-tier brains: Claude Sonnet 4.6, GPT-5.2, Gemini 3.1 Pro.
Budget outlier: GLM-5 (reasoning), strong Agentic score for its listed prices, with lower route TPS.
Better as subroutines than planners: Llama 3.1 8B / Llama 4 Maverick / Qwen3 32B / Qwen3 235B (low Agentic scores here).

Agent workloads amplify both token and failure costs. Opus is premium, Sonnet/GPT-5.2/Gemini are strong mid-tier, GLM-5 is the most interesting budget agentic outlier.

Typical token-cost scenarios (token-cost only)

Formula used:

cost = (input_tokens / 1,000,000 * input_price) + (output_tokens / 1,000,000 * output_price)

Chat turn (baseline): 3k input + 1k output

Model	Cost per turn
GPT-5 mini	~$0.00275
Gemini 3 Flash	~$0.00450
Gemini 3.1 Pro (<=200k tier)	~$0.01800
Claude Opus 4.6	~$0.04000

Mass processing: 1k input + 100 output

Model/route	Cost per item	Items per $1	Notes
Llama 3.1 8B (OpenRouter DeepInfra/Novita)	~$0.000025	~40,000	Cheapest token route
gpt-oss-120B (OpenRouter DeepInfra)	~$0.000058	~17,241	Better quality/success profile
Llama 3.1 8B (Groq)	~$0.000058	~17,241	Much faster streaming
Llama 3.1 8B (Cerebras)	~$0.000110	~9,091	Throughput-focused option

Coding: 10k input + 2k output

Model	Cost per call
Gemini 3.1 Pro (<=200k tier)	~$0.0440
GPT-5.2	~$0.0455
Claude Sonnet 4.6	~$0.0600
Gemini 3 Flash	~$0.0110

Agent step: 3k input + 500 output

Model/route	Cost per step
Claude Opus 4.6	~$0.0275
Claude Sonnet 4.6	~$0.0165
GPT-5.2	~$0.01225
Gemini 3.1 Pro (<=200k tier)	~$0.0120
GLM-5 (OpenRouter SiliconFlow route)	~$0.004125

Pareto-front view (textual)

Coding Index: Gemini 3.1 Pro is the top score; Sonnet/GPT-5.2 are close; Gemini 3 Flash and GPT-5 mini are cheap practical points.
Agentic Index: Opus is the top point; Sonnet/GLM-5/GPT-5.2 cluster as strong alternatives at lower cost.
Bulk processing: Llama 3.1 8B gives the leftmost cost point; gpt-oss-120B often gives the better cost-to-reliability trade.

Caveats

Vendor TPS is not apples-to-apples (different batch sizes, output lengths, queueing, and serving conditions).
Tokenization and effective context behavior vary by model/provider.
Agent cost is multiplicative by number of calls, not just per-call token prices.
Route-level prices/TPS can change quickly;

Conclusions

Output pricing still dominates many real workloads, especially coding and long-form conversations.
Gemini 3.1 Pro is the best default frontier pick today across quality and token economics.
GPT-5.2 vs Sonnet is close on benchmark scores, with GPT-5.2 usually cheaper in prompt-heavy flows.
GLM-5 is the biggest anomaly here: Sonnet-tier Agentic score in this table with much lower listed route pricing, at lower TPS.
Kimi K2.5 looks like a second anomaly if direct 0.10 / 0.60 pricing is stable.
Open-weight fast routes are excellent for cheap subroutines and throughput, but still weak as primary agent planners.
Local gpt-oss-120B decode (~70 TPS in my runs) is roughly on par with the cheapest cloud route in this table, while hosted hardware providers still dominate absolute decode throughput.
For coding-heavy daily work, fixed monthly plans are usually the best deal in this timeframe; token APIs stay useful as overflow, batch, or integration rails.

Personal hardware usage

For my local llama.cpp server runs with gpt-oss-120b, the two machines behave like different profiles:

M4 Max: ~300 TPS prefill / ~70 TPS decode
NVIDIA GB10: ~1000 TPS prefill / ~50 TPS decode

Interpretation:

Prefill TPS mostly affects time to first token (especially with large prompts).
Decode TPS mostly affects streaming speed after generation starts.

Quick intuition table (sequential, no batching):

Scenario (prompt / output)	M4 Max time	GB10 time	Feels better
Simple extraction (1k / 100)	~4.8s	~3.0s	GB10
Casual chat (1k / 200)	~6.2s	~5.0s	GB10
Deep chat turn (3k / 1k)	~24.3s	~23.0s	Slight GB10
Coding with context (10k / 1.5k)	~54.8s	~40.0s	GB10 (big)
Agent step (4k / 300)	~17.6s	~10.0s	GB10 (big)

These are estimated as:

total_time ~= prompt_tokens / prefill_TPS + output_tokens / decode_TPS

Rule of thumb from these measurements:

GB10 wins when prompts are large and outputs are moderate (RAG/context-heavy coding, agent traces, tool logs).
M4 Max catches up only when output is very large relative to prompt.
Crossover is around output_tokens >= 0.4 * prompt_tokens.

Where local hardware fits best:

Coding: good offline copilot for refactors, boilerplate, and iteration without cost anxiety; still use stronger frontier models when correctness is critical.
Mass processing: GB10 is the stronger local box for prefill-heavy pipelines; for pure simple bulk, smaller 8B cloud routes are usually more efficient.
Deep conversations: very usable for private long-form chat, but quality/speed still below frontier hosted options.
Casual conversation: good privacy-first default; not in the same latency class as Groq/Cerebras.
Agents: better as a cheap local subroutine executor than a primary planner brain.

Practical takeaway:

Local gpt-oss-120b sits in a useful “privacy-first, no-marginal-cost, mid-speed” tier.
It complements cloud inference well: use local for private/iterative inner loops, cloud for highest reliability or fastest UX.

Written on February 25, 2026