Self-hosting frontier open-weight models is finally cheap enough on paper that every CTO does the math at least once a quarter. The problem is that the math most people do is wrong — they compare API rack rate to GPU rack rate and conclude they should ship a migration. Real TCO has four lines, not two, and the last two often dominate the first two.

By April 2026 the open-weight frontier is genuinely competitive with closed APIs on capability: Llama 4-MoE 70B, Qwen 3 235B-MoE, DeepSeek V4-Flash, and Mistral Large 2 all clear 80% on MMLU-Pro, 70%+ on SWE-Bench Verified, and ship 256K-1M context. The architectural and serving-stack tooling (vLLM 0.7+, SGLang, TensorRT-LLM) handles MoE all-to-all routing and aggressive KV optimization. The technical question is settled. The economic question is not.

This analysis covers the four-line TCO model we use with clients — GPU rack-rate, serving stack, engineer-time, and the hidden opportunity cost — with break-even tables at 100M, 600M, 1.2B, and 5B tokens/month, plus the failure modes we have watched teams walk into.

Key takeaways

01
Self-hosting wins on per-token cost above ~600M tokens/month for code, ~1.2B for chat.Below those volumes, API rack rate (especially with prompt caching) dominates. Above them, the GPU economics work — but only if at least one full-time inference engineer is available to keep the stack tuned.
02
GPU rent is 60-70% of self-hosted TCO; engineer-time is 25-30%.An 8×H100 cluster on-demand runs $22-28K/month. A senior inference engineer (loaded cost) runs $20-30K/month. The two lines are similar magnitude — and the engineer is the variable that decides whether the GPUs hit their utilization target.
03
Reserved capacity (1-year, 3-year) drops GPU rent 35-60% and shifts the break-even down.1-year reserved H100 drops the 8-GPU cluster from ~$25K/month to ~$15K/month. 3-year reserved drops it to ~$10K/month. Reserved is the right call once steady-state volume is locked in; commit early and you over-pay during ramp.
04
Latency-percentile targets, not throughput, govern cluster sizing.P50 throughput numbers in vLLM benchmark posts are misleading. Production sizing is governed by P95/P99 tail latency under bursty load — and that requires 30-50% more headroom than P50-based sizing suggests. Plan capacity around the hard latency target.
05
Closed-API fallback routing is the cheap insurance most self-hosters skip.Routing the top 2-5% of spiky traffic to a closed API (GPT-5.5, Opus 4.7) protects against capacity emergencies and saves the 'over-provision for tail' surcharge — typically 15-25% of cluster cost. Treat closed-API budget as an ops-resilience line, not a fallback.

01 — The MathThe four-line TCO model.

The TCO comparison most teams do has two lines: API token spend on one side, GPU rent on the other. That model is wrong because it misses the two terms that actually decide the comparison — engineer-time and the opportunity cost of the build itself. The full model has four lines.

Line 1

GPU rent (the obvious one)

$/hour × cluster size × hours/month

Plug-and-chug. 8×H100 on-demand at $3.50/hr lists at $20,160/month before utilization. Reserved capacity drops it 35-60%. This line is what every TCO comparison includes; it is not where the comparison goes wrong.

60-70% of TCO

Line 2

Serving-stack ops (the under-counted one)

vLLM/SGLang/TRT-LLM tuning + monitoring

Batch sizing, capacity tuning, model swap pipelines, observability stack (Helicone / LangSmith / Prometheus), backup model deployments. Often packaged as a 'one-time setup' — actually a recurring 8-12 hours/week of inference engineering.

5-10% of TCO

Line 3

Inference engineer (the hidden one)

Loaded cost of senior infra hire

A senior inference engineer runs $250-360K loaded annually in 2026 ($20-30K/month). For mid-volume self-hosters, this is the difference between running and not running. For high-volume teams, this engineer often pays for themselves in the first month through utilization gains.

25-30% of TCO

Line 4

Build-out opportunity cost

Engineering weeks not shipped elsewhere

The two-month migration from API to self-hosted is two months not shipping product. Multiply that by the team's effective hourly value to clients. For agencies, this often dominates the first-year TCO comparison and the answer flips back toward API.

Variable · often decisive

"Two-line TCO models always favor self-hosting. Four-line models tell the truth — and the truth is that under 600M tokens/month, API spend is cheap rent on a problem someone else handles for you."— Internal client TCO retrospective, May 2026

02 — GPU ChoicesFour GPU classes and what they cost.

By Q2 2026, four GPU classes are realistic for serving frontier open-weight models: H100 (the workhorse), H200 (the long-context specialist), B100/B200 (the new cluster), and AMD MI300X (the value play). Each has a different sweet spot.

Cluster cost · 8-16 GPU configurations · monthly rent

Source: AWS/GCP/Azure list pricing · CoreWeave / Lambda · Apr 2026

H100 — 8 GPUs · on-demandAWS p5 / GCP a3-highgpu · $3.50/hr per GPU

$25.2K/mo

H100 — 8 GPUs · 1-year reservedSame cluster, committed capacity

$16.4K/mo

−35%

H200 — 8 GPUs · on-demandLong-context advantage; 768 GB total VRAM

$29.3K/mo

B100 — 4 GPUs · on-demandNewer, scarce capacity, 256 GB total

$22.0K/mo

MI300X — 8 GPUs · on-demandAMD; cheapest VRAM/$ in 2026

$19.2K/mo

value play

H100 — 16 GPUs · 1-year reservedDeepSeek V4-Pro at FP8, comfortable

$32.8K/mo

The MI300X is the underrated 2026 option. AMD has spent two years on ROCm + vLLM compatibility; the gap on production stacks is largely closed for inference (still real for training). At $19.2K/month for 8 GPUs vs $25.2K for an H100 cluster, the value differential is meaningful for teams comfortable with a slightly less mature stack. NVIDIA still wins on training, on edge cases with custom CUDA kernels, and on the fastest-moving research code — but vanilla production inference works fine on AMD.

Reserved-vs-on-demand timing

The trap is committing to reserved capacity before steady-state volume is locked in. We have watched teams reserve 12 H100s for three years on the strength of a quarterly forecast, then watch actual usage land at 4 H100s — paying full reserved for capacity they cannot fill. Wait until you have at least three months of steady-state production usage at the volume you want to commit to.

03 — Serving StackThe stack decides whether the GPU rent earns out.

Three serving stacks dominate 2026 self-hosting: vLLM (the open standard), SGLang (RadixAttention prefix-cache leader), and TensorRT-LLM (NVIDIA-specific peak performance). Picking the right one is not a religious question; each fits a different workload profile.

Stack

vLLM 0.7+

Open standard. Best community support, fastest model integration (DeepSeek V4 worked day-one), MoE expert-parallel handles top-k routing cleanly. Right default for any team without a strong reason to deviate.

Default · open weight

Stack

SGLang

RadixAttention's hash-based prefix cache wins for high-prefix-overlap workloads (multi-tenant SaaS, agent loops, long-doc Q&A). Slightly less broad model coverage than vLLM. Worth the swap when prefix-cache hit-rate is the dominant cost lever.

Prefix-cache heavy

Stack

TensorRT-LLM

NVIDIA-only, peak performance, more setup friction. Wins by 10-25% on raw throughput for stable, high-volume single-model deployments. Not worth it for fast-moving teams; very worth it for static, locked-in production at scale.

Peak NVIDIA performance

Stack

Replicate / Together (managed)

Not strictly self-hosted — managed serverless inference for open-weight models. Right answer for the 100M-600M tokens/month band where self-hosting math doesn't yet work but closed-API rack rate is too high. Bridges the gap.

Managed bridge tier

04 — Engineer-TimeThe engineer is the line that decides everything.

Most TCO write-ups list GPU rent and stop. Real TCO has a person on it. A senior inference engineer in 2026 runs $250-360K loaded (US) or $180-260K (EU/AU). Their time goes to capacity tuning, model swaps, MoE expert balance monitoring, latency-percentile triage, observability stack ownership, and on-call coverage. None of these are optional once you cross 1B tokens/month.

Below that volume, the math collapses. A team running $8K/month on closed-API spend cannot justify a $25K/month engineer to take it in-house — the engineer cost dominates the savings by 4-5×. The crossover only works once the engineer's loaded cost is a small fraction of the API spend they replace.

"We have seen teams hire two senior infra engineers to save $40K/year in API spend. The right answer was to keep the closed API and ship two more product features."— Agency CTO, May 2026

05 — Break-Even TablesThe arithmetic at four scales.

Break-even depends on the workload. Code completion has higher value-per-token (developers pay for low latency) and runs on shorter prompts; chat workloads run longer prompts at lower value. The crossover sits at different volumes for the two cases.

100M tok/mo

12×

API wins decisively

API spend (chat): ~$1.5K/mo. Self-hosted minimum: ~$25K/mo cluster + $20K/mo engineer. Self-hosting costs 30× more at this volume. The right answer is closed-API with aggressive caching.

Stay on API

600M tok/mo

1.5×

Code workloads cross over

API spend (code): ~$15K/mo. Self-hosted: ~$25K cluster + $25K engineer = $50K. API still cheaper for chat. Code workloads start to break even because of the value-per-token premium developers pay.

Code ≈ even

1.2B tok/mo

0.7×

Chat workloads cross over

API spend (chat): ~$30-40K/mo. Self-hosted: ~$50-55K (cluster + engineer). With 1-year reserved capacity, cluster drops to $16K and total to $41K — break-even hit. Above this, every additional billion tokens widens the gap.

Chat ≈ even

5B tok/mo

0.14×

Self-hosting wins big

API spend: ~$140-200K/mo. Self-hosted: ~$50-60K (with reserved + 2 engineers). 3-7× cheaper to self-host. At this scale, the only reason not to self-host is product velocity — and even that is usually solvable with a partial migration.

Self-host wins

The pattern: from 100M to 5B tokens/month, the relative cost of self-hosting versus API drops from 12× more expensive to 7× cheaper. The crossover sits around 600M-1.2B for most workloads. Below that, API is the right call. Above it, the engineering effort pays for itself — but only if you can find and keep the engineer.

06 — Hidden CostsThe costs that wreck first-year self-hosting.

Model-swap velocity.Frontier models update every 4-8 weeks. Each swap costs 1-2 weeks of inference engineering — quantization, capacity retuning, smoke tests, A/B gates. If the team is on the closed API, the swap is instantaneous; on self-hosted, it's a sprint.
Tail-latency over-provisioning. Sizing for P50 gets ~70% utilization; sizing for P99 under bursty traffic gets 35-45% utilization without aggressive autoscaling. The difference is a 30-40% effective cost increase that most TCO models miss.
Observability stack. Helicone, LangSmith, or a custom Prometheus + OpenTelemetry stack — pick one, but the line is real. Plan $1-3K/month for managed observability or 0.3-0.5 FTE for a roll-your-own.
On-call burden.Self-hosted means you're on-call for the inference layer. Even with mature stacks, expect 2-4 incidents per month requiring inference- engineer attention. This is real psychic cost on a small team.
Compliance friction.Self-hosted means the compliance team can't outsource the data-residency question to OpenAI/Anthropic. Sometimes this is a feature (you control everything); often it's extra audit work.

07 — ConclusionSelf-hosting is cheaper — once you can afford the engineer.

Self-hosting economics, April 2026

Volume buys you the right to self-host. Engineer-time keeps it earned.

Self-hosting frontier open-weight models is genuinely cheaper than closed APIs above 600M-1.2B tokens/month — but only if there is a full-time inference engineer on the build. Below that volume, the engineer's loaded cost dominates the savings; above it, the engineer pays for themselves in the first month through utilization gains and capacity tuning.

The four-line TCO model — GPU rent, serving-stack ops, engineer- time, build-out opportunity cost — is the right framework. Two-line comparisons that put GPU rent against API rack rate always favor self-hosting and always under-deliver. Build the four-line model first, then decide.

The deeper move is to design for hybrid from day one: self-host steady-state, route the spiky 2-5% to a closed API. That gives the cost benefit of self-hosting without the over-provisioning surcharge for tail traffic, and it gives an immediate fall-back when the inference cluster has an incident. Hybrid is what every mature 2026 self-hoster runs.

Self-Hosting Frontier AI Models: 2026 TCO Analysis

01 — The MathThe four-line TCO model.

GPU rent (the obvious one)

Serving-stack ops (the under-counted one)

Inference engineer (the hidden one)

Build-out opportunity cost

02 — GPU ChoicesFour GPU classes and what they cost.

Cluster cost · 8-16 GPU configurations · monthly rent

03 — Serving StackThe stack decides whether the GPU rent earns out.

vLLM 0.7+

SGLang

TensorRT-LLM

Replicate / Together (managed)

04 — Engineer-TimeThe engineer is the line that decides everything.

05 — Break-Even TablesThe arithmetic at four scales.

API wins decisively

Code workloads cross over

Chat workloads cross over

Self-hosting wins big

06 — Hidden CostsThe costs that wreck first-year self-hosting.

07 — ConclusionSelf-hosting is cheaper — once you can afford the engineer.

Volume buys you the right to self-host. Engineer-time keeps it earned.

Move past two-line TCO. Build the honest model.

Self-hosting engagements

The questions we get every week.

Continue exploring AI infrastructure economics.

Quantization Tradeoffs: 4-bit vs 8-bit vs FP8 Data

KV Cache Optimization for LLMs 2026: Engineering Guide

Claude Opus 4.7 1M Context: The Cost-Strategy Guide