Self-hosting frontier open-weight models is finally cheap enough on paper that every CTO does the math at least once a quarter. The problem is that the math most people do is wrong — they compare API rack rate to GPU rack rate and conclude they should ship a migration. Real TCO has four lines, not two, and the last two often dominate the first two.
By April 2026 the open-weight frontier is genuinely competitive with closed APIs on capability: Llama 4-MoE 70B, Qwen 3 235B-MoE, DeepSeek V4-Flash, and Mistral Large 2 all clear 80% on MMLU-Pro, 70%+ on SWE-Bench Verified, and ship 256K-1M context. The architectural and serving-stack tooling (vLLM 0.7+, SGLang, TensorRT-LLM) handles MoE all-to-all routing and aggressive KV optimization. The technical question is settled. The economic question is not.
This analysis covers the four-line TCO model we use with clients — GPU rack-rate, serving stack, engineer-time, and the hidden opportunity cost — with break-even tables at 100M, 600M, 1.2B, and 5B tokens/month, plus the failure modes we have watched teams walk into.
- 01Self-hosting wins on per-token cost above ~600M tokens/month for code, ~1.2B for chat.Below those volumes, API rack rate (especially with prompt caching) dominates. Above them, the GPU economics work — but only if at least one full-time inference engineer is available to keep the stack tuned.
- 02GPU rent is 60-70% of self-hosted TCO; engineer-time is 25-30%.An 8×H100 cluster on-demand runs $22-28K/month. A senior inference engineer (loaded cost) runs $20-30K/month. The two lines are similar magnitude — and the engineer is the variable that decides whether the GPUs hit their utilization target.
- 03Reserved capacity (1-year, 3-year) drops GPU rent 35-60% and shifts the break-even down.1-year reserved H100 drops the 8-GPU cluster from ~$25K/month to ~$15K/month. 3-year reserved drops it to ~$10K/month. Reserved is the right call once steady-state volume is locked in; commit early and you over-pay during ramp.
- 04Latency-percentile targets, not throughput, govern cluster sizing.P50 throughput numbers in vLLM benchmark posts are misleading. Production sizing is governed by P95/P99 tail latency under bursty load — and that requires 30-50% more headroom than P50-based sizing suggests. Plan capacity around the hard latency target.
- 05Closed-API fallback routing is the cheap insurance most self-hosters skip.Routing the top 2-5% of spiky traffic to a closed API (GPT-5.5, Opus 4.7) protects against capacity emergencies and saves the 'over-provision for tail' surcharge — typically 15-25% of cluster cost. Treat closed-API budget as an ops-resilience line, not a fallback.
01 — The MathThe four-line TCO model.
The TCO comparison most teams do has two lines: API token spend on one side, GPU rent on the other. That model is wrong because it misses the two terms that actually decide the comparison — engineer-time and the opportunity cost of the build itself. The full model has four lines.
GPU rent (the obvious one)
$/hour × cluster size × hours/monthPlug-and-chug. 8×H100 on-demand at $3.50/hr lists at $20,160/month before utilization. Reserved capacity drops it 35-60%. This line is what every TCO comparison includes; it is not where the comparison goes wrong.
60-70% of TCOServing-stack ops (the under-counted one)
vLLM/SGLang/TRT-LLM tuning + monitoringBatch sizing, capacity tuning, model swap pipelines, observability stack (Helicone / LangSmith / Prometheus), backup model deployments. Often packaged as a 'one-time setup' — actually a recurring 8-12 hours/week of inference engineering.
5-10% of TCOInference engineer (the hidden one)
Loaded cost of senior infra hireA senior inference engineer runs $250-360K loaded annually in 2026 ($20-30K/month). For mid-volume self-hosters, this is the difference between running and not running. For high-volume teams, this engineer often pays for themselves in the first month through utilization gains.
25-30% of TCOBuild-out opportunity cost
Engineering weeks not shipped elsewhereThe two-month migration from API to self-hosted is two months not shipping product. Multiply that by the team's effective hourly value to clients. For agencies, this often dominates the first-year TCO comparison and the answer flips back toward API.
Variable · often decisive"Two-line TCO models always favor self-hosting. Four-line models tell the truth — and the truth is that under 600M tokens/month, API spend is cheap rent on a problem someone else handles for you."— Internal client TCO retrospective, May 2026
02 — GPU ChoicesFour GPU classes and what they cost.
By Q2 2026, four GPU classes are realistic for serving frontier open-weight models: H100 (the workhorse), H200 (the long-context specialist), B100/B200 (the new cluster), and AMD MI300X (the value play). Each has a different sweet spot.
Cluster cost · 8-16 GPU configurations · monthly rent
Source: AWS/GCP/Azure list pricing · CoreWeave / Lambda · Apr 2026The MI300X is the underrated 2026 option. AMD has spent two years on ROCm + vLLM compatibility; the gap on production stacks is largely closed for inference (still real for training). At $19.2K/month for 8 GPUs vs $25.2K for an H100 cluster, the value differential is meaningful for teams comfortable with a slightly less mature stack. NVIDIA still wins on training, on edge cases with custom CUDA kernels, and on the fastest-moving research code — but vanilla production inference works fine on AMD.
03 — Serving StackThe stack decides whether the GPU rent earns out.
Three serving stacks dominate 2026 self-hosting: vLLM (the open standard), SGLang (RadixAttention prefix-cache leader), and TensorRT-LLM (NVIDIA-specific peak performance). Picking the right one is not a religious question; each fits a different workload profile.
vLLM 0.7+
Open standard. Best community support, fastest model integration (DeepSeek V4 worked day-one), MoE expert-parallel handles top-k routing cleanly. Right default for any team without a strong reason to deviate.
Default · open weightSGLang
RadixAttention's hash-based prefix cache wins for high-prefix-overlap workloads (multi-tenant SaaS, agent loops, long-doc Q&A). Slightly less broad model coverage than vLLM. Worth the swap when prefix-cache hit-rate is the dominant cost lever.
Prefix-cache heavyTensorRT-LLM
NVIDIA-only, peak performance, more setup friction. Wins by 10-25% on raw throughput for stable, high-volume single-model deployments. Not worth it for fast-moving teams; very worth it for static, locked-in production at scale.
Peak NVIDIA performanceReplicate / Together (managed)
Not strictly self-hosted — managed serverless inference for open-weight models. Right answer for the 100M-600M tokens/month band where self-hosting math doesn't yet work but closed-API rack rate is too high. Bridges the gap.
Managed bridge tier04 — Engineer-TimeThe engineer is the line that decides everything.
Most TCO write-ups list GPU rent and stop. Real TCO has a person on it. A senior inference engineer in 2026 runs $250-360K loaded (US) or $180-260K (EU/AU). Their time goes to capacity tuning, model swaps, MoE expert balance monitoring, latency-percentile triage, observability stack ownership, and on-call coverage. None of these are optional once you cross 1B tokens/month.
Below that volume, the math collapses. A team running $8K/month on closed-API spend cannot justify a $25K/month engineer to take it in-house — the engineer cost dominates the savings by 4-5×. The crossover only works once the engineer's loaded cost is a small fraction of the API spend they replace.
"We have seen teams hire two senior infra engineers to save $40K/year in API spend. The right answer was to keep the closed API and ship two more product features."— Agency CTO, May 2026
05 — Break-Even TablesThe arithmetic at four scales.
Break-even depends on the workload. Code completion has higher value-per-token (developers pay for low latency) and runs on shorter prompts; chat workloads run longer prompts at lower value. The crossover sits at different volumes for the two cases.
API wins decisively
API spend (chat): ~$1.5K/mo. Self-hosted minimum: ~$25K/mo cluster + $20K/mo engineer. Self-hosting costs 30× more at this volume. The right answer is closed-API with aggressive caching.
Stay on APICode workloads cross over
API spend (code): ~$15K/mo. Self-hosted: ~$25K cluster + $25K engineer = $50K. API still cheaper for chat. Code workloads start to break even because of the value-per-token premium developers pay.
Code ≈ evenChat workloads cross over
API spend (chat): ~$30-40K/mo. Self-hosted: ~$50-55K (cluster + engineer). With 1-year reserved capacity, cluster drops to $16K and total to $41K — break-even hit. Above this, every additional billion tokens widens the gap.
Chat ≈ evenSelf-hosting wins big
API spend: ~$140-200K/mo. Self-hosted: ~$50-60K (with reserved + 2 engineers). 3-7× cheaper to self-host. At this scale, the only reason not to self-host is product velocity — and even that is usually solvable with a partial migration.
Self-host winsThe pattern: from 100M to 5B tokens/month, the relative cost of self-hosting versus API drops from 12× more expensive to 7× cheaper. The crossover sits around 600M-1.2B for most workloads. Below that, API is the right call. Above it, the engineering effort pays for itself — but only if you can find and keep the engineer.
06 — Hidden CostsThe costs that wreck first-year self-hosting.
- Model-swap velocity.Frontier models update every 4-8 weeks. Each swap costs 1-2 weeks of inference engineering — quantization, capacity retuning, smoke tests, A/B gates. If the team is on the closed API, the swap is instantaneous; on self-hosted, it's a sprint.
- Tail-latency over-provisioning. Sizing for P50 gets ~70% utilization; sizing for P99 under bursty traffic gets 35-45% utilization without aggressive autoscaling. The difference is a 30-40% effective cost increase that most TCO models miss.
- Observability stack. Helicone, LangSmith, or a custom Prometheus + OpenTelemetry stack — pick one, but the line is real. Plan $1-3K/month for managed observability or 0.3-0.5 FTE for a roll-your-own.
- On-call burden.Self-hosted means you're on-call for the inference layer. Even with mature stacks, expect 2-4 incidents per month requiring inference- engineer attention. This is real psychic cost on a small team.
- Compliance friction.Self-hosted means the compliance team can't outsource the data-residency question to OpenAI/Anthropic. Sometimes this is a feature (you control everything); often it's extra audit work.
07 — ConclusionSelf-hosting is cheaper — once you can afford the engineer.
Volume buys you the right to self-host. Engineer-time keeps it earned.
Self-hosting frontier open-weight models is genuinely cheaper than closed APIs above 600M-1.2B tokens/month — but only if there is a full-time inference engineer on the build. Below that volume, the engineer's loaded cost dominates the savings; above it, the engineer pays for themselves in the first month through utilization gains and capacity tuning.
The four-line TCO model — GPU rent, serving-stack ops, engineer- time, build-out opportunity cost — is the right framework. Two-line comparisons that put GPU rent against API rack rate always favor self-hosting and always under-deliver. Build the four-line model first, then decide.
The deeper move is to design for hybrid from day one: self-host steady-state, route the spiky 2-5% to a closed API. That gives the cost benefit of self-hosting without the over-provisioning surcharge for tail traffic, and it gives an immediate fall-back when the inference cluster has an incident. Hybrid is what every mature 2026 self-hoster runs.