Self-hosting open-weight LLMs is the deployment decision teams get wrong most often in 2026 — not because the engineering is hard, but because the cost intuition is backwards. Owning the GPU feels cheaper than paying a per-token API bill, yet for most workloads the math never crosses over, and the costs that sink the project rarely show up on the GPU invoice.
The open-weight ecosystem has never been stronger. Llama 4, the DeepSeek V-series, and the Qwen3 open checkpoints give you frontier- adjacent capability you can run on your own hardware, fine-tune on proprietary data, and operate without an external dependency. The question is no longer whether you can self-host — it is whether you should, and the honest answer depends on volume, utilization, and compliance, not on the model.
This guide gives you the framework: a token-volume breakeven matrix that shows the actual crossover points, the GPU-utilization multiplier that quietly wrecks under-loaded deployments, a 2026 serving-framework decision matrix that reflects the December 2025 sunset of Text Generation Inference, GPU sizing tables by model tier, and the compliance scenarios where the cost math stops mattering entirely. Every figure below is sourced; treat all pricing as a snapshot to re-verify, not a constant.
- 01The cost crossover arrives much later than teams expect.Against a budget API like DeepSeek V4 Flash (around $0.14 per 1M input tokens), one analysis puts the breakeven for a single H100 at roughly 5.7 billion tokens per month. Against a premium GPT-5-class API at about $5 per 1M, the crossover can arrive near 256 million tokens per month — but that figure assumes 60-70% sustained GPU utilization.
- 02GPU utilization is the silent cost multiplier.Effective cost-per-token scales inversely with load. At roughly 10% utilization, your real cost per token can be about 10x the headline GPU rate — enough to make an idle H100 more expensive per token than a premium frontier API. Self-hosting economics only work with high, sustained throughput.
- 03The hidden costs are 3-5x the raw GPU price.DevOps salaries, model update cycles, and infrastructure overhead typically add a 3-5x multiplier on top of the GPU rental alone, per one cost analysis. The GPU invoice is the part of self-hosting that is easy to see and the smallest part of the bill.
- 04The 2026 serving stack is vLLM and SGLang, not TGI.Hugging Face moved Text Generation Inference to maintenance mode on December 11, 2025, redirecting effort upstream to vLLM and SGLang. New greenfield deployments should default to vLLM for batch throughput and SGLang for prefix-heavy RAG and multi-turn workloads.
- 05Compliance can override the cost math entirely.For HIPAA, GDPR, or SOC2-bound workloads, the breakeven analysis is often moot. Without a Business Associate Agreement, standard consumer APIs cannot lawfully process protected health information, which can make a self-hosted VPC deployment the only straightforward compliant path.
01 — The Real QuestionIt is not capability. It is volume, utilization, and compliance.
Most self-hosting debates start in the wrong place. Teams compare benchmark scores, decide an open-weight model is "good enough," and conclude that running it themselves must be cheaper than paying an API per token. The capability question is largely settled — open checkpoints from Llama 4, the DeepSeek V-series, and Qwen3 are genuinely strong. The questions that actually decide the outcome are three: how many tokens you process per month, how busy your GPU stays, and whether regulation forces your hand.
Get those three right and the decision makes itself. A team doing a few hundred million tokens a month at variable load, with no compliance constraint, is almost always cheaper and faster on a managed API. A team processing billions of tokens at steady, high utilization — or one that legally cannot send its data to a third party — is a strong self-hosting candidate. The middle ground is where the matrices below earn their keep.
Three deployment paths exist, and most teams should expect to use more than one. Self-hosting on your own GPUs gives maximum control and the only path to true on-prem data sovereignty. Managed open-weight APIs — Together AI, Fireworks AI, and others — give you open-model weights without the operations burden, at per-token pricing. And local runtimes like Ollama give individual engineers a zero-friction way to prototype before any of this matters.
Self-host on your GPUs
Maximum control, the only route to true on-prem data sovereignty, and the cheapest option past a high, sustained volume. Carries the full DevOps and utilization burden — the part that decides whether it actually saves money.
Managed open-weight API
Open-model weights served for you, billed per token. No GPU to keep busy, no inference stack to operate. The default for most teams below the crossover, and the cleanest A/B baseline against self-hosting.
Local runtime (Ollama)
Zero-friction local inference for engineers to prototype and validate against an OpenAI-compatible surface before any production decision. Not a production serving engine — graduate to vLLM when traffic justifies it.
02 — The Breakeven MatrixWhere self-hosting crosses over a managed API.
The single most useful artifact in this decision is a crossover curve: at what monthly token volume does owning the GPU become cheaper than paying per token? The answer depends entirely on which API you are comparing against. Against a premium frontier API, the crossover arrives relatively early. Against a hyper-cheap open-weight API, it may never arrive in practice.
One published analysis frames the extremes cleanly. At roughly 1 million tokens per day, self-hosting a single A100 80GB lands near $3,240 per month all-in — about $1,440 for cloud rental, around $1,500 in DevOps labor, and roughly $300 of infrastructure overhead — versus a budget API bill of only a few dollars at that volume. The crossover against DeepSeek V4 Flash, at about $0.14 per 1M input tokens, does not arrive until roughly 5.7 billion tokens per month. Against a GPT-5-class API near $5 per 1M, a separate estimate puts the crossover closer to 256 million tokens per month.
Monthly token volume needed before self-hosting wins · by API tier
Sources: devtk.ai self-hosting cost analysis; braincuber cost-performance analysis. Bars indicate relative monthly token volume, not cost. Sonnet-class figure interpolated from the same model.Read that spread carefully, because it is the whole argument. The cheaper your API alternative, the more tokens you must self-serve before owning hardware pays off — and budget open-weight APIs are now so cheap that the crossover sits in territory most teams will never reach. The reason a budget API can undercut your own H100 is structural: the provider runs that GPU at near-saturation across thousands of customers, so its effective cost-per-token is far below what a single tenant achieves on a partially idle box. The next section is the direct consequence of that fact.
At genuine industrial scale the picture flips hard. The same body of analysis puts self-hosting a Llama 70B at around 500 million tokens per day near $4,360 per month, versus roughly $22,500 per month on managed APIs — about a 5x win for self-hosting. The lesson is not "self-hosting is bad" or "self-hosting is good." It is that the answer is a step function of volume, and you need to find your own row on the curve.
| Monthly volume | Cheaper option | Why |
|---|---|---|
| Under ~250M tokens | Managed API | Below the crossover against essentially every API tier. The DevOps labor and idle-capacity cost of a dedicated GPU dwarf the token bill. Prototype on Ollama, serve on a managed open-weight API. |
| ~250M-1B tokens | Depends on API tier | Crosses over against premium frontier APIs (~$5/1M) near 256M/mo at high utilization, but still loses badly to budget open-weight APIs. Self-host only if you are replacing expensive frontier calls and can keep the GPU busy. |
| 1B-5B tokens | Self-host (usually) | The range where owning capacity at sustained high utilization typically wins against mid-tier APIs. Below the ~5.7B breakeven vs the very cheapest budget APIs — so the comparison still depends on your specific API alternative. |
| 5B+ tokens | Self-host | Industrial scale. Self-hosting wins against nearly all alternatives, with reported savings on the order of 5x against managed APIs at the high end — provided utilization stays high and the operations team is already in place. |
03 — The Utilization TrapThe 10x multiplier nobody puts on the slide.
Here is the failure mode that turns a sound-looking self-hosting plan into a money pit. A GPU bills the same whether it is running at 100% or 5% of capacity. Your headline "cost per 1,000 tokens" is only achievable when the card is saturated. The moment your traffic is bursty, overprovisioned, or simply lower than projected, your effective cost-per-token climbs in inverse proportion to load — and it climbs fast.
The arithmetic is unforgiving. If a cloud GPU is running at 10% load, one analysis shows the effective cost per 1,000 tokens jumping from about $0.013 to roughly $0.13 — a 10x increase that makes an under-loaded H100 more expensive per token than a premium API service. The headline rate of about $0.013 per 1K tokens is only valid at high utilization; quote it without the load assumption and you have mispriced the entire project.
If your cloud GPU is running at 10% load, your effective cost per 1,000 tokens jumps from $0.013 to $0.13 — more expensive than premium API services.— Self-hosted vs API cost-performance analysis, braincuber.com
This is why managed APIs are so hard to beat at low and medium volume, and it is the mechanism behind the breakeven spread in the previous section. A provider amortizes one physical GPU across many tenants, so the card stays near saturation and its real cost-per- token approaches the theoretical floor. Your single-tenant deployment has to manufacture that same utilization itself, which means real-time autoscaling, request batching, and enough steady traffic to keep the queue full — operational work that does not appear anywhere on the GPU rental invoice.
And utilization is only the most visible hidden cost. The same analysis estimates a 3-5x multiplier over the raw GPU price once you account for DevOps salaries near $145,000 per year, periodic model update cycles around $12,000 each, and ongoing infrastructure overhead. When you model self-hosting, model the whole stack: idle capacity, the engineers who keep it alive, and the upgrade treadmill — not just the line item that is easiest to see.
The utilization penalty
Per one analysis, dropping from saturated to ~10% load lifts effective cost per 1K tokens from roughly $0.013 to $0.13. The GPU bills the same; you simply put fewer tokens through it. This is the number that decides most self-hosting outcomes.
Beyond the GPU invoice
DevOps salaries (~$145K/yr), model update cycles (~$12K each), and infrastructure overhead typically add a 3-5x multiplier over the raw GPU rental, per the same cost analysis. The visible cost is the smallest part.
Provider choice is first-order
H100 SXM rates run from roughly $1.03/hr spot on budget providers to about $6.88/hr on AWS, with on-demand around $2.49-3.44/hr on specialist clouds. A 4-12x spread makes provider selection a primary TCO variable.
04 — Serving Framework ChoiceThe 2026 stack: vLLM and SGLang, with TGI retired.
If your last serving comparison is more than a year old, update it. The biggest change is a quiet one: Hugging Face moved its Text Generation Inference engine into maintenance mode on December 11, 2025, accepting only minor bug fixes and documentation updates and redirecting its engineering effort upstream into vLLM and SGLang. Existing TGI deployments continue to work, but new greenfield production should not start there.
Text Generation Inference is now in maintenance mode. We will only accept pull requests for minor bug fixes, documentation improvements, and lightweight maintenance tasks.— Lysandre Debut, Staff Engineer at Hugging Face, December 11, 2025
That leaves two clear front-runners and a prototyping tool. vLLM has become the de facto open-source inference standard, with more than 81,000 GitHub stars and over 2,000 contributors as of late May 2026. Its core trick, PagedAttention, manages the KV cache using OS-style virtual-memory paging — and because traditional serving pre-allocates memory for the maximum sequence length and wastes a large share of GPU memory on average requests, vLLM reports up to 24x higher serving throughput than naive static batching. Its continuous (iteration-level) batching keeps the GPU saturated at every forward pass, which the project reports as 3-10x higher throughput on the same hardware.
SGLang is the underreported fork in the road. On standard Llama 3.1 8B workloads on H100 GPUs, one benchmark puts it about 29% ahead of vLLM in raw throughput, with faster time-to-first-token and lower inter-token latency. Its real differentiator is RadixAttention for prefix-heavy workloads: when many requests share a system prompt or a document context — exactly the RAG and multi-turn pattern — it reports throughput gains of up to 6.4x by reusing the shared prefix. For DeepSeek V-series inference specifically, its specialized Multi-head Latent Attention backend is reported around 3.1x faster than vLLM, though under heavy concurrent saturation the two converge.
SGLang vs vLLM · selected reported metrics (Llama 3.1 8B, H100)
Source: particula.tech SGLang vs vLLM benchmark. Figures are workload-specific; benchmark your own prompts.Do not over-read those numbers into a blanket "SGLang wins." Both engines are excellent and the gap narrows under saturation; vLLM's breadth — 200-plus model architectures and hardware support spanning NVIDIA, AMD ROCm, Google TPU, Intel Gaudi, Apple Silicon, and CPUs — makes it the safer default for a mixed model fleet. The practical heuristic is workload-shaped: vLLM for general and batch throughput and broad hardware coverage, SGLang for prefix-heavy RAG and multi-turn and DeepSeek V-series serving, Ollama for desktop prototyping, and a careful look at NVIDIA's TensorRT-LLM only if you are all-in on NVIDIA and chasing peak FP8 throughput. NVIDIA's own page cites over 10,000 output tokens per second on H100 at FP8 with sub-100ms TTFT, but that is a vendor-stated figure — validate it against independent benchmarks on your model before believing it.
| Use case | Recommended engine | Why |
|---|---|---|
| Prototyping / single dev | Ollama | Zero-friction local inference with an OpenAI-compatible surface. Validate prompts and integration before any production stack. Graduate to vLLM when concurrent traffic appears. |
| Batch analytics / high throughput | vLLM | PagedAttention plus continuous batching keep the GPU saturated. The de facto standard, broadest model and hardware coverage, the safe default for a mixed fleet. |
| RAG / multi-turn (shared prefixes) | SGLang | RadixAttention reuses shared system prompts and document context for reported gains up to 6.4x on prefix-heavy workloads. Also the strongest pick for DeepSeek V-series serving. |
| NVIDIA-only, peak FP8 throughput | TensorRT-LLM | Highest reported FP8 throughput on H100 (vendor-stated, verify independently). More setup overhead; worth it only when you are committed to NVIDIA and squeezing maximum tokens/sec. |
| New deployment on TGI | Do not start here | Text Generation Inference entered maintenance mode on December 11, 2025. Existing deployments keep working; new greenfield production should default to vLLM or SGLang instead. |
05 — GPU Sizing By ModelHow much VRAM each model tier actually needs.
Once you have decided to self-host, the first concrete question is which GPU and how many. The sizing rule is simple to state: multiply the parameter count by the bytes-per-precision, then add 20-40% headroom for activations, the KV cache, and serving overhead. FP16 is 2 bytes per parameter, FP8 is 1 byte, and INT4 is 0.5 bytes. A 70B model at INT4 is roughly 35GB of weights, so about 42GB minimum with 20% overhead — which fits comfortably on a single 80GB card.
Quantization is what makes single-GPU serving of large models tractable. AWQ is the current best-practice INT4 format for vLLM production, identifying activation-salient weights to preserve more quality than GPTQ at the same bit-width, and INT4 methods generally stay within about 6% of baseline perplexity on most tasks. One important caveat: multi-step reasoning and math are more sensitive to quantization quality loss, so validate those use cases explicitly before shipping a 4-bit model into a reasoning-heavy pipeline.
| Model | VRAM (quantized) | GPU footprint |
|---|---|---|
| Llama 4 Scout · 109B / 17B active MoE | ~55 GB at INT4 | Fits on a single H100 80GB at INT4, around $2.50/hr on-demand on specialist clouds. The sweet spot for single-card MoE serving in 2026. |
| Qwen3 72B-class | ~36 GB at INT4 | Also single H100 territory at INT4 — roughly 36GB of weights plus headroom. Comfortable margin for moderate context lengths on one 80GB card. |
| Llama 4 Maverick · 400B / 17B active | INT4, multi-GPU | Needs roughly 4x H100 at INT4 (on the order of $10/hr). Tensor-parallel serving; the point where single-card economics end and cluster operations begin. |
| DeepSeek V3.2 · 671B | FP8, large cluster | Requires around 8x H200 at FP8 (roughly $36/hr). Frontier-scale open weights; only worthwhile at high sustained volume or hard sovereignty requirements. |
Card choice matters as much as card count, and the decisive spec for large or long-context models is memory, not raw compute. The H100 80GB SXM offers 3.35 TB/s of HBM3 bandwidth; the H200 raises that to 4.8 TB/s of HBM3e — about 43% more bandwidth — and lifts capacity to 141GB, roughly 76% more VRAM. That extra capacity is the H200's real advantage for 100B-plus models and long-context serving, where the KV cache alone can consume tens of gigabytes. By contrast, the L40S uses 48GB of GDDR6 at around 846 GB/s — roughly 4x less bandwidth than an H100 — and is positioned for cost-sensitive inference of smaller models where GDDR6's lower cost per GB offsets the bandwidth gap.
Keep the KV cache in view when you size. A 70B model at a 128K-token context can need on the order of 42GB of KV cache by itself, on top of the weights — which is exactly why long-context workloads push you toward higher-VRAM cards or more aggressive compression. And the GPU need not be NVIDIA at all: AMD ROCm is now a first-class vLLM target, with roughly 93% of the vLLM AMD test suite passing as of January 2026, so teams are no longer locked to a single vendor for production inference. If you are weighing the full picture against managed options, our AI digital transformation engagements start with exactly this kind of comparative deployment eval.
06 — When Cost Is IrrelevantCompliance is the forcing function.
For one class of organization the entire breakeven discussion is beside the point. If you process regulated data — protected health information, EU personal data at scale, or anything under a strict SOC2 boundary — the question is not whether self-hosting is cheaper. It is whether you are legally permitted to send that data to a third-party API at all. Often the answer is no, and that makes a self-hosted or VPC deployment the only straightforward compliant path, regardless of token volume.
HIPAA is the clearest example. Processing protected health information through any provider acting as a business associate requires a Business Associate Agreement, and standard consumer APIs from the major frontier vendors lack BAA coverage by default. Transmitting PHI to such a service without a BAA in place is itself a compliance violation — which is why clinical and billing workloads so often land on self-hosted infrastructure even when the per-token math would favor an API.
The practical takeaway is to sequence the questions correctly. Ask the compliance question before the cost question. If a BAA, an SCC gap, or a data-residency requirement removes managed APIs from the table, you self-host — and the matrices above stop being about "whether" and start being about "which GPU and which framework." If compliance is not a constraint, then the cost crossover and utilization math run the decision.
07 — Your Deployment DecisionMatch the path to your actual workload.
Putting it together yields a short, opinionated decision tree. Most teams will land in one of four buckets, and the right move differs sharply between them. The mistake to avoid is treating self-hosting as a status decision rather than an economic and regulatory one.
Under ~250M tokens/month, no compliance constraint
The crossover does not arrive against any realistic API tier, and idle GPU time would dominate your cost. Prototype locally on Ollama, then serve on a managed open-weight API. Revisit only when sustained volume climbs.
Billions of tokens at high utilization
Past roughly 1B tokens/month at sustained 60-70%+ load, owning capacity wins against most API tiers, with reported savings up to ~5x at industrial scale. Serve on vLLM (or SGLang for RAG), size per the GPU tables, and instrument utilization relentlessly.
HIPAA / GDPR / SOC2-bound workloads
A BAA gap, an SCC requirement, or a data-residency rule can remove managed APIs from consideration regardless of volume. Self-host or deploy in a controlled VPC, and treat the cost matrix as a sizing exercise rather than a go/no-go.
Prefix-sharing retrieval pipelines
If most requests share a long system prompt or document context, SGLang's RadixAttention reports up to 6.4x throughput on shared prefixes. Choose the engine for the workload shape, not the brand — and benchmark on your own prompts before defaulting.
Whichever bucket you fall into, the discipline is the same: run a real A/B before committing capital. Stand up the open-weight model on a managed API, measure your true monthly token volume and your achievable GPU utilization, and only then price a self-hosted deployment against that baseline using live cloud GPU rates. The numbers in this guide are a map of the terrain — your own measurements are the route. For a deeper look at the per-token economics behind the breakeven curves, our current token pricing across frontier and open-weight models and the broader open-weight versus closed-source model tradeoffs give the surrounding context.
If your self-hosting case is driven by RAG, the supporting infrastructure decisions matter as much as the serving engine — choosing the right vector database for RAG applications and modeling AI inference cost optimization end to end will move your effective cost more than a 29% framework throughput edge ever will.
08 — ConclusionSelf-hosting is an economics decision, not a capability one.
Own the GPU only when volume, utilization, or law makes the math work.
The open-weight models are ready. Llama 4, the DeepSeek V-series, and the Qwen3 checkpoints give you capability you can run, fine-tune, and own. What stops self-hosting from paying off is rarely the model and almost always the economics — the crossover volume you have not reached, the GPU utilization you cannot sustain, and the DevOps cost you did not budget for.
The honest framing is the useful one. Against budget open-weight APIs, the breakeven can sit in the billions of tokens per month — territory most teams never enter. Against premium frontier APIs it arrives earlier, but only at high, sustained utilization, and a 10x penalty waits for any deployment that runs idle. Below the crossover, a managed open-weight API is faster to ship and cheaper to run. Above it, owning capacity on vLLM or SGLang can cut the bill several-fold. And when compliance forces your hand, the math turns into a sizing exercise rather than a choice.
So sequence it deliberately: check compliance first, measure your real volume and utilization second, and price a self-hosted deployment against a live managed-API baseline only after that. The teams that get this right in 2026 are not the ones with the biggest GPUs — they are the ones who ran the numbers before they bought anything.