AI DevelopmentDecision Matrix14 min readPublished May 27, 2026

Cost crossover · vLLM vs SGLang · GPU sizing · 10x utilization trap

Self-Hosting Open-Weight LLMs: When It Actually Wins

Self-hosting open-weight models feels cheaper than paying per token — until you run the numbers. The cost crossover against budget APIs does not arrive until billions of tokens a month, GPU utilization quietly multiplies your real cost-per-token, and the production serving stack changed in late 2025. This is the decision framework, with the math shown.

DA
Digital Applied Team
Senior strategists · Published May 27, 2026
PublishedMay 27, 2026
Read time14 min
Sources12 cited
Breakeven vs DeepSeek V4 Flash
5.7B
tokens/month, single H100
Breakeven vs GPT-5-class API
256M
tokens/month, ~60-70% util
Cost at 10% GPU load
10x
per-token vs full load
vLLM PagedAttention throughput
24x
vs naive static batching
reported

Self-hosting open-weight LLMs is the deployment decision teams get wrong most often in 2026 — not because the engineering is hard, but because the cost intuition is backwards. Owning the GPU feels cheaper than paying a per-token API bill, yet for most workloads the math never crosses over, and the costs that sink the project rarely show up on the GPU invoice.

The open-weight ecosystem has never been stronger. Llama 4, the DeepSeek V-series, and the Qwen3 open checkpoints give you frontier- adjacent capability you can run on your own hardware, fine-tune on proprietary data, and operate without an external dependency. The question is no longer whether you can self-host — it is whether you should, and the honest answer depends on volume, utilization, and compliance, not on the model.

This guide gives you the framework: a token-volume breakeven matrix that shows the actual crossover points, the GPU-utilization multiplier that quietly wrecks under-loaded deployments, a 2026 serving-framework decision matrix that reflects the December 2025 sunset of Text Generation Inference, GPU sizing tables by model tier, and the compliance scenarios where the cost math stops mattering entirely. Every figure below is sourced; treat all pricing as a snapshot to re-verify, not a constant.

Key takeaways
  1. 01
    The cost crossover arrives much later than teams expect.Against a budget API like DeepSeek V4 Flash (around $0.14 per 1M input tokens), one analysis puts the breakeven for a single H100 at roughly 5.7 billion tokens per month. Against a premium GPT-5-class API at about $5 per 1M, the crossover can arrive near 256 million tokens per month — but that figure assumes 60-70% sustained GPU utilization.
  2. 02
    GPU utilization is the silent cost multiplier.Effective cost-per-token scales inversely with load. At roughly 10% utilization, your real cost per token can be about 10x the headline GPU rate — enough to make an idle H100 more expensive per token than a premium frontier API. Self-hosting economics only work with high, sustained throughput.
  3. 03
    The hidden costs are 3-5x the raw GPU price.DevOps salaries, model update cycles, and infrastructure overhead typically add a 3-5x multiplier on top of the GPU rental alone, per one cost analysis. The GPU invoice is the part of self-hosting that is easy to see and the smallest part of the bill.
  4. 04
    The 2026 serving stack is vLLM and SGLang, not TGI.Hugging Face moved Text Generation Inference to maintenance mode on December 11, 2025, redirecting effort upstream to vLLM and SGLang. New greenfield deployments should default to vLLM for batch throughput and SGLang for prefix-heavy RAG and multi-turn workloads.
  5. 05
    Compliance can override the cost math entirely.For HIPAA, GDPR, or SOC2-bound workloads, the breakeven analysis is often moot. Without a Business Associate Agreement, standard consumer APIs cannot lawfully process protected health information, which can make a self-hosted VPC deployment the only straightforward compliant path.

01The Real QuestionIt is not capability. It is volume, utilization, and compliance.

Most self-hosting debates start in the wrong place. Teams compare benchmark scores, decide an open-weight model is "good enough," and conclude that running it themselves must be cheaper than paying an API per token. The capability question is largely settled — open checkpoints from Llama 4, the DeepSeek V-series, and Qwen3 are genuinely strong. The questions that actually decide the outcome are three: how many tokens you process per month, how busy your GPU stays, and whether regulation forces your hand.

Get those three right and the decision makes itself. A team doing a few hundred million tokens a month at variable load, with no compliance constraint, is almost always cheaper and faster on a managed API. A team processing billions of tokens at steady, high utilization — or one that legally cannot send its data to a third party — is a strong self-hosting candidate. The middle ground is where the matrices below earn their keep.

Three deployment paths exist, and most teams should expect to use more than one. Self-hosting on your own GPUs gives maximum control and the only path to true on-prem data sovereignty. Managed open-weight APIs — Together AI, Fireworks AI, and others — give you open-model weights without the operations burden, at per-token pricing. And local runtimes like Ollama give individual engineers a zero-friction way to prototype before any of this matters.

Path A
Self-host on your GPUs
vLLM / SGLang on rented or owned H100s

Maximum control, the only route to true on-prem data sovereignty, and the cheapest option past a high, sustained volume. Carries the full DevOps and utilization burden — the part that decides whether it actually saves money.

Wins at high, steady volume
Path B
Managed open-weight API
Together AI · Fireworks AI · per-token

Open-model weights served for you, billed per token. No GPU to keep busy, no inference stack to operate. The default for most teams below the crossover, and the cleanest A/B baseline against self-hosting.

Default below crossover
Path C
Local runtime (Ollama)
Desktop / single-node prototyping

Zero-friction local inference for engineers to prototype and validate against an OpenAI-compatible surface before any production decision. Not a production serving engine — graduate to vLLM when traffic justifies it.

Prototype, then graduate

02The Breakeven MatrixWhere self-hosting crosses over a managed API.

The single most useful artifact in this decision is a crossover curve: at what monthly token volume does owning the GPU become cheaper than paying per token? The answer depends entirely on which API you are comparing against. Against a premium frontier API, the crossover arrives relatively early. Against a hyper-cheap open-weight API, it may never arrive in practice.

One published analysis frames the extremes cleanly. At roughly 1 million tokens per day, self-hosting a single A100 80GB lands near $3,240 per month all-in — about $1,440 for cloud rental, around $1,500 in DevOps labor, and roughly $300 of infrastructure overhead — versus a budget API bill of only a few dollars at that volume. The crossover against DeepSeek V4 Flash, at about $0.14 per 1M input tokens, does not arrive until roughly 5.7 billion tokens per month. Against a GPT-5-class API near $5 per 1M, a separate estimate puts the crossover closer to 256 million tokens per month.

Monthly token volume needed before self-hosting wins · by API tier

Sources: devtk.ai self-hosting cost analysis; braincuber cost-performance analysis. Bars indicate relative monthly token volume, not cost. Sonnet-class figure interpolated from the same model.
Breakeven vs DeepSeek V4 Flash~$0.14 / 1M input · single H100
~5.7B/mo
Breakeven vs Claude Sonnet-class API~$3 / 1M · 60-70% utilization assumed
~430M/mo
Breakeven vs GPT-5-class API~$5 / 1M · 60-70% utilization assumed
~256M/mo

Read that spread carefully, because it is the whole argument. The cheaper your API alternative, the more tokens you must self-serve before owning hardware pays off — and budget open-weight APIs are now so cheap that the crossover sits in territory most teams will never reach. The reason a budget API can undercut your own H100 is structural: the provider runs that GPU at near-saturation across thousands of customers, so its effective cost-per-token is far below what a single tenant achieves on a partially idle box. The next section is the direct consequence of that fact.

At genuine industrial scale the picture flips hard. The same body of analysis puts self-hosting a Llama 70B at around 500 million tokens per day near $4,360 per month, versus roughly $22,500 per month on managed APIs — about a 5x win for self-hosting. The lesson is not "self-hosting is bad" or "self-hosting is good." It is that the answer is a step function of volume, and you need to find your own row on the curve.

Monthly volume
Under ~250M tokens
Cheaper option
Managed API
Why
Below the crossover against essentially every API tier. The DevOps labor and idle-capacity cost of a dedicated GPU dwarf the token bill. Prototype on Ollama, serve on a managed open-weight API.
Monthly volume
~250M-1B tokens
Cheaper option
Depends on API tier
Why
Crosses over against premium frontier APIs (~$5/1M) near 256M/mo at high utilization, but still loses badly to budget open-weight APIs. Self-host only if you are replacing expensive frontier calls and can keep the GPU busy.
Monthly volume
1B-5B tokens
Cheaper option
Self-host (usually)
Why
The range where owning capacity at sustained high utilization typically wins against mid-tier APIs. Below the ~5.7B breakeven vs the very cheapest budget APIs — so the comparison still depends on your specific API alternative.
Monthly volume
5B+ tokens
Cheaper option
Self-host
Why
Industrial scale. Self-hosting wins against nearly all alternatives, with reported savings on the order of 5x against managed APIs at the high end — provided utilization stays high and the operations team is already in place.
Read the assumptions, not just the number
Every breakeven figure here is conditional. The ~256M-tokens/month crossover versus a GPT-5-class API assumes roughly 60-70% sustained GPU utilization. Drop to ~30% and that breakeven can roughly double. Treat these as the shape of the curve, then re-run the math with your own utilization, your own GPU rate, and live API pricing before committing capital.

03The Utilization TrapThe 10x multiplier nobody puts on the slide.

Here is the failure mode that turns a sound-looking self-hosting plan into a money pit. A GPU bills the same whether it is running at 100% or 5% of capacity. Your headline "cost per 1,000 tokens" is only achievable when the card is saturated. The moment your traffic is bursty, overprovisioned, or simply lower than projected, your effective cost-per-token climbs in inverse proportion to load — and it climbs fast.

The arithmetic is unforgiving. If a cloud GPU is running at 10% load, one analysis shows the effective cost per 1,000 tokens jumping from about $0.013 to roughly $0.13 — a 10x increase that makes an under-loaded H100 more expensive per token than a premium API service. The headline rate of about $0.013 per 1K tokens is only valid at high utilization; quote it without the load assumption and you have mispriced the entire project.

If your cloud GPU is running at 10% load, your effective cost per 1,000 tokens jumps from $0.013 to $0.13 — more expensive than premium API services.— Self-hosted vs API cost-performance analysis, braincuber.com

This is why managed APIs are so hard to beat at low and medium volume, and it is the mechanism behind the breakeven spread in the previous section. A provider amortizes one physical GPU across many tenants, so the card stays near saturation and its real cost-per- token approaches the theoretical floor. Your single-tenant deployment has to manufacture that same utilization itself, which means real-time autoscaling, request batching, and enough steady traffic to keep the queue full — operational work that does not appear anywhere on the GPU rental invoice.

And utilization is only the most visible hidden cost. The same analysis estimates a 3-5x multiplier over the raw GPU price once you account for DevOps salaries near $145,000 per year, periodic model update cycles around $12,000 each, and ongoing infrastructure overhead. When you model self-hosting, model the whole stack: idle capacity, the engineers who keep it alive, and the upgrade treadmill — not just the line item that is easiest to see.

Effective cost @ 10% load
The utilization penalty
10x

Per one analysis, dropping from saturated to ~10% load lifts effective cost per 1K tokens from roughly $0.013 to $0.13. The GPU bills the same; you simply put fewer tokens through it. This is the number that decides most self-hosting outcomes.

$0.013 → $0.13 / 1K
Hidden-cost multiplier
Beyond the GPU invoice
3-5x

DevOps salaries (~$145K/yr), model update cycles (~$12K each), and infrastructure overhead typically add a 3-5x multiplier over the raw GPU rental, per the same cost analysis. The visible cost is the smallest part.

Labor + updates + overhead
Cloud H100 price spread
Provider choice is first-order
4-12x

H100 SXM rates run from roughly $1.03/hr spot on budget providers to about $6.88/hr on AWS, with on-demand around $2.49-3.44/hr on specialist clouds. A 4-12x spread makes provider selection a primary TCO variable.

Spot vs hyperscaler

04Serving Framework ChoiceThe 2026 stack: vLLM and SGLang, with TGI retired.

If your last serving comparison is more than a year old, update it. The biggest change is a quiet one: Hugging Face moved its Text Generation Inference engine into maintenance mode on December 11, 2025, accepting only minor bug fixes and documentation updates and redirecting its engineering effort upstream into vLLM and SGLang. Existing TGI deployments continue to work, but new greenfield production should not start there.

Text Generation Inference is now in maintenance mode. We will only accept pull requests for minor bug fixes, documentation improvements, and lightweight maintenance tasks.— Lysandre Debut, Staff Engineer at Hugging Face, December 11, 2025

That leaves two clear front-runners and a prototyping tool. vLLM has become the de facto open-source inference standard, with more than 81,000 GitHub stars and over 2,000 contributors as of late May 2026. Its core trick, PagedAttention, manages the KV cache using OS-style virtual-memory paging — and because traditional serving pre-allocates memory for the maximum sequence length and wastes a large share of GPU memory on average requests, vLLM reports up to 24x higher serving throughput than naive static batching. Its continuous (iteration-level) batching keeps the GPU saturated at every forward pass, which the project reports as 3-10x higher throughput on the same hardware.

SGLang is the underreported fork in the road. On standard Llama 3.1 8B workloads on H100 GPUs, one benchmark puts it about 29% ahead of vLLM in raw throughput, with faster time-to-first-token and lower inter-token latency. Its real differentiator is RadixAttention for prefix-heavy workloads: when many requests share a system prompt or a document context — exactly the RAG and multi-turn pattern — it reports throughput gains of up to 6.4x by reusing the shared prefix. For DeepSeek V-series inference specifically, its specialized Multi-head Latent Attention backend is reported around 3.1x faster than vLLM, though under heavy concurrent saturation the two converge.

SGLang vs vLLM · selected reported metrics (Llama 3.1 8B, H100)

Source: particula.tech SGLang vs vLLM benchmark. Figures are workload-specific; benchmark your own prompts.
Throughput · Llama 3.1 8B / H100SGLang ~16,200 tok/s vs vLLM ~12,500
+29%
SGLang
Time-to-first-tokenSGLang 79ms vs vLLM 103ms
79ms
SGLang
Inter-token latencySGLang 6.0ms vs vLLM 7.1ms
6.0ms
SGLang
Prefix-heavy RAG (RadixAttention)Up to 6.4x throughput on shared prefixes
6.4x
SGLang
SGLangvLLM baseline

Do not over-read those numbers into a blanket "SGLang wins." Both engines are excellent and the gap narrows under saturation; vLLM's breadth — 200-plus model architectures and hardware support spanning NVIDIA, AMD ROCm, Google TPU, Intel Gaudi, Apple Silicon, and CPUs — makes it the safer default for a mixed model fleet. The practical heuristic is workload-shaped: vLLM for general and batch throughput and broad hardware coverage, SGLang for prefix-heavy RAG and multi-turn and DeepSeek V-series serving, Ollama for desktop prototyping, and a careful look at NVIDIA's TensorRT-LLM only if you are all-in on NVIDIA and chasing peak FP8 throughput. NVIDIA's own page cites over 10,000 output tokens per second on H100 at FP8 with sub-100ms TTFT, but that is a vendor-stated figure — validate it against independent benchmarks on your model before believing it.

Use case
Prototyping / single dev
Recommended engine
Ollama
Why
Zero-friction local inference with an OpenAI-compatible surface. Validate prompts and integration before any production stack. Graduate to vLLM when concurrent traffic appears.
Use case
Batch analytics / high throughput
Recommended engine
vLLM
Why
PagedAttention plus continuous batching keep the GPU saturated. The de facto standard, broadest model and hardware coverage, the safe default for a mixed fleet.
Use case
RAG / multi-turn (shared prefixes)
Recommended engine
SGLang
Why
RadixAttention reuses shared system prompts and document context for reported gains up to 6.4x on prefix-heavy workloads. Also the strongest pick for DeepSeek V-series serving.
Use case
NVIDIA-only, peak FP8 throughput
Recommended engine
TensorRT-LLM
Why
Highest reported FP8 throughput on H100 (vendor-stated, verify independently). More setup overhead; worth it only when you are committed to NVIDIA and squeezing maximum tokens/sec.
Use case
New deployment on TGI
Recommended engine
Do not start here
Why
Text Generation Inference entered maintenance mode on December 11, 2025. Existing deployments keep working; new greenfield production should default to vLLM or SGLang instead.

05GPU Sizing By ModelHow much VRAM each model tier actually needs.

Once you have decided to self-host, the first concrete question is which GPU and how many. The sizing rule is simple to state: multiply the parameter count by the bytes-per-precision, then add 20-40% headroom for activations, the KV cache, and serving overhead. FP16 is 2 bytes per parameter, FP8 is 1 byte, and INT4 is 0.5 bytes. A 70B model at INT4 is roughly 35GB of weights, so about 42GB minimum with 20% overhead — which fits comfortably on a single 80GB card.

Quantization is what makes single-GPU serving of large models tractable. AWQ is the current best-practice INT4 format for vLLM production, identifying activation-salient weights to preserve more quality than GPTQ at the same bit-width, and INT4 methods generally stay within about 6% of baseline perplexity on most tasks. One important caveat: multi-step reasoning and math are more sensitive to quantization quality loss, so validate those use cases explicitly before shipping a 4-bit model into a reasoning-heavy pipeline.

Model
Llama 4 Scout · 109B / 17B active MoE
VRAM (quantized)
~55 GB at INT4
GPU footprint
Fits on a single H100 80GB at INT4, around $2.50/hr on-demand on specialist clouds. The sweet spot for single-card MoE serving in 2026.
Model
Qwen3 72B-class
VRAM (quantized)
~36 GB at INT4
GPU footprint
Also single H100 territory at INT4 — roughly 36GB of weights plus headroom. Comfortable margin for moderate context lengths on one 80GB card.
Model
Llama 4 Maverick · 400B / 17B active
VRAM (quantized)
INT4, multi-GPU
GPU footprint
Needs roughly 4x H100 at INT4 (on the order of $10/hr). Tensor-parallel serving; the point where single-card economics end and cluster operations begin.
Model
DeepSeek V3.2 · 671B
VRAM (quantized)
FP8, large cluster
GPU footprint
Requires around 8x H200 at FP8 (roughly $36/hr). Frontier-scale open weights; only worthwhile at high sustained volume or hard sovereignty requirements.

Card choice matters as much as card count, and the decisive spec for large or long-context models is memory, not raw compute. The H100 80GB SXM offers 3.35 TB/s of HBM3 bandwidth; the H200 raises that to 4.8 TB/s of HBM3e — about 43% more bandwidth — and lifts capacity to 141GB, roughly 76% more VRAM. That extra capacity is the H200's real advantage for 100B-plus models and long-context serving, where the KV cache alone can consume tens of gigabytes. By contrast, the L40S uses 48GB of GDDR6 at around 846 GB/s — roughly 4x less bandwidth than an H100 — and is positioned for cost-sensitive inference of smaller models where GDDR6's lower cost per GB offsets the bandwidth gap.

Keep the KV cache in view when you size. A 70B model at a 128K-token context can need on the order of 42GB of KV cache by itself, on top of the weights — which is exactly why long-context workloads push you toward higher-VRAM cards or more aggressive compression. And the GPU need not be NVIDIA at all: AMD ROCm is now a first-class vLLM target, with roughly 93% of the vLLM AMD test suite passing as of January 2026, so teams are no longer locked to a single vendor for production inference. If you are weighing the full picture against managed options, our AI digital transformation engagements start with exactly this kind of comparative deployment eval.

06When Cost Is IrrelevantCompliance is the forcing function.

For one class of organization the entire breakeven discussion is beside the point. If you process regulated data — protected health information, EU personal data at scale, or anything under a strict SOC2 boundary — the question is not whether self-hosting is cheaper. It is whether you are legally permitted to send that data to a third-party API at all. Often the answer is no, and that makes a self-hosted or VPC deployment the only straightforward compliant path, regardless of token volume.

HIPAA is the clearest example. Processing protected health information through any provider acting as a business associate requires a Business Associate Agreement, and standard consumer APIs from the major frontier vendors lack BAA coverage by default. Transmitting PHI to such a service without a BAA in place is itself a compliance violation — which is why clinical and billing workloads so often land on self-hosted infrastructure even when the per-token math would favor an API.

Compliance can decide before cost does
For HIPAA, GDPR, and SOC2-bound workloads, run the compliance test first. Under GDPR Articles 44-49, EU personal data cannot traverse US-based infrastructure without Standard Contractual Clause documentation — so for organizations handling EU resident data at scale, a self-hosted VPC deployment can eliminate the SCC compliance burden entirely. If regulation forces self-hosting, the breakeven matrix becomes a sizing exercise, not a go/no-go decision.

The practical takeaway is to sequence the questions correctly. Ask the compliance question before the cost question. If a BAA, an SCC gap, or a data-residency requirement removes managed APIs from the table, you self-host — and the matrices above stop being about "whether" and start being about "which GPU and which framework." If compliance is not a constraint, then the cost crossover and utilization math run the decision.

07Your Deployment DecisionMatch the path to your actual workload.

Putting it together yields a short, opinionated decision tree. Most teams will land in one of four buckets, and the right move differs sharply between them. The mistake to avoid is treating self-hosting as a status decision rather than an economic and regulatory one.

Low / variable volume
Under ~250M tokens/month, no compliance constraint

The crossover does not arrive against any realistic API tier, and idle GPU time would dominate your cost. Prototype locally on Ollama, then serve on a managed open-weight API. Revisit only when sustained volume climbs.

Use a managed API
High, steady volume
Billions of tokens at high utilization

Past roughly 1B tokens/month at sustained 60-70%+ load, owning capacity wins against most API tiers, with reported savings up to ~5x at industrial scale. Serve on vLLM (or SGLang for RAG), size per the GPU tables, and instrument utilization relentlessly.

Self-host on vLLM / SGLang
Regulated data
HIPAA / GDPR / SOC2-bound workloads

A BAA gap, an SCC requirement, or a data-residency rule can remove managed APIs from consideration regardless of volume. Self-host or deploy in a controlled VPC, and treat the cost matrix as a sizing exercise rather than a go/no-go.

Self-host for compliance
RAG / multi-turn heavy
Prefix-sharing retrieval pipelines

If most requests share a long system prompt or document context, SGLang's RadixAttention reports up to 6.4x throughput on shared prefixes. Choose the engine for the workload shape, not the brand — and benchmark on your own prompts before defaulting.

Self-host on SGLang

Whichever bucket you fall into, the discipline is the same: run a real A/B before committing capital. Stand up the open-weight model on a managed API, measure your true monthly token volume and your achievable GPU utilization, and only then price a self-hosted deployment against that baseline using live cloud GPU rates. The numbers in this guide are a map of the terrain — your own measurements are the route. For a deeper look at the per-token economics behind the breakeven curves, our current token pricing across frontier and open-weight models and the broader open-weight versus closed-source model tradeoffs give the surrounding context.

If your self-hosting case is driven by RAG, the supporting infrastructure decisions matter as much as the serving engine — choosing the right vector database for RAG applications and modeling AI inference cost optimization end to end will move your effective cost more than a 29% framework throughput edge ever will.

08ConclusionSelf-hosting is an economics decision, not a capability one.

The shape of the decision, May 2026

Own the GPU only when volume, utilization, or law makes the math work.

The open-weight models are ready. Llama 4, the DeepSeek V-series, and the Qwen3 checkpoints give you capability you can run, fine-tune, and own. What stops self-hosting from paying off is rarely the model and almost always the economics — the crossover volume you have not reached, the GPU utilization you cannot sustain, and the DevOps cost you did not budget for.

The honest framing is the useful one. Against budget open-weight APIs, the breakeven can sit in the billions of tokens per month — territory most teams never enter. Against premium frontier APIs it arrives earlier, but only at high, sustained utilization, and a 10x penalty waits for any deployment that runs idle. Below the crossover, a managed open-weight API is faster to ship and cheaper to run. Above it, owning capacity on vLLM or SGLang can cut the bill several-fold. And when compliance forces your hand, the math turns into a sizing exercise rather than a choice.

So sequence it deliberately: check compliance first, measure your real volume and utilization second, and price a self-hosted deployment against a live managed-API baseline only after that. The teams that get this right in 2026 are not the ones with the biggest GPUs — they are the ones who ran the numbers before they bought anything.

Deploy open-weight models the right way

We help teams find the line where self-hosting actually pays off.

Our team helps businesses model the self-host versus API decision, benchmark open-weight models on their own workloads, and stand up vLLM or SGLang deployments — including compliance-bound on-prem and VPC builds — delivered in days not quarters.

Free consultationExpert guidanceTailored solutions
What we work on

Self-hosting & inference engagements

  • Self-host vs managed-API cost-crossover modeling
  • vLLM / SGLang deployment and GPU sizing
  • Compliance-bound on-prem & VPC builds (HIPAA / GDPR)
  • Open-weight model benchmarking on your corpus
  • Inference-cost FinOps for open + closed model mix
FAQ · Self-hosting open-weight LLMs

The questions we get every week.

Usually not, until you reach high, sustained token volume. The crossover depends entirely on which API you compare against. Against a budget open-weight API like DeepSeek V4 Flash (around $0.14 per 1M input tokens), one analysis puts the breakeven for a single H100 near 5.7 billion tokens per month — a level most teams never reach. Against a premium GPT-5-class API near $5 per 1M, a separate estimate puts the crossover closer to 256 million tokens per month, but that assumes roughly 60-70% sustained GPU utilization. Below your crossover, a managed open-weight API is almost always cheaper and far less operational work. Above it, owning capacity can save several-fold. Run the math with your own volume and utilization before committing.