The AI build vs buy decision has quietly inverted. For three years the safe assumption was that you buy off-the-shelf AI for cost and only build custom for control. As of June 2026, open-weight models run roughly 10–12× cheaper than frontier SaaS at comparable capability tiers — which means "cheaper" and "more control" can now sit on the same side of the ledger.

That single shift breaks the old framework. When a June 2026 open-weight coding model can match a frontier SaaS model's benchmark profile at a fraction of the token cost, the question stops being "which is cheaper" and becomes "at what volume, with what data, under what lock-in risk does building pay off." The answer is no longer one number; it is a decision tree.

This guide lays out that tree for agencies and engineering teams: the price compression that re-priced the debate, a cost-per-capability comparison across six options, where the total-cost-of-ownership crossover actually sits, on-device self-hosting as a viable third leg, and the two costs that dwarf tokens — vendor lock-in and data sovereignty. Every figure below is sourced; where a number is vendor-stated or single-sourced, we say so and frame it accordingly.

Key takeaways

01
Open-weight pricing broke the old default.MiniMax M3 (launched June 1, 2026) is reported by its vendor to match GPT-5.5 on the SWE-bench Pro coding benchmark (59.0% vs 58.6%) while listing roughly 12x lower input pricing. 'Buy for cost' is no longer automatic.
02
The build-path crossover sits near ~1M conversations/year.Below that volume, advisory frameworks suggest the engineering and ops overhead of building (a senior engineer plus observability/ops) is not amortized. Above it, custom can pull ahead on unit economics.
03
Lock-in is the line item that dwarfs tokens.In Parallels' 2026 IT survey, 94% of organizations report concern about vendor lock-in. The Register cites a 16x switching-cost premium for organizations without prevention planning. Tokens are the visible cost; switching is the hidden one.
04
On-device self-host is now a real third option.NVIDIA's DGX Spark (~$4,699) runs models up to 200B parameters at 4-bit on the desk; RTX Spark for Windows was announced at Computex 2026 for fall shipping. Capex now competes with cloud OpEx for steady, sensitive workloads.
05
For agencies, the answer is a hybrid, not a side.Multi-client work makes data moats per-engagement rather than cumulative. The pattern that fits the agency model: buy the commodity intelligence layer, build the client-specific data and routing layer you control.

01 — The InversionWhat actually re-priced the build vs buy debate.

The old build vs buy logic rested on a stable assumption: frontier capability lived behind closed APIs, and you paid a premium for it. Building custom meant accepting weaker models to gain control. That premise has eroded fast. Open-weight model share on OpenRouter reportedly climbed to roughly 30% of total token volume by late 2025, up from a negligible share a year earlier — with Chinese open-source models moving from around 1.2% to nearly 30% of weekly share in some weeks, per OpenRouter's State of AI study.

The clearest single data point landed on June 1, 2026: MiniMax M3. Its vendor reports that it matches GPT-5.5 on the SWE-bench Pro coding benchmark — 59.0% versus 58.6% — while listing input pricing around 12× lower and output pricing roughly 12.5× lower. Those benchmark figures are vendor-stated and surfaced via aggregators rather than an independent leaderboard, so the honest framing is "matches," not "beats," and you should re-benchmark on your own tasks before switching defaults. But even held at "matches," the price gap is the story.

Frontier SaaS

GPT-5.5 & Opus 4.8

~$5 in / $25–$30 out per 1M tokens

The capability ceiling and the integration default. You pay for the frontier and the surrounding tooling. The premium is real, and for general knowledge work it can still be worth it.

Buy · highest capability, highest token cost

Open-weight API

MiniMax M3 / Qwen 3.7 Max

M3 ~$0.60 in / $2.40 out · Qwen ~$2.50 / $7.50

Open weights served via OpenRouter and direct APIs. Roughly an order of magnitude cheaper at comparable tiers; Qwen 3.7 Max offers a 1M-token context window. The same weights can also be self-hosted later.

Buy-then-build · cheapest commodity intelligence

Self-host

DGX Spark / RTX Spark

~$4,699 capex · up to 200B params at 4-bit

On-device or on-prem inference for steady, sensitive, or sovereignty-bound workloads. Capex replaces per-token OpEx; data never leaves your infrastructure. Viable for the first time at desk scale.

Build · capex over OpEx, full data control

Two clarifications keep this honest. First, the Qwen 3.7 Max rates above ($2.50 input / $7.50 output per 1M tokens) are the undiscounted list; a 50% launch discount has been running through the May–June 2026 window, so you may see roughly half those numbers temporarily. Second, the MMLU benchmark gap between open-source and proprietary frontier models reportedly narrowed from 17.5 to about 0.3 percentage points across 2025 — but MMLU is a single general-knowledge benchmark. The gap on specialized, long-horizon agentic tasks is wider than that headline suggests, and that nuance matters for any team treating "the gap is closed" as a procurement decision.

Why the cost curve matters more than any one model

Frontier inference cost for a fixed capability level has fallen on the order of ~10× per year since 2023 — GPT-4-class capability cost roughly $30/M tokens in early 2023 and is available under $1/M today via open-weight models, per OpenRouter's State of AI study. Treat "10× per year" as a directional trend, not a precise rule — it varies by tier and use case. The point is that any build vs buy model you write down today has a short shelf life. Design your stack to re-price, not to commit.

02 — Cost Per CapabilitySix options, one table — with the column nobody prices.

Most published pricing comparisons stop at input and output cost. The column that changes the decision is the last one: does your data leave your infrastructure? You can save an order of magnitude on tokens, but if the trade is sending your prompts, logs, and proprietary context to a third-party (sometimes foreign-headquartered) provider, that is a governance decision, not just a pricing one. The table below puts both side by side. Prices are per 1M tokens as of June 2026; SWE-bench Pro figures for M3 and GPT-5.5 are vendor-stated and shown for relative framing only.

Option	Input / 1M	Output / 1M	Context	Posture	Data leaves infra?
GPT-5.5	$5.00	$30.00	Standard	Frontier SaaS	Yes
Claude Opus 4.8	$5.00	$25.00	Standard	Frontier SaaS	Yes
Qwen 3.7 Max	$2.50	$7.50	1M tokens	Open-weight API	Yes (API)
MiniMax M3	$0.60	$2.40	Standard	Open-weight API	Yes (API)
MiniMax M2.7	$0.279	$1.20	205K	Open-weight API	Yes (API)
DGX Spark (self-host)	Capex ~$4,699	+ electricity	Model-dependent	Self-host	No

Read down the "data leaves infra" column and the build vs buy question reframes itself. Frontier SaaS and open-weight APIs both route your data off-premise; the open-weight API just costs an order of magnitude less to do it. The only row that keeps data on your own hardware is self-host — and that is precisely the row that trades predictable capex for the per-token OpEx of the others. The decision is rarely "cheapest model." It is "cheapest model that satisfies my data-control and capability constraints." For the OpEx side of that calculation, our AI inference cost optimization playbook goes deeper on per-token spend control.

03 — The CrossoverWhere building actually pays off.

Cheap tokens do not automatically argue for building. Building has a fixed-cost tail that token pricing never shows. Advisory frameworks for 2026 put a basic custom AI MVP in the $50,000–$100,000 range and full multi-agent systems at $250,000–$400,000+, with a recurring overhead on top — roughly a $120K+ Year 1 senior engineer plus $60–$80K of observability and ops. These are practitioner ranges from advisory firms with commercial interest, not audited research, so treat them as scope, not gospel. But the shape is reliable: the build path carries a five-to-six-figure fixed cost before the first cheap token saves you anything.

That is what produces a crossover. Below a certain volume the fixed overhead of building is never amortized; above it, the per-unit savings on cheap open-weight tokens eventually overtake the fixed cost. The widely-cited 2026 advisory figure puts that crossover near 1 million agent conversations per year. Below it, buy a packaged agent and accept the premium. Above it, the build path's unit economics can pull ahead — and most agencies running AI across multiple high-volume client engagements reach that threshold sooner than they assume.

The build-path cost tail · what you pay before tokens matter

Source: JustThink.ai & Octopus Builds 2026 advisory frameworks (practitioner estimates)

Custom MVP (build)Basic single-purpose system · advisory range

$50K–$100K

Year-1 ops overhead (build)Senior engineer + observability/ops

$180K–$200K

Full multi-agent system (build)Production multi-agent platform · advisory range

$250K–$400K+

Crossover volumeConversations/year where build economics overtake buy

~1M / yr

The success-rate caveat

The same advisory data reports build-from-scratch AI projects succeeding at roughly 33% versus around 67% for vendor-led implementations. Source methodology is unspecified and the firm reporting it has a commercial interest, so read it directionally — but the direction matters: clearing the TCO crossover only helps if you are in the third of teams that ships the build at all. Capability to execute is itself a decision variable, which is exactly why the matrix later in this guide scores "required team capability" as a column.

04 — The Third OptionSelf-host moves from rack to desk.

Most build vs buy posts are binary — SaaS versus a cloud-API custom build. In 2026 there is a credible third leg: on-device and on-prem inference. NVIDIA's DGX Spark, priced at $4,699 (increased from $3,999 in February 2026 on memory supply constraints), packs a GB10 Grace Blackwell Superchip with 128 GB of unified LPDDR5x memory, delivers roughly 1 PFLOP of AI performance, and runs models up to 200B parameters at 4-bit precision. A CES 2026 software update via TensorRT-LLM optimizations and speculative decoding reportedly delivered a 2.5× performance gain; the box can process a GPT-OSS 120B model at around 38.55 decode tokens/second using NVFP4 precision.

The interesting question is when capex beats OpEx. Here is one illustrative scenario with its assumptions stated openly, because an undeclared break-even is just a fabricated number. Amortize the $4,699 hardware over three years and add roughly $25/month of electricity, and the box costs about $136/month. Against that, assume a 3-person dev team running cloud GPUs eight hours a day — a usage profile that works out to roughly $4,342 per quarter in the cited analysis. Under those specific assumptions, the DGX Spark's $4,699 sticker breaks even in roughly 97 days. Change the cloud rate, the daily hours, or the team size and that number moves a lot — treat it as a worked example, not a universal constant.

DGX Spark price

On-desk inference box

4,699$

GB10 Grace Blackwell Superchip, 128 GB unified LPDDR5x, ~1 PFLOP AI performance, models up to 200B params at 4-bit. Up from $3,999 in Feb 2026 on memory supply constraints.

Capex, not per-token

Illustrative break-even

Capex vs cloud OpEx

~97days

Amortized over 3 years + ~$25/mo electricity = ~$136/mo. Versus a 3-person team on cloud GPUs 8 hrs/day (~$4,342/quarter). Assumptions stated; change any input and the number moves.

Declared-assumption scenario

RTX Spark

Windows on-device, announced

Fall'26

Announced at Computex 2026 for fall shipping: 20-core Grace ARM CPU, Blackwell GPU, up to 128 GB unified memory, 120B-param models with 1M context on-device. Official pricing pending.

Analyst est. $2,000–$2,900

The RTX Spark, announced at Computex 2026 and slated to ship in fall 2026, pushes this further into the mainstream: a consumer-grade SoC for Windows laptops and desktops combining a 20-core Grace ARM CPU, a Blackwell GPU, and up to 128 GB of unified memory, capable of running 120B-parameter models with 1M-token context entirely on-device. Official NVIDIA pricing has not been announced; analyst estimates suggest starting prices around $2,000–$2,500 for base configurations and $2,500–$2,900 for flagship 128GB variants, with launch partners spanning ASUS, Dell, HP, Lenovo, Microsoft Surface, and MSI. Treat those figures as estimates until NVIDIA confirms. For the full hardware-versus-cloud math, see our self-hosting frontier models TCO analysis.

"The PC is being reinvented. You ask — and the PC does the work. This is the personal AI computer."— Jensen Huang, CEO, NVIDIA · Computex 2026

05 — The Hidden CostVendor lock-in is the line item that dwarfs tokens.

Token pricing is the cost everyone models. Switching cost is the cost almost nobody does — and it is the one that bites. In Parallels' 2026 State of Cloud Computing Survey (n=540 IT professionals, cited by The Register), 94% of organizations report concern about vendor lock-in. The same reporting cites a switching-cost premium of roughly 16× for organizations that lack lock-in prevention planning versus those that have it. That is a single survey with unspecified methodology, so read it as practitioner intelligence rather than a peer-reviewed finding — but a 16× multiplier turns "lock-in risk" from a vague warning into a budget line.

The mechanism is underappreciated. The cost of switching AI vendors is rarely the API migration itself; it is the accumulated context, workflows, and institutional memory built around one provider's behavior. Pricing power compounds the risk. Anthropic moved Claude enterprise from fixed to dynamic usage-based pricing in April 2026, with observers projecting heavy-user costs could rise materially; The Register also reports some OpenAI models seeing input-price increases of up to roughly 360% — a figure from a single article that may be variant-specific, so we flag it as reported rather than confirmed. Either way, the pattern is clear: once you are locked in, the vendor holds the pricing pen.

"Switching [AI vendors] is context, workflows, and institutional memory"— Haroon Choudery, AI consultant · The Register, April 2026

There is a deeper irony in the numbers worth sitting with. Even as token prices reportedly fell around 80% year-over-year, total enterprise AI spending grew by roughly 320% — driven by volume explosion and migration toward more capable, more expensive models. Cheaper tokens did not lower bills; they expanded appetite. And the confidence gap is stark: in a 2026 Zapier survey of US executives with active AI vendor contracts, 90% believed they could switch vendors within four weeks and 41% within two-to-five business days — yet The Register reports only 42% of organizations that actually attempted migration described it as smooth. The plan-versus-reality gap on switching is the strongest argument for designing portability in from day one.

06 — Sovereignty & MoatsData is the moat — and the constraint.

If token cost has collapsed and capability has converged, what is left to compete on? Increasingly, the answer is data — and where it lives. In a 2026 survey of 2,050+ senior executives cited by MIT Technology Review, 70% said they believe they need sovereign data and AI platforms to succeed. For regulated sectors and sovereignty-bound workloads, that constraint alone can force the self-host leg of the decision tree regardless of what the token math says — if the data legally cannot leave your infrastructure, the "cheapest API" row is simply off the table.

This is also where the build case is strongest on its merits rather than its cost. a16z's 2026 essays argue the software industry expands rather than collapses under AI: coding's new cheapness creates a mandate for "software-first" teams across every function. Crucially, a16z observes that cheap code creation "hasn't yet diffused across the enterprise in the way that's implied by the lower costs" — which means the organizations that move now capture an arbitrage window before the advantage commoditizes. The thing you build is not the model; it is the proprietary data layer and workflows the model runs on.

"Data is really a new currency; it's the IP for many companies."— Kevin Dallas, CEO, EDB · on AI data sovereignty, 2026

The macro backdrop

Worldwide AI spending is forecast to total $2.52 trillion in 2026, a 44% year-over-year increase, per Gartner's January 2026 forecast — which also found that AI will most often be sold to enterprises by their incumbent software provider. That last point is the lock-in trap in one sentence: the path of least resistance routes you straight into a single vendor's gravity well. Sequoia's "$600B Question" frames the same tension from the other end — a large gap between AI infrastructure spending and AI-generated revenue, driven by the diffusion lag between frontier capability and deployed business value.

07 — The Decision MatrixFour archetypes, five dimensions.

Pulling the threads together: the decision is not build-or-buy as a single binary, but a match between your workload archetype and a posture across five dimensions — three-year total cost, control and data sensitivity, time to value, lock-in risk, and the team capability the path demands. The matrix below scores four common archetypes. Scores are 1–5 (5 = highest/strongest) and are our synthesis of the sourced 2026 data above, not vendor figures — use them to locate yourself, then run your own numbers.

Archetype	3-yr cost	Control / sensitivity	Time to value	Lock-in risk	Team capability	Verdict
Standardized Workflow Buyer	Lowest upfront	2 / 5	Days	4 / 5	1 / 5	Buy SaaS
Data-Advantaged Builder	Medium	4 / 5	Weeks	2 / 5	3 / 5	Hybrid
High-Volume Custom Operator	Lowest at scale	4 / 5	Months	1 / 5	5 / 5	Build
Regulated / Sovereign Builder	High capex	5 / 5	Months	1 / 5	5 / 5	Self-host

Standardized workflow

Common task, no data moat

If the workflow is generic and your data confers no advantage, buy. Time-to-value is days, capability bar is low, and the engineering overhead of building is never amortized below the ~1M-conversation crossover. Accept the lock-in risk and plan an exit path.

Buy a packaged agent

Data-advantaged

Proprietary data layer

You have data others don't, but volume hasn't cleared the crossover. Buy the commodity intelligence layer (open-weight API), build only the proprietary data, retrieval, and routing layer you control. This is the agency default.

Hybrid: buy model, build data layer

High-volume operator

Past the crossover

Above ~1M conversations/year with a capable team, the build path's unit economics on cheap open-weight tokens can overtake the fixed overhead. Build — but only if you're confident you're in the third of teams that ship.

Build custom

Regulated / sovereign

Data can't leave

If sovereignty or compliance means data legally cannot leave your infrastructure, the cheapest-API row is off the table regardless of token math. Self-host on DGX Spark today, or RTX-class hardware as it ships. Capex over OpEx for full control.

Self-host

08 — The Agency CaseWhy agencies should buy the layer, build the data.

Agencies sit in an unusual position on this map. High conversation volume per client pushes toward building — but multi-client work means data moats are per-engagement rather than cumulative. You rarely get to pool ten clients' data into one compounding asset; each client's advantage is walled off inside that relationship. That structure argues against building one monolithic custom model and for a repeatable pattern you deploy per client.

The pattern that fits: buy the commodity intelligence layer, build the client data layer. Route the generic reasoning to whichever open-weight or frontier model wins on price-per-capability this quarter — and keep that routing layer yours, so re-pricing is a config change, not a migration. Build and own the parts that actually differentiate: the client-specific retrieval corpus, the prompt and workflow logic, the evaluation harness, and the standardized tool interfaces. That is where switching cost works for you instead of against you, because the moat lives in your infrastructure, not the vendor's.

Concretely, that means standardizing the integration surface so models are swappable. Building a thin, owned routing and tool layer — rather than buying a packaged one — is what keeps you portable; our MCP server TCO calculator walks through that specific build-vs-buy node, and the agent vs Zapier automation cost comparison handles the workflow-automation case. For the retrieval layer itself, weigh RAG vs fine-tuning cost before committing, and cross-reference our enterprise AI agent build vs buy guide for the agent-specific depth. The build vs buy line is no longer drawn between you and a vendor — it runs through your own stack, and you get to choose which side each layer falls on. That is the kind of architecture decision our AI transformation engagements and custom development work are built around.

09 — ConclusionThe choice is now a tree, not a binary.

The shape of the decision, June 2026

Buy the commodity, build the moat, and keep the routing layer yours.

The single most useful thing to internalize about AI build vs buy in 2026 is that the old shortcut — "buy for cost, build for control" — no longer holds. Open-weight pricing has put cheap and controllable on the same side of the ledger for a growing set of workloads. The decision is now a tree with four branches: buy when the workload is standardized, hybrid when you have a data advantage but not the volume, build when you clear the crossover with a capable team, and self-host when data sovereignty takes the cheapest-API option off the table entirely.

The numbers that should anchor your thinking are not the model benchmarks — those will be stale in a quarter. They are the structural ones: a crossover near a million conversations a year, a build path with a five-to-six-figure fixed tail, a reported 16× switching-cost penalty for unplanned lock-in, and a 70% executive belief that sovereign data and platforms matter. Those shape the tree; the specific model you route to is a leaf.

For agencies specifically, the recommendation is concrete: buy the commodity intelligence layer, build and own the client data and routing layers, and standardize the integration surface so re-pricing is a config change rather than a migration project. Token costs will keep falling and models will keep converging. The durable advantage is not which model you picked this month — it is whether you architected the freedom to pick a different one next month without paying the lock-in tax.

AI Build vs Buy in 2026: 10–12× Cheaper Re-prices the Choice