An LLM gateway is the proxy layer that sits between your application and the model providers you call — and in 2026 it has graduated from a convenience to critical AI infrastructure. It is where unified API access, caching, routing, fallbacks, budget enforcement, and compliance logging all live, so that none of those concerns leak into application code.

The shift is structural. Cloudflare's internal analysis suggests the average enterprise engineering team now calls several different model providers, which makes a single, vendor-neutral abstraction the default rather than the exception. Once you are routing to more than one provider — or spending real money on tokens — a gateway stops being optional and becomes the place where cost, reliability, and governance are actually enforced.

This reference compares the six most-deployed options — LiteLLM Proxy, Portkey, Cloudflare AI Gateway, Vercel AI Gateway, OpenRouter, and Kong AI Gateway — across the dimensions that actually decide an architecture: caching mechanics, routing algorithms, resilience patterns, governance and compliance controls, and the build-vs-buy economics most teams never quantify. Every figure below is sourced from official documentation, with vendor-stated claims flagged as such.

Key takeaways

01
The gateway is now critical infrastructure.Once you call more than one provider or spend meaningfully on tokens, the proxy layer is where cost control, reliability, and compliance are enforced — not a nice-to-have you bolt on later.
02
Caching is the fastest ROI lever.Cached responses can return in under 5ms versus 2–5 seconds for live inference. Even modest hit rates produce meaningful cost and latency reductions at production scale.
03
Self-hosted vs managed is a trade-off, not a features gap.LiteLLM and Portkey (open-source) keep data on your infrastructure; Cloudflare, Vercel, and OpenRouter trade an operational burden for a platform fee. The deciding axis is data residency and ops capacity.
04
Resilience needs three distinct fallback categories.General fallbacks (timeouts, 5xx), content-policy fallbacks, and context-window fallbacks are different routing decisions with different provider lists — most tutorials only cover the first.
05
The build-vs-buy crossover is calculable.OpenRouter's 5.5% credit fee gives a public benchmark for the managed layer. Run it against your token spend plus engineering hours before assuming self-hosting is cheaper.

01 — The LayerWhat a gateway actually does.

A gateway presents one API to your application and translates it into calls to whichever providers you have configured. That single translation point is what makes everything else possible: because all traffic flows through it, the gateway is the natural home for cross-cutting concerns that would otherwise be scattered across services. Vercel's documentation frames the role plainly — it gives you the ability to set budgets, monitor usage, load-balance requests, and manage fallbacks behind one endpoint.

The five responsibilities that define a production gateway are unified access, caching, routing, resilience, and governance. Each is examined in its own section below; the grid here is the mental map for how they relate.

Access

Unified API

one endpoint · many providers

A single request format reaches hundreds of models. Vercel's gateway describes a unified API to access hundreds of models through one endpoint; Portkey routes to 1,600+ models across 250+ providers.

OpenAI · Anthropic · Google · others

Caching

Exact & semantic

hash match → vector similarity

Identical or near-identical prompts return from cache instead of re-running inference. Cloudflare reports cache hits cutting latency by up to 90% by serving from its edge rather than the upstream provider.

centralized, shared across services

Routing

Load balancing

cost · latency · usage strategies

Requests are distributed across providers and keys by strategy — simple-shuffle, latency-based, cost-based, or usage-aware. The right strategy depends on whether you optimize for price, speed, or quota.

per-deployment weighting

Resilience

Fallbacks & breakers

retry → fallback → circuit-break

Automatic retries with backoff, fallback chains to alternate providers, and circuit breakers that isolate a failing provider+model so one outage doesn't cascade across your stack.

general · content-policy · context

Governance

Budgets & audit

wallet checks · immutable logs

Request-level budget enforcement, per-user token attribution, zero-data-retention routing, and structured audit logs that SOC 2, GDPR, and the EU AI Act increasingly require from day one.

402 reject · SIEM export

When you actually need one

A widely cited rule of thumb among gateway practitioners: if you're calling more than one model provider, or spending more than a few hundred dollars a month on API calls, the gateway stops paying for itself in convenience and starts paying for itself in money saved — through caching, cheaper routing, and avoided downtime. Below that threshold, a direct provider SDK is usually enough.

02 — ComparisonSix gateways, one matrix.

Most public comparisons cover two or three tools and skip the dimensions that matter for compliance and cost. The table below assembles the six most-deployed gateways against the features that decide a real architecture. Every cell is drawn from each vendor's official documentation; the Kong row carries a caveat because its performance claims come from a vendor-authored benchmark (see the resilience section).

LLM gateway feature matrix — six gateways across deployment model, providers, caching, governance, and best-fit use case
Gateway	Deployment	Providers	Semantic cache	Budgets	ZDR / BYOK	Guardrails / DLP	License	Best fit
LiteLLM Proxy	Self-hosted	100+	Yes	Yes	BYOK · region checks	Via plugins	Open source	Data-residency & regulated on-prem deployments
Portkey (OSS)	Self-hosted / managed	250+	Yes (cosine similarity)	Yes	BYOK	Governance suite	Apache 2.0	Lightweight OSS gateway with full observability
Cloudflare AI Gateway	Managed (edge)	12+	Yes	Rate limits	BYOK (20+ providers)	DLP + real-time guardrails	Proprietary	Global edge caching with compliance scanning
Vercel AI Gateway	Managed	Hundreds of models	Provider-dependent	Yes	ZDR routing · BYOK	Via providers	Proprietary	Vercel-native apps wanting zero-markup access
OpenRouter	Managed	Hundreds of models	Provider-dependent	Credit balance	ZDR filter · BYOK	Moderation routing	Proprietary	Fastest path to many models with fallbacks
Kong AI Gateway	Self-hosted / Konnect	Multi-provider	Yes	Plugin-based	BYOK	Plugin ecosystem	Open source core	Teams already standardized on Kong (vendor benchmark caveat)

The matrix makes the real division visible: the choice is rarely about a missing feature. LiteLLM and Portkey win on data residency because everything runs on your infrastructure; Cloudflare and Vercel win on operational simplicity because they run the edge for you. OpenRouter wins on time-to-many-models. There is no gateway that is strictly best — there is the gateway that matches your residency requirements, your ops capacity, and your spend.

03 — CachingExact-match and semantic caching.

Caching is the single fastest ROI lever a gateway offers. A cached response can return in under 5 milliseconds, versus the 2 to 5 seconds a live inference call typically takes. Cost follows latency: a served-from-cache request costs nothing in provider tokens. Reported hit rates in the 30–40% range are a typical range cited in secondary analyses rather than a guaranteed number — but even a moderate hit rate compounds into meaningful savings at production scale.

There are two distinct mechanisms. Exact-match caching hashes the request and returns a stored response when an identical request arrives. Semantic caching goes further: it embeds the prompt and runs a cosine-similarity search against stored embeddings, returning a cached answer when a new prompt is close enough to a previous one. Portkey's implementation checks an exact hash first and falls back to vector similarity — a dual-layer design that captures both literal repeats and paraphrases.

Cache-hit latency vs live inference

Sources: getmaxim.ai semantic caching analysis; Cloudflare AI Gateway docs

Live inference callround-trip to the model provider

2–5s

Cache hitserved from gateway / edge cache

<5ms

Cloudflare edge cache hitvendor-reported latency reduction

−90%

The cosine-similarity threshold is the underexplored design parameter. Practitioners typically tune it in the 0.90 to 0.98 range, and the choice is a precision/recall trade-off with real consequences. Set it loose (near 0.90) and you risk false positives — returning a cached answer to a question that only superficially resembles the original. Set it tight (near 0.98) and the hit rate collapses toward exact-match levels. The gateway is the right place to host this cache precisely because it is shared: every service behind it benefits from one another's query history rather than each maintaining a siloed cache.

The threshold is not free

Embedding geometry alone cannot reliably separate genuine paraphrases from distinct intents. A high-similarity score does not guarantee two prompts want the same answer — which is why a too-loose semantic cache can quietly serve wrong responses. Treat the threshold as a tunable you validate against real traffic, not a set-and-forget default.

04 — RoutingLoad balancing and routing strategies.

Routing decides which deployment serves each request. LiteLLM exposes a menu of strategies: Simple-Shuffle is the default and the one its docs recommend for production; Latency-Based routes to the fastest responder; Usage-Based-Routing-v2 (Redis-backed) routes to the deployment with the lowest tokens-per-minute load; Least-Busy picks the fewest concurrent requests; and Cost-Based optimizes for price. The strategy you pick encodes what you are optimizing for — and that should be an explicit decision, not a default left untouched.

OpenRouter takes a different, automatic approach. Its load balancer first prioritizes providers with no outages in the last 30 seconds, then weights the remaining candidates by the inverse square of their price — so a provider charging three times more is roughly nine times less likely to be selected. Explicit sort or order parameters disable this and force a sequential ordering instead. The same gateway implements the same sliding- and fixed-window rate limiting at the gateway layer that protects upstream providers from quota exhaustion.

LiteLLM default

Routing strategies

Simple-Shuffle (recommended for production), Latency-Based, Usage-Based-Routing-v2, Least-Busy, Cost-Based, and Custom. Cooldowns trigger automatically when a deployment exceeds 50% failures or returns 429s.

config.yaml

OpenRouter weighting

Inverse-square cost

1/p²

Providers with no outage in the last 30 seconds are prioritized, then weighted by 1/price². A 3× pricier provider is ~9× less likely to be picked. :nitro sorts by throughput, :floor sorts by price.

automatic

Percentile gates

Performance thresholds

p99

OpenRouter can deprioritize endpoints below a percentile threshold (p50/p75/p90/p99) — slow endpoints are moved to the end of the candidate list rather than excluded outright.

p50 · p75 · p90 · p99

05 — ResilienceFallbacks, retries, and circuit breakers.

Resilience is where the gateway earns its keep during an incident. Three layers stack. Retries handle transient failures: Portkey supports up to five attempts with exponential backoff. Fallbacks reroute to an alternate provider when retries won't help — and critically, there are three distinct fallback categories that most tutorials collapse into one. Circuit breakers sit on top, isolating a provider that is persistently failing so the system stops hammering it. These are the same retry and idempotency patterns that govern any reliable distributed system, applied at the provider-request level.

LLM resilience pattern decision tree — failure type mapped to recommended pattern, config location, and gateway setting
Failure type	Recommended pattern	Where it lives	Gateway mechanism
Transient timeout	Retry with exponential backoff	Gateway	Up to 5 attempts (Portkey)
Rate limit (429)	General fallback to alternate provider	Gateway	LiteLLM general fallbacks
Content-policy rejection	Content-policy fallback	Gateway	LiteLLM content-policy fallbacks
Context window exceeded	Context-window fallback	Gateway	LiteLLM context-window fallbacks
Provider outage >30s	Circuit breaker (open state)	Gateway	Per provider+model breaker
Persistent elevated errors	Cooldown / deprioritize	Gateway	LiteLLM cooldown (>50% failures or 429, 5s default)

The circuit breaker deserves its own mental model. The right design keeps a separate breaker per provider+model combination — so an outage on one model doesn't block requests to a different model on the same provider. While the breaker is open during cooldown, it prevents retry storms; half-open probes then check periodically whether the provider has recovered. LiteLLM's cooldown mechanism, which trips when a deployment exceeds a 50% failure rate or returns 429s, is a circuit-breaker implementation in everything but name.

"Each of these strategies serves a different purpose... The key is understanding what kind of failure you're dealing with and choosing the right response for that failure."— Portkey engineering team, Retries, Fallbacks, and Circuit Breakers in LLM Apps

Read the Kong benchmark carefully

A widely cited July 2025 benchmark reported Kong's data plane as 228% faster than Portkey OSS and 859% faster than LiteLLM on requests-per-second, with materially lower p95 latency. Treat these as directional only: Kong authored the test, and it used a mocked LLM backend, so the numbers measure raw gateway overhead, not production behavior. In real workloads, actual inference latency dwarfs gateway overhead — making these ratios largely irrelevant to your end-to-end response time.

06 — GovernanceBudgets, zero-data-retention, and compliance.

Governance is the dimension that turns a gateway from a performance tool into a compliance instrument. Budget enforcement works through request-level wallet checks: if the estimated cost of a request exceeds the available balance, the gateway rejects it with an HTTP 402 (Insufficient Balance) before it ever reaches the provider. A softer variant silently caps max_tokens to whatever the remaining balance can cover. Either way, the spend ceiling is enforced at the edge of your system, not discovered on an invoice. Per-user token attribution at this layer is the operational counterpart to the token cost attribution vocabulary every finance-aware AI team now needs.

Data-residency controls are equally a gateway concern. Vercel AI Gateway offers Zero Data Retention (ZDR) routing that sends requests only to providers contractually committed not to retain or train on prompt data — a compliance control enforced without touching application code. OpenRouter exposes ZDR as a provider filter. Cloudflare layers in Data Loss Prevention scanning for PII, financial, and healthcare data, plus real-time guardrails on both prompts and responses. The common thread: the gateway is where you enforce what data is allowed to leave your perimeter.

Compliance is a logging problem first

SOC 2 Type II requires continuous, immutable audit logs. Gateways that emit structured records — identity, provider, model, token counts, cost, latency, result status — from day one are far easier to certify than applications with ad-hoc logging bolted on later. For EU deployments, high-risk obligations under the EU AI Act become enforceable from August 2, 2026, and routing GDPR-covered data through US-based gateway infrastructure requires completed Standard Contractual Clauses before deployment. Build the audit trail into the gateway now, not under deadline pressure.

07 — EconomicsThe build-vs-buy math nobody runs.

The single most consequential decision is self-hosted versus managed, and most teams decide it on instinct rather than arithmetic. OpenRouter gives the cleanest public benchmark for the managed layer: a 5.5% platform fee on credit purchases. (Note the boundary — that fee applies to credit purchases, not to bring-your-own-key usage; routing through BYOK avoids it.) That single number lets you build the crossover formula.

Self-hosted total cost is provider token spend plus infrastructure (compute, storage) plus observability tooling plus engineering hours at your loaded hourly rate. Managed total cost is provider token spend plus the platform fee. The crossover is the point where avoided engineering effort outweighs the fee. Work a concrete example: a team spending around $2,000/month on tokens pays roughly $110/month at a 5.5% managed fee. If self-hosting that gateway consumes even ten hours a month of maintenance at $150/hour, that's $1,500 in engineering cost — and the managed fee wins by an order of magnitude. The calculus flips only at high spend, where the percentage fee on a large token bill eventually exceeds a fixed engineering allocation. The compute side of that infrastructure line item is shifting too: the 136-core Arm data-center processor is one of the silicon trends quietly reshaping the per-hour cost of running your own backend.

Build-vs-buy crossover at $2,000/month token spend

Source: Digital Applied analysis using OpenRouter's published 5.5% fee

Managed gateway (illustrative)$2,000 token spend + 5.5% platform fee

~$110/mo

Self-hosted maintenance (illustrative)~10 hrs/mo engineering @ $150/hr

~$1,500/mo

Read that chart as illustrative, not prescriptive — the inputs are your numbers, not ours. The point is the method: the 5.5% fee is small in absolute terms until your token spend is large, while engineering time is expensive whether or not anything breaks. For most teams at moderate spend, the managed fee is the cheaper option once real engineering cost is counted. The case for self-hosting is rarely cost at moderate scale; it is data residency, regulatory mandate, or spend high enough that the percentage fee dominates. Decide on that axis, then let the arithmetic confirm it.

"LiteLLM is best for self-hosted deployments where data residency or regulatory requirements mandate running everything on your own infrastructure, providing model abstraction and cost tracking out of the box."— Best LLM Gateways in 2026, getmaxim.ai

08 — DecisionChoosing the gateway for your stack.

The decision collapses to four common situations. Match yours to the row below, then validate the choice against your own traffic — never against a headline.

Data must stay on-prem

Regulated or sovereign workloads

When residency or sector compliance mandates running everything on your own infrastructure, a self-hosted open-source gateway is the only fit. LiteLLM's model abstraction, cost tracking, and region-aware pre-call checks make it the default here.

Pick LiteLLM (self-hosted)

Lightweight OSS with observability

Full control, minimal footprint

Portkey is an open-source, Apache-2.0 gateway with low added latency, semantic caching, and a full governance/observability suite. A strong choice when you want self-hosting but also want budgets, retries, and dashboards out of the box.

Pick Portkey (OSS)

Edge caching + compliance scanning

Global, managed, governed

Cloudflare AI Gateway runs across 330 data centers with caching, DLP scanning, and real-time guardrails. Vercel AI Gateway suits Vercel-native apps wanting unified access with ZDR routing. Both trade ops burden for a platform fee.

Pick Cloudflare / Vercel

Many models, fast, with fallbacks

Time-to-many-providers

OpenRouter is the fastest path to hundreds of models with automatic inverse-square routing and fallback chains. The 5.5% credit fee is the cost of skipping infrastructure work — cheap at moderate spend, worth re-checking at high spend.

Pick OpenRouter

Whichever you pick, treat the gateway as a load-bearing component from day one: instrument its logs for the compliance regime you answer to, tune your cache threshold against real traffic, and define your fallback categories explicitly. If you're standing up a production LLM stack and want the routing, caching, and governance architected correctly the first time, our AI transformation engagements start with exactly this kind of infrastructure decision — and our web and platform development team wires the gateway into your application end to end.

09 — ConclusionThe gateway is the new control plane.

The shape of AI infrastructure, mid-2026

The gateway is where cost, reliability, and compliance are actually enforced.

The LLM gateway has completed the same journey the API gateway made a decade ago — from optional convenience to the layer where the hard cross-cutting concerns live. Caching turns repeated work into sub-five-millisecond responses; routing turns provider sprawl into a cost and latency strategy; resilience turns a provider outage into a non-event; and governance turns spend and data flow into enforceable policy rather than after-the-fact discovery.

The build-vs-buy decision is the one worth doing the arithmetic on. The managed layer's benchmark fee is small in absolute terms until your token spend is large, while engineering time stays expensive whether or not anything breaks. For most teams at moderate scale, the managed option is cheaper once real engineering cost is counted; the genuine case for self-hosting is data residency, regulatory mandate, or spend high enough that a percentage fee dominates a fixed engineering allocation.

Looking forward, the regulatory pressure only sharpens the case for a gateway. With EU AI Act high-risk obligations enforceable from August 2026 and audit-log expectations rising across SOC 2 and GDPR, the teams that built structured logging, budget enforcement, and data-residency routing into the gateway early will certify faster and ship with less friction. The gateway is no longer where you save a few milliseconds. It is where your AI stack becomes governable.

LLM Gateway Architecture: The 2026 Engineering Reference