An LLM gateway is the proxy layer that sits between your application and the model providers you call — and in 2026 it has graduated from a convenience to critical AI infrastructure. It is where unified API access, caching, routing, fallbacks, budget enforcement, and compliance logging all live, so that none of those concerns leak into application code.
The shift is structural. Cloudflare's internal analysis suggests the average enterprise engineering team now calls several different model providers, which makes a single, vendor-neutral abstraction the default rather than the exception. Once you are routing to more than one provider — or spending real money on tokens — a gateway stops being optional and becomes the place where cost, reliability, and governance are actually enforced.
This reference compares the six most-deployed options — LiteLLM Proxy, Portkey, Cloudflare AI Gateway, Vercel AI Gateway, OpenRouter, and Kong AI Gateway — across the dimensions that actually decide an architecture: caching mechanics, routing algorithms, resilience patterns, governance and compliance controls, and the build-vs-buy economics most teams never quantify. Every figure below is sourced from official documentation, with vendor-stated claims flagged as such.
- 01The gateway is now critical infrastructure.Once you call more than one provider or spend meaningfully on tokens, the proxy layer is where cost control, reliability, and compliance are enforced — not a nice-to-have you bolt on later.
- 02Caching is the fastest ROI lever.Cached responses can return in under 5ms versus 2–5 seconds for live inference. Even modest hit rates produce meaningful cost and latency reductions at production scale.
- 03Self-hosted vs managed is a trade-off, not a features gap.LiteLLM and Portkey (open-source) keep data on your infrastructure; Cloudflare, Vercel, and OpenRouter trade an operational burden for a platform fee. The deciding axis is data residency and ops capacity.
- 04Resilience needs three distinct fallback categories.General fallbacks (timeouts, 5xx), content-policy fallbacks, and context-window fallbacks are different routing decisions with different provider lists — most tutorials only cover the first.
- 05The build-vs-buy crossover is calculable.OpenRouter's 5.5% credit fee gives a public benchmark for the managed layer. Run it against your token spend plus engineering hours before assuming self-hosting is cheaper.
01 — The LayerWhat a gateway actually does.
A gateway presents one API to your application and translates it into calls to whichever providers you have configured. That single translation point is what makes everything else possible: because all traffic flows through it, the gateway is the natural home for cross-cutting concerns that would otherwise be scattered across services. Vercel's documentation frames the role plainly — it gives you the ability to set budgets, monitor usage, load-balance requests, and manage fallbacks behind one endpoint.
The five responsibilities that define a production gateway are unified access, caching, routing, resilience, and governance. Each is examined in its own section below; the grid here is the mental map for how they relate.
Unified API
A single request format reaches hundreds of models. Vercel's gateway describes a unified API to access hundreds of models through one endpoint; Portkey routes to 1,600+ models across 250+ providers.
Exact & semantic
Identical or near-identical prompts return from cache instead of re-running inference. Cloudflare reports cache hits cutting latency by up to 90% by serving from its edge rather than the upstream provider.
Load balancing
Requests are distributed across providers and keys by strategy — simple-shuffle, latency-based, cost-based, or usage-aware. The right strategy depends on whether you optimize for price, speed, or quota.
Fallbacks & breakers
Automatic retries with backoff, fallback chains to alternate providers, and circuit breakers that isolate a failing provider+model so one outage doesn't cascade across your stack.
Budgets & audit
Request-level budget enforcement, per-user token attribution, zero-data-retention routing, and structured audit logs that SOC 2, GDPR, and the EU AI Act increasingly require from day one.
02 — ComparisonSix gateways, one matrix.
Most public comparisons cover two or three tools and skip the dimensions that matter for compliance and cost. The table below assembles the six most-deployed gateways against the features that decide a real architecture. Every cell is drawn from each vendor's official documentation; the Kong row carries a caveat because its performance claims come from a vendor-authored benchmark (see the resilience section).
| Gateway | Deployment | Providers | Semantic cache | Budgets | ZDR / BYOK | Guardrails / DLP | License | Best fit |
|---|---|---|---|---|---|---|---|---|
| LiteLLM Proxy | Self-hosted | 100+ | Yes | Yes | BYOK · region checks | Via plugins | Open source | Data-residency & regulated on-prem deployments |
| Portkey (OSS) | Self-hosted / managed | 250+ | Yes (cosine similarity) | Yes | BYOK | Governance suite | Apache 2.0 | Lightweight OSS gateway with full observability |
| Cloudflare AI Gateway | Managed (edge) | 12+ | Yes | Rate limits | BYOK (20+ providers) | DLP + real-time guardrails | Proprietary | Global edge caching with compliance scanning |
| Vercel AI Gateway | Managed | Hundreds of models | Provider-dependent | Yes | ZDR routing · BYOK | Via providers | Proprietary | Vercel-native apps wanting zero-markup access |
| OpenRouter | Managed | Hundreds of models | Provider-dependent | Credit balance | ZDR filter · BYOK | Moderation routing | Proprietary | Fastest path to many models with fallbacks |
| Kong AI Gateway | Self-hosted / Konnect | Multi-provider | Yes | Plugin-based | BYOK | Plugin ecosystem | Open source core | Teams already standardized on Kong (vendor benchmark caveat) |
The matrix makes the real division visible: the choice is rarely about a missing feature. LiteLLM and Portkey win on data residency because everything runs on your infrastructure; Cloudflare and Vercel win on operational simplicity because they run the edge for you. OpenRouter wins on time-to-many-models. There is no gateway that is strictly best — there is the gateway that matches your residency requirements, your ops capacity, and your spend.
03 — CachingExact-match and semantic caching.
Caching is the single fastest ROI lever a gateway offers. A cached response can return in under 5 milliseconds, versus the 2 to 5 seconds a live inference call typically takes. Cost follows latency: a served-from-cache request costs nothing in provider tokens. Reported hit rates in the 30–40% range are a typical range cited in secondary analyses rather than a guaranteed number — but even a moderate hit rate compounds into meaningful savings at production scale.
There are two distinct mechanisms. Exact-match caching hashes the request and returns a stored response when an identical request arrives. Semantic caching goes further: it embeds the prompt and runs a cosine-similarity search against stored embeddings, returning a cached answer when a new prompt is close enough to a previous one. Portkey's implementation checks an exact hash first and falls back to vector similarity — a dual-layer design that captures both literal repeats and paraphrases.
Cache-hit latency vs live inference
Sources: getmaxim.ai semantic caching analysis; Cloudflare AI Gateway docsThe cosine-similarity threshold is the underexplored design parameter. Practitioners typically tune it in the 0.90 to 0.98 range, and the choice is a precision/recall trade-off with real consequences. Set it loose (near 0.90) and you risk false positives — returning a cached answer to a question that only superficially resembles the original. Set it tight (near 0.98) and the hit rate collapses toward exact-match levels. The gateway is the right place to host this cache precisely because it is shared: every service behind it benefits from one another's query history rather than each maintaining a siloed cache.
04 — RoutingLoad balancing and routing strategies.
Routing decides which deployment serves each request. LiteLLM exposes a menu of strategies: Simple-Shuffle is the default and the one its docs recommend for production; Latency-Based routes to the fastest responder; Usage-Based-Routing-v2 (Redis-backed) routes to the deployment with the lowest tokens-per-minute load; Least-Busy picks the fewest concurrent requests; and Cost-Based optimizes for price. The strategy you pick encodes what you are optimizing for — and that should be an explicit decision, not a default left untouched.
OpenRouter takes a different, automatic approach. Its load balancer first prioritizes providers with no outages in the last 30 seconds, then weights the remaining candidates by the inverse square of their price — so a provider charging three times more is roughly nine times less likely to be selected. Explicit sort or order parameters disable this and force a sequential ordering instead. The same gateway implements the same sliding- and fixed-window rate limiting at the gateway layer that protects upstream providers from quota exhaustion.
Routing strategies
Simple-Shuffle (recommended for production), Latency-Based, Usage-Based-Routing-v2, Least-Busy, Cost-Based, and Custom. Cooldowns trigger automatically when a deployment exceeds 50% failures or returns 429s.
Inverse-square cost
Providers with no outage in the last 30 seconds are prioritized, then weighted by 1/price². A 3× pricier provider is ~9× less likely to be picked. :nitro sorts by throughput, :floor sorts by price.
Performance thresholds
OpenRouter can deprioritize endpoints below a percentile threshold (p50/p75/p90/p99) — slow endpoints are moved to the end of the candidate list rather than excluded outright.
05 — ResilienceFallbacks, retries, and circuit breakers.
Resilience is where the gateway earns its keep during an incident. Three layers stack. Retries handle transient failures: Portkey supports up to five attempts with exponential backoff. Fallbacks reroute to an alternate provider when retries won't help — and critically, there are three distinct fallback categories that most tutorials collapse into one. Circuit breakers sit on top, isolating a provider that is persistently failing so the system stops hammering it. These are the same retry and idempotency patterns that govern any reliable distributed system, applied at the provider-request level.
| Failure type | Recommended pattern | Where it lives | Gateway mechanism |
|---|---|---|---|
| Transient timeout | Retry with exponential backoff | Gateway | Up to 5 attempts (Portkey) |
| Rate limit (429) | General fallback to alternate provider | Gateway | LiteLLM general fallbacks |
| Content-policy rejection | Content-policy fallback | Gateway | LiteLLM content-policy fallbacks |
| Context window exceeded | Context-window fallback | Gateway | LiteLLM context-window fallbacks |
| Provider outage >30s | Circuit breaker (open state) | Gateway | Per provider+model breaker |
| Persistent elevated errors | Cooldown / deprioritize | Gateway | LiteLLM cooldown (>50% failures or 429, 5s default) |
The circuit breaker deserves its own mental model. The right design keeps a separate breaker per provider+model combination — so an outage on one model doesn't block requests to a different model on the same provider. While the breaker is open during cooldown, it prevents retry storms; half-open probes then check periodically whether the provider has recovered. LiteLLM's cooldown mechanism, which trips when a deployment exceeds a 50% failure rate or returns 429s, is a circuit-breaker implementation in everything but name.
"Each of these strategies serves a different purpose... The key is understanding what kind of failure you're dealing with and choosing the right response for that failure."— Portkey engineering team, Retries, Fallbacks, and Circuit Breakers in LLM Apps
06 — GovernanceBudgets, zero-data-retention, and compliance.
Governance is the dimension that turns a gateway from a performance tool into a compliance instrument. Budget enforcement works through request-level wallet checks: if the estimated cost of a request exceeds the available balance, the gateway rejects it with an HTTP 402 (Insufficient Balance) before it ever reaches the provider. A softer variant silently caps max_tokens to whatever the remaining balance can cover. Either way, the spend ceiling is enforced at the edge of your system, not discovered on an invoice. Per-user token attribution at this layer is the operational counterpart to the token cost attribution vocabulary every finance-aware AI team now needs.
Data-residency controls are equally a gateway concern. Vercel AI Gateway offers Zero Data Retention (ZDR) routing that sends requests only to providers contractually committed not to retain or train on prompt data — a compliance control enforced without touching application code. OpenRouter exposes ZDR as a provider filter. Cloudflare layers in Data Loss Prevention scanning for PII, financial, and healthcare data, plus real-time guardrails on both prompts and responses. The common thread: the gateway is where you enforce what data is allowed to leave your perimeter.
07 — EconomicsThe build-vs-buy math nobody runs.
The single most consequential decision is self-hosted versus managed, and most teams decide it on instinct rather than arithmetic. OpenRouter gives the cleanest public benchmark for the managed layer: a 5.5% platform fee on credit purchases. (Note the boundary — that fee applies to credit purchases, not to bring-your-own-key usage; routing through BYOK avoids it.) That single number lets you build the crossover formula.
Self-hosted total cost is provider token spend plus infrastructure (compute, storage) plus observability tooling plus engineering hours at your loaded hourly rate. Managed total cost is provider token spend plus the platform fee. The crossover is the point where avoided engineering effort outweighs the fee. Work a concrete example: a team spending around $2,000/month on tokens pays roughly $110/month at a 5.5% managed fee. If self-hosting that gateway consumes even ten hours a month of maintenance at $150/hour, that's $1,500 in engineering cost — and the managed fee wins by an order of magnitude. The calculus flips only at high spend, where the percentage fee on a large token bill eventually exceeds a fixed engineering allocation.
Build-vs-buy crossover at $2,000/month token spend
Source: Digital Applied analysis using OpenRouter's published 5.5% feeRead that chart as illustrative, not prescriptive — the inputs are your numbers, not ours. The point is the method: the 5.5% fee is small in absolute terms until your token spend is large, while engineering time is expensive whether or not anything breaks. For most teams at moderate spend, the managed fee is the cheaper option once real engineering cost is counted. The case for self-hosting is rarely cost at moderate scale; it is data residency, regulatory mandate, or spend high enough that the percentage fee dominates. Decide on that axis, then let the arithmetic confirm it.
"LiteLLM is best for self-hosted deployments where data residency or regulatory requirements mandate running everything on your own infrastructure, providing model abstraction and cost tracking out of the box."— Best LLM Gateways in 2026, getmaxim.ai
08 — DecisionChoosing the gateway for your stack.
The decision collapses to four common situations. Match yours to the row below, then validate the choice against your own traffic — never against a headline.
Regulated or sovereign workloads
When residency or sector compliance mandates running everything on your own infrastructure, a self-hosted open-source gateway is the only fit. LiteLLM's model abstraction, cost tracking, and region-aware pre-call checks make it the default here.
Full control, minimal footprint
Portkey is an open-source, Apache-2.0 gateway with low added latency, semantic caching, and a full governance/observability suite. A strong choice when you want self-hosting but also want budgets, retries, and dashboards out of the box.
Global, managed, governed
Cloudflare AI Gateway runs across 330 data centers with caching, DLP scanning, and real-time guardrails. Vercel AI Gateway suits Vercel-native apps wanting unified access with ZDR routing. Both trade ops burden for a platform fee.
Time-to-many-providers
OpenRouter is the fastest path to hundreds of models with automatic inverse-square routing and fallback chains. The 5.5% credit fee is the cost of skipping infrastructure work — cheap at moderate spend, worth re-checking at high spend.
Whichever you pick, treat the gateway as a load-bearing component from day one: instrument its logs for the compliance regime you answer to, tune your cache threshold against real traffic, and define your fallback categories explicitly. If you're standing up a production LLM stack and want the routing, caching, and governance architected correctly the first time, our AI transformation engagements start with exactly this kind of infrastructure decision — and our web and platform development team wires the gateway into your application end to end.
09 — ConclusionThe gateway is the new control plane.
The gateway is where cost, reliability, and compliance are actually enforced.
The LLM gateway has completed the same journey the API gateway made a decade ago — from optional convenience to the layer where the hard cross-cutting concerns live. Caching turns repeated work into sub-five-millisecond responses; routing turns provider sprawl into a cost and latency strategy; resilience turns a provider outage into a non-event; and governance turns spend and data flow into enforceable policy rather than after-the-fact discovery.
The build-vs-buy decision is the one worth doing the arithmetic on. The managed layer's benchmark fee is small in absolute terms until your token spend is large, while engineering time stays expensive whether or not anything breaks. For most teams at moderate scale, the managed option is cheaper once real engineering cost is counted; the genuine case for self-hosting is data residency, regulatory mandate, or spend high enough that a percentage fee dominates a fixed engineering allocation.
Looking forward, the regulatory pressure only sharpens the case for a gateway. With EU AI Act high-risk obligations enforceable from August 2026 and audit-log expectations rising across SOC 2 and GDPR, the teams that built structured logging, budget enforcement, and data-residency routing into the gateway early will certify faster and ship with less friction. The gateway is no longer where you save a few milliseconds. It is where your AI stack becomes governable.