MCP server reliability metrics in 2026 are the dividing line between a prototype that works on a developer's laptop and a platform that internal teams can build on. The protocol is mature enough to deploy widely, but most servers in production today have no SLO, no error budget, no defined alert thresholds — they have a health-check endpoint and a hope. This framework replaces the hope with eight concrete KPIs and the panel design that makes them legible.
We assume you have already shipped an MCP server and connected it to live agents. The threshold question is no longer does it work — it is what should it promise. Internal consumers asking that question deserve a numeric answer, an honest measurement of how close you are to it, and a burn-rate alert that fires before the answer becomes embarrassing.
This guide is the operations complement to our MCP server security best practices engineering guide. Where that document hardens the privilege boundary, this one instruments it. Seven sections cover the SLO rationale, four KPI families (uptime, latency, success rate, error budget), the alerting thresholds that turn the panel into a paging contract, and the OpenTelemetry semantic conventions that keep your instrumentation vendor-portable.
- 01SLOs distinguish prototype from platform.An MCP server without numeric reliability targets is a personal tool with shared credentials. A server with stated SLOs, a measured error budget, and a paging contract is a platform internal consumers can plan against. The transition is organisational, not technical — but the metrics are how you mark it.
- 02Latency percentiles dominate user perception.P50 latency is the headline number a marketer reaches for; P95 and P99 are the numbers users actually feel. Agents make hundreds of tool calls per session — a P99 in the seconds is a session in the minutes. Always state SLOs at P95 and P99, never the mean.
- 03Tool-call success rate predicts user retention.Aggregate uptime hides the failure mode that matters: an individual tool returning malformed output, timing out, or throwing into an agent it cannot recover from. Track success rate per tool, per caller, per agent — a server-wide rolling average will mask the failure that is driving churn.
- 04Error budget framework prevents engineering burnout.Without a budget, every failure feels like a five-alarm fire. With a budget, the team has explicit permission to let small failures land — and a contractual obligation to slow feature work when burn-rate exceeds policy. The budget is the operational and political artifact, both at once.
- 05OpenTelemetry adoption unlocks vendor portability.Lock-in is the hidden tax on first-generation observability stacks. OpenTelemetry semantic conventions for tool invocation, agent runs, and MCP transports decouple the instrumentation surface from any single backend. Adopt the standard now and route to whichever vendor the next budget cycle prefers.
01 — Why SLOsSLOs separate a prototype from a platform internal teams can build on.
The question every internal MCP server eventually faces is the same. A second team wants to depend on the server. The first team — the one that built it for their own use — is asked whether it is "ready to be a platform." In our experience, the distinguishing artifact between a prototype answer and a platform answer is not the codebase quality, the test coverage, or the deployment automation. It is whether the team can state, in numbers, what the server promises and how close it currently sits to that promise.
SLOs — Service Level Objectives — are how operations teams have answered that question for fifteen years across web services, APIs, and managed platforms. The MCP world has been late to adopt them, largely because the early server population was so heavily personal-tooling that the language felt mismatched. That mismatch is over. MCP servers wrapping production credentials with agentic callers issuing thousands of calls per session need the same numeric discipline every other piece of production infrastructure has been operating under since the 2010s — there is no shortcut to credibility.
What the server commits to
A stated SLO is a single number per dimension — uptime 99.5% monthly, P95 latency under 600ms, tool-call success rate above 99%. The whole point is that internal consumers can plan capacity, sequence, and dependency without asking the owning team to vouch verbally.
Operational contractHonest rolling measurement
The SLI — Service Level Indicator — is what you actually measure. A 30-day rolling window is the right default for monthly SLOs; longer for quarterly contracts. The measurement window has to outlast typical incident-recovery cycles or you will keep promising what last week's outage just disqualified.
30-day defaultPermission to fail incrementally
The error budget is what is left between the SLO and 100%. A 99.5% uptime promise leaves a 0.5%-per-month budget — roughly 3.6 hours — that the team is explicitly allowed to spend on planned maintenance, risky deploys, or incident slack. When the budget burns down, feature work pauses.
Time you can spendThe framework that follows treats the SLO not as a number to chase but as a structured conversation. The owning team commits to a promise. The measurement layer verifies the promise honestly. The error budget governs what happens when the promise is breached. The alerting thresholds tell the on-call person before the budget is gone. The OpenTelemetry layer makes the instrumentation portable so the conversation survives a backend migration. Every section that follows fits into that scaffold.
One political note worth getting on the table. SLOs commit you to paging the team when they breach. A server with an aspirational SLO and no paging discipline is not a platform — it is the same prototype with a number on the doc page. If the organisation is not willing to fund on-call coverage for the server, the right move is to state a weaker SLO that the current cover can actually defend. Honesty up front beats credibility recovery after a silent outage.
02 — Uptime KPIsAvailability, recovery, partial health — three uptime dimensions.
Uptime is the family of KPIs most teams reach for first because it is the easiest to measure. The trap is that "uptime" in MCP context has at least three distinct dimensions, and a single percentage on a dashboard collapses all of them into a number that satisfies the executive review and hides the failure modes that actually drive consumer pain.
The three dimensions worth tracking separately: availability (the server responded to a probe within a deadline), recovery posture(when it didn't, how long until it did), and partial health (when it did respond, which tools were actually functional). The grid below maps the three uptime KPIs the framework recommends, the SLI definition for each, and the platform-tier target we recommend as a starting point. Every target should be tuned to your specific consumer tolerance.
Probe availability
Synthetic ping every 60s · success / totalA synthetic probe — initialize handshake followed by a tools/list call — hits the server every 60 seconds from at least two geographic regions. The KPI is the percentage of probes that completed inside the deadline (recommended: 5 seconds). Platform-tier target: 99.5% monthly. Below that, internal consumers will route around your server within a quarter.
≥ 99.5% / moMean time to recovery
Detection → restoration · per incidentWhen the availability probe fails, MTTR is the wall-clock between first failed probe and first successful probe. The KPI is the monthly average. Platform-tier target: under 30 minutes for self-recovery cases, under 2 hours for cases requiring engineer intervention. The two cohorts deserve separate tracking — they bottleneck on different controls.
< 30 min self-recoveryPartial- health ratio
Working-tool count / advertised-tool countHealth-check expanded: for each registered tool, an actual smoke call — read-only or no-op — verifies the tool is currently functional. The KPI is the ratio of currently-passing smoke calls to advertised tool count. Aggregate uptime can be 99.9% while the partial-health ratio is 75% because three tools have a downstream API outage. Track both.
≥ 95% all-tools-healthyThe third KPI — partial health — is the one most teams skip and then regret. A standard MCP server exposes 8-20 tools, each of which can fail independently because each one wraps a different downstream dependency. The agent calling your server does not care that 17 of 20 tools work; it cares whether the specific tool it needs at this turn works. A 99.9% server-uptime number with three persistently broken tools is a failure-mode the aggregate hides and the consumer feels every time.
Implementation note. The smoke calls for partial-health verification should be no-op or read-only by design — the simplest pattern is a dedicated health.check argument shape on each tool that exercises the code path without performing side effects. The cost of this discipline at build time is small; the cost of not having it is debugging silent partial outages from log lines that don't show which downstream actually broke.
03 — Latency KPIsP50, P95, P99 — the percentiles users actually feel.
Latency is where the discipline of percentiles matters most. The mean is misleading by construction — a server that responds in 120ms most of the time and 8 seconds occasionally has a mean that looks acceptable and a user experience that does not. Agents issuing dozens to hundreds of tool calls per session expose the tail of the distribution every single session; the P99 number is felt within the first ten interactions, not the thousandth.
The framework recommends three latency KPIs, tracked per tool and aggregated server-wide. The chart below shows the distribution shape on a representative production MCP server — the median is fast, the P95 is acceptable, and the P99 is where the architectural bottlenecks show up. Stating SLOs at P95 and P99 keeps the team honest about the tail experience.
Latency percentile distribution · representative production server
Source: Digital Applied production MCP observability, Q2 2026The three latency KPIs the framework codifies: P50 (median) as the headline reassurance number, P95 as the SLO line most internal consumers actually care about, and P99as the architectural canary. State your SLO at P95 — "95% of tool-call latencies under 600ms" — not at the mean. The P50 is fine to publish as context, but the contract sits at P95. Track P99 internally for trending; do not write a P99 SLO unless you are running a tightly-controlled environment where the tail is genuinely actionable.
Latency budgeting deserves a specific note. The end-to-end latency a user experiences is the sum of model latency, MCP round-trip latency, tool handler latency, and downstream API latency. Of those, only the middle two are inside your control. Build the SLO around MCP round-trip plus handler, not end-to-end — otherwise you are committing to a number partially owned by Anthropic or OpenAI's backbone, which is unwise from a contract standpoint and unfair to your own on-call rotation. Separate the SLI from the variables you do not control.
"P50 is the number you put on a slide. P95 is the number you write the SLO around. P99 is the number you wake up for. Confusing them in either direction is how reliability stories get told wrong."— Digital Applied agentic reliability, on percentile discipline
04 — Success Rate KPIsTool-call success rate predicts user retention.
Success rate is the KPI family closest to the user experience — and the one most server owners under-instrument. A tool call counts as a "success" only when the response is well-formed, the agent can act on it, and the underlying operation actually completed. A handler that catches its own exception and returns "an error occurred" is a failure for SLI purposes even though the request technically returned 200 — the agent could not progress, and the user experienced a stall. Define the SLI honestly or it will silently inflate.
Two KPIs cover the family. Tool-call success rate is the headline aggregate, broken out per tool, per caller, and per agent. The second KPI — schema validation failures — is the specific failure mode that an agentic environment exposes most severely. When a tool returns output that fails its own output schema, the agent typically loops, retries, and consumes both its context and its patience. Schema validation is therefore both a correctness signal and a cost-control signal.
Tool-call success rate
Per tool, per caller, per agent. The SLI is well-formed responses that the agent can act on, divided by all calls. Platform-tier target: 99% monthly aggregate, 99.5% on read-only tools, 98% on mutation tools (which depend on downstream availability). Aggregate-only tracking will hide the one broken tool driving user churn.
≥ 99% / mo aggregateSchema validation failures
Responses that fail their own declared output schema, expressed as a percentage of all responses. Anything above 0.1% indicates a schema-versus-handler drift that the agent will exercise dozens of times per session. Wire schema validation into the handler exit path so the failure is logged at source — not only at the agent's parse step.
< 0.1% / moPer-tool · per-caller breakdown
Aggregate success rate masks the high-impact failure: one tool, one caller, persistently failing. The framework requires every success-rate panel to support drill-down by tool and by caller (agent or human principal) so the on-call can identify the specific combination causing pain within seconds, not after a half-hour log dig.
Drill-down requiredOne pattern worth naming: the retry-loop tax. When a tool fails in a way the agent considers retryable — transient network errors, timeouts, malformed responses — the agent typically retries automatically, often three to five times. From a server perspective, a single user request can therefore manifest as five tool calls, four of which are failures. Aggregate success rate handles that case correctly (4-out-of-5 failed), but the dashboard reading is misleading because the user experienced one failure, not four. Track both the per-call success rate and a per-user-intent success rate where the latter de-duplicates retries within a short window.
A second pattern: silent-success failures. A tool that returns "OK" when it actually failed to perform the operation — the database transaction rolled back, the email queued but never sent, the webhook fired but to a stale URL — registers as success in the SLI and as failure in the user's reality. Cover this with end-to-end probes that verify the operation by observing the downstream effect, not by trusting the handler return code. Synthetic monitors that round-trip through the entire side-effect chain are the only honest measurement for state-mutating tools.
05 — Error BudgetThe error budget framework prevents burnout on both sides of the contract.
The error budget is the operational artifact that turns SLOs from aspirational targets into a working contract. The budget is the gap between the SLO and 100%, expressed in time or error-count terms for the measurement window. A 99.5% monthly uptime SLO leaves a 0.5% budget — roughly 216 minutes per 30-day month — that the team is explicitly allowed to spend on planned maintenance, risky deploys, or unplanned incidents.
The framework treats budget as a finite, depletable resource with three operating policies layered on top: how the budget is spent intentionally (planned-maintenance discipline), what happens when burn-rate exceeds policy (the operational response), and what happens when the budget is exhausted (the organisational response). The matrix below maps the three states a budget can be in and the team behavior each state unlocks or blocks.
> 50% budget remaining
Operating normally. The team can ship risky changes (schema changes, dependency bumps, infrastructure swaps) with confidence. Planned maintenance windows are scheduled inside the remaining budget. This is the state the team is engineering toward all the rest of the month.
Ship freely · maintain proactively20-50% budget remaining
Burn-rate is elevated. Risky deploys move behind a feature flag with gradual rollout. Planned maintenance is deferred unless covered by an explicit consumer-side change window. The team's standing agenda shifts toward investigating the burn cause rather than shipping new tools — the budget is signalling reliability work has been deprioritised too long.
Slow risky changes · investigate burn< 20% budget remaining
Feature work pauses by policy. The team's calendar is consumed by reliability work — root-cause analysis, post-incident remediation, capacity headroom additions, and the specific fix for the highest-recurring failure mode. The pause is structural, not punitive — it is the contract the team signed when it stated the SLO.
Pause feature work · ship reliability0% budget remaining
SLO breach. The contract has been broken. The required response is a written post-mortem to every consumer team, a remediation plan with milestones, and either a renegotiated SLO with weaker targets or a funded reliability sprint to restore the original contract. Silence at this point is the loudest signal you can send.
Communicate breach · file remediationThe political function of the error budget is as important as the operational one. Without a budget, every failure feels like a five-alarm fire — and the team that gets paged twice in a fortnight starts to drift toward defensive engineering, blocking deploys, and adding gates that slow the rest of the organisation. With a budget, small failures are contractually-permitted spend; the team has explicit cover for taking deliberate risks. The budget converts an open-ended anxiety into a finite, manageable resource — and the team can point at the budget gauge during a planning meeting rather than relitigating reliability values on every standup.
One implementation detail worth getting right. The error budget should be calculated and displayed on the same panel as the SLO itself, in the same units, and updated on the same cadence as the underlying SLIs. A budget that lives in a spreadsheet is a budget that nobody looks at. A budget that shows up as a thermometer next to the uptime number is a budget the team negotiates against in real time. The visualisation is the management surface, not the data layer.
06 — Alerting ThresholdsBurn-rate alerts beat threshold alerts every time.
The most common alerting mistake the framework displaces is the static-threshold alert. "Page me when uptime drops below 99%" sounds reasonable and fires either too late or too noisily depending on the window. A 1% drop sustained over an hour is a paging-worthy incident; a 1% drop in a single sixty-second window is statistical noise. The right primitive is the burn-rate alert: page when the rate of error-budget consumption, projected forward, would exhaust the budget in less time than the recovery process can absorb.
Google's SRE workbook codifies the multi-window burn-rate alert pattern; this framework adapts it for MCP server scale. Two windows per alert keep the false-positive rate manageable. A short window (5 minutes) catches fast burns; a long window (1 hour) confirms the burn is sustained rather than a single bad probe. The alert fires only when both windows exceed the burn threshold simultaneously. The chart below shows the framework's recommended thresholds across the four KPI families.
Alerting thresholds · MCP SLO framework
Source: Adapted from Google SRE multi-window burn-rate guidance, calibrated for MCP server loadThe 14.4× burn-rate threshold is the framework's default paging trigger because it corresponds to the project that would exhaust a month's budget in 2 days — a rate fast enough that the on-call engineer's response window matches the budget's remaining lifetime. Slower burns (6× rate, exhausting in 5 days) generate tickets rather than pages — they are real but not urgent, and the team should schedule them inside the next sprint rather than disrupting sleep cycles. The two-tier split is what keeps the on-call rotation sustainable.
One operational detail worth emphasising: the burn-rate alert must be backed by an actionable runbook. The recurring failure mode of burn-rate alerts is that they fire correctly, wake the engineer, and then leave them staring at a Grafana panel without a documented diagnostic path. Every alert should resolve to a single runbook page describing the most likely causes, the queries that confirm each cause, and the specific remediation playbook for each. The alert without the runbook is the worst of both worlds — paging discipline without recovery discipline.
07 — OpenTelemetry IntegrationOTel semantic conventions unlock vendor portability.
OpenTelemetry is the standardisation layer that turns an MCP server's telemetry from a vendor-locked silo into a portable, queryable substrate. The OTel specification covers three pillars — traces, metrics, logs — with a shared semantic conventions vocabulary that lets a query written for Datadog work, with minor adjustments, on Honeycomb, Grafana Cloud, or self-hosted Tempo + Prometheus + Loki. The framework treats OTel adoption as table stakes for any MCP server intended to outlive its first observability backend.
The semantic conventions that matter most for MCP servers split into three categories. Agent run attributes describe the agent invocation that triggered the tool call — user identity, conversation ID, model name. Tool invocation attributes describe the individual call — tool name, input shape (not values), output schema validity, retry index. MCP transport attributes describe the protocol layer — stdio vs streamable HTTP, initialize handshake state, capability negotiation result. The grid below maps each category to the specific attributes the framework recommends.
Agent run attributes
user.id · conversation.id · gen_ai.systemAttached to the root span of a session. Identifies the principal, the conversation, the model in use, and the host application. These are the attributes you query when answering 'which user, which agent, which model is driving the burn-rate I'm seeing'. Hashed identifiers are appropriate for privacy-sensitive deployments.
Session identificationTool invocation attributes
mcp.tool.name · mcp.tool.retry · gen_ai.tool.call.idAttached to every tool-call span. Tool name (always), retry index (when an agent retries the same tool with the same arguments in a short window), and the unique call ID assigned by the host. Input argument *shape* — keys and types, never values — can be attached for cardinality control.
Per-call instrumentationTransport layer attributes
mcp.transport · mcp.protocol.version · mcp.session.idAttached to the connection-establishment span. Transport (stdio / streamable-http / sse), protocol version negotiated, session ID from the handshake. These are the attributes you query when diagnosing 'all my errors are coming from clients on protocol version X' — a common pattern after a spec revision.
Protocol diagnosticsTwo implementation notes save weeks of pain. First, cardinality discipline. OTel attributes that embed unbounded high-cardinality values — full user IDs, raw request URLs, complete argument values — will detonate the cost of any backend that bills by series count. High-cardinality fields go in span attributes (logged once, queryable, cheap); they do not go on metric labels (multiplied across the entire time series, expensive). Read the bill from your current backend, identify the top three cardinality offenders, and audit your instrumentation before flipping the OTel switch.
Second, sampling strategy. Most production MCP servers issue volumes of tool calls that are expensive to trace exhaustively. Head-based sampling (decide at span start) is cheap and biased; tail-based sampling (decide at span end based on outcome) gives you 100% of errors plus a representative sample of successes and is what the framework recommends for SLO-grade telemetry. The OTel collector supports tail sampling natively; configure it once at the collector layer rather than baking sampling decisions into every service.
"The point of OpenTelemetry is not the panel you build today. It is the panel you can rebuild on a different backend the day after the budget cycle changes — without rewriting the instrumentation layer underneath."— Digital Applied agentic reliability, on instrumentation portability
For teams evaluating whether to internalise this framework or partner on it: the eight KPIs and the OTel conventions above are sufficient to design a credible SLO panel and a working alerting contract on your own. The reason teams engage us is rarely capability; it is calibration — a reviewer who has seen the same SLO design land across dozens of production servers names the right targets faster, and writes the error-budget policy in language that maps onto an executive review. If that calibration is valuable, our agentic AI transformation engagements include MCP server SLO design as a discrete deliverable; if it is not, the framework above is yours to run.
SLOs turn MCP servers from prototypes into platforms.
The eight KPIs in this framework — probe availability, mean time to recovery, partial-health ratio, latency at P50 / P95 / P99, tool-call success rate, schema validation failures, and the burn-rate signal that ties them together — are the instrumentation surface that distinguishes a production MCP server from a personal tool with shared credentials. None of the eight are individually novel; the combination, stated as a contract and enforced by an error-budget policy, is what gives the server organisational credibility.
The single most consequential framing in Section 01 is the one to walk away with. SLOs are not engineering trivia — they are the organisational language for what a piece of infrastructure promises. Internal consumers asking whether they can build on your MCP server are asking, in effect, what it promises and how close it currently sits to that promise. The eight KPIs are how you give that question two numeric answers and a panel URL instead of a paragraph and a vouch.
The next step is concrete. Pick one MCP server you run in production. Pick a starting SLO for each of the four families — uptime, latency, success rate, error budget — calibrated honestly against last month's incident log. Wire the OpenTelemetry instrumentation, build the dashboard with both the SLI and the burn-rate gauge in view, and commit to one quarter of operating against the contract before adjusting the targets. The first quarter is the hard one; the second quarter is when the platform language starts to feel earned.