DevelopmentIndustry Guide15 min readPublished June 14, 2026

Four DORA metrics · trunk-based delivery · build once, promote always

CI/CD Pipeline Design in 2026: An Engineering Reference

Elite teams deploy many times a day with lead times under a day — yet most teams are nowhere near that. This reference maps the path: DORA’s four metrics as the scoreboard, trunk-based development with feature flags as the branching model, and a deployment-strategy matrix for choosing between blue-green, canary, and progressive delivery.

DA
Digital Applied Team
Senior engineers · Published June 14, 2026
PublishedJune 14, 2026
Read time15 min
SourcesDORA, GitHub, GitLab docs
DORA key metrics
4
deploy freq · lead time · CFR · MTTR
Elite lead time target
<1day
code commit to production
Google single trunk
35K+
developers · one monorepo
DORA research base
39K+
professionals · 2014–2025

CI/CD pipeline design in 2026 is no longer about wiring a build server to a deploy script — it is about engineering the whole path from commit to production so that software ships frequently and safely at the same time. The teams that do this well deploy multiple times a day with lead times under 24 hours; most teams are still a long way from that, and the distance is measurable.

What separates the two is not raw velocity or tooling spend. It is a disciplined combination of how code is integrated (trunk-based, small, continuous), how releases are decoupled from deploys (feature flags), how risk is contained at rollout (canary and blue-green), and how the whole system is measured (DORA’s four metrics). Get those four working together and the pipeline stops being a source of fear and becomes a competitive advantage.

This reference walks the full design space: the four metrics that actually predict delivery performance, the elite thresholds and how far the typical team sits from them, the 2025 restructuring of the DORA model, trunk-based development versus GitFlow, a deployment strategy decision matrix, artifact immutability, the highest-impact speed optimizations in GitHub Actions and GitLab CI, and the supply chain security practices that pipeline-design guides routinely skip.

Key takeaways
  1. 01
    DORA's four metrics are the scoreboard.Deployment frequency and lead time for changes measure throughput; change failure rate and mean time to restore measure stability. Together they predict delivery performance better than velocity or output metrics alone.
  2. 02
    Elite means daily deploys and sub-day lead times.The widely-cited elite thresholds are on-demand deployment, lead time under a day, a low change failure rate, and restore time under an hour. Most organizations are not there yet — the gap is the work.
  3. 03
    Trunk-based development beats GitFlow for continuous delivery.Short-lived branches merged to a single trunk daily keep the codebase releasable. Google runs 35,000+ developers on one monorepo trunk. GitFlow's five branch types add integration complexity that fights continuous deployment.
  4. 04
    Feature flags decouple deploy from release.Merging incomplete work behind a flag resolves the core tension in trunk-based development: you ship code continuously without exposing unfinished features to users, and you toggle rollout per cohort.
  5. 05
    Build once, promote always — and secure the runner.Build a single immutable artifact and promote that exact build through every environment; inject config via variables, never rebuild. Then treat the CI runner as a high-value target with OIDC, SBOMs, and provenance.

01The ScoreboardDORA’s four metrics, defined precisely.

DORA — DevOps Research and Assessment, now a research program under Google Cloud — identifies four metrics that, across a research base of more than 39,000 professionals gathered between 2014 and 2025, predict software delivery performance. Two measure throughput and two measure stability, and the discipline of the model is that you are expected to move all four together rather than trade one off against another.

The two throughput metrics

Deployment frequency is how often an organization successfully releases to production. Lead time for changes is the elapsed time from a code commit to that change running in production. Both answer the question “how fast can we get a change in front of users?”

The two stability metrics

Change failure rate (CFR) is the percentage of deployments that cause a failure in production requiring remediation — a hotfix, rollback, or patch. Mean time to restore (MTTR) is how long it takes to recover service once an incident occurs. Both answer the question “when we move fast, how often do we break things, and how quickly do we recover?”

The reason these four are treated as a set is that any one of them is gameable in isolation. You can inflate deployment frequency by shipping trivial changes, or report a flawless change failure rate by deploying almost nothing. The four-metric frame exists precisely so that throughput and stability are read against each other — a team is only genuinely improving when both halves move in the right direction at once.

Throughput
Deployment frequency
Releases to production / time

How often you successfully ship to production. Elite teams deploy on-demand, multiple times per day. Higher frequency means smaller batches, which makes every other metric easier to manage.

Smaller batches · lower risk
Throughput
Lead time for changes
Commit → production (elapsed)

How long a committed change takes to reach production. Elite teams keep this under a day. Long lead times usually signal manual gates, batch releases, or a fragile pipeline.

Target: < 1 day
Stability
Change failure rate
% of deploys needing remediation

The share of deployments that cause a production failure requiring a hotfix, rollback, or patch. Lower is better, but it must be read alongside deployment frequency, not in isolation.

Stability counterweight
Stability
Time to restore
Incident → service recovered

How quickly you recover when a deployment causes an incident. Elite teams restore in under an hour, usually via fast rollback, feature-flag kill switches, or forward-fix automation.

Target: < 1 hour
Why these four, and not velocity
DORA’s published research reports that teams strong across the four metrics are more likely to meet their organizational performance goals and to report better customer satisfaction. The practical lesson is that throughput and stability are not opposites to be traded off — the highest performers achieve both at once, which is exactly why the four are read as a single balanced scorecard rather than four independent dials.

02Where Teams StandElite thresholds and the distance most teams have to cover.

The classic four-tier reference — elite, high, medium, low — gives concrete numbers to aim at. Elite performers deploy on-demand (multiple times per day), keep lead time for changes under one day, hold change failure rate in the low range, and restore service in under an hour. These thresholds remain the most useful practical targets even though, as the next section explains, DORA itself restructured around them in 2025.

Independent benchmarking from LinearB, drawn from more than 3,000 engineering teams in its own customer base, puts slightly more granular elite thresholds: deployment frequency above one per day, change lead time under roughly 26 hours, change failure rate below 1%, and recovery time under six hours. It is a different sample from the DORA survey — a vendor’s customer cohort rather than a broad cross-industry study — so the two are best read as complementary data points rather than a single authoritative line.

Elite delivery thresholds · two complementary data sets

Source: DORA four-tier thresholds + LinearB engineering benchmarks
Elite — deployment frequencyOn-demand · multiple times per day
Daily+
Elite — lead time for changesCommit to production
< 1 day
Elite — time to restoreIncident to recovery
< 1 hour
LinearB elite — change lead timeVendor cohort · 3,000+ teams
< 26 hrs
LinearB elite — recovery timeVendor cohort · 3,000+ teams
< 6 hrs
DORA elite thresholdsLinearB granular thresholds

The honest interpretation of the benchmarks is the part most posts skip. Reaching on-demand deployment is genuinely rare — secondary compilations of the 2025 DORA data suggest only a small minority of organizations actually deploy multiple times a day, while a large share still deploy less than once a month and many carry change lead times measured in weeks rather than hours. Treat the specific percentages from those compilations as directional; the structural point is robust, and it is the one that matters. Most teams are not near elite, and the journey from medium to high is where the practical work lives.

That gap reframes the goal. If you are deploying weekly with a multi-day lead time, the target is not “become Google” — it is to halve your batch size, automate one manual gate, and add a fast rollback path, then measure the four metrics again next quarter. The benchmarks are a compass, not a scoreboard you are graded on this sprint.

03The 2025 ShiftWhy DORA moved from four tiers to seven archetypes.

The most current development in the model is also the most under-reported in practitioner content. The 2025 State of DevOps report moved away from the familiar four-tier classification (low/medium/high/elite) and, per analyses of the report, introduced a set of team archetypes built on a broader set of measures — with descriptive labels such as “Legacy Bottleneck” and “Harmonious High-Achiever” rather than a single elite-to-low ladder. The intent is to capture that a team can be fast and unstable, or stable and slow, in ways a single rank obscures.

This matters for how you use the older benchmarks. The four-tier thresholds are still the most actionable practical targets, and nothing about the 2025 restructuring makes “deploy daily, lead time under a day” a worse goal. What changed is the framing: DORA now resists the idea that every team should be chasing a single “elite” label, and instead asks what archetype you are and which constraint is actually holding you back. Use the four-tier numbers as targets; use the archetype lens to diagnose which of the four metrics is your binding constraint.

The 2025 report’s headline theme is the role of AI in delivery, and its framing is deliberately measured. Rather than presenting AI as a uniform accelerator, the report characterizes it as an amplifier — the gains show up most in documentation and code quality, while the effect on delivery throughput and stability depends heavily on the surrounding system. The directional reading is that AI rewards teams whose pipeline, testing, and review practices are already sound, and exposes the gaps in teams whose practices are not.

DORA 2025 — the amplifier finding

The 2025 State of DevOps report’s central message, in DORA’s own words: “AI’s primary role is as an amplifier, magnifying an organization’s existing strengths and weaknesses. The greatest returns on AI investment come not from the tools themselves, but from a strategic focus on the underlying organizational system.” The practical translation for pipeline design: invest in the system — trunk-based flow, fast tests, immutable artifacts, safe rollout — before you expect AI tooling to move your delivery metrics.

04Branching ModelTrunk-based development beats GitFlow for continuous delivery.

The branching model is the single highest-leverage decision in pipeline design, because it determines how often code integrates and therefore how releasable the codebase stays. Trunk-based development (TBD) asks developers to commit to a single main branch daily, using short-lived branches only for code review and CI validation before they merge back quickly. Google runs its entire codebase — used by more than 35,000 developers and QA automators — in a single monorepo trunk, which is the existence proof that the model scales.

GitFlow takes the opposite posture. Its five branch types — main, develop, feature, release, and hotfix — create integration complexity that tends to increase pipeline fragility unless a team invests heavily in templating and infrastructure discipline. Long-lived feature branches drift from trunk, accumulate merge conflicts, and delay integration — the exact opposite of what continuous delivery needs. For continuous deployment, trunk-based development and the lighter GitLab Flow are far better aligned.

Trunk-Based Development is a key enabler of Continuous Integration and by extension Continuous Delivery. When developers commit multiple times daily, teams easily satisfy the core CI requirement of committing at least once every 24 hours, keeping the codebase releasable on demand.— TrunkBasedDevelopment.com

The obvious objection to committing to trunk daily is that features are rarely finished daily. Feature flags resolve that tension directly: incomplete work is merged to trunk but kept dark behind a flag, so the code integrates continuously without exposing anything unfinished to users. This is what lets a team deploy and release on different schedules — deploy the artifact whenever it is green, release the feature when it is ready, and roll it out per cohort. A vendor in the flagging space reports an 89% reduction in deployment-related incidents after adopting feature switches; treat that as a vendor-stated, directional figure rather than independent research, but the mechanism — smaller, reversible, gated changes — is sound regardless of the exact number.

For the full pattern catalogue — canary, ring, and kill-switch rollouts, plus how to manage the flag debt that accumulates if you never clean old flags up — our feature flag rollout strategies playbook goes deeper than this reference can.

Branch model
Trunk-based development
1trunk

Daily commits to main, short-lived branches for review only. Keeps the codebase releasable on demand and satisfies the core CI requirement of integrating at least once every 24 hours.

Best for continuous delivery
Branch model
GitFlow
5types

Main, develop, feature, release, hotfix. Powerful for scheduled, versioned releases, but the branch sprawl adds pipeline fragility and delays integration — a poor fit for continuous deployment.

Versioned, scheduled releases
Proof at scale
Google's single monorepo
35K+

More than 35,000 developers and QA automators work in one trunk. The headline takeaway is not the tooling — it is that trunk-based development is the model proven to scale to the largest engineering organizations.

TrunkBasedDevelopment.com

05Rollout StrategyThe deployment strategy decision matrix.

Once the branching model keeps trunk releasable, the next decision is how a release reaches users. The five strategies below differ on rollback speed, infrastructure overhead, how tightly they contain a bad release, and — the column most comparisons omit — how they cope with database schema changes. The matrix is ours, synthesized from Octopus Deploy’s blue-green-versus-canary comparison, the Argo Rollouts documentation, Harness’s best-practice guide, and CircleCI’s progressive-delivery material. Read each row against your constraints; no single strategy wins on every axis.

CI/CD deployment strategy decision matrix comparing blue-green, canary, rolling, feature flags with trunk-based development, and GitOps with Argo CD across rollback time, infrastructure overhead, blast-radius control, database-schema compatibility, and DORA impact. Sources: Octopus Deploy blue-green vs canary comparison, Argo Rollouts documentation, Harness CI/CD best practices, and CircleCI progressive delivery guide, retrieved June 14, 2026.
StrategyRollback timeInfra overheadBlast-radius controlDB-schema compatibilityDORA impact
Blue-greenSeconds — flip traffic back to blueHigh — two full production environmentsAll-or-nothing per switchFragile — stateful schema changes break itStrong MTTR; neutral deploy frequency
CanaryFast — drain the 1–5% sliceLow–moderate — incremental capacityTight — graduated traffic exposureWorks with backwards-compatible migrationsStrong CFR + MTTR
RollingSlow — no automated rollback pathLow — reuses existing capacityPoor — no control over spreadTolerates compatible changes onlyWeak — too risky at high volume
Feature flags + TBDInstant — toggle the flag offLow — one environment, flag infraPer-flag, per-cohort controlCan gate schema reads behind a flagBest all-round across the four keys
GitOps (Argo CD)Fast — revert the Git manifestModerate — controller + clusterDepends on rollout strategy chosenInherits the chosen rollout's behaviorStrong auditability; Kubernetes-native

Three things stand out from the matrix. Blue-green buys near-instant rollback by maintaining two identical production environments and flipping traffic from the live one to the new one — but that doubles infrastructure cost during the transition and is fragile when a release involves a stateful schema change. Canary routes a small initial slice of traffic (typically 1–5%) to the new version, watches error rates and latency, and expands progressively — giving the tightest blast-radius control, provided your migrations stay backwards-compatible. Tooling such as Argo Rollouts and Flagger automates the analysis-and-promotion loop.

The third point is the warning. A plain rolling deployment — progressively replacing old instances with new — is widely considered too risky for high-volume production at scale precisely because it offers no real control over blast radius and no automated rollback on failure. Pair rollout choice with environment-promotion targets that fit the workload; for edge and function-based deploys, weigh your serverless deployment targets against the rollback guarantees each platform actually provides, and for systems split into many independently-shipped services, account for the microservices pipeline complexity that comes from coordinating per-service rollouts.

On rolling deployments at scale

CircleCI’s progressive-delivery guidance puts the risk plainly: in large, high-volume production environments a rolling update is often considered too risky because it provides no control over the blast radius, may roll out too aggressively, and provides no automated rollback on failure. Canary and blue-green exist to give back exactly that control.

06Artifact DisciplineBuild once, promote always.

Artifact immutability is the most quietly violated principle in CI/CD. The rule is simple: build a single artifact once, then promote that exact same build through staging and production unchanged. Never rebuild per environment. Environment differences — connection strings, API endpoints, feature defaults — must be injected at runtime via environment variables, never baked into the artifact at build time.

The reason teams break this rule is that rebuilding per environment feels harmless, and the pipeline often makes it the path of least resistance. But every rebuild is a chance for environment drift: a different base image digest, a transitive dependency that floated to a new version, a build-time flag that differs by stage. When staging passes and production fails on “the same code,” the artifact you tested in staging was usually not the artifact you shipped. Build-once-promote-always eliminates that entire class of incident by guaranteeing the bytes you tested are the bytes you run.

Build once, then promote the exact same artifact across staging and production to prevent environment drift.— Harness CI/CD Best Practices

Two implementation notes make this concrete. First, do not confuse a build cache with a build artifact — they solve different problems. A cache (dependencies, compiled intermediates) exists to speed up the build and can be regenerated at any time; an artifact is the immutable output you promote and must be stored and versioned deliberately. Second, the promotion unit should be addressed by an immutable identifier — a content digest, not a mutable tag like latest — so that “promote build X to production” is unambiguous and auditable.

Do this
Build once, inject config at runtime

One artifact flows from CI through staging to production unchanged. Per-environment differences are supplied as environment variables. The bytes you test are the bytes you run.

Immutable promotion
Avoid this
Rebuild per environment

Rebuilding for staging and again for production invites environment drift — a floated dependency, a different base image, a build flag that differs by stage. Staging-green, production-red incidents trace back here.

Source of drift
Address by
Content digest, not a mutable tag

Promote by immutable identifier so 'ship build X' is unambiguous and auditable. Mutable tags like latest make it impossible to prove which build actually reached production.

Digest-pinned promotion

07Pipeline SpeedCaching and parallelization that actually move the needle.

Lead time for changes is largely determined by how long the pipeline takes, and the two biggest levers are caching dependencies and running independent work in parallel. In GitHub Actions, dependency caching is the highest-impact single change for most JavaScript and Node.js projects — vendor guidance puts the build-time reduction in the region of 60–80%, which is directional rather than guaranteed but consistent with what most teams see once a warm cache is in place.

The platform mechanics are worth knowing precisely. GitHub Actions’ cache stores up to 10 GB per repository by default (configurable higher), evicts entries after seven days without access, and rate-limits cache traffic per repository. Jobs run in parallel by default; the needs: keyword is what introduces a deliberate sequential dependency between them. Beyond dependency caching, the high-leverage patterns are Docker layer caching with cache-from / cache-to, path filters so unrelated changes skip irrelevant jobs, and concurrency controls to cancel superseded runs.

Caching
Dependency cache
10 GB/repo default · 7-day TTL

Restore installed dependencies between runs instead of reinstalling. The single highest-impact speed change for Node.js pipelines; vendor guidance suggests a 60–80% build-time reduction (directional).

Cache ≠ artifact
Parallelism
Parallel jobs + needs
Parallel by default · needs: for order

GitHub Actions runs jobs in parallel unless you declare a dependency. Use needs: only where order genuinely matters; let independent test shards and lint run concurrently.

Fan out, then fan in
GitLab CI
DAG with needs
stages · needs · parallel · cache

GitLab's .gitlab-ci.yml supports stages, needs for out-of-order DAG execution, parallel for distributing workloads, and cache policies (pull / push / pull-push), with up to 150 include references per pipeline.

Up to 150 includes

GitLab CI exposes a comparable toolkit through .gitlab-ci.yml: a default stage order, the needs keyword for DAG-based out-of-order execution, parallel for distributing a job across runners, and a cache with explicit pull / push / pull-push policies — plus the ability to compose a pipeline from up to 150 include file references. Reusable building blocks matter at the org level too: GitHub Actions’ reusable workflows (the workflow_call trigger) let you define pipeline logic once and call it across many repositories, so a change to the shared workflow propagates to every caller and cross-repo maintenance drops.

Caching extends past the pipeline itself. The same instinct — do not recompute what you can safely reuse — applies to the running application, where well-designed caching strategies in production shift load off origin services and shorten the feedback loop between a deploy and a measurable result.

08Supply ChainThe pipeline is now a high-value target.

Most pipeline-design guides stop at speed and stability and skip security, which is precisely backwards for 2026. CI/CD runners hold credentials to source, registries, and production, which makes them an attractive target; research from GitGuardian suggests that, in a recent supply chain attack, a majority of the compromised machines were CI/CD runners. Whatever the exact share, the structural lesson holds: the pipeline is one of the highest-risk places in the organization for secrets exposure, and it should be designed as such.

The most important single change is to stop storing long-lived credentials in the pipeline at all. OIDC-based authentication lets a runner exchange a short-lived, workflow-scoped OIDC token for a temporary cloud role at runtime — so there are no static secrets to rotate, audit, or leak. GitHub Actions, GitLab CI, and most modern CI platforms support the pattern. On top of that, run the security scanners as ordinary CI jobs: a typical GitLab stack uses Semgrep for static analysis (SAST), Gitleaks for secret detection, Trivy for container scanning, and OWASP ZAP for dynamic analysis (DAST), all able to emit a CycloneDX software bill of materials.

The bar for high-assurance environments is now explicit. SLSA Build Level 3 — non-falsifiable provenance produced by a hardened build platform — is the reference standard for federal procurement, and a signed CycloneDX or SPDX SBOM attached to every artifact is the matching supply chain control. Even if you are not selling to government, those two practices — provenance plus a signed bill of materials on every build — are the direction the whole industry is moving, and they fold naturally into the build-once-promote-always discipline from the previous section. Designing a pipeline that satisfies all of this without slowing teams down is exactly the kind of work our web development and engineering engagements handle end to end.

Security baseline for any 2026 pipeline

Three controls cover most of the risk: replace static credentials with OIDC short-lived tokens; run SAST, secret-scanning, container, and DAST checks as CI jobs that can fail the build; and attach a signed SBOM with build provenance to every artifact. None of these slow a well-designed pipeline meaningfully, and each one removes a category of incident rather than a single bug.

09ConclusionA pipeline is a system, not a script.

The shape of CI/CD in 2026

Throughput and stability are won together, or not at all.

The throughline of every section here is the same: CI/CD in 2026 is a system-design problem, not a tooling problem. The four DORA metrics give you a balanced scorecard that refuses to let you trade speed for safety. Trunk-based development with feature flags keeps the codebase releasable while decoupling deploy from release. A clear-eyed deployment strategy contains the blast radius of any bad change. And artifact immutability plus supply chain controls make the whole thing trustworthy, not just fast.

The most useful reframing comes from the 2025 DORA report itself: AI and tooling are amplifiers of the underlying system, not substitutes for it. A team with sound trunk-based flow, fast tests, immutable artifacts, and safe rollout will get compounding returns from better tools; a team without those fundamentals will mostly amplify its existing problems. The order of operations is the system first, the tooling second.

If you are starting from a weekly-deploy, multi-day-lead-time baseline, do not try to leap to elite in one quarter. Halve the batch size, automate one manual gate, add a fast rollback path, move secrets to OIDC, and then read the four metrics again. The distance from medium to high is covered by a handful of disciplined, compounding changes — and that, not a heroic rewrite, is what a well-designed 2026 pipeline actually looks like.

Design a pipeline that ships safely

Make your pipeline a competitive advantage, not a source of fear.

Our engineers help teams design CI/CD pipelines that ship faster and safer — trunk-based delivery, feature-flag rollout, safe deployment strategies, and supply chain security, instrumented against the DORA metrics from day one.

Free consultationSenior engineersTailored to your stack
What we work on

CI/CD & delivery engineering

  • DORA-instrumented pipeline assessment and roadmap
  • Trunk-based development + feature-flag rollout
  • Blue-green and canary deployment design
  • Build-once-promote-always artifact discipline
  • Pipeline security — OIDC, SBOM, supply chain provenance
FAQ · CI/CD pipeline design

The questions engineering leads ask every week.

DORA — DevOps Research and Assessment, a research program under Google Cloud — identifies four metrics that predict software delivery performance. Two measure throughput: deployment frequency (how often you successfully release to production) and lead time for changes (the elapsed time from a code commit to that change running in production). Two measure stability: change failure rate (the percentage of deployments that cause a production failure requiring remediation) and mean time to restore (how long it takes to recover service after an incident). They are treated as a set because any single metric is gameable in isolation — the point is to move throughput and stability together rather than trade one against the other.