MarketingMethodology12 min readPublished June 8, 2026

Causal lift over correlation · 0.70x branded-search iROAS · 3 test designs that cover almost every channel

Incrementality Testing: Proving Ads Actually Caused Sales

Last-click attribution and even marketing mix modeling overstate paid impact because they cannot separate organic demand from ad-driven demand. Incrementality testing answers the only question that matters for budget decisions: what would have happened anyway? This is a practical guide to causal lift, the test designs that prove it, and the benchmark that should change how you think about branded search.

DA
Digital Applied Team
Senior strategists · Published Jun 8, 2026
PublishedJun 8, 2026
Read time12 min
Sources12 cited
US marketers now testing
52%
use incrementality testing
Branded-search median iROAS
0.70x
lowest of any channel tested
below breakeven
Say measurement falls short
75%
lack speed, accuracy, or trust
Rank #1 retail-media KPI
71%
of advertisers surveyed

Incrementality testing is the practice of running a controlled experiment to prove that your paid media actually caused sales, rather than just appearing alongside conversions that would have happened anyway. It is the only measurement method built on causation instead of correlation, and that distinction is why marketing teams are adopting it faster than any other measurement discipline.

The reason it matters now is that the old measurement stack is failing on its own terms. Last-click attribution takes credit for customers who were already on a path to convert. Marketing mix modeling is correlation-based and can overstate a channel by multiples. Multi-touch attribution has been quietly broken by cookie deprecation, walled gardens, and Apple's App Tracking Transparency. According to the IAB and BWG Global State of Data 2026, roughly three in four marketers say their measurement systems lack the speed, accuracy, or trust they need.

This guide covers what incrementality actually measures, the counterintuitive benchmark that should reframe how you treat branded search, the three test designs that cover almost every channel, the tooling that has made experiments affordable for mid-market brands, and the single biggest reason most tests fail before they start. Every figure is sourced and, where it comes from a vendor's own dataset, labeled as such.

Key takeaways
  1. 01
    Attribution and MMM overstate paid impact.Last-click credits users already converting; MMM is correlation-based. Only a controlled experiment isolates the causal lift your ads actually produced.
  2. 02
    Branded search is the surprise underperformer.Across one vendor's 225 DTC geo tests, branded search posted a median iROAS of just 0.70x — the lowest of any channel, reflecting heavy cannibalization of organic clicks.
  3. 03
    Three designs cover almost every channel.Geo / matched-market for geo-targetable media, platform audience holdouts for walled gardens, and ghost-ads / conversion-lift for native platform tooling. Choose by data access, not convenience.
  4. 04
    Adoption is mainstream and testing got cheaper.About 52% of US brand and agency marketers now run incrementality tests, and Google reports cutting minimum test budgets from roughly $100,000 to $5,000 — a vendor claim, not independently audited.
  5. 05
    Pre-test fit, not budget, predicts success.In the same dataset only a minority of tests met the tight pre-test fit criteria — and most setups that miss those thresholds produce inconclusive results regardless of spend.

01Why NowThe measurement stack broke — and everyone knows it.

Three forces converged to push incrementality from niche to mainstream. First, signal loss: third-party cookies, walled-garden data limits, and App Tracking Transparency gutted the user-level tracking that multi-touch attribution depended on. Measured, a measurement vendor, now describes MTA as no longer feasible for most brands. Second, distrust: platform-reported conversions routinely diverge from site-side analytics and actual sales, so finance teams stopped accepting dashboard ROAS at face value. Third, accessibility: the statistical machinery for clean experiments became cheap enough for mid-market brands to run, not just enterprises.

The adoption numbers track that shift. EMARKETER, citing a July 2025 TransUnion survey, reports that about 52% of US brand and agency marketers now use incrementality testing — up from niche status just two years earlier. Roughly 36% plan to increase their incrementality investment over the next year, and in retail media specifically the ANA found 71% of advertisers now rank incrementality as their single most important KPI. This is no longer an experimentation hobby for data-science teams; it is becoming the default standard of proof.

The state of measurement, 2026
Per the IAB / BWG Global State of Data 2026, roughly 75%of marketers say their measurement systems lack the speed, accuracy, or trust they need. EMARKETER's reading of the same TransUnion data found accuracy concerns, cross-channel application, and insufficient tools are the top barriers — which is exactly the gap a controlled experiment is designed to close.

02The DefinitionIncrementality is the counterfactual, made measurable.

Incrementality is the share of conversions that happened because of the ad, not merely alongside it. The cleanest way to think about it is a single counterfactual question, framed by Haus, a measurement firm: what was that group going to do anyway? You create two comparable groups, expose one to the ad and withhold it from the other, and the difference in outcomes is the lift the ad caused. Everything else — last-click credit, view-through windows, platform ROAS — is an estimate of association, not causation.

The arithmetic is deliberately simple. Measured expresses incrementality as the test group's conversion rate minus the control group's conversion rate, divided by the test rate. If the exposed group converts at 1.5% and the held-out group converts at 0.5%, then (1.5% − 0.5%) / 1.5% works out to roughly 66.7% incrementality — meaning about a third of those conversions would have occurred without the ad. That is the number a last-click dashboard silently hands back to the channel as if the ad earned it.

What was that group going to do anyway? That's fundamentally what we mean when we talk about incrementality testing.— Chief Strategy Officer, Haus

This is also why incrementality is structurally more durable than the attribution methods it supplements. A holdout experiment does not depend on cookies, device IDs, or cross-domain stitching — it only needs two comparable groups and an outcome you can measure. That durability is the reason signal loss erodes multi-touch attribution but leaves incrementality testing largely untouched, a point worth weighing alongside any first-party data and server-side tracking strategy you are building to survive the same signal loss.

Most advertisers assume branded search is their highest-ROI, lowest-risk channel. A customer types your brand name, your ad appears, they click, they buy — the last-click ROAS looks spectacular. But that is precisely the scenario where the ad is most likely to be claiming credit for a sale that was already going to happen. The user came looking for you. Many would have clicked the organic result sitting right below the ad.

The benchmark data makes that concrete. In a published dataset of 225 DTC geo-based incrementality tests run between August 2024 and December 2025, Stella reported a median incremental return on ad spend (iROAS) of just 0.70x for branded search — the lowest of any channel tested, and below the 1.0x breakeven line. The same dataset put the full-portfolio median at 2.31x, so branded search was not merely average; it was the clear outlier on the downside. The interpretation offered is heavy cannibalization of organic traffic the brand was already winning for free.

Read this benchmark carefully
These iROAS figures come from a single vendor's dataset of self-selected, measurement-sophisticated DTC and Shopify advertisers, US only. They are not a cross-industry average — treat them as directional, discount for your context, and validate with your own test before reallocating budget. A 0.70x reading also does not automatically mean shut branded search off: defending your branded SERP from competitors who bid on your name can carry strategic value a pure iROAS number will not capture.

The practical move is not to kill branded search on sight; it is to test it first. Because branded search is the channel where last-click attribution and true incrementality diverge most violently, it offers the largest potential budget correction per test dollar spent. Run a geo or audience holdout, measure the genuine lift, and right-size the spend to the defensive value you actually need — rather than to a dashboard number that was never causal in the first place.

04Priority GuideWhere attribution and incrementality diverge most.

The reason to test channels in a deliberate order is that attribution error is not uniform. On some channels, platform-reported ROAS and true incremental ROAS sit close together; on others they pull apart by multiples. The bars below show median iROAS by channel from the same 225-test DTC dataset. The pattern is the actionable part: click-attributed channels at the top of the list tend to be over-credited, while upper-funnel channels like CTV are often under-credited by platform attribution.

Median incremental iROAS by channel · DTC geo tests

Source: Stella 2025 DTC incrementality benchmarks (225 tests, vendor-stated, US DTC only)
Tatari CTVMedian iROAS · N=18 · understated by attribution
3.30x
Performance MaxMedian iROAS · N=38
2.98x
Meta / FacebookMedian iROAS · N=63 · most consistent channel
2.92x
Full-portfolio medianAll 225 tests, 8 channels
2.31x
YouTubeMedian iROAS · N=24
2.17x
Google ShoppingMedian iROAS · N=16
1.86x
Google Non-Branded SearchMedian iROAS · N=31
1.46x
TikTokMedian iROAS · N=10 · highest variance, treat with caution
0.94x
Google Branded SearchMedian iROAS · N=17 · lowest of any channel
0.70x

Two figures on this chart deserve a caveat in bold. TikTok's 0.94x median sits on the smallest, most volatile sample in the set — it carried by far the highest variance of any channel, so it should be read as a flag to test, not as a settled benchmark. And the full-portfolio median of 2.31x is a DTC figure from sophisticated advertisers; a broad cross-industry portfolio would very plausibly land lower. The structural lesson survives the caveats: test the channels where your platform-reported ROAS and your gut-feel confidence are both highest, because that is where the correction is largest.

Test first
Branded search & high-confidence click channels

These are most likely to show cannibalization, so the gap between dashboard ROAS and true lift is widest. Branded search posted a 0.70x median; testing it first yields the biggest budget correction per dollar.

Highest test priority
Test for the budget case
Upper-funnel & CTV

Platform attribution tends to understate these channels because conversions land later and across devices. A clean lift test is often what unlocks the budget argument for them, rather than what cuts it.

Test to defend spend
Test with caution
TikTok & emerging social

High measured variance means a single test can mislead. Treat early readings as directional, run multiple periods, and avoid making large reallocations off one inconclusive result.

Multi-period reads
Lower urgency
Channels already aligned

Where platform attribution and incrementality sit close — Meta was the most consistent channel in the dataset — the marginal value of a test is smaller. Confirm periodically rather than continuously.

Confirm, don't obsess

If you want the upstream decision — when to reach for attribution versus MMM versus a controlled experiment in the first place — we map that tradeoff in our decision matrix for attribution, MMM, and incrementality. This section assumes you have already decided an experiment is warranted and are choosing where to point it.

05Test DesignsThree designs, chosen by data access — not convenience.

Nearly every incrementality test reduces to one of three designs. The choice is dictated by what you can control and observe, not by which is easiest to set up. Geo / matched-market testing fits any channel you can target geographically — TV, out-of-home, radio, CTV, or regionally bought digital. Platform audience holdouts fit walled-garden channels with a strong identity graph, where the platform itself can randomly withhold ads from a slice of your target audience. Ghost ads, also called ghost bidding, fit platforms that can flag the auction moments where your ad would have served and withhold it without charging you for control impressions.

The table below is our consolidated decision matrix. Most published guides explain one design in isolation; the value here is the honest "requires" and "key risk" columns side by side, so you can rule designs in or out before spending a cent.

Incrementality test design decision matrix: best-fit use case, requirements, key risk, and platforms for geo, audience holdout, ghost ads, and channel shutoff designs.
Test designBest forRequiresKey riskPlatforms
Geo / matched-marketTV, OOH, radio, CTV — any geo-targetable channel10–20 matched market pairs; several weeks of baseline dataGeo contamination (spillover); seasonal timingAny (publisher-agnostic)
Platform audience holdoutWalled-garden social and search with an identity graphPlatform holdout feature; large cells; a 5–20% holdout groupOptimizer may treat holdout cells differentlyMeta, Google, TikTok, Amazon DSP
Ghost ads / conversion liftMid-funnel display/programmatic; native lift studiesDSP ghost-bid capability or a platform lift toolNot universally available; auction dynamics can bias resultsMeta Lift, Google Conversion Lift, select DSPs
Channel shutoff / time-basedQuick directional reads; lower-budget scenariosClean pre/post periods with no major external changesConfounders (seasonality, competitors) hard to isolateAny

Sizing matters as much as design. Industry synthesis from Cometly suggests geo-based tests typically need 10–20 markets per group to detect a 10% lift at 80% statistical power, while audience holdouts generally want at least 1,000 conversions in the exposed group, or on the order of 10,000+ users per cell. Most tests run two to eight weeks. A test sized too small cannot detect a modest but real lift, which produces a false "no effect" result — one of the most expensive mistakes in measurement, because it can talk you out of a channel that was actually working.

Ghost ads, in one line
Per Tinuiti, ghost ads avoid paying for control impressions by flagging the auction moments where your ad would have served and withholding it — no public-service-announcement placebo spend required. The catch is availability: ghost-ad capability is not offered across every platform, so it cannot be your default everywhere.

06Tooling & AccessThe tools that made experiments affordable.

For most of the last decade, rigorous geo experiments were an enterprise-only line item. That has changed on two fronts: open-source statistical tooling and falling minimum budgets. Meta open-sourced GeoLift, which builds a synthetic counterfactual from historical pre-treatment data across untreated geographies using augmented synthetic control and generalized synthetic control methods — notably without requiring any user-level tracking. Google previewed Meridian GeoX at Google Marketing Live on May 5, 2026, an open-source, publisher-agnostic geo design that pairs time-based regression with stratified sampling and supports holdback, go-dark, and heavy-up tests; per Search Engine Journal's coverage, testing was slated to begin later in 2026, so frame it as announced, not yet available.

On budgets, EMARKETER reports that Google cut the minimum spend for its incrementality tools from roughly $100,000 to about $5,000 through Bayesian modeling improvements — a vendor-stated figure that has not been independently audited, but a directionally meaningful sign of democratization for mid-market brands. Independent vendors are moving the same way: Recast launched a GeoLift product in September 2025 specifically to validate and calibrate MMM outputs against real experiments.

Open-source · Meta
Synthetic-control geo testing
GeoLift

Builds a counterfactual from untreated geographies using ASCM and GSC methods, without user-level tracking. Published openly on GitHub as facebookincubator/GeoLift.

No user-level tracking
Announced · Google
Meridian GeoX preview
May 5

Open-source, publisher-agnostic geo design pairing time-based regression with stratified sampling; supports holdback, go-dark, and heavy-up. Testing slated for later in 2026 — announced, not yet live.

Frame as announced
Vendor-stated
Lowered minimum budget
$5K

EMARKETER reports Google reduced its incrementality test minimum from roughly $100,000 to about $5,000 via Bayesian improvements. Not independently audited — read as directional democratization.

from ~$100K

Platform-native lift studies round out the toolkit. Meta's Conversion Lift, per vendor guides, draws a holdout group from the target audience — commonly cited in the 5–20% range — and randomly withholds campaign exposure across Facebook, Instagram, and Messenger, surfacing results only once a significance test clears. The practical takeaway is that you no longer need a six-figure budget or an in-house econometrics team to run a credible experiment; you need the right design for your data access and the discipline to size it correctly. If standing up testing infrastructure and a measurement cadence is the bottleneck, that is the kind of program our analytics and measurement engagements are built to operationalize.

07Why Tests FailMost tests are set up to fail before they begin.

The uncomfortable finding buried in the benchmark data is that the biggest predictor of a conclusive test is not budget or duration — it is pre-test fit. A test depends on a model that can accurately predict the control group's behavior in the absence of the ad. If that baseline model is weak, no amount of spend or runtime will rescue the result. In the Stella dataset, only a minority of tests met the tight pre-test fit criteria the authors defined, and tests that missed those thresholds drove the high rate of inconclusive results.

This reframes how to budget a test program. Before launching, the decisive work is validating that your control can be predicted well enough — checking forecast error and goodness-of-fit on a holdout period — rather than negotiating for a larger media budget. A well-fit, modestly funded test will beat a poorly-fit, generously funded one every time. That is the single most useful and least discussed lesson in the discipline.

Pre-test fit quality predicts success more than budget or duration — tests with MAPE < 0.15 AND R² 0.85–0.94 reach 100% statistical significance.— Stella 2025 DTC incrementality benchmarks
Pitfall 01
Underpowered design
Too few markets / conversions

A test sized below the threshold to detect your expected lift returns a false 'no effect.' Confirm you have enough markets or conversions for the lift you hope to find before launch.

10–20 markets · 1K+ conversions
Pitfall 02
Poor pre-test fit
Weak control model

If your baseline cannot predict control behavior, the lift estimate is noise. Validate forecast error and fit on a holdout period first — this matters more than budget.

The #1 cause of failure
Pitfall 03
Single-read overconfidence
One test, one verdict

Confidence builds across multiple test periods, not a single reading. Treat one result as directional, especially on high-variance channels, and re-run before reallocating major budget.

Multiple periods

Looking ahead, the trajectory is clear: as open-source geo tools mature and minimum budgets keep falling, the bottleneck will shift from access to discipline. The brands that win will not be the ones that run the most tests, but the ones that size and fit them properly and let results compound across periods. Expect "did you test it?" to become the default first question in budget reviews the way "what is the ROAS?" was a decade ago — and expect a lot of comfortable assumptions about high-performing channels to not survive the experiment.

08The SynthesisIncrementality is the anchor, not the whole system.

Incrementality testing does not replace your entire measurement stack; it grounds it. The modern best practice, as Measured frames it, is triangulation: use incrementality experiments as the causal ground truth, use marketing mix modeling for always-on portfolio coverage and offline channels, and use platform attribution for tactical, real-time optimization signals. Each method covers the others' blind spots — experiments are precise but episodic, MMM is comprehensive but correlational, attribution is fast but biased.

The connective tissue is calibration. You run experiments periodically, then use the causal results to recalibrate your MMM and sanity-check your attribution. That is exactly why vendors like Recast built a geo-testing product to validate MMM outputs against real lift — an experiment is the only thing that can tell a correlational model whether it is right. If you want the broader case for why MMM alone cannot prove causation, and the parallel argument for how multi-touch attribution misses true impact, those companion guides go deeper on each leg of the triangle.

The triangulated stack
Measured's recommended pattern: incrementality as causal ground truth, MMM for portfolio coverage and offline channels, and platform attribution for real-time tactical signals. No single method is sufficient — and the experiment is the one that keeps the other two honest.

09ConclusionStop measuring association. Start measuring cause.

The shape of measurement, June 2026

The only number that survives a budget review is the one an experiment produced.

Incrementality testing has crossed from specialist tool to default standard of proof because the alternatives stopped being trustworthy. Last-click takes credit for demand it did not create, MMM correlates without proving, and multi-touch attribution is hollowed out by signal loss. A controlled experiment is the one method that answers the only question a CFO actually cares about: what would have happened anyway?

The branded-search benchmark is the reason to start now. A median iROAS of 0.70x — vendor-stated and DTC-specific, but directionally stark — tells you the channel most advertisers trust most is often the least incremental. That single finding, validated on your own accounts, can free meaningful budget for channels that are genuinely driving growth. Test the channels where your confidence and your dashboard ROAS are both highest first, because that is where the correction is largest.

The discipline is not complicated, but it is exacting. Choose the design your data access allows, size it to the lift you expect to find, validate your pre-test fit before spending a dollar, and let results compound across periods rather than betting on a single read. Then fold the causal truth back into your MMM and your attribution so the whole stack tells one honest story. Do that, and you stop arguing about which dashboard to believe — and start knowing what your advertising actually caused.

Prove what your ads actually caused

Move budget on what your ads actually caused.

We design, run, and operationalize incrementality experiments — geo tests, audience holdouts, and conversion-lift studies — then fold the causal results back into your MMM and attribution so budget decisions rest on cause, not correlation.

Free consultationSenior strategistsCausal measurement
What we work on

Measurement & testing engagements

  • Geo / matched-market lift design and execution
  • Platform audience holdouts on Meta, Google, TikTok
  • Pre-test fit validation and power sizing
  • Triangulating experiments with MMM and attribution
  • Budget reallocation from cannibalizing channels
FAQ · Incrementality testing

The questions we get every week.

Incrementality testing is a controlled experiment that measures the share of conversions caused by an ad rather than conversions that would have happened anyway. You compare a group exposed to the advertising against a comparable held-out group that is not, and the difference in outcomes is the causal lift. It is the only measurement method built on causation instead of correlation. A simple version of the math, as expressed by Measured, is the test group's conversion rate minus the control group's conversion rate, divided by the test rate: a 1.5% exposed rate against a 0.5% control rate works out to roughly 66.7% incrementality, meaning about a third of those conversions were not actually driven by the ad.