A marketing data pipeline is the plumbing that moves customer data from where it is created — ad platforms, your CRM, web and product events — into a warehouse where it can be modeled, then back out to the tools where marketers actually act on it. In 2026 the dominant pattern is composable: ingestion, transformation, and activation are three separable layers, and you can buy, build, or self-host each one independently.

What changed this year is consolidation. Fivetran and dbt Labs completed their all-stock merger on June 1, 2026 — first announced October 13, 2025 â and combined with Fivetran’s May 2025 acquisition of Census, the ingestion, transformation, and activation layers are now available from one vendor for the first time. That makes the build-versus-buy decision sharper, not simpler.

This guide walks the four-stage ELT flow end to end, gives a layer-by-layer decision matrix for marketing data teams, draws a clear line between this pipeline infrastructure and the packaged-CDP product that sits above it, and flags the real-time gap that most warehouse-first guides quietly omit. Every figure here is sourced; vendor-stated numbers are labeled as such.

Key takeaways

01
The stack is three ownable layers, not one product.Ingestion (extract + load), transformation (modeling, identity, quality), and activation (reverse ETL) are composable. You can buy connectors, own the model layer, and self-host or buy activation independently.
02
ELT replaced ETL because schemas churn.Legacy ETL transformed data before loading and required rigid schemas. Modern ELT lands raw data in the warehouse first, then transforms with SQL — which fits high-cardinality, schema-churning ad-platform data far better.
03
Fivetran and dbt Labs merged on June 1, 2026.The combined entity reports serving 100,000+ data teams. With Census already acquired in May 2025, the ingestion-to-activation stack now ships from a single vendor — raising the lock-in question for the whole pipeline.
04
The default recommendation: buy connectors, own the model.For most marketing teams the pragmatic split is to buy managed connectors, own the dbt model layer in version control, and activate via reverse ETL — keeping the warehouse as the system of record.
05
The warehouse-first model is batch by default.Reverse ETL syncs on a schedule. For sub-minute activation — abandoned-cart triggers, real-time bid enrichment — teams typically add a streaming sidecar (Kafka, Pub/Sub, Kinesis) alongside the warehouse pipeline.

01 — The FlowFour stages: extract, load, transform, activate.

The modern marketing data pipeline follows a four-stage ELT flow. Extract pulls raw data from ad platforms, CRMs, and web events. Load lands that raw data in a warehouse. Transform runs dbt models, identity resolution, and data-quality checks against the loaded data. Activate uses reverse ETL to push the modeled results back out to your CRM and ad destinations.

The ordering matters. Legacy ETL transformed data before loading it, which demanded a rigid schema defined up front and broke every time a source API changed. ELT inverts that — load first, transform in the warehouse with SQL — so schema changes become a modeling problem you handle in version control rather than a pipeline outage. That inversion is why the four leading layers below are cleanly separable.

Stage 01 · Extract + Load

Ingestion (ELT)

Fivetran · Airbyte · Stitch

Connectors pull raw records from ad platforms, CRMs, and event streams and land them in the warehouse untransformed. Buy managed connectors or self-host open-source ones — this is the most commonly bought layer.

Source → raw warehouse tables

Stage 02 · Transform

Modeling in SQL

dbt models · tests · identity resolution

dbt turns raw tables into clean, tested, documented models in the warehouse. This is the layer most teams keep in-house — it encodes your business logic and definitions.

Raw tables → trusted models

Stage 03 · Activate

Reverse ETL

Hightouch · Census · Fivetran Activations

Reverse ETL queries the warehouse and syncs modeled audiences back to CRMs, ad platforms, and ESPs — the warehouse stays the system of record. This is the activation layer of a composable CDP.

Models → CRM / ad destinations

Pipeline, not CDP

This guide is about the infrastructure beneath a customer data platform — the ETL, warehouse, dbt, and reverse-ETL layers — not the packaged-CDP product decision. A composable CDP is essentially the activation layer of this pipeline running on top of your warehouse. For the packaged-versus-composable product call, see our CDP build-vs-buy decision matrix.

02 — ELT vs ETLWhy ad-platform data is built for ELT.

Most pipeline guides treat every data source as equivalent. They aren’t. Ad-platform APIs change constantly â new fields, deprecated endpoints, shifting rate limits — and produce high-cardinality data with frequent schema churn. That profile is a poor fit for ETL’s rigid pre-load schema and an excellent fit for ELT, where you land the raw data first and absorb the churn in your transformation models. Offline or CSV uploads, by contrast, are stable and small enough that light pre-processing before landing is still reasonable.

The table below is our own mapping of common marketing data sources to the right pattern in 2026. It exists because the source-by-source nuance — schema stability, cardinality, latency needs — is what actually drives the ETL-versus-ELT call, and almost no published guide breaks it out this way.

ELT versus ETL pattern recommendation by marketing data source type, 2026
Source type	Schema profile	Recommended pattern
Ad-platform APIs	High cardinality, frequent schema churn, rate-limited	ELT — land raw, absorb churn in dbt models
CRM exports	Moderately stable, owner-defined fields	ELT — incremental loads, model in warehouse
Web event streams	Schema governed by a tracking plan, high volume	ELT — raw events to warehouse, sessionize in dbt
Product event streams	High volume, latency-sensitive use cases	ELT for analytics; streaming sidecar for real-time
Offline / CSV uploads	Small, stable, batch-delivered	Light pre-processing before load is acceptable

The practical takeaway: do not pick one pattern for the whole pipeline. Default to ELT for the high-churn, high-volume sources that dominate marketing data, and reserve pre-load processing for the small, stable, infrequent uploads where it genuinely simplifies downstream modeling. Most of your engineering pain will come from ad-platform connectors, which is exactly where managed services earn their keep.

03 — Ingestion & WarehouseConnectors and the warehouse underneath them.

The ingestion layer is where most teams start buying. The two reference points are Fivetran and Airbyte. Fivetran offers 700+ connectors, all developed and maintained in-house by a team the company says numbers 600+ engineers, with usage-based pricing on Monthly Active Rows (MAR). In 2025 Fivetran shifted MAR billing from account-wide to per-connector, which removed bulk discounts and, per third-party comparisons, often increased costs for multi-source environments. Airbyte offers 600+ pre-built connectors plus a no-code Connector Builder that has reportedly produced 5,000+ community connectors; it is open-source when self-hosted and also available as the managed Airbyte Cloud — the two are not the same product.

Beneath the connectors sits the warehouse. BigQuery is the default for Google-centric marketing teams given native paths from GA4, Google Ads, Search Console, and Firebase; its published price is around $20 per TB per month for active storage on a serverless compute model. Snowflake separates storage and compute and is reported in the ~$23–40 per TB per month range for storage, billing compute separately in credits — which lets teams scale compute up for heavy transformation runs and back down afterward. Treat all of these as directional list prices and confirm current rates on each vendor’s pricing page before budgeting.

Fivetran connectors

Managed, in-house built

700+

All connectors developed and maintained in-house. Usage-based MAR pricing moved to per-connector in 2025, which can raise costs for multi-source marketing stacks. Verify the current count on Fivetran's site.

Per-connector MAR

Airbyte connectors

Open-source or Cloud

600+

600+ pre-built connectors plus a no-code Connector Builder with thousands of community connectors. Open-source when self-hosted; Airbyte Cloud is a separate managed offering.

5,000+ community-built

BigQuery storage

Serverless, Google-native

20/TB

Around $20 per TB per month for active storage on a serverless model — you write SQL and BigQuery allocates resources. Natural fit when GA4, Google Ads, and Search Console are core sources.

Snowflake: ~$23–40/TB

For most marketing teams, ingestion is the clearest buy: connector maintenance is pure overhead, and managed services absorb the API churn that would otherwise consume engineering time. The warehouse choice usually follows your existing cloud and analytics gravity — BigQuery for Google-stack shops, Snowflake where compute elasticity and multi-cloud matter. Your ingestion source list almost always includes server-side tracking and first-party data collection, which feeds clean events straight into the warehouse.

04 — Transformationdbt and the model layer you should own.

Transformation is the layer most teams keep in-house, because it encodes your business definitions. dbt is the de facto standard, and its model rests on four core primitives: Models are SQL SELECT statements that dbt materializes into warehouse tables or views; Tests are built-in data-quality assertions (not_null, unique, accepted_values, relationships); Freshness checks verify that sources are updating on schedule; and DAG lineage is the directed acyclic graph of model dependencies. The DAG is what makes incremental runs possible — rebuild only the models that changed, cutting both cost and latency.

dbt Labs reports surpassing $100M ARR with 5,000+ customers and more than 80,000 data teams using dbt weekly (vendor-stated). Alongside the merger, dbt Core v2.0 shipped in alpha under the Apache 2.0 license, including the Rust-based Fusion engine runtime; dbt State, a caching preview, is positioned to cut infrastructure costs by 30%+ by rebuilding only changed models — a vendor preview figure, not an independent benchmark. The underlying point holds regardless of the exact number: the DAG-driven, build-only-what-changed model is what keeps transformation costs sane as your pipeline grows.

"Reverse ETL solves this issue by delivering transformed data directly to where decisions are made."— Fivetran, What is Reverse ETL

05 — Identity ResolutionResolving one customer across many devices.

Identity resolution is a required transformation step for marketing data, not an optional one. The average household reportedly owns around 21 connected devices, so a single customer routinely appears as multiple disconnected profiles in raw data. Resolving them is what makes downstream personalization and suppression accurate.

There are two methods. Deterministic resolution matches on exact keys — email, loyalty ID, authenticated session token — and is high-precision but lower-reach. Probabilistic resolution uses statistical signals such as IP, device type, browser fingerprint, and behavioral patterns; it extends reach at the cost of precision. Hightouch has introduced Adaptive Identity Resolution, combining the two so marketers can dial the confidence threshold up for precision or down for reach depending on the use case. Companies that excel at personalization — which depends on resolved identity — are reported to generate meaningfully more revenue from those activities, a figure commonly attributed to McKinsey but worth verifying against the original report before you cite a specific percentage.

Pick the model per business unit

Identity resolution is not one setting for the whole company. Performance teams optimizing for reach may want a probabilistic model; CRM and retention teams optimizing for precision want a deterministic one. Building the resolved-identity model in your warehouse — rather than inside a vendor’s black box â is what lets different teams choose different thresholds against the same source of truth.

06 — ActivationReverse ETL pushes models back out.

Activation is where the modeled data earns its keep. Reverse ETL is a four-step process: the source is your cloud warehouse (Snowflake, BigQuery, Databricks); the model is a SQL query or visual audience selector defining which records to sync; the sync handles field mapping and scheduling, batch or near-real-time; and the destination is a CRM, ad platform, ESP, or support tool. The warehouse stays the system of record — nothing is copied into a proprietary store.

The two reference tools are Hightouch and Census. Hightouch supports 250+ destinations and has repositioned itself from a reverse-ETL tool to an “agentic marketing platform,” though the underlying architecture is unchanged: query the warehouse, push results to destinations. Census — now Fivetran Activations, though the two coexist as separate product surfaces — supports 200+ destinations and is distinguished by SQL-first workflow control and tight dbt integration, which made it the choice for data-engineering-led teams wanting warehouse-first control over what gets synced.

"The dominant recommendation is no longer 'buy a packaged CDP.' It's some variation of 'keep your data warehouse as the source of truth, layer activation on top, don't let any one vendor own your data model.'"— CDP.com editorial

This is precisely the difference between a composable CDP and a packaged one. A composable CDP runs reverse ETL directly on top of your warehouse, which remains the system of record. A packaged CDP replicates your data into its own proprietary store, adding latency and a data-copy problem. The activation layer of this pipeline is the composable CDP — which is why the build-versus-buy call for the product layer is a separate decision from the infrastructure below it.

07 — Decision MatrixThe layer-by-layer build-or-buy call.

Most “modern data stack” guides describe the categories but never give a decision rule per layer. The matrix below is our own framework: for each layer of the pipeline, it weighs the four real options — build it yourself, self-host open-source, buy managed SaaS, or adopt the bundled Fivetran + dbt + Census stack — against the tradeoffs that actually decide it.

Marketing data stack decision matrix: recommended approach, engineering load, lock-in risk, and best fit by pipeline layer
Layer	Default approach	Eng. load	Lock-in risk	Best fit
Ingestion / extract	Buy managed (Fivetran) or self-host Airbyte	Low (buy) · Medium (OSS)	Medium	Most teams — connector upkeep is pure overhead
Warehouse / load	Buy managed (BigQuery / Snowflake)	Low	Medium — data is portable, SQL is not	Follow existing cloud and analytics gravity
Transformation	Own it — dbt models in version control	Medium — ongoing modeling work	Low — dbt Core is Apache 2.0	Everyone — this is your business logic
Identity resolution	Model in warehouse; buy if reach matters	Medium–High	Low (warehouse) · Medium (vendor)	Buy probabilistic; build deterministic
Orchestration	Self-host Airflow / Dagster / Prefect	Medium–High	Low — all open-source	Teams with many interdependent pipelines
Quality / observability	dbt tests + observability tool (Monte Carlo / Elementary)	Low–Medium	Low	Every layer needs its own freshness + checks
Activation / reverse ETL	Buy managed (Hightouch / Census)	Low	Medium — sync logic lives in vendor	Most teams — this is the composable-CDP layer

Read the matrix as a default, not a mandate. The recurring pattern across rows is clear: buy the layers that are pure maintenance (ingestion, activation), own the layer that encodes your business (transformation), and decide identity resolution per team. The bundled Fivetran + dbt + Census stack is genuinely convenient now that it ships from one vendor — but convenience and lock-in are the same coin, and the transformation layer is exactly where you do not want a single vendor owning your model.

08 — Orchestration & QualityScheduling and trust across the pipeline.

Two layers tend to be afterthoughts and shouldn’t be: orchestration and data quality. The three leading general-purpose orchestrators are Apache Airflow, Prefect, and Dagster — none of which are marketing-specific; they happen to schedule marketing pipelines among everything else. Airflow suits teams running 100+ static pipelines that need battle-tested maturity; Prefect offers the fastest path from a Python script to a scheduled pipeline; and Dagster shipped 20+ Components to general availability in October 2025, covering dbt Cloud, BigQuery, Census, and more.

On the quality side, the 2026 landscape spans Monte Carlo (closed-source, ML-powered anomaly detection, which launched Observability Agents in 2025 that deploy monitors via data profiling and investigate root causes across related tables in parallel), Elementary Data (dbt-native, open-source), Soda Core, SYNQ, and dbt’s built-in tests. The rule of thumb: every layer of the pipeline — ingestion, transformation, activation — needs its own freshness and quality checks, because a failure at any stage silently poisons everything downstream.

And here is the caveat almost no marketing-audience pipeline guide surfaces: the warehouse-first model is batch-oriented by default. Reverse ETL syncs on a schedule, which is fine for daily audience refreshes but not for sub-minute activation. For abandoned-cart triggers within ten minutes or real-time bid enrichment, teams typically add a streaming sidecar — Kafka, Pub/Sub, or Kinesis — alongside the warehouse pipeline rather than forcing the warehouse to do something it wasn’t built for.

The real-time gap

If your activation requirement is measured in minutes, the warehouse-first pipeline alone will not meet it. Plan a streaming sidecar from the start rather than discovering the batch ceiling after you have built the whole stack around scheduled syncs.

09 — Build or BuyWhat to build and what to buy.

The economics push consistently toward buying the connector layer. Custom connector build cost is estimated at roughly 50–100 hours per connector plus ongoing maintenance, and marketing-platform APIs change often enough that running 10+ connectors in-house becomes a real engineering burden. One vendor TCO guide puts the maintenance overhead for open-source pipelines at around $8,000/month and estimates data engineers spend roughly 44% of their time on maintenance — figures to treat as directional, not audited, but directionally consistent with what teams actually experience.

Ingestion

Connectors & extraction

Connector maintenance is pure overhead and APIs churn constantly. Buy managed (Fivetran) or self-host Airbyte if you have the engineering capacity and want to avoid usage-based pricing surprises.

Buy connectors

Transformation

The model layer

This encodes your business definitions and belongs in version control. dbt Core is Apache 2.0, so ownership carries no licensing lock-in. This is the one layer to keep firmly in-house.

Own the model

Activation

Reverse ETL to destinations

Hightouch (250+ destinations) or Census/Fivetran Activations (200+) handle sync logic and field mapping. Buy it — the warehouse stays your system of record, so switching cost is bounded.

Buy reverse ETL

Real-time activation

Sub-minute use cases

Abandoned-cart and real-time bid enrichment exceed the batch warehouse model. Add a streaming sidecar (Kafka, Pub/Sub, Kinesis) rather than forcing reverse ETL to do real-time work.

Add a streaming sidecar

The forward view: the Fivetran + dbt + Census consolidation makes the single-vendor bundle more attractive on convenience, and as agentic tooling matures â Monte Carlo’s observability agents, dbt’s Fusion engine â more of the maintenance toil moves into the platform. But the strategic principle does not change. Keep the warehouse as the system of record, keep your model layer portable and in version control, and avoid letting any single vendor own the definitions that describe your customers. The well-structured event taxonomy and tracking plan that feeds this pipeline is what determines whether the whole stack produces trustworthy output or expensive noise.

If you are standing up or rationalizing a marketing data stack, our analytics and data engagements start with exactly this layer-by-layer audit — and tie into broader AI transformation work when the goal is to feed governed data to AI agents downstream.

10 — ConclusionOwn the model, buy the plumbing.

The shape of the marketing data stack, mid-2026

Buy the connectors, own the model, activate via reverse ETL.

The modern marketing data pipeline is three composable layers — ingestion, transformation, activation — sitting on a warehouse that stays the system of record. ELT replaced ETL because marketing data sources churn, and that inversion is precisely what made the layers separable enough to buy and build independently.

The Fivetran + dbt Labs merger on June 1, 2026, with Census already in the fold, means the full stack now ships from one vendor — which raises convenience and lock-in in equal measure. The pragmatic answer for most teams is unchanged: buy the connector and activation layers because they are pure maintenance, own the transformation layer because it encodes your business, and decide identity resolution per team based on whether reach or precision matters more.

Two things to keep in front of you. First, this pipeline is the infrastructure beneath a CDP — the composable CDP is just its activation layer, so the product decision is separate from the infrastructure one. Second, the warehouse-first model is batch by default; if you need sub-minute activation, plan the streaming sidecar before you build, not after. Get those two right and the rest is execution.

Marketing Data Pipelines: ETL to Activation in 2026