Back to Lakehouse

dbt Incremental Models That Actually Scale

Incremental models promise fast builds, but most teams hit correctness bugs within weeks. Here is how to build dbt incrementals that stay correct through late data and schema changes.

David W

David W

Lakehouse10 min read
Lakehouse architecture diagram with layered data tiers

Full table refreshes are simple but expensive. When your fact table exceeds a few hundred million rows, rebuilding it every run burns through compute credits and blocks downstream dashboards for hours. Incremental models solve this by processing only new or changed records on each run.

The problem is that most incremental implementations silently drift from their full-refresh equivalents. Late-arriving events get missed, schema changes cause column mismatches, and duplicate keys create inflated metrics that nobody notices until a finance team reports a discrepancy three weeks later. This article covers how to build dbt incremental models that actually stay correct in production: the right strategy selection, lookback windows for late data, schema change handling, SCD snapshots, and automated reconciliation testing.

Choosing an Incremental Strategy

dbt supports three incremental strategies: append, delete+insert, and merge. Each has different tradeoffs in correctness, performance, and warehouse compatibility.

Append is the simplest: new rows are inserted and nothing else happens. This works for immutable event streams where records are never updated — raw clickstream data, log entries, sensor readings. If a record can change after initial ingestion, append will produce duplicates.

Delete+insert deletes all existing rows that match incoming records on the unique key, then inserts the new versions. This is faster than merge on warehouses that do not optimize MERGE operations well (some Redshift configurations, older BigQuery setups). The tradeoff is that it is not atomic on all platforms — there is a brief window where deleted rows are gone but new rows have not yet been inserted.

Merge is the most correct strategy for mutable data. It performs an upsert: matching rows are updated, non-matching rows are inserted. This is atomic and handles both new and changed records in a single operation. The cost is that MERGE is typically slower than delete+insert because the warehouse must hash-join the incoming data against the full target table.

For most production fact tables, merge is the right default. Use append only for truly immutable event logs. Use delete+insert when you have measured that merge is a performance bottleneck on your specific warehouse and you can tolerate the brief inconsistency window.

Lookback Windows for Late-Arriving Data

The most common incremental bug is missing late-arriving records. A typical incremental model filters source data with WHERE updated_at > (SELECT MAX(updated_at) FROM target). This misses any record that was created or updated before that max timestamp but arrived in the source system after the last dbt run.

The fix is a lookback window: instead of filtering from the exact max timestamp, subtract a buffer that covers your worst-case late-arrival SLA. If your source systems guarantee delivery within 6 hours, use a 6-hour lookback. If you have CDC pipelines with occasional 24-hour delays, use 24 hours.

sql
1-- models/marts/fct_orders.sql2{{3  config(4    materialized='incremental',5    unique_key='order_id',6    incremental_strategy='merge',7    on_schema_change='sync_all_columns',8    cluster_by=['ordered_at::DATE']9  )10}}11 12WITH source AS (13  SELECT14    order_id,15    customer_id,16    product_id,17    order_total,18    currency_code,19    order_status,20    shipping_method,21    ordered_at,22    updated_at23  FROM {{ ref('stg_orders') }}24  {% if is_incremental() %}25    WHERE updated_at >= (26      SELECT DATEADD('hour', -6, MAX(updated_at))27      FROM {{ this }}28    )29  {% endif %}30)31 32SELECT33  order_id,34  customer_id,35  product_id,36  order_total,37  currency_code,38  order_status,39  shipping_method,40  ordered_at,41  updated_at,42  CURRENT_TIMESTAMP() AS _dbt_loaded_at43FROM source

The 6-hour lookback means every run reprocesses the last 6 hours of data. The merge strategy ensures that reprocessed records update in place rather than creating duplicates. The compute cost of reprocessing 6 hours is typically negligible compared to the cost of a full refresh — you are scanning a fraction of the source table.

One important detail: the lookback window only works if your unique_key is truly unique and stable. If order_id can change or if two different orders can share the same ID (common in multi-tenant systems), the merge will produce incorrect results. Always add a uniqueness test to your unique key column.

Handling Schema Changes with on_schema_change

When a new column appears in your source data, what should happen to the incremental model? By default, dbt ignores new columns — the incremental run succeeds but the new column is silently dropped. This is almost never what you want.

The on_schema_change config controls this behavior. There are four options: ignore (default, drop new columns), fail (halt the run if schema differs), append_new_columns (add new columns, leave removed ones), and sync_all_columns (add new columns and remove old ones).

For production models, use sync_all_columns. It keeps your incremental model's schema in sync with the source, which is essential when upstream teams add new attributes to their events or APIs. Combined with a lookback window, sync_all_columns ensures that both data and schema stay current without manual intervention.

sql
1-- The on_schema_change config in the model header2{{3  config(4    materialized='incremental',5    unique_key='event_id',6    incremental_strategy='merge',7    on_schema_change='sync_all_columns'8  )9}}10 11-- When a new column appears in stg_events,12-- dbt will ALTER TABLE to add it before merging data.13-- Existing rows get NULL for the new column;14-- new rows get the actual value.15SELECT16  event_id,17  user_id,18  event_type,19  event_ts,20  -- This column was added last week by the product team21  attribution_channel,22  updated_at23FROM {{ ref('stg_events') }}24{% if is_incremental() %}25  WHERE updated_at >= (26    SELECT DATEADD('hour', -6, MAX(updated_at))27    FROM {{ this }}28  )29{% endif %}

There is one caveat: sync_all_columns issues ALTER TABLE statements under the hood, which can be slow on very large tables in some warehouses. On Snowflake and BigQuery this is nearly instant (metadata-only). On Redshift, adding a column can lock the table briefly. Test schema changes in staging before relying on them in production.

Snapshot-Based SCD Handling

Incremental models handle current state well but do not track historical changes. If a customer changes their plan tier from 'growth' to 'enterprise', the incremental model updates the row — the previous tier is gone. For slowly changing dimensions (SCDs) where you need to preserve history, use dbt snapshots.

Snapshots examine the source table on each run and record changes as new rows with valid_from and valid_to timestamps. This produces a Type 2 SCD automatically.

sql
1-- snapshots/snap_customers.sql2{% snapshot snap_customers %}3{{4  config(5    target_database='analytics',6    target_schema='snapshots',7    unique_key='customer_id',8    strategy='timestamp',9    updated_at='updated_at',10    invalidate_hard_deletes=True11  )12}}13 14SELECT15  customer_id,16  customer_name,17  email,18  plan_tier,19  account_status,20  assigned_csm,21  company_size,22  industry,23  updated_at24FROM {{ source('crm', 'customers') }}25 26{% endsnapshot %}

The timestamp strategy compares updated_at between the source and the snapshot target. When a row's updated_at has advanced, dbt invalidates the existing snapshot row (sets dbt_valid_to to the current timestamp) and inserts a new row with the updated values. The invalidate_hard_deletes option handles rows that disappear from the source — they are marked as deleted in the snapshot rather than silently dropped.

Use snapshots for any entity where business users ask questions like what was this customer's plan tier last quarter or when did this account churn. Do not use snapshots for high-volume fact tables — the row multiplication makes them expensive. Snapshots are for dimensions with hundreds of thousands to low millions of rows, not billions.

Reconciliation Testing Against Full Refresh

The most dangerous property of incremental models is that they can drift from full-refresh equivalents without any visible error. The dbt run succeeds, tests pass, and dashboards update — but the numbers are subtly wrong because a late record was missed or a merge key collision caused data loss.

The solution is reconciliation testing: periodically rebuild the incremental model as a full refresh into a separate table, then compare the two. If they match, your incremental logic is correct. If they diverge, you have a bug to fix.

yaml
1# tests/reconciliation/recon_fct_orders.yml2version: 23models:4  - name: fct_orders5    description: "Incremental orders fact table"6    tests:7      - dbt_utils.equal_rowcount:8          compare_model: ref('fct_orders_full_refresh')9      - dbt_utils.equality:10          compare_model: ref('fct_orders_full_refresh')11          compare_columns:12            - order_id13            - customer_id14            - order_total15            - order_status16            - updated_at17    columns:18      - name: order_id19        tests:20          - unique21          - not_null22      - name: order_total23        tests:24          - dbt_utils.accepted_range:25              min_value: 0

The companion full-refresh model is straightforward — it is the same SQL without the is_incremental() filter, materialized as a table in a separate schema. Run it on a weekly or bi-weekly schedule in CI.

sql
1-- models/reconciliation/fct_orders_full_refresh.sql2{{3  config(4    materialized='table',5    schema='reconciliation',6    tags=['reconciliation', 'weekly']7  )8}}9 10SELECT11  order_id,12  customer_id,13  product_id,14  order_total,15  currency_code,16  order_status,17  shipping_method,18  ordered_at,19  updated_at,20  CURRENT_TIMESTAMP() AS _dbt_loaded_at21FROM {{ ref('stg_orders') }}

Automate this with a CI workflow that runs the reconciliation on a schedule. If the row count or column values diverge beyond a threshold, the workflow fails and alerts the team.

yaml
1# .github/workflows/dbt-reconciliation.yml2name: dbt Reconciliation3on:4  schedule:5    - cron: '0 8 * * 0'  # Every Sunday at 8 AM UTC6  workflow_dispatch:7 8jobs:9  reconciliation:10    runs-on: ubuntu-latest11    steps:12      - uses: actions/checkout@v413 14      - name: Set up Python15        uses: actions/setup-python@v516        with:17          python-version: '3.11'18 19      - name: Install dbt20        run: pip install dbt-snowflake dbt-utils21 22      - name: Build full refresh baseline23        run: |24          dbt run --select tag:reconciliation --full-refresh25        env:26          DBT_PROFILES_DIR: ./profiles27 28      - name: Run reconciliation tests29        run: |30          dbt test --select tag:reconciliation31        env:32          DBT_PROFILES_DIR: ./profiles33 34      - name: Alert on failure35        if: failure()36        uses: slackapi/slack-github-action@v2.0.037        with:38          webhook: ${{ secrets.SLACK_WEBHOOK }}39          webhook-type: incoming-webhook40          payload: |41            {42              "text": "Reconciliation failed for fct_orders. Incremental model has drifted from full refresh."43            }

Practical Guidelines

Start every incremental model with a full refresh and verify the output against your source of truth. Then switch to incremental and run the reconciliation test immediately to confirm they produce identical results.

Set lookback windows based on measured late-arrival data, not guesses. Query your source system's ingestion lag for the 99th percentile and use that as your buffer. If you do not know your late-arrival characteristics, start with 24 hours and tighten after you have data.

Use merge as your default strategy unless you have a specific, measured reason to choose append or delete+insert. The correctness guarantees of merge are worth the small performance overhead.

Always pair on_schema_change='sync_all_columns' with your incremental models. The alternative — manually updating models when upstream schemas change — does not scale past a handful of models.

Run reconciliation tests weekly in CI, not just when you suspect a problem. Drift accumulates slowly, and the earlier you catch it the easier it is to fix. A reconciliation failure after one week means you need to investigate one week of data. A failure after three months means you might need a full backfill.

Incremental models are not a shortcut — they are an optimization that requires more upfront design than full refreshes. The lookback window, merge strategy, schema change handling, and reconciliation tests are non-negotiable components of every production incremental model. Skip any one of them and you are building on a foundation that will eventually crack.

Debugging Incremental Failures

When an incremental model produces wrong results, the debugging process follows a consistent pattern. First, check whether the issue is data completeness (missing rows) or data correctness (wrong values). Run a COUNT and a SUM of your key metric on both the incremental table and a full refresh to narrow it down.

If rows are missing, the lookback window is too short or the incremental predicate is filtering too aggressively. Check the maximum lag in your source data by comparing ingestion timestamps with event timestamps. Widen the lookback to cover the 99.9th percentile, not just the 99th.

If values are wrong but row counts match, you likely have a merge key collision — two different business entities sharing the same unique_key. This causes one record to overwrite the other during the MERGE operation. Add a composite unique key or fix the upstream deduplication logic.

sql
1-- Diagnostic query: find merge key collisions2WITH dupes AS (3  SELECT4    order_id,5    COUNT(*) AS cnt,6    COUNT(DISTINCT customer_id) AS distinct_customers7  FROM {{ ref('stg_orders') }}8  GROUP BY order_id9  HAVING COUNT(*) > 110)11SELECT * FROM dupes12ORDER BY cnt DESC13LIMIT 20;

If your reconciliation test fails intermittently, the most likely cause is a race condition between the incremental run and the source data. The lookback window captures data that arrived during the run itself, but the full refresh was built from a slightly different snapshot of the source. Add a 30-minute buffer to your lookback beyond the theoretical maximum to account for this timing difference.

Build a culture where incremental model failures are treated as bugs, not inconveniences. Every reconciliation failure should produce a ticket with a root cause analysis. Over time, this builds institutional knowledge about your data's arrival patterns and your pipeline's edge cases — knowledge that prevents the next failure before it happens.

Tags

Lakehousedbtincremental modelsdata modelinglakehousedata warehouseSQL

Related articles

Found this useful? Share it with your team.