Data Quality Contracts with dbt and Soda

Data quality dashboard with pass rate metrics

Every data team has lived through the same failure. A producer changes a column type, drops a field, or lets null values leak through. Downstream dashboards break. ML models drift. Finance reports incorrect numbers to the board. Nobody finds out until a stakeholder sends an angry Slack message. Data contracts exist to prevent this. They formalize the agreement between a data producer and its consumers: what columns exist, what values are valid, and how fresh the data must be. When contracts are enforced in CI/CD, bad changes are blocked before they reach production.

This article walks through a three-layer contract system using dbt and Soda. dbt handles schema and semantic validation natively. Soda adds statistical anomaly detection and freshness monitoring that dbt cannot cover. Together, they form a contract framework that is enforceable, readable, and integrated into your development workflow.

The Three Contract Layers

A useful data contract covers three distinct concerns. The first is schema: what columns exist, what their types are, and whether nulls are allowed. The second is semantic: what values are valid within those columns. A status column should only contain known statuses. A revenue column should never be negative. The third is freshness: how recently the data was updated. A revenue table that has not been refreshed in 48 hours is effectively broken, even if every row passes validation.

Separating these layers matters because they fail differently and require different responses. A schema violation is a hard failure that blocks deployment. A semantic violation might be a warning that triggers investigation. A freshness breach is an operational alert that pages the on-call engineer. Treating all three the same leads to either alert fatigue or missed critical failures.

Schema Contracts with dbt Model Contracts

dbt introduced model contracts in version 1.5. When you set contract.enforced to true on a model, dbt validates that the model's output matches the declared schema at build time. If a column is missing, has the wrong type, or if an undeclared column appears, the build fails. This is the foundation of your schema contract.

Here is a contract definition for a revenue fact table. Every column has an explicit data type, and the contract is enforced so any deviation breaks the build.

yaml

1# models/marts/schema.yml2version: 23models:4  - name: fct_revenue5    description: "Daily revenue by product line — contract-enforced"6    config:7      contract:8        enforced: true9      materialized: incremental10      unique_key: [revenue_date, product_line]11    columns:12      - name: revenue_date13        data_type: date14        description: "Calendar date of the revenue event"15        tests:16          - not_null17          - dbt_utils.accepted_range:18              min_value: "'2020-01-01'"19              max_value: "CURRENT_DATE()"20      - name: product_line21        data_type: varchar22        description: "Product tier generating the revenue"23        tests:24          - not_null25          - accepted_values:26              values: ['enterprise', 'growth', 'starter', 'free']27      - name: gross_revenue28        data_type: number(18, 2)29        description: "Total revenue before refunds"30        tests:31          - not_null32          - dbt_utils.accepted_range:33              min_value: 034      - name: net_revenue35        data_type: number(18, 2)36        description: "Revenue after refunds and adjustments"37        tests:38          - not_null39      - name: refund_amount40        data_type: number(18, 2)41        description: "Total refunds processed"42        tests:43          - dbt_utils.accepted_range:44              min_value: 045      - name: payment_count46        data_type: integer47        description: "Number of distinct payments"48        tests:49          - not_null50          - dbt_utils.accepted_range:51              min_value: 052      - name: customer_count53        data_type: integer54        description: "Number of distinct paying customers"55        tests:56          - not_null

With contract.enforced set to true, dbt does two things at build time. First, it checks that the SQL output of your model contains exactly the declared columns with the declared types. If your model SELECT statement returns a column called revenue_dt instead of revenue_date, the build fails immediately with a clear error message. Second, it prevents undeclared columns from leaking through. If someone adds a debug column to the model SQL without updating the YAML, the contract catches it.

This is fundamentally different from dbt tests, which run after the model builds. Contract enforcement happens during the build itself. The data never lands in the warehouse in a broken state.

Semantic Contracts with Column-Level Tests

Schema contracts tell you the shape of the data. Semantic contracts tell you whether the values make sense. dbt's built-in test framework handles this layer through column-level tests declared in the same YAML file.

The tests in the schema above already demonstrate several semantic checks: not_null ensures critical columns always have values, accepted_values restricts product_line to known tiers, and accepted_range prevents negative revenue. But real-world contracts need more nuanced checks. Here is an extended example that adds relationship integrity and custom business rules.

yaml

1# models/marts/schema.yml (additional tests)2models:3  - name: fct_revenue4    tests:5      - dbt_utils.unique_combination_of_columns:6          combination_of_columns:7            - revenue_date8            - product_line9      - dbt_utils.expression_is_true:10          expression: "gross_revenue >= net_revenue"11          config:12            severity: error13      - dbt_utils.expression_is_true:14          expression: "gross_revenue = net_revenue + refund_amount"15          config:16            severity: warn17    columns:18      - name: revenue_date19        tests:20          - not_null21          - dbt_utils.not_constant22          - dbt_utils.sequential_values:23              interval: 124              datepart: day25              config:26                severity: warn

The unique_combination_of_columns test ensures there is exactly one row per date per product line, which is the grain of this fact table. The expression_is_true tests enforce business invariants: gross revenue must always be greater than or equal to net revenue, and ideally the three revenue fields should balance. The sequential_values test on revenue_date catches gaps in the data. If a day is missing, it warns so engineers can investigate whether the source system had an outage.

A critical design choice: use severity levels deliberately. Hard business rules like non-negative revenue should be errors that block deployment. Soft expectations like sequential dates should be warnings that generate alerts but do not block. This prevents a single missing Saturday from holding up your Monday morning pipeline.

Freshness and Anomaly Detection with Soda

dbt tests validate individual rows and columns. But some quality problems only appear in aggregate: row counts drop by 80 percent, average revenue spikes by 10x, or data stops arriving entirely. These require statistical checks and freshness monitoring. This is where Soda comes in.

Soda Core is an open-source data quality tool that runs checks defined in YAML. It connects to your warehouse, executes the checks, and returns pass/fail/warn results. Soda checks complement dbt tests by covering the aggregate and temporal concerns that dbt does not handle well.

yaml

1# soda/checks/fct_revenue.yml2checks for fct_revenue:3  # Freshness: revenue_date should be within 2 days of today4  - freshness(revenue_date) < 2d:5      name: Revenue data is fresh6 7  # Volume: expect at least 100 rows per day8  - row_count:9      fail:10        when < 10011      warn:12        when < 50013      name: Minimum row count check14 15  # Anomaly detection on row count — catches sudden drops or spikes16  - anomaly detection for row_count:17      warn: anomaly score > 0.818      fail: anomaly score > 0.9519      name: Row count anomaly20 21  # Anomaly detection on revenue — catches data corruption22  - anomaly detection for avg(gross_revenue):23      fail: anomaly score > 0.924      name: Average revenue anomaly25 26  # Anomaly detection on null ratio27  - anomaly detection for missing_percent(customer_count):28      warn: anomaly score > 0.729      name: Customer count null ratio anomaly30 31  # Business rule: no negative net revenue32  - failed rows:33      fail condition: net_revenue < 034      name: Negative net revenue check35 36  # Cross-check: revenue should balance37  - failed rows:38      fail condition: ABS(gross_revenue - net_revenue - refund_amount) > 0.0139      name: Revenue balance check

The freshness check is the most operationally important. It verifies that the maximum value in the revenue_date column is within two days of the current date. If your pipeline stops running or a source system goes down, this check fails before any stakeholder notices stale dashboards.

Anomaly detection uses Soda's built-in statistical model, which learns the normal range of a metric from historical data and flags deviations. This catches problems that static thresholds miss. If your row count is normally 50,000 but drops to 5,000 on a holiday, a static threshold of 100 would not catch it. Anomaly detection would, because the drop is statistically significant relative to the historical pattern.

Soda also needs a configuration file that tells it how to connect to your warehouse.

yaml

1# soda/configuration.yml2data_source snowflake:3  type: snowflake4  account: ${SNOWFLAKE_ACCOUNT}5  username: ${SNOWFLAKE_USERNAME}6  password: ${SNOWFLAKE_PASSWORD}7  database: ANALYTICS8  schema: MARTS9  warehouse: WH_DATA_QUALITY10  role: DATA_QUALITY_ROLE

CI/CD Enforcement with GitHub Actions

Contracts are only as strong as their enforcement mechanism. If engineers can merge changes that violate contracts, the contracts are documentation, not guarantees. The enforcement layer is a CI/CD pipeline that runs dbt build and soda scan on every pull request that touches model files.

Here is a GitHub Actions workflow that gates PRs on contract compliance. It uses dbt's state comparison to test only modified models, keeping CI fast even in large projects.

yaml

1# .github/workflows/data-contracts.yml2name: Data Contract Checks3on:4  pull_request:5    paths:6      - 'models/**'7      - 'soda/**'8      - 'dbt_project.yml'9 10jobs:11  contract-checks:12    runs-on: ubuntu-latest13    steps:14      - uses: actions/checkout@v415 16      - name: Set up Python17        uses: actions/setup-python@v518        with:19          python-version: '3.11'20 21      - name: Install dependencies22        run: |23          pip install dbt-snowflake==1.8.* soda-core-snowflake24 25      - name: Download production manifest26        run: |27          aws s3 cp s3://dbt-artifacts/prod/manifest.json \28            prod-manifest/manifest.json29        env:30          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}31          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}32 33      - name: Run dbt build (modified models only)34        run: |35          dbt deps36          dbt build \37            --select state:modified+ \38            --defer \39            --state prod-manifest/40        env:41          DBT_PROFILES_DIR: ./profiles42          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}43          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}44 45      - name: Run Soda checks46        run: |47          soda scan \48            -d snowflake \49            -c soda/configuration.yml \50            soda/checks/51        env:52          SNOWFLAKE_ACCOUNT: ${{ secrets.SNOWFLAKE_ACCOUNT }}53          SNOWFLAKE_USERNAME: ${{ secrets.SNOWFLAKE_USERNAME }}54          SNOWFLAKE_PASSWORD: ${{ secrets.SNOWFLAKE_PASSWORD }}55 56      - name: Post results to PR57        if: always()58        uses: actions/github-script@v759        with:60          script: |61            const fs = require('fs');62            const dbtLog = fs.readFileSync('logs/dbt.log', 'utf8');63            const failedTests = dbtLog.match(/FAIL \d+/g) || ['All tests passed'];64            github.rest.issues.createComment({65              owner: context.repo.owner,66              repo: context.repo.repo,67              issue_number: context.issue.number,68              body: `## Data Contract Results\n\n${failedTests.join('\n')}`69            });

When a contract check fails, the PR is blocked from merging. The workflow posts a comment to the PR with the specific failures, so the author knows exactly what to fix. Here is what a failed contract looks like in practice.

A developer changes the fct_revenue model and accidentally drops the refund_amount column. They open a PR. The CI workflow runs dbt build, which triggers the contract check. dbt detects that the model output no longer matches the declared schema and fails with a message like: contract violation for fct_revenue — column refund_amount declared in YAML but not found in model output. The PR shows a red check, the comment explains the failure, and the merge button is disabled until the contract is satisfied.

This feedback loop is fast. The developer sees the failure within minutes, not after a production incident. They either fix the model to include the column or update the contract YAML to reflect the intentional change. If they update the contract, the diff is visible in the PR, so consumers can review and approve the change before it lands.

Contract-as-Documentation

The contract YAML files serve a dual purpose. They are executable tests and they are human-readable documentation. A product manager can open schema.yml and understand what columns exist in fct_revenue, what values are valid, and what freshness guarantee the team commits to. An analyst can read the Soda checks to understand what anomaly thresholds are in place.

This eliminates the gap between documentation and reality. Traditional data dictionaries rot because nobody updates them when the schema changes. Contract files stay current because the CI pipeline enforces them. If the contract says product_line accepts four values and someone adds a fifth tier, the build fails until the contract is updated. The documentation is always correct because correctness is enforced by the build system.

Treat your contract files like API specifications. Version them in git. Require PR reviews from consumers when contracts change. Publish them to an internal documentation site so stakeholders can browse the guarantees without reading YAML directly.

Implementation Strategy

Do not try to add contracts to every model at once. Start with your five most critical gold-layer models — the ones that power executive dashboards, financial reports, or ML features. Add schema contracts with enforced: true. Add the three or four most important semantic tests per model. Add a Soda freshness check. Wire up the GitHub Action.

Once the first five models are under contract, expand to the silver layer. Silver contracts focus on deduplication guarantees, type correctness, and referential integrity. Bronze contracts are usually minimal — just schema validation to catch source system changes early.

Track contract coverage as a metric. Count the number of models with enforced contracts divided by total models. Set a target: 80 percent of gold models under contract by end of quarter, 60 percent of silver by end of next quarter. Report this metric alongside data freshness and pipeline reliability in your team's operational review.

A partially enforced contract system is infinitely better than none at all. Every contract you add prevents a category of production incidents. Every CI gate you enable saves hours of incident response. The goal is not perfection — it is a continuous reduction in the surface area where bad data can leak through.

Data Quality Contracts with dbt and Soda

Tags

Related articles

Data Pipeline Observability: From Alerts to Root Cause

dbt Incremental Models That Actually Scale

Medallion Architecture: Bronze, Silver, Gold Done Right

Terraform for Your Data Platform: Infrastructure as Code

Apache Iceberg v3: What's New?