Medallion Architecture: Bronze, Silver, Gold Done Right

Lakehouse architecture diagram with layered data tiers

The medallion architecture organizes data into three tiers: bronze (raw), silver (cleaned), and gold (business-ready). It is the most widely adopted pattern in lakehouse platforms. Databricks coined the terminology, but the pattern itself — raw staging, cleaned integration, business-ready consumption — predates any specific vendor. The problem is that most teams blur the boundaries between layers, turning a clean architecture into a maintenance nightmare.

This article defines what each layer should and should not contain, establishes ownership rules, shows the SQL patterns for each tier, and calls out the anti-patterns that cause medallion architectures to collapse under their own weight.

Bronze: The Immutable Audit Trail

Bronze is your safety net. It stores data exactly as received from source systems. No transformations. No filtering. No deduplication. The only additions are metadata columns that record where the data came from and when it arrived. The purpose of bronze is to be a complete, replayable record of every piece of data your platform has ever ingested.

This immutability matters because source systems are unreliable. APIs change without notice. CSV exports contain corrupted rows. Webhook payloads shift schema between deployments. When any of these break a downstream model, you need to replay from the raw data. If bronze is already transformed, you have lost the ability to recover.

sql

1-- Bronze layer: raw ingestion with metadata only2-- No type casting, no filtering, no deduplication3CREATE OR REPLACE TABLE bronze.raw_payments AS4SELECT5  $1                              AS raw_json,6  METADATA$FILENAME               AS _source_file,7  METADATA$FILE_ROW_NUMBER        AS _source_row_number,8  METADATA$FILE_LAST_MODIFIED     AS _source_modified_at,9  CURRENT_TIMESTAMP()             AS _ingested_at,10  'payments_api_v2'               AS _source_system,11  'daily_batch'                   AS _load_method12FROM @landing_stage/payments/dt=20260120/13  (FILE_FORMAT => 'raw_json_format');

Notice what is not here: no CAST operations, no WHERE clauses, no joins. The raw_json column contains the exact payload. The metadata columns are prefixed with underscores to distinguish them from source data. The _source_system and _load_method columns enable you to trace any row back to its origin, which is essential for debugging and compliance.

Bronze tables should be append-only. Never update or delete rows in bronze. If a source system sends a correction, ingest the correction as a new row. Let the silver layer handle which version is canonical. This append-only pattern is also what makes bronze compatible with Iceberg, Delta Lake, and Hudi — all of which handle append-only writes efficiently.

Silver: The Trusted, Cleaned Layer

Silver is where you establish truth. This layer handles four responsibilities: deduplication, type casting, null handling, and cross-source joins. Silver tables represent the single source of truth for each business entity. Every silver table should have a clear primary key, a defined grain, and a freshness SLA.

Deduplication is the most important silver responsibility. Source systems frequently send duplicate records — retried API calls, overlapping batch exports, CDC events that replay during recovery. Silver resolves these duplicates using a deterministic strategy.

sql

1-- Silver layer: deduplicated, typed, cleaned2-- models/silver/silver_payments.sql3{{4  config(5    materialized='incremental',6    unique_key='payment_id',7    incremental_strategy='merge'8  )9}}10 11WITH parsed AS (12  SELECT13    raw_json:payment_id::VARCHAR           AS payment_id,14    raw_json:order_id::VARCHAR             AS order_id,15    raw_json:customer_id::VARCHAR          AS customer_id,16    raw_json:amount::NUMBER(18, 2)         AS amount,17    raw_json:currency::VARCHAR(3)          AS currency,18    TRY_TO_TIMESTAMP(19      raw_json:created_at::VARCHAR20    )                                      AS created_at,21    raw_json:status::VARCHAR               AS status,22    raw_json:payment_method::VARCHAR       AS payment_method,23    _ingested_at,24    _source_file25  FROM {{ source('bronze', 'raw_payments') }}26  WHERE raw_json:payment_id IS NOT NULL27  {% if is_incremental() %}28    AND _ingested_at >= (29      SELECT DATEADD('hour', -6, MAX(_ingested_at))30      FROM {{ this }}31    )32  {% endif %}33),34 35ranked AS (36  SELECT37    *,38    ROW_NUMBER() OVER (39      PARTITION BY payment_id40      ORDER BY _ingested_at DESC41    ) AS _row_rank42  FROM parsed43)44 45SELECT46  payment_id,47  order_id,48  customer_id,49  amount,50  currency,51  created_at,52  status,53  payment_method,54  _ingested_at       AS _last_seen_at,55  _source_file       AS _last_source_file56FROM ranked57WHERE _row_rank = 1

The ROW_NUMBER pattern is the standard deduplication approach. Partition by the natural key, order by ingestion time descending, and keep only the latest version. The TRY_TO_TIMESTAMP function handles malformed dates gracefully — returning null instead of throwing an error. The WHERE clause at the top filters out records with null primary keys, which are always invalid.

Silver is also where you join across source systems. If your orders come from one API and your customers come from another, silver is where you create the canonical relationship between them. But keep joins minimal at this layer — join only to resolve foreign keys and enrich with stable reference data. Complex business logic belongs in gold.

Quality Gates for Silver

Every silver table needs tests. At minimum, test for primary key uniqueness, non-null primary keys, and referential integrity to related silver tables. These tests are your contract with the gold layer: if silver is wrong, everything downstream is wrong.

yaml

1# models/silver/schema.yml2version: 23models:4  - name: silver_payments5    description: "Deduplicated, typed payment records"6    columns:7      - name: payment_id8        tests:9          - unique10          - not_null11      - name: order_id12        tests:13          - not_null14          - relationships:15              to: ref('silver_orders')16              field: order_id17      - name: amount18        tests:19          - not_null20          - dbt_utils.accepted_range:21              min_value: 022              max_value: 100000023      - name: currency24        tests:25          - not_null26          - accepted_values:27              values: ['USD', 'EUR', 'GBP', 'JPY', 'CAD', 'AUD']28      - name: status29        tests:30          - not_null31          - accepted_values:32              values: ['pending', 'succeeded', 'failed', 'refunded', 'disputed']33      - name: created_at34        tests:35          - not_null36          - dbt_utils.accepted_range:37              min_value: "'2020-01-01'"38              max_value: "CURRENT_TIMESTAMP()"

These tests run as part of every dbt build. If a new source system starts sending payment statuses that are not in the accepted list, the build fails and the issue is caught before it corrupts gold-layer metrics.

Gold: Business-Ready Metrics

Gold is the layer your stakeholders interact with. These tables power dashboards, feed ML models, and generate regulatory reports. Gold models should be opinionated: pre-aggregated to the right grain, filtered to relevant business scopes, and named in language the business understands.

The difference between silver and gold is the difference between data and information. Silver gives you every individual payment with its attributes. Gold gives you daily revenue by product line with refund breakdowns. Silver is complete. Gold is useful.

sql

1-- Gold layer: business-ready fact table2-- models/gold/fct_daily_revenue.sql3{{4  config(5    materialized='table',6    tags=['finance', 'daily']7  )8}}9 10WITH payments AS (11  SELECT12    p.payment_id,13    p.order_id,14    p.amount,15    p.currency,16    p.status,17    p.created_at,18    o.product_line,19    o.region,20    o.customer_segment21  FROM {{ ref('silver_payments') }} p22  INNER JOIN {{ ref('silver_orders') }} o23    ON p.order_id = o.order_id24  WHERE p.status IN ('succeeded', 'refunded')25)26 27SELECT28  created_at::DATE                        AS revenue_date,29  product_line,30  region,31  customer_segment,32  COUNT(DISTINCT payment_id)              AS payment_count,33  COUNT(DISTINCT order_id)                AS order_count,34  SUM(amount)                             AS gross_revenue,35  SUM(CASE36    WHEN status = 'refunded'37    THEN amount ELSE 038  END)                                    AS refund_amount,39  SUM(CASE40    WHEN status = 'succeeded'41    THEN amount ELSE 042  END)                                    AS net_revenue43FROM payments44GROUP BY 1, 2, 3, 4

Gold tables use business names: revenue_date, not created_at_date. gross_revenue, not sum_amount. payment_count, not cnt_payment_id. This is not cosmetic — it eliminates an entire class of misunderstandings between data engineers and business stakeholders. When the CFO sees gross_revenue on a dashboard, they know exactly what it means.

Gold models should also be materialized as tables, not views. Views save storage but shift compute to query time, which means every dashboard refresh recomputes the aggregation. For gold-layer fact tables that power dashboards, pre-computing the result and storing it as a table gives predictable, fast query performance. Use dbt's materialized table configuration and refresh on a schedule that matches your freshness SLA — typically every one to four hours for operational metrics, and daily for financial reporting.

Dimension tables in the gold layer follow a similar pattern. A dim_customers table contains the current state of every customer with business-relevant attributes: customer segment, account status, lifetime value tier. If you need historical state tracking, use a Type 2 slowly changing dimension in silver and expose the current-state snapshot in gold. Most dashboard queries do not need the full change history — they just need the current truth.

The One-Directional Flow Rule

The single most important architectural rule: data flows in one direction only. Bronze feeds silver. Silver feeds gold. Gold never reads from bronze. Silver never reads from gold. Gold never feeds back into silver.

This rule seems obvious but is violated constantly in practice. The most common violation is a gold model that reads directly from bronze to grab a field that the silver layer does not expose. This creates a circular dependency and means your gold layer can produce different results depending on the state of bronze, bypassing all the cleaning and validation in silver.

If gold needs a field that silver does not provide, the fix is to add it to silver. This might feel like extra work, but it ensures the field goes through deduplication, type casting, and validation before reaching business users. Every shortcut around the one-directional flow rule creates a future debugging nightmare.

Ownership Model

Assign each layer to a team with clear responsibility. Platform engineers own bronze. They are responsible for ingestion reliability, source system connectivity, and storage efficiency. Analytics engineers own silver. They handle data cleaning, deduplication logic, and schema evolution. Domain analysts or analytics engineers own gold. They define business metrics, build aggregations, and maintain dashboard queries.

This ownership model prevents the most common failure mode: a single data team that owns everything and treats silver as a dumping ground. When silver has a clear owner, that team enforces standards — naming conventions, test coverage, documentation — that keep the layer trustworthy.

Ownership also determines who approves changes. A pull request that modifies a bronze ingestion job should be reviewed by the platform team. A PR that changes a silver deduplication strategy should be reviewed by analytics engineering. A PR that alters a gold metric definition should be reviewed by the domain analyst who owns that metric. Cross-layer PRs require sign-off from both teams.

Anti-Patterns to Avoid

Business logic in bronze. If you are applying WHERE clauses, CASE statements, or joins in your bronze layer, you have turned bronze into silver. Bronze should be a dumb copy of the source with metadata appended. Nothing more.

Circular dependencies between layers. If gold reads from bronze or silver reads from gold, your architecture has a cycle. Every cycle makes debugging exponentially harder because changes in one layer can affect another through an unexpected path.

Too many silver models. Some teams create a silver model for every conceivable entity variant: silver_customers_active, silver_customers_churned, silver_customers_with_revenue. This is gold-layer logic leaking into silver. Silver should have one model per entity: silver_customers. The active/churned/revenue segmentation belongs in gold, where business definitions are applied.

Skipping silver entirely. Going directly from bronze to gold is tempting for simple pipelines, but it means your gold models contain both cleaning logic and business logic. When a deduplication bug appears, you have to modify gold models that are also used by dashboards. Keeping silver as a separate layer gives you a clean place to fix data quality issues without touching business metric definitions.

Naming Conventions That Scale

Consistent naming prevents confusion as your medallion architecture grows from ten models to two hundred. Use a prefix that identifies the layer: bronze tables start with raw_, silver tables start with silver_ or stg_ for staging, and gold tables use business domain prefixes like fct_ for facts and dim_ for dimensions. This convention lets anyone looking at a query immediately know which layer they are reading from.

Column naming follows a similar pattern. Bronze columns match the source system exactly — do not rename anything. Silver columns use snake_case with clear, unambiguous names: customer_id, not cust_id or CustomerID. Gold columns use business language: gross_revenue, not amt_gross. Suffixes like _at for timestamps, _id for identifiers, and _count for aggregations make columns self-documenting.

When to Bend the Rules

The medallion architecture is a framework, not a religion. There are legitimate cases where strict three-layer adherence creates more problems than it solves. Real-time dashboards that need sub-minute latency may read directly from a streaming silver table without waiting for a gold aggregation. Exploratory data science workloads may query silver directly because the gold layer does not expose the granularity they need. These exceptions are fine as long as they are documented and the team understands why the exception exists.

The danger is when exceptions become the norm. If half your dashboards read from silver, you do not have a gold layer problem — you have a gold layer gap. Fix it by building the gold models those dashboards need, then migrate the dashboards to use gold. Every exception should have an expiration date or a ticket to resolve it.

Measuring Architecture Health

Track three metrics to assess whether your medallion architecture is healthy. First, layer violation count: how many gold models reference bronze directly? This should be zero. Use dbt's graph operations to detect cross-layer references in CI. Second, test coverage by layer: what percentage of silver models have primary key uniqueness tests? What percentage of gold models have freshness checks? Target 100 percent for silver primary keys and 90 percent for gold freshness. Third, model count ratio: a healthy architecture has roughly two to three times more gold models than silver models, because multiple business questions draw from the same cleaned entities. If your silver layer has more models than gold, you may be doing business segmentation in the wrong layer.

Start with the simplest possible implementation: one bronze table per source, one silver model per entity, one gold model per business question. Resist the temptation to over-engineer. A working three-layer architecture with ten models is better than a theoretical fifteen-layer framework that never ships.

Medallion Architecture: Bronze, Silver, Gold Done Right

Tags

Related articles

dbt Incremental Models That Actually Scale

Apache Iceberg v3: What's New?

Building a Multi-Engine Iceberg Lakehouse with QueryFlux

Snowflake Cost Optimization Without Slowing Teams Down

Data Quality Contracts with dbt and Soda