All articles

In-depth guides on streaming pipelines, cost optimization, open table formats, and building reliable data platforms.

Apache Iceberg v3 feature overview showing Variant Type, Nanosecond Timestamps, Deletion Vectors, Geospatial Types, and Table Encryption
Iceberg

Apache Iceberg v3: What's New?

A deep dive into everything new in Apache Iceberg format version 3 — from the Variant type and nanosecond timestamps to deletion vectors, row lineage, geospatial types, and table-level encryption.

David W
David W
11 min read
Apache Iceberg v3 building a multi-engine lakehouse with QueryFlux
Iceberg

Building a Multi-Engine Iceberg Lakehouse with QueryFlux

How to use QueryFlux as the SQL routing layer in front of Trino, DuckDB, and StarRocks — giving every client one endpoint while each query lands on the right Iceberg engine automatically.

David W
David W
11 min read
Lakehouse architecture diagram with layered data tiers
Lakehouse

dbt Incremental Models That Actually Scale

Incremental models promise fast builds, but most teams hit correctness bugs within weeks. Here is how to build dbt incrementals that stay correct through late data and schema changes.

David W
David W
10 min read
Streaming data flow visualization with wave patterns
Streaming

Real-Time CDC with Debezium and Kafka

Build a production CDC pipeline from PostgreSQL to Kafka using Debezium with log-based capture, schema registry, exactly-once delivery, and zero-downtime snapshots.

Chris P
Chris P
12 min read
Snowflake cost optimization with governance guardrails: minimalist cover with sizing and spend visuals
Platform

Snowflake Cost Optimization Without Slowing Teams Down

A practical playbook for reducing Snowflake compute waste by 30-50% while protecting delivery speed and analyst productivity with governance guardrails.

Chris P
Chris P
11 min read
Data quality dashboard with pass rate metrics
Quality

Data Quality Contracts with dbt and Soda

How to implement enforceable data contracts between producers and consumers using dbt model contracts, Soda anomaly detection, and CI/CD gates that block bad data before it reaches production.

David W
David W
11 min read
Pipeline orchestration DAG visualization with connected nodes
Orchestration

Migrating from Airflow to Dagster: A Practical Guide

A step-by-step migration path from Apache Airflow's task-centric model to Dagster's asset-based approach, covering code translation, testing patterns, and a realistic timeline.

Chris P
Chris P
11 min read
Lakehouse architecture diagram with layered data tiers
Lakehouse

Medallion Architecture: Bronze, Silver, Gold Done Right

The medallion architecture is everywhere, but most implementations get the layer boundaries wrong. Here is how to design bronze, silver, and gold tiers that actually scale.

David W
David W
11 min read
Data platform optimization dashboard with metric tiles
Platform

Terraform for Your Data Platform: Infrastructure as Code

Managing Snowflake warehouses, AWS S3 buckets, and IAM roles with Terraform — from provider setup and remote state to CI/CD pipelines that plan on PR and apply on merge.

Chris P
Chris P
11 min read
Data platform optimization dashboard with metric tiles
Platform

PySpark Performance Tuning: From 4 Hours to 20 Minutes

Practical PySpark optimizations that reduced a production pipeline from 4 hours to 20 minutes — covering data skew, broadcast joins, partition sizing, and AQE.

Chris P
Chris P
11 min read
Streaming data flow visualization with wave patterns
Streaming

Event-Driven Data Architecture with Kafka

How to design Kafka topic hierarchies, schema evolution strategies, and consumer patterns that scale to thousands of events per second reliably.

Chris P
Chris P
10 min read
Data quality dashboard with pass rate metrics
Quality

Data Pipeline Observability: From Alerts to Root Cause

Building observability into data pipelines for fast incident detection and root cause analysis — covering freshness, volume, schema, distribution, and lineage.

Chris P
Chris P
11 min read