Real-Time CDC with Debezium and Kafka

Streaming data flow visualization with wave patterns

Change Data Capture (CDC) replicates database changes to downstream systems in real time. Instead of querying a database on a schedule to detect what changed, CDC reads the database's internal transaction log — the Write-Ahead Log (WAL) in PostgreSQL, the binlog in MySQL — and streams every insert, update, and delete as an event. This article covers how to build a production CDC pipeline from PostgreSQL to Kafka using Debezium, including WAL configuration, connector setup, schema registry integration, exactly-once delivery, incremental snapshots, monitoring, and dead letter queues.

Why Log-Based CDC

The alternative to log-based CDC is query-based CDC: run a periodic SELECT WHERE updated_at > last_checkpoint against the source database. This approach has three fundamental problems. First, it misses deletes — there is no row to select after a DELETE. Second, it puts load on the production database proportional to the table size, because the query must scan for changed rows. Third, it introduces latency equal to the polling interval; a 5-minute poll means up to 5 minutes of staleness.

Log-based CDC avoids all three problems. The WAL captures every transaction — inserts, updates, and deletes — as immutable log entries. Reading the WAL has near-zero impact on the database because it is a sequential read of an append-only file that PostgreSQL is already writing for crash recovery. And latency drops from minutes to seconds, because Debezium tails the WAL in real time.

The tradeoff is operational complexity. Log-based CDC requires configuring the database's replication settings, deploying and monitoring Debezium connectors, and managing schema evolution across the pipeline. This article covers all of it.

Configuring PostgreSQL for WAL-Based CDC

PostgreSQL must be configured to produce logical replication output. By default, PostgreSQL only writes physical WAL entries (byte-level changes) that are not useful for CDC. You need to set the WAL level to logical and create a replication slot for Debezium.

sql

1-- postgresql.conf — set these parameters and restart PostgreSQL2-- wal_level = logical3-- max_replication_slots = 44-- max_wal_senders = 45 6-- Verify the WAL level is set correctly7SHOW wal_level;8-- Should return: logical9 10-- Create a dedicated user for Debezium with replication privileges11CREATE ROLE debezium_user WITH LOGIN PASSWORD 'secure_password' REPLICATION;12GRANT SELECT ON ALL TABLES IN SCHEMA public TO debezium_user;13ALTER DEFAULT PRIVILEGES IN SCHEMA public GRANT SELECT ON TABLES TO debezium_user;14 15-- Create a publication for the tables you want to capture16CREATE PUBLICATION debezium_pub FOR TABLE17  orders,18  customers,19  payments,20  inventory;

The max_replication_slots setting limits how many CDC consumers can connect simultaneously. Set it to at least the number of Debezium connectors you plan to run plus a buffer for replication failover. The max_wal_senders controls how many concurrent replication connections PostgreSQL accepts.

One critical detail: replication slots prevent PostgreSQL from discarding WAL segments that the slot has not consumed. If your Debezium connector goes down for an extended period, WAL files accumulate on disk and can fill the volume. Monitor replication slot lag and set up alerts when it exceeds your threshold.

Debezium Connector Configuration

Debezium runs as a Kafka Connect connector. You deploy it to a Kafka Connect cluster and configure it with a JSON spec that tells it which database to connect to, which tables to capture, and how to serialize the change events.

json

1{2  "name": "postgres-cdc-connector",3  "config": {4    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",5    "database.hostname": "postgres.internal",6    "database.port": "5432",7    "database.user": "debezium_user",8    "database.password": "${env:DEBEZIUM_DB_PASSWORD}",9    "database.dbname": "production",10    "topic.prefix": "cdc",11    "schema.include.list": "public",12    "table.include.list": "public.orders,public.customers,public.payments",13    "plugin.name": "pgoutput",14    "publication.name": "debezium_pub",15    "slot.name": "debezium_slot",16    "key.converter": "io.confluent.connect.avro.AvroConverter",17    "key.converter.schema.registry.url": "http://schema-registry:8081",18    "value.converter": "io.confluent.connect.avro.AvroConverter",19    "value.converter.schema.registry.url": "http://schema-registry:8081",20    "tombstones.on.delete": "true",21    "decimal.handling.mode": "double",22    "time.precision.mode": "adaptive_time_microseconds",23    "snapshot.mode": "initial",24    "heartbeat.interval.ms": "10000",25    "errors.tolerance": "all",26    "errors.deadletterqueue.topic.name": "cdc-dlq",27    "errors.deadletterqueue.topic.replication.factor": 3,28    "errors.deadletterqueue.context.headers.enable": true29  }30}

Key configuration choices: the plugin.name should be pgoutput for PostgreSQL 10 and above — it is the native logical decoding plugin and does not require installing extensions. The snapshot.mode of initial takes a full snapshot of existing data when the connector first starts, then switches to streaming WAL events. The errors.tolerance and dead letter queue settings ensure that malformed events do not crash the connector — they are routed to a DLQ topic for investigation.

Schema Registry Integration

Without a schema registry, Debezium serializes change events as JSON strings. This works for prototyping but fails in production for two reasons: consumers cannot validate incoming schemas, and JSON serialization is 3-5x larger than binary Avro or Protobuf encoding.

The AvroConverter configuration in the connector JSON above enables Confluent Schema Registry integration. Debezium automatically registers a schema for each captured table and evolves it when the source schema changes. Set the schema registry's compatibility mode to BACKWARD so that new schemas can add fields but cannot remove or rename existing ones. This ensures consumers can handle schema evolution without breaking.

When a column is added to the PostgreSQL source table, Debezium detects the change on the next WAL event from that table and registers a new schema version. Consumers using the Avro deserializer automatically pick up the new schema. Dropped columns are handled by the BACKWARD compatibility rule — the field remains in the schema with a default null value.

Exactly-Once Semantics

Kafka Connect supports exactly-once delivery when running on Kafka 3.3 or later with the following connector worker settings: exactly.once.source.support=enabled and transaction.boundary=poll. With these settings, Debezium commits Kafka offsets and source database positions atomically within a Kafka transaction. If the connector crashes and restarts, it resumes from the last committed position without duplicating or losing events.

Note that exactly-once applies to the connector-to-Kafka segment of the pipeline. Your downstream consumers still need to handle idempotency — writing the same event to your data warehouse twice should produce the same result as writing it once. Use the event's primary key as a deduplication key in your target system.

Incremental Snapshots for Zero-Downtime Backfills

The initial snapshot locks the connector to table scanning mode until it finishes reading every row. For large tables (hundreds of millions of rows), this can take hours during which no WAL events are streamed. Debezium's incremental snapshot feature solves this by interleaving snapshot reads with WAL streaming.

Trigger an incremental snapshot by sending a signal to Debezium's signal table or topic. Debezium reads the table in small chunks (configurable batch size), streams each chunk as snapshot events, and continues processing WAL events between chunks. There is no downtime, no missed WAL events, and the snapshot runs at a controlled pace that does not overwhelm the source database.

sql

1-- Trigger an incremental snapshot via the signal table2INSERT INTO debezium_signal (id, type, data)3VALUES (4  'snapshot-orders-001',5  'execute-snapshot',6  '{"data-collections": ["public.orders"], "type": "incremental"}'7);

This is essential for production operations. When you add a new table to the capture list, you need its historical data without stopping CDC on existing tables. Incremental snapshots make this possible.

Kafka Consumer Example

Here is a Python consumer that reads CDC events from Kafka and writes them to a target system. It demonstrates manual offset management, dead letter queue routing for failed events, and basic deserialization.

python

1from confluent_kafka import DeserializingConsumer2from confluent_kafka.schema_registry import SchemaRegistryClient3from confluent_kafka.schema_registry.avro import AvroDeserializer4from confluent_kafka.serialization import StringDeserializer5import json6 7schema_registry = SchemaRegistryClient({"url": "http://schema-registry:8081"})8 9consumer = DeserializingConsumer({10    "bootstrap.servers": "kafka:9092",11    "group.id": "warehouse-sink",12    "auto.offset.reset": "earliest",13    "enable.auto.commit": False,14    "key.deserializer": AvroDeserializer(schema_registry),15    "value.deserializer": AvroDeserializer(schema_registry),16})17 18consumer.subscribe(["cdc.public.orders", "cdc.public.customers"])19 20def process_cdc_event(event):21    op = event.get("op")  # c=create, u=update, d=delete, r=read(snapshot)22    before = event.get("before")23    after = event.get("after")24    source = event.get("source", {})25    table = source.get("table")26 27    if op in ("c", "r", "u"):28        upsert_to_warehouse(table, after)29    elif op == "d":30        soft_delete_in_warehouse(table, before)31 32try:33    while True:34        msg = consumer.poll(timeout=1.0)35        if msg is None:36            continue37        if msg.error():38            log_error(f"Consumer error: {msg.error()}")39            continue40 41        try:42            process_cdc_event(msg.value())43            consumer.commit(message=msg)44        except Exception as e:45            publish_to_dlq("cdc-consumer-dlq", msg, str(e))46            consumer.commit(message=msg)47finally:48    consumer.close()

Development Environment with Docker Compose

For local development and testing, use Docker Compose to spin up the full CDC stack: PostgreSQL, Kafka, Zookeeper, Schema Registry, Kafka Connect with Debezium, and a Kafka UI for inspection.

yaml

1# docker-compose.yml2version: '3.8'3services:4  postgres:5    image: postgres:166    environment:7      POSTGRES_DB: production8      POSTGRES_USER: postgres9      POSTGRES_PASSWORD: postgres10    command: ["postgres", "-c", "wal_level=logical", "-c", "max_replication_slots=4"]11    ports:12      - "5432:5432"13 14  kafka:15    image: confluentinc/cp-kafka:7.6.016    environment:17      KAFKA_NODE_ID: 118      KAFKA_PROCESS_ROLES: broker,controller19      KAFKA_LISTENERS: PLAINTEXT://0.0.0.0:9092,CONTROLLER://0.0.0.0:909320      KAFKA_ADVERTISED_LISTENERS: PLAINTEXT://kafka:909221      KAFKA_CONTROLLER_QUORUM_VOTERS: 1@kafka:909322      KAFKA_CONTROLLER_LISTENER_NAMES: CONTROLLER23      KAFKA_OFFSETS_TOPIC_REPLICATION_FACTOR: 124      CLUSTER_ID: "local-dev-cluster-001"25    ports:26      - "9092:9092"27 28  schema-registry:29    image: confluentinc/cp-schema-registry:7.6.030    depends_on: [kafka]31    environment:32      SCHEMA_REGISTRY_HOST_NAME: schema-registry33      SCHEMA_REGISTRY_KAFKASTORE_BOOTSTRAP_SERVERS: kafka:909234    ports:35      - "8081:8081"36 37  kafka-connect:38    image: debezium/connect:2.739    depends_on: [kafka, schema-registry, postgres]40    environment:41      BOOTSTRAP_SERVERS: kafka:909242      GROUP_ID: connect-cluster43      CONFIG_STORAGE_TOPIC: connect-configs44      OFFSET_STORAGE_TOPIC: connect-offsets45      STATUS_STORAGE_TOPIC: connect-status46      KEY_CONVERTER: io.confluent.connect.avro.AvroConverter47      VALUE_CONVERTER: io.confluent.connect.avro.AvroConverter48      CONNECT_KEY_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:808149      CONNECT_VALUE_CONVERTER_SCHEMA_REGISTRY_URL: http://schema-registry:808150    ports:51      - "8083:8083"

Monitoring and Alerting

In production, three metrics determine whether your CDC pipeline is healthy: connector status, consumer lag, and replication slot lag.

Connector status tells you whether the Debezium connector is running, paused, or failed. Kafka Connect exposes this through its REST API. Poll it every 30 seconds and alert immediately on FAILED status.

Consumer lag measures how far behind your downstream consumers are from the latest Kafka offset. High lag means events are being produced faster than consumed, or a consumer has crashed. Use Kafka's built-in consumer group metrics or a tool like Burrow for lag monitoring.

Replication slot lag measures how much WAL PostgreSQL is retaining for Debezium. If Debezium falls behind, WAL accumulates and can fill the database's disk. This is the most dangerous failure mode because it affects the source database, not just the pipeline.

sql

1-- Monitor replication slot lag in PostgreSQL2SELECT3  slot_name,4  active,5  pg_size_pretty(6    pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)7  ) AS replication_lag,8  pg_size_pretty(9    pg_wal_lsn_diff(pg_current_wal_lsn(), confirmed_flush_lsn)10  ) AS confirmed_lag11FROM pg_replication_slots12WHERE slot_name = 'debezium_slot';13 14-- Monitor Kafka Connect connector status15-- curl http://kafka-connect:8083/connectors/postgres-cdc-connector/status16 17-- Monitor consumer group lag18-- kafka-consumer-groups.sh --bootstrap-server kafka:9092 \19--   --group warehouse-sink --describe

Dead Letter Queues

Not every CDC event will process successfully. A schema mismatch, a malformed row, or a downstream system outage can cause individual events to fail. Without a dead letter queue, these failures either crash the connector or get silently dropped.

The connector-level DLQ (configured in the Debezium connector JSON) catches events that fail during serialization or transformation within Kafka Connect. The consumer-level DLQ (shown in the Python consumer example) catches events that fail during downstream processing.

Monitor both DLQ topics. Any message landing in a DLQ should trigger an alert. Review DLQ events daily — they often reveal upstream schema changes, data quality issues, or bugs in your processing logic that need attention before they affect a larger share of events.

Production CDC is not a set-and-forget system. The pipeline has moving parts — the source database, Debezium, Kafka, Schema Registry, and your consumers — and each one can fail independently. Invest in monitoring and dead letter queues from day one. The cost of catching a problem at the DLQ is minutes of investigation; the cost of missing it is hours of data loss and a manual backfill.

Handling Schema Evolution in the Pipeline

Schema changes in the source database are the most common cause of CDC pipeline failures. A developer adds a column to the orders table, and suddenly your downstream consumer fails because it does not recognize the new field. With the Schema Registry and BACKWARD compatibility mode configured as described above, most additive changes (new columns, new tables) are handled automatically.

However, destructive changes — dropping a column, changing a column's type, or renaming a column — require coordination. Before making such changes in the source database, update your consumers to handle the new schema. Then make the source change. Then update the Schema Registry compatibility mode temporarily if needed. This sequence ensures no consumer sees a schema it cannot process.

For column renames, the safest approach is to add the new column, backfill it, update all consumers to read from the new column, and then drop the old column. This is the same pattern used in zero-downtime database migrations, applied to the CDC pipeline.

Operational Runbook

Every production CDC pipeline should have a runbook covering these scenarios. When the connector fails, check the Kafka Connect REST API for the error message, fix the underlying issue (usually a database connectivity or permissions problem), and restart the connector. Debezium resumes from its last committed offset automatically.

When replication slot lag exceeds your threshold, first check whether the connector is running. If it is running but lagging, the source database is producing WAL faster than Debezium can consume it. Increase the connector's max.batch.size and max.queue.size to process events in larger batches. If the connector is down and WAL is accumulating, restart it immediately — PostgreSQL will not reclaim that disk space until the slot catches up.

When consumer lag spikes, check whether the consumer process is healthy and whether downstream systems (your data warehouse, for example) are accepting writes normally. Consumer lag often correlates with target system slowdowns, not Kafka issues. Scale your consumer group by adding more instances if throughput is the bottleneck.

When DLQ messages appear, investigate immediately. Each DLQ message includes headers with the original topic, partition, offset, and error message. Use these to reproduce the failure, fix the consumer logic, and replay the failed events from the DLQ topic.

A well-operated CDC pipeline delivers sub-second latency from database commit to Kafka topic with 99.99 percent reliability. The key is treating the operational scaffolding — monitoring, alerting, DLQs, and the runbook — as essential components of the system, not optional add-ons.

Real-Time CDC with Debezium and Kafka

Tags

Related articles

Event-Driven Data Architecture with Kafka

Apache Iceberg v3: What's New?

Building a Multi-Engine Iceberg Lakehouse with QueryFlux

dbt Incremental Models That Actually Scale

Snowflake Cost Optimization Without Slowing Teams Down