Skip to main content

Monitoring Jobs

Effective monitoring ensures your CDC pipelines run smoothly. This guide covers metrics, dashboards, and alerting.

Key Metrics

Throughput

MetricDescriptionHealthy Range
events_per_secondEvents processed per secondDepends on workload
batches_per_minuteBatches sent to sink> 0
bytes_per_secondData throughputDepends on workload

Latency

MetricDescriptionHealthy Range
replication_lag_bytesBytes behind source< 1 MB
replication_lag_msTime behind source< 5000 ms
batch_latency_msTime to process batch< 1000 ms

Health

MetricDescriptionHealthy Value
worker_statusWorker healthhealthy
job_statusJob staterunning
last_heartbeatLast worker check-in< 60 seconds ago

Portal Dashboard

Job Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│ Job: orders-to-analytics ● RUNNING│
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Throughput Replication Lag │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ ╱╲ ╱╲ │ │ │ │
│ │ ╱ ╲ ╱ ╲ ╱╲ │ │ ───────────────────── │ │
│ │ ╱ ╲╱ ╲ ╱ ╲ │ │ 512 bytes │ │
│ │ ╱ ╲ ╲ │ │ │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ 1,247 events/sec 0.5 KB lag │
│ │
│ Tables Statistics │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ orders ● 245/sec │ │ Total Events: 2.4M │ │
│ │ order_items ● 892/sec │ │ Total Batches: 240 │ │
│ │ customers ● 110/sec │ │ Uptime: 2h 34m │ │
│ └─────────────────────────┘ │ Last Event: 2s ago │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘

Worker Health

┌─────────────────────────────────────────────────────────────────────────────┐
│ Deployment: production │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Workers │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ i-0abc123 │ ● Healthy │ CPU: 25% │ Memory: 15% │ Jobs: 3 │ us-west-2a│ │
│ │ i-0def456 │ ● Healthy │ CPU: 18% │ Memory: 12% │ Jobs: 2 │ us-west-2b│ │
│ │ i-0ghi789 │ ● Healthy │ CPU: 22% │ Memory: 14% │ Jobs: 3 │ us-west-2c│ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘

Viewing Metrics

Job Metrics

View job metrics in the portal: Jobs → your job → Metrics

Available metrics:

  • Events per second
  • Batches per minute
  • Replication lag (bytes and ms)
  • Total events processed
  • Per-table breakdown

Worker Metrics

View worker metrics in the portal: Deployments → your deployment → Workers

Available metrics:

  • CPU usage
  • Memory usage
  • Disk usage
  • Jobs running per worker

Alerting

AlertConditionSeverity
High Replication Laglag_bytes > 10MB for 5mWarning
Very High Laglag_bytes > 100MB for 5mCritical
Job Failedstatus == failedCritical
Worker Unhealthyheartbeat > 2mCritical
No Eventsevents_per_second == 0 for 5mWarning

Log Analysis

Worker Logs

Logs are available in CloudWatch:

Log Group: /ez-cdc/workers/{deployment-name}
Log Stream: {instance-id}/worker-agent.log

Example log entries:

2024-01-15T10:30:00Z INFO  Starting job orders-to-analytics
2024-01-15T10:30:01Z INFO Created replication slot ezcdc_orders_slot
2024-01-15T10:30:02Z INFO Starting streaming from LSN 0/1A3E5F8
2024-01-15T10:30:05Z INFO Processed batch: 10000 events in 45ms

Querying Logs

# AWS CLI
aws logs filter-log-events \
--log-group-name /ez-cdc/workers/production \
--filter-pattern "ERROR"

# CloudWatch Insights
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Troubleshooting with Metrics

High Lag Investigation

Symptom: Lag increasing steadily

Check:
1. Events/sec vs historical baseline
└─ Source write rate increased?

2. Batch latency
└─ Sink performance degraded?

3. Worker CPU/Memory
└─ Worker overloaded?

4. Network metrics
└─ Connectivity issues?

No Events Processing

Symptom: events_per_second = 0

Check:
1. Job status
└─ Is job running?

2. Source activity
└─ Are writes happening to source?

3. Worker connectivity
└─ Can worker reach source?

4. Replication slot
└─ Is slot active?

Health Checks

gRPC Health Check

Workers expose gRPC health check:

grpcurl -plaintext worker:50051 dbmazz.HealthService/Check

Response:

{
"status": "SERVING",
"stage": "STAGE_CDC",
"stageDetail": "Replicating",
"errorDetail": ""
}

HTTP Health Check (Future)

curl http://worker:8080/health

Best Practices

1. Set Baseline Metrics

After initial deployment:

  • Record normal throughput range
  • Note typical lag values
  • Document peak hours

2. Use Multi-Level Alerts

Warning → Investigate
Critical → Immediate action

3. Correlate with Source Metrics

Compare CDC metrics with:

  • Source database write rate
  • Source connection count
  • Source WAL generation rate

4. Regular Review

Weekly review of:

  • Lag trends
  • Error rates
  • Capacity utilization

Next Steps