Monitoring Jobs
Effective monitoring ensures your CDC pipelines run smoothly. This guide covers metrics, dashboards, and alerting.
Key Metrics
Throughput
| Metric | Description | Healthy Range |
|---|---|---|
events_per_second | Events processed per second | Depends on workload |
batches_per_minute | Batches sent to sink | > 0 |
bytes_per_second | Data throughput | Depends on workload |
Latency
| Metric | Description | Healthy Range |
|---|---|---|
replication_lag_bytes | Bytes behind source | < 1 MB |
replication_lag_ms | Time behind source | < 5000 ms |
batch_latency_ms | Time to process batch | < 1000 ms |
Health
| Metric | Description | Healthy Value |
|---|---|---|
worker_status | Worker health | healthy |
job_status | Job state | running |
last_heartbeat | Last worker check-in | < 60 seconds ago |
Portal Dashboard
Job Overview
┌─────────────────────────────────────────────────────────────────────────────┐
│ Job: orders-to-analytics ● RUNNING│
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Throughput Replication Lag │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ ╱╲ ╱╲ │ │ │ │
│ │ ╱ ╲ ╱ ╲ ╱╲ │ │ ───────────────────── │ │
│ │ ╱ ╲╱ ╲ ╱ ╲ │ │ 512 bytes │ │
│ │ ╱ ╲ ╲ │ │ │ │
│ └─────────────────────────┘ └─────────────────────────┘ │
│ 1,247 events/sec 0.5 KB lag │
│ │
│ Tables Statistics │
│ ┌─────────────────────────┐ ┌─────────────────────────┐ │
│ │ orders ● 245/sec │ │ Total Events: 2.4M │ │
│ │ order_items ● 892/sec │ │ Total Batches: 240 │ │
│ │ customers ● 110/sec │ │ Uptime: 2h 34m │ │
│ └─────────────────────────┘ │ Last Event: 2s ago │ │
│ └─────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
Worker Health
┌─────────────────────────────────────────────────────────────────────────────┐
│ Deployment: production │
├─────────────────────────────────────────────────────────────────────────────┤
│ │
│ Workers │
│ ┌───────────────────────────────────────────────────────────────────────┐ │
│ │ i-0abc123 │ ● Healthy │ CPU: 25% │ Memory: 15% │ Jobs: 3 │ us-west-2a│ │
│ │ i-0def456 │ ● Healthy │ CPU: 18% │ Memory: 12% │ Jobs: 2 │ us-west-2b│ │
│ │ i-0ghi789 │ ● Healthy │ CPU: 22% │ Memory: 14% │ Jobs: 3 │ us-west-2c│ │
│ └───────────────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────────┘
Viewing Metrics
Job Metrics
View job metrics in the portal: Jobs → your job → Metrics
Available metrics:
- Events per second
- Batches per minute
- Replication lag (bytes and ms)
- Total events processed
- Per-table breakdown
Worker Metrics
View worker metrics in the portal: Deployments → your deployment → Workers
Available metrics:
- CPU usage
- Memory usage
- Disk usage
- Jobs running per worker
Alerting
Recommended Alerts
| Alert | Condition | Severity |
|---|---|---|
| High Replication Lag | lag_bytes > 10MB for 5m | Warning |
| Very High Lag | lag_bytes > 100MB for 5m | Critical |
| Job Failed | status == failed | Critical |
| Worker Unhealthy | heartbeat > 2m | Critical |
| No Events | events_per_second == 0 for 5m | Warning |
Log Analysis
Worker Logs
Logs are available in CloudWatch:
Log Group: /ez-cdc/workers/{deployment-name}
Log Stream: {instance-id}/worker-agent.log
Example log entries:
2024-01-15T10:30:00Z INFO Starting job orders-to-analytics
2024-01-15T10:30:01Z INFO Created replication slot ezcdc_orders_slot
2024-01-15T10:30:02Z INFO Starting streaming from LSN 0/1A3E5F8
2024-01-15T10:30:05Z INFO Processed batch: 10000 events in 45ms
Querying Logs
# AWS CLI
aws logs filter-log-events \
--log-group-name /ez-cdc/workers/production \
--filter-pattern "ERROR"
# CloudWatch Insights
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100
Troubleshooting with Metrics
High Lag Investigation
Symptom: Lag increasing steadily
Check:
1. Events/sec vs historical baseline
└─ Source write rate increased?
2. Batch latency
└─ Sink performance degraded?
3. Worker CPU/Memory
└─ Worker overloaded?
4. Network metrics
└─ Connectivity issues?
No Events Processing
Symptom: events_per_second = 0
Check:
1. Job status
└─ Is job running?
2. Source activity
└─ Are writes happening to source?
3. Worker connectivity
└─ Can worker reach source?
4. Replication slot
└─ Is slot active?
Health Checks
gRPC Health Check
Workers expose gRPC health check:
grpcurl -plaintext worker:50051 dbmazz.HealthService/Check
Response:
{
"status": "SERVING",
"stage": "STAGE_CDC",
"stageDetail": "Replicating",
"errorDetail": ""
}
HTTP Health Check (Future)
curl http://worker:8080/health
Best Practices
1. Set Baseline Metrics
After initial deployment:
- Record normal throughput range
- Note typical lag values
- Document peak hours
2. Use Multi-Level Alerts
Warning → Investigate
Critical → Immediate action
3. Correlate with Source Metrics
Compare CDC metrics with:
- Source database write rate
- Source connection count
- Source WAL generation rate
4. Regular Review
Weekly review of:
- Lag trends
- Error rates
- Capacity utilization