Monitoring Jobs

Effective monitoring ensures your CDC pipelines run smoothly. This guide covers metrics, dashboards, and alerting.

Key Metrics

Throughput

Metric	Description	Healthy Range
`events_per_second`	Events processed per second	Depends on workload
`batches_per_minute`	Batches sent to sink	> 0
`bytes_per_second`	Data throughput	Depends on workload

Latency

Metric	Description	Healthy Range
`replication_lag_bytes`	Bytes behind source	< 1 MB
`replication_lag_ms`	Time behind source	< 5000 ms
`batch_latency_ms`	Time to process batch	< 1000 ms

Health

Metric	Description	Healthy Value
`worker_status`	Worker health	`healthy`
`job_status`	Job state	`running`
`last_heartbeat`	Last worker check-in	< 60 seconds ago

Portal Dashboard

Job Overview

┌─────────────────────────────────────────────────────────────────────────────┐
│  Job: orders-to-analytics                                          ● RUNNING│
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Throughput                          Replication Lag                        │
│  ┌─────────────────────────┐        ┌─────────────────────────┐            │
│  │    ╱╲    ╱╲             │        │                         │            │
│  │   ╱  ╲  ╱  ╲   ╱╲      │        │  ─────────────────────  │            │
│  │  ╱    ╲╱    ╲ ╱  ╲     │        │         512 bytes       │            │
│  │ ╱            ╲    ╲    │        │                         │            │
│  └─────────────────────────┘        └─────────────────────────┘            │
│    1,247 events/sec                   0.5 KB lag                           │
│                                                                             │
│  Tables                              Statistics                             │
│  ┌─────────────────────────┐        ┌─────────────────────────┐            │
│  │ orders       ● 245/sec  │        │ Total Events: 2.4M      │            │
│  │ order_items  ● 892/sec  │        │ Total Batches: 240      │            │
│  │ customers    ● 110/sec  │        │ Uptime: 2h 34m          │            │
│  └─────────────────────────┘        │ Last Event: 2s ago      │            │
│                                     └─────────────────────────┘            │
└─────────────────────────────────────────────────────────────────────────────┘

Worker Health

┌─────────────────────────────────────────────────────────────────────────────┐
│  Deployment: production                                                     │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│  Workers                                                                    │
│  ┌───────────────────────────────────────────────────────────────────────┐ │
│  │ i-0abc123  │ ● Healthy │ CPU: 25% │ Memory: 15% │ Jobs: 3 │ us-west-2a│ │
│  │ i-0def456  │ ● Healthy │ CPU: 18% │ Memory: 12% │ Jobs: 2 │ us-west-2b│ │
│  │ i-0ghi789  │ ● Healthy │ CPU: 22% │ Memory: 14% │ Jobs: 3 │ us-west-2c│ │
│  └───────────────────────────────────────────────────────────────────────┘ │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘

Viewing Metrics

Job Metrics

View job metrics in the portal: Jobs → your job → Metrics

Available metrics:

Events per second
Batches per minute
Replication lag (bytes and ms)
Total events processed
Per-table breakdown

Worker Metrics

View worker metrics in the portal: Deployments → your deployment → Workers

Available metrics:

CPU usage
Memory usage
Disk usage
Jobs running per worker

Alerting

Recommended Alerts

Alert	Condition	Severity
High Replication Lag	`lag_bytes > 10MB` for 5m	Warning
Very High Lag	`lag_bytes > 100MB` for 5m	Critical
Job Failed	`status == failed`	Critical
Worker Unhealthy	`heartbeat > 2m`	Critical
No Events	`events_per_second == 0` for 5m	Warning

Log Analysis

Worker Logs

Logs are available in CloudWatch:

Log Group: /ez-cdc/workers/{deployment-name}
Log Stream: {instance-id}/worker-agent.log

Example log entries:

2024-01-15T10:30:00Z INFO  Starting job orders-to-analytics
2024-01-15T10:30:01Z INFO  Created replication slot ezcdc_orders_slot
2024-01-15T10:30:02Z INFO  Starting streaming from LSN 0/1A3E5F8
2024-01-15T10:30:05Z INFO  Processed batch: 10000 events in 45ms

Querying Logs

# AWS CLI
aws logs filter-log-events \
  --log-group-name /ez-cdc/workers/production \
  --filter-pattern "ERROR"

# CloudWatch Insights
fields @timestamp, @message
| filter @message like /ERROR/
| sort @timestamp desc
| limit 100

Troubleshooting with Metrics

High Lag Investigation

Symptom: Lag increasing steadily

Check:
1. Events/sec vs historical baseline
   └─ Source write rate increased?

2. Batch latency
   └─ Sink performance degraded?

3. Worker CPU/Memory
   └─ Worker overloaded?

4. Network metrics
   └─ Connectivity issues?

No Events Processing

Symptom: events_per_second = 0

Check:
1. Job status
   └─ Is job running?

2. Source activity
   └─ Are writes happening to source?

3. Worker connectivity
   └─ Can worker reach source?

4. Replication slot
   └─ Is slot active?

Health Checks

gRPC Health Check

Workers expose gRPC health check:

grpcurl -plaintext worker:50051 dbmazz.HealthService/Check

Response:

{
  "status": "SERVING",
  "stage": "STAGE_CDC",
  "stageDetail": "Replicating",
  "errorDetail": ""
}

HTTP Health Check (Future)

curl http://worker:8080/health

Best Practices

1. Set Baseline Metrics

After initial deployment:

Record normal throughput range
Note typical lag values
Document peak hours

2. Use Multi-Level Alerts

Warning → Investigate
Critical → Immediate action

3. Correlate with Source Metrics

Compare CDC metrics with:

Source database write rate
Source connection count
Source WAL generation rate

4. Regular Review

Weekly review of:

Lag trends
Error rates
Capacity utilization

Key Metrics​

Throughput​

Latency​

Health​

Portal Dashboard​

Job Overview​

Worker Health​

Viewing Metrics​

Job Metrics​

Worker Metrics​

Alerting​

Recommended Alerts​

Log Analysis​

Worker Logs​

Querying Logs​

Troubleshooting with Metrics​

High Lag Investigation​

No Events Processing​

Health Checks​

gRPC Health Check​

HTTP Health Check (Future)​

Best Practices​

1. Set Baseline Metrics​

2. Use Multi-Level Alerts​

3. Correlate with Source Metrics​

4. Regular Review​

Next Steps​