Troubleshooting Jobs
This guide helps diagnose and resolve common CDC job issues.
Diagnostic Workflow
1. Check job status
└─ Portal → Jobs → Status
2. Review error message
└─ Portal → Jobs → Error Details
3. Check worker logs
└─ CloudWatch → /ez-cdc/workers/
4. Verify connectivity
└─ Test source and sink connections
5. Check resource health
└─ Replication slot, publication, tables
Common Issues
Job Won't Start
Symptom: Job stuck in "Pending"
Possible Causes:
-
No healthy workers
Check deployment:
Deployment: production
Workers: 0/1 healthy ← ProblemSolution: Check worker health, restart if needed
-
Worker can't reach control plane
Check worker logs:
ERROR Failed to connect to control plane: connection refusedSolution: Verify network connectivity, security groups
-
All workers at capacity
Worker i-abc123: 5/5 jobs (max reached)Solution: Scale up workers or stop unused jobs
Symptom: Job fails immediately after starting
Check error message:
Error: password authentication failed for user "ezcdc_user"
Solution: Verify datasource credentials
Error: database "mydb" does not exist
Solution: Check database name in datasource config
Error: permission denied for table orders
Solution: Grant required permissions to CDC user
Connection Issues
Source Connection Failed
Error: could not connect to server: Connection refused
Is the server running on host "postgres.example.com" and accepting
TCP/IP connections on port 5432?
Checklist:
- PostgreSQL is running
- Listening on correct host/port
- Security group allows worker access
- No firewall blocking connection
Diagnosis:
# From worker (via SSM)
nc -zv postgres.example.com 5432
Sink Connection Failed
Error: Failed to connect to StarRocks: Connection refused
Checklist:
- StarRocks FE is running (port 9030)
- StarRocks BE is running (port 8040)
- Security groups allow worker access
- Load balancer healthy (if using)
Replication Issues
Replication Slot Not Found
Error: replication slot "ezcdc_orders_slot" does not exist
Causes:
- Slot was manually deleted
- Slot expired (if configured)
- Different PostgreSQL instance
Solution:
- Job will auto-recreate slot on restart
- Data since slot deletion may be lost
- Consider full resnapshot
Slot Lag Growing (WAL Retention)
SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots;
-- Result:
-- ezcdc_orders_slot | 50 GB ← Problem!
Causes:
- Job stopped/failed for extended period
- Consumer too slow
- Sink unavailable
Solution:
- Resume job immediately
- Increase consumer throughput
- If unrecoverable, drop slot and resnapshot
Large slot lag can fill disk. Monitor and alert on slot lag.
Publication Missing Tables
Error: table "customers" is not in publication "ezcdc_orders_pub"
Causes:
- Table added after job creation
- Publication manually modified
Solution:
ALTER PUBLICATION ezcdc_orders_pub ADD TABLE customers;
Sink Issues
Stream Load Timeout
Error: Stream Load timeout after 300 seconds
Causes:
- Batch too large
- StarRocks overloaded
- Network latency
Solutions:
- Reduce batch size
- Check StarRocks BE performance
- Increase timeout if needed
Stream Load Memory Error
Error: Memory of process is overloaded
Solutions:
- Reduce batch size in job config
- Increase BE memory
- Reduce parallel loads
Stream Load Type Error
Error: Invalid value for column 'total': cannot convert 'abc' to DECIMAL
Causes:
- Data type mismatch
- Source data corruption
Solutions:
- Check error log for specific rows
- Fix source data or exclude table
- Modify sink column type if appropriate
Performance Issues
High Replication Lag
Replication Lag: 500 MB (growing)
Diagnosis:
- Check throughput metrics
- Compare to normal baseline
- Identify bottleneck
Common causes and solutions:
| Cause | Solution |
|---|---|
| High source write rate | Scale workers, increase batch size |
| Slow sink | Optimize sink, add resources |
| Network latency | Check connectivity |
| Worker overloaded | Scale horizontally |
Events Per Second = 0
Events/sec: 0 (for 5 minutes)
Checklist:
- Job is running (not paused/failed)
- Source has activity
- Worker is healthy
- Replication slot is active
Diagnosis:
-- Check slot activity
SELECT slot_name, active
FROM pg_replication_slots
WHERE slot_name = 'ezcdc_orders_slot';
Worker Issues
Worker Not Registering
Check worker logs:
ERROR Failed to register with control plane: authentication failed
Solutions:
- Verify
METRICS_AUTH_TOKENis set - Check token matches deployment
- Verify control plane URL
Worker High CPU/Memory
Worker i-abc123:
CPU: 95%
Memory: 90%
Solutions:
- Reduce number of jobs per worker
- Use larger instance type
- Scale horizontally (add workers)
Worker Lost Connection
WARN Lost connection to control plane, reconnecting...
ERROR Failed to reconnect after 3 attempts
Causes:
- Network issue
- Control plane maintenance
- Security group change
Solutions:
- Check worker network connectivity
- Verify security groups unchanged
- Worker will auto-reconnect when possible
Recovery Procedures
Full Resnapshot
If data is inconsistent:
- Stop the job
- Truncate sink tables (optional)
- Delete and recreate job with
snapshot_mode: initial
Reset Replication Position
If slot is too far behind:
-- 1. Drop old slot
SELECT pg_drop_replication_slot('ezcdc_orders_slot');
-- 2. Restart job (will recreate slot)
This may cause data loss. Consider full resnapshot.
Recover from Sink Corruption
- Stop job
- Fix or recreate sink tables
- Reset job checkpoint (or full resnapshot)
- Resume job
Getting Help
If issues persist:
-
Collect diagnostics:
- Job status and error message
- Worker logs (last 100 lines)
- Source/sink connectivity test results
-
Contact support:
- Email: support@ez-cdc.com
- Include deployment ID and job ID