Troubleshooting Jobs

This guide helps diagnose and resolve common CDC job issues.

Diagnostic Workflow

1. Check job status
   └─ Portal → Jobs → Status

2. Review error message
   └─ Portal → Jobs → Error Details

3. Check worker logs
   └─ CloudWatch → /ez-cdc/workers/

4. Verify connectivity
   └─ Test source and sink connections

5. Check resource health
   └─ Replication slot, publication, tables

Common Issues

Job Won't Start

Symptom: Job stuck in "Pending"

Possible Causes:

No healthy workers

Check deployment:
```
Deployment: production
Workers: 0/1 healthy ← Problem
```
Solution: Check worker health, restart if needed
Worker can't reach control plane

Check worker logs:
```
ERROR Failed to connect to control plane: connection refused
```
Solution: Verify network connectivity, security groups
All workers at capacity
```
Worker i-abc123: 5/5 jobs (max reached)
```
Solution: Scale up workers or stop unused jobs

Symptom: Job fails immediately after starting

Check error message:

Error: password authentication failed for user "ezcdc_user"

Solution: Verify datasource credentials

Error: database "mydb" does not exist

Solution: Check database name in datasource config

Error: permission denied for table orders

Solution: Grant required permissions to CDC user

Connection Issues

Source Connection Failed

Error: could not connect to server: Connection refused
  Is the server running on host "postgres.example.com" and accepting
  TCP/IP connections on port 5432?

Checklist:

PostgreSQL is running
Listening on correct host/port
Security group allows worker access
No firewall blocking connection

Diagnosis:

# From worker (via SSM)
nc -zv postgres.example.com 5432

Sink Connection Failed

Error: Failed to connect to StarRocks: Connection refused

Checklist:

StarRocks FE is running (port 9030)
StarRocks BE is running (port 8040)
Security groups allow worker access
Load balancer healthy (if using)

Replication Issues

Replication Slot Not Found

Error: replication slot "ezcdc_orders_slot" does not exist

Causes:

Slot was manually deleted
Slot expired (if configured)
Different PostgreSQL instance

Solution:

Job will auto-recreate slot on restart
Data since slot deletion may be lost
Consider full resnapshot

Slot Lag Growing (WAL Retention)

SELECT slot_name,
       pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots;

-- Result:
-- ezcdc_orders_slot | 50 GB ← Problem!

Causes:

Job stopped/failed for extended period
Consumer too slow
Sink unavailable

Solution:

Resume job immediately
Increase consumer throughput
If unrecoverable, drop slot and resnapshot

danger

Large slot lag can fill disk. Monitor and alert on slot lag.

Publication Missing Tables

Error: table "customers" is not in publication "ezcdc_orders_pub"

Causes:

Table added after job creation
Publication manually modified

Solution:

ALTER PUBLICATION ezcdc_orders_pub ADD TABLE customers;

Sink Issues

Stream Load Timeout

Error: Stream Load timeout after 300 seconds

Causes:

Batch too large
StarRocks overloaded
Network latency

Solutions:

Reduce batch size
Check StarRocks BE performance
Increase timeout if needed

Stream Load Memory Error

Error: Memory of process is overloaded

Solutions:

Reduce batch size in job config
Increase BE memory
Reduce parallel loads

Stream Load Type Error

Error: Invalid value for column 'total': cannot convert 'abc' to DECIMAL

Causes:

Data type mismatch
Source data corruption

Solutions:

Check error log for specific rows
Fix source data or exclude table
Modify sink column type if appropriate

Performance Issues

High Replication Lag

Replication Lag: 500 MB (growing)

Diagnosis:

Check throughput metrics
Compare to normal baseline
Identify bottleneck

Common causes and solutions:

Cause	Solution
High source write rate	Scale workers, increase batch size
Slow sink	Optimize sink, add resources
Network latency	Check connectivity
Worker overloaded	Scale horizontally

Events Per Second = 0

Events/sec: 0 (for 5 minutes)

Checklist:

Job is running (not paused/failed)
Source has activity
Worker is healthy
Replication slot is active

Diagnosis:

-- Check slot activity
SELECT slot_name, active
FROM pg_replication_slots
WHERE slot_name = 'ezcdc_orders_slot';

Worker Issues

Worker Not Registering

Check worker logs:

ERROR Failed to register with control plane: authentication failed

Solutions:

Verify METRICS_AUTH_TOKEN is set
Check token matches deployment
Verify control plane URL

Worker High CPU/Memory

Worker i-abc123:
  CPU: 95%
  Memory: 90%

Solutions:

Reduce number of jobs per worker
Use larger instance type
Scale horizontally (add workers)

Worker Lost Connection

WARN Lost connection to control plane, reconnecting...
ERROR Failed to reconnect after 3 attempts

Causes:

Network issue
Control plane maintenance
Security group change

Solutions:

Check worker network connectivity
Verify security groups unchanged
Worker will auto-reconnect when possible

Recovery Procedures

Full Resnapshot

If data is inconsistent:

Stop the job
Truncate sink tables (optional)
Delete and recreate job with snapshot_mode: initial

Reset Replication Position

If slot is too far behind:

-- 1. Drop old slot
SELECT pg_drop_replication_slot('ezcdc_orders_slot');

-- 2. Restart job (will recreate slot)

warning

This may cause data loss. Consider full resnapshot.

Recover from Sink Corruption

Stop job
Fix or recreate sink tables
Reset job checkpoint (or full resnapshot)
Resume job

Getting Help

If issues persist:

Collect diagnostics:
- Job status and error message
- Worker logs (last 100 lines)
- Source/sink connectivity test results
Contact support:
- Email: support@ez-cdc.com
- Include deployment ID and job ID

Diagnostic Workflow​

Common Issues​

Job Won't Start​

Symptom: Job stuck in "Pending"​

Symptom: Job fails immediately after starting​

Connection Issues​

Source Connection Failed​

Sink Connection Failed​

Replication Issues​

Replication Slot Not Found​

Slot Lag Growing (WAL Retention)​

Publication Missing Tables​

Sink Issues​

Stream Load Timeout​

Stream Load Memory Error​

Stream Load Type Error​

Performance Issues​

High Replication Lag​

Events Per Second = 0​

Worker Issues​

Worker Not Registering​

Worker High CPU/Memory​

Worker Lost Connection​

Recovery Procedures​

Full Resnapshot​

Reset Replication Position​

Recover from Sink Corruption​

Getting Help​

Diagnostic Workflow

Common Issues

Job Won't Start

Symptom: Job stuck in "Pending"

Symptom: Job fails immediately after starting

Connection Issues

Source Connection Failed

Sink Connection Failed

Replication Issues

Replication Slot Not Found

Slot Lag Growing (WAL Retention)

Publication Missing Tables

Sink Issues

Stream Load Timeout

Stream Load Memory Error

Stream Load Type Error

Performance Issues

High Replication Lag

Events Per Second = 0

Worker Issues

Worker Not Registering

Worker High CPU/Memory

Worker Lost Connection

Recovery Procedures

Full Resnapshot

Reset Replication Position

Recover from Sink Corruption

Getting Help