Skip to main content

Troubleshooting Jobs

This guide helps diagnose and resolve common CDC job issues.

Diagnostic Workflow

1. Check job status
└─ Portal → Jobs → Status

2. Review error message
└─ Portal → Jobs → Error Details

3. Check worker logs
└─ CloudWatch → /ez-cdc/workers/

4. Verify connectivity
└─ Test source and sink connections

5. Check resource health
└─ Replication slot, publication, tables

Common Issues

Job Won't Start

Symptom: Job stuck in "Pending"

Possible Causes:

  1. No healthy workers

    Check deployment:

    Deployment: production
    Workers: 0/1 healthy ← Problem

    Solution: Check worker health, restart if needed

  2. Worker can't reach control plane

    Check worker logs:

    ERROR Failed to connect to control plane: connection refused

    Solution: Verify network connectivity, security groups

  3. All workers at capacity

    Worker i-abc123: 5/5 jobs (max reached)

    Solution: Scale up workers or stop unused jobs

Symptom: Job fails immediately after starting

Check error message:

Error: password authentication failed for user "ezcdc_user"

Solution: Verify datasource credentials

Error: database "mydb" does not exist

Solution: Check database name in datasource config

Error: permission denied for table orders

Solution: Grant required permissions to CDC user


Connection Issues

Source Connection Failed

Error: could not connect to server: Connection refused
Is the server running on host "postgres.example.com" and accepting
TCP/IP connections on port 5432?

Checklist:

  • PostgreSQL is running
  • Listening on correct host/port
  • Security group allows worker access
  • No firewall blocking connection

Diagnosis:

# From worker (via SSM)
nc -zv postgres.example.com 5432

Sink Connection Failed

Error: Failed to connect to StarRocks: Connection refused

Checklist:

  • StarRocks FE is running (port 9030)
  • StarRocks BE is running (port 8040)
  • Security groups allow worker access
  • Load balancer healthy (if using)

Replication Issues

Replication Slot Not Found

Error: replication slot "ezcdc_orders_slot" does not exist

Causes:

  • Slot was manually deleted
  • Slot expired (if configured)
  • Different PostgreSQL instance

Solution:

  1. Job will auto-recreate slot on restart
  2. Data since slot deletion may be lost
  3. Consider full resnapshot

Slot Lag Growing (WAL Retention)

SELECT slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) as lag
FROM pg_replication_slots;

-- Result:
-- ezcdc_orders_slot | 50 GB ← Problem!

Causes:

  • Job stopped/failed for extended period
  • Consumer too slow
  • Sink unavailable

Solution:

  1. Resume job immediately
  2. Increase consumer throughput
  3. If unrecoverable, drop slot and resnapshot
danger

Large slot lag can fill disk. Monitor and alert on slot lag.

Publication Missing Tables

Error: table "customers" is not in publication "ezcdc_orders_pub"

Causes:

  • Table added after job creation
  • Publication manually modified

Solution:

ALTER PUBLICATION ezcdc_orders_pub ADD TABLE customers;

Sink Issues

Stream Load Timeout

Error: Stream Load timeout after 300 seconds

Causes:

  • Batch too large
  • StarRocks overloaded
  • Network latency

Solutions:

  1. Reduce batch size
  2. Check StarRocks BE performance
  3. Increase timeout if needed

Stream Load Memory Error

Error: Memory of process is overloaded

Solutions:

  1. Reduce batch size in job config
  2. Increase BE memory
  3. Reduce parallel loads

Stream Load Type Error

Error: Invalid value for column 'total': cannot convert 'abc' to DECIMAL

Causes:

  • Data type mismatch
  • Source data corruption

Solutions:

  1. Check error log for specific rows
  2. Fix source data or exclude table
  3. Modify sink column type if appropriate

Performance Issues

High Replication Lag

Replication Lag: 500 MB (growing)

Diagnosis:

  1. Check throughput metrics
  2. Compare to normal baseline
  3. Identify bottleneck

Common causes and solutions:

CauseSolution
High source write rateScale workers, increase batch size
Slow sinkOptimize sink, add resources
Network latencyCheck connectivity
Worker overloadedScale horizontally

Events Per Second = 0

Events/sec: 0 (for 5 minutes)

Checklist:

  • Job is running (not paused/failed)
  • Source has activity
  • Worker is healthy
  • Replication slot is active

Diagnosis:

-- Check slot activity
SELECT slot_name, active
FROM pg_replication_slots
WHERE slot_name = 'ezcdc_orders_slot';

Worker Issues

Worker Not Registering

Check worker logs:

ERROR Failed to register with control plane: authentication failed

Solutions:

  1. Verify METRICS_AUTH_TOKEN is set
  2. Check token matches deployment
  3. Verify control plane URL

Worker High CPU/Memory

Worker i-abc123:
CPU: 95%
Memory: 90%

Solutions:

  1. Reduce number of jobs per worker
  2. Use larger instance type
  3. Scale horizontally (add workers)

Worker Lost Connection

WARN Lost connection to control plane, reconnecting...
ERROR Failed to reconnect after 3 attempts

Causes:

  • Network issue
  • Control plane maintenance
  • Security group change

Solutions:

  1. Check worker network connectivity
  2. Verify security groups unchanged
  3. Worker will auto-reconnect when possible

Recovery Procedures

Full Resnapshot

If data is inconsistent:

  1. Stop the job
  2. Truncate sink tables (optional)
  3. Delete and recreate job with snapshot_mode: initial

Reset Replication Position

If slot is too far behind:

-- 1. Drop old slot
SELECT pg_drop_replication_slot('ezcdc_orders_slot');

-- 2. Restart job (will recreate slot)
warning

This may cause data loss. Consider full resnapshot.

Recover from Sink Corruption

  1. Stop job
  2. Fix or recreate sink tables
  3. Reset job checkpoint (or full resnapshot)
  4. Resume job

Getting Help

If issues persist:

  1. Collect diagnostics:

    • Job status and error message
    • Worker logs (last 100 lines)
    • Source/sink connectivity test results
  2. Contact support: