Job Lifecycle
Understanding job states and transitions helps you manage CDC pipelines effectively.
Job States
State Descriptions
| State | Description | Actions Available |
|---|---|---|
created | Job created, not yet assigned | Start, Delete |
pending | Waiting for worker assignment | Cancel |
starting | Worker initializing job | Cancel |
running | Actively replicating | Pause, Stop |
paused | Replication paused | Resume, Stop |
failed | Error occurred | Retry, Stop, Delete |
stopped | Manually stopped | Start, Delete |
State Transitions
Start Job
created/stopped → pending → starting → running
Triggers:
- Finds available worker
- Creates replication resources
- Begins streaming
Pause Job
running → paused
- Worker holds position
- No new events processed
- Resources retained
Resume Job
paused → running
- Continues from last position
- No data loss
Stop Job
running/paused → stopped
- Graceful shutdown
- Commits pending batches
- Releases resources (optional)
Retry Failed Job
failed → pending → starting → running
- Worker re-assigned
- Resumes from last checkpoint
- May re-process some events
Job Phases
Within the running state, jobs progress through phases:
1. Initialization Phase
[INIT] Setting up replication resources
├─ Creating publication... done
├─ Creating replication slot... done
└─ Creating destination tables... done
2. Snapshot Phase
[SNAPSHOT] Loading initial data
├─ public.orders: 100% (245,000 rows)
├─ public.order_items: 67% (804,000/1,200,000 rows)
└─ public.customers: Pending
3. Streaming Phase
[STREAMING] Replicating changes
├─ Events/sec: 1,247
├─ Lag: 512 bytes
└─ Uptime: 2h 34m
Controlling Jobs
Via Portal
- Go to Jobs → select job
- Use action buttons:
- Pause: Temporarily stop replication
- Resume: Continue replication
- Stop: End the job
- Restart: Stop and start fresh
Failure Handling
Automatic Retry
For transient failures, EZ-CDC retries automatically:
| Failure Type | Retry Behavior |
|---|---|
| Network timeout | Retry 3x with backoff |
| Connection lost | Reconnect and resume |
| Rate limit | Backoff and retry |
Manual Intervention
For persistent failures:
- Check error message in portal
- Fix underlying issue (credentials, network, etc.)
- Click Retry to restart
Common Failures
| Error | Cause | Resolution |
|---|---|---|
replication slot not found | Slot deleted | Recreate slot, may need resnapshot |
connection refused | Network issue | Check connectivity |
permission denied | Auth issue | Verify credentials |
table not found | Schema change | Update table selection |
Checkpointing
Jobs maintain checkpoints to enable resumption:
Checkpoint:
LSN: 0/1A3E5F8
Table: public.orders
Last Commit: 2024-01-15 10:30:00 UTC
Checkpoint Frequency
Checkpoints are saved:
- After each successful batch
- Before graceful shutdown
- Periodically (every 30 seconds)
Resume from Checkpoint
When a job restarts:
- Reads last checkpoint
- Connects to source at checkpoint LSN
- Continues streaming
Some events may be re-processed (idempotent writes handle this).
Resource Cleanup
When Stopping
By default, stopping a job retains resources:
- Replication slot (maintains position)
- Publication (table list)
- Destination tables (with data)
Full Cleanup
To release all resources, stop the job with the Cleanup resources option enabled in the portal. This will:
- Drop replication slot
- Drop publication
- Keep destination tables (data preserved)
Job Deletion
Deleting a job:
- Stops the job (if running)
- Drops replication slot
- Drops publication
- Removes job from system
Destination tables are NOT deleted.
Monitoring State Changes
Webhooks (Future)
Receive notifications on state changes:
{
"event": "job.state_changed",
"job_id": "job_abc123",
"previous_state": "running",
"new_state": "failed",
"error": "Connection refused",
"timestamp": "2024-01-15T10:30:00Z"
}
Portal Monitoring
Monitor job states in real-time through the EZ-CDC Portal under Jobs → your job → Status.
Best Practices
1. Use Pause for Maintenance
Before source database maintenance:
- Go to Jobs → your job
- Click Pause
- Perform maintenance
- Click Resume
2. Monitor Failed Jobs
Set up alerts for:
- Jobs in
failedstate > 5 minutes - Repeated failures (flapping)
3. Clean Up Unused Jobs
Stopped jobs still hold resources. Delete if not needed via Jobs → your job → Delete.