Job Lifecycle

Understanding job states and transitions helps you manage CDC pipelines effectively.

Job States

Job Lifecycle: State transitions and recovery

State Descriptions

State	Description	Actions Available
`created`	Job created, not yet assigned	Start, Delete
`pending`	Waiting for worker assignment	Cancel
`starting`	Worker initializing job	Cancel
`running`	Actively replicating	Pause, Stop
`paused`	Replication paused	Resume, Stop
`failed`	Error occurred	Retry, Stop, Delete
`stopped`	Manually stopped	Start, Delete

State Transitions

Start Job

created/stopped → pending → starting → running

Triggers:

Finds available worker
Creates replication resources
Begins streaming

Pause Job

running → paused

Worker holds position
No new events processed
Resources retained

Resume Job

paused → running

Continues from last position
No data loss

Stop Job

running/paused → stopped

Graceful shutdown
Commits pending batches
Releases resources (optional)

Retry Failed Job

failed → pending → starting → running

Worker re-assigned
Resumes from last checkpoint
May re-process some events

Job Phases

Within the running state, jobs progress through phases:

1. Initialization Phase

[INIT] Setting up replication resources
  ├─ Creating publication... done
  ├─ Creating replication slot... done
  └─ Creating destination tables... done

2. Snapshot Phase

[SNAPSHOT] Loading initial data
  ├─ public.orders: 100% (245,000 rows)
  ├─ public.order_items: 67% (804,000/1,200,000 rows)
  └─ public.customers: Pending

3. Streaming Phase

[STREAMING] Replicating changes
  ├─ Events/sec: 1,247
  ├─ Lag: 512 bytes
  └─ Uptime: 2h 34m

Controlling Jobs

Via Portal

Go to Jobs → select job
Use action buttons:
- Pause: Temporarily stop replication
- Resume: Continue replication
- Stop: End the job
- Restart: Stop and start fresh

Failure Handling

Automatic Retry

For transient failures, EZ-CDC retries automatically:

Failure Type	Retry Behavior
Network timeout	Retry 3x with backoff
Connection lost	Reconnect and resume
Rate limit	Backoff and retry

Manual Intervention

For persistent failures:

Check error message in portal
Fix underlying issue (credentials, network, etc.)
Click Retry to restart

Common Failures

Error	Cause	Resolution
`replication slot not found`	Slot deleted	Recreate slot, may need resnapshot
`connection refused`	Network issue	Check connectivity
`permission denied`	Auth issue	Verify credentials
`table not found`	Schema change	Update table selection

Checkpointing

Jobs maintain checkpoints to enable resumption:

Checkpoint:
  LSN: 0/1A3E5F8
  Table: public.orders
  Last Commit: 2024-01-15 10:30:00 UTC

Checkpoint Frequency

Checkpoints are saved:

After each successful batch
Before graceful shutdown
Periodically (every 30 seconds)

Resume from Checkpoint

When a job restarts:

Reads last checkpoint
Connects to source at checkpoint LSN
Continues streaming

Some events may be re-processed (idempotent writes handle this).

Resource Cleanup

When Stopping

By default, stopping a job retains resources:

Replication slot (maintains position)
Publication (table list)
Destination tables (with data)

Full Cleanup

To release all resources, stop the job with the Cleanup resources option enabled in the portal. This will:

Drop replication slot
Drop publication
Keep destination tables (data preserved)

Job Deletion

Deleting a job:

Stops the job (if running)
Drops replication slot
Drops publication
Removes job from system

Destination tables are NOT deleted.

Monitoring State Changes

Webhooks (Future)

Receive notifications on state changes:

{
  "event": "job.state_changed",
  "job_id": "job_abc123",
  "previous_state": "running",
  "new_state": "failed",
  "error": "Connection refused",
  "timestamp": "2024-01-15T10:30:00Z"
}

Portal Monitoring

Monitor job states in real-time through the EZ-CDC Portal under Jobs → your job → Status.

Best Practices

1. Use Pause for Maintenance

Before source database maintenance:

Go to Jobs → your job
Click Pause
Perform maintenance
Click Resume

2. Monitor Failed Jobs

Set up alerts for:

Jobs in failed state > 5 minutes
Repeated failures (flapping)

3. Clean Up Unused Jobs

Stopped jobs still hold resources. Delete if not needed via Jobs → your job → Delete.

Job States​

State Descriptions​

State Transitions​

Start Job​

Pause Job​

Resume Job​

Stop Job​

Retry Failed Job​

Job Phases​

1. Initialization Phase​

2. Snapshot Phase​

3. Streaming Phase​

Controlling Jobs​

Via Portal​

Failure Handling​

Automatic Retry​

Manual Intervention​

Common Failures​

Checkpointing​

Checkpoint Frequency​

Resume from Checkpoint​

Resource Cleanup​

When Stopping​

Full Cleanup​

Job Deletion​

Monitoring State Changes​

Webhooks (Future)​

Portal Monitoring​

Best Practices​

1. Use Pause for Maintenance​

2. Monitor Failed Jobs​

3. Clean Up Unused Jobs​

Next Steps​