Skip to main content

Job Lifecycle

Understanding job states and transitions helps you manage CDC pipelines effectively.

Job States

CREATEDPENDINGSTARTINGRUNNINGPAUSEDSTOPPEDFAILEDDELETEDAPI creates jobWorker claims jobDaemon startsUser pausesUser resumesUser stopsUser restartsErrorAuto-recoveryUser deletesRUNNING phases: INIT → SNAPSHOT → STREAMINGCheckpoints every 30sAuto-recovery usesexponential backoffwith max 10 retries
Job Lifecycle: State transitions and recovery

State Descriptions

StateDescriptionActions Available
createdJob created, not yet assignedStart, Delete
pendingWaiting for worker assignmentCancel
startingWorker initializing jobCancel
runningActively replicatingPause, Stop
pausedReplication pausedResume, Stop
failedError occurredRetry, Stop, Delete
stoppedManually stoppedStart, Delete

State Transitions

Start Job

created/stopped → pending → starting → running

Triggers:

  1. Finds available worker
  2. Creates replication resources
  3. Begins streaming

Pause Job

running → paused
  • Worker holds position
  • No new events processed
  • Resources retained

Resume Job

paused → running
  • Continues from last position
  • No data loss

Stop Job

running/paused → stopped
  • Graceful shutdown
  • Commits pending batches
  • Releases resources (optional)

Retry Failed Job

failed → pending → starting → running
  • Worker re-assigned
  • Resumes from last checkpoint
  • May re-process some events

Job Phases

Within the running state, jobs progress through phases:

1. Initialization Phase

[INIT] Setting up replication resources
├─ Creating publication... done
├─ Creating replication slot... done
└─ Creating destination tables... done

2. Snapshot Phase

[SNAPSHOT] Loading initial data
├─ public.orders: 100% (245,000 rows)
├─ public.order_items: 67% (804,000/1,200,000 rows)
└─ public.customers: Pending

3. Streaming Phase

[STREAMING] Replicating changes
├─ Events/sec: 1,247
├─ Lag: 512 bytes
└─ Uptime: 2h 34m

Controlling Jobs

Via Portal

  1. Go to Jobs → select job
  2. Use action buttons:
    • Pause: Temporarily stop replication
    • Resume: Continue replication
    • Stop: End the job
    • Restart: Stop and start fresh

Failure Handling

Automatic Retry

For transient failures, EZ-CDC retries automatically:

Failure TypeRetry Behavior
Network timeoutRetry 3x with backoff
Connection lostReconnect and resume
Rate limitBackoff and retry

Manual Intervention

For persistent failures:

  1. Check error message in portal
  2. Fix underlying issue (credentials, network, etc.)
  3. Click Retry to restart

Common Failures

ErrorCauseResolution
replication slot not foundSlot deletedRecreate slot, may need resnapshot
connection refusedNetwork issueCheck connectivity
permission deniedAuth issueVerify credentials
table not foundSchema changeUpdate table selection

Checkpointing

Jobs maintain checkpoints to enable resumption:

Checkpoint:
LSN: 0/1A3E5F8
Table: public.orders
Last Commit: 2024-01-15 10:30:00 UTC

Checkpoint Frequency

Checkpoints are saved:

  • After each successful batch
  • Before graceful shutdown
  • Periodically (every 30 seconds)

Resume from Checkpoint

When a job restarts:

  1. Reads last checkpoint
  2. Connects to source at checkpoint LSN
  3. Continues streaming

Some events may be re-processed (idempotent writes handle this).

Resource Cleanup

When Stopping

By default, stopping a job retains resources:

  • Replication slot (maintains position)
  • Publication (table list)
  • Destination tables (with data)

Full Cleanup

To release all resources, stop the job with the Cleanup resources option enabled in the portal. This will:

  • Drop replication slot
  • Drop publication
  • Keep destination tables (data preserved)

Job Deletion

Deleting a job:

  1. Stops the job (if running)
  2. Drops replication slot
  3. Drops publication
  4. Removes job from system

Destination tables are NOT deleted.

Monitoring State Changes

Webhooks (Future)

Receive notifications on state changes:

{
"event": "job.state_changed",
"job_id": "job_abc123",
"previous_state": "running",
"new_state": "failed",
"error": "Connection refused",
"timestamp": "2024-01-15T10:30:00Z"
}

Portal Monitoring

Monitor job states in real-time through the EZ-CDC Portal under Jobs → your job → Status.

Best Practices

1. Use Pause for Maintenance

Before source database maintenance:

  1. Go to Jobs → your job
  2. Click Pause
  3. Perform maintenance
  4. Click Resume

2. Monitor Failed Jobs

Set up alerts for:

  • Jobs in failed state > 5 minutes
  • Repeated failures (flapping)

3. Clean Up Unused Jobs

Stopped jobs still hold resources. Delete if not needed via Jobs → your job → Delete.

Next Steps