Skip to main content

Creating a Job

A job defines a CDC pipeline from a source to a sink. This guide walks you through creating your first job.

Prerequisites

Before creating a job, ensure you have:

Create a Job

Via Portal

  1. Navigate to JobsNew Job

  2. Select Source

    • Choose your PostgreSQL datasource
    • Tables will be loaded automatically
  3. Select Sink

    • Choose your StarRocks datasource
  4. Select Tables

    Choose which tables to replicate:

    ☑ public.orders         (245,000 rows)
    ☑ public.order_items (1,200,000 rows)
    ☑ public.customers (50,000 rows)
    ☐ public.audit_logs (10,000,000 rows)
    ☐ public.sessions (Temporary data)
  5. Configure Replication Settings

    SettingDescriptionRecommended
    Replication SlotPostgreSQL slot nameAuto-generated
    PublicationPostgreSQL publicationAuto-generated
    Batch SizeEvents per batch10,000
    Flush IntervalMax seconds between flushes5
  6. Review & Create

    Review your configuration and click Create Job.

Job Initialization

When a job starts, EZ-CDC:

1. Create publication (PostgreSQL)
└─ CREATE PUBLICATION ezcdc_orders_pub FOR TABLE orders, order_items, customers

2. Create replication slot
└─ SELECT pg_create_logical_replication_slot('ezcdc_orders_slot', 'pgoutput')

3. Create destination tables (if needed)
└─ CREATE TABLE orders (...) PRIMARY KEY (id)
└─ CREATE TABLE order_items (...) PRIMARY KEY (id)

4. Start initial snapshot (optional)
└─ Copy existing data from source to sink

5. Begin streaming
└─ Subscribe to logical replication stream

Job Status

After creation, monitor job status:

Job: orders-to-analytics
Status: ● Running
Phase: Streaming

Source: production-postgres
Sink: analytics-starrocks
Tables: 3

Metrics:
Events/sec: 1,247
Lag: 512 bytes
Total Events: 2,456,789

Status Values

StatusDescription
pendingCreated, waiting for worker
startingWorker assigned, initializing
runningActively replicating
pausedManually paused
failedError occurred
stoppedManually stopped

Configuration Options

Replication Settings

OptionDefaultDescription
replication_slot_nameAutoPostgreSQL slot name
publication_nameAutoPostgreSQL publication
batch_size10000Max events per batch
flush_interval_ms5000Max ms between flushes

Advanced Settings

OptionDefaultDescription
start_positionlatestWhere to start (latest or LSN)
snapshot_modeinitialInitial data load mode
parallel_tables1Tables to process in parallel

Initial Snapshot

Snapshot Modes

ModeDescriptionUse Case
initialFull snapshot then streamingNew replication
neverStreaming only, no snapshotExisting data in sink
when_neededSnapshot if no checkpointRecovery

Snapshot Process

For large tables, the initial snapshot:

  1. Reads existing data from source
  2. Batches into chunks (default 10,000 rows)
  3. Loads to sink via Stream Load
  4. Records checkpoint position
Snapshotting public.orders (245,000 rows)
[████████████████████░░░░░] 80% (196,000/245,000)
Speed: 12,000 rows/sec
ETA: 4 seconds

Table Mapping

Automatic Mapping

By default, tables map 1:1:

Source                    Sink
──────────────────── ────────────────────
public.orders → orders
public.order_items → order_items
public.customers → customers

Custom Mapping

For different sink table names (future feature):

{
"table_mappings": {
"public.orders": "raw.orders",
"public.customers": "raw.customers"
}
}

Multiple Jobs

You can create multiple jobs for different use cases:

Job 1: orders-realtime
Tables: orders, order_items
Batch Size: 1000
Flush Interval: 1s
└─ Low latency for dashboards

Job 2: analytics-batch
Tables: customers, products
Batch Size: 50000
Flush Interval: 30s
└─ High throughput for analytics
warning

Each job requires its own replication slot. Monitor slot usage on PostgreSQL.

Troubleshooting

Job stuck in "Pending"

Causes:

  • No healthy workers
  • Worker can't reach source/sink

Solutions:

  1. Check deployment worker health
  2. Verify worker can reach databases
  3. Check worker logs

Job fails immediately

Causes:

  • Invalid credentials
  • Missing permissions
  • Table doesn't exist

Solutions:

  1. Test datasource connections
  2. Verify user permissions
  3. Check table names are correct

Initial snapshot very slow

Solutions:

  1. Increase batch size
  2. Check source database performance
  3. Consider snapshot_mode: never if data already exists

Next Steps