Creating a Job
A job defines a CDC pipeline from a source to a sink. This guide walks you through creating your first job.
Prerequisites
Before creating a job, ensure you have:
- An active deployment
- A PostgreSQL source configured
- A StarRocks sink configured
- Both datasources have passed connection tests
Create a Job
Via Portal
-
Navigate to Jobs → New Job
-
Select Source
- Choose your PostgreSQL datasource
- Tables will be loaded automatically
-
Select Sink
- Choose your StarRocks datasource
-
Select Tables
Choose which tables to replicate:
☑ public.orders (245,000 rows)
☑ public.order_items (1,200,000 rows)
☑ public.customers (50,000 rows)
☐ public.audit_logs (10,000,000 rows)
☐ public.sessions (Temporary data) -
Configure Replication Settings
Setting Description Recommended Replication Slot PostgreSQL slot name Auto-generated Publication PostgreSQL publication Auto-generated Batch Size Events per batch 10,000 Flush Interval Max seconds between flushes 5 -
Review & Create
Review your configuration and click Create Job.
Job Initialization
When a job starts, EZ-CDC:
1. Create publication (PostgreSQL)
└─ CREATE PUBLICATION ezcdc_orders_pub FOR TABLE orders, order_items, customers
2. Create replication slot
└─ SELECT pg_create_logical_replication_slot('ezcdc_orders_slot', 'pgoutput')
3. Create destination tables (if needed)
└─ CREATE TABLE orders (...) PRIMARY KEY (id)
└─ CREATE TABLE order_items (...) PRIMARY KEY (id)
4. Start initial snapshot (optional)
└─ Copy existing data from source to sink
5. Begin streaming
└─ Subscribe to logical replication stream
Job Status
After creation, monitor job status:
Job: orders-to-analytics
Status: ● Running
Phase: Streaming
Source: production-postgres
Sink: analytics-starrocks
Tables: 3
Metrics:
Events/sec: 1,247
Lag: 512 bytes
Total Events: 2,456,789
Status Values
| Status | Description |
|---|---|
pending | Created, waiting for worker |
starting | Worker assigned, initializing |
running | Actively replicating |
paused | Manually paused |
failed | Error occurred |
stopped | Manually stopped |
Configuration Options
Replication Settings
| Option | Default | Description |
|---|---|---|
replication_slot_name | Auto | PostgreSQL slot name |
publication_name | Auto | PostgreSQL publication |
batch_size | 10000 | Max events per batch |
flush_interval_ms | 5000 | Max ms between flushes |
Advanced Settings
| Option | Default | Description |
|---|---|---|
start_position | latest | Where to start (latest or LSN) |
snapshot_mode | initial | Initial data load mode |
parallel_tables | 1 | Tables to process in parallel |
Initial Snapshot
Snapshot Modes
| Mode | Description | Use Case |
|---|---|---|
initial | Full snapshot then streaming | New replication |
never | Streaming only, no snapshot | Existing data in sink |
when_needed | Snapshot if no checkpoint | Recovery |
Snapshot Process
For large tables, the initial snapshot:
- Reads existing data from source
- Batches into chunks (default 10,000 rows)
- Loads to sink via Stream Load
- Records checkpoint position
Snapshotting public.orders (245,000 rows)
[████████████████████░░░░░] 80% (196,000/245,000)
Speed: 12,000 rows/sec
ETA: 4 seconds
Table Mapping
Automatic Mapping
By default, tables map 1:1:
Source Sink
──────────────────── ────────────────────
public.orders → orders
public.order_items → order_items
public.customers → customers
Custom Mapping
For different sink table names (future feature):
{
"table_mappings": {
"public.orders": "raw.orders",
"public.customers": "raw.customers"
}
}
Multiple Jobs
You can create multiple jobs for different use cases:
Job 1: orders-realtime
Tables: orders, order_items
Batch Size: 1000
Flush Interval: 1s
└─ Low latency for dashboards
Job 2: analytics-batch
Tables: customers, products
Batch Size: 50000
Flush Interval: 30s
└─ High throughput for analytics
warning
Each job requires its own replication slot. Monitor slot usage on PostgreSQL.
Troubleshooting
Job stuck in "Pending"
Causes:
- No healthy workers
- Worker can't reach source/sink
Solutions:
- Check deployment worker health
- Verify worker can reach databases
- Check worker logs
Job fails immediately
Causes:
- Invalid credentials
- Missing permissions
- Table doesn't exist
Solutions:
- Test datasource connections
- Verify user permissions
- Check table names are correct
Initial snapshot very slow
Solutions:
- Increase batch size
- Check source database performance
- Consider
snapshot_mode: neverif data already exists
Next Steps
- Table Selection - Advanced table configuration
- Job Lifecycle - Understand job states
- Monitoring - Track job health and metrics