Table Selection

This guide covers how to select and configure tables for CDC replication.

Discovering Tables

When creating a job, EZ-CDC discovers available tables from your source:

Available Tables:
┌────────────────────┬──────────────┬─────────────┬────────────┐
│ Table              │ Rows (est.)  │ Primary Key │ Status     │
├────────────────────┼──────────────┼─────────────┼────────────┤
│ public.orders      │ 245,000      │ id          │ ✓ Ready    │
│ public.order_items │ 1,200,000    │ id          │ ✓ Ready    │
│ public.customers   │ 50,000       │ id          │ ✓ Ready    │
│ public.products    │ 5,000        │ id          │ ✓ Ready    │
│ public.audit_logs  │ 10,000,000   │ id          │ ⚠ Large    │
│ public.sessions    │ 500,000      │ (none)      │ ⚠ No PK    │
└────────────────────┴──────────────┴─────────────┴────────────┘

Selecting Tables

Basic Selection

Select tables to replicate:

{
  "tables": [
    "public.orders",
    "public.order_items",
    "public.customers"
  ]
}

Schema-Qualified Names

Always use fully-qualified names:

✓ public.orders
✓ sales.transactions
✗ orders (ambiguous)

Pattern Matching (Future)

Match tables by pattern:

{
  "table_patterns": [
    "public.*",           // All tables in public schema
    "sales.order_*",      // Tables starting with order_
    "!*._archive"         // Exclude archive tables
  ]
}

Table Requirements

Primary Keys

Tables should have primary keys:

Has PK?	Behavior
Yes	Updates/Deletes use PK to identify rows
No	All columns used (REPLICA IDENTITY FULL)

For tables without primary keys:

-- Option 1: Add a primary key
ALTER TABLE sessions ADD PRIMARY KEY (session_id);

-- Option 2: Use REPLICA IDENTITY FULL
ALTER TABLE sessions REPLICA IDENTITY FULL;

Supported Column Types

Category	Types	Status
Numeric	int, bigint, decimal, float	✓
String	varchar, text, char	✓
Temporal	date, timestamp, timestamptz	✓
Boolean	boolean	✓
JSON	json, jsonb	✓
UUID	uuid	✓
Binary	bytea	✓
Arrays	integer[], text[], etc.	⚠ Limited
Geometric	point, polygon, etc.	✗
Custom	User-defined types	✗

Table Considerations

Large Tables

For tables with millions of rows:

⚠ Table public.audit_logs has 10,000,000 rows

Initial snapshot will:
- Take approximately 15 minutes
- Generate significant source load
- Require 2GB+ transfer

Options:

Include with longer timeout
Snapshot during off-peak hours
Use snapshot_mode: never if data exists

High-Write Tables

For tables with heavy write load:

⚠ Table public.events has high write volume (1000+ writes/sec)

Consider:
- Larger batch size
- More frequent flushes
- Dedicated job

Tables Without Primary Keys

⚠ Table public.sessions has no primary key

Updates and deletes will:
- Include all columns in messages
- Be less efficient
- May cause issues with some sinks

Recommendation: Add a primary key or use REPLICA IDENTITY FULL.

Column Selection (Future)

Select specific columns:

{
  "tables": [
    {
      "name": "public.customers",
      "columns": ["id", "name", "email", "created_at"]
    }
  ]
}

Excluding Tables

By Not Selecting

Simply don't include in the tables list.

By Pattern (Future)

{
  "exclude_tables": [
    "*_backup",
    "*_archive",
    "temp_*"
  ]
}

Schema Discovery

Tables are automatically discovered when you select a source in the portal. The portal displays:

Schema and table names
Estimated row counts
Column information (name, type, nullable, primary key)

Adding Tables to Running Job

Currently, adding tables requires:

Stop the job
Modify table selection
Restart the job

The new table will be snapshotted and then streamed.

Future Feature

Hot-add tables without stopping the job is planned.

Removing Tables

To remove a table:

Stop the job
Remove table from selection
Restart the job

Note: Data already replicated remains in the sink.

Table-Specific Settings (Future)

Configure per-table settings:

{
  "tables": [
    {
      "name": "public.orders",
      "batch_size": 5000,
      "priority": "high"
    },
    {
      "name": "public.audit_logs",
      "batch_size": 50000,
      "priority": "low"
    }
  ]
}

Best Practices

1. Start Small

Begin with a few critical tables:

{
  "tables": ["public.orders", "public.customers"]
}

Expand after validating replication.

Keep related tables in the same job:

Job: sales-data
  - orders
  - order_items
  - customers

Job: inventory-data
  - products
  - inventory
  - warehouses

3. Separate High-Volume Tables

Isolate high-write tables:

Job: events-realtime (dedicated)
  - events (10K writes/sec)

Job: core-data
  - users
  - accounts
  - settings

4. Monitor Table Lag

Watch per-table lag to identify bottlenecks.

Discovering Tables​

Selecting Tables​

Basic Selection​

Schema-Qualified Names​

Pattern Matching (Future)​

Table Requirements​

Primary Keys​

Supported Column Types​

Table Considerations​

Large Tables​

High-Write Tables​

Tables Without Primary Keys​

Column Selection (Future)​

Excluding Tables​

By Not Selecting​

By Pattern (Future)​

Schema Discovery​

Adding Tables to Running Job​

Removing Tables​

Table-Specific Settings (Future)​

Best Practices​

1. Start Small​

2. Group Related Tables​

3. Separate High-Volume Tables​

4. Monitor Table Lag​

Next Steps​