Skip to main content

StarRocks Stream Load

This guide explains how EZ-CDC uses StarRocks Stream Load for efficient data ingestion.

What is Stream Load?

Stream Load is StarRocks' HTTP-based data ingestion method:

  • Synchronous: Returns after data is committed
  • Transactional: All-or-nothing batch commits
  • Efficient: Direct load to BE nodes
  • Supports UPSERT: Update existing rows

How EZ-CDC Uses Stream Load

Request Flow

EZ-CDC Worker Process:

  1. Collect batch of CDC events
  2. Transform to JSON format
  3. Send via HTTP PUT to Stream Load endpoint

Stream Load Request:

PUT /api/{database}/{table}/_stream_load
Headers:
- label: unique_batch_id
- format: json
- partial_update: true
Body: [{"id":1,"name":"John"}, ...]

Batch Processing

EZ-CDC batches events before sending:

CDC Events:
INSERT {id: 1, name: "John"}
UPDATE {id: 2, name: "Jane"}
INSERT {id: 3, name: "Bob"}
DELETE {id: 4}

↓ Transform ↓

Stream Load JSON:
[
{"id": 1, "name": "John", "_cdc_deleted": false},
{"id": 2, "name": "Jane", "_cdc_deleted": false},
{"id": 3, "name": "Bob", "_cdc_deleted": false},
{"id": 4, "_cdc_deleted": true}
]

Delete Handling

EZ-CDC uses soft deletes:

-- Deleted rows have _cdc_deleted = true
SELECT * FROM orders WHERE _cdc_deleted = false;

This allows:

  • Audit trail of deleted records
  • Point-in-time queries
  • Recovery if needed

Stream Load Request

HTTP Request

PUT /api/analytics/orders/_stream_load HTTP/1.1
Host: starrocks-be:8040
Authorization: Basic ZXpjZGNfdXNlcjpwYXNzd29yZA==
Content-Type: application/json
Expect: 100-continue
label: ezcdc_orders_1705315200_001
format: json
strip_outer_array: true
partial_update: true
partial_update_mode: row

[
{"id": 1, "customer_id": 100, "total": 99.99, "_cdc_deleted": false},
{"id": 2, "customer_id": 101, "total": 149.99, "_cdc_deleted": false}
]

Headers

HeaderValueDescription
labelUnique IDIdempotent batch identifier
formatjsonData format
strip_outer_arraytrueJSON array input
partial_updatetrueOnly update specified columns
partial_update_moderowRow-level partial update

Response

Success:

{
"TxnId": 12345,
"Label": "ezcdc_orders_1705315200_001",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 2,
"NumberLoadedRows": 2,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 256,
"LoadTimeMs": 45
}

Failure:

{
"Status": "Fail",
"Message": "too many filtered rows",
"NumberTotalRows": 100,
"NumberFilteredRows": 50,
"ErrorURL": "http://be:8040/api/_load_error_log?..."
}

Label Management

Label Format

EZ-CDC generates unique labels:

ezcdc_{table}_{timestamp}_{sequence}

Examples:
- ezcdc_orders_1705315200_001
- ezcdc_orders_1705315200_002

Idempotency

Labels ensure exactly-once delivery:

  1. First attempt: Label ezcdc_orders_001 - Success
  2. Retry (same label): Returns existing result (no duplicate)
-- Check label status
SHOW LOAD WHERE LABEL = 'ezcdc_orders_001';

Performance Optimization

Batch Size

Configure batch size based on your workload:

Events/secondRecommended Batch Size
< 1001,000
100 - 1,0005,000
1,000 - 10,00010,000
> 10,00020,000

Flush Interval

Balance latency vs efficiency:

RequirementFlush Interval
Real-time (< 1s)500ms
Near real-time2-5s
Efficiency-focused10-30s

Parallelism

For high throughput, EZ-CDC can use multiple Stream Load requests in parallel (per table).

Monitoring Stream Load

Active Loads

-- View active Stream Load jobs
SHOW TRANSACTION;

-- View recent load history
SHOW LOAD ORDER BY CreateTime DESC LIMIT 10;

Failed Loads

-- View failed loads
SHOW LOAD WHERE State = 'CANCELLED';

-- Get error details
SELECT * FROM information_schema.load_tracking_logs
WHERE label LIKE 'ezcdc_%'
AND state = 'CANCELLED'
ORDER BY create_time DESC;

Error Log

For detailed error information:

# Get error log URL from failed response
curl "http://be:8040/api/_load_error_log?file=error_log_xxx"

Troubleshooting

"Label already exists"

Error: Label [ezcdc_orders_001] has already been used

Cause: Duplicate load attempt.

Solutions:

  1. Check if previous load succeeded (this is idempotent)
  2. If failed, use a new label
  3. Wait for label to expire (default 3 days)

"Timeout"

Error: Stream Load timeout after 300 seconds

Solutions:

  1. Reduce batch size
  2. Increase timeout in StarRocks config
  3. Check BE resource utilization

"Memory limit exceeded"

Error: Memory of process is overloaded

Solutions:

  1. Reduce batch size
  2. Increase BE memory
  3. Reduce parallel loads

"Filtered rows"

Error: too many filtered rows

Cause: Type conversion errors or data issues.

Solutions:

  1. Check error log URL for details
  2. Verify type mapping is correct
  3. Check for NULL in NOT NULL columns

Stream Load vs Other Methods

MethodUse CaseThroughput
Stream LoadReal-time CDCHigh
Routine LoadKafka consumptionHigh
Broker LoadBatch from HDFS/S3Very High
InsertAd-hoc insertsLow

EZ-CDC uses Stream Load because:

  • Best for real-time micro-batches
  • HTTP-based (no extra dependencies)
  • Supports UPSERT for CDC
  • Synchronous (know when committed)

Next Steps