StarRocks Stream Load
This guide explains how EZ-CDC uses StarRocks Stream Load for efficient data ingestion.
What is Stream Load?
Stream Load is StarRocks' HTTP-based data ingestion method:
- Synchronous: Returns after data is committed
- Transactional: All-or-nothing batch commits
- Efficient: Direct load to BE nodes
- Supports UPSERT: Update existing rows
How EZ-CDC Uses Stream Load
Request Flow
EZ-CDC Worker Process:
- Collect batch of CDC events
- Transform to JSON format
- Send via HTTP PUT to Stream Load endpoint
Stream Load Request:
PUT /api/{database}/{table}/_stream_load
Headers:
- label: unique_batch_id
- format: json
- partial_update: true
Body: [{"id":1,"name":"John"}, ...]
Batch Processing
EZ-CDC batches events before sending:
CDC Events:
INSERT {id: 1, name: "John"}
UPDATE {id: 2, name: "Jane"}
INSERT {id: 3, name: "Bob"}
DELETE {id: 4}
↓ Transform ↓
Stream Load JSON:
[
{"id": 1, "name": "John", "_cdc_deleted": false},
{"id": 2, "name": "Jane", "_cdc_deleted": false},
{"id": 3, "name": "Bob", "_cdc_deleted": false},
{"id": 4, "_cdc_deleted": true}
]
Delete Handling
EZ-CDC uses soft deletes:
-- Deleted rows have _cdc_deleted = true
SELECT * FROM orders WHERE _cdc_deleted = false;
This allows:
- Audit trail of deleted records
- Point-in-time queries
- Recovery if needed
Stream Load Request
HTTP Request
PUT /api/analytics/orders/_stream_load HTTP/1.1
Host: starrocks-be:8040
Authorization: Basic ZXpjZGNfdXNlcjpwYXNzd29yZA==
Content-Type: application/json
Expect: 100-continue
label: ezcdc_orders_1705315200_001
format: json
strip_outer_array: true
partial_update: true
partial_update_mode: row
[
{"id": 1, "customer_id": 100, "total": 99.99, "_cdc_deleted": false},
{"id": 2, "customer_id": 101, "total": 149.99, "_cdc_deleted": false}
]
Headers
| Header | Value | Description |
|---|---|---|
label | Unique ID | Idempotent batch identifier |
format | json | Data format |
strip_outer_array | true | JSON array input |
partial_update | true | Only update specified columns |
partial_update_mode | row | Row-level partial update |
Response
Success:
{
"TxnId": 12345,
"Label": "ezcdc_orders_1705315200_001",
"Status": "Success",
"Message": "OK",
"NumberTotalRows": 2,
"NumberLoadedRows": 2,
"NumberFilteredRows": 0,
"NumberUnselectedRows": 0,
"LoadBytes": 256,
"LoadTimeMs": 45
}
Failure:
{
"Status": "Fail",
"Message": "too many filtered rows",
"NumberTotalRows": 100,
"NumberFilteredRows": 50,
"ErrorURL": "http://be:8040/api/_load_error_log?..."
}
Label Management
Label Format
EZ-CDC generates unique labels:
ezcdc_{table}_{timestamp}_{sequence}
Examples:
- ezcdc_orders_1705315200_001
- ezcdc_orders_1705315200_002
Idempotency
Labels ensure exactly-once delivery:
- First attempt: Label
ezcdc_orders_001- Success - Retry (same label): Returns existing result (no duplicate)
-- Check label status
SHOW LOAD WHERE LABEL = 'ezcdc_orders_001';
Performance Optimization
Batch Size
Configure batch size based on your workload:
| Events/second | Recommended Batch Size |
|---|---|
| < 100 | 1,000 |
| 100 - 1,000 | 5,000 |
| 1,000 - 10,000 | 10,000 |
| > 10,000 | 20,000 |
Flush Interval
Balance latency vs efficiency:
| Requirement | Flush Interval |
|---|---|
| Real-time (< 1s) | 500ms |
| Near real-time | 2-5s |
| Efficiency-focused | 10-30s |
Parallelism
For high throughput, EZ-CDC can use multiple Stream Load requests in parallel (per table).
Monitoring Stream Load
Active Loads
-- View active Stream Load jobs
SHOW TRANSACTION;
-- View recent load history
SHOW LOAD ORDER BY CreateTime DESC LIMIT 10;
Failed Loads
-- View failed loads
SHOW LOAD WHERE State = 'CANCELLED';
-- Get error details
SELECT * FROM information_schema.load_tracking_logs
WHERE label LIKE 'ezcdc_%'
AND state = 'CANCELLED'
ORDER BY create_time DESC;
Error Log
For detailed error information:
# Get error log URL from failed response
curl "http://be:8040/api/_load_error_log?file=error_log_xxx"
Troubleshooting
"Label already exists"
Error: Label [ezcdc_orders_001] has already been used
Cause: Duplicate load attempt.
Solutions:
- Check if previous load succeeded (this is idempotent)
- If failed, use a new label
- Wait for label to expire (default 3 days)
"Timeout"
Error: Stream Load timeout after 300 seconds
Solutions:
- Reduce batch size
- Increase timeout in StarRocks config
- Check BE resource utilization
"Memory limit exceeded"
Error: Memory of process is overloaded
Solutions:
- Reduce batch size
- Increase BE memory
- Reduce parallel loads
"Filtered rows"
Error: too many filtered rows
Cause: Type conversion errors or data issues.
Solutions:
- Check error log URL for details
- Verify type mapping is correct
- Check for NULL in NOT NULL columns
Stream Load vs Other Methods
| Method | Use Case | Throughput |
|---|---|---|
| Stream Load | Real-time CDC | High |
| Routine Load | Kafka consumption | High |
| Broker Load | Batch from HDFS/S3 | Very High |
| Insert | Ad-hoc inserts | Low |
EZ-CDC uses Stream Load because:
- Best for real-time micro-batches
- HTTP-based (no extra dependencies)
- Supports UPSERT for CDC
- Synchronous (know when committed)