GCP Worker Infrastructure
This guide covers the infrastructure components that run in your GCP project as part of an EZ-CDC deployment.
Components Overview
EZ-CDC provisions the following resources in your GCP project:
| Resource | Type | Purpose |
|---|---|---|
| Instance Template | google_compute_instance_template | Defines worker VM configuration |
| Managed Instance Group | google_compute_region_instance_group_manager | Regional HA, auto-healing |
| Autoscaler | google_compute_region_autoscaler | CPU-based scaling |
| Health Check | google_compute_health_check | TCP check on port 50051 |
| Firewall Rules | google_compute_firewall | Egress-only security model |
| Service Account | google_service_account | Worker identity and permissions |
| Cloud NAT | google_compute_router_nat | Private egress (Cloud NAT mode only) |
Instance Template
Shielded VM
All worker instances use Shielded VM with hardware-based security:
shielded_instance_config {
enable_secure_boot = true # Verified boot chain
enable_vtpm = true # Virtual TPM for key storage
enable_integrity_monitoring = true # Runtime integrity verification
}
Instance Specifications
| Setting | Value |
|---|---|
| Default machine type | c2-standard-8 (8 vCPU, 32 GB RAM) |
| Boot disk | 50 GB pd-ssd |
| Boot disk encryption | Google-managed (CMEK optional) |
| SSH access | Blocked (block-project-ssh-keys = true) |
| External IP | Ephemeral (Standard) or None (Cloud NAT) |
Recommended Machine Types
| Machine Type | vCPU | Memory | Recommended Jobs |
|---|---|---|---|
| e2-standard-4 | 4 | 16 GB | 1-5 |
| c2-standard-8 | 8 | 32 GB | 5-15 |
| c2-standard-16 | 16 | 64 GB | 15-30 |
Resource Usage
Each dbmazz daemon uses approximately:
| Resource | Usage |
|---|---|
| Memory | ~5 MB |
| CPU | Less than 5% idle, 10-25% active |
| Disk | Minimal (logs only) |
Managed Instance Group (MIG)
Workers run in a regional Managed Instance Group for high availability:
resource "google_compute_region_instance_group_manager" "workers" {
name = "ez-cdc-workers-{deployment-id}"
region = "us-central1"
# Distributes instances across zones in the region
base_instance_name = "ez-cdc-wk-{deployment-id}"
# Auto-healing: replace instances that fail health checks
auto_healing_policies {
health_check = google_compute_health_check.worker.self_link
initial_delay_sec = 300 # Wait 5 min for startup
}
# Rolling update: zero-downtime deployments
update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 3
max_unavailable_fixed = 0
}
}
Key Features
| Feature | Configuration | Description |
|---|---|---|
| Distribution | Regional (multi-zone) | Instances spread across AZs |
| Auto-healing | 300s initial delay | Replaces unhealthy instances |
| Update policy | PROACTIVE + REPLACE | Rolling updates, zero downtime |
| Max surge | 3 | Up to 3 new instances during updates |
| Max unavailable | 0 | No capacity loss during updates |
Autoscaler
CPU-based autoscaling keeps worker capacity matched to load:
resource "google_compute_region_autoscaler" "workers" {
name = "ez-cdc-as-{deployment-id}"
target = google_compute_region_instance_group_manager.workers.self_link
autoscaling_policy {
min_replicas = 2
max_replicas = 10
cooldown_period = 120 # Prevent thrashing during CDC spikes
cpu_utilization {
target = 0.7 # Scale up at 70% CPU
}
}
}
| Setting | Default | Description |
|---|---|---|
| Min replicas | 2 | Minimum workers (configurable) |
| Max replicas | 10 | Maximum workers (configurable) |
| CPU target | 70% | Scale up threshold |
| Cooldown | 120s | Stability period between scaling actions |
Health Check
Workers are monitored via TCP health checks on the gRPC port:
resource "google_compute_health_check" "worker" {
name = "ez-cdc-hc-{deployment-id}"
check_interval_sec = 30
timeout_sec = 10
healthy_threshold = 2 # 2 consecutive successes → healthy
unhealthy_threshold = 3 # 3 consecutive failures → unhealthy
tcp_health_check {
port = 50051 # Worker gRPC port
}
}
Unhealthy instances are automatically replaced by the MIG auto-healing policy.
Firewall Rules
EZ-CDC creates 7 firewall rules implementing a deny-all base with specific allows:
Ingress Rules
| Rule | Priority | Source | Ports | Purpose |
|---|---|---|---|---|
| deny-all-ingress | 1000 | 0.0.0.0/0 | All | Block all inbound traffic |
| allow-health-check | 900 | 130.211.0.0/22, 35.191.0.0/16 | TCP 50051 | Google health check probes |
Egress Rules
| Rule | Priority | Destination | Ports | Purpose |
|---|---|---|---|---|
| deny-all-egress | 1100 | 0.0.0.0/0 | All | Block all outbound (baseline) |
| allow-egress-https | 900 | 0.0.0.0/0 | TCP 443, 8443, 50051, 80 | Control-plane, GCS, metrics |
| allow-egress-pg | 900 | 0.0.0.0/0 | TCP 5432 | PostgreSQL sources |
| allow-egress-sr | 900 | 0.0.0.0/0 | TCP 9030, 8040 | StarRocks sinks |
| allow-egress-dns | 900 | 0.0.0.0/0 | TCP/UDP 53 | DNS resolution |
GCP has an implied allow-all egress at priority 65535. The explicit deny-all-egress at priority 1100 ensures only the specific allow rules (priority 900) pass traffic.
All rules are scoped to worker instances via network tags.
Bootstrap Process
When a worker instance starts, the startup script:
#!/bin/bash
# 1. Download worker-agent binary from GCS
gsutil cp gs://ez-cdc-releases-gcp/worker-agent/{version}/worker-agent /usr/local/bin/
chmod +x /usr/local/bin/worker-agent
# 2. Download dbmazz binary
gsutil cp gs://ez-cdc-releases-gcp/dbmazz/{version}/dbmazz /usr/local/bin/
chmod +x /usr/local/bin/dbmazz
# 3. Configure environment
cat > /etc/ez-cdc/config.env << EOF
CONTROL_PLANE_ENDPOINT={control_plane_endpoint}
CONTROL_PLANE_HTTP_URL={control_plane_http_url}
BOOTSTRAP_TOKEN={bootstrap_token}
DEPLOYMENT_ID={deployment_id}
CONNECTIVITY_MODE={connectivity_mode}
WORKER_ID=$(curl -s -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/id)
EOF
# 4. Start worker-agent service
systemctl enable worker-agent
systemctl start worker-agent
Monitoring
Cloud Logging
Worker logs are automatically sent to Cloud Logging:
resource.type="gce_instance"
labels."deployment-id"="{deployment-id}"
Cloud Monitoring
Workers emit metrics visible in Cloud Monitoring:
| Metric | Description |
|---|---|
| CPU Utilization | CPU usage percentage |
| Memory Utilization | Memory usage percentage |
| Disk I/O | Read/write operations |
| Network Traffic | Inbound/outbound bytes |
Instance Access
Workers block SSH by default. For debugging:
- Serial console:
gcloud compute instances get-serial-port-output INSTANCE_NAME - OS Login: Can be enabled for emergency access
Enabling SSH increases attack surface. Use serial console or OS Login only when necessary.
Cost Optimization
Preemptible/Spot VMs
For non-critical workloads:
scheduling {
preemptible = true # Up to 80% cost savings
}
Preemptible VMs can be terminated with 30s notice. Only use for fault-tolerant workloads.
Committed Use Discounts
For production workloads, consider GCP Committed Use Discounts for sustained savings.