Skip to main content

GCP Worker Infrastructure

This guide covers the infrastructure components that run in your GCP project as part of an EZ-CDC deployment.

Components Overview

EZ-CDC provisions the following resources in your GCP project:

ResourceTypePurpose
Instance Templategoogle_compute_instance_templateDefines worker VM configuration
Managed Instance Groupgoogle_compute_region_instance_group_managerRegional HA, auto-healing
Autoscalergoogle_compute_region_autoscalerCPU-based scaling
Health Checkgoogle_compute_health_checkTCP check on port 50051
Firewall Rulesgoogle_compute_firewallEgress-only security model
Service Accountgoogle_service_accountWorker identity and permissions
Cloud NATgoogle_compute_router_natPrivate egress (Cloud NAT mode only)

Instance Template

Shielded VM

All worker instances use Shielded VM with hardware-based security:

shielded_instance_config {
enable_secure_boot = true # Verified boot chain
enable_vtpm = true # Virtual TPM for key storage
enable_integrity_monitoring = true # Runtime integrity verification
}

Instance Specifications

SettingValue
Default machine typec2-standard-8 (8 vCPU, 32 GB RAM)
Boot disk50 GB pd-ssd
Boot disk encryptionGoogle-managed (CMEK optional)
SSH accessBlocked (block-project-ssh-keys = true)
External IPEphemeral (Standard) or None (Cloud NAT)
Machine TypevCPUMemoryRecommended Jobs
e2-standard-4416 GB1-5
c2-standard-8832 GB5-15
c2-standard-161664 GB15-30

Resource Usage

Each dbmazz daemon uses approximately:

ResourceUsage
Memory~5 MB
CPULess than 5% idle, 10-25% active
DiskMinimal (logs only)

Managed Instance Group (MIG)

Workers run in a regional Managed Instance Group for high availability:

resource "google_compute_region_instance_group_manager" "workers" {
name = "ez-cdc-workers-{deployment-id}"
region = "us-central1"

# Distributes instances across zones in the region
base_instance_name = "ez-cdc-wk-{deployment-id}"

# Auto-healing: replace instances that fail health checks
auto_healing_policies {
health_check = google_compute_health_check.worker.self_link
initial_delay_sec = 300 # Wait 5 min for startup
}

# Rolling update: zero-downtime deployments
update_policy {
type = "PROACTIVE"
minimal_action = "REPLACE"
max_surge_fixed = 3
max_unavailable_fixed = 0
}
}

Key Features

FeatureConfigurationDescription
DistributionRegional (multi-zone)Instances spread across AZs
Auto-healing300s initial delayReplaces unhealthy instances
Update policyPROACTIVE + REPLACERolling updates, zero downtime
Max surge3Up to 3 new instances during updates
Max unavailable0No capacity loss during updates

Autoscaler

CPU-based autoscaling keeps worker capacity matched to load:

resource "google_compute_region_autoscaler" "workers" {
name = "ez-cdc-as-{deployment-id}"
target = google_compute_region_instance_group_manager.workers.self_link

autoscaling_policy {
min_replicas = 2
max_replicas = 10
cooldown_period = 120 # Prevent thrashing during CDC spikes

cpu_utilization {
target = 0.7 # Scale up at 70% CPU
}
}
}
SettingDefaultDescription
Min replicas2Minimum workers (configurable)
Max replicas10Maximum workers (configurable)
CPU target70%Scale up threshold
Cooldown120sStability period between scaling actions

Health Check

Workers are monitored via TCP health checks on the gRPC port:

resource "google_compute_health_check" "worker" {
name = "ez-cdc-hc-{deployment-id}"

check_interval_sec = 30
timeout_sec = 10
healthy_threshold = 2 # 2 consecutive successes → healthy
unhealthy_threshold = 3 # 3 consecutive failures → unhealthy

tcp_health_check {
port = 50051 # Worker gRPC port
}
}

Unhealthy instances are automatically replaced by the MIG auto-healing policy.

Firewall Rules

EZ-CDC creates 7 firewall rules implementing a deny-all base with specific allows:

Ingress Rules

RulePrioritySourcePortsPurpose
deny-all-ingress10000.0.0.0/0AllBlock all inbound traffic
allow-health-check900130.211.0.0/22, 35.191.0.0/16TCP 50051Google health check probes

Egress Rules

RulePriorityDestinationPortsPurpose
deny-all-egress11000.0.0.0/0AllBlock all outbound (baseline)
allow-egress-https9000.0.0.0/0TCP 443, 8443, 50051, 80Control-plane, GCS, metrics
allow-egress-pg9000.0.0.0/0TCP 5432PostgreSQL sources
allow-egress-sr9000.0.0.0/0TCP 9030, 8040StarRocks sinks
allow-egress-dns9000.0.0.0/0TCP/UDP 53DNS resolution
Priority Model

GCP has an implied allow-all egress at priority 65535. The explicit deny-all-egress at priority 1100 ensures only the specific allow rules (priority 900) pass traffic.

All rules are scoped to worker instances via network tags.

Bootstrap Process

When a worker instance starts, the startup script:

#!/bin/bash
# 1. Download worker-agent binary from GCS
gsutil cp gs://ez-cdc-releases-gcp/worker-agent/{version}/worker-agent /usr/local/bin/
chmod +x /usr/local/bin/worker-agent

# 2. Download dbmazz binary
gsutil cp gs://ez-cdc-releases-gcp/dbmazz/{version}/dbmazz /usr/local/bin/
chmod +x /usr/local/bin/dbmazz

# 3. Configure environment
cat > /etc/ez-cdc/config.env << EOF
CONTROL_PLANE_ENDPOINT={control_plane_endpoint}
CONTROL_PLANE_HTTP_URL={control_plane_http_url}
BOOTSTRAP_TOKEN={bootstrap_token}
DEPLOYMENT_ID={deployment_id}
CONNECTIVITY_MODE={connectivity_mode}
WORKER_ID=$(curl -s -H "Metadata-Flavor: Google" \
http://metadata.google.internal/computeMetadata/v1/instance/id)
EOF

# 4. Start worker-agent service
systemctl enable worker-agent
systemctl start worker-agent

Monitoring

Cloud Logging

Worker logs are automatically sent to Cloud Logging:

resource.type="gce_instance"
labels."deployment-id"="{deployment-id}"

Cloud Monitoring

Workers emit metrics visible in Cloud Monitoring:

MetricDescription
CPU UtilizationCPU usage percentage
Memory UtilizationMemory usage percentage
Disk I/ORead/write operations
Network TrafficInbound/outbound bytes

Instance Access

Workers block SSH by default. For debugging:

  • Serial console: gcloud compute instances get-serial-port-output INSTANCE_NAME
  • OS Login: Can be enabled for emergency access
warning

Enabling SSH increases attack surface. Use serial console or OS Login only when necessary.

Cost Optimization

Preemptible/Spot VMs

For non-critical workloads:

scheduling {
preemptible = true # Up to 80% cost savings
}
caution

Preemptible VMs can be terminated with 30s notice. Only use for fault-tolerant workloads.

Committed Use Discounts

For production workloads, consider GCP Committed Use Discounts for sustained savings.

Next Steps