GCP Worker Infrastructure

This guide covers the infrastructure components that run in your GCP project as part of an EZ-CDC deployment.

Components Overview

EZ-CDC provisions the following resources in your GCP project:

Resource	Type	Purpose
Instance Template	`google_compute_instance_template`	Defines worker VM configuration
Managed Instance Group	`google_compute_region_instance_group_manager`	Regional HA, auto-healing
Autoscaler	`google_compute_region_autoscaler`	CPU-based scaling
Health Check	`google_compute_health_check`	TCP check on port 50051
Firewall Rules	`google_compute_firewall`	Egress-only security model
Service Account	`google_service_account`	Worker identity and permissions
Cloud NAT	`google_compute_router_nat`	Private egress (Cloud NAT mode only)

Instance Template

Shielded VM

All worker instances use Shielded VM with hardware-based security:

shielded_instance_config {
  enable_secure_boot          = true   # Verified boot chain
  enable_vtpm                 = true   # Virtual TPM for key storage
  enable_integrity_monitoring = true   # Runtime integrity verification
}

Instance Specifications

Setting	Value
Default machine type	`c2-standard-8` (8 vCPU, 32 GB RAM)
Boot disk	50 GB pd-ssd
Boot disk encryption	Google-managed (CMEK optional)
SSH access	Blocked (`block-project-ssh-keys = true`)
External IP	Ephemeral (Standard) or None (Cloud NAT)

Recommended Machine Types

Machine Type	vCPU	Memory	Recommended Jobs
e2-standard-4	4	16 GB	1-5
c2-standard-8	8	32 GB	5-15
c2-standard-16	16	64 GB	15-30

Resource Usage

Each dbmazz daemon uses approximately:

Resource	Usage
Memory	~5 MB
CPU	Less than 5% idle, 10-25% active
Disk	Minimal (logs only)

Managed Instance Group (MIG)

Workers run in a regional Managed Instance Group for high availability:

resource "google_compute_region_instance_group_manager" "workers" {
  name   = "ez-cdc-workers-{deployment-id}"
  region = "us-central1"

  # Distributes instances across zones in the region
  base_instance_name = "ez-cdc-wk-{deployment-id}"

  # Auto-healing: replace instances that fail health checks
  auto_healing_policies {
    health_check      = google_compute_health_check.worker.self_link
    initial_delay_sec = 300  # Wait 5 min for startup
  }

  # Rolling update: zero-downtime deployments
  update_policy {
    type                  = "PROACTIVE"
    minimal_action        = "REPLACE"
    max_surge_fixed       = 3
    max_unavailable_fixed = 0
  }
}

Key Features

Feature	Configuration	Description
Distribution	Regional (multi-zone)	Instances spread across AZs
Auto-healing	300s initial delay	Replaces unhealthy instances
Update policy	PROACTIVE + REPLACE	Rolling updates, zero downtime
Max surge	3	Up to 3 new instances during updates
Max unavailable	0	No capacity loss during updates

Autoscaler

CPU-based autoscaling keeps worker capacity matched to load:

resource "google_compute_region_autoscaler" "workers" {
  name   = "ez-cdc-as-{deployment-id}"
  target = google_compute_region_instance_group_manager.workers.self_link

  autoscaling_policy {
    min_replicas    = 2
    max_replicas    = 10
    cooldown_period = 120  # Prevent thrashing during CDC spikes

    cpu_utilization {
      target = 0.7  # Scale up at 70% CPU
    }
  }
}

Setting	Default	Description
Min replicas	2	Minimum workers (configurable)
Max replicas	10	Maximum workers (configurable)
CPU target	70%	Scale up threshold
Cooldown	120s	Stability period between scaling actions

Health Check

Workers are monitored via TCP health checks on the gRPC port:

resource "google_compute_health_check" "worker" {
  name = "ez-cdc-hc-{deployment-id}"

  check_interval_sec  = 30
  timeout_sec         = 10
  healthy_threshold   = 2   # 2 consecutive successes → healthy
  unhealthy_threshold = 3   # 3 consecutive failures → unhealthy

  tcp_health_check {
    port = 50051  # Worker gRPC port
  }
}

Unhealthy instances are automatically replaced by the MIG auto-healing policy.

Firewall Rules

EZ-CDC creates 7 firewall rules implementing a deny-all base with specific allows:

Ingress Rules

Rule	Priority	Source	Ports	Purpose
deny-all-ingress	1000	0.0.0.0/0	All	Block all inbound traffic
allow-health-check	900	130.211.0.0/22, 35.191.0.0/16	TCP 50051	Google health check probes

Egress Rules

Rule	Priority	Destination	Ports	Purpose
deny-all-egress	1100	0.0.0.0/0	All	Block all outbound (baseline)
allow-egress-https	900	0.0.0.0/0	TCP 443, 8443, 50051, 80	Control-plane, GCS, metrics
allow-egress-pg	900	0.0.0.0/0	TCP 5432	PostgreSQL sources
allow-egress-sr	900	0.0.0.0/0	TCP 9030, 8040	StarRocks sinks
allow-egress-dns	900	0.0.0.0/0	TCP/UDP 53	DNS resolution

Priority Model

GCP has an implied allow-all egress at priority 65535. The explicit deny-all-egress at priority 1100 ensures only the specific allow rules (priority 900) pass traffic.

All rules are scoped to worker instances via network tags.

Bootstrap Process

When a worker instance starts, the startup script:

#!/bin/bash
# 1. Download worker-agent binary from GCS
gsutil cp gs://ez-cdc-releases-gcp/worker-agent/{version}/worker-agent /usr/local/bin/
chmod +x /usr/local/bin/worker-agent

# 2. Download dbmazz binary
gsutil cp gs://ez-cdc-releases-gcp/dbmazz/{version}/dbmazz /usr/local/bin/
chmod +x /usr/local/bin/dbmazz

# 3. Configure environment
cat > /etc/ez-cdc/config.env << EOF
CONTROL_PLANE_ENDPOINT={control_plane_endpoint}
CONTROL_PLANE_HTTP_URL={control_plane_http_url}
BOOTSTRAP_TOKEN={bootstrap_token}
DEPLOYMENT_ID={deployment_id}
CONNECTIVITY_MODE={connectivity_mode}
WORKER_ID=$(curl -s -H "Metadata-Flavor: Google" \
  http://metadata.google.internal/computeMetadata/v1/instance/id)
EOF

# 4. Start worker-agent service
systemctl enable worker-agent
systemctl start worker-agent

Monitoring

Cloud Logging

Worker logs are automatically sent to Cloud Logging:

resource.type="gce_instance"
labels."deployment-id"="{deployment-id}"

Cloud Monitoring

Workers emit metrics visible in Cloud Monitoring:

Metric	Description
CPU Utilization	CPU usage percentage
Memory Utilization	Memory usage percentage
Disk I/O	Read/write operations
Network Traffic	Inbound/outbound bytes

Instance Access

Workers block SSH by default. For debugging:

Serial console: gcloud compute instances get-serial-port-output INSTANCE_NAME
OS Login: Can be enabled for emergency access

warning

Enabling SSH increases attack surface. Use serial console or OS Login only when necessary.

Cost Optimization

Preemptible/Spot VMs

For non-critical workloads:

scheduling {
  preemptible = true  # Up to 80% cost savings
}

caution

Preemptible VMs can be terminated with 30s notice. Only use for fault-tolerant workloads.

Committed Use Discounts

For production workloads, consider GCP Committed Use Discounts for sustained savings.

Components Overview​

Instance Template​

Shielded VM​

Instance Specifications​

Recommended Machine Types​

Resource Usage​

Managed Instance Group (MIG)​

Key Features​

Autoscaler​

Health Check​

Firewall Rules​

Ingress Rules​

Egress Rules​

Bootstrap Process​

Monitoring​

Cloud Logging​

Cloud Monitoring​

Instance Access​

Cost Optimization​

Preemptible/Spot VMs​

Committed Use Discounts​

Next Steps​