Cloud Platform Engineering

Terraform-Managed Internal Load Balancing for Cloud Run

How we provision a Regional Internal HTTPS Load Balancer — complete with Serverless NEGs, URL Maps, and SSL certificate rotation — across DEV, UAT, and PROD using a single Terraform codebase.

14 min read
The Routing Problem

When you have 15+ Cloud Run services that need to be reachable from internal VPC clients, you can't just use Cloud Run's default URLs. You need stable, predictable hostnames (kb.ai-platform.example.internal, agent.ai-platform.example.internal) that route through a centralized Layer 7 load balancer. This requires stitching together Forwarding Rules, Target HTTPS Proxies, URL Maps, Backend Services, and Serverless Network Endpoint Groups (SNEGs) — all of which need to be consistent across three environments.

Data-Driven Terraform via Map Iteration

Our approach: a single main.tf that uses for_each over variable maps to dynamically create every resource. Environment differences live entirely in .tfvars files. Adding a new service to the load balancer is a three-line addition to the tfvars — one entry each in backend_services, host_rules, and path_matchers. This modular, data-driven approach dramatically reduces template duplication and keeps configuration readable.

hcl
# envs/dev.tfvars — Adding a new service is just a few lines
backend_services = {
  "kb-service" = {
    cloud_run_service = "my-kb-service"
  }
  "ai-agent" = {
    cloud_run_service = "my-ai-agent"
  }
  "llm-gateway" = {
    cloud_run_service = "my-llm-gateway"
  }
  "otel-collector" = {
    cloud_run_service = "my-otel-collector"
  }
}

host_rules = {
  "kb"        = "kb-matcher"
  "agent"     = "agent-matcher"
  "llm"       = "llm-matcher"
  "otel"      = "otel-matcher"
}
Serverless NEGs and Backend Services

The Terraform uses google_compute_region_network_endpoint_group with network_endpoint_type = "SERVERLESS" to create SNEGs pointing at each Cloud Run service. A random suffix plus an MD5 hash of the service name prevents naming collisions, while create_before_destroy lifecycle rules ensure zero-downtime updates when the NEG needs recreation. This is a critical pattern because deleting a NEG while it is actively referenced by a backend service will immediately break your routing path.

hcl
resource "google_compute_region_network_endpoint_group" "serverless_negs" {
  for_each = var.backend_services
  name     = "${each.key}-sneg-${random_id.random_suffix.hex}-${substr(md5(each.value.cloud_run_service), 0, 4)}"
  network_endpoint_type = "SERVERLESS"
  cloud_run {
    service = each.value.cloud_run_service
  }
  lifecycle {
    create_before_destroy = true
  }
}
Internal Load Balancer Architecture
flowchart LR Client[Internal VPC Client] --> FWD[Forwarding Rule :443] FWD --> Proxy[Target HTTPS Proxy<br/>SSL Termination] Proxy --> URLMap[URL Map] URLMap -- "kb.*" --> BE1[Backend Service] URLMap -- "agent.*" --> BE2[Backend Service] BE1 --> SNEG1[Serverless NEG] BE2 --> SNEG2[Serverless NEG] SNEG1 --> CR1[Cloud Run: KB] SNEG2 --> CR2[Cloud Run: Agent]
HTTP-to-HTTPS Redirect

An optional enable_http_redirect flag provisions a separate URL Map, HTTP Proxy, and Forwarding Rule on port 80 that returns a 301 redirect to HTTPS. This is controlled per-environment — enabled in DEV and UAT for convenience, mandatory in PROD for compliance. It ensures that any user trying to reach a service via insecure channels is seamlessly and safely upgraded to TLS.

Lessons Learned: SNEG Naming and State Drift

A major lesson learned was dealing with state drift in Terraform Serverless NEGs. If someone manually deletes or updates a Cloud Run service in the GCP Console, the SNEG enters a broken state, and Terraform fails to apply updates because the underlying service target is missing or has a different revision signature. The solution was implementing strict IAM policies restricting manual console edits, alongside running daily Terraform plans via cron to detect and alert on infrastructure drift immediately.

More Recent Posts