Terraform-Managed Internal Load Balancing for Cloud Run

The Routing Problem

When you have 15+ Cloud Run services that need to be reachable from internal VPC clients, you can't just use Cloud Run's default URLs. You need stable, predictable hostnames (kb.ai-platform.example.internal, agent.ai-platform.example.internal) that route through a centralized Layer 7 load balancer. This requires stitching together Forwarding Rules, Target HTTPS Proxies, URL Maps, Backend Services, and Serverless Network Endpoint Groups (SNEGs) — all of which need to be consistent across three environments.

Data-Driven Terraform via Map Iteration

Our approach: a single main.tf that uses for_each over variable maps to dynamically create every resource. Environment differences live entirely in .tfvars files. Adding a new service to the load balancer is a three-line addition to the tfvars — one entry each in backend_services, host_rules, and path_matchers. This modular, data-driven approach dramatically reduces template duplication and keeps configuration readable.

hcl

# envs/dev.tfvars — Adding a new service is just a few lines
backend_services = {
  "kb-service" = {
    cloud_run_service = "my-kb-service"
  }
  "ai-agent" = {
    cloud_run_service = "my-ai-agent"
  }
  "llm-gateway" = {
    cloud_run_service = "my-llm-gateway"
  }
  "otel-collector" = {
    cloud_run_service = "my-otel-collector"
  }
}

host_rules = {
  "kb"        = "kb-matcher"
  "agent"     = "agent-matcher"
  "llm"       = "llm-matcher"
  "otel"      = "otel-matcher"
}

Serverless NEGs and Backend Services

The Terraform uses google_compute_region_network_endpoint_group with network_endpoint_type = "SERVERLESS" to create SNEGs pointing at each Cloud Run service. A random suffix plus an MD5 hash of the service name prevents naming collisions, while create_before_destroy lifecycle rules ensure zero-downtime updates when the NEG needs recreation. This is a critical pattern because deleting a NEG while it is actively referenced by a backend service will immediately break your routing path.

hcl

resource "google_compute_region_network_endpoint_group" "serverless_negs" {
  for_each = var.backend_services
  name     = "${each.key}-sneg-${random_id.random_suffix.hex}-${substr(md5(each.value.cloud_run_service), 0, 4)}"
  network_endpoint_type = "SERVERLESS"
  cloud_run {
    service = each.value.cloud_run_service
  }
  lifecycle {
    create_before_destroy = true
  }
}

Internal Load Balancer Architecture

flowchart LR Client[Internal VPC Client] --> FWD[Forwarding Rule :443] FWD --> Proxy[Target HTTPS Proxy<br/>SSL Termination] Proxy --> URLMap[URL Map] URLMap -- "kb.*" --> BE1[Backend Service] URLMap -- "agent.*" --> BE2[Backend Service] BE1 --> SNEG1[Serverless NEG] BE2 --> SNEG2[Serverless NEG] SNEG1 --> CR1[Cloud Run: KB] SNEG2 --> CR2[Cloud Run: Agent]

HTTP-to-HTTPS Redirect

An optional enable_http_redirect flag provisions a separate URL Map, HTTP Proxy, and Forwarding Rule on port 80 that returns a 301 redirect to HTTPS. This is controlled per-environment — enabled in DEV and UAT for convenience, mandatory in PROD for compliance. It ensures that any user trying to reach a service via insecure channels is seamlessly and safely upgraded to TLS.

Lessons Learned: SNEG Naming and State Drift

A major lesson learned was dealing with state drift in Terraform Serverless NEGs. If someone manually deletes or updates a Cloud Run service in the GCP Console, the SNEG enters a broken state, and Terraform fails to apply updates because the underlying service target is missing or has a different revision signature. The solution was implementing strict IAM policies restricting manual console edits, alongside running daily Terraform plans via cron to detect and alert on infrastructure drift immediately.

Terraform-Managed Internal Load Balancing for Cloud Run

More Recent Posts

Hello World: Vibe Coding This Blog with Gemini

Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers

Eliminating Dockerfiles with Cloud Native Buildpacks