Terraform-Managed Internal Load Balancing for Cloud Run
How we provision a Regional Internal HTTPS Load Balancer — complete with Serverless NEGs, URL Maps, and SSL certificate rotation — across DEV, UAT, and PROD using a single Terraform codebase.
When you have 15+ Cloud Run services that need to be reachable from internal VPC clients, you can't just use Cloud Run's default URLs. You need stable, predictable hostnames (kb.ai-platform.example.internal, agent.ai-platform.example.internal) that route through a centralized Layer 7 load balancer. This requires stitching together Forwarding Rules, Target HTTPS Proxies, URL Maps, Backend Services, and Serverless Network Endpoint Groups (SNEGs) — all of which need to be consistent across three environments.
Our approach: a single main.tf that uses for_each over variable maps to dynamically create every resource. Environment differences live entirely in .tfvars files. Adding a new service to the load balancer is a three-line addition to the tfvars — one entry each in backend_services, host_rules, and path_matchers. This modular, data-driven approach dramatically reduces template duplication and keeps configuration readable.
# envs/dev.tfvars — Adding a new service is just a few lines
backend_services = {
"kb-service" = {
cloud_run_service = "my-kb-service"
}
"ai-agent" = {
cloud_run_service = "my-ai-agent"
}
"llm-gateway" = {
cloud_run_service = "my-llm-gateway"
}
"otel-collector" = {
cloud_run_service = "my-otel-collector"
}
}
host_rules = {
"kb" = "kb-matcher"
"agent" = "agent-matcher"
"llm" = "llm-matcher"
"otel" = "otel-matcher"
}
The Terraform uses google_compute_region_network_endpoint_group with network_endpoint_type = "SERVERLESS" to create SNEGs pointing at each Cloud Run service. A random suffix plus an MD5 hash of the service name prevents naming collisions, while create_before_destroy lifecycle rules ensure zero-downtime updates when the NEG needs recreation. This is a critical pattern because deleting a NEG while it is actively referenced by a backend service will immediately break your routing path.
resource "google_compute_region_network_endpoint_group" "serverless_negs" {
for_each = var.backend_services
name = "${each.key}-sneg-${random_id.random_suffix.hex}-${substr(md5(each.value.cloud_run_service), 0, 4)}"
network_endpoint_type = "SERVERLESS"
cloud_run {
service = each.value.cloud_run_service
}
lifecycle {
create_before_destroy = true
}
}
An optional enable_http_redirect flag provisions a separate URL Map, HTTP Proxy, and Forwarding Rule on port 80 that returns a 301 redirect to HTTPS. This is controlled per-environment — enabled in DEV and UAT for convenience, mandatory in PROD for compliance. It ensures that any user trying to reach a service via insecure channels is seamlessly and safely upgraded to TLS.
A major lesson learned was dealing with state drift in Terraform Serverless NEGs. If someone manually deletes or updates a Cloud Run service in the GCP Console, the SNEG enters a broken state, and Terraform fails to apply updates because the underlying service target is missing or has a different revision signature. The solution was implementing strict IAM policies restricting manual console edits, alongside running daily Terraform plans via cron to detect and alert on infrastructure drift immediately.