Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers
A side-by-side comparison of how internal traffic flows, SSL terminates, and services communicate in our Kubernetes and serverless environments.
In a GKE environment, traffic hits a GCP Network Load Balancer (Layer 4) that forwards raw TCP. SSL termination happens inside the cluster at the Istio Ingress Gateway, using TLS certificates issued by Cert-Manager with a private internal CA. Istio then inspects the hostname and routes via VirtualService rules to the appropriate Kubernetes Service. Internal service-to-service calls bypass the ingress entirely using Kubernetes DNS (e.g., api-svc.my-namespace.svc.cluster.local), guaranteeing sub-millisecond internal latency.
Cloud Run uses a Regional Internal Application Load Balancer (Layer 7) where SSL termination happens at the GCP Target HTTPS Proxy — outside the application. The URL Map examines the Host header and routes to the matching Backend Service, which points to a Serverless NEG connected to the Cloud Run service. Unlike GKE, there's no in-cluster DNS. Inter-service communication follows a "star schema" — services talk to each other by routing back through the centralized load balancer via internal domains. This adds slightly more latency but scales effortlessly.
# GKE: Internal service calls bypass ingress, use K8s DNS
- name: LLM_PLATFORM_API_URL
value: "http://llm-platform-api.llm-platform.svc.cluster.local/v1"
- name: VECTOR_DB_HOST
value: "vectordb.vectordb-ns.svc.cluster.local"
# Cloud Run: All inter-service calls route via internal ALB
- name: KB_SERVICE_URL
value: "https://kb.ai-platform.example.internal/api/v2"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "https://otel.ai-platform.example.internal"
GKE gives you direct pod-to-pod networking, Istio's advanced traffic management (canary releases, circuit breaking, fault injection), and sub-millisecond internal latency. Cloud Run gives you per-request billing, automatic scale-to-zero, and zero cluster management — but at the cost of higher inter-service latency (every call goes through the LB) and less networking control. We run both: GKE for always-on, latency-sensitive workloads (vector databases, LLM gateways); Cloud Run for bursty, event-driven AI agent services. This hybrid setup optimizes both cost and performance.
Lessons Learned: Handling VPC Ingress LockdownsA major gotcha was locking down ingress for Cloud Run while keeping it open for GKE service accounts. By default, setting ingress to "internal" on Cloud Run restricts external access, but still allows traffic from any VPC client. To achieve enterprise-grade isolation, we had to pair the "internal" ingress setting with strict IAM permissions on individual services (e.g. disabling run.googleapis.com/invoker-iam-disabled and requiring Cloud Run Invoker role bindings on calling service accounts). The lesson: never trust network routing alone to enforce security boundaries; always back it up with programmatic authentication at the service layer.