Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers

GKE: L4 NLB → Istio Ingress → VirtualService

In a GKE environment, traffic hits a GCP Network Load Balancer (Layer 4) that forwards raw TCP. SSL termination happens inside the cluster at the Istio Ingress Gateway, using TLS certificates issued by Cert-Manager with a private internal CA. Istio then inspects the hostname and routes via VirtualService rules to the appropriate Kubernetes Service. Internal service-to-service calls bypass the ingress entirely using Kubernetes DNS (e.g., api-svc.my-namespace.svc.cluster.local), guaranteeing sub-millisecond internal latency.

Cloud Run: L7 ALB → URL Map → Serverless NEG

Cloud Run uses a Regional Internal Application Load Balancer (Layer 7) where SSL termination happens at the GCP Target HTTPS Proxy — outside the application. The URL Map examines the Host header and routes to the matching Backend Service, which points to a Serverless NEG connected to the Cloud Run service. Unlike GKE, there's no in-cluster DNS. Inter-service communication follows a "star schema" — services talk to each other by routing back through the centralized load balancer via internal domains. This adds slightly more latency but scales effortlessly.

yaml

# GKE: Internal service calls bypass ingress, use K8s DNS
- name: LLM_PLATFORM_API_URL
  value: "http://llm-platform-api.llm-platform.svc.cluster.local/v1"
- name: VECTOR_DB_HOST
  value: "vectordb.vectordb-ns.svc.cluster.local"

# Cloud Run: All inter-service calls route via internal ALB
- name: KB_SERVICE_URL
  value: "https://kb.ai-platform.example.internal/api/v2"
- name: OTEL_EXPORTER_OTLP_ENDPOINT
  value: "https://otel.ai-platform.example.internal"

GKE vs Cloud Run Traffic Flow

Trade-offs: Choosing the Right Engine

GKE gives you direct pod-to-pod networking, Istio's advanced traffic management (canary releases, circuit breaking, fault injection), and sub-millisecond internal latency. Cloud Run gives you per-request billing, automatic scale-to-zero, and zero cluster management — but at the cost of higher inter-service latency (every call goes through the LB) and less networking control. We run both: GKE for always-on, latency-sensitive workloads (vector databases, LLM gateways); Cloud Run for bursty, event-driven AI agent services. This hybrid setup optimizes both cost and performance.

Lessons Learned: Handling VPC Ingress Lockdowns

A major gotcha was locking down ingress for Cloud Run while keeping it open for GKE service accounts. By default, setting ingress to "internal" on Cloud Run restricts external access, but still allows traffic from any VPC client. To achieve enterprise-grade isolation, we had to pair the "internal" ingress setting with strict IAM permissions on individual services (e.g. disabling run.googleapis.com/invoker-iam-disabled and requiring Cloud Run Invoker role bindings on calling service accounts). The lesson: never trust network routing alone to enforce security boundaries; always back it up with programmatic authentication at the service layer.

Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers

More Recent Posts

Hello World: Vibe Coding This Blog with Gemini

Eliminating Dockerfiles with Cloud Native Buildpacks

Push vs Pull: Two CI/CD Philosophies for AI Services