The App-of-Apps Pattern: Running an AI Stack on GKE with ArgoCD

Why GitOps for AI Infrastructure?

Managing an AI platform on Kubernetes means juggling dozens of interdependent services: Istio for ingress, Cert-Manager for TLS, Prometheus and Grafana for monitoring, Qdrant and Weaviate for vector search, LiteLLM as an LLM proxy, Dify for prompt orchestration, and OpenTelemetry collectors for distributed tracing. Manually applying manifests across environments is error-prone and impossible to audit. By adopting ArgoCD with the App-of-Apps pattern, every component is defined in Git, automatically synced, and self-healing. This completely eliminates manual configuration drift and enforces a perfect audit trail.

Cluster Architecture: Istio Ingress and Internal TLS

Traffic enters through a GCP Network Load Balancer (Layer 4) and hits the Istio Ingress Gateway inside the cluster, where SSL termination occurs using certificates issued by Cert-Manager with a private CA. Each service — ArgoCD, Grafana, Qdrant, Dify, LiteLLM — gets its own wildcard-matched VirtualService definition, routing traffic based on hostname to the appropriate Kubernetes Service. This gives us granular control over routing policies and allows us to easily enforce mutual TLS (mTLS) for secure service-to-service communication within the cluster.

yaml

# Istio VirtualService routing for the LLM orchestration platform
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: llm-platform
  namespace: llm-platform
spec:
  gateways:
  - internal-gateway
  hosts:
  - "llm-platform.ai-platform.example.internal"
  http:
  - match:
    - uri:
        prefix: /api
    - uri:
        prefix: /v1
    route:
    - destination:
        host: llm-platform-api.llm-platform.svc.cluster.local
        port:
          number: 80
  - match:
    - uri:
        prefix: /
    route:
    - destination:
        host: llm-platform-frontend.llm-platform.svc.cluster.local
        port:
          number: 80

ArgoCD Sync & Self-Healing

With automated.selfHeal: true and prune: true enabled on our ArgoCD Application sets, any manual drift in the cluster is automatically corrected. The namespace generator watches a dedicated Git repository for namespace definitions, creating and destroying namespaces declaratively. When Jenkins updates an image tag in the config repo, ArgoCD detects the commit within seconds and rolls out the new revision — no deployment webhook needed. This reduces human error to virtually zero.

GKE AI Platform Architecture

flowchart TD GCP_NLB[GCP L4 NLB] --> Istio[Istio Ingress Gateway SSL Termination] Istio --> Dify[Dify Prompt Orchestration] Istio --> LiteLLM[LiteLLM LLM Gateway] Istio --> Qdrant[Qdrant Vector DB] Istio --> Grafana[Grafana Monitoring] Istio --> ArgoCD[ArgoCD GitOps Controller] Istio --> OTEL[OpenTelemetry Collector]

Security Posture & Hardening

Every pod runs with readOnlyRootFilesystem, drops all Linux capabilities, and enforces runAsNonRoot with a RuntimeDefault seccomp profile. Service accounts are tightly scoped — no cluster-admin wildcards. The Istio gateway only listens on port 443 (with automatic HTTP-to-HTTPS redirect on port 80), ensuring all internal traffic is encrypted end-to-end. We also configure network policies to isolate namespaces, preventing a compromised vector database pod from accessing monitoring or admin systems.

Lessons Learned: The Invalidation Trap

One major gotcha we encountered was the "invalidation trap" in GitOps sync loops. When deploying resources with dynamically generated fields (such as dynamic passwords, randomized keys, or auto-updating timestamps generated at run-time by helm charts), ArgoCD flags the resource as permanently out-of-sync. It attempts to re-apply the manifest continuously, triggering a CPU-consuming loop. The lesson: always separate immutable structural config from volatile run-time variables, and leverage ArgoCD's ignoreDifferences annotation for fields that must be computed dynamically inside the cluster.

GitOps is not just about using git as a backup; it is about establishing Git as the single source of truth that dynamically dictates the actual, living state of your infrastructure.

The App-of-Apps Pattern: Running an AI Stack on GKE with ArgoCD

More Recent Posts

Hello World: Vibe Coding This Blog with Gemini

Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers

Eliminating Dockerfiles with Cloud Native Buildpacks