The App-of-Apps Pattern: Running an AI Stack on GKE with ArgoCD
How we use a GitOps-driven Kubernetes cluster to run vector databases, LLM gateways, observability collectors, and AI workloads — all managed declaratively through Helm and ArgoCD.
Managing an AI platform on Kubernetes means juggling dozens of interdependent services: Istio for ingress, Cert-Manager for TLS, Prometheus and Grafana for monitoring, Qdrant and Weaviate for vector search, LiteLLM as an LLM proxy, Dify for prompt orchestration, and OpenTelemetry collectors for distributed tracing. Manually applying manifests across environments is error-prone and impossible to audit. By adopting ArgoCD with the App-of-Apps pattern, every component is defined in Git, automatically synced, and self-healing. This completely eliminates manual configuration drift and enforces a perfect audit trail.
Cluster Architecture: Istio Ingress and Internal TLSTraffic enters through a GCP Network Load Balancer (Layer 4) and hits the Istio Ingress Gateway inside the cluster, where SSL termination occurs using certificates issued by Cert-Manager with a private CA. Each service — ArgoCD, Grafana, Qdrant, Dify, LiteLLM — gets its own wildcard-matched VirtualService definition, routing traffic based on hostname to the appropriate Kubernetes Service. This gives us granular control over routing policies and allows us to easily enforce mutual TLS (mTLS) for secure service-to-service communication within the cluster.
# Istio VirtualService routing for the LLM orchestration platform
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: llm-platform
namespace: llm-platform
spec:
gateways:
- internal-gateway
hosts:
- "llm-platform.ai-platform.example.internal"
http:
- match:
- uri:
prefix: /api
- uri:
prefix: /v1
route:
- destination:
host: llm-platform-api.llm-platform.svc.cluster.local
port:
number: 80
- match:
- uri:
prefix: /
route:
- destination:
host: llm-platform-frontend.llm-platform.svc.cluster.local
port:
number: 80
With automated.selfHeal: true and prune: true enabled on our ArgoCD Application sets, any manual drift in the cluster is automatically corrected. The namespace generator watches a dedicated Git repository for namespace definitions, creating and destroying namespaces declaratively. When Jenkins updates an image tag in the config repo, ArgoCD detects the commit within seconds and rolls out the new revision — no deployment webhook needed. This reduces human error to virtually zero.
Every pod runs with readOnlyRootFilesystem, drops all Linux capabilities, and enforces runAsNonRoot with a RuntimeDefault seccomp profile. Service accounts are tightly scoped — no cluster-admin wildcards. The Istio gateway only listens on port 443 (with automatic HTTP-to-HTTPS redirect on port 80), ensuring all internal traffic is encrypted end-to-end. We also configure network policies to isolate namespaces, preventing a compromised vector database pod from accessing monitoring or admin systems.
One major gotcha we encountered was the "invalidation trap" in GitOps sync loops. When deploying resources with dynamically generated fields (such as dynamic passwords, randomized keys, or auto-updating timestamps generated at run-time by helm charts), ArgoCD flags the resource as permanently out-of-sync. It attempts to re-apply the manifest continuously, triggering a CPU-consuming loop. The lesson: always separate immutable structural config from volatile run-time variables, and leverage ArgoCD's ignoreDifferences annotation for fields that must be computed dynamically inside the cluster.
GitOps is not just about using git as a backup; it is about establishing Git as the single source of truth that dynamically dictates the actual, living state of your infrastructure.