Zero-Downtime Deployments on Cloud Run: Startup Probes and CPU Boost

The Cold Start Problem for AI Services

AI microservices built with FastAPI + Gunicorn typically need 15-30 seconds to initialize: loading ML model artifacts, establishing database connection pools, warming Elasticsearch clients, and initializing OpenTelemetry instrumentation. Without proper configuration, Cloud Run routes traffic to the new revision before it's ready, causing a cascade of 503 errors during every deployment, frustrating users and triggering false-positive alerts.

The Fix: Startup Probes + CPU Boost

Two Cloud Run annotations solve this entirely. run.googleapis.com/startup-cpu-boost: "true" temporarily allocates additional CPU during container initialization, cutting startup time in half. The startupProbe defines an HTTP health check that must succeed before traffic is routed — Cloud Run will attempt the probe every 5 seconds for up to 24 failures (2 minutes) before giving up, ensuring only fully-initialized containers receive traffic.

yaml

# Production service.yaml — startup configuration
spec:
  template:
    metadata:
      annotations:
        run.googleapis.com/startup-cpu-boost: "true"  # Extra CPU during init
        run.googleapis.com/cpu-throttling: "true"      # Throttle between requests
    spec:
      containerConcurrency: 60
      timeoutSeconds: 300
      containers:
        - name: my-ai-service
          resources:
            limits:
              cpu: "1"
              memory: 1Gi
          startupProbe:
            httpGet:
              path: /health
              port: 8080
            initialDelaySeconds: 5
            periodSeconds: 5
            timeoutSeconds: 5
            failureThreshold: 24  # Total window: 5 + (24 × 5) = 125s

Min Instances for Critical Services

For latency-sensitive services like the Knowledge Base API, we also set run.googleapis.com/minScale: "1" to keep at least one instance always warm. Combined with maxScale: "5", this ensures the service handles baseline traffic without cold starts while still scaling for bursts. The trade-off is cost — a minimum instance is always billed — but for services that back real-time AI agents, the reliability is worth every penny.

Properly tuned startup probes are the difference between a seamless deployment and a page from your on-call rotation.

Lessons Learned: The Startup Timeout Trap

A subtle gotcha when configuring startup probes is neglecting the database migration step. If your build pipeline runs database migrations as part of the container startup command (e.g. alembic upgrade head && gunicorn...), a slow migration or database lock will cause the startup probe to time out repeatedly, forcing Cloud Run to mark the container as unhealthy and restart it. This creates a crash loop that prevents the new revision from ever deploying. The lesson: always run database migrations as a separate step (like a Cloud Run Job or pre-deploy hook) rather than inside the application container startup process.

Zero-Downtime Deployments on Cloud Run: Startup Probes and CPU Boost

More Recent Posts

Hello World: Vibe Coding This Blog with Gemini

Routing AI Traffic: GKE Istio vs Cloud Run Load Balancers

Eliminating Dockerfiles with Cloud Native Buildpacks