Skip to content

Add readiness probe endpoint that detects unlicensed multi-replica state #21255

@blinkagent

Description

@blinkagent

Problem

When multiple Coder instances are deployed behind a load balancer (e.g., Kubernetes Service) pointing at the same database without a Premium license, users experience intermittent failures. This happens because:

  1. Each Coder instance registers itself as a replica in the database
  2. After the grace period (default 1 minute), the entitlements check detects multiple replicas without HA entitlement
  3. An error is recorded: "You have multiple replicas but high availability is an Enterprise feature. You will be unable to connect to workspaces."
  4. However, the /healthz endpoint still returns 200 OK
  5. The load balancer continues routing traffic to all nodes, but only one can properly serve workspace connections

The result is that ~50% of requests fail unpredictably (or worse ratios with more replicas).

Current Behavior

The /healthz endpoint (coderd/coderd.go:909) unconditionally returns "OK":

r.Get("/healthz", func(w http.ResponseWriter, _ *http.Request) { _, _ = w.Write([]byte("OK")) })

The entitlement error is being detected and stored (enterprise/coderd/license/license.go:431-449), but it's only exposed via:

  • The authenticated /api/v2/entitlements endpoint
  • Warning headers on authenticated responses

Neither of these can be used for Kubernetes readiness probes.

Proposed Solution

Add a /readyz endpoint (unauthenticated) that returns:

  • 200 OK when the node is fully operational
  • 503 Service Unavailable when the node has critical issues that should exclude it from load balancing

The readiness check should verify:

  1. Database connectivity - Can the node reach the database?
  2. No blocking entitlement errors - Specifically, the multi-replica without HA license error

This follows Kubernetes conventions where:

  • /healthz (liveness) = "Is the process alive?" → restart if failing
  • /readyz (readiness) = "Can this instance serve traffic?" → remove from load balancer if failing

Implementation Notes

Key code areas:

  • Entitlements tracking: coderd/entitlements/entitlements.go - Add method like HasBlockingErrors() bool that checks for errors that should make the node unready
  • New endpoint: coderd/coderd.go - Add /readyz route
  • Error detection: The replica error is already generated in enterprise/coderd/license/license.go in the Errors slice

Example implementation sketch:

r.Get("/readyz", func(w http.ResponseWriter, r *http.Request) {
    ctx, cancel := context.WithTimeout(r.Context(), 5*time.Second)
    defer cancel()
    
    // Check database connectivity
    if _, err := api.Database.Ping(ctx); err != nil {
        w.WriteHeader(http.StatusServiceUnavailable)
        _, _ = w.Write([]byte("database unreachable"))
        return
    }
    
    // Check for blocking entitlement errors
    if api.Entitlements.HasBlockingErrors() {
        w.WriteHeader(http.StatusServiceUnavailable)
        _, _ = w.Write([]byte("entitlement error"))
        return
    }
    
    _, _ = w.Write([]byte("OK"))
})

Alternatives Considered

  1. Modify /healthz directly - Rejected because changing liveness probe behavior could cause restart loops instead of just removing from load balancer

  2. Require authentication on readiness probe - Rejected because Kubernetes probes typically run without application-level auth, and managing secrets for probes adds operational complexity

  3. External monitoring of /api/v2/entitlements - Works but requires additional infrastructure (sidecar, external health checker with credentials)

Additional Context

This issue particularly affects:

  • Kubernetes deployments using replicas > 1 without realizing HA requires Premium
  • Blue-green or rolling deployments where multiple pods temporarily coexist
  • Development/staging environments that mirror production topology without licenses

The current workaround is to manually ensure only one replica runs, but this defeats the purpose of high availability and is error-prone.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions