labkit/v2/app: Graceful degradation on startup for non-critical components

Context

app.Start currently halts if any registered component fails to start. This was flagged as a production blocker in the Artifact Registry PoC assessment.

Problem

In production environments, not all components are equally critical:

  • A k8s sidecar container (e.g. the OTel Collector) may not be ready at the exact moment the application starts
  • Some components (e.g. a background job worker) may be optional for core request handling
  • An all-or-nothing startup model prevents the service from serving traffic while non-critical dependencies recover

Existing Go lifecycle frameworks like fx and wire handle this with optional/soft-dependency semantics.

Proposed solution

Allow components to be marked as non-critical. A non-critical component failure during startup:

  1. Logs a warning rather than halting startup
  2. Marks the component as degraded in the readiness check
  3. Retries connection in the background until it succeeds (self-healing)

Tasks

  • Design the Component interface extension (e.g. Optional() bool or a wrapper type)
  • Update app.Start to handle non-critical component failures gracefully
  • Expose degraded state via the /-/readiness endpoint
  • Add background retry loop for failed non-critical components
  • Update go-service-template to demonstrate the pattern
  • Document which components should be marked non-critical vs critical

References