labkit/v2/app: Graceful degradation on startup for non-critical components
Context
app.Start currently halts if any registered component fails to start. This was flagged as a production blocker in the Artifact Registry PoC assessment.
Problem
In production environments, not all components are equally critical:
- A k8s sidecar container (e.g. the OTel Collector) may not be ready at the exact moment the application starts
- Some components (e.g. a background job worker) may be optional for core request handling
- An all-or-nothing startup model prevents the service from serving traffic while non-critical dependencies recover
Existing Go lifecycle frameworks like fx and wire handle this with optional/soft-dependency semantics.
Proposed solution
Allow components to be marked as non-critical. A non-critical component failure during startup:
- Logs a warning rather than halting startup
- Marks the component as degraded in the readiness check
- Retries connection in the background until it succeeds (self-healing)
Tasks
-
Design the
Componentinterface extension (e.g.Optional() boolor a wrapper type) -
Update
app.Startto handle non-critical component failures gracefully -
Expose degraded state via the
/-/readinessendpoint - Add background retry loop for failed non-critical components
- Update go-service-template to demonstrate the pattern
- Document which components should be marked non-critical vs critical
References
- Parent epic: gitlab-org/quality#360
- Artifact Registry PoC feedback