Enable metrics collection during graceful shutdown by setting publishNotReadyAddresses
What does this MR do?
Sets publishNotReadyAddresses: true on the metrics Service to allow Prometheus to continue scraping metrics from runner manager pods during graceful shutdown.
Why was this MR needed?
GitLab Runner's graceful shutdown can take hours as it waits for all jobs to complete. By default, Kubernetes immediately removes terminating pods from Service endpoints, causing the ServiceMonitor to stop scraping metrics during this entire shutdown period. This leaves us without observability into the graceful shutdown process.
This change is safe because the Service is only used for metrics collection via ServiceMonitor - no external traffic is routed through it. An alternative approach would be to use a PodMonitor instead, which scrapes pods directly rather than through Service endpoints.
What's the best way to test this MR?
- Deploy the chart with metrics and ServiceMonitor enabled
- Trigger a pod termination (e.g.,
kubectl delete pod <runner-pod>) - Verify that Prometheus continues to scrape metrics from the terminating pod during the
terminationGracePeriodSecondswindow - Check the Service endpoints to confirm the terminating pod remains in the list