Deployment health metric is at risk

Problem Statement 📉

We leverage Deployment Health Metrics in a non-automated fashion to determine if it is safe to proceed with various tasks.

Initiating a deploy
During target steps of a deploy to provide us with an updated status as a deploy progresses
When Engineering is making changes to Feature Flags

Lately, these metrics have been dipping in and out of a healthy state which seemingly have been becoming increasingly unstable. Here's a view of the metric across all services during peak load times:

While all services shown are not subject to our checks, all services appear to be experiencing problems. Yet, no alarms are being signaled to any SRE that there's a problem with any service. Most of the time we see these metrics indicate poor behavior, we can run a retry on a given command and we'll receive a green light. This is indicative that these metrics are currently highly flappy. Due to this, the confidence in our ability to rely on these metrics will deteriorate if we do not uncover a root cause or tweak how we leverage this metric.

Solutionize 📈

These metrics should be reliable. Utilize this issue to identify why this metric is seemingly flappy. Make any attempt we can at tweaking either the metric or thresholds in order to stabilize the metric. The goal here is not not make the metric work for us, but to identify why our metrics are going unhealthy while our services do not appear to be negatively impacted.

cc @gitlab-org/delivery

Edited Feb 17, 2023 by John Skarbek