Restarting delivery-metrics gateway causes deployment apdex to behave unexpectedly
As seen in gitlab-com/runbooks!3757 (comment 637655059), we recently deployed an update to the delivery-metrics gateway to add additional buckets for the deployment SLO.
When the service restarts, Prometheus scraping this endpoint won't see any value for our delivery_deployment_duration_seconds_count
metric, because we haven't recorded anything in it yet.
Then, when we record our next deployment time, the scraper will see this metric instantly jump to 1
without first seeing it at 0
.
As @smcgivern
helpfully explained, this causes the PromQL rate
function to behave unexpectedly for our purposes -- see https://www.section.io/blog/beware-prometheus-counters-that-do-not-begin-at-zero/ for details:
It seems the rate PromQL function always returns zero for the first recorded sample of a series even when the sample value is non-zero. This is because the goal of the rate function is to compare multiple samples and interpolate the values in between. This interpolation behaviour is normally why counter metrics are ideal: they allow us to infer system behavior in the time window between scrape intervals, a capability not offered by gauge metrics.
The problem with the first sample of a new metric series is that rate is attempting to compare against a non-existent previous value and Prometheus does not have enough data with which to interpolate.
The solution Scalability have implemented in gitlab-rails (see example) is to "get" the values for the metric at launch, causing the registry to be aware of them, so that they start to show up immediately during the next scrape (with a value of 0
).
This is easy enough to implement in our gateway, but something to keep in mind is that you have to do this for every combination of possible labels. Right now while we only record "coordinated_deploy", "success"
, this is easy, but as we add more values and labels (failure
, environment=gprd
), it will get unwieldy.