Reinstate Labkit::Metrics and fixes locking pattern
What does this MR do and why?
Resolves #549420 (closed).
Relates to gitlab-com/gl-infra/observability/team#4160 (closed).
The Labkit::Metrics
had been introduced in the MR !188302 (merged), behind an environment variable flag (LABKIT_METRICS_ENABLED). However, there was bug hidden in there causing a deadlock.
The specific commit fixing the issue is here. Explanation:
- safe_provide_metric acquires PROVIDER_MUTEX.synchronize
- It calls provide_metric(metric_type, metric_name, *args)
- registry.get(metric_name) returns nil (metric doesn't exist)
- The || operator triggers registry.method(metric_type).call(metric_name, *args)
- The registry call goes through Redis instrumentation, triggered by Gitlab::CurrentSettings.prometheus_metrics_enabled
- The instrumentation tries to record its own metrics, calling safe_provide_metric again, which tries to acquire the same mutex that's already held
This issue was only present in Gitlab::Metrics::Prometheus. The Gitlab::Metrics::Labkit implementation is not affected by that.
This bug has been introduced due to an attempt of simplifying the current implementation, as a step to replace it by Labkit::Metrics, however, this edge case of recursive call wasn't clear given the hidden call to the metrics library from a instrumentation object call within a method call from the model.
The implementation of safe_provide_metric
is now the same as before the merge.
References
Previous MR with the revert: !194383 (merged)
Screenshots or screen recordings
Before | After |
---|---|
How to set up and validate locally
I verified this change by running this with & without the revert:
RAILS_ENV=test bin/rails db:reset
RAILS_ENV=test bin/rails db:migrate
Same steps as reproduced here: !194383 (merged)
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.