Excess ThanosRuleHighRuleEvaluationWarning incident issues
We're getting a lot of these issues, including (currently open):
- #4223 (closed)
- #4243 (closed)
- #4244 (closed)
- #4248 (closed)
- #4249 (closed)
- #4257 (closed)
- #4260 (closed)
- #4261 (closed)
- #4264 (closed)
- #4265 (closed)
There's at least two problems occurring:
Duplicates
Sometimes we get duplicate alerts which generates two issues for a given alert, then when the alert resolves only one (the latest) issue is closed. Citation is Alert 249:
in which the two issue creations were within 1 minute of each other, but only the second one was closed. This smells like it might be a bug in GitLab itself, although that's just speculation right now.
Alerts not closing:
That situation doesn't hold for https://gitlab.com/gitlab-com/gl-infra/production/-/alert_management/256/details#/activity which created two issues that are still open, i.e. it looks like the resolution was never received. Or https://gitlab.com/gitlab-com/gl-infra/production/-/alert_management/259/details#/activity which also didn't close, but only had one issue created.
This might be an infrastructure problem, although it's even harder to be sure.
I've closed all but the latest outstanding issue for now, to clear up the list. @gitlab-com/gl-infra/sre-observability if someone has time to look at what's going on, it'd be good to get this cleaned up; it's adding mental overhead with the noise it adds to handover issues and trying to get a handle on what incidents are active.