We will be shipping AlertManager via GitLab 10.8 (omnibus-gitlab#2999 (closed)). I think we should begin shipping default alerts for GitLab administrators. What metrics are most useful to add ASAP as alerts for most GitLab users/customers?
@stanhu How dynamic is this list? To what extend can a alerts be added this release, and removed the next? For Gitaly, the metric makes sense now, although I hope that in a release from now we have a clue why it happens and have it under control.
Futhermore, who is responsible for maintaining metrics? And to what extend will this be dogfooded? Is it an idea to turn this problem on its head and export all of our alerts unless marked as internal? The alerts in the Gitaly channel are generic enough that it would be a good start and this would make the documentation on administration better as the links in the alerts will need to go public resources.
I don't think we can write a good alert rule for Gitaly ResourceExhausted at this point. We are seeing a lot of non-actionable noise and we don't know what to about it.
Gitaly relative error rate would be more useful. As @zj points out, the definition of "error" is evolving over time. How do we keep that in sync with the alerts we ship, as opposed to the alerts we use every day?
What alerts we ship will be included in omnibus, my recommendation for this is to include alerts that have been well vetted by gitlab.com production. My plan first pass for this was to only include very basic stuff like "is Gitaly up", "is Workhorse up", etc.
The idea is that we want to include alerts that are most actionable by end users, and have a clear runbook that they can follow.
I agree with @bjk-gitlab. Let's start with simple alerts. If we get too complex without good documentation and have false-positives then support may get questions we can't answer.
For the database I think there are some known misconfigurations that are
ticking time bombs that will cause problems for users. For instance we've
seen in production and we've seen multiple users run into unused
replication slots causing postgres to use up all the disk space. We have an
alert for this in our cookbooks:
The only problem is that this is based on a query we add to queries.yaml in
gitlab-exporters. We can start pushing these queries upstream to
postgres_exporter or we can ship our queries.yaml in gitlab or we can add
queries to gitlab-monitor. That's basically my order of preference but I'm
open to suggestions, which would be easiest to maintain in the long-term?
My plan first pass for this was to only include very basic stuff like "is Gitaly up", "is Workhorse up", etc.
While this is a good start, it has limited usefulness. As a GitLab admin I know how to tell if my components are up. What is harder to tell is if they are healthy and if not what to do about it. For example, see https://gitlab.com/gitlab-org/gitlab-ce/issues/42575.
@bbodenmiller Yes, but the point of alerts is that they can tell actively, without you having to look at their status all day.
Resource alert metrics can be gathered from the node_exporter.
I would like to be careful about how we implement CPU and memory alerts, as they can be very noisy.
Additional health alerts will come later, we just need to make sure that they're "high quality" in that they won't fire when they shouldn't. The last thing I want to do is induce notification fatigue right away and cause people to not use the built-in alerting feature.
@bjk-gitlab right... Just giving some suggestions on some alerts that'd be nice to ship out of the box. Agreed health alerts are needed but need to be careful not to have false alarms.
This issue has passed the feature freeze date and considered a missed-deliverable. If this is correct, please add a corressponding missed:xx.x label, too.