[Meta] SLO Alerts

One of the core requirements of a monitoring system is the ability to generate alerts. While it is nice to be able to view the performance of a system, one cannot be viewing it at all times. The system therefore needs to be capable of generating automated alerts when a specific condition occurs, like the HTTP error rate reaching specific levels.

User facing features

There are three major user-facing facets to delivering this:

  • Configuration of desired alert levels for supported metrics
  • Displaying the configured alert levels as they are approached and triggered
  • Notification when an alert is triggered

Configuration of desired alert levels

Prometheus alerting rules can be a completely open ended PromQL statement, which requires knowledge of the query language and typically testing/fine tuning. While we'd like to get here eventually, this should wait for the broader Query Builder.

For now we should keep this simple, and we can accomplish this leveraging the queries that are already configured for each chart. Since we already know the query, we can re-use that to create an alert as well, avoiding much of the input required to create an alert.

I propose we limit the input, for now, to:

  • Chart to attach an alert to
  • Series/label, if multiple are present
  • Alerting threshold and duration. For example, request latency (ms) > 150 for 5 minutes.
    • Dropdown for duration could be: 0,1,5, 15 minutes
    • Dropdown for operator: <, >, =
    • Threshold: number field

While we should wait for UX to propose design, one idea would be to have a little alarm icon on each chart. This would have three states:

  • Empty (no alerts configured)
  • Active (alerts configured)
  • Alarming (alert is currently firing)

Clicking on it would list the alert(s) for the chart, and allow you to edit them. This would bring up a small modal to prompt for: series (if needed), operator, duration, and threshold.

Note that these alerts should be specific to the environment they were configured on. (So production alerts should not be triggering off review apps, etc.)

Displaying the configured alert levels

Now that we have configured alerting levels, we should attempt to display them in the chart. This could take the form of a horizontal line on the chart at that specific level.

A couple of notes:

  • For single series charts the line can be red.
  • For multiple series charts, the line can be colored similar to the line that it is alerting on. Perhaps dashed or some other indicator that won't get confused with the canary line. (Different dash type?)
  • The line should not be shown if the scale doesn't otherwise reach it. So if the response time is so low the alert is off the screen, we shouldn't squash the rest of the data to fit in the alerting level. (Definitely open to feedback on this one)

Notification when an alert is triggered

As noted above, we can have a third alarm state that indicates an alarm is active. We should also provide pro-active alerts as well. While we have a larger issue for a centralized alerting framework, for this initial release we should consider something simpler.

I would propose we issue email based alerts to all master/owners of the project. A link can then take them to the dashboard page for the environment in question.

If others agree, we can develop the email template.

Technical implementation

Rule creation

With the queries known, we can simply derive the alerting configuration based on the desired information. existing query operator value, which for example could be (requests[5xx] / requests) * 100 > 1. In the alert configuration, then we'd simply specify the for label and pass along the desired duration.

Since alerts are configured in Prometheus by loading a YAML file at startup, we need a way to be able to manage the config and lifecycle of the Prometheus server.

Luckily with the addition of GitLab deployed Prometheus servers to Kubernetes clusters, we have servers that we have full control over as well as the capability to easily manage config via ConfigMaps.

We should leverage this capability to dynamically manage the configured alerts, as they are added, edited, and deleted.

Alerting

While we could utilize Prometheus' built-in email alerts, we really want GitLab to be aware of these alerts for a variety of reasons:

  • Alerting logs, metrics, and representation in the UI
  • Subsequent automated action, like rolling back a release or pausing a canary deploy

So this is a poor solution, in that it would not be moving us in the direction we actually want to go. We'd also need to provision Prometheus with SMTP configuration, which would involve potentially secret information (SMTP auth info) as well as the fact that if GitLab is using postfix we would not have any to provision.

Instead we have two other options to evaluate:

  • Prometheus Webhook support
  • GitLab specific alerting plugin for Prometheus

I think for now, a simple webhook alert may be sufficient since we'd anyway need to build the API support on the GitLab side for the plugin. We could include a token on the webhook for validation to prevent abuse.

This management is performed by AlertManager, which should be deployed alongside Prometheus. (It is as simple as flipping alertmanager.enabled=true in the Helm chart.) We would then need to pass this above configuration in a ConfigMap to AlertManager. This lifecycle should be managed as part of the broader Prometheus integration lifecycle, including the alert rule creation as well.

Edited by silv