Skip to content

Draft: Bulk process alert notifications from prometheus

What does this MR do and why?

Context:

When a project has an alert integration configured, prometheus servers can be configured to send alerts to GitLab. When we receive a prometheus request, it may contain many unique alerts within the same payload. Currently, we handle each alert individually. However, there's no strict limit on the Prometheus side to the number of alerts a notification can contain. So we've got a pretty hefty N+1.

Related issue: https://gitlab.com/gitlab-org/gitlab/-/issues/348676

Changes in this MR
  • Uses the service added in !95827 (closed) to process prometheus alerts in bulk
Scope:
  • Generic alerts will be handled in !95834 (closed)
  • Removal/cleanup of the old classes/modules/specs will happen in a future MR
Motivation
  • Improve the performance of processing prometheus payloads which contain many alerts

Holistic overview

Related MRs

All MRs will merge into master, but should merge in the order below

flowchart RL
    master
    95827["1. Add service (!95827)"]
    95834["2.a. Use service for generic alerts (!95834)"]
    95854["2.b. Use service for prometheus alerts (!95854)"]
    TBD["3. Delete unneeded tests/files (MR TBD)"]

    95827-->master
    95834-->95827
    95854-->95827
    TBD-->95834
    TBD-->95854
Expected data flow for alert processing
  1. Alert comes into integration endpoint
  2. Controller determines which NotifyService to use
  3. NotifyService checks permissions & validity
  4. NotifyService parses input into Gitlab::AlertManagement::Payload` format
  5. NotifyService passes payloads to BulkProcessAlertsService
  6. BulkProcessAlertsService is responsible for finding any existing alerts for the payloads and triggering any side-effects
  7. For each alert, BulkProcessAlertsService delegates to UpdateAlertFromPayloadService to modify the alert itself.
    • As a note: This is still an N+1, but it will stay for now, since we return the id of each alert in the HTTP response. The expected system notes & record updates will vary by alert, so we still need to run validations on save.

How to set up and validate locally

Send Prometheus alerts
  1. In a project with maintainer+ permissions, nav to Settings > Monitor > Alerts to create/turn on an HTTP alert integration
    • Alerts from prometheus can be sent to either a "Prometheus" integration, or a generic HTTP integration. They'll be processed the same either way
    • Skip custom mapping fields for fastest integration creation
  2. Select 'Send test alert` to send an alert to your integration
    {
      "version" : "4",
      "groupKey": null,
      "status": "firing",
      "receiver": "",
      "groupLabels": {},
      "commonLabels": {},
      "commonAnnotations": {},
      "externalURL": "", 
      "alerts": [{
        "startsAt": "2022-08-30T11:22:40Z", 
        "generatorURL": "http://host?g0.expr=up", 
        "endsAt": null,
        "status": "firing",
        "labels": {
          "gitlab_environment_name": "production"
        }, 
        "annotations": {}
      }]
    }
  3. Visit Monitor > Alerts to see the new alert, and select Activity to see the system notes which are generated from new methods
Send recovery alerts
  1. Go back to Settings > Monitor > Alerts to send another test alert, and include the appropriate resolving attributes (end_time for HTTP integrations)
    {
    "version" : "4",
    "groupKey": null,
    "status": "resolved",
    "receiver": "",
    "groupLabels": {},
    "commonLabels": {},
    "commonAnnotations": {},
    "externalURL": "", 
    "alerts": [{
     "startsAt": "2022-08-30T11:22:40Z", 
     "generatorURL": "http://host?g0.expr=up", 
     "endsAt": "2022-08-30T18:22:40Z",
     "status": "resolved",
     "labels": {
       "gitlab_environment_name": "production"
     }, 
     "annotations": {}
    }]
    }
  2. Visit Monitor > Alerts to view your original alert -> it should be resolved & have new system notes under "Activity" tab

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Sarah Yasonik

Merge request reports

Loading