No error shown when alertmanager alert config cannot be applied to Prometheus

Summary

When an alert creation/modification fails, no error is shown on the UI.

Prometheus rejects invalid configurations. If an alert query is not valid, Prometheus will reject the new configuration (which contains the invalid alert query) and revert to the old configuration. If this happens when a user sets an alert from the metrics dashboard, we do not show any error on the UI.

Demo of the bug (with sound):

alertmanager_update_fail2

Steps to reproduce

  1. Apply the following patch to app/services/clusters/applications/prometheus_config_service.rb to your local GDK.

    This patch applies buggy code, which was fixed in !33772 (merged). The purpose of applying buggy code is to cause variable substitution in alert queries to fail. If variables are not substituted, the alert queries are invalid to Prometheus. Prometheus will reject the invalid queries during the next update.

    patch -Nu app/services/clusters/applications/prometheus_config_service.rb -i the_patch.patch

    --- app/services/clusters/applications/prometheus_config_service.rb.upd 2020-06-04 16:05:41.240830892 +0530
    +++ app/services/clusters/applications/prometheus_config_service.rb     2020-06-04 16:15:42.715071626 +0530
    @@ -132,21 +132,19 @@
          end
    
          def alerts(environment)
    +        variables = Gitlab::Prometheus::QueryVariables.call(environment)
             alerts = Projects::Prometheus::AlertsFinder
             .new(environment: environment)
             .execute
    
             alerts.map do |alert|
    -          hash = alert.to_param
    -          hash['expr'] = substitute_query_variables(hash['expr'], environment)
    -          hash
    +          substitute_query_variables(alert.to_param, variables)
             end
          end
    
    -      def substitute_query_variables(query, environment)
    -        result = ::Prometheus::ProxyVariableSubstitutionService.new(environment, query: query).execute
    -
    -        result[:params][:query]
    +      def substitute_query_variables(hash, variables)
    +        hash['expr'] %= variables
    +        hash
          end
    
          def environments
    
  2. Create a project (You can fork https://gitlab.com/joshlambert/autodevops-deploy), add a K8s cluster to it, and install Helm and Prometheus managed apps.

  3. On the metrics dashboard page (Operations > Metrics), create a new alert on the "Memory Usage (Total)" chart (or any other chart whose metric query contains a variable), or modify an existing alert on that chart.

  4. In a Rails console, you can observe the Clusters::Applications::Prometheus object status change to 4 (updating) and then 5 (updated).

  5. Connect to the Prometheus UI by following the steps described here: https://docs.gitlab.com/ee/development/prometheus.html#access-the-ui-of-a-prometheus-managed-application-in-kubernetes.

  6. Open the Prometheus UI and go to the "Alerts" tab.

  7. If you created a new alert on the metrics dashboard page, the alert will not show up on the Prometheus UI. If you modified an existing alert, the Prometheus UI will continue to show the old alert config.

Example Project

What is the current bug behavior?

No error is shown on the UI when an error occurs while syncing Prometheus alert config.

What is the expected correct behavior?

An error should be shown when the Prometheus alert config cannot be synced.

Relevant logs and/or screenshots

The logs of the prometheus-server-configmap-reload container in the prometheus-prometheus-server-xxxx-xxxx pod contains the following when an alertmanager update fails:

2020/06/23 14:02:14 config map updated
2020/06/23 14:02:14 error: Received response code 500 , expected 200

while the prometheus-server container of the same pod contains:

level=info ts=2020-06-23T14:10:40.264Z caller=main.go:743 msg="Loading configuration file" filename=/etc/config/prometheus.yml
level=info ts=2020-06-23T14:10:40.270Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-06-23T14:10:40.272Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-06-23T14:10:40.273Z caller=kubernetes.go:192 component="discovery manager scrape" discovery=k8s msg="Using pod service account via in-cluster config"
level=info ts=2020-06-23T14:10:40.279Z caller=kubernetes.go:192 component="discovery manager notify" discovery=k8s msg="Using pod service account via in-cluster config"
level=error ts=2020-06-23T14:10:40.285Z caller=manager.go:832 component="rule manager" msg="loading groups failed" err="/etc/config/alerts: group \"production.rules\", rule 2, \"HTTP Error Rate\": could not parse expression: parse error at char 55: unexpected left brace '{'"
level=error ts=2020-06-23T14:10:40.285Z caller=main.go:762 msg="Failed to apply configuration" err="error loading rules, previous rule set restored"
level=error ts=2020-06-23T14:10:40.285Z caller=main.go:587 msg="Error reloading config" err="one or more errors occurred while applying the new configuration (--config.file=\"/etc/config/prometheus.yml\")"

Output of checks

This bug happens on GitLab.com

Results of GitLab environment info

Expand for output related to GitLab environment info

(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of: sudo gitlab-rake gitlab:check SANITIZE=true)

(For installations from source run and paste the output of: sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)

(we will only investigate if the tests are passing)

Possible fixes

The Clusters::Applications::Prometheus object does not go into errored state. I'm not sure if it should go into errored state in this situation, but just wanted to point out that we don't currently detect this situation at all.

We currently perform a helm update every time we need to sync new alerts to Prometheus. With a helm update, there is no way to check if the alerts have been successfully synced.

Instead, we can create a separate way to update Prometheus alerts (without using PrometheusUpdateService), so that we don't have to do a helm update each time. We can update alerts by using KubeClient to directly update the alerts configMap, and then call the Prometheus config reload API (http://localhost:9090/-/reload). This should also reduce the time it takes to sync alerts to Prometheus.

Edited by Reuben Pereira