Skip to content

Show alerts in environment index page

Problem to solve

As part of #8295 (comment 298418198) we want to stop deployment in case an an alert is raised by alert manager (See more &2877 (closed) about what's "Alert" is). A good first step to this would be to notify users that such an event happened even before stopping anything.

Intended users

Further details

In case there is a degradation in performance or quality, we will notify the user on the environment index page (deploy board) so that they will know something is wrong and can take action.

Using the existing Prometheus API we will query the current threshold of error rates

We already associate Environments to Alerts in 1:N relation. This means we can show a list of alerts for a specific environment, or only show the latest one.

For more information, see &2877 (closed) for what devopsmonitor team is planning in an upcoming milestone:

Screenshot

Proposal

  • We will display the latest alert (already supported in &2877 (closed)) in case a threshold is crossed for the environment on the environment list/deploy board.
    • This will only be done for primary environments (no grouped review environments for example)
    • Only one alert will be visible at a time
      • The alert which will be shown is the latest one unless there is a critical alert that is persisting.
    • Alerts in the environment page/deploy board should be dismissed automatically if a corresponding metric returns to normal and doesn't exceed a threshold. If the alert has already ended, it should not appear.
    • The payload of the alert will include [Alert severity icon] [Alert severity title] - [when alert started] [alert condition] [metric name] - [Error rate]. [View details]
Name Query
Throughput (req/sec) sum(label_replace(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m]), "status_code", "${1}xx", "status", "(.)..")) by (status_code)
Latency (ms) sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_sum{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_count{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) * 1000
HTTP Error Rate (%) sum(rate(nginx_ingress_controller_requests{status=~"5.",namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}.*"}[2m])) * 100
  • Introduce error below environment or pod information (incase deployment board is active) similar to merge request widgets frontend backend
Mockup (browser made)
image
code I injected to create the mockup above
<div style="
    /* padding-top: 5px; */
    /* padding-bottom: 5px; */
"><div class="mr-widget-extension d-flex align-items-center pl-3" style="
    vertical-align: middle;
    /* margin-top: 5px; */
    padding-top: 5px;
    padding-bottom: 5px;
"><svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 12 12" style="
    margin-right: 8px;
">
  <path fill-rule="evenodd" d="M6.70565033,0.184992446 L10.7943497,2.49459124 C11.2310076,2.74124783 11.5,3.19708802 11.5,3.69040121 L11.5,8.30959879 C11.5,8.80291198 11.2310076,9.25875217 10.7943497,9.50540876 L6.70565033,11.8150076 C6.26899239,12.0616641 5.73100761,12.0616641 5.29434967,11.8150076 L1.20565033,9.50540876 C0.768992386,9.25875217 0.5,8.80291198 0.5,8.30959879 L0.5,3.69040121 C0.5,3.19708802 0.768992386,2.74124783 1.20565033,2.49459124 L5.29434967,0.184992446 C5.73100761,-0.0616641488 6.26899239,-0.0616641488 6.70565033,0.184992446 Z" style="
    fill: #8c210d;
"></path>
</svg>
  <span style="
    margin-right: 4px;
">Critical - HTTP error rate exceeded 0.1%.</span><button type="button" class="btn btn-link btn-md"><!----> View details</button></div> <!----></div>

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references

Scoped off

Edited by Dimitrie Hoekstra