Skip to content
GitLab Next
  • Menu
Projects Groups Snippets
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • GitLab GitLab
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
    • Locked Files
  • Issues 43,823
    • Issues 43,823
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Merge requests 1,413
    • Merge requests 1,413
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
    • Test Cases
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Metrics
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Code review
    • Insights
    • Issue
    • Repository
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar

We will soon be undergoing scheduled maintenance to our database layer. We expect GitLab.com to be unavailable for up to 2 hours starting from 2022-07-02 06:00 UTC. Please note that any CI jobs that start before the maintenance window but complete during the window period will fail and may need to be started again.

  • GitLab.org
  • GitLabGitLab
  • Issues
  • #214634
Closed
Open
Created Apr 16, 2020 by Orit Golowinski@ogolowinskiDeveloper

Show alerts in environment index page

Problem to solve

As part of #8295 (comment 298418198) we want to stop deployment in case an an alert is raised by alert manager (See more &2877 (closed) about what's "Alert" is). A good first step to this would be to notify users that such an event happened even before stopping anything.

Intended users

  • Devon (DevOps Engineer)
  • Allison (Application Ops)
  • Rachel (Release Manager)

Further details

In case there is a degradation in performance or quality, we will notify the user on the environment index page (deploy board) so that they will know something is wrong and can take action.

Using the existing Prometheus API we will query the current threshold of error rates

We already associate Environments to Alerts in 1:N relation. This means we can show a list of alerts for a specific environment, or only show the latest one.

For more information, see &2877 (closed) for what devopsmonitor team is planning in an upcoming milestone:

Screenshot

Proposal

  • We will display the latest alert (already supported in &2877 (closed)) in case a threshold is crossed for the environment on the environment list/deploy board.
    • This will only be done for primary environments (no grouped review environments for example)
    • Only one alert will be visible at a time
      • The alert which will be shown is the latest one unless there is a critical alert that is persisting.
    • Alerts in the environment page/deploy board should be dismissed automatically if a corresponding metric returns to normal and doesn't exceed a threshold. If the alert has already ended, it should not appear.
    • The payload of the alert will include [Alert severity icon] [Alert severity title] - [when alert started] [alert condition] [metric name] - [Error rate]. [View details]
      • [Alert severity icon], [Alert severity title], [when alert started], [alert condition], and [metric name] are pulled from the alerts API
      • [View details] links to the metrics page with the correct environment selected solving #214927 (closed)
      • [Error rate] will use [\pre-existing defined error rates] (https://docs.gitlab.com/ee/user/project/integrations/prometheus_library/nginx_ingress.html#metrics-supported)
Name Query
Throughput (req/sec) sum(label_replace(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m]), "status_code", "${1}xx", "status", "(.)..")) by (status_code)
Latency (ms) sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_sum{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_count{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) * 1000
HTTP Error Rate (%) sum(rate(nginx_ingress_controller_requests{status=~"5.",namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}.*"}[2m])) * 100
  • Introduce error below environment or pod information (incase deployment board is active) similar to merge request widgets frontend backend
Mockup (browser made)
image
code I injected to create the mockup above
<div style="
    /* padding-top: 5px; */
    /* padding-bottom: 5px; */
"><div class="mr-widget-extension d-flex align-items-center pl-3" style="
    vertical-align: middle;
    /* margin-top: 5px; */
    padding-top: 5px;
    padding-bottom: 5px;
"><svg xmlns="http://www.w3.org/2000/svg" width="16" height="16" viewBox="0 0 12 12" style="
    margin-right: 8px;
">
  <path fill-rule="evenodd" d="M6.70565033,0.184992446 L10.7943497,2.49459124 C11.2310076,2.74124783 11.5,3.19708802 11.5,3.69040121 L11.5,8.30959879 C11.5,8.80291198 11.2310076,9.25875217 10.7943497,9.50540876 L6.70565033,11.8150076 C6.26899239,12.0616641 5.73100761,12.0616641 5.29434967,11.8150076 L1.20565033,9.50540876 C0.768992386,9.25875217 0.5,8.80291198 0.5,8.30959879 L0.5,3.69040121 C0.5,3.19708802 0.768992386,2.74124783 1.20565033,2.49459124 L5.29434967,0.184992446 C5.73100761,-0.0616641488 6.26899239,-0.0616641488 6.70565033,0.184992446 Z" style="
    fill: #8c210d;
"></path>
</svg>
  <span style="
    margin-right: 4px;
">Critical - HTTP error rate exceeded 0.1%.</span><button type="button" class="btn btn-link btn-md"><!----> View details</button></div> <!----></div>

Permissions and Security

Documentation

Availability & Testing

What does success look like, and how can we measure that?

What is the type of buyer?

Is this a cross-stage feature?

Links / references

Scoped off

  • Update environments list page / deploy board to be more legible (#223760)
Edited Aug 26, 2020 by Dimitrie Hoekstra
Assignee
Assign to
Time tracking