Skip to content

Post-deployment monitoring (continuous verification) MVC

Problem to Solve

Continuous deployment should be easy and boring. One thing that makes it more comfortable is to have monitoring to measure service-level objectives and the impact on those SLOs of an individual deploy. When doing an automatic incremental deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1660) or canary deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1659), we should be able to use these measurements to automatically halt a deploy and even revert/rollback.

Use Cases

Scenario: Incremental rollout, notices error rate exceeds SLO of 0.1%, aborts rollout at 1%, and reverts to last-known-good version.

Proposal

Name Query
Throughput (req/sec) sum(label_replace(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m]), "status_code", "${1}xx", "status", "(.)..")) by (status_code)
Latency (ms) sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_sum{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_count{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) * 1000
HTTP Error Rate (%) sum(rate(nginx_ingress_controller_requests{status=~"5.",namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}.*"}[2m])) * 100

For the POC we will use HTTP Error Rate (%)

  • Using the existing Prometheus API we will query the current threshold of error rates
  • If an error threshold exceeds the defined threshold we will stop deployment (we need to check if we can leverage the existing trigger similar to the incident response issue creation)
  • If the rollout was stopped due to exceeding threshold, On the deploy board there should be a notification of: "Rollout stopped due to high error rate"

We will present the error rate only in the environment page (deploy board) and we will make this very minimal - UX TBD As for Notifications - for the MVC we will use issue creation/email notifications that exist for incident response - TODOs/assignments will not be part of the MVC image

Future UX

Links / references

Further details

Edited by Orit Golowinski