Post-deployment monitoring (continuous verification) MVC

Problem to Solve

Continuous deployment should be easy and boring. One thing that makes it more comfortable is to have monitoring to measure service-level objectives and the impact on those SLOs of an individual deploy. When doing an automatic incremental deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1660) or canary deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1659), we should be able to use these measurements to automatically halt a deploy and even revert/rollback.

Use Cases

Scenario: Incremental rollout, notices error rate exceeds SLO of 0.1%, aborts rollout at 1%, and reverts to last-known-good version.

Proposal

We will use [pre-existing defined error rates] (https://docs.gitlab.com/ee/user/project/integrations/prometheus_library/nginx_ingress.html#metrics-supported)

Name	Query
Throughput (req/sec)	sum(label_replace(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m]), "status_code", "${1}xx", "status", "(.)..")) by (status_code)
Latency (ms)	sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_sum{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_count{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) * 1000
HTTP Error Rate (%)	sum(rate(nginx_ingress_controller_requests{status=~"5.",namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) 100

For the POC we will use HTTP Error Rate (%)

Using the existing Prometheus API we will query the current threshold of error rates
If an error threshold exceeds the defined threshold we will stop deployment (we need to check if we can leverage the existing trigger similar to the incident response issue creation)
If the rollout was stopped due to exceeding threshold, On the deploy board there should be a notification of: "Rollout stopped due to high error rate"

We will present the error rate only in the environment page (deploy board) and we will make this very minimal - UX TBD As for Notifications - for the MVC we will use issue creation/email notifications that exist for incident response - TODOs/assignments will not be part of the MVC

Future UX

Links / references

Further details

Prior art: https://harness.io/harness-continuous-delivery/secret-sauce/continuous-verification/

Edited Apr 19, 2020 by Orit Golowinski