Post-deployment monitoring (continuous verification) MVC
Problem to Solve
Continuous deployment should be easy and boring. One thing that makes it more comfortable is to have monitoring to measure service-level objectives and the impact on those SLOs of an individual deploy. When doing an automatic incremental deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1660) or canary deploy (https://gitlab.com/gitlab-org/gitlab-ee/issues/1659), we should be able to use these measurements to automatically halt a deploy and even revert/rollback.
Use Cases
Scenario: Incremental rollout, notices error rate exceeds SLO of 0.1%, aborts rollout at 1%, and reverts to last-known-good version.
Proposal
- We will use [pre-existing defined error rates] (https://docs.gitlab.com/ee/user/project/integrations/prometheus_library/nginx_ingress.html#metrics-supported)
Name | Query |
---|---|
Throughput (req/sec) | sum(label_replace(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m]), "status_code", "${1}xx", "status", "(.)..")) by (status_code) |
Latency (ms) | sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_sum{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_ingress_upstream_latency_seconds_count{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) * 1000 |
HTTP Error Rate (%) | sum(rate(nginx_ingress_controller_requests{status=~"5.",namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}."}[2m])) / sum(rate(nginx_ingress_controller_requests{namespace="%{kube_namespace}",ingress=~".%{ci_environment_slug}.*"}[2m])) * 100 |
For the POC we will use HTTP Error Rate (%)
- Using the existing Prometheus API we will query the current threshold of error rates
- If an error threshold exceeds the defined threshold we will stop deployment (we need to check if we can leverage the existing trigger similar to the incident response issue creation)
- If the rollout was stopped due to exceeding threshold, On the deploy board there should be a notification of: "Rollout stopped due to high error rate"
We will present the error rate only in the environment page (deploy board) and we will make this very minimal - UX TBD As for Notifications - for the MVC we will use issue creation/email notifications that exist for incident response - TODOs/assignments will not be part of the MVC