Skip to content

feat: monitor error ratio in deployment step

Gregorius Marco requested to merge mg/monitor-deployment into main

feat(deploy): monitor error ratio on 25% step

For team#245

On the 25% deployment step, perform these:

  1. Sleep for 5 minutes (default) to let Cloud Run request_count metrics to be available (3 minute delay) and we have at least 2-minute data points. The sleep duration can be adjusted with LEGACY_DEPLOY_MONITOR_DELAY_DURATION_S env variable.
  2. Query error ratio from canary and stable revision and compare the error ratio between them. Currently, only warn if the error ratio is elevated. In team#246, this would result in a rollback.

Manual testing

Elevated error scenario https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303938413:

  1. In example service, change an endpoint to return 500, eg https://gitlab.com/marcogreg/example-service/-/commit/b20b8e3db643071ca93bb4be77f05e742fb17c78
  2. Fire requests to the endpoint. This still returns Hello, World!
while true; do curl "https://mg-example-svc-rft902.staging.runway.gitlab.net/hello" ; done
  1. Once the deployment hits 25%, we should see above curl returning 500 occasionally. This simulates new revision constantly erroring out 100% for the whole 5 minutes duration (sleep duration).
  2. Job log should have printed ⚠️ Canary version is experiencing an elevated error ratio. Example: https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303938413#L534

image

Healthy scenario https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303986595:

  1. Without any 500 traffic, it should print ✅ Canary version error ratio is healthy https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303986595#L529

image

Edited by Gregorius Marco

Merge request reports