feat: monitor error ratio in deployment step
feat(deploy): monitor error ratio on 25% step
For team#245
On the 25% deployment step, perform these:
- Sleep for 5 minutes (default) to let Cloud Run request_count metrics to be available (3 minute delay) and we have at least 2-minute data points. The sleep duration can be adjusted with LEGACY_DEPLOY_MONITOR_DELAY_DURATION_S env variable.
- Query error ratio from canary and stable revision and compare the error ratio between them. Currently, only warn if the error ratio is elevated. In team#246, this would result in a rollback.
Manual testing
Elevated error scenario https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303938413:
- In example service, change an endpoint to return 500, eg https://gitlab.com/marcogreg/example-service/-/commit/b20b8e3db643071ca93bb4be77f05e742fb17c78
- Fire requests to the endpoint. This still returns
Hello, World!
while true; do curl "https://mg-example-svc-rft902.staging.runway.gitlab.net/hello" ; done
- Once the deployment hits 25%, we should see above curl returning 500 occasionally. This simulates new revision constantly erroring out 100% for the whole 5 minutes duration (sleep duration).
- Job log should have printed
⚠️ Canary version is experiencing an elevated error ratio
. Example: https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303938413#L534
Healthy scenario https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303986595:
- Without any 500 traffic, it should print
✅ Canary version error ratio is healthy
https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303986595#L529
Edited by Gregorius Marco