feat: monitor error ratio in deployment step (!494) · Merge requests · GitLab.com / GitLab Infrastructure Team / platform / Runway / runwayctl

Gregorius Marco requested to merge mg/monitor-deployment into main Jul 05, 2024

feat(deploy): monitor error ratio on 25% step

On the 25% deployment step, perform these:

Sleep for 5 minutes (default) to let Cloud Run request_count metrics to be available (3 minute delay) and we have at least 2-minute data points. The sleep duration can be adjusted with LEGACY_DEPLOY_MONITOR_DELAY_DURATION_S env variable.
Query error ratio from canary and stable revision and compare the error ratio between them. Currently, only warn if the error ratio is elevated. In team#246, this would result in a rollback.

Manual testing

In example service, change an endpoint to return 500, eg https://gitlab.com/marcogreg/example-service/-/commit/b20b8e3db643071ca93bb4be77f05e742fb17c78
Fire requests to the endpoint. This still returns Hello, World!

while true; do curl "https://mg-example-svc-rft902.staging.runway.gitlab.net/hello" ; done

Once the deployment hits 25%, we should see above curl returning 500 occasionally. This simulates new revision constantly erroring out 100% for the whole 5 minutes duration (sleep duration).
Job log should have printed ⚠️ Canary version is experiencing an elevated error ratio. Example: https://gitlab.com/gitlab-com/gl-infra/platform/runway/deployments/mg-example-svc-rft902/-/jobs/7303938413#L534

Edited Jul 10, 2024 by Gregorius Marco