[CI] Review environments periodically failing due to resource exhaustion
Summary
We're noticing that CI review jobs periodically fail, reporting various problems:
- 500 bad gateway
- Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
The culprit is usually resource exhaustion, and you can see pods related to the release are stuck Pending.
The typical workaround is to:
- Run
helm ls - Find oldest releases.
- Run
helm get values <release name> | yq .ci.pipeline.url - Visit the URL to see if the environment can be stopped (MR merged, docs-only MR, etc.)
- If it can be stopped, run
helm delete <release name>or run thestop_reviewjobs from the MR pipeline.
Some considerations:
- I see lots of
triggerenvironments, which were created from CNG MR pipelines that are not automatically cleaned up in many cases. - Some MRs change only docs and don't need full CI review environments, but the author forgot the
docs-prefix /-docssuffix. - Maybe we consider automatically uninstalling successful review environments after ~1 hour. I personally don't often need to jump in and look at successful CI environments nearly as often as I need to see failed review environments. This would clean up quite a bit of resource usage.
Edited by Mitchell Nielsen