[CI] Review environments periodically failing due to resource exhaustion
Summary
We're noticing that CI review jobs periodically fail, reporting various problems:
- 500 bad gateway
- Error: UPGRADE FAILED: pre-upgrade hooks failed: timed out waiting for the condition
The culprit is usually resource exhaustion, and you can see pods related to the release are stuck Pending
.
The typical workaround is to:
- Run
helm ls
- Find oldest releases.
- Run
helm get values <release name> | yq .ci.pipeline.url
- Visit the URL to see if the environment can be stopped (MR merged, docs-only MR, etc.)
- If it can be stopped, run
helm delete <release name>
or run thestop_review
jobs from the MR pipeline.
Some considerations:
- I see lots of
trigger
environments, which were created from CNG MR pipelines that are not automatically cleaned up in many cases. - Some MRs change only docs and don't need full CI review environments, but the author forgot the
docs-
prefix /-docs
suffix. - Maybe we consider automatically uninstalling successful review environments after ~1 hour. I personally don't often need to jump in and look at successful CI environments nearly as often as I need to see failed review environments. This would clean up quite a bit of resource usage.
Edited by Mitchell Nielsen