[CI] Review environments periodically failing due to resource exhaustion

Summary

We're noticing that CI review jobs periodically fail, reporting various problems:

The culprit is usually resource exhaustion, and you can see pods related to the release are stuck Pending.

The typical workaround is to:

  1. Run helm ls
  2. Find oldest releases.
  3. Run helm get values <release name> | yq .ci.pipeline.url
  4. Visit the URL to see if the environment can be stopped (MR merged, docs-only MR, etc.)
  5. If it can be stopped, run helm delete <release name> or run the stop_review jobs from the MR pipeline.

Some considerations:

  • I see lots of trigger environments, which were created from CNG MR pipelines that are not automatically cleaned up in many cases.
  • Some MRs change only docs and don't need full CI review environments, but the author forgot the docs- prefix / -docs suffix.
  • Maybe we consider automatically uninstalling successful review environments after ~1 hour. I personally don't often need to jump in and look at successful CI environments nearly as often as I need to see failed review environments. This would clean up quite a bit of resource usage.
Edited by Mitchell Nielsen