Create a scheduled pipeline or kube-janitor to cleanup releases updated older than 2 days

Summary

We've exceeded our ELBs quota of 20 because we had too many environments deployed. But we shouldn't have as there were many older than 2 days environment hanging, which means our auto_stop_in: $REVIEW_APPS_AUTO_STOP_IN didn't work.

As mentioned internally by @mnielsen:

... I’ve noticed that if someone retries any of those trigger jobs then it won’t re-trigger cleanup later

which could be the reason for dangling resources.

Proposal 1 - Scheduled Pipeline with script

Proposal 1.1

Create a scheduled pipeline which runs our autodevops.sh delete and cleanup scripts to remove Helm releases last updated which are older than 2 days.

We've scripted something like this manually already. See this internal thread for reference.

Proposal 1.2

Do the same as above, but add it directly to every master pipeline, i.e. not scheduled.

Proposal 2 - kube-janitor

Investigate extending kube-janitor beyond buildx cleanups. It can be driven by rules, and be configured to remove objects that are more than X days old, or we could add "expiration" annotation to our CI deployments to keep bumping TTL real time preventing cleanup of "static" environments.

As per @WarheadsSE and @dmakovey discussion:

jplum
Will this kube-janitor also remove all traces of the Helm objects, dangling persistence (secrets, disks)?

Dmytro Makovey
yes.

Dmytro Makovey
https://codeberg.org/hjacobs/kube-janitor#rules-file

Dmytro Makovey
if not using rules, pipelines will have to update expiration annotation at each execution etc. naturally one has to be careful with a blunt tool like that.

jplum
So, we could configure it to remove the Helm identifying bits, but it does not look like it natively will call Helm to tear-down.

Dmytro Makovey
unfortunately - not to my knowledge

Dmytro Makovey
so it's good for "cleanup" after Helm Chart has been removed (edited) 

jplum
It's in python. GPLv3 licensed.
Perhaps we just contribute that Helm support?

Dmytro Makovey
that is a possibility. My past Helm-related additions didn't  always lend as author seemed to be a bit Helm-averse (edited) 

jplum
mm. Well, I'd have to look at them, and see if there is a "better way" to look at conveying the value / risk of just deleting the objects from under helm.

Dmytro Makovey
In my past experience we got around that by enclosing deployments in a namespace, then we just have to deal with namespace and can "ignore" Helm. Not the nicest way around but workable for the most part. In our case the best we can do is proper labeling and operate based on labels (edited) 

jplum
Moving forward, we could look at separation of deployments by namespace.
We'll need some refactoring due to external-dns
We may need to backport some fixes to work in that way / around that fact that moving forward we do

jplum
Immediately viable: a scheduled pipeline that took the information from Helm, and forcible does one of:
helm delete / purge from scripts
attempt to trigger the Job from the Pipeline the Helm stem's from, to delete it.

Proposal 3 - gitlab-bot

An idea came up for the gitlab-bot to auto-schedule the stop-review-* jobs when it detects a call to review-*. The tricky thing here, is that the gitlab-bot must only really call the stop-review-* if the pipeline succeeds after all the other subsequent jobs have finished.

To circumvent this, we'd have to be able to trigger auto_stop_in with the gitlab-bot to refresh it. Either that, or gitlab-bot would need its own queue scheduling mechanism.

Proposal 4 - Change (fix?) `auto_stop_in` behaviour

This looks like a bugfunctional in the auto_stop_in feature. We could contribute a change to it so that it starts removing envs for retried jobs.

‼ ‼ Important ‼ ‼

Whatever cleanup we automate, we don't want to delete *production*, gitlab-external-dns. We only want to delete test releases triggered by CNG, MRs, or stable branches updated older than 2 days.

Using the eks-helm-charts-win and helm-charts-win namespaces is a good start, but won't preclude deleting things that shouldn't be deleted. So we need to filter them out, or refactor them to go on separate namespaces.

Edited Apr 17, 2024 by João Alexandre Cunha