Ensure orphaned Starboard jobs are cleaned up
Why are we doing this work
If a Starboard scan is interrupted, it leaves behind an orphaned vulnerabilityreport object in the Kubernetes cluster.
gitlab-org/cluster-integration/gitlab-agent!630 (comment 921111765)
I stopped my agent process and unluckily the deferred cleanup function did not execute and left me with a dangling, yet completed job:
% kubectl get jobs -n gitlab-agent NAME COMPLETIONS DURATION AGE scan-vulnerabilityreport-68676cd7bc 1/1 13s 138m
Because starboard uses a hash of the object's kind/namespace/name as suffix for the scan job name, this resource can now no longer be scanned, across restarts:
{"level":"error","time":"2022-04-22T15:38:01.853+0200","msg":"Failed to perform vulnerability scan on workload","mod_name":"starboard_vulnerability","error":"running scan job: creating job: jobs.batch \"scan-vulnerabilityreport-68676cd7bc\" already exists"}
IMO starboard should hash the whole object, not just their simplified ObjectRef?
Alternatively, starboard should specify
.spec.ttlSecondsAfterFinished
on the jobs it creates so that they get cleaned up automatically.
We should fix this using one of the two options:
- Do an existence check for a job before running a scan. If it exists, delete it first.
- Make an upstream contribution to Starboard to set
.spec.ttlSecondsAfterFinished
so that orphaned jobs are deleted automatically
Relevant links
Non-functional requirements
-
Documentation: -
Feature flag: -
Performance: -
Testing:
Implementation plan
Verification steps
Edited by Brian Williams