Skip to content

Ensure orphaned Starboard jobs are cleaned up

Why are we doing this work

If a Starboard scan is interrupted, it leaves behind an orphaned vulnerabilityreport object in the Kubernetes cluster.

gitlab-org/cluster-integration/gitlab-agent!630 (comment 921111765)

I stopped my agent process and unluckily the deferred cleanup function did not execute and left me with a dangling, yet completed job:

% kubectl get jobs -n gitlab-agent
NAME                                  COMPLETIONS   DURATION   AGE
scan-vulnerabilityreport-68676cd7bc   1/1           13s        138m

Because starboard uses a hash of the object's kind/namespace/name as suffix for the scan job name, this resource can now no longer be scanned, across restarts:

{"level":"error","time":"2022-04-22T15:38:01.853+0200","msg":"Failed to perform vulnerability scan on workload","mod_name":"starboard_vulnerability","error":"running scan job: creating job: jobs.batch \"scan-vulnerabilityreport-68676cd7bc\" already exists"}

IMO starboard should hash the whole object, not just their simplified ObjectRef?

Alternatively, starboard should specify .spec.ttlSecondsAfterFinished on the jobs it creates so that they get cleaned up automatically.

We should fix this using one of the two options:

  1. Do an existence check for a job before running a scan. If it exists, delete it first.
  2. Make an upstream contribution to Starboard to set .spec.ttlSecondsAfterFinished so that orphaned jobs are deleted automatically

Relevant links

Non-functional requirements

  • Documentation:
  • Feature flag:
  • Performance:
  • Testing:

Implementation plan

Verification steps

Edited by Brian Williams