Operational Container Scanning of objects with unavailable image references results in Errored `scan-vulnerabilityreport` pods being left in cluster
Summary
The Agent-initiated Operational Container scanner builds up a list of images to scan by collecting image references from various objects existing in the k8s cluster e.g. pods, replicasets, deployments, etc.
Sometimes clusters contain inactive objects which reference images that are no longer available, for instance an old deployment replicaset referencing an image that has been deleted from the container registry.
Setting aside the question of why the old replicasets still exist, when they do exist the scanner creates a scan-vulnerabilityreport
pod which fails to start (due to image being unavailable) and this errored pod remains in the cluster indefinitely, until the associated scan-vulnerabilityreport
job is manually deleted.
There is already an open issue Ensure orphaned Starboard jobs are cleaned up dealing with the need to clean up failed scan-vulnerabilityreport
jobs and their associated resources, but in the interest of avoiding errors being triggered in the cluster in the first place this issue asks whether:
- the images-to-scan identification process could be amended to preempt the errors by checking if an image can be pulled and if not skipping it; or
- cluster objects that have no running containers associated with them can/should be excluded from the images-to-scan list
This issue was raised by an Ultimate customer (ZD internal link).
Steps to reproduce
- Configure Operational Container scanning of a cluster namespace
test
. - Deploy a replicaset referencing a tagged image in the container registry.
- Delete the tagged image from the container registry.
- Run a scan.
- Observe a new
scan-vulnerabilityreport
pod is created which fails to start and is left in Error state. - Delete the associated
scan-vulnerabilityreport
job and the pod is removed also.
Example Project
What is the current bug behavior?
The Operational Container Scanning process leaves behind errored pods.
What is the expected correct behavior?
The Operation Container Scanning process should not leave behind any errored pods regardless of cluster configuration.
Relevant logs and/or screenshots
Output of checks
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
Implementation plan
- More context in this comment
- Update
ParsePodLogsToReport
in Scanner.go to keep only the JSON portion of the output.