Operational Container Scanning of objects with unavailable image references results in Errored `scan-vulnerabilityreport` pods being left in cluster

Summary

The Agent-initiated Operational Container scanner builds up a list of images to scan by collecting image references from various objects existing in the k8s cluster e.g. pods, replicasets, deployments, etc.

Sometimes clusters contain inactive objects which reference images that are no longer available, for instance an old deployment replicaset referencing an image that has been deleted from the container registry.

Setting aside the question of why the old replicasets still exist, when they do exist the scanner creates a scan-vulnerabilityreport pod which fails to start (due to image being unavailable) and this errored pod remains in the cluster indefinitely, until the associated scan-vulnerabilityreport job is manually deleted.

There is already an open issue Ensure orphaned Starboard jobs are cleaned up dealing with the need to clean up failed scan-vulnerabilityreport jobs and their associated resources, but in the interest of avoiding errors being triggered in the cluster in the first place this issue asks whether:

the images-to-scan identification process could be amended to preempt the errors by checking if an image can be pulled and if not skipping it; or
cluster objects that have no running containers associated with them can/should be excluded from the images-to-scan list

This issue was raised by an Ultimate customer (ZD internal link).

Steps to reproduce

Configure Operational Container scanning of a cluster namespace test.
Deploy a replicaset referencing a tagged image in the container registry.
Delete the tagged image from the container registry.
Run a scan.
Observe a new scan-vulnerabilityreport pod is created which fails to start and is left in Error state.
Delete the associated scan-vulnerabilityreport job and the pod is removed also.

Example Project

What is the current bug behavior?

The Operational Container Scanning process leaves behind errored pods.

What is the expected correct behavior?

The Operation Container Scanning process should not leave behind any errored pods regardless of cluster configuration.

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

Implementation plan

More context in this comment

Update ParsePodLogsToReport in Scanner.go to keep only the JSON portion of the output.

Edited Apr 11, 2023 by Shao Ming Tan