Avoid resolving OCS vulns if a scanning pod fails

For OCS scans of multiple namespaces, OCS vulns are being incorrectly resolved when one or more scanning pod fails.

Background

For each scan, OCS calls the PUT starboard_vulnerability endpoint to idempotently create vulnerabilities which returns a UUID for each vulnerability. Once all vulnerabilities are created, it calls the POST scan_result with an array of all the UUIDs. For UUIDs that are not in this list, it resolves them.

When a scanning pod fails, no vulnerabilities are created, which means no UUIDs from these images are sent to scan_result. Consequently, if vulnerabilities were previously created for these images, they would be incorrectly resolved.

Related Issues

First reported by customer in this comment: #468631 (comment 2099341916)+

The fix for this issue "might" also address: OCS vulns might resolved incorrectly due to pul... (#468422 - closed)

Steps to replicate

Create an OCS scan for at least 2 namespaces
Run an OCS scan and check that vulns are created in the dashboard and note the number of vulns.
Run another OCS scan but this time, manually kill one of the scanning pod. When the scan completes, note that the number of vulns have decreased as the vulns from the killed scanning pod have been resolved.

Possible Fix

Introduce a mechanism to skip resolving of all vulnerabilities if there is an error with the scanning pod.
1. Ensure that failures due to time out is also factored
Enable scanning pod timeout to be configurable and pass this value to the Trivy scanning pod.
1. Consider creating a separate issue and MR for this.

Consideration

If resolving all vulnerabilities is skipped due to an error, we risk missing vulnerabilities that should be resolved. However, this approach is still safer than accidentally resolving vulnerabilities that shouldn't be resolved.

Edited Sep 19, 2024 by Shao Ming Tan