Operational Container Scanning can fail for k8s clusters that have low default ephemeral storage
Summary
OCS scans could fail for k8s clusters that have low default ephemeral storage. For example, GKE autopilot sets the default ephemeral storage to 1GB. This is an issue for the Trivy scanning Pod as the Trivy image used is ~200MB and the uncompressed trivy-db-glad vulnerability DB used for the scan is ~400MB. This leaves about 400MB of storage to download the images to be scanned and to store the Trivy report.
Each Trivy scanning Pod is in charge of scanning a single namespace, if the namespace contains many large images the pod could run out of space and cause the Pod to fail. See steps below to replicate the failure. Note that the Trivy k8s scan has a parallel flag defaulted to 5. This means that if the total size of the 5 images that are scanned concurrently exceeds 400MB, the Pod would fail.
To ensure that OCS scans does not fail for this reason, I propose to specify a high limit for ephemeral storage of the Trivy Scanning Pod.
Implementation Plan
Enable configuration of default ephemeral storage.
- Add ephemeral-storageas a configurable resource in the OCS resource requirements config.- This MR might be a useful reference of how the memory and cpu resource requirement was first added.
 
- Configure a default ephemeral-storage requestof 1GB andlimitof 3GB.- This ensures that the requested ephemeral-storage is low but more storage can be used up to 3GB. See requests-and-limits on the difference between requests and limits.
 
- Parse and validate the ephemeral-storage config configured in agent config
- Configure the ephemeral-storage for the OCS scanning pod
- Update the logic in OCS that handles Pod Failure to determine if it was because ephemeral storage was exceeded and print a meaningful error log for user to configure a higher ephemeral storage limit to cater to their use case.
- Update the resource requirements section in OCS docs with ephemeral-storage example
- Update OCS troubleshooting documentation to include the exceed ephemeral storageuse case.- Add a link to the resource requirements section
 
Steps to replicate failure:
- 
Start a GKE autopilot cluster, install gitlab-agent and enable OCS 
- 
Start multiple large pods in the default namespace kubectl run nginx-pod --image=nginx:latest --restart=Never kubectl run ubuntu-pod --image=ubuntu:latest --restart=Never kubectl run debian-pod --image=debian:latest --restart=Never kubectl run alpine-pod --image=alpine:latest --restart=Never kubectl run centos-pod --image=centos:latest --restart=Never
- 
Start an OCS scan for the default namespace 
- 
View the gitlab-agent logs to observe that the scan fails with error Error running Trivy scanandpod failed{"level":"error","time":"2023-12-08T07:06:25.317Z","msg":"Error running Trivy scan","mod_name":"starboard_vulnerability","error":"pod failed. Container terminated reason: &ContainerStateTerminated{ExitCode:2,Signal:0,Reason:Error,Message:,StartedAt:2023-12-08 07:04:01 +0000 UTC,FinishedAt:2023-12-08 07:06:23 +0000 UTC,ContainerID:containerd://ebbcc53476e1d85b8acf427d89879bf2c8a7d612a543bb09574ff060d971b117,}","agent_id":1081675}
- 
View the events of the gitlab-agent namespace and see that there is a warning that Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.smtan@cloudshell:~ (smtan-7bd483b7)$ kubectl get events -n gitlab-agent-agentk LAST SEEN TYPE REASON OBJECT MESSAGE ... 4m21s Normal Created pod/trivy-scan-default Created container trivy 4m21s Normal Started pod/trivy-scan-default Started container trivy 119s Warning Evicted pod/trivy-scan-default Pod ephemeral local storage usage exceeds the total limit of containers 1Gi. 119s Normal Killing pod/trivy-scan-default Stopping container trivy