Operational Container Scanning can fail for k8s clusters that have low default ephemeral storage

Summary

OCS scans could fail for k8s clusters that have low default ephemeral storage. For example, GKE autopilot sets the default ephemeral storage to 1GB. This is an issue for the Trivy scanning Pod as the Trivy image used is ~200MB and the uncompressed trivy-db-glad vulnerability DB used for the scan is ~400MB. This leaves about 400MB of storage to download the images to be scanned and to store the Trivy report.

Each Trivy scanning Pod is in charge of scanning a single namespace, if the namespace contains many large images the pod could run out of space and cause the Pod to fail. See steps below to replicate the failure. Note that the Trivy k8s scan has a parallel flag defaulted to 5. This means that if the total size of the 5 images that are scanned concurrently exceeds 400MB, the Pod would fail.

To ensure that OCS scans does not fail for this reason, I propose to specify a high limit for ephemeral storage of the Trivy Scanning Pod.

Implementation Plan

Enable configuration of default ephemeral storage.

Add ephemeral-storage as a configurable resource in the OCS resource requirements config.
1. This MR might be a useful reference of how the memory and cpu resource requirement was first added.
Configure a default ephemeral-storage request of 1GB and limit of 3GB.
1. This ensures that the requested ephemeral-storage is low but more storage can be used up to 3GB. See requests-and-limits on the difference between requests and limits.
Parse and validate the ephemeral-storage config configured in agent config
Configure the ephemeral-storage for the OCS scanning pod
Update the logic in OCS that handles Pod Failure to determine if it was because ephemeral storage was exceeded and print a meaningful error log for user to configure a higher ephemeral storage limit to cater to their use case.
Update the resource requirements section in OCS docs with ephemeral-storage example
Update OCS troubleshooting documentation to include the exceed ephemeral storage use case.
1. Add a link to the resource requirements section

Steps to replicate failure:

Start a GKE autopilot cluster, install gitlab-agent and enable OCS

Start multiple large pods in the default namespace

kubectl run nginx-pod --image=nginx:latest --restart=Never
kubectl run ubuntu-pod --image=ubuntu:latest --restart=Never
kubectl run debian-pod --image=debian:latest --restart=Never
kubectl run alpine-pod --image=alpine:latest --restart=Never
kubectl run centos-pod --image=centos:latest --restart=Never

Start an OCS scan for the default namespace

View the gitlab-agent logs to observe that the scan fails with error Error running Trivy scan and pod failed

{"level":"error","time":"2023-12-08T07:06:25.317Z","msg":"Error running Trivy scan","mod_name":"starboard_vulnerability","error":"pod failed. Container terminated reason: &ContainerStateTerminated{ExitCode:2,Signal:0,Reason:Error,Message:,StartedAt:2023-12-08 07:04:01 +0000 UTC,FinishedAt:2023-12-08 07:06:23 +0000 UTC,ContainerID:containerd://ebbcc53476e1d85b8acf427d89879bf2c8a7d612a543bb09574ff060d971b117,}","agent_id":1081675}

View the events of the gitlab-agent namespace and see that there is a warning that Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.

smtan@cloudshell:~ (smtan-7bd483b7)$ kubectl get events -n gitlab-agent-agentk
LAST SEEN   TYPE      REASON      OBJECT                   MESSAGE
...
4m21s       Normal    Created     pod/trivy-scan-default   Created container trivy
4m21s       Normal    Started     pod/trivy-scan-default   Started container trivy
119s        Warning   Evicted     pod/trivy-scan-default   Pod ephemeral local storage usage exceeds the total limit of containers 1Gi.
119s        Normal    Killing     pod/trivy-scan-default   Stopping container trivy

Relevant links

Original slack discussion

Edited Sep 23, 2024 by Yasha Rise