Switch default pull policy for runner-managed images to if-not-present
Description
Dear gitlab-runner team, on some of our high traffic shared runners, using the Kubernetes executor, we sometimes notice users that are impacted by failed jobs that ran into the Kubelet-sided pull QPS which is by default set to 5
.
See: https://kubernetes.io/docs/reference/config-api/kubelet-config.v1beta1/#kubelet-config-k8s-io-v1beta1-KubeletConfiguration - registryPullQPS
: registryPullQPS is the limit of registry pulls per second. The value must not be a negative number. Setting it to 0 means no limit. Default: 5
We see benefits in having a rate limit there from Kubernetes-side and don't just want to set this limit to 0
, effectively disabling it.
In the team we discussed solutions for this issue and analyzed our current setup in further detail. During this process we noticed that per default the GitLab Runner Kubernetes executor is spawning the three containers for a job Pod with the image pull policy set to Always
.
We noticed that the Always
setting is currently in place probably due to the here described security implications of using IfNotPresent
especially on shared runners: https://docs.gitlab.com/runner/security/#usage-of-private-docker-images-with-if-not-present-pull-policy
Proposal
From our point of view, the current default introduces a lot of image pull that could be avoided as especially the runner helper images are using a properly pinned version that changes on GitLab Runner updates, hence they could benefit from a IfNotPresent
setting. This could effectively reduce the number of necessary image pulls by 2/3 per job pod execution leaving far more headroom for possible rate limiting issues with the Kubernetes default pull QPS. (given pull operations of already present images are counted towards the rate limit which we are currently not 100% sure about)
Regarding the aforementioned security concerns we propose to only change this default from Always
to IfNotPresent
for "runner-managed images", like the init-permissions and build-helper which are controlled by the runner itself and not by the pipeline definitions of the executed job, hoping that this would eliminate the security issues described in the docs, but provide the previously described benefits.
We would love to hear the maintainers thoughts on this proposal.
Links to related issues and merge requests / references
None