Image pull failed: Back-off pulling image (Kubernetes Executor)

Status update 2022-04-07

We are testing on an EKS cluster(1.19) in order to reliably reproduce the error messages listed in the issue description and comments.
So far we haven't been able to generate the error with our test configuration.

Test matrix

Test Date	Runner Version	EKS	Test Result
2022-04-07	14.9.0	1.19	No errors pulling images

Configurations used

gitlab-ci.yml

stages:
  - build
  - test

build-image:
  stage: build
  variables:
    DOCKER_DRIVER: overlay2
    DOCKER_HOST: tcp://dockerservice:2375/
    DOCKER_TLS_CERTDIR: ""
  image: docker:20.10.14
  services:
    - name: docker:20.10.14-dind
      alias: dockerservice
  before_script:
    - sleep 30
    - docker info
  script:
  - echo ${CI_BUILD_TOKEN} | docker login -u gitlab-ci-token --password-stdin ${CI_REGISTRY}
  - docker build --pull --no-cache -t ${CI_REGISTRY_IMAGE}:1.0.0 - << EOF

    FROM alpine:3.13.1

    RUN apk update
    RUN apk upgrade
    RUN apk add --no-cache git ca-certificates curl python3 jq apache2-utils nmap-ncat netcat-openbsd wget

    ENTRYPOINT []

    EOF
  - docker push ${CI_REGISTRY_IMAGE}:1.0.0

test:
  stage: test
  image: ${CI_REGISTRY_IMAGE}:1.0.0
  variables:
    GIT_STRATEGY: none
  script: 
    - sleep 10
  parallel: 50

config.toml

concurrent = 90
check_interval = 1
log_level = "debug"

[session_server]
  session_timeout = 1800

[[runners]]
  url = "https://gitlab.com/"
  token = "__REDACTED__"
  executor = "kubernetes"
  [runners.custom_build_dir]
  [runners.kubernetes]
    #service_account="image-pull-secret-sa"
    pull_policy="always"
    #image_pull_secrets=["regcred"]
    #image = "aaro301/alpine:latest"
    namespace_overwrite_allowed = ""
    privileged = true
    allow_privilege_escalation = true
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    terminationGracePeriodSeconds = 30
    [runners.kubernetes.affinity]
    [runners.kubernetes.volumes]
    [runners.kubernetes.dns_config]

Problem

We are currently experiencing problem on our Kubernetes cluster : pods which needs to pull image through ImagePullSecrets fail sometimes with the error image pull failed: Back-off pulling image

This problem appeared after we upgraded our kubernetes cluster from 1.17 to 1.18.

How to reproduce

Configure a kube-executor runner and set pull_policy = "always" to maximize reproduce chances
Create a project with this gitlab-ci.yml :

stages:
  - build
  - test

build-image:
  stage: build
  variables:
    DOCKER_HOST: tcp://127.0.0.1:2375
    DOCKER_TLS_CERTDIR: ""
  image: docker:19.03.15
  services:
    - docker:19.03.15-dind
  script:
  - echo ${CI_BUILD_TOKEN} | docker login -u gitlab-ci-token --password-stdin ${CI_REGISTRY}
  - docker build --pull --no-cache -t ${CI_REGISTRY_IMAGE}:1.0.0 - << EOF

    FROM alpine:3.13.1

    RUN apk update --quiet && \
        apk upgrade --quiet && \
        apk add --quiet --no-cache git ca-certificates curl python3 jq apache2-utils nmap-ncat netcat-openbsd wget

    ENTRYPOINT []

    EOF
  - docker push ${CI_REGISTRY_IMAGE}:1.0.0

test:
  stage: test
  image: ${CI_REGISTRY_IMAGE}:1.0.0
  variables:
    GIT_STRATEGY: none
  script: 
    - sleep 10
  parallel: 30

Schedule this project to run every 10 minutes until some jobs fails with Image pull failed: Back-off pulling image

Tests & Results

We tried multiple configurations on staging environments :

Versions	EKS 1.17.12	EKS 1.18.9	EKS 1.19.6	EKS 1.21.2	EKS1.18.9(Master)+EKS1.17.12(Nodes)
Runner 13.3	OK	Random errors	Random errors	Untested	Untested
Runner 13.8	OK	Random errors	Random errors	Untested	OK
Runner 13.9	OK	Random errors	Random errors	Untested	OK
Runner 13.12	OK	Random errors	Random errors	Random errors	OK

Debug & Trace logs

Runner 13.8.0 / Gitlab 13.8.4

ERROR: Job failed (system failure): prepare environment: image pull failed: Back-off pulling image "registry.example.net/tests/issue-image:1.0.0". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information  duration=2m54.750286918s job=22436 project=550 runner=aDykLRhX
WARNING: Failed to process runner                   builds=18 error=prepare environment: image pull failed: Back-off pulling image "registry.example.net/tests/issue-image:1.0.0". Check https://docs.gitlab.com/runner/shells/index.html#shell-profile-loading for more information executor=kubernetes runner=aDykLRhX

Kubelet 1.18.9

kubelet: E0305 10:59:52.282050    3958 reflector.go:178] object-"runners"/"runner-adyklrhx-project-550-concurrent-216pxpk": Failed to list *v1.Secret: secrets "runner-adyklrhx-project-550-concurrent-216pxpk" is forbidden: User "system:node:issue.eu-west-1.compute.internal" cannot list resource "secrets" in API group "" in the namespace "runners": no relationship found between node "issue.eu-west-1.compute.internal" and this object

kubelet: I0305 10:59:52.347492    4012 reconciler.go:224] operationExecutor.VerifyControllerAttachedVolume started for volume "default-token-lxr6w" (UniqueName: "kubernetes.io/secret/14b74129-b719-4183-bffc-530b5d9d5bb3-default-token-lxr6w") pod "runner-adyklrhx-project-550-concurrent-21q7gp7" (UID: "14b74129-b719-4183-bffc-530b5d9d5bb3")

kubelet: W0305 10:59:53.509762    4012 kubelet_pods.go:863] Unable to retrieve pull secret runners/
runner-adyklrhx-project-550-concurrent-216pxpk for runners/runner-adyklrhx-project-550-concurrent-21q7gp7 due to failed to sync secret cache: timed out waiting for the condition.  The image pull may not succeed.

kubelet: E0305 10:59:55.315834    4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ErrImagePull: "rpc error: code = Unknown desc = Error response from daemon: Get https://registry.example.net/v2/tests/issue-image/manifests/1.0.0: denied: access forbidden"

kubelet: E0305 10:59:56.125427    4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ImagePullBackOff: "Back-off pulling image \"registry.example.net/tests/issue-image:1.0.0\""

kubelet: E0305 10:59:57.146155    4012 pod_workers.go:191] Error syncing pod 14b74129-b719-4183-bffc-530b5d9d5bb3 ("runner-adyklrhx-project-550-concurrent-21q7gp7_runners(14b74129-b719-4183-bffc-530b5d9d5bb3)"), skipping: failed to "StartContainer" for "build" with ImagePullBackOff: "Back-off pulling image \"registry.example.net/tests/issue-image:1.0.0\""

NB : on 1.17.12 we also have the first message "no relationship found between..." but no message "due to failed to sync secret cache: timed out waiting for the condition. The image pull may not succeed."

EKS "Master" 1.18.9

09:59:52.181921 - scheduler.go:731 - pod runners/runner-adyklrhx-project-550-concurrent-21q7gp7 is bound successfully on node "issue.eu-west-1.compute.internal", 12 nodes evaluated, 2 nodes were found feasible.

09:59:52.163165 - audit.k8s.io/v1 - create binding (@/api/v1/namespaces/runners/pods/runner-adyklrhx-project-550-concurrent-21q7gp7/binding & issue.eu-west-1.compute.internal ) - (HTTP 201)

09:59:52.201538 - node_authorizer.go:197 - NODE DENY: node "issue.eu-west-1.compute.internal" cannot get unknown secret runners/runner-adyklrhx-project-550-concurrent-216pxpk

09:59:52.201492 - audit.k8s.io/v1 - no relationship found between node "xxx" and this object (the ImagePullSecret) - (HTTP 403)

09:59:52.208891 - audit.k8s.io/v1 - get pods - (HTTP 200)

09:59:52.212011 - audit.k8s.io/v1 - patch pod status (running) - (HTTP 200)

09:59:53.522568 - audit.k8s.io/v1 - list the ImagePullSecret - (HTTP 200) 

09:59:56.133547 - audit.k8s.io/v1 - get the ImagePullSecret - (HTTP 200)

09:59:56.139091 - audit.k8s.io/v1 - patch pod stat : ErrImagePull - (HTTP 200)

NB : on 1.17.12 we won't see any differences expect the last audit message is patch pod stat : RUNNING instead ErrImagePull

Assumption

We have discarded a race condition between the creation of pod and the creation of the secret because we observed pods sometimes had "imagepullbackoff" after 30s wait to be scheduled on our cluster
As we can't reproduce the problem on a kubelet 1.17 with master node on 1.18 : we assume the problem is linked to kubelet >=1.18 & gitlab runner design
May-it be a race condition between scheduler and node_authorizer/graph_populator ? https://github.com/kubernetes/kubernetes/pull/46941

Other related issues

Seems to be the same problem as : #4846

Edited May 10, 2022 by Darren Eastman