Skip to content

Kubernetes executor is unstable that some job pods will always keep `ContainerCreating` status until timeout

When I used k8s executor to speed up CI/CD for my project, I found that sometimes job's pod would keep ContainerCrating status until timeout.

Environment:

System: CentOS Linux 7 (Kernel: 5.0.13-1.el7.elrepo.x86_64)

Docker CE Version: 18.09.6 (cgroupfs)

Kubernetes Version: 1.14.1 (cgroupfs)

Gitlab Version: 9.4.3

Gitlab-runner Version: 11.6.1

Runner kubernetes image is a big image whose size is about 20GB.

1.The describe of the pod is:

  Type     Reason                  Age                    From               Message
  ----     ------                  ----                   ----               -------
  Normal   Scheduled               9m11s                  default-scheduler  Successfully assigned gitlab-managed-apps/runner-xxx-project-1126-concurrent-5plvkj to node2
  Warning  FailedCreatePodSandBox  7m9s                   kubelet, node2     Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded
  Warning  FailedCreatePodSandBox  6m16s (x4 over 6m56s)  kubelet, node2     Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": Error response from daemon: Conflict. The container name "/k8s_POD_runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps_a97a2b25-8080-11e9-8a75-309c23206db5_0" is already in use by container "07565029aefe90020794d4547b2aed560a53aa37d2e8e6b1299c9a2f3a1b2528". You have to remove (or rename) that container to be able to reuse that name.

2.Check the system log

First dockerd output:

Container <container-helper-id> failed to exit within 15 seconds of signal 15 - using the force
Container <container-builder-id> failed to exit within 15 seconds of signal 15 - using the force

Next kubelet remote_runtime.go error:

StopContainer "<container-helper-id>" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
StopContainer "<container-builder-id>" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

Next kubelet kuberuntime_container.go error:

Container "docker://<container-helper-id>" termination failed with gracePeriod 15: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Container "docker://<container-builder-id>" termination failed with gracePeriod 15: rpc error: code = Unknown desc = operation timeout: context deadline exceeded

Next kubelet remote_runtime.go error:

RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded

Next kubelet kuberuntime_sandbox.go error:

CreatePodSandbox for pod "runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded

Next kubelet kuberuntime_manager.go error:

createPodSandbox for pod "runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded

Next kubelet pod_workers.go error:

Error syncing pod a97a2b25-8080-11e9-8a75-309c23206db5 ("runner-xxx-project-1126-concurrent-5plvkj_gitla
b-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)"), skipping: failed to "CreatePodSandbox" for "runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)" with CreatePodSandboxError: "CreatePodSandbox for pod \"runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"runner-xxx-project-1126-concurrent-5plvkj\": operation timeout: context deadline exceeded"

Next kubelet out failed to remove sandbox:

# remote_runtime.go
RemovePodSandbox "<pod-id>" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
# kuberuntime_gc.go
Failed to remove sandbox "<pod-id>": rpc error: code = DeadlineExceeded desc = context deadline exceeded

Then it seems the runner will recreate sandbox pod with the same name that has been used by the failed pod before. And docker pass the name conflict error to k8s.

# kubelet remote_runtime.go error:
RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": Error response from daemon: Conflict. The container name "/k8s_POD_runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps_a97a2b25-8080-11e9-8a75-309c23206db5" is already in use by container "<container-id>". You have to remove (or rename) that container to be able to reuse that name.

3.Analysis

It seems that the last docker container for sandbox has not been deleted and gitlab-runner will recreate the sandbox pod soon.But there's no policy in gitlab-runner to catch this exceptions.And the pod is always kept ContainerCreating status in k8s cluster.

Or maybe my image is too big for dockerd to delete and sepc.terminationGracePeriodSeconds=15(s)(from MR-383:Kubernetes termination grace period) is not enough.

4.How to solve this problem?

Can gitlab-runner catch context deadline exceeded exception ?

Does gitlab-runner not use kubernetes force delete option help?

Or some other solutions.

Edited by 🤖 GitLab Bot 🤖