Kubernetes executor is unstable that some job pods will always keep `ContainerCreating` status until timeout
When I used k8s executor to speed up CI/CD for my project, I found that sometimes job's pod would keep ContainerCrating
status until timeout.
Environment:
System: CentOS Linux 7 (Kernel: 5.0.13-1.el7.elrepo.x86_64)
Docker CE Version: 18.09.6 (cgroupfs)
Kubernetes Version: 1.14.1 (cgroupfs)
Gitlab Version: 9.4.3
Gitlab-runner Version: 11.6.1
Runner kubernetes image is a big image whose size is about 20GB.
1.The describe of the pod is:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal Scheduled 9m11s default-scheduler Successfully assigned gitlab-managed-apps/runner-xxx-project-1126-concurrent-5plvkj to node2
Warning FailedCreatePodSandBox 7m9s kubelet, node2 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded
Warning FailedCreatePodSandBox 6m16s (x4 over 6m56s) kubelet, node2 Failed create pod sandbox: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": Error response from daemon: Conflict. The container name "/k8s_POD_runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps_a97a2b25-8080-11e9-8a75-309c23206db5_0" is already in use by container "07565029aefe90020794d4547b2aed560a53aa37d2e8e6b1299c9a2f3a1b2528". You have to remove (or rename) that container to be able to reuse that name.
2.Check the system log
First dockerd output:
Container <container-helper-id> failed to exit within 15 seconds of signal 15 - using the force
Container <container-builder-id> failed to exit within 15 seconds of signal 15 - using the force
Next kubelet remote_runtime.go error:
StopContainer "<container-helper-id>" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
StopContainer "<container-builder-id>" from runtime service failed: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Next kubelet kuberuntime_container.go error:
Container "docker://<container-helper-id>" termination failed with gracePeriod 15: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Container "docker://<container-builder-id>" termination failed with gracePeriod 15: rpc error: code = Unknown desc = operation timeout: context deadline exceeded
Next kubelet remote_runtime.go error:
RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded
Next kubelet kuberuntime_sandbox.go error:
CreatePodSandbox for pod "runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded
Next kubelet kuberuntime_manager.go error:
createPodSandbox for pod "runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": operation timeout: context deadline exceeded
Next kubelet pod_workers.go error:
Error syncing pod a97a2b25-8080-11e9-8a75-309c23206db5 ("runner-xxx-project-1126-concurrent-5plvkj_gitla
b-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)"), skipping: failed to "CreatePodSandbox" for "runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)" with CreatePodSandboxError: "CreatePodSandbox for pod \"runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps(a97a2b25-8080-11e9-8a75-309c23206db5)\" failed: rpc error: code = Unknown desc = failed to create a sandbox for pod \"runner-xxx-project-1126-concurrent-5plvkj\": operation timeout: context deadline exceeded"
Next kubelet out failed to remove sandbox:
# remote_runtime.go
RemovePodSandbox "<pod-id>" from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded
# kuberuntime_gc.go
Failed to remove sandbox "<pod-id>": rpc error: code = DeadlineExceeded desc = context deadline exceeded
Then it seems the runner will recreate sandbox pod with the same name that has been used by the failed pod before.
And docker pass the name conflict error
to k8s.
# kubelet remote_runtime.go error:
RunPodSandbox from runtime service failed: rpc error: code = Unknown desc = failed to create a sandbox for pod "runner-xxx-project-1126-concurrent-5plvkj": Error response from daemon: Conflict. The container name "/k8s_POD_runner-xxx-project-1126-concurrent-5plvkj_gitlab-managed-apps_a97a2b25-8080-11e9-8a75-309c23206db5" is already in use by container "<container-id>". You have to remove (or rename) that container to be able to reuse that name.
3.Analysis
It seems that the last docker container for sandbox has not been deleted and gitlab-runner will recreate the sandbox pod soon.But there's no policy in gitlab-runner to catch this exceptions.And the pod is always kept ContainerCreating
status in k8s cluster.
Or maybe my image is too big for dockerd to delete and sepc.terminationGracePeriodSeconds=15(s)
(from MR-383:Kubernetes termination grace period) is not enough.
4.How to solve this problem?
Can gitlab-runner catch context deadline exceeded
exception ?
Does gitlab-runner not use kubernetes force delete option help?
Or some other solutions.