Resource Limitation Issue with Runner installed on Kubernetes probably due to PLEG/NodeNotReady problems

Overview

We have a number of customers encountering issues while using GitLab Kubernetes executor on different cloud providers (EKS & AKS have been reported).

One case shows that jobs intermittent fail after a couple (or sometimes 1) successful jobs, the job output of the failing ones usually have the logs below before timing out:

Waiting for pod default/runner-8yeawpmn-project-529-concurrent-02d7lz to be running, status is Pending

On Investigation of the underlying pods, kubelet logs, and others, there seems to be a problem somewhere that is triggered at a point in time based on load. See the logs below:

Kubelet logs:

NAMESPACE               LAST SEEN   TYPE     REASON      KIND   MESSAGE
gitlab-runner-support   48s         Normal   Scheduled   Pod    Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-08rsfk to aks-agentpool-25657597-0
gitlab-runner-support   47s         Normal   Pulled      Pod    Container image "node:latest" already present on machine
gitlab-runner-support   47s         Normal   Created     Pod    Created container
gitlab-runner-support   47s         Normal   Started     Pod    Started container
gitlab-runner-support   47s         Normal   Pulled      Pod    Container image "gitlab/gitlab-runner-helper:x86_64-de7731dd" already present on machine
gitlab-runner-support   46s         Normal   Created     Pod    Created container
gitlab-runner-support   46s         Normal   Started     Pod    Started container
NAMESPACE               LAST SEEN   TYPE     REASON      OBJECT                                              MESSAGE
gitlab-runner-support   0s          Normal   Scheduled   pod/runner-kw4tq1u-project-1404-concurrent-05pd84   Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-05pd84 to aks-agentpool-25657597-0
gitlab-runner-support   0s          Normal   Scheduled   pod/runner-kw4tq1u-project-1404-concurrent-188xgc   Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-188xgc to aks-agentpool-25657597-0
gitlab-runner-support   0s          Normal   Scheduled   pod/runner-kw4tq1u-project-1404-concurrent-2948kg   Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-2948kg to aks-agentpool-25657597-0
gitlab-runner-support   0s          Normal   Scheduled   pod/runner-kw4tq1u-project-1404-concurrent-3zr5r7   Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-3zr5r7 to aks-agentpool-25657597-0
gitlab-runner-support   0s          Normal   Scheduled   pod/runner-kw4tq1u-project-1404-concurrent-44lkh6   Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-44lkh6 to aks-agentpool-25657597-0
gitlab-runner-support   0s          Normal   Scheduled   pod/runner-kw4tq1u-project-1404-concurrent-546d2g   Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-546d2g to aks-agentpool-25657597-0
gitlab-runner-support   0s          Normal   Killing     pod/runner-kw4tq1u-project-1404-concurrent-08rsfk   Killing container with id docker://build:Need to kill Pod
gitlab-runner-support   0s          Normal   Killing     pod/runner-kw4tq1u-project-1404-concurrent-08rsfk   Killing container with id docker://build:Need to kill Pod
gitlab-runner-support   0s          Normal   Killing     pod/runner-kw4tq1u-project-1404-concurrent-08rsfk   Killing container with id docker://helper:Need to kill Pod
gitlab-runner-support   0s          Normal   Killing     pod/runner-kw4tq1u-project-1404-concurrent-08rsfk   Killing container with id docker://helper:Need to kill Pod
gitlab-runner-support   0s          Normal   Pulled      pod/runner-kw4tq1u-project-1404-concurrent-05pd84   Container image "node:latest" already present on machine
gitlab-runner-support   0s          Warning   FailedKillPod   pod/runner-kw4tq1u-project-1404-concurrent-08rsfk   error killing pod: failed to "KillPodSandbox" for "97a7ed0f-c4cd-11e9-9f63-e27977b905b8" with KillPodSandboxError: "rpc error: code = DeadlineExceeded desc = context deadline exceeded"
gitlab-runner-support   0s          Warning   FailedCreatePodSandBox   pod/runner-kw4tq1u-project-1404-concurrent-188xgc   Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "runner-kw4tq1u-project-1404-concurrent-188xgc": operation timeout: context deadline exceeded
gitlab-runner-support   0s          Warning   FailedCreatePodSandBox   pod/runner-kw4tq1u-project-1404-concurrent-2948kg   Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "runner-kw4tq1u-project-1404-concurrent-2948kg": operation timeout: context deadline exceeded
gitlab-runner-support   0s          Warning   FailedCreatePodSandBox   pod/runner-kw4tq1u-project-1404-concurrent-3zr5r7   Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "runner-kw4tq1u-project-1404-concurrent-3zr5r7": operation timeout: context deadline exceeded
gitlab-runner-support   0s          Warning   FailedCreatePodSandBox   pod/runner-kw4tq1u-project-1404-concurrent-44lkh6   Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "runner-kw4tq1u-project-1404-concurrent-44lkh6": operation timeout: context deadline exceeded
gitlab-runner-support   0s          Warning   FailedCreatePodSandBox   pod/runner-kw4tq1u-project-1404-concurrent-546d2g   Failed create pod sandbox: rpc error: code = Unknown desc = failed to start sandbox container for pod "runner-kw4tq1u-project-1404-concurrent-546d2g": operation timeout: context deadline exceeded
default                 1s          Normal    NodeNotReady             node/aks-agentpool-25657597-0                       Node aks-agentpool-25657597-0 status is now: NodeNotReady

Another Set of kubelet logs:

Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.286521    1975 remote_runtime.go:282] ContainerStatus "c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.286539    1975 kuberuntime_container.go:397] ContainerStatus for c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1 error: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.286545    1975 kuberuntime_manager.go:875] getPodContainerStatuses for pod "runner-kw4tq1u-project-1404-concurrent-2r4spr_gitlab-runner-support(2d9c3321-c4d0-11e9-9f63-e27977b905b8)" failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.286555    1975 generic.go:247] PLEG: Ignoring events for pod runner-kw4tq1u-project-1404-concurrent-2r4spr/gitlab-runner-support: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.290505    1975 remote_runtime.go:282] ContainerStatus "c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.290522    1975 kuberuntime_container.go:397] ContainerStatus for c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1 error: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.290531    1975 kuberuntime_manager.go:875] getPodContainerStatuses for pod "runner-kw4tq1u-project-1404-concurrent-2r4spr_gitlab-runner-support(2d9c3321-c4d0-11e9-9f63-e27977b905b8)" failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.290543    1975 generic.go:277] PLEG: pod runner-kw4tq1u-project-1404-concurrent-2r4spr/gitlab-runner-support failed reinspection: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.294577    1975 remote_runtime.go:282] ContainerStatus "ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.294604    1975 kuberuntime_container.go:397] ContainerStatus for ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d error: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.294610    1975 kuberuntime_manager.go:875] getPodContainerStatuses for pod "runner-kw4tq1u-project-1404-concurrent-1k8d22_gitlab-runner-support(2d799e11-c4d0-11e9-9f63-e27977b905b8)" failed: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:46.294621    1975 generic.go:277] PLEG: pod runner-kw4tq1u-project-1404-concurrent-1k8d22/gitlab-runner-support failed reinspection: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.305651    1975 remote_runtime.go:282] ContainerStatus "ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.305680    1975 kuberuntime_container.go:397] ContainerStatus for ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d error: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.305692    1975 kuberuntime_manager.go:875] getPodContainerStatuses for pod "runner-kw4tq1u-project-1404-concurrent-1k8d22_gitlab-runner-support(2d799e11-c4d0-11e9-9f63-e27977b905b8)" failed: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.305703    1975 generic.go:247] PLEG: Ignoring events for pod runner-kw4tq1u-project-1404-concurrent-1k8d22/gitlab-runner-support: rpc error: code = Unknown desc = Error: No such container: ffd0dd52b276d79eca4a45244da644f47da10a465e53ba2dfd47efc9b5630e5d
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.309854    1975 remote_runtime.go:282] ContainerStatus "c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.309872    1975 kuberuntime_container.go:397] ContainerStatus for c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1 error: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.309877    1975 kuberuntime_manager.go:875] getPodContainerStatuses for pod "runner-kw4tq1u-project-1404-concurrent-2r4spr_gitlab-runner-support(2d9c3321-c4d0-11e9-9f63-e27977b905b8)" failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.309890    1975 generic.go:247] PLEG: Ignoring events for pod runner-kw4tq1u-project-1404-concurrent-2r4spr/gitlab-runner-support: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.313962    1975 remote_runtime.go:282] ContainerStatus "c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1" from runtime service failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.313982    1975 kuberuntime_container.go:397] ContainerStatus for c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1 error: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1
Aug 22 11:35:14 aks-agentpool-25657597-0 kubelet[1975]: E0822 11:34:47.313988    1975 kuberuntime_manager.go:875] getPodContainerStatuses for pod "runner-kw4tq1u-project-1404-concurrent-2r4spr_gitlab-runner-support(2d9c3321-c4d0-11e9-9f63-e27977b905b8)" failed: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1

It is also worth noting that while on a call with one of the customers, we tried running a pipeline with the first job's stage running successfully, while all the jobs on the second stage failing and the node went NodeNotReady before reverting back when all the pods were terminated. See the description of one of the job pods below:

Name:         runner-kw4tq1u-project-1404-concurrent-2r4spr
Namespace:    gitlab-runner-support
Priority:     0
Node:         aks-agentpool-25657597-0/10.115.21.66
Start Time:   Thu, 22 Aug 2019 13:30:10 +0200
Labels:       pod=runner-kw4tq1u-project-1404-concurrent-2
Annotations:  <none>
Status:       Pending
IP:           
Containers:
  build:
    Container ID:  
    Image:         node:latest
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      if [ -x /usr/local/bin/bash ]; then
        exec /usr/local/bin/bash 
      elif [ -x /usr/bin/bash ]; then
        exec /usr/bin/bash 
      elif [ -x /bin/bash ]; then
        exec /bin/bash 
      elif [ -x /usr/local/bin/sh ]; then
        exec /usr/local/bin/sh 
      elif [ -x /usr/bin/sh ]; then
        exec /usr/bin/sh 
      elif [ -x /bin/sh ]; then
        exec /bin/sh 
      elif [ -x /busybox/sh ]; then
        exec /busybox/sh 
      else
        echo shell not found
        exit 1
      fi
      
      
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      FF_CMD_DISABLE_DELAYED_ERROR_LEVEL_EXPANSION:  false
      FF_USE_LEGACY_BUILDS_DIR_FOR_DOCKER:           false
      FF_USE_LEGACY_VOLUMES_MOUNTING_ORDER:          false
      DOCKER_HOST:                                   tcp://localhost:2375
      DOCKER_TLS_CERTDIR:                            
      CI_BUILDS_DIR:                                 /builds
      CI_PROJECT_DIR:                                /builds/application-development-platform/software-innovation-lab-frontend
      CI_CONCURRENT_ID:                              2
      CI_CONCURRENT_PROJECT_ID:                      2
      CI_SERVER:                                     yes

...

      CI_RUNNER_VERSION:                             12.1.0
      CI_RUNNER_REVISION:                            de7731dd
      CI_RUNNER_EXECUTABLE_ARCH:                     linux/amd64
    Mounts:
      /builds from repo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fx949 (ro)
  helper:
    Container ID:  
    Image:         gitlab/gitlab-runner-helper:x86_64-de7731dd
    Image ID:      
    Port:          <none>
    Host Port:     <none>
    Command:
      sh
      -c
      if [ -x /usr/local/bin/bash ]; then
        exec /usr/local/bin/bash 
      elif [ -x /usr/bin/bash ]; then
        exec /usr/bin/bash 
      elif [ -x /bin/bash ]; then
        exec /bin/bash 
      elif [ -x /usr/local/bin/sh ]; then
        exec /usr/local/bin/sh 
      elif [ -x /usr/bin/sh ]; then
        exec /usr/bin/sh 
      elif [ -x /bin/sh ]; then
        exec /bin/sh 
      elif [ -x /busybox/sh ]; then
        exec /busybox/sh 
      else
        echo shell not found
        exit 1
      fi
      
      
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      FF_CMD_DISABLE_DELAYED_ERROR_LEVEL_EXPANSION:  false
      FF_USE_LEGACY_BUILDS_DIR_FOR_DOCKER:           false
      FF_USE_LEGACY_VOLUMES_MOUNTING_ORDER:          false
      DOCKER_HOST:                                   tcp://localhost:2375
      DOCKER_TLS_CERTDIR:                            
      CI_BUILDS_DIR:                                 /builds
      CI_PROJECT_DIR:                                /builds/application-development-platform/software-innovation-lab-frontend
      CI_CONCURRENT_ID:                              2
      CI_CONCURRENT_PROJECT_ID:                      2
      CI_SERVER:                                     yes

      ....

    Mounts:
      /builds from repo (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from default-token-fx949 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  repo:
    Type:       EmptyDir (a temporary directory that shares a pod's lifetime)
    Medium:     
    SizeLimit:  <unset>
  default-token-fx949:
    Type:        Secret (a volume populated by a Secret)
    SecretName:  default-token-fx949
    Optional:    false
QoS Class:       BestEffort
Node-Selectors:  <none>
Tolerations:     node.kubernetes.io/not-ready:NoExecute for 300s
                 node.kubernetes.io/unreachable:NoExecute for 300s
Events:
  Type     Reason      Age               From                               Message
  ----     ------      ----              ----                               -------
  Normal   Scheduled   5m21s             default-scheduler                  Successfully assigned gitlab-runner-support/runner-kw4tq1u-project-1404-concurrent-2r4spr to aks-agentpool-25657597-0
  Normal   Pulled      5m9s              kubelet, aks-agentpool-25657597-0  Container image "node:latest" already present on machine
  Warning  Failed      3m9s              kubelet, aks-agentpool-25657597-0  Error: context deadline exceeded
  Normal   Pulled      3m9s              kubelet, aks-agentpool-25657597-0  Container image "gitlab/gitlab-runner-helper:x86_64-de7731dd" already present on machine
  Warning  Failed      45s               kubelet, aks-agentpool-25657597-0  Error: context deadline exceeded
  Warning  FailedSync  7s (x3 over 44s)  kubelet, aks-agentpool-25657597-0  error determining status: rpc error: code = Unknown desc = Error: No such container: c23114be3edcad3def36bb596e44379d417f10d1b0dc2b84c3af6aa870ccdaf1

On seeing PLEG in the logs, @WarheadsSE encountered https://github.com/kubernetes/kubernetes/issues/45419#issuecomment-525669603, which seems to fix the handling of PLEG issues in Kubernetes 1.16. However, the main concern is how to effectively manage the resource limitations of the runner to manage these issues.

@WarheadsSE suggests configuring resource request and limits when using Helm Charts, but we also have customers who are deploying manually or using the GitLab Kubernetes Integration to deploy runners.

Customer Tickets (Internal):

There are more logs and info in the tickets specific to each customer that I can't share here.

Root cause

The issue here is because GitLab Runner is telling Kubernetes to schedule a new Pod and waiting for that Pod to be running, but since the Kubernetes cluster is saturated from resources it cannot. GitLab Runner waits for 3 minutes by default for the pod to become available, but then it kills the job.

Workaround/Prevention

When something like this happens there are a few things that you can do to make the cluster/Jobs more resilient.

Set limits

GitLab Runner can set specific limits to the container it creates, using the settings specified below. You might be hesitant to add limits, but this will help with the stability of the cluster and GitLab Runner because you prevent issues where you have 1 Job consume 80% of the CPU on a node because of a mistake from the commit you are testing. Also if you job takes a few seconds/minutes longer because it's capped at a specific CPU level it would leave space in the cluster to run more jobs concurrently.

cpu_limit: The CPU allocation given to build containers
memory_limit: The amount of memory allocated to build containers
service_cpu_limit: The CPU allocation given to build service containers
service_memory_limit: The amount of memory allocated to build service containers
helper_cpu_limit: The CPU allocation given to build helper containers
helper_memory_limit: The amount of memory allocated to build helper containers

There is no magic value for the limits about it depends on a lot of factors, for example: What are the script that the Job is running; How big is the repository; How heavy is the service, are you starting a small webserver, a large DB with a huge amount of data? These are all questions you should ask whilst set them up.

There are also kubernetes limits you can set, if you have a Kubernetes cluster and you are sharing it with other application, that are not GitLab Runner it might be worth investigating creating namespaces and setting appropriate limits.

You can also consider reducing the amount of concurrent of jobs GitLab Runner can run, or limit

Configure your kubernetes cluster to autoscale

Most managed Kubernetes service provides autoscaling to add node when there is resource saturation, you should look into enabling this.

Increase poll_timeout, poll_interval

GitLab Runner provides poll_timeout, which is the amount of time, in seconds, that needs to pass before the runner will time out attempting to connect to the container it has just created. Useful for queueing more builds that the cluster can handle at a time (default = 180). You can try and bump this to up to 10miniutes or even longer.
GitLab Runner provides poll_interval defines how frequently, in seconds, the runner will poll the Kubernetes pod it has just created to check its status (default = 3). If you make it every 10 seconds this can help a little bit with pressure on the kube API, if that is something you are seeing in your cluster using most resources.

Use Kubernetes scheduling policies

GitLab Runner supports node selectors & taints and tolerations which can help you schedule the right jobs on the right nodes for more efficient use of your cluster.

Action Items

Create a new section ## Scale Kubernetes in the kubernetes executor explaining the prevention/workaround above
Change the default poll_timeout to be a larger amount

Edited Dec 06, 2019 by Steve Xuereb