Job marked as success when job terminates midway in Kubernetes

Summary

When using the Kubernetes executor, sometimes the job stops/terminates midway of running and reports the Job as "Success" without any warning to the user.

Status as of February 14, 2020:

This bug has been, and continues to be a top priority for us to resolve. In the 12.5 release cycle, we spent most of the time simply trying to reliably reproduce the bug.
In 12.6 and 12.7 we continued to work on implementing potential solutions to the problem.
Work continued in 12.8, but the merge request review for the identified fix did not complete in time for the 12.8 code freeze. So 12.9 is the new target. You can follow along in !1775 (merged)

Steps to reproduce

Environment setup:

I'm using a Linux laptop with Debian. dsniff package is installed (this provides the tcpkill command).
On this laptop the I've installed minikube, using the KVM as virtualization engine.
Minikube's VM is attached to a virtual network, connected with my local interface through virbr2 bridge.

On the same laptop I have GitLab Runner, with the following configuration:

# ~/.gitlab-runner/config.toml.kubernetes

concurrent = 3
check_interval = 5
listen_address = "127.0.0.1:6062"

[[runners]]
  name = "apollo-kubernetes"
  url = "https://gitlab.com/"
  token = "[REDACTED}"
  limit = 3
  executor = "kubernetes"

  [runners.kubernetes]
    host = "https://192.168.39.230:8443"
    cert_file = "/home/tmaczukin/.minikube/apiserver.crt"
    key_file = "/home/tmaczukin/.minikube/apiserver.key"
    ca_file = "/home/tmaczukin/.minikube/ca.crt"
    image = "alpine:3.10"
    namespace = "default"
    helper_image = "gitlab/gitlab-runner-helper:x86_64-latest"
    pull_policy = "if-not-present"

Kubernetes host value was taken from the cat ~/.kube/config file after starting minikube.

Having all of this prepared, I did the following:

In one terminal window I've started Runner's process: gitlab-runner run --config ~/.gitlab-runner/config.toml.kubernetes.
I've opened another terminal window and typed the command sudo tcpkill -i virbr2 -9 port 8443 but without executing it yet!
I've opened another terminal window and executed: watch -n 1 kubectl get pods.
I've started the job from my test project (https://gitlab.com/tmaczukin-test-projects/test-kubernetes-4119/-/jobs/337201695).
When I've noticed that $ for i in $(seq 1 60); do echo -n .; sleep 1; done is being executed, I've executed the prepared tcpkill command.

After a moment, tcpkill started reporting TCP streams that it was trying to reset, in job's log I saw the Job succeeded injected in the middle of loop execution, and in Runner output I've noticed:

Job succeeded                                       duration=17.8968597s job=337201695 project=14948016 runner=d578332a
ERROR: Error cleaning up pod: Delete https://192.168.39.230:8443/api/v1/namespaces/default/pods/runner-d578332a-project-14948016-concurrent-0mdx6t: read tcp 192.168.39.1:34442->192.168.39.230:8443: read: connection reset by peer  job=337201695 project=14948016 runner=d578332a

What's important, while the pod deletion reported an error (because tcpkill was still running and it terminated the communication), in this case K8S accepted the request and removed the pod. The watch -n 1 kubectl get pods, that was showing the existing job pod during job execution, a moment after the job was finished started showing No resources found.

For my tests I've used the 12.5.0~beta.1988.g1cc30aa5 (1cc30aa5) version of Runner, so the one that includes updated Kubernetes client library.

Actual behavior

We are using kubernetes runner to build our application. Also, we are using GCP preemptible machines. Normally, our pipeline contains approximately 12 concurrent build jobs, one for each build profile. Each job needs approximately 20 minutes to complete.

Sometimes the job is failing because the machine was preempted, which is OK because it is marked as failed and we can just restart it. But we also encountered a very strange behavior: The build job stops midway, according to our logs, then the helper proceeds to upload the artifacts, etc and the job is marked as success.

We suspect the helper cannot properly detect the exit code of builder container under some circumstances. This error does not coincide with GCS preemption events in most of the cases. But we still believe some external event kills the builder.

Expected behavior

It is expected that the job is marked as failed if the builder exits before completion.

Relevant logs and/or screenshots

There is not much in logs, because the error was not detected. The job 116236 normally needs approximately 20 minutes but it stopped after 1

Checking for jobs... received                      [0;m  job [0;m=116140 repo_url [0;m=https://***.****.******/mobile/shield.git runner [0;m=37151372
 [0;33mWARNING: Namespace is empty, therefore assuming 'default'. [0;m   [0;33mjob [0;m=116140  [0;33mproject [0;m=1  [0;33mrunner [0;m=37151372
.... Other info logs omitted ...
Job succeeded                                      [0;m  duration [0;m=15m10.746034166s job [0;m=116140 project [0;m=1 runner [0;m=37151372

Theories on what is going on

Below is a list of what people are experiencing and have somewhat consistent behavior of reproducing the issue, as you can see some of them are very similar:

As you can see all of them are related to the pod being removed and when GitLab Runner sees that it just stops the job but does not handle it as an error for the user. @redbaron1 did some awesome investigation regarding this in #3175 (comment 90286240) & #3175 (comment 90507398) which is where we are communicating to kubernetes about the status/stream STDIN/STDOUT

Possible workarounds that users said they worked.

Below are some possible workarounds that users reported that fixed the issue for them:

Environment description

This is mostly visible for Kubernetes on any kind of environment be is self-hosted or on GKE, though it has been reported for the Docker Runner as well.

Reported environments that it is failing on:

Google Cloud Kubernetes ver.1.12.5-gke.10
Self-hosted
AKS

Used GitLab Runner version

The version doesn't seem to affect anything, it has been seen failing for the following versions:

11.8.0
11.6.0
11.3.0
11.2.0

Duplicate issues

Proposal

As suggested by @redbaron1 in #4119 (comment 173794124) we have quite an outdated kubernetes SDK, which might fix the issues if we upgrade the SDK. Upgrading the SDK is something that always needs to be done so it might be a good first step.

One idea that was floating around when talking to @ayufan about this issue is that we might be checking if a build failed wrong. Right now we are checking the stream output which might not be the most reliable way to do so since the stream can be cut at any time. We used to have similar problems with the Docker executor and the way we solved this is checking the exit code of the Container instead of the stream. We should check if it's possible to do something similar on the pods, so instead of relying on the stream check the pod has some kind of exit code (we need to investigate if pods have something similar exposed)

Investigation log

#4119 (comment 190844584)

** Follow up steps:** #4119 (comment 190968921)

Reproduce the issue

#4119 (comment 237837226)

Edited Feb 14, 2020 by Darren Eastman