Job marked as success when job terminate midway in Kubernetes
When using the Kubernetes executor, sometimes the job stops/terminates midway of running and reports the Job as "Success" without any warning to the user.
Steps to reproduce
This is hard to reproduce the job. This problem appears often on our production kubernetes builder but we have not succeeded reproducing it manually.
We created a simple never ending job running
sha256sum /dev/zero > data/checksum.
Then we tried the following scenarios :
Kill the node running the pod.
Result: the runner behaves as expected: it fails with
ERROR: Job failed (system failure): error dialing backend: EOF
Result: again, the
helperdetected it correctly and failed the entire job.
builderdocker container on k8s node.
helperdetected the error state and reported it.
We are using kubernetes runner to build our application. Also, we are using GCP preemptible machines. Normally, our pipeline contains approximately 12 concurrent
build jobs, one for each build profile. Each job needs approximately 20 minutes to complete.
Sometimes the job is failing because the machine was preempted, which is OK because it is marked as
failed and we can just restart it.
But we also encountered a very strange behavior: The build job stops midway, according to our logs, then the
helper proceeds to upload the artifacts, etc and the job is marked as
We suspect the
helper cannot properly detect the exit code of
builder container under some circumstances. This error does not coincide with GCS preemption events in most of the cases. But we still believe some external event kills the
It is expected that the job is marked as
failed if the
builder exits before completion.
Relevant logs and/or screenshots
There is not much in logs, because the error was not detected. The job
116236 normally needs approximately 20 minutes but it stopped after 1
Checking for jobs... received [0;m job [0;m=116140 repo_url [0;m=https://***.****.******/mobile/shield.git runner [0;m=37151372 [0;33mWARNING: Namespace is empty, therefore assuming 'default'. [0;m [0;33mjob [0;m=116140 [0;33mproject [0;m=1 [0;33mrunner [0;m=37151372 .... Other info logs omitted ... Job succeeded [0;m duration [0;m=15m10.746034166s job [0;m=116140 project [0;m=1 runner [0;m=37151372
Theories on what is going on
Below is a list of what people are experiencing and have somewhat consistent behavior of reproducing the issue, as you can see some of them are very similar:
As you can see all of them are related to the pod being removed and when GitLab Runner sees that it just stops the job but does not handle it as an error for the user. @redbaron1 did some awesome investigation regarding this in #3175 (comment 90286240) & #3175 (comment 90507398) which is where we are communicating to kubernetes about the status/stream STDIN/STDOUT
Possible workarounds that users said they worked.
Below are some possible workarounds that users reported that fixed the issue for them:
This is mostly visible for Kubernetes on any kind of environment be is self-hosted or on GKE, though it has been reported for the Docker Runner as well.
Reported environments that it is failing on:
- Google Cloud Kubernetes ver.
Used GitLab Runner version
The version doesn't seem to affect anything, it has been seen failing for the following versions:
As suggested by @redbaron1 in #4119 (comment 173794124) we have quite an outdated kubernetes SDK, which might fix the issues if we upgrade the SDK. Upgrading the SDK is something that always needs to be done so it might be a good first step.
One idea that was floating around when talking to
@ayufan about this issue is that we might be checking if a build failed wrong. Right now we are checking the stream output which might not be the most reliable way to do so since the stream can be cut at any time. We used to have similar problems with the Docker executor and the way we solved this is checking the exit code of the Container instead of the stream. We should check if it's possible to do something similar on the pods, so instead of relying on the stream check the pod has some kind of exit code (we need to investigate if pods have something similar exposed)
** Follow up steps:** #4119 (comment 190968921)