Job executed in a Kubernetes cluster is reported as succeeded, despite it should have failed
Summary
We are executing hardware simulations in a private Kubernetes cluster. One of the jobs fails, however, in GitLab it is reported as succeeded. In GitLab, the job output appears cut off in the middle of a line and in the same line we get the typical Uploading artifacts... message that follows. Some interesting facts:
- Looking at the already printed part of the job's log we can see that there are failures in that job, hence, we know that the final job result is expected to be Failed.
- When we execute the same commands in a manually started Docker container on the same host where Kubernetes executed the job, the command finishes with exit code 2. Doing so revealed the remaining log which shows that the hardware simulation probably got stuck and was terminated by a watchdog timer. But: In that particular project, we have tests which run much longer, hence, we assume it is not GitLab's max job timer which kills it (in addition, we would then expect the job to properly terminate with exit code 0).
- Checking the GitLab runner's log (with verbosity increased to the debug level) did not reveal anything interesting.
- The same applies to the Kubernetes (and general all system) logs on the affected node.
We are not quite sure what the root cause for this issue is. It might well be in the test itself, but as it is really similar in structure to the remaining tests, we currently rather suspect the issue to be in the runner or Kubernetes. More concrete, we suspect that there is some procedure checking e.g. the standard output of the command and - as there is no output for a really long time for this job - killing it (maybe something as described in #2887 (closed) but with output on stdout/stderr as decision criteria instead of a timeout). Any hints on how to troubleshoot this would be really welcome :-)
Steps to reproduce
Unfortunately, we have no clue how to reproduce this. We have one job where this happens all the time, but no idea how to build a "minimal" example that reproduces the behavior.
Actual behavior
GitLab runner recognizes the job as succeeded and reports it as such despite it failed.
Expected behavior
GitLab runner notices that the job actually failed and reports it as a fail to the server.
Relevant logs and/or screenshots
Here's a screenshot of the "cut off" log:
Environment description
We have a private GitLab EE instance running in house. For CI testing, we have set up an in house Kubernetes cluster consisting of a master node and two CI nodes executing jobs. We use an internal Docker registry to store our images used for testing.
- GitLab Version: 10.5.4-ee
- Kubernetes: v1.9.2
- Host OS: CentOS 7
Used GitLab Runner version
GitLab runner version:
Version: 10.5.0
Git revision: 80b03db9
Git branch: 10-5-stable
GO version: go1.8.5
Built: 2018-02-22T09:18:33+00:00
OS/Arch: linux/amd64
Startup log lines:
Running with gitlab-runner 10.5.0 (80b03db9)
on ICD ac0d08f3
Using Kubernetes namespace: gitlab
Using Kubernetes executor with image dockerhub.commsolid.com/icd ...