Job timeouts still not working
## Problem CI/CD job timeouts set by the user is propagated to the Runner, Runner cancels the job when the timeout threshold has been exceeded and then sends the job status to RAILS. In some scenarios it does not appear that the RAILS app is handling the updated job status. This is resulting in users noting that CI job timeouts not working. ## Note to everyone noticing related behaviour. I am starting to research the problem and trying to identify the flow, potential failure points and their intersection. On that endeavour as already [stated](https://gitlab.com/gitlab-org/gitlab/-/issues/457222#note_2569173100) the hard part is figuring out where to strategically place more monitoring points. 1. Please follow the same template as in https://gitlab.com/gitlab-org/gitlab/-/issues/457222#note_2665264202 and provide as much info as possible. 2. Tag `panoskanell` to the comment so that I don't miss it. 3. **Logs should be \< 1 month old** for ~"GitLab.com" occurrences ## Actual behavior Have 5 jobs in an stage and runner concurrency of 4 and this is the only runner added to project repo. All the 3 jobs that neither finished nor timed out have their timeouts set in ci. The script execution for all 3 jobs did start but then it hanged(!) So then manually canceled the job from UI and retried and things went smoothly as expected. ## Expected behavior Jobs should not hang and should honor timeout! ## Relevant screenshots <details> <summary> jobs' screenshot </summary> ![Screenshot_2020-07-17_at_02.34.58](/uploads/7763c1f960224443b416f68571d65e2a/Screenshot_2020-07-17_at_02.34.58.png) ![Screenshot_2020-07-17_at_02.32.51](/uploads/97dd76e0fd891102819a7313c7a0170e/Screenshot_2020-07-17_at_02.32.51.png) ![Screenshot_2020-07-17_at_02.30.09](/uploads/2cbde5cd0ff4fb2576936e0d4e0d98f2/Screenshot_2020-07-17_at_02.30.09.png) </details> ## Environment description Using shell executor (on AWS Lightsail with gitlab-runner latest stable) with gitlab.com. <details> <summary> config.toml contents </summary> ```toml concurrent = 4 check_interval = 0 [session_server] session_timeout = 7200 listen_address = "0.0.0.0:8093" advertise_address = "server-random.example.com:8093" [[runners]] name = "runner for server-random.example.com" url = "https://gitlab.com" token = "2ZBY4-token-for-server-random.example.com-Dxz" executor = "shell" [runners.custom_build_dir] [runners.cache] [runners.cache.s3] [runners.cache.gcs] ``` </details> ### Used GitLab Runner version ``` Version: 13.4.1 Git revision: e95f89a0 Git branch: 13-4-stable GO version: go1.13.8 Built: 2020-09-25T20:03:43+0000 OS/Arch: linux/amd64 ``` Also have faced the same problem once with v13.1.1 and then twice with 13.4.1 ## Weird Behaviour Faced this issue thrice, and all the 3 times it occured only with those 3 jobs. As per the job logs, all the 3 times the jobs hanged during different commands' execution. Sorry can't share the CI code publicly but @steveazz I can provide you access to the repo (with runner still setup on server) as the 3rd time I faced this issue is a few hours ago. So I am gonna keep the server running for a few days. In all those 3 jobs that hanged, at some point 'yum' is called, and if there is a process of yum already running then the newer process logs ``` Existing lock /var/run/yum.pid: another copy is running as pid 26118. Another app is currently holding the yum lock; waiting for it to exit... The other application is: yum Memory : 98 M RSS (376 MB VSZ) Started: Fri Oct 2 15:39:47 2020 - 00:11 ago State : Running, pid: 26118 ``` and waits for the other process to finish...
issue