Job timeouts still not working

Problem

CI/CD job timeouts set by the user is propagated to the Runner, Runner cancels the job when the timeout threshold has been exceeded and then sends the job status to RAILS. In some scenarios it does not appear that the RAILS app is handling the updated job status. This is resulting in users noting that CI job timeouts not working.

Note to everyone noticing related behaviour.

I am starting to research the problem and trying to identify the flow, potential failure points and their intersection. On that endeavour as already stated the hard part is figuring out where to strategically place more monitoring points.

  1. Please follow the same template as in gitlab#457222 (comment 2665264202) and provide as much info as possible.
  2. Tag panoskanell to the comment so that I don't miss it.
  3. Logs should be < 1 month old for ~"GitLab.com" occurrences

Actual behavior

Have 5 jobs in an stage and runner concurrency of 4 and this is the only runner added to project repo.

All the 3 jobs that neither finished nor timed out have their timeouts set in ci. The script execution for all 3 jobs did start but then it hanged(!)

So then manually canceled the job from UI and retried and things went smoothly as expected.

Expected behavior

Jobs should not hang and should honor timeout!

Relevant screenshots

jobs' screenshot

Screenshot_2020-07-17_at_02.34.58

Screenshot_2020-07-17_at_02.32.51

Screenshot_2020-07-17_at_02.30.09

Environment description

Using shell executor (on AWS Lightsail with gitlab-runner latest stable) with gitlab.com.

config.toml contents
concurrent = 4
check_interval = 0
[session_server]
  session_timeout = 7200
  listen_address = "0.0.0.0:8093"
  advertise_address = "server-random.example.com:8093"
[[runners]]
  name = "runner for server-random.example.com"
  url = "https://gitlab.com"
  token = "2ZBY4-token-for-server-random.example.com-Dxz"
  executor = "shell"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]

Used GitLab Runner version

Version:      13.4.1
Git revision: e95f89a0
Git branch:   13-4-stable
GO version:   go1.13.8
Built:        2020-09-25T20:03:43+0000
OS/Arch:      linux/amd64

Also have faced the same problem once with v13.1.1 and then twice with 13.4.1

Weird Behaviour

Faced this issue thrice, and all the 3 times it occured only with those 3 jobs. As per the job logs, all the 3 times the jobs hanged during different commands' execution.

Sorry can't share the CI code publicly but @steveazz I can provide you access to the repo (with runner still setup on server) as the 3rd time I faced this issue is a few hours ago.

So I am gonna keep the server running for a few days.

In all those 3 jobs that hanged, at some point 'yum' is called, and if there is a process of yum already running then the newer process logs

Existing lock /var/run/yum.pid: another copy is running as pid 26118.
Another app is currently holding the yum lock; waiting for it to exit...
  The other application is: yum
    Memory :  98 M RSS (376 MB VSZ)
    Started: Fri Oct  2 15:39:47 2020 - 00:11 ago
    State  : Running, pid: 26118

and waits for the other process to finish...

Edited by Panos Kanellidis