Skip to content

Kubernetes executor: prevent background processes from hanging the entire job

Snaipe requested to merge Snaipe/gitlab-runner:fix/2880 into main

What does this MR do?

This commit changes the way we execute scripts when attaching to containers in the kubernetes controller. It used to be that we executed scripts with:

sh /scripts-x-y/detect_shell_script /scripts-x-y/script 2>&1 | tee -a /logs-x-y/output.log

This works well most of the time, but if the executed script also starts processes in the background, they'll inherit the tee pipe as their standard output & error file descriptors. This means that when the parent command terminates, if these background processes survive, then the tee process never ends up getting EOF, blocking the shell we are attaching to.

This commit changes that command to:

sh -c '(/scripts-x-y/detect_shell_script /scripts-x-y/script 2>&1 | tee -a /logs-x-y/output.log) &'

Which causes the problematic command to be run in a background subshell, and means we don't wait on a potentially blocked tee.

There are better ways to do this. Normally, you'd create a session with a controlling terminal, and on process exit the session would get SIGHUP'd; unfortunately this is hard to do in a portable manner, and adding new requirements to CI images seem like a no-go. What is real however is that this issue hasn't been fixed for 5 years and is impacting people, so let's just do this instead.

Why was this MR needed?

I am an SRE working for a GitLab customer with an on-prem deployment of GitLab EE, and helped support the transition from bare-metal runners to Kubernetes-hosted runners.

This issue popped up almost immediately for some jobs, and was fairly hard to troubleshoot & root cause. Comparatively, a fix to make things Just Work©™ was fairly easy.

The issue is unique to the attach strategy of the Kubernetes executor. I could have used the exec strategy instead but decided not to, since #27976 mentions that the exec strategy will be removed at some point.

What's the best way to test this MR?

Simply use the kubernetes executor to build the following job:

test:
  image: alpine:latest
  script:
  - sleep infinity &
  - mkdir out && echo "Hello, world" > out/greeting
  after_script:
  - echo I should be running
  artifacts:
    when: always
    paths:
    - out/greeting

If run without this change, the job will produce no output when entering the after_script stage, which will time out after some time. Then the job will move on to upload the artifacts, succeed, then hang forever.

Attaching to the build container of the job pod and killing sleep will immediately cause the job to get unstuck.

Re-running the job with this change shows that the after_script will run and the job will not hang.

What are the relevant issue numbers?

close #2880 (closed)

Edited by Romuald Atchadé

Merge request reports