Skip to content

Cancel stage script upon job cancellation in attach mode

What does this MR do?

When the job is cancelled from the UI, in attach mode for the executorkubernetes, GitLab Runner doesn't cancel the execution and hangs until the job eventually times out.

With this MR, a command is remotely executed on the stage container to explicitly cancel the ongoing script.

For bash shell, all pids those the name ends with the stage script and its child process are killed. We however avoid to kill the parent process responsible of the tee redirection in the output.log file.

Without this redirection it is impossible to GitLab Runner to get the trap_status and gracefully finish the job.

The same logic was used for powershell shell.

Why was this MR needed?

When using a Kubernetes executor with a self-hosted Runner (16.11.1) on GitLab.com jobs appear to hang instead of cancelling immediately.

What's the best way to test this MR?

Pipeline passes

gitlab-ci
variables:
  FF_USE_POWERSHELL_PATH_RESOLVER: "true"
  FF_RETRIEVE_POD_WARNING_EVENTS: "true"
  FF_PRINT_POD_EVENTS: "true"
  FF_SCRIPT_SECTIONS: "true"
  FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR: "false" # also tested with "true" value

simple-job:
  script:
    - sleep 3600
  after_script:
    - echo "this is the after_script running"

Bash

config.toml
concurrent = 1
check_interval = 1
log_level = "debug"
shutdown_timeout = 0

listen_address = ':9252'

[session_server]
  session_timeout = 1800

[[runners]]
  name = "investigation"
  url = "https://gitlab.com/"
  id = 0
  token = "glrt-REDACTED"
  token_obtained_at = "0001-01-01T00:00:00Z"
  token_expires_at = "0001-01-01T00:00:00Z"
  executor = "kubernetes"
  shell = "bash"
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "alpine"
    pod_termination_grace_period_seconds = 3600
    namespace = ""
    namespace_overwrite_allowed = ""
    pod_labels_overwrite_allowed = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    node_selector_overwrite_allowed = ".*"
    allow_privilege_escalation = false
    [[runners.kubernetes.services]]
    [runners.kubernetes.dns_config]
    [runners.kubernetes.pod_labels]
      user = "ratchade"

Those tests were made with FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR:true and FF_USE_DUMB_INIT_WITH_KUBERNETES_EXECUTOR:false

PowerShell

config.toml
concurrent = 1
check_interval = 1
log_level = "debug"
shutdown_timeout = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = ""
  url = "https://gitlab.com/"
  id = 0
  token = "glrt-REDACTED"
  token_obtained_at = "0001-01-01T00:00:00Z"
  token_expires_at = "0001-01-01T00:00:00Z"
  executor = "kubernetes"
  shell = "powershell"
  [runners.kubernetes]
    host = ""
    bearer_token_overwrite_allowed = false
    image = "mcr.microsoft.com/windows/servercore:ltsc2022"
    namespace = ""
    namespace_overwrite_allowed = ""
    node_selector_overwrite_allowed = ""
    helper_image = "gitlab/gitlab-runner-helper:x86_64-latest-servercore21H2"
    poll_timeout = 3600
    pod_labels_overwrite_allowed = ""
    service_account_overwrite_allowed = ""
    pod_annotations_overwrite_allowed = ""
    [runners.kubernetes.node_selector]
        "kubernetes.io/arch" = "amd64"
        "kubernetes.io/os" = "windows"
        "node.kubernetes.io/windows-build" = "10.0.20348"
    [runners.kubernetes.pod_security_context]
    [runners.kubernetes.volumes]
    [runners.kubernetes.dns_config]

Job cancelled as expected

What are the relevant issue numbers?

close #37780 https://gitlab.com/gitlab-com/ops-sub-department/section-ops-request-for-help/-/issues/340

Edited by Romuald Atchadé

Merge request reports