Job won’t complete: containers with unready status

Summary

Since upgrading the runner to v17.4.0, some jobs with extra service(s) will not complete if one of the service container is not running.

In our specific case, a docker service is defined in the runner's Helm configuration and also in the job's definition. One of them will stop because of duplicated ports. In a different scenario, a job has a docker service configured but is attempting to run on a non-privileged runner. The service container is started but immediately stopped.

Steps to reproduce

Start a job with a service
Ensure that the service is stopped before the end of the job
Let the job complete successfully

Actual behavior

The job will wait for the stopped container to be in a "running" state before ending

Expected behavior

The job will stop with the "success" status

Relevant logs and/or screenshots

Environment description

We are using our own runners, deployed using Helm on an AKS cluster.

config.toml contents

[[runners]]
    pre_build_script = REDACTED
    environment = [
      "DOCKER_HOST=REDACTED",
      "DOCKER_TLS_CERTDIR=REDACTED",
    ]
    [runners.kubernetes]
      namespace = "{{.Release.Namespace}}"
      pull_policy = ["if-not-present"]
      image_pull_secrets = ["dockerhub"]
      image = "ubuntu:20.10.17"
      privileged = true
      poll_timeout = 500
      # The affinity definition below define a scheduling preference for job so they avoid running on system nodes.
      # This block must be copied in every configuration as it is not currently possible to extract it for easier reuse.
      [runners.kubernetes.affinity]
        [runners.kubernetes.affinity.node_affinity]
          [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution]]
            weight = 100
            [runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.label_selector]
              [[runners.kubernetes.affinity.node_affinity.preferred_during_scheduling_ignored_during_execution.label_selector.match_expressions]]
                key = "kubernetes.azure.com/mode"
                operator = "In"
                values = ["user"]
      [[runners.kubernetes.services]]
        name = "docker:dind"
        command = ["--insecure-registry=REDACTED", "--registry-mirror=REDACTED"]
      [runners.kubernetes.volumes]
        [[runners.kubernetes.volumes.config_map]]
          name = "runner-scripts"
          mount_path = "REDACTED"
      [runners.kubernetes.pod_labels]
        axceta_job_id = "$CI_JOB_ID"
        axceta_job_name = "$CI_JOB_NAME"
        axceta_job_stage = "$CI_JOB_STAGE"
        axceta_project_name = "$CI_PROJECT_NAME"
        axceta_project_id = "$CI_PROJECT_ID"
        axceta_pipeline_id = "$CI_PIPELINE_ID"
    [[runners.kubernetes.volumes.empty_dir]]
      name = "docker-certs"
      REDACTED
    [[runners.kubernetes.volumes.secret]]
      name = "dockerhub"
      REDACTED
  [runners.cache]
    Type = "azure"
    Shared = true
    [runners.cache.azure]
      REDACTED

Used GitLab Runner version

Current workaround

Downgrade back to v17.3.1