Windows Shell Runner: Failing jobs do not fail

Summary

For jobs running inside a Windows shell runner, a failing command is not handled correctly.

What happens is that the job will stop the script at the failing command. However, the job is still reported as "successful" / "passing".

Steps to reproduce

For the simplest example, add any non-existing command to the "script" section of a job. Alternatively, add any command return an error code.

.gitlab-ci.yml

win_debug_build:
  tags:
    - windows_runner
  stage: build
  interruptible: true
  rules:
    - if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
      when: always
    - if: $CI_MERGE_REQUEST_IID
      when: always
    - if: $CI_COMMIT_BRANCH
      when: manual
  script:
    - ThisShouldFail # Imaginary, non-existing command
    # <other build commands>
  artifacts:
    paths:
      - debug_build
    expire_in: 1 week

Actual behavior

The job is correctly stopped at the line with the bad command, and further commands are not executed.

However, the job is reported as "passing" or "successful" despite the failing command.

Expected behavior

The job should both stop and fail because a command in the script failed.

Relevant logs and/or screenshots

Here's an excerpt of the output when running the job above:

job log

Running with gitlab-runner 13.9.0 (2ebc4dc4)
  on [...] 157RkzCW
  feature flags: FF_GITLAB_REGISTRY_HELPER_IMAGE:true
Resolving secrets
00:00
Preparing the "shell" executor
00:00
Using Shell executor...
Preparing environment
00:02
Running on [...]...
Getting source from Git repository
00:06
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in C:/gitlab_runner/builds/157RkzCW/0/[...]/.git/
Checking out cb69ba8e as refs/merge-requests/16/head...
git-lfs/2.4.2 (GitHub; windows amd64; go 1.8.3; git 6f4b2e98)
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:01
$ ThisShouldFail
ThisShouldFail : The term 'ThisShouldFail' is not recognized as the name of a cmdlet, function, 
script file, or operable program. Check the spelling of the name, or if a path was included, 
verify that the path is correct and try again.
At C:\Users\gitlab\AppData\Local\Temp\build_script277876021\script.ps1:245 char:1
+ ThisShouldFail
+ ~~~~~~~~~~~~~~
    + CategoryInfo          : ObjectNotFound: (ThisShouldFail:String) [], CommandNotFoundException
    + FullyQualifiedErrorId : CommandNotFoundException
 
Uploading artifacts for successful job
00:14
Version:      13.9.0
Git revision: 2ebc4dc4
Git branch:   13-9-stable
GO version:   go1.13.8
Built:        2021-02-22T20:17:11+0000
OS/Arch:      windows/amd64
Uploading artifacts...
Runtime platform                                    arch=amd64 os=windows pid=1378996 revision=2ebc4dc4 version=13.9.0
WARNING: debug_build: no matching files            
ERROR: No files to upload                          
Cleaning up file based variables
00:01
Job succeeded

Environment description

This is a custom installation, namely a Windows shell runner. In terms of configuration, this is a group-specific runner which is used by the project facing this issue.

config.toml contents

concurrent = 1
check_interval = 0

[session_server]
  session_timeout = 1800

[[runners]]
  name = "..."
  url = "..."
  token = "..."
  executor = "shell"
  shell = "powershell"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]

[[runners]]
  name = "..."
  url = "..."
  token = "..."
  executor = "shell"
  shell = "powershell"
  [runners.custom_build_dir]
  [runners.cache]
    [runners.cache.s3]
    [runners.cache.gcs]
    [runners.cache.azure]

Used GitLab Runner version

PS C:\gitlab_runner> .\gitlab-runner.exe --version
Version:      13.9.0
Git revision: 2ebc4dc4
Git branch:   13-9-stable
GO version:   go1.13.8
Built:        2021-02-22T20:17:11+0000
OS/Arch:      windows/amd64

Possible cause

When trying the job script manually inside a powershell terminal, I can see that $LASTEXITCODE is 1, as expected, after the failing command. However, $? reports true, indicating "success". Is it possible that gitlab-runner is querying $? instead of $LASTEXITCODE in this situation, leading to "job successful"?

I have a feeling, without verifying, that this is a recent regression. We have another Windows job that is relying on return codes to detect errors, and it was previously working (i.e., job failing when the script failed).

Here's the script line that was previously working (but is not anymore):

- cmd /c 'for /r "./temp_bin/" %a in (*.exe) do call "%a" & if errorlevel 1 exit /b 1'