Windows Shell Runner: Failing jobs do not fail
Summary
For jobs running inside a Windows shell runner, a failing command is not handled correctly.
What happens is that the job will stop the script at the failing command. However, the job is still reported as "successful" / "passing".
Steps to reproduce
For the simplest example, add any non-existing command to the "script" section of a job. Alternatively, add any command return an error code.
.gitlab-ci.yml
win_debug_build:
tags:
- windows_runner
stage: build
interruptible: true
rules:
- if: $CI_COMMIT_BRANCH == $CI_DEFAULT_BRANCH
when: always
- if: $CI_MERGE_REQUEST_IID
when: always
- if: $CI_COMMIT_BRANCH
when: manual
script:
- ThisShouldFail # Imaginary, non-existing command
# <other build commands>
artifacts:
paths:
- debug_build
expire_in: 1 week
Actual behavior
The job is correctly stopped at the line with the bad command, and further commands are not executed.
However, the job is reported as "passing" or "successful" despite the failing command.
Expected behavior
The job should both stop and fail because a command in the script failed.
Relevant logs and/or screenshots
Here's an excerpt of the output when running the job above:
job log
Running with gitlab-runner 13.9.0 (2ebc4dc4)
on [...] 157RkzCW
feature flags: FF_GITLAB_REGISTRY_HELPER_IMAGE:true
Resolving secrets
00:00
Preparing the "shell" executor
00:00
Using Shell executor...
Preparing environment
00:02
Running on [...]...
Getting source from Git repository
00:06
Fetching changes with git depth set to 50...
Reinitialized existing Git repository in C:/gitlab_runner/builds/157RkzCW/0/[...]/.git/
Checking out cb69ba8e as refs/merge-requests/16/head...
git-lfs/2.4.2 (GitHub; windows amd64; go 1.8.3; git 6f4b2e98)
Skipping Git submodules setup
Executing "step_script" stage of the job script
00:01
$ ThisShouldFail
ThisShouldFail : The term 'ThisShouldFail' is not recognized as the name of a cmdlet, function,
script file, or operable program. Check the spelling of the name, or if a path was included,
verify that the path is correct and try again.
At C:\Users\gitlab\AppData\Local\Temp\build_script277876021\script.ps1:245 char:1
+ ThisShouldFail
+ ~~~~~~~~~~~~~~
+ CategoryInfo : ObjectNotFound: (ThisShouldFail:String) [], CommandNotFoundException
+ FullyQualifiedErrorId : CommandNotFoundException
Uploading artifacts for successful job
00:14
Version: 13.9.0
Git revision: 2ebc4dc4
Git branch: 13-9-stable
GO version: go1.13.8
Built: 2021-02-22T20:17:11+0000
OS/Arch: windows/amd64
Uploading artifacts...
Runtime platform arch=amd64 os=windows pid=1378996 revision=2ebc4dc4 version=13.9.0
WARNING: debug_build: no matching files
ERROR: No files to upload
Cleaning up file based variables
00:01
Job succeeded
Environment description
This is a custom installation, namely a Windows shell runner. In terms of configuration, this is a group-specific runner which is used by the project facing this issue.
config.toml contents
concurrent = 1
check_interval = 0
[session_server]
session_timeout = 1800
[[runners]]
name = "..."
url = "..."
token = "..."
executor = "shell"
shell = "powershell"
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
[[runners]]
name = "..."
url = "..."
token = "..."
executor = "shell"
shell = "powershell"
[runners.custom_build_dir]
[runners.cache]
[runners.cache.s3]
[runners.cache.gcs]
[runners.cache.azure]
Used GitLab Runner version
PS C:\gitlab_runner> .\gitlab-runner.exe --version
Version: 13.9.0
Git revision: 2ebc4dc4
Git branch: 13-9-stable
GO version: go1.13.8
Built: 2021-02-22T20:17:11+0000
OS/Arch: windows/amd64
Possible cause
When trying the job script manually inside a powershell terminal, I can see that $LASTEXITCODE is 1, as expected, after the failing command. However, $? reports true, indicating "success". Is it possible that gitlab-runner is querying $? instead of $LASTEXITCODE in this situation, leading to "job successful"?
I have a feeling, without verifying, that this is a recent regression. We have another Windows job that is relying on return codes to detect errors, and it was previously working (i.e., job failing when the script failed).
Here's the script line that was previously working (but is not anymore):
- cmd /c 'for /r "./temp_bin/" %a in (*.exe) do call "%a" & if errorlevel 1 exit /b 1'
Related issues
See also: #1142 (closed)