Nonsensical out-of-memory errors on Windows Shared Runner
Summary
When attempting to build OpenMW (https://gitlab.com/OpenMW/openmw) on a Windows Shared Runner, something will report an out-of-memory error despite plenty being free. Using more memory (e.g. by building PDBs, too) will make it happen sooner.
This may be a problem specific to the shared runners.
Steps to reproduce
I haven't managed to narrow down a minimal reproduction case, but OpenMW fails reliably with the following rough behaviour:
- Load cache zip with dependencies from a previous run.
- Use chocolatey to install dependencies
- Call
refreshenv
to make them available. - Call OpenMW's prebuild script (used by developers on their local machines and on AppVeyor successfully). This:
- Downloads and extracts dependencies.
- Puts certain dependency DLLs into the right place.
- Sets up CMake options.
- Gets CMake to create a
.sln
, NMake or Ninja build.
- Activate MSVC in the current shell (if required by the build system).
- Get CMake to start the build.
- Await a crash.
Factors that are known to make things worse include building more targets and building PDB files, too. Both of these would make the build process consume more memory.
I know that in Windows, out of memory errors can be caused if a process doesn't have enough contiguous virtual address space free even when it's got enough free in total, but this isn't what's happening here as 64-bit executables are crashing, too, and they'll always have enough virtual address space until people start filling 8TB of RAM.
.gitlab-ci.yml
stages:
- build
# Debian & MacOS stuff removed for brevity
Windows:
tags:
- windows
before_script:
- Import-Module "$env:ChocolateyInstall\helpers\chocolateyProfile.psm1"
- choco install git --force --params "/GitAndUnixToolsOnPath" -y
- choco install 7zip -y
- choco install cmake.install --installargs 'ADD_CMAKE_TO_PATH=System' -y
- choco install vswhere -y
- choco install ninja -y
- choco install python -y
- refreshenv
stage: build
script:
- $time = (Get-Date -Format "HH:mm:ss")
- echo ${time}
- echo "started by ${GITLAB_USER_NAME}"
- sh CI/before_script.msvc.sh -c Release -p Win64 -v 2019 -k -V -N
- cd MSVC2019_64_Ninja
- .\ActivateMSVC.ps1
- cmake --build . --config Release
- cd Release
- |
if (Get-ChildItem -Recurse *.pdb) {
7z a -tzip ..\..\OpenMW_MSVC2019_64_${CIBuildRefName}_${CIBuildID}_symbols.zip '*.pdb'
Get-ChildItem -Recurse *.pdb | Remove-Item
}
- 7z a -tzip ..\..\OpenMW_MSVC2019_64_${CIBuildRefName}_${CIBuildID}.zip '*'
cache:
paths:
- deps
- MSVC2019_64/deps
artifacts:
when: always
paths:
- "*.zip"
- "*.log"
- MSVC2019_64_Ninja/*
- MSVC2019_64_Ninja/*.log
- MSVC2019_64_Ninja/*/*.log
- MSVC2019_64_Ninja/*/*/*.log
- MSVC2019_64_Ninja/*/*/*/*.log
- MSVC2019_64_Ninja/*/*/*/*/*.log
- MSVC2019_64_Ninja/*/*/*/*/*/*.log
- MSVC2019_64_Ninja/*/*/*/*/*/*/*.log
- MSVC2019_64_Ninja/*/*/*/*/*/*/*/*.log
OpenMW/openmw!211 (closed) and OpenMW/openmw!186 (closed) are both attempts to get this to work. 211 is less cluttered, but 186 has a lot more logging
Actual behavior
At some point, one or more processes will crash due to an out-of-memory error. This has included:
- The MSVC compiler reporting that it's out of heap space.
- Various internal errors in the compiler and linker.
- MSBuild reporting a
System.OutOfMemoryException
. - MSBuild reporting an inability to communicate with a child process.
- The memory monitoring PowerShell script I'd set up to debug this getting a
System.OutOfMemoryException
.
When this happens, about six out of the eight gigabytes of physical memory seem to be free. I'd be unsurprised if something was limiting the whole job to something around 2GB of memory, but it could also be a coincidence that it happens around this level as the build spends a lot of time using about that much.
Expected behavior
If there's free memory, it should be possible to allocate it, and the build should succeed.
Relevant logs and/or screenshots
Practically every failed job listed here https://gitlab.com/OpenMW/openmw/-/jobs is because of this and has some amount of logging, and the logs are usually too long to reasonably include here. The ones from the windows-shared-runner
branch, but not the ones from the windows-shared-runner-the-second
branch, often have a separate memory log as an artefact, where a background process queried the memory usage as the build ran. Usually, the later the build, the more information gets logged. Many builds have the output redirected to another log as stderr and stdout weren't always appearing in order.
Environment description
We use shared Windows runners, so get what we're given.
Used GitLab Runner version
The first few lines of the build output are:
[0KRunning with gitlab-runner 12.9.0 (4c96e5ad)
[0;m[0K on windows-shared-runners-manager Hs8mheX5
[0;msection_start:1589907028:prepare_executor
[0K[0K[36;1mPreparing the "custom" executor[0;m
[0;m[0KUsing Custom executor with driver autoscaler dev (64a348d)...
[0;mCreating virtual machine for the job...
Virtual machine created!
section_end:1589907133:prepare_executor
[0Ksection_start:1589907133:prepare_script
[0K[0K[36;1mPreparing environment[0;m
[0;mRunning on PACKER-5E557E8E via
runner-hs8mhex5-wsrm-6feb0f5587b312c6b1ed...
section_end:1589907147:prepare_script
and without ANSI escape sequences, it's:
Running with gitlab-runner 12.9.0 (4c96e5ad)
on windows-shared-runners-manager Hs8mheX5
Preparing the "custom" executor
Using Custom executor with driver autoscaler dev (64a348d)...
Creating virtual machine for the job...
Virtual machine created!
Preparing environment
Running on PACKER-5E557E8E via
runner-hs8mhex5-wsrm-6feb0f5587b312c6b1ed...
Possible fixes
None known, or this would be a merge request.