Optimize CI job allocations
This project is by far the largest consumer of GitLab compute credits in our group. This MR works to optimize the pipeline cost without changing the matrix of tested configurations. The end result is an overall >50% decrease in cost-per-successful-pipeline.
The primary changes from this MR are listed below:
- Jobs that fail the pipeline and are simple to fix (
pre-commit
,clang-format
) are now part of averify
stage. This stage must now pass first before the rest of the pipeline runs.- The
verify
state includes a newmeson setup: [*]
set of jobs which confirm that various combinations of arguments tomeson setup
work without error. These replace theoption: [*]
jobs, unlike them no compilations or tests are run and Meson's internal dependency etc. cache is used to accelerate the process.
- The
-
meson compile
andmeson test
are now separate CI jobs, inbuild
andtest
stages respectively. Both tasks are optimized, builds use large CPU-only runners while unique hardware (e.g. GPUs) is only available during tests.- With the exception of
aio
, most test jobs no longer repeat running the tests. This helps save CI time. - Most
compile
jobs compile without debug info. This also saves CI time, but more importantly saves artifact space. - HPSF CI and
ppc64le
jobs only have atest
jobs with nocompile
job. For HPSF CI, this prevents overloading the limited generic-hardware resources. Forppc64le
, this avoids a large artifact upload for an architecture that does not support GPUs (i.e. requires no unique hardware). - Compiler tests (previously
compiler *
) are now a subset of thebuild
jobs, but do not have associatedtest
jobs. This is primarily an organizational change.
- With the exception of
- Test data is now regenerated by
regen testdata: [*]
jobs, which do not test the newly generated test data. These jobs must be triggered manually to run and are in theartifacts
stage. - Failing jobs no longer cancel the entire pipeline, the pipeline now continues until no more jobs can run. This is intended to reduce waste CI time, by giving more information from failing pipelines (-> less pipelines per MR) and reducing cancel/retry cycles for ultimately successful jobs.
To Demonstrate
An example pipeline off of develop
typically costs ~210-220 credits, this number is visible in the ci performance report
job and also in the top header. For example: https://gitlab.com/hpctoolkit/hpctoolkit/-/pipelines/2028041223
The pipelines in this MR typically cost ~90-100 credits, this number is visible in the same locations in the MR pipeline linked just below this description.
Backward Compatibility
The test matrix is largely unchanged from the previous iteration, however there are three major removals:
- The jobs based on Ubuntu 20.04 and Fedora 39 have been dropped in the shuffle in accordance with their EOL status. These old distros do not represent any system we aim to give full support for, so we do not need the aggressive testing we do for, say, RHEL 8.
- The
setup: [*]
regression tests have been dropped. These jobs tested specifically strange system configurations that caused issues in the past. We predict these issues will never recur thanks to internal code improvements, so they do not need aggressive testing. - Tests using
--buildtype=release
have been dropped. The majority of test jobs use the default--buildtype=debugoptimized
, which is like muchrelease
except for the presence of debug info and assertions. We predict most changes that cause differences in behavior betweendebugoptimized
andrelease
builds will be visible in a manual review.
In addition, the clang-tidy-fix
job has been removed. Previously this job generated a patch of automated fix-its to fix found clang-tidy
errors. This patch normally goes unused and often requires additional care to apply, so the job has been dropped.