Optimize CI job allocations (!1305) · Merge requests · HPCToolkit / HPCToolkit

This project is by far the largest consumer of GitLab compute credits in our group. This MR works to optimize the pipeline cost without changing the matrix of tested configurations. The end result is an overall >50% decrease in cost-per-successful-pipeline.

The primary changes from this MR are listed below:

Jobs that fail the pipeline and are simple to fix (pre-commit, clang-format) are now part of a verify stage. This stage must now pass first before the rest of the pipeline runs.
- The verify state includes a new meson setup: [*] set of jobs which confirm that various combinations of arguments to meson setup work without error. These replace the option: [*] jobs, unlike them no compilations or tests are run and Meson's internal dependency etc. cache is used to accelerate the process.
meson compile and meson test are now separate CI jobs, in build and test stages respectively. Both tasks are optimized, builds use large CPU-only runners while unique hardware (e.g. GPUs) is only available during tests.
- With the exception of aio, most test jobs no longer repeat running the tests. This helps save CI time.
- Most compile jobs compile without debug info. This also saves CI time, but more importantly saves artifact space.
- HPSF CI and ppc64le jobs only have a test jobs with no compile job. For HPSF CI, this prevents overloading the limited generic-hardware resources. For ppc64le, this avoids a large artifact upload for an architecture that does not support GPUs (i.e. requires no unique hardware).
- Compiler tests (previously compiler *) are now a subset of the build jobs, but do not have associated test jobs. This is primarily an organizational change.
Test data is now regenerated by regen testdata: [*] jobs, which do not test the newly generated test data. These jobs must be triggered manually to run and are in the artifacts stage.
Failing jobs no longer cancel the entire pipeline, the pipeline now continues until no more jobs can run. This is intended to reduce waste CI time, by giving more information from failing pipelines (-> less pipelines per MR) and reducing cancel/retry cycles for ultimately successful jobs.

To Demonstrate

An example pipeline off of develop typically costs ~210-220 credits, this number is visible in the ci performance report job and also in the top header. For example: https://gitlab.com/hpctoolkit/hpctoolkit/-/pipelines/2028041223

The pipelines in this MR typically cost ~90-100 credits, this number is visible in the same locations in the MR pipeline linked just below this description.

Backward Compatibility

The test matrix is largely unchanged from the previous iteration, however there are three major removals:

The jobs based on Ubuntu 20.04 and Fedora 39 have been dropped in the shuffle in accordance with their EOL status. These old distros do not represent any system we aim to give full support for, so we do not need the aggressive testing we do for, say, RHEL 8.
The setup: [*] regression tests have been dropped. These jobs tested specifically strange system configurations that caused issues in the past. We predict these issues will never recur thanks to internal code improvements, so they do not need aggressive testing.
Tests using --buildtype=release have been dropped. The majority of test jobs use the default --buildtype=debugoptimized, which is like much release except for the presence of debug info and assertions. We predict most changes that cause differences in behavior between debugoptimized and release builds will be visible in a manual review.

In addition, the clang-tidy-fix job has been removed. Previously this job generated a patch of automated fix-its to fix found clang-tidy errors. This patch normally goes unused and often requires additional care to apply, so the job has been dropped.

Edited Oct 02, 2025 by Jonathon Anderson

Optimize CI job allocations

To Demonstrate

Backward Compatibility

Merge request reports