Consider lighter weight CI infrastructure.
Our Docker images are big and slow to build. Our CI job artifacts are big and numerous. Our CI workflows spend a lot of time starting and stopping containers or transferring data.
Proposals
Merge the configure and build stages
We probably spend more time on overhead than actually running the configure stages. On the rare occasions that configure stages are used by multiple build jobs, it is probably cheaper to duplicate the configure. Also, it is not safe to use the configure stage output without rerunning CMake anyway, since we don't use commit-specific artifact names or cache keys.
Reduce artifact and cache storage
Consider discontinuing artifacts and caches that are easily invalidated (requiring extra wall time and leading to wasted transfer time). Corollary: Don't store and transfer data that is easy to reproduce.
Evaluate/optimize ccache usage.
The build jobs spend a long time downloading and uploading ccache databases. We should check whether this is costing or saving time. Consider adding ccache -s
to the beginning and end of the build scripts. Consider checking the distribution of job wall times for a couple of job names across a bunch of commits.
To make ccache most useful, store ccache even for failed builds.
Most failed builds still manage to compile a lot of files that will not be modified in the MR. By default, the CI cache
parameter is set to when: on_success
. For ccache, we should probably use cache: when: always
.
Related: something in our build system causes a lot of ccache records to be invalidated much more often than should seem necessary. If we could figure out what is causing this, we could probably save ~90% of the build time for initial MR pushes by inheriting the ccache cache from the master branch.
Use more specific builds for test stages
We can either merge the build and test stages or tag the installation artifacts more vigorously, but it is currently possible for test jobs to pick up artifacts from other pipelines run for the same ref.
Also, merging the build and test pipelines would allow us to conditionally store more build artifacts for failed tests and discard all but the test results for passing tests. Note that we would want to add when: 'always'
to the cache
CI parameter so that ccache database is updated for failed jobs.
Only a small number of jobs, if any, fail late in execution from transient resource interruptions, and we only need cache (not artifacts) to mitigate the cost of re-building. There is not much gained by "check-pointing" the CI workflow between build and test.
Reevaluate CI Docker images for size
The images take a lot of time to download, build, and upload, and take up a lot of space in our GitLab project storage. Large image size also means that fewer Docker images will remain in local caches for less time. (Note that Docker registries may self-optimize, and avoid storing duplicated layers across images, so the actual storage used is not necessarily the same as the sum of image sizes.)
We need some extra files in order to use the containers as build environments, but we probably have bloated containers that could be shrunken somewhat with more judicious use of features like hpccm.Stage.runtime
We can revisit the hpccm building blocks available, but also should consider relying more on packages from the OS distributions. In particular,
MPI
MPI can take a long time to build, but recent Ubuntu releases are fairly flexible with configuring MPI installations for different toolchains. The flexibility is affected, though, by MPI flavor and OS distribution.
Python
Recent OS distributions have gotten much better at managing multiple Python installations, and native packages may have substantially smaller footprints than from-scratch installations.
We have at least one extra Python installation already, since we are explicitly installing the OS distribution's Python3 in addition to the pyenv installations.
Use additional projects to test new CI images
We use the latest
tag for CI images that are still evolving, but this means that we can't easily stage in new CI images without extra temporary edits to the job configurations.
If we used ${CI_REGISTRY_IMAGE}
instead of ${CI_REGISTRY}/gromacs/gromacs
in our image names, we could push trial images to a different registry (e.g. https://gitlab.com/gromacs/testing/gromacs/container_registry) to confirm that a fork of master
(or the branch in need of the image update) at https://gitlab.com/gromacs/testing/gromacs.git
will work as expected without editing the gitlab-ci yaml.
Only run one CI pipeline per MR update
Resolution
This issue is intended to focus discussion and launch new issues when it is determined that action should be taken. To clarify scope, I suggest closing this issue once each proposal can be accepted or rejected, or after a couple of planning meetings have passed without evidence that these topics have significant priority.