Automatically Load-balance tests

Description

Optimizing CI/CD pipelines to take the minimum amount of time can be tough. We should help developers by automatically load-balancing tests so that each job will take roughly equal time, resulting in shortest wall-clock time.

Context

When you run tests that's only much you can do in a "single thread single process" execution. As your test base grow you have few alternatives depending on the programming language or framework of choice. Some will allow you to use multiple threads or multiple process to spread out the execution of tests on a single machine. It may or may not consolidate the reports in the end.

Other alternative is to split tests between multiple CI workers. That's usually more portable, as you essentially pass a list of files to the test framework and get a JUnit file in the end It can also be combined with the previous approach in order to better utilize all resources available.

The split is normally done by having the entire list of specs divided in sequence between all CI workers running the test. This is usually implemented using a 'maximum concurrency' and 'index' variables. In GitLab, when you use the parallel: keyword, that assigns CI_NODE_TOTAL, CI_NODE_INDEX that can be used by the test framework to generate the list of files to the specific machine.

The problem with this idea is that test files are not guaranteed to take the same amount of time to execute. In reality they very by a lot (you can have one that take few seconds and the next taking several minutes).

The Knapsack alternative

Knapsack name comes from the Knapsack problem which is a problem from combinatorial optimization:

Given a set of items, each with a weight and a value, determine the number of each item to include in a collection so that the total weight is less than or equal to a given limit and the total value is as large as possible.

In our universe, that means, given a set of test files being executed, for which each one takes X amount of time, split them as evenly as possible to be paralelized in Y machines.

The goal is to go from one worker taking 5 minutes while another takes 15, to have something closer like 9 minutes and 11 on the same workers, just by feeding them with a better set of files.

Proposal

Links / references

Likely depends on https://gitlab.com/gitlab-org/gitlab-ce/issues/21480
Prior art: knapsack

Documentation blurb

(Write the start of the documentation of this feature here, include:

Why should someone use it; what's the underlying problem.
What is the solution.
How does someone use this

During implementation, this can then be copied and used as a starter for the documentation.)

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited Mar 13, 2023 by 🤖 GitLab Bot 🤖