WIP: Change runners used by unit tests by the main fork MRs (!1944) · Merge requests · GitLab.org / gitlab-runner

Tomasz Maczukin requested to merge run-tests-on-different-runners-when-on-main-fork into master Mar 17, 2020

What does this MR do?

Sets different runners to be used by unit test CI jobs when the MR is started from our main GitLab Runner fork.

Why was this MR needed?

To speed up development a little until we will find a proper solution for the real issue. Please take a look at #16544 (closed) for the context.

I've did some analysis for the flaky test and where it is executed.

Most of the failures since last two months happened on srmX machines. Only two of the prmX cases listed bellow are the false failures. Other two were fully legitimate failures that found a problem in prepared change.

gitlabhq_production=> SELECT description, count(*) FROM (SELECT b.runner_id FROM ci_builds AS b WHERE b.id > 436222297 AND b.project_id = 250833 AND b.status = 'failed' AND b.name = 'unit test 3/8') AS source JOIN ci_runners ON ci_runners.id = source.runner_id GROUP BY description;
             description              | count 
--------------------------------------+-------
 private-runners-manager-3.gitlab.com |     3
 private-runners-manager-4.gitlab.com |     1
 shared-runners-manager-3.gitlab.com  |     7
 shared-runners-manager-4.gitlab.com  |    19
 shared-runners-manager-5.gitlab.com  |    10
 shared-runners-manager-6.gitlab.com  |    13
(6 rows)

How srmX are different from prmX? Well, one of the differences is... the disk type! 🙂 srmX is using the normal disks, while prmX and gsrmX are using SSD disks. The second is number of vCPUs and RAM (prmX and gsrmX are using n1-standard-2 which have 7.5 GB and 2 vCPUs).

This seems to align with @erushton's recent findings about the Docker executor random failures.

The bad news is that we can't easily migrate off srmX, because our tests - especially the Docker ones where the random failure happens! - require DinD. Which means they require privileged = true. Which means we can't move them off srmX, because community contributors don't have access to the prmX and gsrmX don't have privileged = true.

This MR is a nasty hack and should be definitely reverted when we will find a proper solution for #4450 (closed). But for now it should reduce the amount of false-failures that we see regularly.

Are there points in the code the reviewer needs to double check?

Does this MR meet the acceptance criteria?

Documentation created/updated
Added tests for this feature/bug
In case of conflicts with master - branch was rebased

What are the relevant issue numbers?

Related to #16544 (closed)

Edited Jul 21, 2020 by 🤖 GitLab Bot 🤖

WIP: Change runners used by unit tests by the main fork MRs

What does this MR do?

Why was this MR needed?

Are there points in the code the reviewer needs to double check?

Does this MR meet the acceptance criteria?

What are the relevant issue numbers?

Merge request reports