Define steps when pipelines on master branches are red

During the Release 11.6 cycle, due to number of reasons, our master branch pipelines were red. The reasons are plentiful: from contention in the infrastructure behind CI, flaky tests, to some valid failures. What was in common was that resolving the failures took some time causing confusion whether anyone is doing anything about it and in some cases hiding real failures. Our handbook states that maintainer can merge while the pipeline is red.

During the 11.6 retrospective, discussion was started on whether this policy needs to change to disallow merging to master when pipelines are red.

Important

I need to stress that it is in no-one's interest to prevent merging into master. This is an extreme solution that should be avoided as it slows down development and increases stress around freeze date. However, we also have a high interest in ensuring that what we are ready to ship to users have green pipelines and green QA pipelines.

Problem description

I want to highlight a couple of things that I would like to see resolved:

Ensuring that when master has a red build, we avoid the bystander effect and have the highest priority set and clear owner of fixing the issue
When the MR has red build due to failure in master that is being resolved, we have an additional check executed if we are to merge into master while it is red

Sample

Updated description with a sample of failing builds in master for the period from 2019-03-20/2019-04-03:

Type	Count	Issue
Linting	3	https://gitlab.com/gitlab-org/gitlab-ee/issues/10541 https://gitlab.com/gitlab-org/gitlab-ee/issues/10751 https://gitlab.com/gitlab-org/gitlab-ee/issues/10833
Translation file (gitlab.pot)	3	https://gitlab.com/gitlab-org/gitlab-ee/issues/10565 https://gitlab.com/gitlab-org/gitlab-ee/issues/10572 https://gitlab.com/gitlab-org/gitlab-ee/issues/10590
Broken spec	8	https://gitlab.com/gitlab-org/gitlab-ee/issues/10525 https://gitlab.com/gitlab-org/gitlab-ee/issues/10599 https://gitlab.com/gitlab-org/gitlab-ee/issues/10636 https://gitlab.com/gitlab-org/gitlab-ee/issues/10738 https://gitlab.com/gitlab-org/gitlab-ce/issues/59731 https://gitlab.com/gitlab-org/gitlab-ce/issues/59747 https://gitlab.com/gitlab-org/gitlab-ee/issues/10858 https://gitlab.com/gitlab-org/gitlab-ee/issues/10863
Karma	2	https://gitlab.com/gitlab-org/gitlab-ee/issues/10592 https://gitlab.com/gitlab-org/gitlab-ce/issues/59848
CI system failure	1	https://gitlab.com/gitlab-org/gitlab-ee/issues/10611

Out of 9 issues with broken specs, the breakdown is as follows:

Cause	Solution	Issue
Broken spec	Fix	https://gitlab.com/gitlab-org/gitlab-ee/issues/10525
CE merged before EE port	Revert	https://gitlab.com/gitlab-org/gitlab-ee/issues/10599
CE merge	Fix	https://gitlab.com/gitlab-org/gitlab-ee/issues/10636
Broken spec	Fix	https://gitlab.com/gitlab-org/gitlab-ee/issues/10738
Broken spec	Fix	https://gitlab.com/gitlab-org/gitlab-ce/issues/59731
Broken spec	Fix	https://gitlab.com/gitlab-org/gitlab-ee/issues/10738
Missing EE counterpart	Fix	https://gitlab.com/gitlab-org/gitlab-ee/issues/10858
Broken spec	Fix	https://gitlab.com/gitlab-org/gitlab-ee/issues/10863

Any of these will cause a long tail of failing specs in MR's and can also hide additional failures because MR's are merged into red build.

Assigning ownership

We have notifications in #development that inform us of any failed pipeline. In discussion with @DouweM , we concluded that the problem comes with notifications not being actionable. It is not clear whether there is an existing issue and what the status is.

I understand that, especially around feature freeze, everyone is busy with their deliverables. However, it is important that each developer understands their responsibility in passing specs, especially in master, and takes ownership of resolving the problem.

Below are a few suggestions on what can be done about it:

Option 1 - revert the change that introduced a failing spec

This should be our default action in my opinion: when specs start failing, find the culprit and revert it. Preferably all automated.

We had this implemented with merge-train originally but there are some challenges:

Revert is not always simple and straight forward
Revert sometimes causes more spec failures because of features that depend on the reverted feature
Re-introducing revert can be even more work if there is a dependency between CE and EE; CE MR reverted, it already reached EE and something was built on EE on top of the change - this requires at minimum 2 other reverts with large chance of it causing more work

This option would be less problematic if the build on merged code before merging is available. Sadly, it has been moved for few releases already and there is no clear set delivery date for it yet.

Option 2 - create a post processing job and process for devs while the problems are resolved

Increasing the visibility of failure:

Parse the knapsack json report for failed test
Have a post-processing step on specs failure that will automatically create an issue
The issue can have the title of the failed spec and a label "broken master"
Each time the same test fails, we can look up if the issue with the same title exists and print its status
Status should state if the issue has an assignee and how many times it has failed already
Slack notification should print the status

This should not be difficult to implement because we have all the information available in the pipeline and we have all the tools to create the issue.

Ideally, we would have this is a built in feature in GitLab - automatically create an issue out of failing specs and apply defined labels.

Now that we have ensured that the issue is visible and someone is working on it, we can introduce some guidelines for reviewers on how to behave on red master.

When master is red and issue for fixing specs is assigned and worked on I think the following should be sufficient. Reviewer (or developer) can:

Create a new branch out of the feature branch and disable the spec failing in master
If the pipelines are green for that branch, link it to the MR and merge the MR as is
If the failing spec can't be disabled due to being a dependency of a change introduced in the MR, the MR should not be merged until master is green

Above, in combination with https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/23809 should allow us sufficient confidence that things are moving forward and options in case it takes a long time to fix the problem.

Option 3 - Lock down master

This option should be considered as a last resort, in case option 1, 2 or a new one, do not happen in the next couple of releases.

If the master is red, no one is allowed to merge to master. We would enforce this with protected branch options and lift it again once master is safe.

The benefit of locking down master is that we can continue using master and releasing with some confidence in what we are releasing.

Some of downsides, once the master branch is opened:

We might have accumulated many MR's that when merged, cause new failures
Developers require a feature that is not merged to continue doing their other work so they branch of another feature branch that is behind causing more work to get up-to-date with master
Reviewers get an additional requirement of checking the MR's that they already reviewed, asking developers to rebase on newest master and then possibly failing to merge again if master is in lockdown again
With timezones, having master blocked for 2 days can set you back more than two days

I think that no one wants this option, so I am open for new ideas and discussion of the listed options.

Edited Apr 03, 2019 by Marin Jankovski