Define steps when pipelines on master branches are red
During the Release 11.6 cycle, due to number of reasons, our master branch pipelines were red. The reasons are plentiful: from contention in the infrastructure behind CI, flaky tests, to some valid failures. What was in common was that resolving the failures took some time causing confusion whether anyone is doing anything about it and in some cases hiding real failures. Our handbook states that maintainer can merge while the pipeline is red.
During the 11.6 retrospective, discussion was started on whether this policy needs to change to disallow merging to master when pipelines are red.
Important
I need to stress that it is in no-one's interest to prevent merging into master. This is an extreme solution that should be avoided as it slows down development and increases stress around freeze date. However, we also have a high interest in ensuring that what we are ready to ship to users have green pipelines and green QA pipelines.
Problem description
I want to highlight a couple of things that I would like to see resolved:
- Ensuring that when master has a red build, we avoid the bystander effect and have the highest priority set and clear owner of fixing the issue
- When the MR has red build due to failure in master that is being resolved, we have an additional check executed if we are to merge into master while it is red
Sample
Updated description with a sample of failing builds in master for the period from 2019-03-20/2019-04-03:
Out of 9 issues with broken specs, the breakdown is as follows:
Cause | Solution | Issue |
---|---|---|
Broken spec | Fix | https://gitlab.com/gitlab-org/gitlab-ee/issues/10525 |
CE merged before EE port | Revert | https://gitlab.com/gitlab-org/gitlab-ee/issues/10599 |
CE merge | Fix | https://gitlab.com/gitlab-org/gitlab-ee/issues/10636 |
Broken spec | Fix | https://gitlab.com/gitlab-org/gitlab-ee/issues/10738 |
Broken spec | Fix | https://gitlab.com/gitlab-org/gitlab-ce/issues/59731 |
Broken spec | Fix | https://gitlab.com/gitlab-org/gitlab-ee/issues/10738 |
Missing EE counterpart | Fix | https://gitlab.com/gitlab-org/gitlab-ee/issues/10858 |
Broken spec | Fix | https://gitlab.com/gitlab-org/gitlab-ee/issues/10863 |
Any of these will cause a long tail of failing specs in MR's and can also hide additional failures because MR's are merged into red build.
Assigning ownership
We have notifications in #development that inform us of any failed pipeline. In discussion with @DouweM , we concluded that the problem comes with notifications not being actionable. It is not clear whether there is an existing issue and what the status is.
I understand that, especially around feature freeze, everyone is busy with their deliverables. However, it is important that each developer understands their responsibility in passing specs, especially in master, and takes ownership of resolving the problem.
Below are a few suggestions on what can be done about it:
Option 1 - revert the change that introduced a failing spec
This should be our default action in my opinion: when specs start failing, find the culprit and revert it. Preferably all automated.
We had this implemented with merge-train originally but there are some challenges:
- Revert is not always simple and straight forward
- Revert sometimes causes more spec failures because of features that depend on the reverted feature
- Re-introducing revert can be even more work if there is a dependency between CE and EE; CE MR reverted, it already reached EE and something was built on EE on top of the change - this requires at minimum 2 other reverts with large chance of it causing more work
This option would be less problematic if the build on merged code before merging is available. Sadly, it has been moved for few releases already and there is no clear set delivery date for it yet.
Option 2 - create a post processing job and process for devs while the problems are resolved
Increasing the visibility of failure:
- Parse the knapsack json report for failed test
- Have a post-processing step on specs failure that will automatically create an issue
- The issue can have the title of the failed spec and a label "broken master"
- Each time the same test fails, we can look up if the issue with the same title exists and print its status
- Status should state if the issue has an assignee and how many times it has failed already
- Slack notification should print the status
This should not be difficult to implement because we have all the information available in the pipeline and we have all the tools to create the issue.
Ideally, we would have this is a built in feature in GitLab - automatically create an issue out of failing specs and apply defined labels.
Now that we have ensured that the issue is visible and someone is working on it, we can introduce some guidelines for reviewers on how to behave on red master.
When master is red and issue for fixing specs is assigned and worked on I think the following should be sufficient. Reviewer (or developer) can:
- Create a new branch out of the feature branch and disable the spec failing in master
- If the pipelines are green for that branch, link it to the MR and merge the MR as is
- If the failing spec can't be disabled due to being a dependency of a change introduced in the MR, the MR should not be merged until master is green
Above, in combination with https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/23809 should allow us sufficient confidence that things are moving forward and options in case it takes a long time to fix the problem.
Option 3 - Lock down master
This option should be considered as a last resort, in case option 1, 2 or a new one, do not happen in the next couple of releases.
If the master is red, no one is allowed to merge to master. We would enforce this with protected branch options and lift it again once master is safe.
The benefit of locking down master is that we can continue using master and releasing with some confidence in what we are releasing.
Some of downsides, once the master branch is opened:
- We might have accumulated many MR's that when merged, cause new failures
- Developers require a feature that is not merged to continue doing their other work so they branch of another feature branch that is behind causing more work to get up-to-date with master
- Reviewers get an additional requirement of checking the MR's that they already reviewed, asking developers to rebase on newest master and then possibly failing to merge again if master is in lockdown again
- With timezones, having master blocked for 2 days can set you back more than two days
I think that no one wants this option, so I am open for new ideas and discussion of the listed options.