Redefine how we pick security fixes into release candidates

Throughout the release cycle of 11.6 we ran into numerous cases where security fixes had to be picked into RCs that were ready to go. This lead to a lot of frustration, and delayed the time to deploy these RCs. As part of the 11.6 retrospective I proposed the following changes related to security fixes:

Never delay an RC because of security fixes. Instead, always finish and deploy the current RC.
Only pick new changes (security or not) if the last deployed RC is stable. If it's not, only pick the changes necessary to resolve the issues.

@marin pointed out that the Security team has a deadline in the form of issue due dates: https://about.gitlab.com/handbook/engineering/security/#due-date-on-security-issues. From what I can tell this means that if the due date is X, the expectation is that the fix is released on X no matter what.

I think these two goals directly conflict. More importantly, I think the use of a hard deadline makes it impossible to ship at a higher rate as we would have to regularly delay deploys if a fix comes in at the last minute (and this happens more often than we'd like).

Imagine for a moment that we have a hypothetical release RC5. It's ready to go, and we're about to start deploying. Suddenly somebody from Security comes in and says "Hey, my security fix has a deadline for today and needs to be picked!". We now have two choices:

We continue with RC5, because it was already good to go.
We abandon RC5, and create RC6 by picking the fix (meaning RC6 is RC5 + just the fix)

Option one: picking security fixes into a new RC, after deploying the previous one

Out of these two, option one is the most favourable as in both the best and worst case scenario it requires the least amount of work and time. This means that the "time to GitLab.com" is the shortest.

In the best case scenario we can deploy RC5, all is well, and we can immediately start working on RC6. In the worst case scenario there are three things that may happen:

We may need to revert back to the previous RC.
Prior to deploying the RC we find out we can not do so, for example due to pipeline failures. This means we may need to abandon RC5 all together.
We deploy RC5, but it introduces problems in some shape or form; requiring a new RC to resolve.

Option two: picking security fixes into a ready RC

If we take option two, the best case scenario requires more work. This is due to having to pick one or more merge requests into an RC that is already good to go, with all the extra work and waiting (e.g. for pipelines) that might be necessary.

For the worst case (a broken RC), more work may also be necessary. For example, we may need to revert to a previous RC based on some problem with the RC we want to deploy. If the RC also included security fixes, this means creating a new RC based on a previous RC (skipping the "current" one), and picking the fix there. This basically means we end up with different "timelines" where RC6 is based on changes from RC4, with a security fix, without the changes from RC5. Visually this means the following:

RC4 --> RC5 (broken)
 |       |
 +-------+------------> RC6 (RC4 + security fix)
         |
         +----------------------------------------> RC7 (RC5 + fixes for the bugs + security fix)

This gets even worse if RC1 is broken, as we may need to build RC2 (since RC1 is broken) based on a previous patch release (instead of master). This creates a logistical nightmare we should avoid at all costs.

Proposal

Based on all the above, I propose the following changes:

The Delivery team reserves the right to reject any change from being picked into an RC at any given time, given the reason for this is clearly explained (instead of just "No."). This is absolutely necessary, as otherwise we are unable to perform our work (especially considering the small team size).
Whenever the pipeline for an RC is green, we do not pick any changes into it and proceed with deploying it. Example:
1. At 12:00 UTC the pipeline for RC2 is green
2. At 12:15 UTC Billy Bob tells Delivery it's supposed to include security fix X into RC2
3. RC2 is deployed as-is, without the security fix
4. Some time after the deploy, Delivery starts working on RC3
Whenever we delay a security fix because we first want to deploy a pending RC, the next RC will only include that security fix (and any other pending ones). Non security changes will be delayed until the RC after the security RC.
Security needs to make Delivery aware of the due dates, so that Delivery can take these into account when starting work on RCs. I'm not sure yet what the format would be for this, as long as it's something better than a last minute notice.
Because of all this, Security due dates become "lossy". This means that if a due date is January 1st, we may end up deploying it on the 2nd. In other words, instead of "It must be released on date X" it becomes "It must be released as close to date X as possible"
In the event of a life or death situation (e.g. a critical security flaw that is actively exploited) we may decide to still pick changes into an RC at the last minute, but such cases should be discussed and handled on a case by case basis.

I am open to alternatives, but I would like to make one thing very clear: introducing last minute (security) fixes and delaying RCs does not work. Just take a look at the list of issues we ran into with 11.6, and the work necessary to deal with all that.

cc @gitlab-org/delivery @gitlab-com/gl-security

Edited Jan 03, 2019 by Yorick Peterse