Retrospective: Security Release 13.2.5, 13.1.7, 13.0.13
README FIRST
This issue is created to recognize the causes that led to the described problems. No individual or group will need to take responsibility for the problems but will need to take responsibility for the solution.
Summary
During the Security Release 13.2.5, 13.1.7, 13.0.13we experienced two issues
- Confusion over the hot patch process
- Discovery that recent security releases haven't been releasing fixes to FOSS
Release timeline summary
- 2020-08-13 20:52 UTC: Security MR is started by @sabrams
- 2020-08-13 21:08 UTC:
@mayra-cabrera
starts discussion about release plan for security issue in https://gitlab.com/gitlab-org/gitlab/-/issues/235996#note_395773838, including discussion for hot-patch or not. - 2020-08-14 04:54 UTC:
@dosuken123
queries if we should do a hot-patch. @tkuah says yes, considering https://gitlab.com/gitlab-org/gitlab/-/issues/235996#note_395787439 - 2020-08-14 07:36 UTC: First hot-patch MR is https://ops.gitlab.net/gitlab-com/engineering/patcher/-/merge_requests/31
- 2020-08-14 13:00 UTC: More hot-patches are created and merged
- 2020-08-14 13:35 UTC:
@mayra-cabrera
recommends against hot-patch in favor of focusing security MR as part of Critical security Release. - 2020-06-16 21:40 UTC:
@tkuah
restarts both hot-patch process and security MR process. - 2020-06-16 22:40 UTC:
@rchan-gitlab
/@tkuah
finds that staging does not have the hot-patch applied. It is unclear why that is - 2020-06-16 23:00 UTC: https://ops.gitlab.net/gitlab-com/engineering/patcher/-/merge_requests/34 is created
- 2020-06-17 00:00 UTC: MR and backports are ready.
- 2020-08-17 00:30 UTC:
@yorickpeterse
fixes the staging deployments by enabling the Omnibus role override, then triggering a deploy. Around this same time, Yorick asks that when he signs up in about 8 hours from now, he is provided with clear information about what state we are in, what environments need to be patched, etc - 2020-08-17 01:36 UTC: the staging deploy finished
- 2020-06-17 04:56 UTC:
@cmaxim
creates the Critical Security Release issue - 2020-06-17 05:00 UTC: https://ops.gitlab.net/gitlab-com/engineering/patcher/-/merge_requests/34 is merged
- 2020-06-17 05:06 UTC:
@tkuah
adds a second related P1/S1 issue to the Critical Security Release. - Next few hours: More hot-patches are being created but staging auto-deploys continue to happen overwriting the hot-patch.
- 2020-06-17 08:00 UTC:
@tkuah
suggests to give up on hot-patching as Critical Security Release is almost ready - 2020-06-17 08:30 UTC:
@jarv
comes online and suggests the same thing. The critical security release is started soon after - 2020-08-17 09:15 UTC:
@yorickpeterse
starts his day, but finds out there still is no clear information about what state we are in. Instead, information is spread across a bunch of Slack threads and GitLab issues - 2020-08-17 09:30 UTC: The decision is made to first ensure staging, canary, and production all run the same auto-deploy version; before we start including security patches
- 2020-08-17 14:15 UTC: Security merge requests have been approved and merged
- 2020-08-17 14:57 UTC: the production deployment finished, all environments are now running version
13.3.202008170520-82d6547b2a5.0315997f334
. This version does not yet contain the security fixes - 2020-08-18 12:00 UTC (ish): we discover that the CE packages do not include the security fixes, as the merge train synced the changes too late. We decide to bump the version for the security releases, effectively starting the tagging/publishing process over again
- 2020-08-18 13:30 UTC: packages tagged, now waiting for them to build
- 2020-08-10 12:00 UTC: The cause of the missing sync is fixed in gitlab-com/gl-infra/delivery#1139 (comment 398719130)
Impact
For issue 1 - Confusion over the hot patch process
- Engineer time wasted trying to follow the hot patch process
- Unclear situation when RM started work on Monday
For issue 2 - Discovery that recent security releases haven't been releasing fixes to FOSS
For at least 3 security releases fixes were not released to FOSS due to the merge train syncing issues. Customers left in a vulnerable state.
Corrective actions
- A regular auto-deploy fixed the security vulnerability on GitLab.com. A critical security release was made for self-hosted.
- The critical security release was re-tagged, and another blog post published to overcome the FOSS issue.
- Hot patch documentation has been updated
- https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1158 opened
Process improvements
- Hot-patching over the weekend should be avoided if possible.
- For security fixes, it may be best to just not hot-patch at all; instead of relying on auto-deploys to deploy fixes
Tooling improvements
- It should be easy to check that all expected fixes are included in deploys
Edited by Amy Phillips