Release 11.6 retrospective
README FIRST
This issue is created to recognize the causes that led to the described problems. No individual or group will need to take responsibility for the problem, but will need to take responsibility for the solution.
Release timeline summary
- RC1 (staging, December 4th) was blocked by broken Load Balancing due to Rails 5 upgrade https://gitlab.com/gitlab-org/gitlab-ee/issues/8692
- RC2 (canary, December 5th) was rolled back because of cache inconsistency between canary and production tasks#578 (comment 122726945)
- RC3 (canary, December 6th) was not promoted to production because of project not showing when it ends in
.json
https://gitlab.com/gitlab-org/gitlab-ce/issues/54175#note_122986750 - RC3 (canary, December 6th) we ended up exposing one of the security fixes an hour before disclosure since this was a non-security release which cherry-picked a security commit on dev. Improvement discussed in https://gitlab.com/gitlab-org/release/framework/issues/29#note_122946134
- RC4 (production, December 10th) we needed to turn off a feature flag
ci_merge_request_pipeline
before deploying to production because of merge request pipelines . https://gitlab.com/gitlab-org/gitlab-ce/issues/55026#note_123268382 - RC4 (production, December 10th) was rolled back due to 500 errors due to the way apt-get installs are done by takeoff and the new version of ruby. gitlab-com/gl-infra/production#608 (closed) ~30 minute outage
- RC4 (production, December 10th) Regression, any project with a push rule or patch check fails to push gitlab-com/gl-infra/production#610 (closed)
- RC4 (production, December 10th) Regression, broken service desk emails on GitLab.com gitlab-com/gl-infra/production#613 (closed)
- RC4 (production, December 10th) S1 security issue with project imports + LFS, arbitrary file read https://gitlab.com/gitlab-com/gl-infra/production/issues/614
- RC5 (staging, December 11) Failing smoke test for "user creates a merge request" on staging https://gitlab.com/gitlab-org/gitlab-ce/issues/54302#note_124250991
- RC6 (nada, December 12) This RC was abandoned because it did not incorporate a critical security fix, unfortunately this means a bit of wasted toil. We don't really have a good way to add security fixes to RCs that have not been disclosed.
- RC7 (canary, December 12) Migrations failed because of statement timeout. Patched deploy node to fix. Slack thread
- RC7 (canary, December 12) Canary started scheduling background migrations that RC4 did not know about. https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5746
- RC8 (staging, December 18) A failover on the patroni cluster caused the first deploy of RC8 on staging to fail
- RC9 Rails4 Gemfile needed fixing https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/23918/diffs?commit_id=58e2c6b6d709573e1403cc6132b9a8c61200c6d8
- RC9 mysql and pgsql tests failed due to Rails4 issues.
- RC9 last minute fixes for issue sorting: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/23919 and https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/8913
- Flakey smoke test “promotes issue to an epic” gitlab-org/quality/staging#18 (closed) Removed from smoke test
- Red EE master was completely ignored by development teams and new items were being merged on top https://gitlab.com/gitlab-org/gitlab-ee/issues/8757
- Red CE/EE master was completely ignored (docs-lint job), first failure https://gitlab.com/gitlab-org/gitlab-ce/commit/69aaa30dd9e4211e081bf258e79fcfcfc2e8f230 https://gitlab.com/gitlab-org/gitlab-ce/issues/55038
- Red EE master due to schema changes not getting merged in properly, and
.pot
changes not being made properly for EE: https://gitlab.com/gitlab-org/gitlab-ee/issues/8856 - EE ended up with more commits than CE counterpart: tasks#604 (closed)
- Unable to merge the security MR for
11.3.14
https://gitlab.com/gitlab-org/gitlab-ce/issues/55611 - 500s from S3 are causing us to have to retry a bunch of uploads
- We had to unpause the old promotion runner. The k8s one was OOMing and causing problems.
Impact
Corrective actions
Process improvements
- Never delay an RC because of security fixes. Instead, always finish and deploy the current RC.
- Only pick new changes (security or not) if the last deployed RC is stable. If it's not, only pick the changes necessary to resolve the issues.
- Start packaging and deploying around the 25th, instead of around the 7th.
Tooling improvements
Edited by Yorick Peterse