Failed deploy of 11.6.0 RC4
Summary
Deploy of 11.6.0 RC4 to production failed and needed to be rolled back.
Service(s) affected : gitlab.com
Team attribution :
Minutes downtime or degradation : 10h03 - 10h37 = 34m
Root cause
- When deploys are done with takeoff, an
apt-get install ...
is done concurrently across the entire fleet first, then a hup is issued in a rolling safe manner with haproxy load balancer draining - For this upgrade, after the
apt-get install ...
we will immediately start seeing 500 errors. The reason for this is because of the ruby upgrade which means we are running the old ruby with missing gems on the file system, this is why we were seeing load errors in the logs - The fix will be to change the takeoff deploy tool so that we move the apt-get install so that it is done when the instance is out of the loadbalancer
- There was no recent change that caused this problem, takeoff has been designed this way for a long time, it’s just that we had a different scenario because of the ruby upgrade
- This probably wasn’t caught on staging for the upgrade because if we let the deploy finish, the hup would eventually happened and the errors would have cleared
Timeline
2018-12-10
- 09h49 Deployment of 11.6.0-rc4 (https://ops.gitlab.net/gitlab-org/takeoff/pipelines/14941)
- 10h03 Pingdom alerts gitlab-ce down (https://gitlab.pagerduty.com/incidents/PJM0YL9)
- 10h06 High Web Error Rate (https://gitlab.pagerduty.com/incidents/PHERBYV)
- 10h14 Jose tweets: https://twitter.com/gitlabstatus/status/1072072022627373056
- 10h14 Initiated rollback to 11.5.3 https://ops.gitlab.net/gitlab-org/takeoff/pipelines/14947
- 10h15 Errors on GitLab.com (https://log.gitlab.net/goto/1bb0fbde4bbf4d43fb8ce0b16c6bdcbf)
- 10h33 Jose tweets we are rolling back: https://twitter.com/gitlabstatus/status/1072076713612468224
- 10h37 Alerts resolved https://gitlab.slack.com/services/B12SVN24D
- 10h42 tweet: Everything back to normal https://twitter.com/gitlabstatus/status/1072079050401701888
Corrective actions
- Improve communication for omnibus customers about this gitlab-org/omnibus-gitlab#3973 (closed)
- Improve omnibus downgrade for Gitaly gitlab-org/omnibus-gitlab#3971 (closed)
- Only install the omnibus when we are out of the load balancer https://gitlab.com/gitlab-org/takeoff/issues/119
Edited by 🤖 GitLab Bot 🤖