RCA: AutoDeploy Staging Failures 2019-09-17
https://docs.google.com/document/d/1qUsz0wBAC3K_mnCTwANfVav6RslWP6AI5KgQXIRhZ4c/edit
Introduction
Summary
- Services Impacted: AutoDeploy
- Users & Teams Involved: Delivery, Infrastructure
- Incident Window (UTC): 2019-09-17 through 2019-09-18
Merged code introduced a post deployment background migration that was estimated to take 16 hours in production, failed to run at all against our staging environment. The failure to run on staging triggered incorrect corrective actions on the staging environment. Due to the failure a new Merge Request was built which delayed the creation and subsequent release of the desired auto-deploy branch when needing to wait for testing and building to complete.
Impact
- Impact to GitLab: Delayed Deploys
- Scope of the event: Internal
- Number of customers affected: n/a
- How was this visible to customers: n/a
- How many attempts made to impacted service(s): n/a
<< include any additional data relevant to the event to further describe the impact >>
Detection & Response
- How was the event detected? Slack notification
- Was this caught by alerting? Yes
- Time between event start and detection? 3 Hours
- Time from detection to remediation? 24 hours
<< include any issues which delayed response or remediation of an event (i.e. bastion host unavailable, relevant team member wasn't page-able, ...) >>
Root Cause Analysis
<< The purpose of this document is to understand the reasons which led to the incident, and to create mechanisms preventing them from recurring. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actor(s). Follow the "5 whys" in a blameless manner as the core of the root cause analysis. For this it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause. Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.
Example of the usage of "5 whys" The vehicle will not start. (the problem)
Why? - The battery is dead. Why? - The alternator is not functioning. Why? - The alternator belt has broken. Why? - The alternator belt was well beyond its useful service life and not replaced. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)
More information: https://en.wikipedia.org/wiki/5_Whys >>
What Went Well
<< Identify the things that worked well or as expected. Any additional callouts for what went particularly well. >>
Skarbek: An improvement was able to be made to improve the performance of the background migration
Skarbek: Learning that the database statement timeouts were intentional to catch potential bad actors from being deployed into production successfully prevented the auto-deploy from continuing forward
Improvement Areas
<< Using the root cause analysis, explain what can be improved to prevent this from recurring Anything to improve the detection or time to detection? Anything to improve the time of response or the incident response itself? Is there an existing issue that would have either prevented this incident or reduced the impact? Did we have any indication or prior knowledge that this incident might take place? >>
Skarbek: The Release Engineer waited too long to discover the failed deploy and act upon this information.
Corrective Actions
<< List issues created as corrective actions from this incident. For each issue, include the following: (Bare Issue link) - Issue labeled as corrective action. Include an estimated date of completion of the corrective action. Include the named individual who owns the delivery of the corrective action. >>
Skarbek:
- Staging Differs from production in various ways: gitlab-org/gitlab#32208 (closed)
- Staging is not performant given the lack of traffic and small database size: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7828#note_219086389
- Staging dataset differs from production: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7214
Timeline
2019-09-12
- Staging’s Postgresql statement timeout reduced from 5 minutes to 10 seconds: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7828#note_216351210
2019-09-17
- 17:38 - Post Deploy migration job failed on staging: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/638121
- 20:33 - Post Deploy migration job failed on staging: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/pipelines/80622
- 21:36 - Release Engineer asks for help from specific Infrastructure Engineers
- 22:07 - Release Engineer receives information that we have a post deploy migration that is estimated to take 16 hours to run in production: gitlab-org/gitlab!15137 (comment 218016968)
- 22:28 - Proposed mitigation of using a different user for deploys is merged into place: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1849
2019-09-18
- 6:25 - Database Engineer suggests rollback of the proposed mitigation: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1849#note_33834
- 10:14 - Database Engineer creates an issue to address the statement timeouts: gitlab-org/gitlab#32206 (closed)
- 11:25 - Rollback of the proposed mitigation is rolled back: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1849
- 13:20 - MR merged and picked into the next auto-deploy; to create a temporary index to increase the performance of a post deploy migration: gitlab-org/gitlab!17054 (merged)
- 17:31 - Successful staging deploy: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/639633
2019-09-19
- Attempted to deploy this post background migration failed in production: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/640029
- Decision made to revert: gitlab-org/gitlab!17253 (merged)