RCA: AutoDeploy Staging Failures 2019-09-17

https://docs.google.com/document/d/1qUsz0wBAC3K_mnCTwANfVav6RslWP6AI5KgQXIRhZ4c/edit

Introduction

Summary

Services Impacted: AutoDeploy
Users & Teams Involved: Delivery, Infrastructure
Incident Window (UTC): 2019-09-17 through 2019-09-18

Merged code introduced a post deployment background migration that was estimated to take 16 hours in production, failed to run at all against our staging environment. The failure to run on staging triggered incorrect corrective actions on the staging environment. Due to the failure a new Merge Request was built which delayed the creation and subsequent release of the desired auto-deploy branch when needing to wait for testing and building to complete.

Impact

Impact to GitLab: Delayed Deploys
Scope of the event: Internal
Number of customers affected: n/a
How was this visible to customers: n/a
How many attempts made to impacted service(s): n/a

<< include any additional data relevant to the event to further describe the impact >>

Detection & Response

How was the event detected? Slack notification
Was this caught by alerting? Yes
Time between event start and detection? 3 Hours
Time from detection to remediation? 24 hours

<< include any issues which delayed response or remediation of an event (i.e. bastion host unavailable, relevant team member wasn't page-able, ...) >>

Root Cause Analysis

<< The purpose of this document is to understand the reasons which led to the incident, and to create mechanisms preventing them from recurring. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actor(s). Follow the "5 whys" in a blameless manner as the core of the root cause analysis. For this it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause. Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys" The vehicle will not start. (the problem)

Why? - The battery is dead. Why? - The alternator is not functioning. Why? - The alternator belt has broken. Why? - The alternator belt was well beyond its useful service life and not replaced. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

More information: https://en.wikipedia.org/wiki/5_Whys >>

What Went Well

<< Identify the things that worked well or as expected. Any additional callouts for what went particularly well. >>

Skarbek: An improvement was able to be made to improve the performance of the background migration

Skarbek: Learning that the database statement timeouts were intentional to catch potential bad actors from being deployed into production successfully prevented the auto-deploy from continuing forward

Improvement Areas

<< Using the root cause analysis, explain what can be improved to prevent this from recurring Anything to improve the detection or time to detection? Anything to improve the time of response or the incident response itself? Is there an existing issue that would have either prevented this incident or reduced the impact? Did we have any indication or prior knowledge that this incident might take place? >>

Skarbek: The Release Engineer waited too long to discover the failed deploy and act upon this information.

Corrective Actions

<< List issues created as corrective actions from this incident. For each issue, include the following: (Bare Issue link) - Issue labeled as corrective action. Include an estimated date of completion of the corrective action. Include the named individual who owns the delivery of the corrective action. >>

Skarbek:

Staging Differs from production in various ways: gitlab-org/gitlab#32208 (closed)
Staging is not performant given the lack of traffic and small database size: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7828#note_219086389
Staging dataset differs from production: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7214

Timeline

2019-09-12

Staging’s Postgresql statement timeout reduced from 5 minutes to 10 seconds: https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7828#note_216351210

2019-09-17

17:38 - Post Deploy migration job failed on staging: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/638121
20:33 - Post Deploy migration job failed on staging: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/pipelines/80622
21:36 - Release Engineer asks for help from specific Infrastructure Engineers
22:07 - Release Engineer receives information that we have a post deploy migration that is estimated to take 16 hours to run in production: gitlab-org/gitlab!15137 (comment 218016968)
22:28 - Proposed mitigation of using a different user for deploys is merged into place: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1849

2019-09-18

6:25 - Database Engineer suggests rollback of the proposed mitigation: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1849#note_33834
10:14 - Database Engineer creates an issue to address the statement timeouts: gitlab-org/gitlab#32206 (closed)
11:25 - Rollback of the proposed mitigation is rolled back: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1849
13:20 - MR merged and picked into the next auto-deploy; to create a temporary index to increase the performance of a post deploy migration: gitlab-org/gitlab!17054 (merged)
17:31 - Successful staging deploy: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/639633

2019-09-19

Attempted to deploy this post background migration failed in production: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/640029
Decision made to revert: gitlab-org/gitlab!17253 (merged)

Edited Sep 19, 2019 by John Skarbek