Skip to content

RCA: AutoDeploy Staging Failures 2019-09-17

https://docs.google.com/document/d/1qUsz0wBAC3K_mnCTwANfVav6RslWP6AI5KgQXIRhZ4c/edit

Introduction

Summary

  • Services Impacted: AutoDeploy
  • Users & Teams Involved: Delivery, Infrastructure
  • Incident Window (UTC): 2019-09-17 through 2019-09-18

Merged code introduced a post deployment background migration that was estimated to take 16 hours in production, failed to run at all against our staging environment. The failure to run on staging triggered incorrect corrective actions on the staging environment. Due to the failure a new Merge Request was built which delayed the creation and subsequent release of the desired auto-deploy branch when needing to wait for testing and building to complete.

Impact

  • Impact to GitLab: Delayed Deploys
  • Scope of the event: Internal
  • Number of customers affected: n/a
  • How was this visible to customers: n/a
  • How many attempts made to impacted service(s): n/a

<< include any additional data relevant to the event to further describe the impact >>

Detection & Response

  • How was the event detected? Slack notification
  • Was this caught by alerting? Yes
  • Time between event start and detection? 3 Hours
  • Time from detection to remediation? 24 hours

<< include any issues which delayed response or remediation of an event (i.e. bastion host unavailable, relevant team member wasn't page-able, ...) >>

Root Cause Analysis

<< The purpose of this document is to understand the reasons which led to the incident, and to create mechanisms preventing them from recurring. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actor(s). Follow the "5 whys" in a blameless manner as the core of the root cause analysis. For this it is necessary to start with the incident and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause. Keep in mind that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys" The vehicle will not start. (the problem)

Why? - The battery is dead. Why? - The alternator is not functioning. Why? - The alternator belt has broken. Why? - The alternator belt was well beyond its useful service life and not replaced. Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

More information: https://en.wikipedia.org/wiki/5_Whys >>

What Went Well

<< Identify the things that worked well or as expected. Any additional callouts for what went particularly well. >>

Skarbek: An improvement was able to be made to improve the performance of the background migration

Skarbek: Learning that the database statement timeouts were intentional to catch potential bad actors from being deployed into production successfully prevented the auto-deploy from continuing forward

Improvement Areas

<< Using the root cause analysis, explain what can be improved to prevent this from recurring Anything to improve the detection or time to detection? Anything to improve the time of response or the incident response itself? Is there an existing issue that would have either prevented this incident or reduced the impact? Did we have any indication or prior knowledge that this incident might take place? >>

Skarbek: The Release Engineer waited too long to discover the failed deploy and act upon this information.

Corrective Actions

<< List issues created as corrective actions from this incident. For each issue, include the following: (Bare Issue link) - Issue labeled as corrective action. Include an estimated date of completion of the corrective action. Include the named individual who owns the delivery of the corrective action. >>

Skarbek:

Timeline

2019-09-12

2019-09-17

2019-09-18

2019-09-19

Edited by John Skarbek