Skip to content

2020-12-04 staging HTTP500's after latest deploy

Summary

An MR that is changing what type of object we cache in ReleaseHighlight.paginated, is causing exceptions.

Timeline

All times UTC.

2020-12-04

  • 13:33 - Blackbox probe alert for staging paging EOC
  • 13:36 - @hphilipps (EOC) pings RM's asking if staging is okay
  • 13:37 - investigation by RM starts reveals staging is indeed in a bad state
  • 13:40 - jskarbek declares incident in Slack.
  • 13:42 - dev-escalation called upon
  • 13:50 - culprit MR identified
  • 13:55 - revert procedure started for identified MR
  • 14:45 - rollback of staging completed

Corrective Actions

Incident Review

Summary

  1. Service(s) affected: ServiceWeb
  2. Team attribution: ~"group::adoption" devopsgrowth
  3. Time to detection: 0 minutes
  4. Minutes downtime or degradation: < 60 minutes

Metrics

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    1. Internal Users - teamDelivery
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
    1. The instance was not usable in any way. Any call to GitLab resulted in an HTTP500
  3. How many customers were affected?
    1. n/a
  4. If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
    1. n/a

What were the root causes?

("5 Whys")

Incident Response Analysis

  1. How was the incident detected?
    1. Blackbox Probes pinging the staging instance began to fail from the HTTP500 response codes
  2. How could detection time be improved?
    1. No
  3. How was the root cause diagnosed?
    1. Errors in Sentry
  4. How could time to diagnosis be improved?
    1. Finding the culprit commit and MR from the sentry error was fast. But maybe we should make sure that all SREs know how to do that?
  5. How did we reach the point where we knew how to mitigate the impact?
    1. The default go-to for this style of change is to evaluate a revert. This was our first and resulting course of action.
  6. How could time to mitigation be improved?
    1. No
  7. What went well?
    1. Team collaboration working together to quickly identify root cause

Post Incident Analysis

  1. Did we have other events in the past with the same root cause?
    1. ...
  2. Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
    1. ...
  3. Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
    1. Code change: gitlab-org/gitlab!47852 (merged)

Lessons Learned

  1. We attempted to roll back staging using our documented procedures - which did not work as designed.
  2. Documentation covers the need to ensure version compatability with schema changes of varying types: https://docs.gitlab.com/ee/development/multi_version_compatibility.html#stale-cache-in-issue-or-merge-request-descriptions-and-comments

Guidelines

Edited by Henri Philipps