2020-12-04 staging HTTP500's after latest deploy

Summary

An MR that is changing what type of object we cache in ReleaseHighlight.paginated, is causing exceptions.

All times UTC.

2020-12-04

Who was impacted by this incident? (i.e. external customers, internal customers)
1. Internal Users - teamDelivery
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. The instance was not usable in any way. Any call to GitLab resulted in an HTTP500
How many customers were affected?
1. n/a
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. n/a

How was the incident detected?
1. Blackbox Probes pinging the staging instance began to fail from the HTTP500 response codes
How could detection time be improved?
1. No
How was the root cause diagnosed?
1. Errors in Sentry
How could time to diagnosis be improved?
1. Finding the culprit commit and MR from the sentry error was fast. But maybe we should make sure that all SREs know how to do that?
How did we reach the point where we knew how to mitigate the impact?
1. The default go-to for this style of change is to evaluate a revert. This was our first and resulting course of action.
How could time to mitigation be improved?
1. No
What went well?
1. Team collaboration working together to quickly identify root cause

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. Code change: gitlab-org/gitlab!47852 (merged)

We attempted to roll back staging using our documented procedures - which did not work as designed.
- delivery#1392 (closed)
Documentation covers the need to ensure version compatability with schema changes of varying types: https://docs.gitlab.com/ee/development/multi_version_compatibility.html#stale-cache-in-issue-or-merge-request-descriptions-and-comments

Edited Dec 09, 2020 by Henri Philipps