2020-12-04 staging HTTP500's after latest deploy
Summary
An MR that is changing what type of object we cache in ReleaseHighlight.paginated
, is causing exceptions.
Timeline
All times UTC.
2020-12-04
- 13:33 - Blackbox probe alert for staging paging EOC
- 13:36 - @hphilipps (EOC) pings RM's asking if staging is okay
- 13:37 - investigation by RM starts reveals staging is indeed in a bad state
- 13:40 - jskarbek declares incident in Slack.
- 13:42 - dev-escalation called upon
- 13:50 - culprit MR identified
- 13:55 - revert procedure started for identified MR
- 14:45 - rollback of staging completed
Corrective Actions
- gitlab-org/gitlab#291067 (closed)
- Fix rollback deployment pipeline - delivery#1392 (closed)
Incident Review
Summary
- Service(s) affected: ServiceWeb
- Team attribution: ~"group::adoption" devopsgrowth
- Time to detection: 0 minutes
- Minutes downtime or degradation: < 60 minutes
Metrics
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- Internal Users - teamDelivery
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- The instance was not usable in any way. Any call to GitLab resulted in an HTTP500
-
How many customers were affected?
- n/a
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- n/a
What were the root causes?
("5 Whys")
Incident Response Analysis
-
How was the incident detected?
- Blackbox Probes pinging the staging instance began to fail from the HTTP500 response codes
-
How could detection time be improved?
- No
-
How was the root cause diagnosed?
- Errors in Sentry
-
How could time to diagnosis be improved?
- Finding the culprit commit and MR from the sentry error was fast. But maybe we should make sure that all SREs know how to do that?
-
How did we reach the point where we knew how to mitigate the impact?
- The default go-to for this style of change is to evaluate a revert. This was our first and resulting course of action.
-
How could time to mitigation be improved?
- No
-
What went well?
- Team collaboration working together to quickly identify root cause
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- Code change: gitlab-org/gitlab!47852 (merged)
Lessons Learned
- We attempted to roll back staging using our documented procedures - which did not work as designed.
- Documentation covers the need to ensure version compatability with schema changes of varying types: https://docs.gitlab.com/ee/development/multi_version_compatibility.html#stale-cache-in-issue-or-merge-request-descriptions-and-comments
Guidelines
Edited by Henri Philipps