Incident Review: 500 error when trying to access repos/projects for some users

Key Information

Metric Value
Customers Affected 6000 unique users, 15 reporting customers
Requests Affected 90,733 link
Incident Severity severity2
Start Time 09:05 UTC
End Time 10:27 UTC
Total Duration 1hr 22mins
Link to Incident Issue #18514 (closed)

Summary

Details

Outcomes/Corrective Actions

  1. Replace custom InfoRefsUploadPack's caching wit... (gitlab-org/gitaly#6371)

Learning Opportunities

What went well?

  1. We disable a feature flag and the system recovered.

What was difficult?

  1. Understanding the root cause of the issue

Review Guidelines

This review should be completed by the team which owns the service causing the alert. That team has the most context around what caused the problem and what information will be needed for an effective fix. The EOC or IMOC may create this issue, but unless they are also on the service owning team, they should assign someone from that team as the DRI.

For the person opening the Incident Review

  • Set the title to Incident Review: (Incident issue name)
  • Assign a Service::* label (most likely matching the one on the incident issue)
  • Set a Severity::* label which matches the incident
  • In the Key Information section, make sure to include a link to the incident issue
  • Find and Assign a DRI from the team which owns the service (check their slack channel or assign the team's manager) The DRI for the incident review is the issue assignee.
  • Announce the incident review in the incident channel on Slack.
:mega: @here An incident review issue was created for this incident with <USER> assigned as the DRI.
If you have any review feedback please add it to <ISSUE_LINK>.

For the assigned DRI

  • Fill in the remaining fields in the Key Information section, using the incident issue as a reference. Feel free to ask the EOC or other folks involved if anything is difficult to find.
  • If there are metrics showing Customers Affected or Requests Affected, link those metrics in those fields
  • Create a few short sentences in the Summary section summarizing what happened (TL;DR)
  • Use the description section to write a few paragraphs explaining what happened
  • Link any corrective actions and describe any other actions or outcomes from the incident
  • Consider the implications for self-managed and Dedicated instances. For example, do any bug fixes need to be backported?
  • Add any appropriate labels based on the incident issue and discussions
  • Once discussion wraps up in the comments, summarize any takeaways in the details section
  • Close the review before the due date
Edited by Max Orefice