2020-05-27: The `sidekiq` service (`main` stage) has a apdex score
/label incident
Summary
sidekiq
service (main
stage) has a apdex score
2020-05-27: The Timeline
All times UTC.
2020-05-27
- 22:18 - Alert firing https://gitlab.pagerduty.com/incidents/PX1R0HO?utm_source=slack&utm_campaign=channel
- 22:38 - alejandro declares incident in Slack using
/incident declare
command. - 22:58 - Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PX1R0HO?utm_source=slack&utm_campaign=channel
2020-05-28
- 00:27 Alert firing https://gitlab.pagerduty.com/incidents/PD8A8YR/
- 00:32 Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PD8A8YR/
- 01:38 Alert firing https://gitlab.pagerduty.com/incidents/PYWDU8B/
- 01:43 Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PYWDU8B/
- 02:31 Alert firing https://gitlab.pagerduty.com/incidents/PVMTEIB
- 02:36 Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PVMTEIB
Click to expand or collapse the Incident Review section.
Incident Review
Summary
Over a period of two days, there were multiple times where SREs on call got pagerduty alerts about the sidekiq
service in gprd
apdex falling below slo. This was determined to be caused by a single queue authorized_projects
getting a massive spike of messages onto the queue, temporarily make sidekiq unable to perform within our apdex. The spikes in the queue are caused by known functionality that happens within Gitlab whenever someone changes project/group/user permissions in a large hierarchy.
- Service(s) affected: Sidekiq
- Team attribution: sre-coreinfra
- Minutes downtime or degradation: Approximately 45 minutes over 2 days
Metrics
Customer Impact
- Who was impacted by this incident? (i.e. external customers, internal customers)
I believe all customers external and internal were effected
- What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
During the periods of degradation it is possible that any other actions relating to changing permissions in Gitlab.com would have taken longer than normal to apply
- How many customers were affected?
We did not receive any reports that any customers were affected
- If a precise customer impact number is unknown, what is the estimated potential impact?
Impact in terms of people affected is quite wide, but actual impact to end users experience was likely quite minimal
Incident Response Analysis
- How was the event detected?
Through pagerduty alerts relating to apdex violations
- How could detection time be improved?
I don't believe detection time could be improved. Looking at the actual spikes and times of alerts, the alerts happened pretty much spot on to when problems started
- How did we reach the point where we knew how to mitigate the impact?
We did not reach a point where we could mitigate the impact. Simply wait for it to go away
- How could time to mitigation be improved?
As we were unable to mitigate it (due to Gitlab code itself being the culprit), simply having any mechanism to mitigate this would be an improvement.
Post Incident Analysis
- How was the root cause diagnosed?
Once the problem queue in question was diagnosed from Grafana dashboards, it was actually tribal knowledge and discussion between SREs that identified this as a long standing issue with Gitlab architecture.
- How could time to diagnosis be improved?
Ultimately we reached a point in the investigation where it was determined the fix to this issue was actual code changes in Gitlab itself. We reached this point by looking at the queue in question, understanding why there was such a large spike in the queue, and then determining through Gitlab issues that this is known and in the process of being fixed.
Time to diagnosis could be improved by updating runbook documentation to explicitly point out that apdex violation caused by authorized_projects
is well known (with a link to the issues/epics in question).
- Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?
Gitlab project does indeed, this one being most important https://gitlab.com/gitlab-org/gitlab/-/issues/218383
- Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?
5 Whys
- Sidekiq queue processing time was lower that our apdex score for that service
- This was because sidekiq had a sudden massive spike in number of messages that it needed to process. This caused our sidekiq nodes to not be able to keep up with their current size and load
- This was because the
authorized_projects
queue received a huge spike of messages onto the queue to process very quickly - This was due to how Gitlab code works, this was not an unknown event, nor an unintentional side effect (it was intentional)
- This was because somewhere in Gitlab.com a user changed user/group/project permissions for a large hierarchy. This causes an message onto the sidekiq
authorized_users
queue for every single permission that needs to be changed
Lessons Learned
We should really be prioritising the work in Gitlab itself to make sure there are no known places where we activately violated our apdex. This also means that we in infrastructure need to ensure we are sufficiently making development aware that we keep hitting these alerts and the problem keeps happening.
Corrective Actions
gitlab-org/gitlab#217637 (closed)
https://gitlab.com/gitlab-org/gitlab/-/issues/218383
gitlab-org/gitlab#218380 (closed)
gitlab-org/gitlab#218379 (closed)