2020-05-27: The `sidekiq` service (`main` stage) has a apdex score

/label incident

Summary

2020-05-27: The `sidekiq` service (`main` stage) has a apdex score

Timeline

All times UTC.

2020-05-27

22:18 - Alert firing https://gitlab.pagerduty.com/incidents/PX1R0HO?utm_source=slack&utm_campaign=channel
22:38 - alejandro declares incident in Slack using /incident declare command.
22:58 - Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PX1R0HO?utm_source=slack&utm_campaign=channel

2020-05-28

00:27 Alert firing https://gitlab.pagerduty.com/incidents/PD8A8YR/
00:32 Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PD8A8YR/
01:38 Alert firing https://gitlab.pagerduty.com/incidents/PYWDU8B/
01:43 Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PYWDU8B/
02:31 Alert firing https://gitlab.pagerduty.com/incidents/PVMTEIB
02:36 Alert cleared without intervention https://gitlab.pagerduty.com/incidents/PVMTEIB

Click to expand or collapse the Incident Review section.

Incident Review

Summary

Over a period of two days, there were multiple times where SREs on call got pagerduty alerts about the sidekiq service in gprd apdex falling below slo. This was determined to be caused by a single queue authorized_projects getting a massive spike of messages onto the queue, temporarily make sidekiq unable to perform within our apdex. The spikes in the queue are caused by known functionality that happens within Gitlab whenever someone changes project/group/user permissions in a large hierarchy.

Service(s) affected: Sidekiq
Team attribution: sre-coreinfra
Minutes downtime or degradation: Approximately 45 minutes over 2 days

Metrics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)

I believe all customers external and internal were effected

What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)

During the periods of degradation it is possible that any other actions relating to changing permissions in Gitlab.com would have taken longer than normal to apply

How many customers were affected?

We did not receive any reports that any customers were affected

If a precise customer impact number is unknown, what is the estimated potential impact?

Impact in terms of people affected is quite wide, but actual impact to end users experience was likely quite minimal

Incident Response Analysis

How was the event detected?

Through pagerduty alerts relating to apdex violations

How could detection time be improved?

I don't believe detection time could be improved. Looking at the actual spikes and times of alerts, the alerts happened pretty much spot on to when problems started

How did we reach the point where we knew how to mitigate the impact?

We did not reach a point where we could mitigate the impact. Simply wait for it to go away

How could time to mitigation be improved?

As we were unable to mitigate it (due to Gitlab code itself being the culprit), simply having any mechanism to mitigate this would be an improvement.

Post Incident Analysis

How was the root cause diagnosed?

Once the problem queue in question was diagnosed from Grafana dashboards, it was actually tribal knowledge and discussion between SREs that identified this as a long standing issue with Gitlab architecture.

How could time to diagnosis be improved?

Ultimately we reached a point in the investigation where it was determined the fix to this issue was actual code changes in Gitlab itself. We reached this point by looking at the queue in question, understanding why there was such a large spike in the queue, and then determining through Gitlab issues that this is known and in the process of being fixed.

Time to diagnosis could be improved by updating runbook documentation to explicitly point out that apdex violation caused by authorized_projects is well known (with a link to the issues/epics in question).

Do we have an existing backlog item that would've prevented or greatly reduced the impact of this incident?

Gitlab project does indeed, this one being most important https://gitlab.com/gitlab-org/gitlab/-/issues/218383

Was this incident triggered by a change (deployment of code or change to infrastructure. If yes, have you linked the issue which represents the change?)?

5 Whys

Sidekiq queue processing time was lower that our apdex score for that service
This was because sidekiq had a sudden massive spike in number of messages that it needed to process. This caused our sidekiq nodes to not be able to keep up with their current size and load
This was because the authorized_projects queue received a huge spike of messages onto the queue to process very quickly
This was due to how Gitlab code works, this was not an unknown event, nor an unintentional side effect (it was intentional)
This was because somewhere in Gitlab.com a user changed user/group/project permissions for a large hierarchy. This causes an message onto the sidekiq authorized_users queue for every single permission that needs to be changed

Lessons Learned

We should really be prioritising the work in Gitlab itself to make sure there are no known places where we activately violated our apdex. This also means that we in infrastructure need to ensure we are sufficiently making development aware that we keep hitting these alerts and the problem keeps happening.