RCA: 2020-02-12: The elastic_indexer Sidekiq queue (main stage) is not meeting its latency SLOs

Summary

Our default setting of sidekiq db connection pool size to be equal to max_concurrency was making it likely to have worker threads competing for db connections on nodes with a very low max_concurrency. As a result, we were seeing some jobs timing out on getting a db connection on those queues.

Service(s) affected : ServiceSidekiq
Team attribution :
Minutes downtime or degradation :

For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.

Impact & Metrics

Start with the following:

What was the impact of the incident? (i.e. service outage, sub-service brown-out, exposure of sensitive data, ...)
Who was impacted by this incident? (i.e. external customers, internal customers, specific teams, ...)
How did the incident impact customers? (i.e. preventing them from doing X, incorrect display of Y, ...)
How many attempts were made to access the impacted service/feature?
How many customers were affected?
How many customers tried to access the impacted service/feature?

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

How was the incident detected?
Did alarming work as expected?
How long did it take from the start of the incident to its detection?
How long did it take from detection to remediation?
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team memeber wasn't page-able, ...)

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this it is necessary to start with the incident, and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in min that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys"

The vehicle will not start. (the problem)

Why? - The battery is dead.
Why? - The alternator is not functioning.
Why? - The alternator belt has broken.
Why? - The alternator belt was well beyond its useful service life and not replaced.
Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

What went well

Start with the following:

Identify the things that worked well or as expected.
Any additional call-outs for what went particularly well.

What can be improved

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
Is there anything that could have been done to improve the response or time to response?
Is there an existing issue that would have either prevented this incident or reduced the impact?
Did we have any indication or beforehand knowledge that this incident might take place?

Corrective actions

From https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/9198#note_287817712

We have overridden the minimum connection pool size in gitlab.rb for the elasticsearch and export priorities and this has resolved the issue: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/-/merge_requests/2668
The database connection pool should always have a minimum number pool size greater than some value (perhaps 5?) cc @mkaeppler @ayufan scalability#155
For higher concurrencies, it would be worth considering adding a few extra connections to the pool than the concurrency, as an insurance against deadlock cc @mkaeppler @ayufan scalability#155
We should report, through prometheus, connection pool utilisation in a process (on an interval, eg every 10 seconds?) - as a gauge metric. We should also record maximum pool size, so that we can add connection pool saturation as a saturation metric cc @mkaeppler @ayufan (scalability#153)
Track down the cause of the exception rescues in the elastic indexer workers cc @DylanGriffith gitlab-org/gitlab#205640 (closed)
Do not re-report old Sidekiq errors in future invocations gitlab-org/gitlab!25161 (merged)
Do no mask errors behind Sidekiq::JobRetry::Skip exceptions
Reindex lost ES index jobs cc @DylanGriffith - https://gitlab.com/gitlab-com/gl-infra/production/issues/1666

Guidelines

Edited Feb 20, 2020 by Henri Philipps