RCA Job queue pipeline_processing:pipeline_process is growing

Summary

CI jobs took very long to complete because jobs in the pipeline_processing:pipeline_process sidekiq queue piled up. 2 pipelines caused a high amount of sidekiq jobs, sidekiq pipeline nodes were maxing out their CPU, pipeline_processing jobs are causing many SQL calls and the pgbouncer pool for sidekiq was becoming saturated.

RCA doc: https://docs.google.com/document/d/15UPwfmUFVmx6jtghlUoOod3JAGVa4BRycNqFA1OSrjs/edit#

Service(s) affected : ~"Service:Sidekiq"
Team attribution :
Minutes downtime or degradation : 240

Impact & Metrics

Start with the following:

What was the impact of the incident?
- delay of CI jobs
Who was impacted by this incident?
- all customers CI pipelines
How did the incident impact customers?
- preventing them from running CI tests/deploys
How many attempts were made to access the impacted service/feature?
How many customers were affected?
How many customers tried to access the impacted service/feature?

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

How was the incident detected?
- support reporting customer issues with CI pipelines
Did alarming work as expected?
- we got Sidekiq single_node_cpu alerts and pgbouncer connection_pool saturation alerts but no pages. We did not get an alert for the queue size which would have been a clear indication of the issue.
How long did it take from the start of the incident to its detection?
- 80m from queue starting to rise till first alert for sidekiq CPU
How long did it take from detection to remediation?
- 240m
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team member wasn't page-able, ...)
- EOC became aware of the incident by reports from customer support and not from being paged for alerts.
- It was hard to find someone to help with that issue.

Root Cause Analysis

The purpose of this document is to understand the reasons that caused an incident, and to create mechanisms to prevent it from recurring in the future. A root cause can never be a person, the way of writing has to refer to the system and the context rather than the specific actors.

Follow the "5 whys" in a blameless manner as the core of the root-cause analysis.

For this it is necessary to start with the incident, and question why it happened. Keep iterating asking "why?" 5 times. While it's not a hard rule that it has to be 5 times, it helps to keep questions get deeper in finding the actual root cause.

Keep in min that from one "why?" there may come more than one answer, consider following the different branches.

Example of the usage of "5 whys"

The vehicle will not start. (the problem)

Why? - The battery is dead.
Why? - The alternator is not functioning.
Why? - The alternator belt has broken.
Why? - The alternator belt was well beyond its useful service life and not replaced.
Why? - The vehicle was not maintained according to the recommended service schedule. (Fifth why, a root cause)

What went well

Start with the following:

Identify the things that worked well or as expected.
Any additional call-outs for what went particularly well.

What can be improved

Start with the following:

Using the root cause analysis, explain what can be improved to prevent this from happening again.
Is there anything that could have been done to improve the detection or time to detection?
Is there anything that could have been done to improve the response or time to response?
Is there an existing issue that would have either prevented this incident or reduced the impact?
Did we have any indication or beforehand knowledge that this incident might take place?

Corrective actions

List issues that have been created as corrective actions from this incident.
For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Incldue the named individual who owns the delivery of the corrective action.

increase CPU for sidekiq nodes production#997 (closed)
review pgbouncer pool config https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7403
optimize PipelineProcessWorker https://gitlab.com/gitlab-org/gitlab-ce/issues/65414
deduplicate sidekiq jobs gitlab-org/gitlab#30585 (closed)
define sidekiq SLOs gitlab-org/gitlab#30174 (closed)
simplify sidekiq setup https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7219
improve sidekiq observability
prevent customers from causing platform issues by adding per-client limits in all places

Guidelines

Edited Apr 20, 2020 by Rachel Nienaber