2022-04-25: Sidekiq apdex drop

Incident Roles:

Incident Manager/DRI: @afappiano
EOC: @ahanselka
CMOC: @gerardo

Current Status: Incident Mitigated

After further investigation this appears to be related to OOM kills on on GKE nodes. Each GKE pod has a consul agent that under certain circumstances, will generate a spike in demand memory. If the memory is not avilable the consul process will be rapidly and repeatedly killed. As @msmiley points out in his excellent analysis:

"Without a working consul agent, the rails pods (web, api, git, websockets) repeatedly tried and failed to discover which postgres dbs were available as healthy up-to-date replicas. Without that service discovery working, the existing rails pods had to fallback to sending all of their queries to the last known primary db."

As a temporary fix, memory limits on the pods has been raised from 500MB to 1GB. This should give the GKE pods enough memory to keep consul agents running even when spiking. We'll continue to monitor and observe the health of the system now that these limits have been raised.

Beyond that we need to investigate the consul spikes for memory demand more thoroughly. We should attempt to understand if these spikes are expected behavior or the result of our particular configuration.

Next steps:

Monitor and observe the consul agent on GKE pods today to understand if raising the memory limit from 500MB to 1GB is enough to keep things healthy.
Begin investigation into consul memory demand spikes to see what, if anything, can be done to improve this beyond just raising the hard memory limit.

For customers believed to be affected by this incident, please subscribe to this issue or monitor our status page for further updates.

Summary for CMOC notice / Exec summary:

Customer Impact: intermittent site slowness
Service Impact: The whole of gitlab.com
Impact Duration: intermittent site slowness on 2022-04-25 and 2022-04-26
Root cause: consul process failing on GKE pods as a result of OOM errors.

Timeline

Recent Events (available internally only):

All times UTC.

2022-04-25

15:13 - @engwan declares incident in Slack.
15:15 - @ahanselka and @afappiano discuss and ask @igorwwwwwwwwwwwwwwwwwwww to join the zoom.
15:26 - First notification sent out from CMOC
15:32 - @afappiano escalates to @dbre for some DB assistance but no one available.
15:34 - @msmiley joins to assist.
15:36 - @fzimmer joins and reports that pgbouncer changes were made on 2022-04-21
15:43 - Second notification sent out from CMOC
15:45 - @fzimmer asks @ayufan joins and begins discussing potential causes of the issue with @igorwwwwwwwwwwwwwwwwwwww and @msmiley
15:50 - Investigation continues as the team tries to determine if the pgbouncer change can be reverted safely.
16:01 - Team starts to look at other deploys that may have contributed to this problem.
16:03 - @ahanselka looks into state of sidekiq queue. Determines that we had a bad queue at 15:33, but things have recovered (in terms of of the sidekiq queue).
16:06 - @msmiley suggests that we revert to larger pgbouncer pool sizes that were in place previously.
16:08 - @igorwwwwwwwwwwwwwwwwwwww notes that he is in favor of raising these as suggested but notes we also need to do something similar to the CI pgbouncer as well.
16:10 - @fzimmer and @ayufan note that a rollback of a feature flag would be safer.
16:11 - @igorwwwwwwwwwwwwwwwwwwww suggests that we do the feature flag rollback but also adjust the pool size after the flag has been rolled back.
16:13 - a path forward is identified:
1. alter user gitlab to raise per-user connection limit
2. create an MR to raise pgbouncer backend pools for main pgbouncer
3. consider rolling the feature flag percentage back from 50%
16:17 - The safety and efficacy of the proposed changes are discussed.
16:19 - @igorwwwwwwwwwwwwwwwwwwww mentions that the thing to watch out for once these changes are applied is overall load on the main DB.
16:23 - @fzimmer Summarizes the current working hypothesis: Saturation of connection was caused by reduction of pool size introduce by https://gitlab.com/gitlab-com/gl-infra/chef-repo/-/merge_requests/1679. Course of action: Increase pool size on the main database. Main risk here is increasing CPU load on the primary node on the main database. This was already suggested via https://gitlab.com/gitlab-com/gl-infra/reliability/-/issues/15628 Following this, we’ll monitor the CI saturation. If this decreases following 1. this may have been caused by contention. Otherwise we can reduce the percentage rollout feature flag for CI phase 4. We must be mindful that this will shif back load back from the CI database to main
16:29 - @msmiley notes that we may not be as saturated as we think as the metric is slightly incorrect.
16:29 Third notification sent out by CMOC.
16:36 - Folks working on the other incident determine that their problem is a result of this incident.
17:07 - Fourth notification sent out by CMOC.
17:08 - The status of incident is changed to mitigated based on the fact that saturation levels are much lower and site slowness is no longer occurring.
17:26 - @msmiley applies changes.
17:47 - @msmiley confirms that the changes have been applied.
17:47 - Fifth and final notification sent out by CMOC
17:48 - @afappiano leaves incident as mitigated but not resolved as we want to continue to observe the system.

2022-04-26

01:21 - monitoring detects elevated error rate.
01:26 - @cindy declares incident in Slack. This was initially thought to be a separate issue but, after some analysis, is actually related.

Create related issues

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Takeaways

Corrective Actions

Same root cause as #6910 (closed) so same corrective actions.

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All GitLab.com users
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Workloads that involved async would have been unusually delayed (eg. pipelines take longer to be created, import/export takes longer to be processed)
How many customers were affected?
1. All GitLab.com users during the interval
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. ...

What were the root causes?

Same as #6909 (closed) . Consul failures meant Rails could not discover replicas and sent read traffic to primary causing saturation causing delays in sidekiq processing.

Incident Response Analysis

How was the incident detected?
1. ...
How could detection time be improved?
1. ...
How was the root cause diagnosed?
1. ...
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

What went well?

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Apr 28, 2022 by Dylan Griffith