RCA: Degraded performance because of Redis-cache overload.

Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.

Summary

Since July 1st, 8:00 UTC we were seeing degraded performance and elevated 500 errors for Web, API and delayed CI jobs. The imminent root cause turned out to be maxing out the CPU on the redis-cache primary by many expensive calls to redis-cache from the application.

Service(s) affected : ~"Service:Web"

Team attribution :

Minutes downtime or degradation : 540m based on web below 95% latency APDEX

Impact & Metrics

Start with the following:

What was the impact of the incident?
- degraded performance and elevated error rate on Web and API component, delayed CI jobs.
Who was impacted by this incident?
- All users of GitLab.com, mostly during EMEA business times
How did the incident impact customers?
- slow loading pages, 500 errors, delayed CI jobs and pull mirrors
How many attempts were made to access the impacted service/feature?
How many customers were affected?
How many customers tried to access the impacted service/feature?

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

How was the incident detected?
- Pagerduty alert on GitLabComLatencyWebCritical
Did alarming work as expected?
- yes
How long did it take from the start of the incident to its detection?
5 minutes
How long did it take from detection to remediation?
- 27h until a patch eliminated heavy app config requests to redis
Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team memeber wasn't page-able, ...)
- we should have been detecting Redis-cache slowly becoming saturated earlier

Timeline

2019-07-01

07:56 UTC - connections queueing up at unicorn workers, latencies rise for web and api
08:01 UTC - Pagerduty alert on GitLabComLatencyWebCritical
08:05 UTC - Alert acknowledged by SRE on call
08:15 UTC - Job queue durations rise
08:54 UTC - Incident issue 928 opened
09:06 UTC - status.io incident opened
09:56 UTC - status.io update: "We are adding more workers..."
10:30 UTC - 4 new api and 4 web workers added to LBs https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1338
11:24 UTC - support reports stuck CI jobs for customers (https://gitlab.zendesk.com/agent/tickets/125409)
12:09 UTC - new incident issue 929 opened for reports of delayed CI runners
12:11 UTC - tweet "jobs on shared runners being picked up at a low rate or appear being stuck..."
13:14 UTC - status.io update acknowledging CI pipeline delays
13:15 UTC - incident issue 929 closed again as it is related to 928
13:51 UTC - status.io update: "continue to investigate...", announcing incident issue URL
14:20 UTC - additional workers removed again to reduce connections to redis-cache
16:51 UTC - status.io update: status changed to "monitoring", "CI jobs are catching up..."
18:11 UTC - status.io update: "back to normal levels..."
19:40 UTC - status.io incident resolved

2019-07-02

09:45 UTC - kernel update and reboot of redis-cache-03
10:06 UTC - unexpected failover to redis-cache-01
10:50 UTC - redis-cache-02 kernel upgrade and reboot
11:22 UTC - unexpected failover to redis-cache-02
11:20 UTC - patch eliminating application config requests to redis-cache deployed: https://ops.gitlab.net/gitlab-com/gl-infra/patcher/merge_requests/113 (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/14500)
11:20 UTC - CPU usage drops to 85%, network from 300Mb/s to under 100Mb/s, all metrics improve
12:15 UTC - redis-cache-01 kernel update and reboot

Root Cause Analysis

The web component had slower response times.

Why? - Redis-cache had slower response times.
Why? - Redis-cache was saturating it's CPU.
Why? - Too many and too heavy requests to Redis from the application.
Why? - Missing awareness and testing for how many and how expensive Redis-cache requests would be generated from the application.

What went well

Alerting worked for getting aware of web performance issues immediately.
A lot support from all over engineering to find the root cause and working on several remediations.

What can be improved

detection of Redis performance issues (or generally: detecting saturation of a service/system)
trend analysis, capacity planning
finding the root cause of performance degradations - we sometimes don't followup on degradations if they resolved from self and we didn't see a direct root cause at first sight, but they might be an indication of a deeper issue or trend.

Corrective actions

List issues that have been created as corrective actions from this incident.
For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Incldue the named individual who owns the delivery of the corrective action.

per @andrewn:

per @stanhu:

Move Flipper caching away from Redis to in-memory cache: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30276
Move Geo checks away from Redis to in-memory cache: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/14513
Add Redis details to Peek performance bar: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30191

per @rymai:

compress Rails.cache payloads that are bigger than a certain threshold

per @bjk-gitlab:

cleanup/improve the redis cache metrics to be more useful: https://gitlab.com/gitlab-org/gitlab-ce/issues/64064

Guidelines

Edited Aug 19, 2019 by Henri Philipps