RCA: Degraded performance because of Redis-cache overload.
Please note: if the incident relates to sensitive data, or is security related consider labeling this issue with security and mark it confidential.
Summary
Since July 1st, 8:00 UTC we were seeing degraded performance and elevated 500 errors for Web, API and delayed CI jobs. The imminent root cause turned out to be maxing out the CPU on the redis-cache primary by many expensive calls to redis-cache from the application.
Service(s) affected : ~"Service:Web"
Team attribution :
Minutes downtime or degradation : 540m based on web below 95% latency APDEX
Impact & Metrics
Start with the following:
- What was the impact of the incident?
- degraded performance and elevated error rate on Web and API component, delayed CI jobs.
- Who was impacted by this incident?
- All users of GitLab.com, mostly during EMEA business times
- How did the incident impact customers?
- slow loading pages, 500 errors, delayed CI jobs and pull mirrors
- How many attempts were made to access the impacted service/feature?
- How many customers were affected?
- How many customers tried to access the impacted service/feature?
Include any additional metrics that are of relevance.
Provide any relevant graphs that could help understand the impact of the incident and its dynamics.
Detection & Response
Start with the following:
- How was the incident detected?
-
Pagerduty alert on
GitLabComLatencyWebCritical
-
Pagerduty alert on
- Did alarming work as expected?
- yes
- How long did it take from the start of the incident to its detection?
- 5 minutes
- How long did it take from detection to remediation?
- 27h until a patch eliminated heavy app config requests to redis
- Were there any issues with the response to the incident? (i.e. bastion host used to access the service was not available, relevant team memeber wasn't page-able, ...)
- we should have been detecting Redis-cache slowly becoming saturated earlier
Timeline
2019-07-01
- 07:56 UTC - connections queueing up at unicorn workers, latencies rise for web and api
- 08:01 UTC - Pagerduty alert on
GitLabComLatencyWebCritical
- 08:05 UTC - Alert acknowledged by SRE on call
- 08:15 UTC - Job queue durations rise
- 08:54 UTC - Incident issue 928 opened
- 09:06 UTC - status.io incident opened
- 09:56 UTC - status.io update: "We are adding more workers..."
- 10:30 UTC - 4 new api and 4 web workers added to LBs https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1338
- 11:24 UTC - support reports stuck CI jobs for customers (https://gitlab.zendesk.com/agent/tickets/125409)
- 12:09 UTC - new incident issue 929 opened for reports of delayed CI runners
- 12:11 UTC - tweet "jobs on shared runners being picked up at a low rate or appear being stuck..."
- 13:14 UTC - status.io update acknowledging CI pipeline delays
- 13:15 UTC - incident issue 929 closed again as it is related to 928
- 13:51 UTC - status.io update: "continue to investigate...", announcing incident issue URL
- 14:20 UTC - additional workers removed again to reduce connections to redis-cache
- 16:51 UTC - status.io update: status changed to "monitoring", "CI jobs are catching up..."
- 18:11 UTC - status.io update: "back to normal levels..."
- 19:40 UTC - status.io incident resolved
2019-07-02
- 09:45 UTC - kernel update and reboot of redis-cache-03
- 10:06 UTC - unexpected failover to redis-cache-01
- 10:50 UTC - redis-cache-02 kernel upgrade and reboot
- 11:22 UTC - unexpected failover to redis-cache-02
- 11:20 UTC - patch eliminating application config requests to redis-cache deployed: https://ops.gitlab.net/gitlab-com/gl-infra/patcher/merge_requests/113 (https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/14500)
- 11:20 UTC - CPU usage drops to 85%, network from 300Mb/s to under 100Mb/s, all metrics improve
- 12:15 UTC - redis-cache-01 kernel update and reboot
Root Cause Analysis
The web component had slower response times.
- Why? - Redis-cache had slower response times.
- Why? - Redis-cache was saturating it's CPU.
- Why? - Too many and too heavy requests to Redis from the application.
- Why? - Missing awareness and testing for how many and how expensive Redis-cache requests would be generated from the application.
What went well
- Alerting worked for getting aware of web performance issues immediately.
- A lot support from all over engineering to find the root cause and working on several remediations.
What can be improved
- detection of Redis performance issues (or generally: detecting saturation of a service/system)
- trend analysis, capacity planning
- finding the root cause of performance degradations - we sometimes don't followup on degradations if they resolved from self and we didn't see a direct root cause at first sight, but they might be an indication of a deeper issue or trend.
Corrective actions
- List issues that have been created as corrective actions from this incident.
- For each issue, include the following:
- - Issue labeled as corrective action.
- Include an estimated date of completion of the corrective action.
- Incldue the named individual who owns the delivery of the corrective action.
per @andrewn:
-
Start monitoring on various saturation metrics: gitlab-com/runbooks!1188 (merged), add per-service SLOs -
Distributed tracing instrumentation of Rails caching: https://gitlab.com/gitlab-org/labkit-ruby/merge_requests/12 -
Distributed tracing instrumentation of Redis calls: https://gitlab.com/gitlab-org/labkit-ruby/issues/2 -
Discuss adding n+1
style limits on Redis calls, in development and testing environments (no issue yet) -
Discuss adding size limits on Redis keys stored in the cache (no issue yet) -
Stop caching junit files in Redis: https://gitlab.com/gitlab-org/gitlab-ce/issues/64035 -
Monitor cache misuse of Redis by application teams redis-cli --bigkeys
-
Add redis_duration_ms
field to our Rails+API structured logs (no issue yet) -
Add documentation on how to monitor redis instances: gitlab-com/runbooks!1187 (merged) -
Consider breaking our Redis instances down further than the current persistent/cache pair - for example, CI-cache, MergeRequest-cache, etc -
Discuss the possibility of moving over to Redis-cluster or managed Redis instances (eg Redis Labs) (no issue yet) -
Use cached markdown fields for calculating participants https://gitlab.com/gitlab-org/gitlab-ce/issues/63967 -
Bandaid: Disable juint reports: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30254
per @stanhu:
-
Move Flipper caching away from Redis to in-memory cache: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30276 -
Move Geo checks away from Redis to in-memory cache: https://gitlab.com/gitlab-org/gitlab-ee/merge_requests/14513 -
Add Redis details to Peek performance bar: https://gitlab.com/gitlab-org/gitlab-ce/merge_requests/30191
per @rymai:
-
compress Rails.cache
payloads that are bigger than a certain threshold
per @bjk-gitlab:
-
cleanup/improve the redis cache metrics to be more useful: https://gitlab.com/gitlab-org/gitlab-ce/issues/64064