RCA: Gitaly n+1 calls causing bad latency and sidekiq queues to grow
Incident: production#1039 (closed)
Rapid Action Issue: gitlab-com/www-gitlab-com#4997 (closed)
Summary
Some commits with a massive amount of tags caused jobs to make many Gitaly calls, leading to higher Gitaly latency and growing sidekiq queues.
For timeline see the incident issue: production#1039 (closed)
-
Service(s) affected : ~"Service:Gitaly" ~"Service:Sidekiq" ~"Service:Web"
-
Team attribution :
-
Minutes downtime or degradation : 05:10 - 14:55 = 9h45m = 585m
For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.
Impact & Metrics
Start with the following:
- What was the impact of the incident?
- higher Gitaly latencies, Sidekiq queues growing, higher web latencies
- Who was impacted by this incident?
- all users waiting for jobs to be triggered or finish or web hooks
- How did the incident impact customers?
- How many attempts were made to access the impacted service/feature?
- How many customers were affected?
- How many customers tried to access the impacted service/feature?
Include any additional metrics that are of relevance.
Provide any relevant graphs that could help understand the impact of the incident and its dynamics.
Detection & Response
Start with the following:
- How was the incident detected?
- 05:29 UTC gitaly latency APDEX alert in #alerts-general which was noticed by EOC at 6:30
- Did alarming work as expected?
- Alarming on queue size should fire earlier and Gitaly latency alerts should go to pagerduty
- How long did it take from the start of the incident to its detection?
- queues started to grow at 05:10, got detected by EOC at 06:30 = 80m
- How long did it take from detection to remediation?
- 06:30 - 14:55 = 8h25m = 505m
- Were there any issues with the response to the incident?
Root Cause Analysis
Sidekiq jobs were piling up over hours.
- Why? - Jobs took longer to process.
- Why? - Gitaly latency was getting worse.
- Why? - There were more gitaly calls made by some jobs.
- Why? - Some jobs were processing massive amounts of tags which cause n+1 problems for Gitaly
- Why? - Commits of a user contained too many tags.
What went well
Start with the following:
- Identify the things that worked well or as expected.
- Any additional call-outs for what went particularly well.
What can be improved
- we should add limits for things like the amount of tags
- improve performance for handling many tags
- improve sidekiq architecture
- alerts for growing queue size should fire earlier
- Gitaly latency APDEX alerts should go to pagerduty
- paging CMOC via
/pd-mgr
slack command doesn't seem to work? - change the severity label on the incident ticket in time to reflect our current rating of the incident severity
- make status.io updates more meaningful for customers
Start with the following:
Corrective actions
-
add chef config for the sidekiq changes we did manual: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1619 -
create find_tag
RPC gitlab-org/gitaly#1848 (closed) -
implement find_tag
RPC https://gitlab.com/gitlab-org/gitlab-ce/issues/65795 -
PostReceive should have bounds on how many changes it processes. https://gitlab.com/gitlab-org/gitlab-ce/issues/65804 -
add timeouts to gitaly calls from sidekiq gitlab-com/www-gitlab-com#4997 (closed) -
make it possible to kill running sidekiq jobs https://gitlab.com/gitlab-org/gitlab-ce/issues/51096 -
re-architect queue implementation gitlab-com/www-gitlab-com#4951 (closed) -
page for gitaly SLO alerts https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7391 -
identify limits to prevent platform incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7481 -
add runbook for analyzing Gitaly pprof data https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7751 -
document marquee_account_alerts and infra-escalation channels in oncall runbook