RCA: Gitaly n+1 calls causing bad latency and sidekiq queues to grow

Incident: production#1039 (closed)

Rapid Action Issue: gitlab-com/www-gitlab-com#4997 (closed)

Summary

Some commits with a massive amount of tags caused jobs to make many Gitaly calls, leading to higher Gitaly latency and growing sidekiq queues.

For timeline see the incident issue: production#1039 (closed)

Service(s) affected : ~"Service:Gitaly" ~"Service:Sidekiq" ~"Service:Web"
Team attribution :
Minutes downtime or degradation : 05:10 - 14:55 = 9h45m = 585m

For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.

Impact & Metrics

Start with the following:

What was the impact of the incident?
- higher Gitaly latencies, Sidekiq queues growing, higher web latencies
Who was impacted by this incident?
- all users waiting for jobs to be triggered or finish or web hooks
How did the incident impact customers?
How many attempts were made to access the impacted service/feature?
How many customers were affected?
How many customers tried to access the impacted service/feature?

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

How was the incident detected?
- 05:29 UTC gitaly latency APDEX alert in #alerts-general which was noticed by EOC at 6:30
Did alarming work as expected?
- Alarming on queue size should fire earlier and Gitaly latency alerts should go to pagerduty
How long did it take from the start of the incident to its detection?
- queues started to grow at 05:10, got detected by EOC at 06:30 = 80m
How long did it take from detection to remediation?
- 06:30 - 14:55 = 8h25m = 505m
Were there any issues with the response to the incident?

Root Cause Analysis

Sidekiq jobs were piling up over hours.

Why? - Jobs took longer to process.
Why? - Gitaly latency was getting worse.
Why? - There were more gitaly calls made by some jobs.
Why? - Some jobs were processing massive amounts of tags which cause n+1 problems for Gitaly
Why? - Commits of a user contained too many tags.

What went well

Start with the following:

Identify the things that worked well or as expected.
Any additional call-outs for what went particularly well.

What can be improved

we should add limits for things like the amount of tags
improve performance for handling many tags
improve sidekiq architecture
alerts for growing queue size should fire earlier
Gitaly latency APDEX alerts should go to pagerduty
paging CMOC via /pd-mgr slack command doesn't seem to work?
change the severity label on the incident ticket in time to reflect our current rating of the incident severity
make status.io updates more meaningful for customers

Start with the following:

Corrective actions

add chef config for the sidekiq changes we did manual: https://ops.gitlab.net/gitlab-cookbooks/chef-repo/merge_requests/1619
create find_tag RPC gitlab-org/gitaly#1848 (closed)
implement find_tag RPC https://gitlab.com/gitlab-org/gitlab-ce/issues/65795
PostReceive should have bounds on how many changes it processes. https://gitlab.com/gitlab-org/gitlab-ce/issues/65804
add timeouts to gitaly calls from sidekiq gitlab-com/www-gitlab-com#4997 (closed)
make it possible to kill running sidekiq jobs https://gitlab.com/gitlab-org/gitlab-ce/issues/51096
re-architect queue implementation gitlab-com/www-gitlab-com#4951 (closed)
page for gitaly SLO alerts https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7391
identify limits to prevent platform incidents https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7481
add runbook for analyzing Gitaly pprof data https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7751
document marquee_account_alerts and infra-escalation channels in oncall runbook

Guidelines

Blameless RCA Guideline
5 whys

Edited Sep 05, 2019 by Henri Philipps

Assignee Loading

Time tracking Loading