Skip to content

RCA: Gitaly n+1 calls causing bad latency and sidekiq queues to grow

Incident: production#1039 (closed)

Rapid Action Issue: gitlab-com/www-gitlab-com#4997 (closed)

Summary

Some commits with a massive amount of tags caused jobs to make many Gitaly calls, leading to higher Gitaly latency and growing sidekiq queues.

For timeline see the incident issue: production#1039 (closed)

  • Service(s) affected : ~"Service:Gitaly" ~"Service:Sidekiq" ~"Service:Web"

  • Team attribution :

  • Minutes downtime or degradation : 05:10 - 14:55 = 9h45m = 585m

For calculating duration of event, use the Platform Metrics Dashboard to look at appdex and SLO violations.

image

Impact & Metrics

Start with the following:

  • What was the impact of the incident?
    • higher Gitaly latencies, Sidekiq queues growing, higher web latencies
  • Who was impacted by this incident?
    • all users waiting for jobs to be triggered or finish or web hooks
  • How did the incident impact customers?
  • How many attempts were made to access the impacted service/feature?
  • How many customers were affected?
  • How many customers tried to access the impacted service/feature?

Include any additional metrics that are of relevance.

Provide any relevant graphs that could help understand the impact of the incident and its dynamics.

Detection & Response

Start with the following:

  • How was the incident detected?
  • Did alarming work as expected?
    • Alarming on queue size should fire earlier and Gitaly latency alerts should go to pagerduty
  • How long did it take from the start of the incident to its detection?
    • queues started to grow at 05:10, got detected by EOC at 06:30 = 80m
  • How long did it take from detection to remediation?
    • 06:30 - 14:55 = 8h25m = 505m
  • Were there any issues with the response to the incident?

Root Cause Analysis

Sidekiq jobs were piling up over hours.

  1. Why? - Jobs took longer to process.
  2. Why? - Gitaly latency was getting worse.
  3. Why? - There were more gitaly calls made by some jobs.
  4. Why? - Some jobs were processing massive amounts of tags which cause n+1 problems for Gitaly
  5. Why? - Commits of a user contained too many tags.

What went well

Start with the following:

  • Identify the things that worked well or as expected.
  • Any additional call-outs for what went particularly well.

What can be improved

  • we should add limits for things like the amount of tags
  • improve performance for handling many tags
  • improve sidekiq architecture
  • alerts for growing queue size should fire earlier
  • Gitaly latency APDEX alerts should go to pagerduty
  • paging CMOC via /pd-mgr slack command doesn't seem to work?
  • change the severity label on the incident ticket in time to reflect our current rating of the incident severity
  • make status.io updates more meaningful for customers

Start with the following:

Corrective actions

Guidelines

Edited by Henri Philipps