Skip to content

Incident Review: Degraded performance on gitlab.com

Incident Review

The DRI for the incident review is the issue assignee.

  • If applicable, ensure that the exec summary is completed at the top of the associated incident issue, the timeline tab is updated and relevant graphs are included.
  • If there are any corrective actions or infradev issues, ensure they are added as related issues to the original incident.
  • Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

  1. Who was impacted by this incident? (i.e. external customers, internal customers)
    1. Users on gitlab.com
  2. What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
    1. Users reported slow requests and not being able to load their CI/CD pipelines, overall degraded performance on Gitlab.com
  3. How many customers were affected?
    1. Everyone on gitlab.com
  4. If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
    1. ...
  5. Duration
    1. It took us about ~1hour to figure out where the problem as coming from.
    2. ~3hours - 14:15 utc - 17:57 utc

What were the root causes?

We introduced a bug fix which was unlocking a lot of pipelines cascading to unlocks more job artifacts afterwards.

This bug fix had a slow query that updates hundreds of rows on ci_pipelines and ci_job_artifacts table and increased dead tuples on these tables.

This caused a datatase saturation leading to a large increase in 500s site-wide and degradation of gitlab.com.

Incident Response Analysis

  1. How was the incident detected?
    1. Alert on replication lag in CI database https://gitlab.slack.com/archives/C101F3796/p1680101410420179
    2. Incident was declared https://gitlab.slack.com/archives/C101F3796/p1680102335670589
  2. How could detection time be improved?
    1. ...
  3. How was the root cause diagnosed?
    1. Increased pg_stat_activity from the Sidekiq worker was detected This worker was then traced back to the MR that introduced it.
  4. How could time to diagnosis be improved? 1.
  5. How did we reach the point where we knew how to mitigate the impact?
    1. Once the MR that introduced the worker was introduced, rollback was started.
  6. How could time to mitigation be improved?
    1. Feature flagging the new worker so that we could revert to the previous worker without having to rollback.

Post Incident Analysis

  1. Did we have other events in the past with the same root cause?
    1. None that we are aware of.
  2. Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
    1. No as we accumulated a lot of artifacts over the years and Fix unlocking of job artifacts when pipelines s... (gitlab-org/gitlab!114426 - merged) was supposed to fix this going forward. We did not anticipate well how big the number of accumulated artifacts would be which is really hard to determine.
  3. Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
    1. bug fix to fix these issues gitlab-org/gitlab#387087 (closed), gitlab-org/gitlab#266958 (closed)

What went well?

Guidelines

Edited by Albert