2022-03-17: Site wide performance degradation

Incident DRI

Current Status

GitLab.com performance was degraded from 10:50 to 11:20 UTC due to a high load on the primary CI builds table. Some features and sites were unavailable during this time. Users also experienced inconsistent data due to replication lag. The incident has been resolved, but an investigation is ongoing.

Summary for CMOC notice / Exec summary:

Customer Impact: Customers were unable to access features and pages on GitLab.com.
Service Impact: ServicePostgres
Impact Duration:10:50- UTC - 11:20 UTC ( 30 minutes )
Root cause: RootCauseSaturation

Timeline

Recent Events (available internally only):

All times UTC.

2022-03-17

11:12 - @igorwwwwwwwwwwwwwwwwwwww declares incident in Slack.
11:25 - Andrew currently investigating what we are doing vacuuming on
11:28 - A dashboard shows a single patroni replica is having a huge burst whereas the others are stable. Could be a symptom of replication lag, where replicas are dropping out of the pool putting more load on the smaller subsets of replicas
11:29 - We are seeing a drop on most of the replicas and an increas on some of them
11:29 -Concluding that replication lag could be a contributing factor
11:29 - Capturing a CPU profile on the primary to gather data
11:32 - Hypothesis: Replication lag is a big factor in what we are seeing – when this lag increases, clients will stop talking to that replica, putting more load on the primary (and the rest of the fleet?).
11:37 - We have recovered, but since we don’t know the cause, we don’t know how long that will last. We are not categorizing this as Mitigated until we see more stability
11:38 - Investigate the tuple statistics to see if there was an increase in writes, could this have contributed to the replication lag?
11:40 - Increase in updates, tables that are impacted is ci_builds. increase in writes on ci_builds
11:47 - In 5 minutes we will make the determination whether or not to mark this as mitigated.
11:50 - Customer reporting issues acquiring runners for their jobs. This is most likely related to this incident. They are no longer having these problems
11:58 - We have marked the incident as mitigated
12:41 - We have marked this incident as resolved.

Create related issues

Use the following links to create related issues to this incident if additional work needs to be completed after it is resolved:

Takeaways

This issue was difficult to diagnose because the symptoms (lots of UPDATE queries) were not easily linked to the trigger (the DELETE statement).
Another big surprise to me was the behaviour of canceled statements generating WAL pressure continuously, this is a vector that exists in general for large UPDATE and DELETE statements, and even heavy read-only queries to some extent.
The dynamics of client behaviour in the presence of replication lag was also quite interesting to observe. In this case, we ended up saturating the pgbouncer backend pool – probably better than taking down an entire replica (or the primary).
Another interesting aspect in terms of user-facing symptoms was flappiness in recently written records due to replication lag. I saw a newly created issue flapping in and out of existence.
We got pretty lucky as this incident was left unmitigated for an entire week (due to difficulties while deleting the foreign key constraint). A large customer started triggering runner deletions shortly after we managed to roll out the mitigation. Kudos to @msmiley for prioritizing this task. In hindsight, additional mitigations (e.g. blocking the runner deletion API endpoint via cloudflare) would have been a good idea.

Corrective Actions

Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.

https://gitlab.com/gitlab-org/gitlab/-/issues/356153 - Convert FK between Ci::Build and Ci::Runner to LFK
https://gitlab.com/gitlab-org/gitlab/-/issues/356271

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Ensure that the exec summary is completed at the top of the incident issue, the timeline is updated and relevant graphs are included in the summary
If there are any corrective action items mentioned in the notes on the incident, ensure they are listed in the "Corrective Action" section
Fill out relevant sections below or link to the meeting review notes that cover these topics

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All customers using the site at the time.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. 5xx errors, slow responses, stale data.
How many customers were affected?
1. n/a
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. 60% traffic drop.

What were the root causes?

We had some deletions on the runners table.
This triggered cascading updates to set ci_builds.runner_id to NULL.
Since a runner can have very many builds, this generated a large amount of updates.
These delete queries timed out after 15s of statement timeout.
This did not prevent the uncommitted updates from generating WALs for 15 seconds before aborting the transaction.
Every retry attempt to delete a runner was able to run the uncommitted updates again, thus amplifying the WAL generation.
WAL generation induced replication lag.
This replication lag caused replicas to drop out of rotation, reducing overall capacity, this saturated backend connections of pgbouncer connection pools on remaining replicas and the primary.
This connection pool saturation resulted in the user-facing degradation, as clients had to wait for connections.

Incident Response Analysis

How was the incident detected?
1. SRE on-call noticed stale data, indicating replication lag.
2. Alerts followed shortly after.
How could detection time be improved?
1. It was actually pretty good.
How was the root cause diagnosed?
1. Finding the trigger was difficult, as it was a complex interaction, and the UPDATEs were tied to a DELETE statement.
How could time to diagnosis be improved?
1. Newer pg_stat_statements have attribution for WAL generated.
How did we reach the point where we knew how to mitigate the impact?
1. A lot of disproven theories, and then someone connected the dots on DELETE queries (may have been @andrewn).
How could time to mitigation be improved?
1. Prioritize concurrently performing mitigation, in addition to a fix.
2. In this case, an API block rule in Cloudflare to prevent runner deletions would have been a good idea in retrospect.
What went well?
1. Coordination was really good. Several threads of investigation were spread across team members.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Not afaik.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. Not afaik.
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No. There was a latent change gitlab-org/gitlab!80203 (merged), but it was triggered several weeks later.

What went well?

Coordination and collaboration.
Putting a maintenance in place on PagerDuty to avoid flooding SRE on-call with alerts.
Tracking down the pieces contributing to the outage and refining the mental model along the way.

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Apr 04, 2022 by Igor