2021-03-09 High number of database statement timeouts

(GDoc used for early notes - https://docs.google.com/document/d/11J-mrK_TNedUjdzWHOlNAuhla50lY5zkhDHCy-rhQv0/edit# )

High-level summary

A large number of expensive database queries were running and performing poorly on the primary database. This caused a high load across the database cluster. The load condition saturated available connections to the database leading to a slowdown across the platform. This slowdown meant that at a certain point no requests were served.

We performed an emergency maintenance operation on the database in order to mitigate the load increase. Once the database was capable of serving the queued requests, the platform recovered.

It is suspected that the known bug in PostgreSQL combined with an expensive query are the root cause of this outage.

The root cause is explained in detail below.

Summary

Summary fully copied from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12791#note_527321032 :

To summarize our findings:

We have identified as the root cause that the optimizer of PostgreSQL 11 overestimates the cardinality of the result of the outer nested loop when an IN (..) clause is evaluated by the optimizer to result in more than 200 results.

So, any query with an SELECT ... FROM R WHERE a IN ( {QUERY} ) clause, where the QUERY could potentially return more than 200 records is a candidate for a similar behavior.

As a minimal example, the following query has been used to showcase the problem:

SELECT * FROM foo WHERE i IN (SELECT i FROM generate_series(1, 100) i LIMIT 201);

The overestimation depends on the cardinality of foo, so it will be more evident for large tables.

This seems like a known PostgreSQL bug, which seems to have been patched in the latest versions. We have confirmed that this behavior has been fixed in PG12.

If instead of a query, a list of values is provided in the IN (..) clause, then there is no overestimation of the cardinality of the result of the outer nested loop. This approach is evaluated in gitlab-org/gitlab#323985 (closed), but with concerns raised as those inner queries may return 1000s of results in GitLab.com, which would affect both the planning times and our monitoring capabilities.

An alternative approach (gitlab-org/gitlab!56078 (merged)) is to replace the IN (..) clause in the query that caused the incident and evaluate if we can use it at GitLab.com scale without issues.

Timeline

View recent production deployment and configuration events (internal only)

All times UTC.

2021-03-09

04:45 - @ggillies receives page https://gitlab.pagerduty.com/incidents/P2JT1I4?utm_source=slack&utm_campaign=channel , and raises incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3874
05:10 - @ggillies users scrubber to block suspected abusers
05:15 - @ggillies users scrubber to block suspected abusers
05:19 - @ggillies receives an overwhelming large number of pages indicating all parts of Gitlab.com are experiencing issues

05:28 - @stanhu declares incident in Slack.
06:07 - automatic reindexing of public.index_ci_builds_on_runner_id_and_id_desc started
~06:45 - Database degradation is suspected due to a large number of queued queries, confirmed through Pgbouncer high saturation. project statistics query times out in production, results returned quickly in staging. Code inspection finds two possible configuration options: Feature flag: namespace_storage_limit , Database setting: ::Gitlab::CurrentSettings.automatic_purchased_storage_allocation? => Forces us into branch ::Namespaces::CheckStorageSizeService which has an early return on namespace_storage_limit which will mean we shouldn’t be displaying the banner at all. We did not disable the feature flag, instead we disabled configuration
07:15 - @stanhu disables configuration automatic_purchased_storage_allocation to false through the database console
07:33 - added patroni-08 into db replica rotation
07:54 - @ggillies manually attempts to disable queue update_namespace_statistics:namespaces_root_statistics from being picked up by sidekiq-catchall by changing the configmap, unaware that it needs to be changed in the deployment instead
06:51 - @ggillies restores the state of the sidekiq-catchall deployment and configmap to match what they have in git
07:53 - ANALYZE namespaces resolves the DB slowness
08:26 - took patroni-08 out of rotation again
08:31 - Incident is mitigated and restore of operations is publicly announced
09:10 - Incident is resolved and the system is fully operational
09:21 - toggled automatic_purchased_storage_allocation setting to true

Corrective Actions

Corrective actions for this incidents are split between proposed tooling/process improvements and application changes.

Application changes

Investigate analyze action taken as part of the incident https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12762
- The issue spawned gitlab-org/gitlab#323985 (closed) as smaller targeted change to be shipped as soon as possible to alleviate some future risk. The solution is temporary and is splitting a large complex query into smaller ones.
- Likely the long term solution is replacing recursive queries altogether gitlab-org&4115

Tooling and process changes

Improve runbook for disabling a specific sidekiq queue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12765 (if we don't completely move sidekiq to k8s since it is unblocked, more info to follow)
Ensure that we can apply gitlab-com/k8s-workloads updates with charts.gitlab.io is down delivery#1596 (closed)
Prevent operational runner connections to GitLab.com delivery#227 (closed)
Consistent way of managing configuration settings gitlab-org/gitlab#323911
Create better alerting for crashloopbackoff for gitlab-shell https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12767

Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.

Click to expand or collapse the Incident Review section.

Incident Review

Summary

For a perioud of approximately 2.5 hours (between 2021-03-09 05:19 UTC and 2021-03-09 07:53 UTC) Gitlab.com experienced a full site outage, where all customers and users of Gitlab.com were unable to access the site, and CI jobs were unable to be executed. The underlying cause was determined to be specific database queries that run against the main postgresql cluster taking a much longer time to execute than normal (from ms to over 2 minutes). This caused database connection pools to fill up, causing resource exhaustion on the database on not allowing anything that relied on the database to function. The incident was eventually resolved by running a ANALYZE namespaces; inside the main postgres database, in order to give accurate table statistics to the postgres query planner, which was making incorrect decisions about how to run specific queries, causing the database slowness.

Service(s) affected: All services
Team attribution:
Time to detection: Minutes
Minutes downtime or degradation: 154 minutes

Metrics

Metrics provided in comments to issue

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All customers of Gitlab.com
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. While occasionally some pages would load, the main customer experience is that trying to load any page of the Gitlab.com application would give back a generic 500 internal server error to the user. CI Jobs running during the period of outage would also stall and fail as well.
How many customers were affected?
1. The total number of Gitlab customers
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
1. A 65% traffic drop

What were the root causes?

copied from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12791#note_527321032 :

So, any query with an SELECT ... FROM R WHERE a IN ( {QUERY} ) clause, where the QUERY could potentially return more than 200 records is a candidate for a similar behavior.

As a minimal example, the following query has been used to showcase the problem:

SELECT * FROM foo WHERE i IN (SELECT i FROM generate_series(1, 100) i LIMIT 201);

The overestimation depends on the cardinality of foo, so it will be more evident for large tables.

This seems like a known PostgreSQL bug, which seems to have been patched in the latest versions. We have confirmed that this behavior has been fixed in PG12.

An alternative approach (gitlab-org/gitlab!56078 (merged)) is to replace the IN (..) clause in the query that caused the incident and evaluate if we can use it at GitLab.com scale without issues.

Incident Response Analysis

How was the incident detected?
1. Alerts sent from our standard observability stack
How could detection time be improved?
1. Detecting that the site was suffering I don't think could be improved. Our monitoring caught this quickly, before e.g. people started reporting problems
2. Detecting the underlying problem in the database could be improved by having specific targeted alerting to identify that database queries are taking an abnormally long amount of time. This might have allowed an EOC to be notified and intervene before a full outage took place
How was the root cause diagnosed?
1. Diagnosing that the problem was database related was very quick, and done through standard mechanisms such as looking at abnormalities in the triage dashboard.
How could time to diagnosis be improved?
1. ...
How did we reach the point where we knew how to mitigate the impact?
1. ...
How could time to mitigation be improved?
1. ...
What went well?
1. ...

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. ...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. ...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. ...

Lessons Learned

Guidelines

Blameless RCA Guideline

Resources

If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)

Edited Mar 18, 2021 by Graeme Gillies