2021-03-09 High number of database statement timeouts
(GDoc used for early notes - https://docs.google.com/document/d/11J-mrK_TNedUjdzWHOlNAuhla50lY5zkhDHCy-rhQv0/edit# )
High-level summary
A large number of expensive database queries were running and performing poorly on the primary database. This caused a high load across the database cluster. The load condition saturated available connections to the database leading to a slowdown across the platform. This slowdown meant that at a certain point no requests were served.
We performed an emergency maintenance operation on the database in order to mitigate the load increase. Once the database was capable of serving the queued requests, the platform recovered.
It is suspected that the known bug in PostgreSQL combined with an expensive query are the root cause of this outage.
The root cause is explained in detail below.
Summary
Summary fully copied from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12791#note_527321032 :
To summarize our findings:
We have identified as the root cause that the optimizer of PostgreSQL 11 overestimates the cardinality of the result of the outer nested loop when an IN (..)
clause is evaluated by the optimizer to result in more than 200 results.
So, any query with an SELECT ... FROM R WHERE a IN ( {QUERY} )
clause, where the QUERY
could potentially return more than 200 records is a candidate for a similar behavior.
As a minimal example, the following query has been used to showcase the problem:
SELECT * FROM foo WHERE i IN (SELECT i FROM generate_series(1, 100) i LIMIT 201);
The overestimation depends on the cardinality of foo
, so it will be more evident for large tables.
This seems like a known PostgreSQL bug, which seems to have been patched in the latest versions. We have confirmed that this behavior has been fixed in PG12.
If instead of a query, a list of values is provided in the IN (..)
clause, then there is no overestimation of the cardinality of the result of the outer nested loop. This approach is evaluated in gitlab-org/gitlab#323985 (closed), but with concerns raised as those inner queries may return 1000s of results in GitLab.com, which would affect both the planning times and our monitoring capabilities.
An alternative approach (gitlab-org/gitlab!56078 (merged)) is to replace the IN (..)
clause in the query that caused the incident and evaluate if we can use it at GitLab.com scale without issues.
Timeline
View recent production deployment and configuration events (internal only)
All times UTC.
2021-03-09
-
04:45
- @ggillies receives page https://gitlab.pagerduty.com/incidents/P2JT1I4?utm_source=slack&utm_campaign=channel , and raises incident https://gitlab.com/gitlab-com/gl-infra/production/-/issues/3874 -
05:10
- @ggillies users scrubber to block suspected abusers -
05:15
- @ggillies users scrubber to block suspected abusers -
05:19
- @ggillies receives an overwhelming large number of pages indicating all parts of Gitlab.com are experiencing issues
-
05:28
- @stanhu declares incident in Slack. -
06:07
- automatic reindexing ofpublic.index_ci_builds_on_runner_id_and_id_desc
started -
~06:45
- Database degradation is suspected due to a large number of queued queries, confirmed through Pgbouncer high saturation.project statistics query
times out in production, results returned quickly in staging. Code inspection finds two possible configuration options: Feature flag:namespace_storage_limit
, Database setting:::Gitlab::CurrentSettings.automatic_purchased_storage_allocation?
=> Forces us into branch ::Namespaces::CheckStorageSizeService which has an early return on namespace_storage_limit which will mean we shouldn’t be displaying the banner at all. We did not disable the feature flag, instead we disabled configuration -
07:15
- @stanhu disables configurationautomatic_purchased_storage_allocation
to false through the database console -
07:33
- added patroni-08 into db replica rotation -
07:54
- @ggillies manually attempts to disable queueupdate_namespace_statistics:namespaces_root_statistics
from being picked up bysidekiq-catchall
by changing the configmap, unaware that it needs to be changed in the deployment instead -
06:51
- @ggillies restores the state of the sidekiq-catchall deployment and configmap to match what they have in git -
07:53
-ANALYZE namespaces
resolves the DB slowness -
08:26
- took patroni-08 out of rotation again -
08:31
- Incident is mitigated and restore of operations is publicly announced -
09:10
- Incident is resolved and the system is fully operational -
09:21
- toggledautomatic_purchased_storage_allocation
setting totrue
Corrective Actions
Corrective actions for this incidents are split between proposed tooling/process improvements and application changes.
Application changes
- Investigate
analyze
action taken as part of the incident https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12762- The issue spawned gitlab-org/gitlab#323985 (closed) as smaller targeted change to be shipped as soon as possible to alleviate some future risk. The solution is temporary and is splitting a large complex query into smaller ones.
- Likely the long term solution is replacing recursive queries altogether gitlab-org&4115
Tooling and process changes
- Improve runbook for disabling a specific sidekiq queue https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12765 (if we don't completely move sidekiq to k8s since it is unblocked, more info to follow)
- Ensure that we can apply
gitlab-com/k8s-workloads
updates with charts.gitlab.io is down delivery#1596 (closed) - Prevent operational runner connections to GitLab.com delivery#227 (closed)
- Consistent way of managing configuration settings gitlab-org/gitlab#323911
- Create better alerting for crashloopbackoff for gitlab-shell https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12767
Note: In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally. By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
For a perioud of approximately 2.5 hours (between 2021-03-09 05:19 UTC and 2021-03-09 07:53 UTC) Gitlab.com experienced a full site outage, where all customers and users of Gitlab.com were unable to access the site, and CI jobs were unable to be executed. The underlying cause was determined to be specific database queries that run against the main postgresql cluster taking a much longer time to execute than normal (from ms to over 2 minutes). This caused database connection pools to fill up, causing resource exhaustion on the database on not allowing anything that relied on the database to function. The incident was eventually resolved by running a ANALYZE namespaces;
inside the main postgres database, in order to give accurate table statistics to the postgres query planner, which was making incorrect decisions about how to run specific queries, causing the database slowness.
- Service(s) affected: All services
- Team attribution:
- Time to detection: Minutes
- Minutes downtime or degradation: 154 minutes
Metrics
Metrics provided in comments to issue
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- All customers of Gitlab.com
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- While occasionally some pages would load, the main customer experience is that trying to load any page of the Gitlab.com application would give back a generic
500 internal server error
to the user. CI Jobs running during the period of outage would also stall and fail as well.
- While occasionally some pages would load, the main customer experience is that trying to load any page of the Gitlab.com application would give back a generic
-
How many customers were affected?
- The total number of Gitlab customers
-
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
- A 65% traffic drop
What were the root causes?
copied from https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12791#note_527321032 :
We have identified as the root cause that the optimizer of PostgreSQL 11 overestimates the cardinality of the result of the outer nested loop when an IN (..)
clause is evaluated by the optimizer to result in more than 200 results.
So, any query with an SELECT ... FROM R WHERE a IN ( {QUERY} )
clause, where the QUERY
could potentially return more than 200 records is a candidate for a similar behavior.
As a minimal example, the following query has been used to showcase the problem:
SELECT * FROM foo WHERE i IN (SELECT i FROM generate_series(1, 100) i LIMIT 201);
The overestimation depends on the cardinality of foo
, so it will be more evident for large tables.
This seems like a known PostgreSQL bug, which seems to have been patched in the latest versions. We have confirmed that this behavior has been fixed in PG12.
If instead of a query, a list of values is provided in the IN (..)
clause, then there is no overestimation of the cardinality of the result of the outer nested loop. This approach is evaluated in gitlab-org/gitlab#323985 (closed), but with concerns raised as those inner queries may return 1000s of results in GitLab.com, which would affect both the planning times and our monitoring capabilities.
An alternative approach (gitlab-org/gitlab!56078 (merged)) is to replace the IN (..)
clause in the query that caused the incident and evaluate if we can use it at GitLab.com scale without issues.
Incident Response Analysis
-
How was the incident detected?
- Alerts sent from our standard observability stack
-
How could detection time be improved?
- Detecting that the site was suffering I don't think could be improved. Our monitoring caught this quickly, before e.g. people started reporting problems
- Detecting the underlying problem in the database could be improved by having specific targeted alerting to identify that database queries are taking an abnormally long amount of time. This might have allowed an EOC to be notified and intervene before a full outage took place
-
How was the root cause diagnosed?
- Diagnosing that the problem was database related was very quick, and done through standard mechanisms such as looking at abnormalities in the triage dashboard.
-
How could time to diagnosis be improved?
- ...
-
How did we reach the point where we knew how to mitigate the impact?
- ...
-
How could time to mitigation be improved?
- ...
-
What went well?
- ...
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- ...
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
- ...
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- ...
Lessons Learned
- ...
Guidelines
Resources
- If the Situation Zoom room was utilised, recording will be automatically uploaded to Incident room Google Drive folder (private)