Summary of efforts and corrective actions for database incidents in March 2021

This issue is a spin-off from the incident in production#3875 (closed) , and is a holding issue for summarising statuses of various corrective actions listed in production#3875 (closed) .

On 2021-03-09, there was an outage related to severe database degradation. The high level summary is available.

On 2021-03-11, we had an incident that resembled the one from 2 days earlier. Luckily, the scope was smaller because the query had a smaller scope and there was no downtime component.

On 2021-03-15, we had an incident that was essentially a repetition of the above incidents. This incident did induce downtime.

We have had multiple efforts related to the incident:

Mitigate the queries which were a significant contributor to the outage

Two discussion streams: Find some quick solutions to relieve the pressure, and ensure that we resolve the whole problem.

In progress

Added Distinct to the CTE queries and rolled it out to GitLab.com as a hot-patch. The feature flag was enabled globally at 00:53 on 2021-03-16. The analysis is being conducted in gitlab-org&5617 (comment 529881916) . This is now a part of Rapid action for query planner bug.
Ongoing investigation about possible other contributing factors such as index creation and wal-g generating I/O that additionally slowed down the database. Additional tuning was done in production#3974 (closed) and production#3969 (comment 529761989), but the results are inconclusive.
Other performance improvements are being considered and are in review.
Other finders need to take advantage of change in recursive namespace. Primary candidates are NPM, Nuget and other packages endpoints. This is a longer term change that we should consider pursuing.

Done

Change in recursive namespace lookup in production approx 01:00 UTC March 12. Encouraging results shown for large customer Maven API timings with Maven finder updates.
1.Original attempt of a quick fix is likely to be abandoned due to performance concerns - there are concerns about implementation being discussed on the suggested code change.
Discussions on how to refactor offending queries created a new proposal that is in review. This proposal changes the nature of the query and the hope is that this will reduce the general impact. This is unlikely to resolve the problem completely.
Related issue where we are looking to prevent abusive behaviour in which multiple top level groups get created, with a number of subgroups which are then used in CI. MR is merged and deployed has to be deployed and enabled in production before weekend. Feature flag is top_level_group_creation_enabled and needs to be set to false. This can be done with /chatops run feature set top_level_group_creation_enabled false. Change issue to track toggling the FF.

The root cause of event

Root cause is now well understood.

More technically detailed summarised findings are in comment made by Yannis Roussos and Andreas Brandl.

Deep dive into the PG code, and general discussion on what long term steps are have been discussed in this recorded call.

Done

Perform Restore and gather data about why the queries where choosing inefficient plans at the time of the incident.
Data is being analysed. We can reliably reproduce the problem and the likely root cause has been found. Likely root cause confirmed.

Edited Mar 16, 2021 by Marin Jankovski