2018-06-19: Increased load on db causing some 500 errors on GitLab

Summary

At approximately 06:00UTC we started to see increased db load on the secondary databases. It looks like we are seeing heavy usage on issues and project_features putting pressure on the secondaries which is causing an increase in statement timeouts.

Timeline (UTC)

Summary of https://docs.google.com/document/d/14N6ykqCm5iJk2flzEeeoh0Fm6ihQ7CW-pe2t9O2BKYI/edit. The doc remains private as it mentions projects and users.

06h05 sequential scans on the postgres secondaries ramp up on the project_features and issues tables.
This leads to database query latencies, an increase in the number of QueryCanceled exceptions, an increase in 500s and increased latencies on GitLab.com.
08h30 incident call starts
Due to the timing of the incident (starting at 06h00), the incident call focuses on the possibility of abuse or other external causes
10h15 we start focusing on why certain complicated queries have resorted to table scans. The possibility is raised that this is still due to external load.
11h18 we perform a reanalysis of all tables involved in the slow queries. The problem is instantly solved as the query optimiser goes back to using indexes for the the project_features table
11h20 it's noticed that the last analysis of statistics on the namespaces table happened at 06h05. The errors started at 06h06.

Note that at present we still don't understand the root cause of why the postgres query optimiser switched over to sequence scans on these queries.

Correctional Actions

Once the root cause is determined, we should better placed to focus on correctional actions
A runbook for dealing with query optimiser issues and performing stats analysis on tables.
Warnings on QueryCanceled errors?

Edited Jun 19, 2018 by Andrew Newdigate