2020-12-01 "Reset CI minutes job" query timeouts

Summary

The job that runs on the first of the month to reset CI minutes used per namespace caused significant churn in the namespaces table that resulted in poor query planning by Postgres, resulting in query timeouts manifesting in various ways throughout the application stack. It was resolved (in the short term) by running analyze on the namespaces table.

Timeline

All times UTC.

2020-12-01

00:00 - CI minutes monthly reset job begins
01:30 - DB performance starts getting intermittently and lightly wobbly
02:30 - DB performance gets really bad; CPU usage on patroni-06 (primary) nearing 100%, load average > 100.
02:36 - IncreasedErrorRateOtherBackends fires - query timeouts the cause
02:57 - Alert clears, but only briefly
02:59 - last of the CI monthly minute reset batches completes.
03:00 - Postgres transactions showing high rate of statement timeouts fires, then resolves a minute later
03:01 - cmiskell declares incident in Slack.
03:07 - Alert start firing again, including error ratio burn rates on various services and components, all from the same cause (slow queries)
03:31 - Ran analyze verbose namespaces
03:32 - Errors stop, load drops on the DBs, things calm down and alerts gradually clear.

Corrective Actions

https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12027 - No ETA, @Finotto
This CI-group EPIC: gitlab-org&4840 (closed)

Click to expand or collapse the Incident Review section.

Incident Review

Summary

From 01:30 UTC until 03:32 UTC on 2020-12-01, performance on gitlab.com was degraded due to database performance problems triggered by the monthly "CI minutes reset" job that runs from midnight on the 1st of every month. During that period all users may have experienced performance degradation (slow responses). After some other possibilities were examined, the problematic database table (namespaces) was speculatively identified and once the statistics were refreshed/analyzed on that table, performance quickly returned to normal.

Service(s) affected: ServicePostgres ServiceWeb ServiceAPI
Team attribution: ~"team::Datastores"
Time to detection: 60 minutes from first hints of problems (before actual alerting)
Minutes downtime or degradation: 120 minutes

Metrics

API status codes

This shows the increase in 5xx errors on the API fleet, particularly from 2:40 and even more pronounced from 03:10 or so, with rapid recovery at mitigation.

Source

API apdex

Shows the slightly more subtle impact, and that around 1.5-2% of requests were failing to meet the desired performance at the worst periods.

Source

CPU on the primary DB node

Shows the excessive CPU usage on the primary database node. There was a similarly shaped increase on the replicas although not to the same degree (only 60% peaks).

Source

Customer Impact

Who was impacted by this incident? (i.e. external customers, internal customers)
1. All users of gitlab.com at the time.
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
1. Possible slow response to requests (depending on the precise request and queries required), or HTTP 500 responses caused by database timeout
How many customers were affected?
1. Up to 2% of requests failed to meet the desired apdex for a short period, and up to 1.5% failed to meet the apdex for the last 35 minutes.
2. Up to ~102K logged in users who were active in this period, 2729 of which experienced one or more database timeouts.
3. Up to ~152K IP addresses (anonymous requests), of which 677 experienced one or more database timeouts
- Some of those database timeouts may have occurred normally, but separating those out would be a substantial task with extremely limited value.

What were the root causes?

("5 Whys")

Performance was degraded because the database was performing badly, with high CPU usage and long query times (including timeouts)
This was because the statistics on the namespaces table caused poor query plans
The statistics were bad because many/all rows are touched by the monthly CI minutes reset job which was running at the time.
All the rows are touched because the data is "current" only, not accrued per month then left alone.

Incident Response Analysis

How was the incident detected?
1. Multiple alerts, although interestingly not our apdex alerts (performance was within current SLOs and may have self-corrected before reaching critical levels)
How could detection time be improved?
1. Perhaps some alerting on anomalous patterns on the database node performance, although the spikes before we got alerted were pretty brief and would probably be below any threshold we would set to avoid regular false positives
How was the root cause diagnosed?
1. The on-call SRE was already aware of the CI minutes reset performance problems from previous months (see gitlab-org/gitlab#260286 (closed) for an overview and links to other incidents) and was monitoring in general for DB performance issues, and #3100 (closed) had fired earlier implicating the project_statistics and namespaces tables.
How could time to diagnosis be improved?
1. Nothing springs to mind; there are a huge number of reasons why we can have degraded performance, and it's always a matter of working through them. In this case we spent some time looking at whether there was an increase in traffic that might be to blame, as this is a more common cause of the various symptoms than database statistics.
How did we reach the point where we knew how to mitigate the impact?
1. To be quite clear, we didn't know; the table analyze was somewhat speculative, based on the previous incident and a very crude look at pg_stat_activity. It was executed on the grounds that it was unlikely to make things worse, and might help; the rapid improvement when it completed was a pleasant surprise.
How could time to mitigation be improved?
1. More knowledge by the on-call SRE that an analyze <table> is safe to run (mostly) at any time; knowing that, I probably would have just tried that earlier (much more speculatively).
What went well?
1. @ggillies and @nnelson jumped into the incident zoom when the on-call SRE asked for help, and provided some really exceedingly useful input, and much appreciated moral support.

Post Incident Analysis

Did we have other events in the past with the same root cause?
1. Yes. The CI minutes reset job is a known problem. This month it was spread out over 3 hours to mitigate previously seen meltdowns, but this problem manifested around half-way through, when the reset job touched sufficient rows in the two tables and affected the statistics to the point where query performance degraded. It's possible that the statistics problem was part of previous incidents caused by the same code, but masked by (or following immediately on from) the stampede of all the jobs running at once, such that we couldn't reasonably identify it as a separate problem.
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
1. gitlab-org&4840 (closed) - moving the monthly CI minutes counts out of the namespaces table (which is large, and central to many expensive queries) would avoid this monthly job having to touch many/all rows in that table. Additionally, the proposed solutions lead are lower impact (new rows per month, not resetting updating existing rows across all active namespaces).
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
1. No.

Lessons Learned

gitlab-org&4840 (closed) is critical and should ideally be raised in priority.
analyze <table> is a safe operation at most times; during deploy migrations executing DDL statements it might be lightly contra-indicated, but even then it's probably still reasonable as our migrations should be safe to take while, and the analyze will only take a minute or two (or, at worst, will be blocked by the DDL). If there is even the lightest suspicion that a table has bad statistics, this should be a go-to action.

Guidelines

Blameless RCA Guideline

Edited Dec 15, 2020 by Alberto Ramos