2020-12-01 "Reset CI minutes job" query timeouts
Summary
The job that runs on the first of the month to reset CI minutes used per namespace caused significant churn in the namespaces
table that resulted in poor query planning by Postgres, resulting in query timeouts manifesting in various ways throughout the application stack. It was resolved (in the short term) by running analyze
on the namespaces
table.
Timeline
All times UTC.
2020-12-01
- 00:00 - CI minutes monthly reset job begins
- 01:30 - DB performance starts getting intermittently and lightly wobbly
- 02:30 - DB performance gets really bad; CPU usage on patroni-06 (primary) nearing 100%, load average > 100.
- 02:36 - IncreasedErrorRateOtherBackends fires - query timeouts the cause
- 02:57 - Alert clears, but only briefly
- 02:59 - last of the CI monthly minute reset batches completes.
- 03:00 -
Postgres transactions showing high rate of statement timeouts
fires, then resolves a minute later - 03:01 - cmiskell declares incident in Slack.
- 03:07 - Alert start firing again, including error ratio burn rates on various services and components, all from the same cause (slow queries)
- 03:31 - Ran
analyze verbose namespaces
- 03:32 - Errors stop, load drops on the DBs, things calm down and alerts gradually clear.
Corrective Actions
- https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/12027 - No ETA, @Finotto
- This CI-group EPIC: gitlab-org&4840 (closed)
Click to expand or collapse the Incident Review section.
Incident Review
Summary
From 01:30 UTC until 03:32 UTC on 2020-12-01, performance on gitlab.com was degraded due to database performance problems triggered by the monthly "CI minutes reset" job that runs from midnight on the 1st of every month. During that period all users may have experienced performance degradation (slow responses). After some other possibilities were examined, the problematic database table (namespaces
) was speculatively identified and once the statistics were refreshed/analyzed on that table, performance quickly returned to normal.
- Service(s) affected: ServicePostgres ServiceWeb ServiceAPI
- Team attribution: ~"team::Datastores"
- Time to detection: 60 minutes from first hints of problems (before actual alerting)
- Minutes downtime or degradation: 120 minutes
Metrics
API status codes
This shows the increase in 5xx errors on the API fleet, particularly from 2:40 and even more pronounced from 03:10 or so, with rapid recovery at mitigation.
API apdex
Shows the slightly more subtle impact, and that around 1.5-2% of requests were failing to meet the desired performance at the worst periods.
CPU on the primary DB node
Shows the excessive CPU usage on the primary database node. There was a similarly shaped increase on the replicas although not to the same degree (only 60% peaks).
Customer Impact
-
Who was impacted by this incident? (i.e. external customers, internal customers)
- All users of gitlab.com at the time.
-
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
- Possible slow response to requests (depending on the precise request and queries required), or HTTP 500 responses caused by database timeout
-
How many customers were affected?
- Up to 2% of requests failed to meet the desired apdex for a short period, and up to 1.5% failed to meet the apdex for the last 35 minutes.
- Up to ~102K logged in users who were active in this period, 2729 of which experienced one or more database timeouts.
- Up to ~152K IP addresses (anonymous requests), of which 677 experienced one or more database timeouts
- Some of those database timeouts may have occurred normally, but separating those out would be a substantial task with extremely limited value.
What were the root causes?
("5 Whys")
- Performance was degraded because the database was performing badly, with high CPU usage and long query times (including timeouts)
- This was because the statistics on the
namespaces
table caused poor query plans - The statistics were bad because many/all rows are touched by the monthly CI minutes reset job which was running at the time.
- All the rows are touched because the data is "current" only, not accrued per month then left alone.
Incident Response Analysis
-
How was the incident detected?
- Multiple alerts, although interestingly not our apdex alerts (performance was within current SLOs and may have self-corrected before reaching critical levels)
-
How could detection time be improved?
- Perhaps some alerting on anomalous patterns on the database node performance, although the spikes before we got alerted were pretty brief and would probably be below any threshold we would set to avoid regular false positives
-
How was the root cause diagnosed?
- The on-call SRE was already aware of the CI minutes reset performance problems from previous months (see gitlab-org/gitlab#260286 (closed) for an overview and links to other incidents) and was monitoring in general for DB performance issues, and #3100 (closed) had fired earlier implicating the
project_statistics
andnamespaces
tables.
- The on-call SRE was already aware of the CI minutes reset performance problems from previous months (see gitlab-org/gitlab#260286 (closed) for an overview and links to other incidents) and was monitoring in general for DB performance issues, and #3100 (closed) had fired earlier implicating the
-
How could time to diagnosis be improved?
- Nothing springs to mind; there are a huge number of reasons why we can have degraded performance, and it's always a matter of working through them. In this case we spent some time looking at whether there was an increase in traffic that might be to blame, as this is a more common cause of the various symptoms than database statistics.
-
How did we reach the point where we knew how to mitigate the impact?
- To be quite clear, we didn't know; the table analyze was somewhat speculative, based on the previous incident and a very crude look at
pg_stat_activity
. It was executed on the grounds that it was unlikely to make things worse, and might help; the rapid improvement when it completed was a pleasant surprise.
- To be quite clear, we didn't know; the table analyze was somewhat speculative, based on the previous incident and a very crude look at
-
How could time to mitigation be improved?
- More knowledge by the on-call SRE that an
analyze <table>
is safe to run (mostly) at any time; knowing that, I probably would have just tried that earlier (much more speculatively).
- More knowledge by the on-call SRE that an
- What went well?
Post Incident Analysis
-
Did we have other events in the past with the same root cause?
- Yes. The CI minutes reset job is a known problem. This month it was spread out over 3 hours to mitigate previously seen meltdowns, but this problem manifested around half-way through, when the reset job touched sufficient rows in the two tables and affected the statistics to the point where query performance degraded. It's possible that the statistics problem was part of previous incidents caused by the same code, but masked by (or following immediately on from) the stampede of all the jobs running at once, such that we couldn't reasonably identify it as a separate problem.
-
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
-
gitlab-org&4840 (closed) - moving the monthly CI minutes counts out of the
namespaces
table (which is large, and central to many expensive queries) would avoid this monthly job having to touch many/all rows in that table. Additionally, the proposed solutions lead are lower impact (new rows per month, not resetting updating existing rows across all active namespaces).
-
gitlab-org&4840 (closed) - moving the monthly CI minutes counts out of the
-
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
- No.
Lessons Learned
- gitlab-org&4840 (closed) is critical and should ideally be raised in priority.
-
analyze <table>
is a safe operation at most times; during deploy migrations executing DDL statements it might be lightly contra-indicated, but even then it's probably still reasonable as our migrations should be safe to take while, and the analyze will only take a minute or two (or, at worst, will be blocked by the DDL). If there is even the lightest suspicion that a table has bad statistics, this should be a go-to action.