The apdex has recovered, however we are still investigating the root cause. An increase in autovacuum concurrently on ci_pending_builds and ci_running_builds likely contributed to the statement timeouts.
13:30 - increase in autovacuum on ci_pending_builds and ci_running_builds begins
13:35 - Statement timeouts begin on ci_pending_builds
13:37 - Large spike in statement timeouts on project_builds table starts, apdex begins to fall`
14:42 - alejandro declares incident in Slack.
14:50 - Statement timeouts on ci_pending_builds end
15:05 - apdex recovers
15:15 - autovacuum on ci_pending_builds and ci_running_builds returns to normal levels
Corrective Actions
Corrective actions should be put here as soon as an incident is mitigated, ensure that all corrective actions mentioned in the notes below are included.
...
Note:
In some cases we need to redact information from public view. We only do this in a limited number of documented cases. This might include the summary, timeline or any other bits of information, laid out in out handbook page. Any of this confidential data will be in a linked issue, only visible internally.
By default, all information we can share, will be public, in accordance to our transparency value.
Click to expand or collapse the Incident Review section.
Incident Review
Summary
Service(s) affected:
Team attribution:
Time to detection:
Minutes downtime or degradation:
Metrics
Customer Impact
Who was impacted by this incident? (i.e. external customers, internal customers)
...
What was the customer experience during the incident? (i.e. preventing them from doing X, incorrect display of Y, ...)
...
How many customers were affected?
...
If a precise customer impact number is unknown, what is the estimated impact (number and ratio of failed requests, amount of traffic drop, ...)?
...
What were the root causes?
...
Incident Response Analysis
How was the incident detected?
...
How could detection time be improved?
...
How was the root cause diagnosed?
...
How could time to diagnosis be improved?
...
How did we reach the point where we knew how to mitigate the impact?
...
How could time to mitigation be improved?
...
What went well?
...
Post Incident Analysis
Did we have other events in the past with the same root cause?
...
Do we have existing backlog items that would've prevented or greatly reduced the impact of this incident?
...
Was this incident triggered by a change (deployment of code or change to infrastructure)? If yes, link the issue.
@stanhu very likely. There were a bunch of statement timeouts on project_builds, which I noted below right before I saw your comment. I was wrong and misread the query.
The output wouldn't be to logs, but this error message (from the screenshot I linked above) looks like it might have been from someone typing on the database console?
Yeah, for security reasons, the parameters are filtered in the slow logs in Kibana. In the future, you would need to get the exact query from the CSV directly.
The PostgreSQL logs don't seem to have that many mentions of locks during this time period. Mostly these queries were canceled due to statement timeouts. Maybe there was a query estimate issue here. I wish we had the EXPLAIN output.
However, this might be an issue with kibana – I couldn't find any autovacuum runs prior June 16. So, looks like we can rely only on data since that date.
That being said, acute spikes -- especially on June 23 and 24 -- are obvious.
Tutple stats show that workload is "INSERT, then DELETE" – numbers of INSERTs and DELETEs match, a "queue-like" workload (a scalability anti-pattern for Postgres MVCC):
Other tables have tuples stats for dates prior June 16 – but ci_running_builds doesn't. So, most likely, INSERTs/DELETEs started on that date, causing frequent autovacuuming.
Quick guard for too frequent autovacuums (that don't save us from the bloat as we can see above and as one would expect, because this workload is an anti-pattern for Postgres MVCC) is raising autovacuum_vacuum_threshold and autovacuum_analyze_threshold from default 50 to a few thousand.
To scale better, this table and workload need redesign (e.g.: 3 partitions, INSERT-only workload to the "hot" partition + periodic TRUNCATE for "idle" partitions or just simple time-decay automated partitioning with partition pruning -- could be done with TimescaleDB).
Had a call with @grzesiek (thanks!), some numbers – current and forecasted for 2 year horizon:
both ci_running_builds and ci_pending_builds are ~2000 rows now, will grow 10x (20k rows)
operation (INSERT/DELETE) rates: 10-20 per sec now, and 100-200 in 2y
INSERTs and DELETEs are happening in different transactions. However, some transactions have more than 1 statement (operations with ci_builds involved)
during of the "big query" is 100ms now – not very good, but kind of acceptable
from the plan of the "big query" we see that index scan is applied to [almost] all of them anyway – looks like there is no need for frequent stats recalculation (example: https://explain.depesz.com/s/HoDO) -- we can consider tuning autovacuum to have much, mush less frequent runs (both for VACUUM and ANALYZE parts) // <-- this is for short term ideas
there are business requirements that lead to the fact that no rows older than 24h (by created_at) are allowed in the both tables -- we can rely on it in the future, choosing better strategy // <--this is for longer term ideas
This data is supposed to be useful to mitigate the problem of frequent autovacuums and frequent stats recalculation (a few times per minute during workday periods without spikes). However, there is still the question of what have caused so huge spikes on June 24.
Double-checking the thesis that we had a lot of autovacuum runs for ci_[pending|running]_builds during the incident (starting June 24, ~13:40 UTC) that @stanhu raised above – cannot confirm it. The screenshot comment #4972 (comment 610881804) looks like it's about autovacuums for ci_running_builds, but those bars at ~13:40 include not only autovacuums – the majority of them are statement-timeouts. The query in kibana was simply ci_running_builds (https://log.gprd.gitlab.net/goto/aa1657ff116dcdf92745ae30e8f1619c):
if we "zoom" into one of the first high bars, we see that statement-timeouts are the majority of those log entries:
Cc @grzesiek@vitabaks@Finotto@stanhu@ahanselka@abrandl – this is important, high autovacuum frequency is not directly related to the problem. However, I still think that we might be dealing with a "plan flip" again. But I do doubt it's related to the stats of ci_[pending|running]_builds – as we see from the plans, we scan all ~2k rows anyway (example: https://explain.depesz.com/s/HoDO), so the actuality of the stats for these small tables is not really important. So it might be a flip related to a recalculation of stats on some large table.
Conclusion & proposal
Frequent autovacuum on ci_[pending|running]_builds turned out to be a red herring.
We have seen timely coincidences like this before, where vacuum on ci_builds finishing was closely related (time-wise) to another pathological symptom (statement timeouts) coming to an end.
The interesting part is that the query called out here (https://explain.depesz.com/s/HoDO) doesn't use ci_builds. Whether or not there is a causality I'm not sure, but wanted to say we've seen this pattern before.
The interesting part is that the query called out here (https://explain.depesz.com/s/HoDO) doesn't use ci_builds. Whether or not there is a causality I'm not sure, but wanted to say we've seen this pattern before.
@abrandl the ci_builds table is being used when INSERTING and DELETING from ci_pending/running_builds because we perform these operations in transaction with UPDATE ci_builds SET during build status transitions (state machine).
I wonder if this was actually a database-wide problem, not really something that was CI-specific. We just by coincidence alert on the outcome of the builds queuing query degradation.
On the other hand the maginalia sampler show that mostly queuing queries were affected. Not sure why the rails_sql_latency per category shows other categories too.
I see a significant spike in LWLock and IPC wait type (not sure what it means):
LWLock: The backend is waiting for a lightweight lock. Each such lock protects a particular data structure in shared memory. wait_event will contain a name identifying the purpose of the lightweight lock. (Some locks have specific names; others are part of a group of locks each with a similar purpose.)
IPC: The server process is waiting for some activity from another process in the server. wait_event will identify the specific wait point.
How can I access logs to see exactly on what wait LWLock / IPC events we had to wait?
@grzesiek thank you for the additional monitoring pictures.
In general, all this is showing normal behavior when we have very slow queries (which is not normal of course, and reasons are still not fully clear) -- and, as a result, a lot of more active sessions.
How can I access logs to see exactly on what wait LWLock / IPC events we had to wait?
as you mentioned above, this is sampled data -- based on observing pg_stat_activity, it has columns wait_event and wait_event_type. LWLock indicates work with the buffer pool, IPC – inter-process communication. Note, that duration of those wait events is unknown, it may be VERY brief, we just observed samples, once per N seconds, and saw, how many backends were "sitting" in which state.
In my opinion, the main hypothesis is still a plan flip.
Ok, thanks @NikolayS! What are the next steps? Do we need to modify our infra to log plans for slow queries and wait for another incident to progress with this investigation?
Marking this incident as resolved, if we think there needs to be more investigation let's break this out into a new issue. If it would benefit from a sync discussion feel free to also add a ~"review-requested" label and we can put it on the next review agenda.