Given that this is the case, we're going to have to not only address models that appear in the dashboard but also remove all instances of subtransaction generating code from our codebase.
Identify lines of code in our codebase that use requires_new: true to generate subtransactions and determine an appropriate alternative that does not. An example could be to identify using grep -R 'requires_new: true' (thanks @sgoldstein) and then implement an upsert in place of a subtransaction.
Please review the linked issues and issues labelled with subtransactions to determine whether your team owns or would be best placed to remove subtransactions from the particular model.
Considerations
Consideration should be given to the potential for bugs to be introduced due to the absence of a previously used subtransaction.
In some cases, alternatives will need to be found to existing logic that will safely account for the state of data without using subtransactions.
The presence of any subtransaction poses a risk to our PG cluster performance.
Mitigation work is considered infradevseverity1priority1.
History
With the new instrumentation available from #337843 (closed), we would want to identify the cause of subtransactions related database degradation and identify improvement opportunities.
Sam Goldsteinchanged title from Root cause analysis of subtransaction's impacts to Root cause indentification of subtransaction's impacts
changed title from Root cause analysis of subtransaction's impacts to Root cause indentification of subtransaction's impacts
Sam Goldsteinchanged the descriptionCompare with previous version
changed the description
Sam Goldsteinchanged title from Root cause indentification of subtransaction's impacts to Root cause indentification of PG Subtransaction related performance degredations
changed title from Root cause indentification of subtransaction's impacts to Root cause indentification of PG Subtransaction related performance degredations
Sam Goldsteinchanged title from Root cause indentification of PG Subtransaction related performance degredations to Root cause indentification of PG Subtransaction related performance degradations
changed title from Root cause indentification of PG Subtransaction related performance degredations to Root cause indentification of PG Subtransaction related performance degradations
Sam Goldsteinchanged title from Root cause indentification of PG Subtransaction related performance degradations to Root cause identification of PG Subtransaction related performance degradations
changed title from Root cause indentification of PG Subtransaction related performance degradations to Root cause identification of PG Subtransaction related performance degradations
I managed to spend some time today looking at PostgreSQL 12.7 -> PostgreSQL 14 changes related to how subtransactions are being handled. As @stanhu previously noticed there is not a lot of changed done since that time.
An interesting one is https://github.com/postgres/postgres/commit/dee663f7843 that brings another LWLock wait event into the world: SubtransBuffer: Waiting for I/O on a sub-transaction SLRU buffer. This could make it easier to understand our case but it is unlikely that it fixes a bug we might be seeing in our cluster (a bug is still something that we have not ruled out at this point).
I see that in PostgreSQL 12 we also do have subtrans LWLock along with SubtransControlLock. We saw this lock a few times during the incidents too, and it is described as Waiting for I/O a subtransaction buffer. I still need to understand how it differs from SubtransBuffer lock in PostgreSQL 14 because the difference in code is apparent.
We had another instance of elevated SubtransControlLocks today between 12:49 and 12:55 UTC. It lasted for around 6 minutes - not enough to cause an incident or database contention:
This increased queuing queries latency a bit, but not enough to violate our runners apdex score, no user impact this time:
Again we saw a significant number of subtransactions created by Notes model around 5 minutes before the incident:
@grzesiek could you elaborate on the correlation with Notes? For example do we know what codepaths are generating these Notes related sub transactions?
@sgoldstein we usually see spikes from Node model around incidents, but these almost never correlate exactly with the incidents. This might be red herring, we simple do not have enough data to tell if this is related.
We saw another occurrence of SubtransControlLocks causing contention today, but it was a minor degradation, lasting only 6 minutes, without user impact.
The new instrumentation did not log anything new during the incident. Either we didn't have 64+ nested subtransactions per transaction during the contention, or our instrumentation does not cover something / Sidekiq workers.
During the performance degradation today we again saw a long running transaction being visible in our pg_stat_activity sampling. We investigated this and concluded that this was an autovacuum analyze transactions, working with the issues table.
@catdid a really great analysis of how transactions are being processed by PostgreSQL what allows us to benchmark another scenario involving a concurrent long running transaction.
What we still don't know
Why CI/CD builds queuing queries are affected the most? This might be a sampling bias.
We we don't see context switches caused by locking on the primary database?
How are long running autovacuum analyze transactions related to subtransactions locking?
Does our new instrumentation work correctly?
Is subtransactions locking a cause or a symptom?
Next steps
Validate that Sidekiq workers are covered by the new instrumentation.
Yeah, that was my test. I just tried a similar test on a K8S node, and I do NOT see the log message in Kibana. I think the application_json.log may not be parsed on K8S nodes.
I added that info to the standup notes for tomorrow. If there's more detail at this point on what needs to happen to confirm/resolve log forwarding issues let me know and I can add that info.
From the above thread, K8S nodes are reporting these logs in the sidekiq Elasticsearch index under the subcomponent as application_json. Omnibus nodes report these logs to the rails index with tag as application_json.
I added debugging to my Sidekiq instance and watched PostgreSQL statement logs. As far as I can tell, the instrumentation is working as designed.
I think that !68189 (merged) reduces some SAVEPOINT and RELEASE calls (one per vulnerability finding), but Note and Ci::Minutes::NamespaceMonthlyUsage appear to be dominating the subtransaction graphs now. We may want to lower the number of SAVEPOINTs to 10 or so to get more data. !68219 (merged) might reduce Ci::Minutes::NamespaceMonthlyUsage a bit.
@nikolay's data below is very interesting. It makes me wonder if replicas have to do significant amount of work loading pg_subtrans data when there is a long transaction. I'm not quite sure how a replica gets this data since subtransaction IDs aren't written to the WAL.
Fact: all the "performance degtadation on replicas" incidents since the end of June had LWLock spikes, most likely SubtransControlLock and, additionally, a long-running transaction (either autovacuum: ANALYZE or application-generated one)
@stanhu@cdu1 here is an addition to the table in the issue description, including some June and July incidents. It also includes some link corrections + info about long-running transactions (including both user-generated and, as we saw, autovauum: ANALYZE)
Though we don't see the wait_event there, only wait_event_type -- it's LWLock and endpoint values are "POST /api/:version/jobs/request", this matches the August events, so we can conclude with high confidence that those LWLocks were SubtransControlLock ones. For the last one, 2021-07-14 at ~18:10 UTC, it was manually proved that those LWLocks were indeed SubtransControlLock: gitlab-com/gl-infra/production#5152 (comment 629654474).
I'm trying to understand the lock contention on the replicas (sorry if some of this was discussed; don't have much context since I was off until Monday), since the lock is SubtransControlLock, and we only try to acquire (and release after) the exclusive lock, when reading the pg_subtrans on disk - does this mean:
Only queries using subtransactions are affected? Reasoning here would be that only queries reading/writing subtransaction data should wait for this lock (although, have enough queries waiting for this lock, and the entire DB gets overloaded maybe)
In the application, we write on the primary then attempt to read from a replica in a subtransaction? Otherwise, I'm not really sure how do we get to use subtransactions on the replicas (since @NikolayS mentioned read-only transactions don't even get txids assigned) - I'm thinking mostly of some.. "read-your-own-write" sort of delay until the txid get replicated, but I didn't really see logic for that when reading the file in SlruPhysicalReadPage
Looks like that's not very relevant now - based on @NikolayS's analysis in #338410 (comment 652518469) which reproduced this - it does happen with only SELECTs on the replicas & a long-running tx on the primary, after multiple concurrent subtransactions on the primary.
If I understand correctly, it appears that performance drops on the replicas because we exceed the number of transaction IDs we can keep in working memory. As discussed in #338410 (comment 652117324), the subtransaction LRU (SLRU) cache in PostgreSQL can store up to 65K transaction IDs. Once we exceed that gap, performance on the replicas drop because we have to go to disk more often to read the pg_subtrans information.
Even without subtransactions, we see a performance drop because we have to go to disk. It's just that reading from the pg_subtrans is more expensive for some reason (this comment amused me: https://github.com/postgres/postgres/commit/573a71a5da70d6e2503c8f53e3b4f26b3b6d738d). With our workload, it appears to induce a context switch storm for some reason.
Next steps:
Get rid of SAVEPOINT and RELEASE.
Report this problem to PostgreSQL mailing list. It would be nice to have metrics around this to warn when we start to get SLRU misses, and see if there are performance optimizations that can be made to prevent the context switch storm.
@stanhu, thank you for the summary, all seems right. One correction to
clarify the meaning and avoid ambiguity:
Even without subtransactions, we see a performance drop because we have
to go to disk.
This is about the case when we initially had a lot of subtransactions used
in parallel with a long-running transaction -- then we may stop having
subtransactions at all, and have only long-running subtransactions. This is
enough to continue experiencing performance degradation on replicas until
this transaction finishes (and allows Postgres to cleanip the slru cache).
If we're running the same workloads but without subtransactions initially,
there are no degradations. This allows us to conclude that getting rid of
SAVEPOINTs will solve the problem.
If I understand correctly, it appears that performance drops because we
exceed the number of transaction IDs we can keep in a working memory. As
discussed in #338410 (closed) (comment 652117324)
#338410 (comment 652117324),
the subtransaction LRU (SLRU) cache in PostgreSQL can store up to 65K
transaction IDs. Once we exceed that gap, performance on the replicas
drop because we have to go to disk more often to read the pg_subtrans
information.
Even without subtransactions, we see a performance drop because we have to
go to disk. It's just that reading from the pg_subtrans induces a context
switch storm for some reason.
Next steps:
Get rid of SAVEPOINT and RELEASE.
Report this problem to PostgreSQL mailing list. It would be nice to
have metrics around this to warn when we start to get SLRU misses, and see
if there are performance optimizations that can be made to prevent the
context switch storm.
@stanhu I don't think so -- as I know how the Postgres project is organized (very decentralized development), so it's all up to volunteers (persons and companies) to review pending patches. We definitely can help move the patches forward if we test and provide feedback to the pgsql-hackers mailing list. Additionally, what is possible:
I can try asking people who reviewed earlier versions of those patches to have a look again, and
bring more attention to the patches showing their importance in various community channels
We made a significant progress in determining the root cause of database contention caused by subtransactions. The likely root-case, reproduced in a benchmarking environment, has been described in #338346 (comment 652623314).
The root cause description answers all the questions we had in the previous status update!
The most likely root cause is PostgreSQL bug.
What we still don't know
We just want to confirm the root cause hypothesis!
Next steps
Confirm root cause analysis.
Design a mitigation plan that will result in fixing the database contention problem.
Implement mitigation plan in collaboration engineering leaders.
Characteristics: Anywhere we use requires_new: true transaction is suspect.
@stanhu and @grzesiek will identify low hanging fruits from a quick scan of requires_new: true, report them into issues under the epic &6540 (closed). These issues will be passed to query owner groups to fix.
@craig-gomes will have the DB team implement a rubocop rule to warn and prevent subtransactions.
@tkuah - Yes we should. It appeared that the evidence was pointing to the long transactions being outside the application (e.g. vacuum queries) but I don't think we have enough information to say that definitively yet. @NikolayS has been doing some analysis and may have some more insight on what is currently known about the long running queries.
@tkuah@stanhu This is great 👍 didn't know we have visibility of what's inside long-running transactions
@sgoldstein autovacuum's activity (actually ANALYZE) was seen on one case, and was definitely not present in some other -- so it's still a bit grey area. As I mentioned, if VACUUM itself could be considered as a long-running transaction, we should be having many incidents per day, so I suspect only ANALYZE variant of it. Planning to double-check.
But what is definitely happening – any regular (user-generated) long-running transaction can provoke the problem. It can be anything, all we need is just to have a new XID assigned, we can even skip dealing with tables. I've just successfully built a fully synthetic experiment showing the problem, and there I used select txid_current(), pg_sleep_for(...); as a long-running transaction -- a regular SELECT that get a new XID assigned and waits for the specified time.