ProjectDestroyWorker sidekiq job holds db transaction open for 1 hour
Summary
A sidekiq job held open a postgres transaction for 60 minutes (at which point presumably sidekiq reached its deadline and aborted the job, rolling back the db transaction).
The fact that a transaction was open for so long triggered a PagerDuty alert "Firing 11 - PostgreSQL_ReplicaStaleXmin". This alert just means that all of the replica dbs have noticed that there is at least 1 very old transaction open on the primary db that cannot be replicated until it ends (commit/rollback).
Practically this contributes to bloating of all high-churn tables and indexes, which has the long-term effect of reducing overall database efficiency. So it's not a dire problem, but it's worth preventing if it's a repeating pattern.
In this case, it looks like it might be appropriate for the sidekiq job to release its db transaction rather than hold it continuously throughout the job.
More information will be added as we investigate the issue.
Timeline
All times UTC.
2020-04-21
- 14:17 - Sidekiq job id
b5177b792f9d972172d53f52
starts transaction in postgres. - 15:02 - PagerDuty alert "Firing 11 - PostgreSQL_ReplicaStaleXmin"
- 15:17 - Transaction ends. Alert clears. Presumably Sidekiq job aborted after reaching a 60-minute deadline.
Graph of the alerting metric
This metric indicates that the db replicas all detected the presence of an old transaction on the primary.
For context, we alert on this because it contributes to bloating of tables and indexes that have heavy writing during this timespan (because it can impede efficient vacuuming of dead tuples), and this can have a long term effect on database performance.