Long running database transaction monitoring
Coming from production#3390 (comment 490544849), we have a few long running transaction scattered across the day regularly.
This can be seen here:
We have seen some transactions to take hours. This is a problem, particularly for schema migrations but also elsewhere. Schema migrations require a lock on the object, which can be held for a long time by long running transactions (even a read lock prevents DDL). This causes incidents like production#3390 (closed).
Now we do monitor the length of transactions (see above), but typically we don't log those or know any more details. In order to get to the root cause of the long running transaction, one needs to observe them in realtime (by monitoring pg_stat_activity
to get some query traces).
Can we get some sort of automated monitoring for long running transactions?
A sketch:
- Any transaction > 5 minutes is considered 'long' (for discussion)
- We actively sample
pg_stat_activity
(say every 5 seconds) - From the sampling, we log
pg_stat_activity.query
, the transaction id and overall duration
This would allow us to retroactively go in and easily see which components caused a long running transaction.