Tweaking (decreasing) idle_in_transaction_session_timeout on Production
|Change Objective||Describe the objective of the change|
|Change Team Members||@ahmadsherif|
|Tested in staging|
|Schedule of the change||Date and time (with timezone)|
|Duration of the change||This variable has
|Downtime Component||No downtime required|
|Detailed steps for the change. Each step must include:||See bellow|
idle_in_transaction_session_timeout is set to
1min which is a large value for current production traffic.
@abrandl recommended to reduce to
15s, although, it may probably be too short and could generate undesired cancelled jobs. So, we are going to decrease this gradually until we see that we get cancelled large runs, without cancelling too many jobs. That is, first iteration will be set it at
30s and, if no perceptible changes (that is, long jobs never get targeted by this threshold, consequently no action) it can be decreased to its half
- All Patroni nodes must be in a healthy state and relatively short replication lag (under 100MB delay).
- Full list of all nodes (replicas first, leader at the end).
alter system set idle_in_transaction_session_timeout = '30s'in the first node of the list.
- Monitor errors and verify the service messages. If no errors, change is safe.
- Iterate the change across all replicas, with its corresponding verification.
- Persist the configuration on the configuration file (*).
Once the change is applied to all replicas, the leader is the next:
alter system set idle_in_transaction_session_timeout = '30s'in the leader
Verify if sessions are being killed and monitor any other related issues through
tail -vf /var/log/gitlab/postgresql/postgresql.log.
idle_in_transaction_session_timeout is set to 30s, there should be visible the impact when large transactions occur. If no transaction is being killed when large activity issued, it might be necessary to reduce it to its half, by using the same procedure above.
Currently, PgBouncer supports
idle_transaction_timeout and it is currently disabled (0) in write-pools.
Probably, as this is write issue and to avoid changes on Patroni cluster, it is possible to apply this change
at PgBouncer side, avoiding the execution all over the Patroni fleet. Although, this is only applicable for the writing transactions (pgbouncer-02/03). It only requires SIGHUP and this value is in seconds.
(*) Step require to be reviewed and commented by DBRE