Tweaking (decreasing) idle_in_transaction_session_timeout on Production
C1
Production Change - Criticality 1Change Objective | Describe the objective of the change |
---|---|
Change Type |
idle_in_transaction_session_timeout needs to be decreased in order to terminate long transactions within long idle in transaction state. Currently sits in 1min, which is relatively high considering the high traffic. |
Services Impacted | patroni-* |
Change Team Members | @ahmadsherif |
Change Severity | C1 |
Buddy check | @emanuel_ongres |
Tested in staging | |
Schedule of the change | Date and time (with timezone) |
Duration of the change | This variable has user scope, meaning that set will apply effectively. Although, if persisted in configuration, a SIGHUP is necessary for refreshing it across all active sessions |
Downtime Component | No downtime required |
Detailed steps for the change. Each step must include: | See bellow |
General
Currently, idle_in_transaction_session_timeout
is set to 1min
which is a large value for current production traffic.
@abrandl recommended to reduce to 15s
, although, it may probably be too short and could generate undesired cancelled jobs. So, we are going to decrease this gradually until we see that we get cancelled large runs, without cancelling too many jobs. That is, first iteration will be set it at 30s
and, if no perceptible changes (that is, long jobs never get targeted by this threshold, consequently no action) it can be decreased to its half 15s
.
Prerequisites
-
All Patroni nodes must be in a healthy state and relatively short replication lag (under 100MB delay). -
Full list of all nodes (replicas first, leader at the end).
Execution Steps
-
alter system set idle_in_transaction_session_timeout = '30s'
in the first node of the list. -
Monitor errors and verify the service messages. If no errors, change is safe. -
Iterate the change across all replicas, with its corresponding verification. -
Persist the configuration on the configuration file (*).
Once the change is applied to all replicas, the leader is the next:
-
alter system set idle_in_transaction_session_timeout = '30s'
in the leader -
Verify if sessions are being killed and monitor any other related issues through tail -vf /var/log/gitlab/postgresql/postgresql.log
.- Useful queries that can be used Locks, Activity with Locks and General Activity Query
Verification Steps
Once the idle_in_transaction_session_timeout
is set to 30s, there should be visible the impact when large transactions occur. If no transaction is being killed when large activity issued, it might be necessary to reduce it to its half, by using the same procedure above.
Related Issues
PgBouncer Notes
Currently, PgBouncer supports idle_transaction_timeout
and it is currently disabled (0) in write-pools.
Probably, as this is write issue and to avoid changes on Patroni cluster, it is possible to apply this change
at PgBouncer side, avoiding the execution all over the Patroni fleet. Although, this is only applicable for the writing transactions (pgbouncer-02/03). It only requires SIGHUP and this value is in seconds.
(*) Step require to be reviewed and commented by DBRE