Pause migration while autovacuum is running for the table
Extracted from #353395 (comment 892991285)
Overview
- Indicator: Active autovacuum on the table the migration works on (yes/no)
- Source: query primary on
pg_stat_activity
- Action: Pause migration while autovacuum is running on the same table the migration works on
- Needs prometheus: No (see Caveat)
This is a higher level, table-level tunable indicator which ideally is already tuned to sane levels on the system side (to achieve good autovacuum results in the first place). If there is an autovacuum going on, it can be seen as an indicator of a high rate of churn and we would pause further updates until the autovacuum has finished.
❗
Caveat Querying pg_stat_activity
is currently not possible on .com for the gitlab
user (permission denied). We will have to grant permissions or work around this (e.g. with a custom function or an alternative way of getting this information).
Alternatively, this is also available through prometheus: max(pg_stat_activity_autovacuum_age_in_seconds{env="gprd"}) by (relname)
Discussion
In particular with large tables, this means that we'll pause data migration perhaps for many hours until the autovacuum run has finished. This gives priority to regular application-side updates.
The caveat here is that too large tables can lead to too long autovacuum timings which in turn would dramatically reduce the throughput for data migrations. This is specific to GitLab.com and we can see how we go about this once we have it implemented and feature flagged. In any case the problem is the large tables and long autovacuum times (which has many more implications for database health). Slowing migrations down for these cases is another expression that large tables are a problem.
Out of scope
- Detect whether autovacuum would be necessary (system side problem) - may be a follow up
- Detect whether all autovacuum workers are busy (system side problem) - seems unnecessary (current capacity used) and is well monitored