Cannot disable per-session statement timeouts for post-deployment migrations
Context
We've just started shipping post-deployment migrations. The first use case (part of #853 (closed)), is to create new indexes on a partitioned table, which involves CONCURRENTLY
creating an individual index on all partitions and then the corresponding index on the parent table.
We started by creating indexes on just two partitions (gitlab-com/gl-infra/production#8165 (closed)). This worked just fine, with a total runtime of ~30s. We now wanted to execute the post-deployment migrations that will create the remaining partition indexes but it failed due to exceeding the statement timeout limit (gitlab-com/gl-infra/production#8212 (comment 1237508756)). This is configured to 15 seconds on the PostgreSQL side for GitLab.com.
We were lucky in gitlab-com/gl-infra/production#8165 (closed), as each partition index was created in less than 15s. That did not happen in the next batch, thus the failure.
Problem
Note: Although this problem is being reported in the context of #853 (closed), this is not only a problem for this particular issue but for all future post-deployment migrations whose execution may exceed the configured limit (regardless of the value).
Initially, I thought we could overcome this by toggling the statement timeout on the registry side around these post-deployment migrations like so. However, this won't work for production because with the connection pool on the registry side and PgBouncer's pooling after that, index creation statements and the preceding SET statement_timeout TO 0
would most likely be executed in different sessions, so the latter wouldn't have an effect on the former.
We cannot rely on transactions for these migrations either (as a way to ensure that all statements within would be executed in the same session) because creating an index CONCURRENTLY
cannot be done within transactions.
The other option, likely ideal, and similar to how it's done in Rails, would be to grab an exclusive connection from the pool and use that for the entirety of the migration execution (instead of letting the abstraction grab one from the pool for every statement execution). This can be easily done in Go (https://pkg.go.dev/database/sql#DB.Conn), however, the migration tool we're using (https://github.com/rubenv/sql-migrate) does not have support for that, so we're stuck.
Solution
The ideal solution is to contribute to the upstream migrations lib by adding support to run a migration's up and/or down statements in an exclusive connection instead of relying on the pool abstraction.
At the API level, this could be implemented by adding a new e.g. ExclusiveConn
parameter to the Migration
struct, similar to DisableTransaction*
, which we're already using. If this was set to true on the migration's definition, the lib would grab and use an exclusive connection to execute all its statements.