Retry DDL migrations with low lock_timeout
Extract from the discussion in #34325 (closed).
In order to increase the chance of DDL to succeed, we want to implement optional behavior that allows us to control retries of a migration.
Let's look at an example:
def up
drop_table :foo
end
Let's assume foo -> bar
through a foreign key constraint. This would require an exclusive lock on bar
, too, because of the FK. Let's also assume bar
is a table with a lot of traffic.
Today's behavior
Worst case, the migration would fail after hitting statement timeout (15s). This is because it might not be able to acquire a lock on bar
within this time frame.
During this attempt, any further locks to bar
cannot be granted (lock queue). This means, that any reads and writes to this table are going to be blocked for 15s currently. This would in turn stall all requests coming in that interact with bar
.
Proposed behavior
def up
with_lock_retries(timeout: 500.milliseconds) do
drop_table :foo
end
end
Here, we set the lock_timeout
to a low value (500ms) and attempt the migration step (DROP TABLE foo
). If we don't manage to get the lock (ie. a lock timeout exception), we'll retry.
The retry can be implemented with
- linear/exponential backoff
- a timeout for the migration (10 minutes overall or similar)
- Increasing
lock_timeout
to an upper threshold
We can start with indicating this new behavior as needed (like in the example above) or make the behavior a default for migrations anyways.
This way, we
- have a guaranteed maximum time of blocking other reads/writes, substantially lower than statement timeout
- increase the likelihood of successful migrations even facing high traffic time.
What we don't get from this is the ability to decouple migrations from this from the success of a deploy. This may be a follow-up issue.