DB load balancer: Automatic retries may leak queries across transaction boundaries
In !32668 (merged), we found that an idle-in-transaction timeout inside an after_save
callback hook could cause an idle-in-transaction timeout.
PostgreSQL will sever the connection, and subsequent attempts to use the connection will trigger a ActiveRecord::StatementInvalid: PG::UnableToSend: no connection to the server
. However, https://gitlab.com/gitlab-org/gitlab/blob/828c3af6822db69bf50929ae3b74a6d76868b955/ee/lib/gitlab/database/load_balancing/load_balancer.rb#L154 catches these StatementInvalid
errors and retries them.
As @engwan showed in !32668 (comment 351497463), the danger here is that the retry could cause SQL queries to leak across transaction boundaries. For example, I think we demonstrated that this can happen:
- BEGIN TRANSACTION ...
- INSERT INTO ...
- after_save callbacks
- idle in transaction timeout happens, PostgreSQL severs connection and ends transaction.
- Some random SELECT or SQL call
➡ StatementInvalid
errors, EE load balancing retry runs again. - Rails thinks everything is okay, moves onto
after_commit
callbacks. -
after_commit
callbacks fail due to aborted INSERT - EE load balancer code retries previous step before failing
The PostgreSQL client library shows FATAL: terminating connection due to idle-in-transaction timeout
, but I'm not quite sure if this is error is passed to Rails clients. I see Rails has a QueryAborted
error for MySQL, but not for PostgreSQL.
@engwan was also concerned about other cases:
But I'm also concerned about the other cases. For example, Web to DB network issues. That would also be a StatementInvalid exception and lead to some queries being retried in a separate transaction right?
🤔