DB load balancer: Automatic retries may leak queries across transaction boundaries

In !32668 (merged), we found that an idle-in-transaction timeout inside an after_save callback hook could cause an idle-in-transaction timeout.

PostgreSQL will sever the connection, and subsequent attempts to use the connection will trigger a ActiveRecord::StatementInvalid: PG::UnableToSend: no connection to the server. However, https://gitlab.com/gitlab-org/gitlab/blob/828c3af6822db69bf50929ae3b74a6d76868b955/ee/lib/gitlab/database/load_balancing/load_balancer.rb#L154 catches these StatementInvalid errors and retries them.

As @engwan showed in !32668 (comment 351497463), the danger here is that the retry could cause SQL queries to leak across transaction boundaries. For example, I think we demonstrated that this can happen:

  1. BEGIN TRANSACTION ...
  2. INSERT INTO ...
  3. after_save callbacks
  4. idle in transaction timeout happens, PostgreSQL severs connection and ends transaction.
  5. Some random SELECT or SQL call StatementInvalid errors, EE load balancing retry runs again.
  6. Rails thinks everything is okay, moves onto after_commit callbacks.
  7. after_commit callbacks fail due to aborted INSERT
  8. EE load balancer code retries previous step before failing

The PostgreSQL client library shows FATAL: terminating connection due to idle-in-transaction timeout, but I'm not quite sure if this is error is passed to Rails clients. I see Rails has a QueryAborted error for MySQL, but not for PostgreSQL.

@engwan was also concerned about other cases:

But I'm also concerned about the other cases. For example, Web to DB network issues. That would also be a StatementInvalid exception and lead to some queries being retried in a separate transaction right? 🤔

Edited by Heinrich Lee Yu