Skip to content

Database backup locks and partition manager job causing full outage

The last two weekends, we've had a full outage in the timeframe our DB backup was running, caused apparently by exclusive locks on postgres. It looks like some internal partinioning jobs run automatically concurrently, triggering the db locks load.

Context:

  • Siemens self-hosted production setup, running latest 17.1. DB Postgres RDS 14.11.
  • Nightly DB backups via cron job with gitlab-backup create STRATEGY=copy. Load at backup time is low. We've been running this setup for 5+ years.
  • Separate frontends & sidekiq nodes. Using object storage for most data in S3.
  • We've not yet completed the DB registry migration, we have just started step three (see #423459 (comment 1980823334)). Same RDS host for gitlab & registry.

Example timeline:

  • 2024-07-06 22:26:35 CEST start of gitlab-rake DB dump. Increase of accesssharelock in DB, usual behaviour
  • 2024-07-06 22:47:35 CEST sidekiq-03 sidekiq rss memory protection kicks in, sidekiq supervisor restarts processes Gitlab::Memory::Watchdog::Handlers::SidekiqHandler, A worker terminated, shutting down the cluster
  • 2024-07-06 22:48:28 CEST sidekiq-03 process reports starting Checking state of dynamic postgres partitions for Gitlab::Database::Partitioning::PartitionManager on multiple tables
  • 2024-07-06 22:48:58 CEST sidekiq-03 process reports finalizing full restart Booted Rails 7.0.8.4 application in production environment
  • 2024-07-06 22:48:58 - 23:21:02 CEST sidekiq-03 reports multiple times Gitlab::Database::Partitioning::PartitionManager message ActiveRecord::LockWaitTimeout error, retrying after sleep, with current_iteration from 1 to 40
  • 2024-07-06 23:21:02 CEST Gitlab::Database::Partitioning::PartitionManager message Executing the migration without lock timeout with iteration 41 on sidekiq-03. Linear increase of rowexclusivelock in DB. We start running out of available connections in puma.
  • 2024-07-07 00:03:00 CEST Exponential increase of rowexclusivelock. Unclear what causes it at this point, but could be just that resources are being exhausted and more jobs started.
  • 2024-07-07 01:59:15 CEST LB starts responding some requests with 502
  • 2024-07-07 02:18:00 CEST All puma connections are busy. Most LB responses are 502.
  • 2024-07-07 04:29:35 CEST End of gitlab-rake DB dump. All DB locks released, gitlab starts recovering
  • 2024-07-07 04:31:00 CEST All LB 502 errors are gone, nominal state

image

Comments:

  • We see the new memory protection of sidekiq kicking in the middle of this, though we're not sure it's related to the actual problem. It's perhaps just chance that it shows at the same time. Though we can say that since 17.0 we are seeing a significant amount of these memory-induced restarts of the sidekiq workers.
  • Although this has happened the last two weekends, we've just had a similar db locks issue last night in the db backup timeframe not on a weekend, but much smaller in scale and only caused a handful of 502.

@morefice As we know you've been working in the partioning feature, perhaps you'd have any idea, or you can redirect this issue to whoever could help 🙇

/cc @max-wittig @ercan.ucan @bufferoverflow @fh1ch

Edited by 🤖 GitLab Bot 🤖