Database connection might goes away between repo_healthy? and wiki_repo_healthy? in RepositoryCheck::BatchWorker

Summary

Reviewing customer logs, they had four examples of RepositoryCheck::BatchWorker failing for different shards.

  "error_message": "PG::UnableToSend: server closed the connection unexpectedly\n\tThis probably means the server terminated abnormally\n\tbefore or while processing the request.\n",
  "error_class": "ActiveRecord::StatementInvalid",
  "error_backtrace": [
    "app/models/project.rb:432:in `wiki_enabled?'",
    "app/workers/repository_check/single_repository_worker.rb:62:in `has_wiki_changes?'",
    "app/workers/repository_check/single_repository_worker.rb:39:in `wiki_repo_healthy?'",
    "app/workers/repository_check/single_repository_worker.rb:29:in `project_healthy?'",
    "app/workers/repository_check/single_repository_worker.rb:14:in `perform'",
    "app/workers/repository_check/batch_worker.rb:53:in `block in perform_repository_checks'",
    "app/workers/repository_check/batch_worker.rb:48:in `each'",
    "app/workers/repository_check/batch_worker.rb:48:in `perform_repository_checks'",
    "app/workers/repository_check/batch_worker.rb:28:in `block in perform'",
    "app/services/concerns/exclusive_lease_guard.rb:29:in `try_obtain_lease'",
    "app/workers/repository_check/batch_worker.rb:27:in `perform'",
    "lib/gitlab/sidekiq_middleware/duplicate_jobs/strategies/until_executing.rb:20:in `perform'",
    "lib/gitlab/sidekiq_middleware/duplicate_jobs/duplicate_job.rb:57:in `perform'",

The return false unless project.wiki_enabled? check in has_wiki_changes? is going to be a database check.

It gets there from:

    def wiki_repo_healthy?(project)
      return true unless has_wiki_changes?(project)

which in turn runs back to back off checking the main project repo:

    def project_healthy?(project)
      repo_healthy?(project) && wiki_repo_healthy?(project)
    end

Steps to reproduce

Example Project

What is the current bug behavior?

Hypothesis: repo_healthy?(project) takes a long time (eg: for large repos) and then when it gets to run project.wiki_enabled? the database connection has gone away.

I don't have the logs to be certain why .. whether a 'idle in transaction' issue, or the session idles on the network and so gets dropped by either PgBouncer or the load balancer.

What is the expected correct behavior?

Code should account for the fact that the git fsck may take a while.

Relevant logs and/or screenshots

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

Customer is on 14.2.7, but when I checked the workers code here and here hadn't changed since 14.2, so they would likely see this on the current release unless the underlying git fsck has got a lot faster.

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

Edited Jul 30, 2022 by Ben Prescott (ex-GitLab)