refs changed while git fsck running are likely to be reported as missing, resulting false positives for repository checks
- Summary
- Workaround
- Steps to reproduce
- Example Project
- What is the current bug behavior?
- What is the expected correct behavior?
- Relevant logs and/or screenshots
- Output of checks
- Possible fixes
DISCLAIMER: This page contains information related to upcoming products, features, and functionality. It is important to note that the information presented is for informational purposes only. Please do not rely on this information for purchasing or planning purposes. The development, release, and timing of any products, features, or functionality may be subject to change or delay and remain at the sole discretion of GitLab Inc.
Summary
Customers report false positives from repository checks; with missing commits being incorrectly reported. Investigation identifies that these are, or were HEAD commits on branches, so the reference to that SHA has come from the refs in the repository.
It's typically most commonly seen on large repositories that take a long time to check.
It appears that git fsck
doesn't keep track of additions to the list of valid commits in the repository, and then when it checks all the refs against the list, refs that have changed during the git fsck
aren't valid.
Planned resolution
We believe this is an inherent race in Git. The planned solution is to serialize transactions, which is Gitaly functionality we are getting ready for general availability. See &13306
Workaround
Overview
Most changes to a branch or tag in a Git repository while git fsck
is running will cause an error.
High level, the goal of the workaround is to reduce false positives, and in detail to:
-
Reduce the chance of
git fsck
running while changes are being made, by allowing Administrators to schedule running the check. -
Try to get a 'clean' run by attempting the check up to three times, as coded. Retries are only performed if a failure occurs, and once a 'clean' check has been obtained, the script is done with that project and writes the last-checked date for the project.
- If there is genuinely a fault with the repository, retrying the check will not make any difference, and GitLab will report that the check failed.
- If a bot or a developer pushes to the repository during all three retries, there will still be a false positive. This problem isn't expected to be completed fixed until the Gitaly Transactions features is generally available.
-
Prevent Sidekiq from checking the repository, by executing it more frequently (every 14 days) so the coded threshold for Sidekiq (one month) should never be reached.
Potential issues
The datestamp for the last check is only written by the workaround script if the check succeeds.
This allows for further attempts to be made to get a successful git fsck
. The workaround code and Sidekiq both check this datestamp, so either would be able to perform additional checks, but Sidekiq will record a datestamp for the failed check.
This prevents Sidekiq re-checking for another month, but it'll also prevent the workaround code checking for another two weeks.
If you still get false positives even after three retries, try and find a quieter time to check the repository, and/or increase the number of retries by adding more elements to the tries
array.
Workaround deployment and implementation
These steps are for a packaged GitLab (Omnibus) installation. The script should work on other deployments, with some modifications such as:
- For a GitLab deployment in Kubernetes or Docker you'll need to identify a way to schedule it, the paths may be different, and the script will need to be deployed each time the pod/container is spun up.
- For self-compiled, the paths will be different.
-
Select a GitLab node configured to run Rails (Sidekiq or Puma). If your environment has a node selected to run database migrations during upgrades, this could be a good candidate.
-
Determine the account to use: it'll be the
gitlab_user
from/opt/gitlab/etc/gitlab-rails-rc
.- Usually the account is
git
, the following steps will assume this.
- Usually the account is
-
Copy the script below to the server in a location readable by that account (source).
check-repository-workaround.rb
- Not root's home directory (read more about rails runner )
- The following steps assume:
/var/opt/gitlab/check-repository-workaround.rb
- It does not need to be executable.
-
If you need to check multiple projects at different times, for example to accommodate different time zones when developers aren't working, you'll need more than one copy of the the script. The project list is hard coded in the script.
-
Determine the project IDs of affected projects.
- The GitLab Web UI can be used to determine the project ID.
- Affected projects will have been reported regularly as having failed repository checks, but when you re-run the check, the check passes.
- Affected projects are likely to be large repositories. The
git fsck
will take longer to run, and so there's more chance of agit push
to the repository while the check is running.
-
Modify
projects
array at the top of the script to list the projects that need to be checked. -
Set up a cron job
echo '45 22 * * 6 git /usr/bin/gitlab-rails runner /var/opt/gitlab/check-repository-workaround.rb >> /var/log/gitlab/gitlab-rails/cron_dot_d_check-repository-workaround.log 2>&1' > /etc/cron.d/check-repository-workaround
- This uses the assumed account (
git
) and path for the script (/var/opt/gitlab
) - See
man 5 crontab
for details on the time and date fields - this example runs at 22:45 on a Saturday evening. - The log will be rotated by the
logrotate
service deployed as part of packaged GitLab (seegitlab-ctl status
) - Schedule the script to run at least weekly. If a successful check occurs, the script only checks each project every two weeks (13-14 days). But, if the project check keeps failing, and the script runs every week, for example, it'll get more opportunities to try and get a successful check. If it doesn't get a successful check after 2-3 weeks, the Sidekiq job will trigger a check as well.
- This uses the assumed account (
-
It can be run manually as well, or instead, of a cron job. Projects with a recent check (either by the script, or by Sidekiq) will not get checked: the script logs this.
sudo /usr/bin/gitlab-rails runner /var/opt/gitlab/check-repository-workaround.rb
Steps to reproduce
- Start with a large repository so
git fsck
takes long enough. Reproduced on a 4.5gb repo (comprising themaster
branch of the nixpkgs repo)- It may be necessary to inject this repo into your test environment by transferring the
.git
directory direct into thegit-data/repositories
directory tree. More direct methods like pushing 4.5gb, or pulling it in via project import, typically fail.
- It may be necessary to inject this repo into your test environment by transferring the
- Kick off a repository check
- Push commits and tags to the repo
-
git fsck
fails with missing commit errors.
Example Project
What is the current bug behavior?
False positives about missing commits in projects.
What is the expected correct behavior?
Repository check accounts for changes to the repository that occur while the check is happening.
Relevant logs and/or screenshots
Could not fsck repository: missing tag 626f030a8754caacac6e8b54e8a39244b8a9c345
missing tag ebfc03685d9346b4c2d41ca8fe32cef026785aca
missing tag 4c3c05ba88caede6e5037a542b77d1ce948f0f82
missing tag 61c10504123f1020fa3d924c84d71d59f84849b1
missing tag 47b9c0ce0f2d69f015c5e39e7dcf1d65e9b6a603
missing tag fbcbc1b06787b65bcbebe35fa495186562b8ffcb
missing commit 2840c7e0097bcfe0caeb3ee44ef6d62c76332255
missing tag 2b2ac8d60ddd148719703924662c66d56bad8eb7
missing tag 24ecc8c98fdaf448a5fb2a509c1dab05d42d1c78
- Reproduced by pushing to a branch and pushing a new tag every five seconds while a 27 minute long repository check ran
- 152 of the 173 tags added were reported as missing.
- The 152nd change to the branch was also reported as missing.
- Public test harness project
GitLab team members can read more in a confidential issue
Output of checks
Reported by a customer on both 14.x and 16.11.
Reproduced on 17.0 and 17.5.1