git 2.15+ doesn't play nicely with stale worktrees

In https://gitlab.com/gitlab-org/gitlab-ce/issues/44068, we found that users trying to fetch, push, or pull repositories would get errors that indicate a bad object:

remote: fatal: bad object HEAD
fatal: bad object HEAD

It turns out the problem is that many of these repositories have stale worktrees from squash-rebase attempts that were never cleaned up. You can see it in this strace as git attempts to walk all the worktrees, look up their references, and find the corresponding objects:

# /opt/gitlab/embedded/bin/git rev-list 470ec851e3fd2393da60c5b77d59ffa1701a5903 --not --all  
<snip>
open("worktrees/squash-7463504/HEAD", O_RDONLY) = 3
read(3, "9c8af0299f1d7f8808af127e796627e5"..., 256) = 41
read(3, "", 215)                        = 0
close(3)                                = 0
lstat("./objects/9c/8af0299f1d7f8808af127e796627e57c2c23f6", 0x7ffe188a87b0) = -1 ENOENT (No such file or directory)
open("./objects/pack", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents(3, /* 5 entries */, 32768)     = 264
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
open("./objects/9c/8af0299f1d7f8808af127e796627e57c2c23f6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("./objects/pack", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents(3, /* 5 entries */, 32768)     = 264
getdents(3, /* 0 entries */, 32768)     = 0
close(3)                                = 0
lstat("./objects/9c/8af0299f1d7f8808af127e796627e57c2c23f6", 0x7ffe188a87f0) = -1 ENOENT (No such file or directory)
write(2, "fatal: bad object HEAD\n", 23fatal: bad object HEAD

The problem looks like it hit us in multiple places:

  1. Unicorn errors
  2. During a git push

Unicorn errors

There are at least two places where having stale worktrees breaks: via the /internal lookup, and in the Web UI when attempting to create a temporary branch.

API /internal lookup

Sentry error: https://sentry.gitlap.com/gitlab/gitlabcom/issues/141134/

In the /internal lookup, it looks like the problem happens because git rev-list tries to walk all working trees, include leftover stale working trees. According to https://git-scm.com/docs/git-rev-list, this argument was added in git 2.15 (https://github.com/git/git/commit/32619f99f9):

--single-worktree
By default, all working trees will be examined by the following options when there are more than one (see git-worktree[1]): --all, --reflog and --indexed-objects. This option forces them to examine the current working tree only.

It looks by adding the --single-worktree argument, things work properly:

/opt/gitlab/embedded/bin/git rev-list 470ec851e3fd2393da60c5b77d59ffa1701a5903 --single-worktree --not --all

Local git fetch

Sentry error: https://sentry.gitlap.com/gitlab/gitlabcom/issues/142438/

This also breaks in a git fetch case. You can see this here:

# /opt/gitlab/embedded/bin/git fetch --no-tags -f . master:test
fatal: bad object HEAD
error: . did not send all necessary objects

During a git push

It looks like git-receive-pack also runs rev-list, but we don't have any control over how git behaves with worktrees; the --single-worktree argument only applies to the rev-list command. You can see that in a push, this is what happens on the server side:

[pid 13799] execve("/bin/sh", ["/bin/sh", "-c", "git-receive-pack '/tmp/stanhu/test2.git/'", "git-receive-pack '/tmp/stanhu/test2.git/'"], [/* 19 vars */]) = 0
[pid 13800] execve("/opt/gitlab/embedded/libexec/git-core/git-receive-pack", ["git-receive-pack", "/tmp/stanhu/test2.git/"], [/* 19 vars */]) = 0
[pid 13802] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "pack-objects", "--all-progress-implied", "--revs", "--stdout", "--thin", "--delta-base-offset", "-q"], [/* 20 vars */]) = 0
[pid 13804] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "unpack-objects", "--pack_header=2,3", "-q", "--strict"], [/* 23 vars */]) = 0
[pid 13806] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "rev-list", "--objects", "--stdin", "--not", "--all", "--quiet"], [/* 23 vars */] <unfinished ...>
[pid 13806] <... execve resumed> )      = 0
[pid 13807] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "rev-list", "--objects", "--stdin", "--not", "--all", "--quiet"], [/* 23 vars */] <unfinished ...>
[pid 13807] <... execve resumed> )      = 0
[pid 13809] execve("hooks/pre-receive", ["hooks/pre-receive"], [/* 24 vars */] <unfinished ...>
[pid 13809] <... execve resumed> )      = 0
[pid 13812] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "gc", "--auto", "--quiet"], [/* 20 vars */]) = 0

Next Steps

Step Status
We'll have to revert to git 2.14 until we can figure out how to deal with stale worktrees DONE
We should be more vigilant about cleaning up stale worktrees https://gitlab.com/gitlab-org/gitlab-ce/issues/44115 => gitaly!622 (merged)
We need to investigate why omnibus-gitlab ships two copies of the git binary omnibus-gitlab#3265
We should investigate whether there are any arguments/environment variables we can pass to git to ignore stale worktrees DONE: --single-worktree only applies to rev-list. Doesn't work for fetch or push.
We should talk to @chriscool and other git maintainers if we can make Git more tolerant of stale worktrees No need; focus on cleaning worktrees, see https://gitlab.com/gitlab-org/gitlab-ce/issues/44100#note_62503761
Gather how many repositories have stale worktrees gitlab-com/infrastructure#3832
Edited by Stan Hu