git 2.15+ doesn't play nicely with stale worktrees
In https://gitlab.com/gitlab-org/gitlab-ce/issues/44068, we found that users trying to fetch, push, or pull repositories would get errors that indicate a bad object:
remote: fatal: bad object HEAD
fatal: bad object HEAD
It turns out the problem is that many of these repositories have stale worktrees from squash-rebase attempts that were never cleaned up. You can see it in this strace as git attempts to walk all the worktrees, look up their references, and find the corresponding objects:
# /opt/gitlab/embedded/bin/git rev-list 470ec851e3fd2393da60c5b77d59ffa1701a5903 --not --all
<snip>
open("worktrees/squash-7463504/HEAD", O_RDONLY) = 3
read(3, "9c8af0299f1d7f8808af127e796627e5"..., 256) = 41
read(3, "", 215) = 0
close(3) = 0
lstat("./objects/9c/8af0299f1d7f8808af127e796627e57c2c23f6", 0x7ffe188a87b0) = -1 ENOENT (No such file or directory)
open("./objects/pack", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents(3, /* 5 entries */, 32768) = 264
getdents(3, /* 0 entries */, 32768) = 0
close(3) = 0
open("./objects/9c/8af0299f1d7f8808af127e796627e57c2c23f6", O_RDONLY|O_CLOEXEC) = -1 ENOENT (No such file or directory)
open("./objects/pack", O_RDONLY|O_NONBLOCK|O_DIRECTORY|O_CLOEXEC) = 3
fstat(3, {st_mode=S_IFDIR|0755, st_size=4096, ...}) = 0
getdents(3, /* 5 entries */, 32768) = 264
getdents(3, /* 0 entries */, 32768) = 0
close(3) = 0
lstat("./objects/9c/8af0299f1d7f8808af127e796627e57c2c23f6", 0x7ffe188a87f0) = -1 ENOENT (No such file or directory)
write(2, "fatal: bad object HEAD\n", 23fatal: bad object HEAD
The problem looks like it hit us in multiple places:
- Unicorn errors
- During a
git push
Unicorn errors
There are at least two places where having stale worktrees breaks: via the /internal lookup, and in the Web UI when attempting to create a temporary branch.
API /internal lookup
Sentry error: https://sentry.gitlap.com/gitlab/gitlabcom/issues/141134/
In the /internal lookup, it looks like the problem happens because git rev-list tries to walk all working trees, include leftover stale working trees. According to https://git-scm.com/docs/git-rev-list, this argument was added in git 2.15 (https://github.com/git/git/commit/32619f99f9):
--single-worktree
By default, all working trees will be examined by the following options when there are more than one (see git-worktree[1]): --all, --reflog and --indexed-objects. This option forces them to examine the current working tree only.
It looks by adding the --single-worktree argument, things work properly:
/opt/gitlab/embedded/bin/git rev-list 470ec851e3fd2393da60c5b77d59ffa1701a5903 --single-worktree --not --all
Local git fetch
Sentry error: https://sentry.gitlap.com/gitlab/gitlabcom/issues/142438/
This also breaks in a git fetch case. You can see this here:
# /opt/gitlab/embedded/bin/git fetch --no-tags -f . master:test
fatal: bad object HEAD
error: . did not send all necessary objects
During a git push
It looks like git-receive-pack also runs rev-list, but we don't have any control over how git behaves with worktrees; the --single-worktree argument only applies to the rev-list command. You can see that in a push, this is what happens on the server side:
[pid 13799] execve("/bin/sh", ["/bin/sh", "-c", "git-receive-pack '/tmp/stanhu/test2.git/'", "git-receive-pack '/tmp/stanhu/test2.git/'"], [/* 19 vars */]) = 0
[pid 13800] execve("/opt/gitlab/embedded/libexec/git-core/git-receive-pack", ["git-receive-pack", "/tmp/stanhu/test2.git/"], [/* 19 vars */]) = 0
[pid 13802] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "pack-objects", "--all-progress-implied", "--revs", "--stdout", "--thin", "--delta-base-offset", "-q"], [/* 20 vars */]) = 0
[pid 13804] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "unpack-objects", "--pack_header=2,3", "-q", "--strict"], [/* 23 vars */]) = 0
[pid 13806] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "rev-list", "--objects", "--stdin", "--not", "--all", "--quiet"], [/* 23 vars */] <unfinished ...>
[pid 13806] <... execve resumed> ) = 0
[pid 13807] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "rev-list", "--objects", "--stdin", "--not", "--all", "--quiet"], [/* 23 vars */] <unfinished ...>
[pid 13807] <... execve resumed> ) = 0
[pid 13809] execve("hooks/pre-receive", ["hooks/pre-receive"], [/* 24 vars */] <unfinished ...>
[pid 13809] <... execve resumed> ) = 0
[pid 13812] execve("/opt/gitlab/embedded/libexec/git-core/git", ["/opt/gitlab/embedded/libexec/git-core/git", "gc", "--auto", "--quiet"], [/* 20 vars */]) = 0
Next Steps
| Step | Status |
|---|---|
| We'll have to revert to git 2.14 until we can figure out how to deal with stale worktrees | DONE |
| We should be more vigilant about cleaning up stale worktrees | https://gitlab.com/gitlab-org/gitlab-ce/issues/44115 => gitaly!622 (merged) |
We need to investigate why omnibus-gitlab ships two copies of the git binary |
omnibus-gitlab#3265 |
| We should investigate whether there are any arguments/environment variables we can pass to git to ignore stale worktrees | DONE: --single-worktree only applies to rev-list. Doesn't work for fetch or push. |
We should talk to @chriscool and other git maintainers if we can make Git more tolerant of stale worktrees |
No need; focus on cleaning worktrees, see https://gitlab.com/gitlab-org/gitlab-ce/issues/44100#note_62503761 |
| Gather how many repositories have stale worktrees | gitlab-com/infrastructure#3832 |