Gitaly hooks symlinks may be deleted after 10 days of uptime when 'noatime' is set on /tmp, preventing all tasks executed by hooks from running

Summary

Update: See Will's comment for what is actually happening. noatime on /tmp, or long periods of inactivity, result in Git hook symlinks getting deleted. Amazon Linux has noatime on /tmp by default. The current workarounds to prevent this would be to remove noatime, modify the service which deletes /tmp files, or touch the file periodically. You may override the tmpfiles.d service behavior as a simple workaround to avoid this issue on most distros. See the original description below.

Sometimes, in a GitLab self-managed instance (seen so far in two cases that I know of, both on 14.8.2), the post-receive hooks aren't being fired off by Gitaly across the entire instance, causing a lot of weird issues.

I learned about this when working with a customer on a support ticket (internal), including an extended call. Additionally, @wchandler was able to reproduce it briefly.

Potential workaround

Restarting Gitaly may be enough to work around the issue for now:

sudo gitlab-ctl restart gitaly

Steps to reproduce

Right now, we aren't sure how to reproduce this.

What is the current bug behavior?

Post-receive hooks don't fire. As a result, many other parts of GitLab stop working:

Pushes from a terminal don't provide a link to create a merge request.
Activity is not logged (such as in the Project information -> Activity page).
Cached repository information is stale.

This results in at least a couple weird behaviors when in a new branch created while this bug is observed:
- Clicking a file gives inactive Edit and Web IDE buttons, with a popover which states that You can only edit files when you are on a branch.
- Merge requests have the following message next to the inactive Merge button:
```
The source branch `new_branch` does not exist. Please restore it or use a different branch.
```
These may be fixed by running sudo gitlab-rake cache:clear.
GitLab CI/CD pipelines are not triggered.
Webhooks/integrations are not triggered.

What is the expected correct behavior?

Post-receive hooks should fire. Gitaly logs should have entries that look like the following (on a single node instance):

{
  "content_length_bytes": 217,
  "correlation_id": "01FY7GRMGVMNN260XZB5NY3TPP",
  "duration_ms": 51,
  "level": "info",
  "method": "POST",
  "msg": "Finished HTTP request",
  "status": 200,
  "time": "2022-03-15T19:27:04.348Z",
  "url": "http://unix/api/v4/internal/post_receive"
}

Then, the rest of the GitLab functionality triggered by the built-in post-receive hooks should work.

Relevant logs and/or screenshots

Right now, the most sure sign of this happening in the logs seems to be the absence of calls to the post_receive API endpoint.

Output of checks

Results of GitLab environment info

Expand for output related to GitLab environment info


(For installations with omnibus-gitlab package run and paste the output of:
`sudo gitlab-rake gitlab:env:info`)

(For installations from source run and paste the output of:
`sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)

Results of GitLab application Check

I will ask the customer to run the commands to fill out the information below, but their instance was a single node on 14.8.2, and not doing anything unusual.

Expand for output related to the GitLab application check

(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true)
(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true)
(we will only investigate if the tests are passing)

Possible fixes

I've yet to track down how/where this could be happening.

Edited Apr 18, 2022 by Andrew Conrad

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information