When attempting to migrate storage, when it takes too long, we'll drown the file server
Summary
When a job to move storage from one file server to another, for larger repos, a git upload-pack command may take a long time to complete. When this occurs the job itself will timeout, but the command keeps going. When a job is set to retry, which currently it's set to 3 times, we'll flood a file server to run the same command on the same repo.
This introduces the following risks:
- The repo may become corrupted as multiple git commands are doing the same thing on a large repo
- The file server performance will plummet due to IO load this command causes
- The job will most likely eventually fail
Steps to reproduce
- Create a repo that is a few hundred Gigabytes in size
- Tell that repo to move to a different file server
What is the expected correct behavior?
Let the git upload-pack command to succeed, only run once, and allow ample time for the project to move to its new location.
Relevant logs screenshots
- On a project that is 149GB in size:
root@file-23-stor-gprd.c.gitlab-production.internal:~# ps -efl | grep hashed | grep -v grep | grep 47e03
0 S git 6708 16651 0 80 0 - 267002 poll_s 16:34 ? 00:00:00 /opt/gitlab/embedded/bin/git upload-pack /var/opt/gitlab/git-data/repositories/@hashed/47/e0/47e03b7ac922cefbfa7c07ad3736497397d05953c4e78dd59586703428e5f24a.git
0 S git 13865 16651 0 80 0 - 267002 poll_s 15:43 ? 00:00:00 /opt/gitlab/embedded/bin/git upload-pack /var/opt/gitlab/git-data/repositories/@hashed/47/e0/47e03b7ac922cefbfa7c07ad3736497397d05953c4e78dd59586703428e5f24a.git
0 S git 32542 16651 0 80 0 - 267002 poll_s 16:18 ? 00:00:00 /opt/gitlab/embedded/bin/git upload-pack /var/opt/gitlab/git-data/repositories/@hashed/47/e0/47e03b7ac922cefbfa7c07ad3736497397d05953c4e78dd59586703428e5f24a.git
Output of checks
While this happens on GitLab.com, there's a good chance this happens on any GitLab instance that supports the Project.change_repository_storage functionality.
Possible fixes
GitLab needs to recognize that a job is still going, but some background processing is simply taking a really long time. If GitLab knows this, the job wouldn't be retried due to a "timeout."
We should prevent multiple git upload-pack commands from running on the same repo.
