When attempting to migrate storage, when it takes too long, we'll drown the file server

Summary

When a job to move storage from one file server to another, for larger repos, a git upload-pack command may take a long time to complete. When this occurs the job itself will timeout, but the command keeps going. When a job is set to retry, which currently it's set to 3 times, we'll flood a file server to run the same command on the same repo.

This introduces the following risks:

The repo may become corrupted as multiple git commands are doing the same thing on a large repo
The file server performance will plummet due to IO load this command causes
The job will most likely eventually fail

Steps to reproduce

Create a repo that is a few hundred Gigabytes in size
Tell that repo to move to a different file server

What is the expected correct behavior?

Let the git upload-pack command to succeed, only run once, and allow ample time for the project to move to its new location.

Relevant logs screenshots

On a project that is 149GB in size:

root@file-23-stor-gprd.c.gitlab-production.internal:~# ps -efl | grep hashed | grep -v grep | grep 47e03
0 S git       6708 16651  0  80   0 - 267002 poll_s 16:34 ?       00:00:00 /opt/gitlab/embedded/bin/git upload-pack /var/opt/gitlab/git-data/repositories/@hashed/47/e0/47e03b7ac922cefbfa7c07ad3736497397d05953c4e78dd59586703428e5f24a.git
0 S git      13865 16651  0  80   0 - 267002 poll_s 15:43 ?       00:00:00 /opt/gitlab/embedded/bin/git upload-pack /var/opt/gitlab/git-data/repositories/@hashed/47/e0/47e03b7ac922cefbfa7c07ad3736497397d05953c4e78dd59586703428e5f24a.git
0 S git      32542 16651  0  80   0 - 267002 poll_s 16:18 ?       00:00:00 /opt/gitlab/embedded/bin/git upload-pack /var/opt/gitlab/git-data/repositories/@hashed/47/e0/47e03b7ac922cefbfa7c07ad3736497397d05953c4e78dd59586703428e5f24a.git

Output of checks

While this happens on GitLab.com, there's a good chance this happens on any GitLab instance that supports the Project.change_repository_storage functionality.

Possible fixes

GitLab needs to recognize that a job is still going, but some background processing is simply taking a really long time. If GitLab knows this, the job wouldn't be retried due to a "timeout."

We should prevent multiple git upload-pack commands from running on the same repo.

Relatable Material

Edited Feb 02, 2019 by John Skarbek