Allow us to determine the niceness of processes in Gitaly

Everyone can contribute. Help move this issue forward while earning points, leveling up and collecting rewards.

  • Close this issue

Problem to solve

There are certain processes that spin up on Gitaly nodes that have the potential to really harm a fileserver. Let's think about ways in which we can curb potential outage scenarios by adjusting the way Gitaly spins up child processes.

Target audience

  • Sidney, Systems Administrator, https://design.gitlab.com/research/personas#persona-sidney

Further details

During a recent storage migration occurring on GitLab.com, a self inflicted performance issue was introduced when one file server was choking due to a few situations that prevented this server from being able to respond to Gitaly requests in a performant manner. Here are a few scenarios where Gitaly has the ability to take preventive action:

  • we see multiple git pack-upload commands operating on the same repo
  • certain git functions take too much IO
  • processes kicked off by sidekiq will timeout, and end up retried, while Gitaly is still doing the processing

Proposal

We need to figure out what kind of improvements can be made to Gitaly and our Sidekiq workers to determine if we can avoid the above scenarios. One example is situations where a project repository storage is moved from one server to another. In some situations these may take a really long time. If the sidekiq job timedout waiting on Gitaly, the job is marked as a failure and retried. This causes Gitaly to spin up a second task on the server to do the exact same thing it's still currently doing.

There are some processes that Gitaly spins up where high IO on a project is inevitable due to the nature of that process. git upload-pack is a great example. Due to it needing to parse the entire repo, it will take a lot of disk operations to complete this task. If enough of these are running on a server, we'll eventually run into an IO limit and the performance of the server will suffer hindering other operations on the server entirely. We should evaluate if we can utilize linux kernel features to set the niceness of these and other high IO impacting functionality.

Sometimes when jobs get retried, or if customers send too many commands, we can wind up with the same process running multiple times on a Gitaly server. A common one we often see, is git upload-pack, most likely due to the nature of this command. We should evaluate if running multiples of these commands is safe, and find a way to potentially queue up this type of work. Running this same command on a project multiple times will have the potential to severely impact the performance of the server.

Links / references

  • https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/6147
  • https://gitlab.com/gitlab-org/gitlab-ee/issues/9563
Edited Jun 27, 2025 by 🤖 GitLab Bot 🤖
Assignee Loading
Time tracking Loading