gitaly-ruby processes SIGKILL'd too frequently

Hi. We have an internal gitlab deployment of GitLab Community Edition 11.7.5.

We have been having intermittent "something went wrong" errors on a variety of pages in the GitLab UI. Yesterday, someone clicked "Merge" on a merge request. Once the pipeline succeeded, GitLab attempted the merge and "Something went wrong." GitLab remembers that the merge button historically failed and no longer offers the button on the merge request once this occurs. The author had to close the MR and file a new identical MR to they could click the merge button again. By the time they did this, it turned out the merge had in fact occurred (you could see the merge commit in the target's history), but GitLab still believed the old MR hadn't been merged.

I decided to investigate this incident in some depth.

983 is the ID of the MR whose merge failed. I noticed it was interleaved with the merge of 984, so I'm showing that in the log, and it took an alarmingly long time. The 983 merge was ultimately killed before it finished. Is there any sort of deadlock that can occur with concurrent merge operations?

I started looking at /opt/gitlab/embedded/service/gitlab-rails/log/production.log:

product/cso!984 - Git merge started on JID df4a5a1e49a36963914fca92
...
Started POST "/api/v4/internal/pre_receive" for 127.0.0.1 at 2019-03-26 15:03:27 -0400
product/cso!983 - Git merge started on JID
...
product/cso!984 - Git merge finished on JID df4a5a1e49a36963914fca92 commit 5d76cfa33e5451a11f136f1fff1e4a68902ce165
product/cso!984 - Post merge started on JID df4a5a1e49a36963914fca92 with state locked
...
product/cso!984 - Post merge finished on JID df4a5a1e49a36963914fca92 with state merged
...
Started PUT "/api/v4/jobs/585101" for 127.0.0.1 at 2019-03-26 15:16:27 -0400
MergeService ERROR: product/cso!983 - 14:transport is closing
MergeService ERROR: product/cso!983 - Something went wrong during merge
Started GET "/product/cso/merge_requests/843/ci_environments_status" for 127.0.0.1 at 2019-03-26 15:16:27 -0400

OK so transport is closing is probably the best we're going to get from this log. This was quite a small merge request, so it seems abnormal that it would take 13 minutes. The server is generally sitting at 10% CPU utilitization and 100GB free memory.

So next I went looking at gitaly logs at around the time of the failed merge:

2019-03-26_19:16:26.84819 time="2019-03-26T19:16:26Z" level=info msg="sending SIGKILL" worker.name=gitaly-ruby.0 worker.pid=1321
2019-03-26_19:16:27.18284 time="2019-03-26T19:16:27Z" level=warning msg=exited error="signal: killed" supervisor.args="[bundle exec bin/ruby-cd /var/opt/gitlab/gitaly /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 2098 /tmp/gital\
y-ruby175072047/socket.0]" supervisor.name=gitaly-ruby.0

Note that these timestamps are zulu, and the earlier log was EDT.

It looks likely that the gitaly-ruby worker was killed while carrying out the merge. I found this change that I think is responsible for the killing: 776c0e65

I can see that it kills gitaly-ruby processes that have leaked too much memory, so I guess the leaks are expected. My question is, under what circumstances will it kill a worker that is currently carrying out an RPC? Is there a way to have the workers voluntarily exit in between tasks so that tasks don't fail? Would it be possible to have them exit after executing K tasks, for a user configurable K?

Looking at our logs, I can see that one of these workers is being SIGKILL'd something like every 5-10 minutes. My understanding is that each time this happens it is potentially causing a random failure for someone, if that worker happened to be doing a task at the time. Is this normal?

I have since changed the configuration to raise various thresholds beyond the low default settings:

gitaly['ruby_max_rss'] = 1000000000 # That's 1GB.
gitaly['ruby_graceful_restart_timeout'] = '30m' # Grace time for a gitaly-ruby process to finish ongoing requests
gitaly['ruby_restart_delay'] = '5m' # Period of sustained high RSS that needs to be observed before restarting gitaly-ruby
gitaly['ruby_num_workers'] = 30 # Number of gitaly-ruby worker processes. Minimum 2, default 2.

However, I can still see in today's logs that ruby_graceful_restart_timeout is being hit every time something is killed--every single SIGTERM is followed 30 minutes later by a SIGKILL. SIGKILLs are happening on average at 10 minute intervals. Do you think all these tasks are really taking >30m? Could it be that SIGTERM is being ignored in error?

The typical 'top' output for the gitaly subtree is:

                          VIRT    RES           %CPU %MEM
 5233 git       20   0 5324436  76880   8264 S  13.9  0.1 215:41.26          `- /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
19868 git       20   0 3011764 846680  15596 S   0.3  0.6   1:23.47              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.23
16266 git       20   0 3142904 810732  16168 S   0.0  0.6   1:19.22              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.15
27064 git       20   0 3209336 751800  16124 S   0.3  0.6   1:10.39              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.20
17419 git       20   0 3208452 714232  16064 S   0.3  0.5   1:03.60              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.6
 9503 git       20   0 3274208 989.8m  16164 S   0.3  0.8   1:11.39              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.25
14387 git       20   0 3144304 859176  15964 S   0.3  0.7   1:07.61              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.17
24267 git       20   0 3282708 886980  15876 S   0.3  0.7   1:02.04              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.28
28603 git       20   0 3274076 759548  16112 S   0.3  0.6   1:05.44              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.3
 4534 git       20   0 3011176 668120  16164 S   0.3  0.5   1:00.62              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.10
 6066 git       20   0 3274360 720040  16156 S   0.3  0.5   0:57.41              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.2
 7857 git       20   0 3011340 682888  15968 S   0.3  0.5   0:53.95              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.8
10890 git       20   0 3077216 870632  16160 S   0.0  0.7   0:59.78              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.21
 8348 git       20   0 3077900 562624  16256 S   0.3  0.4   0:49.14              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.11
17811 git       20   0 3078244 364432  16160 S   0.3  0.3   0:37.90              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.26
12000 git       20   0 3399048 768956  18620 S   0.3  0.6   0:35.97              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.12
13984 git       20   0 3273864 612280  15812 S   0.3  0.5   0:28.55              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.29
25992 git       20   0 3076220 399428  15568 S   0.3  0.3   0:26.25              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.13
 7481 git       20   0 3142784 384464  15844 S   0.3  0.3   0:23.24              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.18
20733 git       20   0 3142656 437324  15784 S   0.0  0.3   0:22.09              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.24
13660 git       20   0 3203140 436936  18496 S   0.0  0.3   0:20.65              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.19
27079 git       20   0 3142752 379436  16468 S   0.0  0.3   0:18.24              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.0
29017 git       20   0 3076344 355072  16464 S   0.3  0.3   0:20.41              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.1
30712 git       20   0 3077252 305388  16048 S   0.3  0.2   0:17.93              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.4
24976 git       20   0 3010796 188176  16172 S   0.7  0.1   0:15.19              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.16
28365 git       20   0 3011240 247856  16528 S   0.3  0.2   0:16.62              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.14
30279 git       20   0 3011580 233900  16116 S   0.0  0.2   0:15.78              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.27
 2753 git       20   0 3010820 214100  16180 S   0.3  0.2   0:16.07              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.7
21150 git       20   0 3011680 173368  16256 S   0.3  0.1   0:12.39              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.5
 9698 git       20   0 3010360 182356  15812 S   0.0  0.1   0:08.27              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.9
18151 git       20   0 2801492  64640  15588 S   0.3  0.0   0:05.44              `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.22

It seems likely that these workers just work their way up to 1GB and get killed over and over given that the RSS distribution looks fairly uniform. The expected lifetime of a worker is something like 5 hours based on worker count and kill frequency.

Edited Mar 27, 2019 by David Vitek

To upload designs, you'll need to enable LFS and have an admin enable hashed storage. More information