gitaly-ruby processes SIGKILL'd too frequently
Hi. We have an internal gitlab deployment of GitLab Community Edition 11.7.5.
We have been having intermittent "something went wrong" errors on a variety of pages in the GitLab UI. Yesterday, someone clicked "Merge" on a merge request. Once the pipeline succeeded, GitLab attempted the merge and "Something went wrong." GitLab remembers that the merge button historically failed and no longer offers the button on the merge request once this occurs. The author had to close the MR and file a new identical MR to they could click the merge button again. By the time they did this, it turned out the merge had in fact occurred (you could see the merge commit in the target's history), but GitLab still believed the old MR hadn't been merged.
I decided to investigate this incident in some depth.
983 is the ID of the MR whose merge failed. I noticed it was interleaved with the merge of 984, so I'm showing that in the log, and it took an alarmingly long time. The 983 merge was ultimately killed before it finished. Is there any sort of deadlock that can occur with concurrent merge operations?
I started looking at /opt/gitlab/embedded/service/gitlab-rails/log/production.log
:
product/cso!984 - Git merge started on JID df4a5a1e49a36963914fca92
...
Started POST "/api/v4/internal/pre_receive" for 127.0.0.1 at 2019-03-26 15:03:27 -0400
product/cso!983 - Git merge started on JID
...
product/cso!984 - Git merge finished on JID df4a5a1e49a36963914fca92 commit 5d76cfa33e5451a11f136f1fff1e4a68902ce165
product/cso!984 - Post merge started on JID df4a5a1e49a36963914fca92 with state locked
...
product/cso!984 - Post merge finished on JID df4a5a1e49a36963914fca92 with state merged
...
Started PUT "/api/v4/jobs/585101" for 127.0.0.1 at 2019-03-26 15:16:27 -0400
MergeService ERROR: product/cso!983 - 14:transport is closing
MergeService ERROR: product/cso!983 - Something went wrong during merge
Started GET "/product/cso/merge_requests/843/ci_environments_status" for 127.0.0.1 at 2019-03-26 15:16:27 -0400
OK so transport is closing
is probably the best we're going to get from this log. This was quite a small merge request, so it seems abnormal that it would take 13 minutes. The server is generally sitting at 10% CPU utilitization and 100GB free memory.
So next I went looking at gitaly logs at around the time of the failed merge:
2019-03-26_19:16:26.84819 time="2019-03-26T19:16:26Z" level=info msg="sending SIGKILL" worker.name=gitaly-ruby.0 worker.pid=1321
2019-03-26_19:16:27.18284 time="2019-03-26T19:16:27Z" level=warning msg=exited error="signal: killed" supervisor.args="[bundle exec bin/ruby-cd /var/opt/gitlab/gitaly /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 2098 /tmp/gital\
y-ruby175072047/socket.0]" supervisor.name=gitaly-ruby.0
Note that these timestamps are zulu, and the earlier log was EDT.
It looks likely that the gitaly-ruby worker was killed while carrying out the merge. I found this change that I think is responsible for the killing: 776c0e65
I can see that it kills gitaly-ruby processes that have leaked too much memory, so I guess the leaks are expected. My question is, under what circumstances will it kill a worker that is currently carrying out an RPC? Is there a way to have the workers voluntarily exit in between tasks so that tasks don't fail? Would it be possible to have them exit after executing K tasks, for a user configurable K?
Looking at our logs, I can see that one of these workers is being SIGKILL'd something like every 5-10 minutes. My understanding is that each time this happens it is potentially causing a random failure for someone, if that worker happened to be doing a task at the time. Is this normal?
I have since changed the configuration to raise various thresholds beyond the low default settings:
gitaly['ruby_max_rss'] = 1000000000 # That's 1GB.
gitaly['ruby_graceful_restart_timeout'] = '30m' # Grace time for a gitaly-ruby process to finish ongoing requests
gitaly['ruby_restart_delay'] = '5m' # Period of sustained high RSS that needs to be observed before restarting gitaly-ruby
gitaly['ruby_num_workers'] = 30 # Number of gitaly-ruby worker processes. Minimum 2, default 2.
However, I can still see in today's logs that ruby_graceful_restart_timeout is being hit every time something is killed--every single SIGTERM is followed 30 minutes later by a SIGKILL. SIGKILLs are happening on average at 10 minute intervals. Do you think all these tasks are really taking >30m? Could it be that SIGTERM is being ignored in error?
The typical 'top' output for the gitaly subtree is:
VIRT RES %CPU %MEM
5233 git 20 0 5324436 76880 8264 S 13.9 0.1 215:41.26 `- /opt/gitlab/embedded/bin/gitaly /var/opt/gitlab/gitaly/config.toml
19868 git 20 0 3011764 846680 15596 S 0.3 0.6 1:23.47 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.23
16266 git 20 0 3142904 810732 16168 S 0.0 0.6 1:19.22 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.15
27064 git 20 0 3209336 751800 16124 S 0.3 0.6 1:10.39 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.20
17419 git 20 0 3208452 714232 16064 S 0.3 0.5 1:03.60 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.6
9503 git 20 0 3274208 989.8m 16164 S 0.3 0.8 1:11.39 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.25
14387 git 20 0 3144304 859176 15964 S 0.3 0.7 1:07.61 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.17
24267 git 20 0 3282708 886980 15876 S 0.3 0.7 1:02.04 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.28
28603 git 20 0 3274076 759548 16112 S 0.3 0.6 1:05.44 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.3
4534 git 20 0 3011176 668120 16164 S 0.3 0.5 1:00.62 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.10
6066 git 20 0 3274360 720040 16156 S 0.3 0.5 0:57.41 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.2
7857 git 20 0 3011340 682888 15968 S 0.3 0.5 0:53.95 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.8
10890 git 20 0 3077216 870632 16160 S 0.0 0.7 0:59.78 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.21
8348 git 20 0 3077900 562624 16256 S 0.3 0.4 0:49.14 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.11
17811 git 20 0 3078244 364432 16160 S 0.3 0.3 0:37.90 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.26
12000 git 20 0 3399048 768956 18620 S 0.3 0.6 0:35.97 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.12
13984 git 20 0 3273864 612280 15812 S 0.3 0.5 0:28.55 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.29
25992 git 20 0 3076220 399428 15568 S 0.3 0.3 0:26.25 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.13
7481 git 20 0 3142784 384464 15844 S 0.3 0.3 0:23.24 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.18
20733 git 20 0 3142656 437324 15784 S 0.0 0.3 0:22.09 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.24
13660 git 20 0 3203140 436936 18496 S 0.0 0.3 0:20.65 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.19
27079 git 20 0 3142752 379436 16468 S 0.0 0.3 0:18.24 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.0
29017 git 20 0 3076344 355072 16464 S 0.3 0.3 0:20.41 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.1
30712 git 20 0 3077252 305388 16048 S 0.3 0.2 0:17.93 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.4
24976 git 20 0 3010796 188176 16172 S 0.7 0.1 0:15.19 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.16
28365 git 20 0 3011240 247856 16528 S 0.3 0.2 0:16.62 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.14
30279 git 20 0 3011580 233900 16116 S 0.0 0.2 0:15.78 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.27
2753 git 20 0 3010820 214100 16180 S 0.3 0.2 0:16.07 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.7
21150 git 20 0 3011680 173368 16256 S 0.3 0.1 0:12.39 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.5
9698 git 20 0 3010360 182356 15812 S 0.0 0.1 0:08.27 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.9
18151 git 20 0 2801492 64640 15588 S 0.3 0.0 0:05.44 `- ruby /opt/gitlab/embedded/service/gitaly-ruby/bin/gitaly-ruby 5233 /tmp/gitaly-ruby185709972/socket.22
It seems likely that these workers just work their way up to 1GB and get killed over and over given that the RSS distribution looks fairly uniform. The expected lifetime of a worker is something like 5 hours based on worker count and kill frequency.