Gitlab is getting very slow, sometimes http errors - maybe because unicorn_timeout parameter is not working?

We are running a gitlab setup with about 3.000 Users and about 300 GB projects. We see some performance issues regularly - every 2 to 3 days (webfrontend / api calls).

First of all:

System information
System:		
Current User:	git
Using RVM:	no
Ruby Version:	2.4.5p335
Gem Version:	2.7.6
Bundler Version:1.16.2
Rake Version:	12.3.1
Redis Version:	3.2.12
Git Version:	2.18.1
Sidekiq Version:5.2.1
Go Version:	unknown

running on ecs / docker - 2 containers parallel on gitlab/gitlab-ce:11.4.5-ce.0

instance type m4.4xlarge, local ebs, no nfs.

As you can see, the target response times behind the loadbalancer. If i restart the unicorn with

gitlab-ctl restart unicorn

it tooks about 20 seconds and the ui was up and responding (much faster) again.

Any ideas what can cause this?

If i look into the logfiles, i can see the following:

W, [2018-12-10T15:26:45.225553 #76173] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 76173) exceeds memory limit (467265536.0 bytes > 438931584 bytes)
2018-12-10 15:26:12.997
W, [2018-12-10T15:26:12.802705 #55399] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 55399) exceeds memory limit (487145472.0 bytes > 460798392 bytes)
2018-12-10 15:23:23.913
W, [2018-12-10T15:23:23.881979 #90458] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 90458) exceeds memory limit (540641792.0 bytes > 540635450 bytes)
2018-12-10 15:21:55.801
W, [2018-12-10T15:21:54.911044 #82682] WARN -- : #<Unicorn::HttpServer:0x00007f5b64f4bb60>: worker (pid: 82682) exceeds memory limit (461782528.0 bytes > 446529511 bytes)
2018-12-10 15:18:50.731
W, [2018-12-10T15:18:50.639896 #24389] WARN -- : #<Unicorn::HttpServer:0x00007f5b64f4bb60>: worker (pid: 24389) exceeds memory limit (554089472.0 bytes > 553282884 bytes)
2018-12-10 15:09:36.867
W, [2018-12-10T15:09:36.333666 #70726] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 70726) exceeds memory limit (482471424.0 bytes > 476889684 bytes)
2018-12-10 15:06:48.799
W, [2018-12-10T15:06:48.796787 #61923] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 61923) exceeds memory limit (457211392.0 bytes > 441531244 bytes)
2018-12-10 15:06:08.788
W, [2018-12-10T15:06:08.522062 #23613] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 23613) exceeds memory limit (538663936.0 bytes > 536596382 bytes)

every unicorn worker seems to have different memory limits - is this correct?

Somewhere i read, that every 23 seconds a restart of an unicorn is quite normal. Our 32 workers terminate about 100 times per day - so very worker about 3 times / day - that is every 8 hours.

Apart from that, shouldn't the worker_timeout parameter lead to a restart of the workers every x seconds (provided via the parameter)?

###! Time between sampling of unicorn socket metrics, in seconds
# gitlab_rails['monitoring_unicorn_sampler_interval'] = 10
##! Tweak unicorn settings.
##! Docs: https://docs.gitlab.com/omnibus/settings/unicorn.html
unicorn['worker_timeout'] = 60

but i can't see any timeout restarts in the logfiles

i can see lines like this:

Unicorn::WorkerKiller send SIGQUIT (pid: 61773) alive: 137714 sec (trial 1)
Unicorn::WorkerKiller send SIGQUIT (pid: 31801) alive: 244384 sec (trial 1)
Unicorn::WorkerKiller send SIGQUIT (pid: 64041) alive: 334004 sec (trial 1)
Unicorn::WorkerKiller send SIGQUIT (pid: 21509) alive: 242968 sec (trial 1)

These unicorns are alive for more than 2 days...

Most of the runners live about 1.000 to 5.000 seconds - but thats way more than the 60 seconds configured in gitlab.rb.

Added an extractor to this alive value, to create a graph to have a better overview how long our unicorns live.

I think all performance issues are related to the long lifetime of our unicorn workers.

Edited Dec 10, 2018 by Markus Siebert