Gitlab is getting very slow, sometimes http errors - maybe because unicorn_timeout parameter is not working? (#25690) · Issues · GitLab.org / GitLab

Gitlab is getting very slow, sometimes http errors - maybe because unicorn_timeout parameter is not working?

We are running a gitlab setup with about 3.000 Users and about 300 GB projects. We see some performance issues regularly - every 2 to 3 days (webfrontend / api calls). First of all: ``` System information System: Current User: git Using RVM: no Ruby Version: 2.4.5p335 Gem Version: 2.7.6 Bundler Version:1.16.2 Rake Version: 12.3.1 Redis Version: 3.2.12 Git Version: 2.18.1 Sidekiq Version:5.2.1 Go Version: unknown ``` running on ecs / docker - 2 containers parallel on gitlab/gitlab-ce:11.4.5-ce.0 instance type m4.4xlarge, local ebs, no nfs. ![Unbenannt](/uploads/09f25d5a61aa286f8160facb73fdc81f/Unbenannt.PNG) As you can see, the target response times behind the loadbalancer. If i restart the unicorn with ``` gitlab-ctl restart unicorn ``` it tooks about 20 seconds and the ui was up and responding (much faster) again. Any ideas what can cause this? If i look into the logfiles, i can see the following: ```2018-12-10 15:26:46.007 W, [2018-12-10T15:26:45.225553 #76173] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 76173) exceeds memory limit (467265536.0 bytes > 438931584 bytes) 2018-12-10 15:26:12.997 W, [2018-12-10T15:26:12.802705 #55399] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 55399) exceeds memory limit (487145472.0 bytes > 460798392 bytes) 2018-12-10 15:23:23.913 W, [2018-12-10T15:23:23.881979 #90458] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 90458) exceeds memory limit (540641792.0 bytes > 540635450 bytes) 2018-12-10 15:21:55.801 W, [2018-12-10T15:21:54.911044 #82682] WARN -- : #<Unicorn::HttpServer:0x00007f5b64f4bb60>: worker (pid: 82682) exceeds memory limit (461782528.0 bytes > 446529511 bytes) 2018-12-10 15:18:50.731 W, [2018-12-10T15:18:50.639896 #24389] WARN -- : #<Unicorn::HttpServer:0x00007f5b64f4bb60>: worker (pid: 24389) exceeds memory limit (554089472.0 bytes > 553282884 bytes) 2018-12-10 15:09:36.867 W, [2018-12-10T15:09:36.333666 #70726] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 70726) exceeds memory limit (482471424.0 bytes > 476889684 bytes) 2018-12-10 15:06:48.799 W, [2018-12-10T15:06:48.796787 #61923] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 61923) exceeds memory limit (457211392.0 bytes > 441531244 bytes) 2018-12-10 15:06:08.788 W, [2018-12-10T15:06:08.522062 #23613] WARN -- : #<Unicorn::HttpServer:0x00007f3410b4aa78>: worker (pid: 23613) exceeds memory limit (538663936.0 bytes > 536596382 bytes) ``` every unicorn worker seems to have different memory limits - is this correct? Somewhere i read, that every 23 seconds a restart of an unicorn is quite normal. Our 32 workers terminate about 100 times per day - so very worker about 3 times / day - that is every 8 hours. Apart from that, shouldn't the worker_timeout parameter lead to a restart of the workers every x seconds (provided via the parameter)? ```root@11767c2a2743:/# cat /etc/gitlab/gitlab.rb | grep unicorn ###! Time between sampling of unicorn socket metrics, in seconds # gitlab_rails['monitoring_unicorn_sampler_interval'] = 10 ##! Tweak unicorn settings. ##! Docs: https://docs.gitlab.com/omnibus/settings/unicorn.html unicorn['worker_timeout'] = 60 ``` but i can't see any timeout restarts in the logfiles i can see lines like this: ``` Unicorn::WorkerKiller send SIGQUIT (pid: 61773) alive: 137714 sec (trial 1) Unicorn::WorkerKiller send SIGQUIT (pid: 31801) alive: 244384 sec (trial 1) Unicorn::WorkerKiller send SIGQUIT (pid: 64041) alive: 334004 sec (trial 1) Unicorn::WorkerKiller send SIGQUIT (pid: 21509) alive: 242968 sec (trial 1) ``` These unicorns are alive for more than 2 days... Most of the runners live about 1.000 to 5.000 seconds - but thats way more than the 60 seconds configured in gitlab.rb. Added an extractor to this alive value, to create a graph to have a better overview how long our unicorns live. I think all performance issues are related to the long lifetime of our unicorn workers.

issue