2021-01-26 Puma failing after HUP during deployments (#3434) · Issues · GitLab.com / GitLab Infrastructure Team / Production

2021-01-26 Puma failing after HUP during deployments

During deployment to canary, web-cny-04 was added back into rotation but was in state DOWN. This failed the deploy job. This is expected because we wait for HAProxy to tell us that the server is UP. Though we probably should have stopped earlier because the HUP clearly did not succeed. The before and after status of puma showed the same PID, and this has been used as our check to validate that we've successfully restarted the service as desired. https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/2936691 More concerning, Deployer did not properly detect that the HUP was unsuccessful. Notes from the `server`web-cny-04` as it is while troubleshooting: ``` # gitlab-ctl status run: gitlab-kas: (pid 24387) 3865s; run: log: (pid 27025) 668104s run: gitlab-workhorse: (pid 18170) 127023s; run: log: (pid 2564) 18218581s run: logrotate: (pid 17216) 1450s; run: log: (pid 2561) 18218581s run: nginx: (pid 27088) 668102s; run: log: (pid 2565) 18218581s run: puma: (pid 24930) 3853s; run: log: (pid 2569) 18218581s # curl localhost:8080/-/readiness {"status":"failed","master_check":[{"status":"failed","message":"unexpected Master check result: false"}]} ``` The logs from Puma: <details><summary>Logs from puma</summary> ``` 2021-01-26_16:54:38.72738 Received HUP from runit, sending INT signal to perform graceful restart 2021-01-26_16:54:38.72756 Sending INT signal to Puma 27852... 2021-01-26_16:54:38.72822 Waiting for Puma 27852 to exit (1)... 2021-01-26_16:54:38.74210 Received HUP from runit, sending INT signal to perform graceful restart 2021-01-26_16:54:38.74224 Sending INT signal to Puma 27852... 2021-01-26_16:54:38.74275 Waiting for Puma 27852 to exit (1)... 2021-01-26_16:54:39.72885 Waiting for Puma 27852 to exit (2)... 2021-01-26_16:54:39.74339 Waiting for Puma 27852 to exit (2)... 2021-01-26_16:54:40.72954 Waiting for Puma 27852 to exit (3)... 2021-01-26_16:54:40.74403 Waiting for Puma 27852 to exit (3)... 2021-01-26_16:54:41.73017 Waiting for Puma 27852 to exit (4)... 2021-01-26_16:54:41.74466 Waiting for Puma 27852 to exit (4)... 2021-01-26_16:54:42.03018 Received HUP from runit, sending INT signal to perform graceful restart 2021-01-26_16:54:42.03038 Sending INT signal to Puma 27852... 2021-01-26_16:54:42.03096 Waiting for Puma 27852 to exit (1)... 2021-01-26_16:54:42.73082 Waiting for Puma 27852 to exit (5)... 2021-01-26_16:54:42.74528 Waiting for Puma 27852 to exit (5)... 2021-01-26_16:54:43.03156 Waiting for Puma 27852 to exit (2)... 2021-01-26_16:54:43.73151 Waiting for Puma 27852 to exit (6)... 2021-01-26_16:54:43.74591 Waiting for Puma 27852 to exit (6)... 2021-01-26_16:54:44.03242 Waiting for Puma 27852 to exit (3)... 2021-01-26_16:54:44.73221 Waiting for Puma 27852 to exit (7)... 2021-01-26_16:54:44.74655 Waiting for Puma 27852 to exit (7)... 2021-01-26_16:54:45.03316 Waiting for Puma 27852 to exit (4)... 2021-01-26_16:54:45.73302 Waiting for Puma 27852 to exit (8)... 2021-01-26_16:54:45.74725 Waiting for Puma 27852 to exit (8)... 2021-01-26_16:54:46.03387 Waiting for Puma 27852 to exit (5)... 2021-01-26_16:54:46.73377 Waiting for Puma 27852 to exit (9)... 2021-01-26_16:54:46.74801 Waiting for Puma 27852 to exit (9)... 2021-01-26_16:54:47.03460 Waiting for Puma 27852 to exit (6)... 2021-01-26_16:54:47.73445 Waiting for Puma 27852 to exit (10)... 2021-01-26_16:54:47.74869 Waiting for Puma 27852 to exit (10)... 2021-01-26_16:54:48.03533 Waiting for Puma 27852 to exit (7)... 2021-01-26_16:54:48.73517 Waiting for Puma 27852 to exit (11)... 2021-01-26_16:54:48.74929 Waiting for Puma 27852 to exit (11)... 2021-01-26_16:54:49.03603 Waiting for Puma 27852 to exit (8)... 2021-01-26_16:54:49.73606 Waiting for Puma 27852 to exit (12)... 2021-01-26_16:54:49.75579 Waiting for Puma 27852 to exit (12)... 2021-01-26_16:54:50.03672 Waiting for Puma 27852 to exit (9)... 2021-01-26_16:54:50.73694 Waiting for Puma 27852 to exit (13)... 2021-01-26_16:54:50.73696 control/h: 26: kill: No such process 2021-01-26_16:54:50.73696 2021-01-26_16:54:50.73696 Puma 27852 did exit. 2021-01-26_16:54:50.76911 Waiting for Puma 27852 to exit (13)... 2021-01-26_16:54:50.76912 control/h: 26: kill: No such process 2021-01-26_16:54:50.76913 2021-01-26_16:54:50.76913 Puma 27852 did exit. 2021-01-26_16:54:51.03742 Waiting for Puma 27852 to exit (10)... 2021-01-26_16:54:51.03744 control/h: 26: kill: No such process 2021-01-26_16:54:51.03744 2021-01-26_16:54:51.03744 Puma 27852 did exit. 2021-01-26_16:54:51.24915 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"Puma starting in cluster mode..."} 2021-01-26_16:54:51.24916 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Puma version: 5.1.1 (ruby 2.7.2-p137) (\"At Your Service\")"} 2021-01-26_16:54:51.24917 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Min threads: 1"} 2021-01-26_16:54:51.24917 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Max threads: 4"} 2021-01-26_16:54:51.24918 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Environment: production"} 2021-01-26_16:54:51.24919 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Master PID: 24930"} 2021-01-26_16:54:51.24921 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Workers: 16"} 2021-01-26_16:54:51.24921 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Restarts: (✔) hot (✖) phased"} 2021-01-26_16:54:51.24922 {"timestamp":"2021-01-26T16:54:51.249Z","pid":24930,"message":"* Preloading application"} 2021-01-26_16:55:30.18680 {"timestamp":"2021-01-26T16:55:30.186Z","pid":24930,"message":"* Listening on unix:///var/opt/gitlab/gitlab-rails/sockets/gitlab.socket"} 2021-01-26_16:55:30.18693 {"timestamp":"2021-01-26T16:55:30.186Z","pid":24930,"message":"* Listening on http://0.0.0.0:8080"} 2021-01-26_16:55:30.18698 {"timestamp":"2021-01-26T16:55:30.186Z","pid":24930,"message":"! WARNING: Detected 2 Thread(s) started in app boot:"} 2021-01-26_16:55:30.18704 {"timestamp":"2021-01-26T16:55:30.186Z","pid":24930,"message":"! #\u003cThread:0x00007fe8b4c4d390 /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/activerecord-6.0.3.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:334 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/activerecord-6.0.3.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:337:in `sleep'"} 2021-01-26_16:55:30.18711 {"timestamp":"2021-01-26T16:55:30.187Z","pid":24930,"message":"! #\u003cThread:0x00007fe89a33a9c0 /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/rack-timeout-0.5.2/lib/rack/timeout/support/scheduler.rb:73 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/rack-timeout-0.5.2/lib/rack/timeout/support/scheduler.rb:91:in `sleep'"} 2021-01-26_16:55:30.18718 {"timestamp":"2021-01-26T16:55:30.187Z","pid":24930,"message":"Use Ctrl-C to stop"} 2021-01-26_18:03:48.62538 Received TERM from runit, sending to process group (-PID) 2021-01-26_18:04:33.85168 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"Puma starting in cluster mode..."} 2021-01-26_18:04:33.85170 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Puma version: 5.1.1 (ruby 2.7.2-p137) (\"At Your Service\")"} 2021-01-26_18:04:33.85170 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Min threads: 1"} 2021-01-26_18:04:33.85171 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Max threads: 4"} 2021-01-26_18:04:33.85171 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Environment: production"} 2021-01-26_18:04:33.85173 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Master PID: 26376"} 2021-01-26_18:04:33.85173 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Workers: 16"} 2021-01-26_18:04:33.85175 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Restarts: (✔) hot (✖) phased"} 2021-01-26_18:04:33.85175 {"timestamp":"2021-01-26T18:04:33.851Z","pid":26376,"message":"* Preloading application"} 2021-01-26_18:05:12.52683 {"timestamp":"2021-01-26T18:05:12.526Z","pid":26376,"message":"* Listening on unix:///var/opt/gitlab/gitlab-rails/sockets/gitlab.socket"} 2021-01-26_18:05:12.52695 {"timestamp":"2021-01-26T18:05:12.526Z","pid":26376,"message":"* Listening on http://0.0.0.0:8080"} 2021-01-26_18:05:12.52700 {"timestamp":"2021-01-26T18:05:12.526Z","pid":26376,"message":"! WARNING: Detected 2 Thread(s) started in app boot:"} 2021-01-26_18:05:12.52706 {"timestamp":"2021-01-26T18:05:12.527Z","pid":26376,"message":"! #\u003cThread:0x00007fa7b7b955a8 /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/activerecord-6.0.3.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:334 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/activerecord-6.0.3.4/lib/active_record/connection_adapters/abstract/connection_pool.rb:337:in `sleep'"} 2021-01-26_18:05:12.52711 {"timestamp":"2021-01-26T18:05:12.527Z","pid":26376,"message":"! #\u003cThread:0x00007fa79d795a80 /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/rack-timeout-0.5.2/lib/rack/timeout/support/scheduler.rb:73 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/2.7.0/gems/rack-timeout-0.5.2/lib/rack/timeout/support/scheduler.rb:91:in `sleep'"} 2021-01-26_18:05:12.52718 {"timestamp":"2021-01-26T18:05:12.527Z","pid":26376,"message":"Use Ctrl-C to stop"} ``` </details> In the above logs, a human intervened at 18:03. Because the server is noted as DOWN in haproxy, it is not receiving any traffic. There's nothing exciting in the rails or workhorse logs at all. This issue manifested itself on the same package install for web-cny-06 as well: https://ops.gitlab.net/gitlab-com/gl-infra/deployer/-/jobs/2937928 --- To fix the issue, a stop/start of gitlab was issued on the `web-cny-04` --- ## Milestones * [x] Determine why HUP'ing our servers has become problematic and inconsistent * [x] Determine why deployer is not detecting this - https://gitlab.com/gitlab-com/gl-infra/delivery/-/issues/1513#note_494465192 * We should not be adding servers back to haproxy if we failed to HUP the service

issue