Enabling "db_load_balancing" causes the GitLab HA cluster unusable when Patroni is shutdown forcefully
Summary
During availability test of a HA cluster based on the 3k reference setup, we found that the GitLab service is unavailable when the Patroni node is shutdown forcefully.
By checking /var/log/gitlab/gitlab-rails/database_load_balancing.log
it turns out that the relevant node will go offline when the service is stopped via gitlab-ctl stop
on a Patroni node. While if the Patroni node is shutdown forcefully the status of that node is not checked so its status is not updated in the database load balancing.
Steps to reproduce
- Set up a HA cluster by referring to the 3k cluster documentation.
- Make sure database load balancing is on in GitLab rails and sidekiq configuration.
- Forcefully shutdown a Patroni node
Example Project
N/A
What is the current bug behavior?
The GitLab service of the cluster is unavailable.
What is the expected correct behavior?
The service is still available.
Relevant logs and/or screenshots
Expand for Puma logs
==> /var/log/gitlab/puma/current <== 2023-06-08_06:40:25.89442 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Master PID: 59844"} 2023-06-08_06:40:25.89443 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Workers: 8"} 2023-06-08_06:40:25.89444 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Restarts: (✔) hot (✖) phased"} 2023-06-08_06:40:25.89445 {"timestamp":"2023-06-08T06:40:25.894Z","pid":59844,"message":"* Preloading application"} 2023-06-08_06:41:01.53819 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"* Listening on unix:///var/opt/gitlab/gitlab-rails/sockets/gitlab.socket"} 2023-06-08_06:41:01.53830 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"* Listening on http://0.0.0.0:8080"} 2023-06-08_06:41:01.53834 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"! WARNING: Detected 2 Thread(s) started in app boot:"} 2023-06-08_06:41:01.53838 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"! #\u003cThread:0x00007f82d2cd5080 /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/rack-timeout-0.6.3/lib/rack/timeout/support/scheduler.rb:73 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/rack-timeout-0.6.3/lib/rack/timeout/support/scheduler.rb:91:in `sleep'"} 2023-06-08_06:41:01.53842 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"! #\u003cThread:0x00007f82d9d9e3e8 /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sentry-ruby-5.8.0/lib/sentry/session_flusher.rb:81 sleep\u003e - /opt/gitlab/embedded/lib/ruby/gems/3.0.0/gems/sentry-ruby-5.8.0/lib/sentry/session_flusher.rb:83:in `sleep'"} 2023-06-08_06:41:01.53848 {"timestamp":"2023-06-08T06:41:01.538Z","pid":59844,"message":"Use Ctrl-C to stop"}==> /var/log/gitlab/puma/puma_stdout.log <== Note: GC compacting is currently disabled. Refer to
config/initializers_before_autoloader/003_gc_compact.rb
for details. {"timestamp":"2023-06-08T06:41:01.653Z","pid":59844,"message":"! Friendly fork preparation complete."} {"timestamp":"2023-06-08T06:41:02.266Z","pid":59844,"message":"- Worker 0 (PID: 60001) booted in 0.6s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.276Z","pid":59844,"message":"- Worker 1 (PID: 60003) booted in 0.6s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.278Z","pid":59844,"message":"- Worker 2 (PID: 60005) booted in 0.6s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.278Z","pid":59844,"message":"- Worker 3 (PID: 60007) booted in 0.58s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.313Z","pid":59844,"message":"- Worker 4 (PID: 60009) booted in 0.6s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.325Z","pid":59844,"message":"- Worker 6 (PID: 60013) booted in 0.59s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.325Z","pid":59844,"message":"- Worker 5 (PID: 60011) booted in 0.61s, phase: 0"} {"timestamp":"2023-06-08T06:41:02.342Z","pid":59844,"message":"- Worker 7 (PID: 60015) booted in 0.6s, phase: 0"}==> /var/log/gitlab/puma/puma_stderr.log <== source=rack-timeout id=01H2CWBPPP716GGSW7VKTFT0HN timeout=60000ms service=60000ms state=timed_out at=error source=rack-timeout id=01H2CWBXHHN5QT7187PFJ4BVX5 timeout=60000ms service=60000ms state=timed_out at=error source=rack-timeout id=01H2CWBZ8DBY3WDPRWA857RHSX timeout=60000ms service=60000ms state=timed_out at=error source=rack-timeout id=01H2CWC4CB29E7K6JV8XQZ1K5S timeout=60000ms service=60000ms state=timed_out at=error source=rack-timeout id=01H2CWCB76K8BDZZ26ZXN88EY4 timeout=60000ms service=60000ms state=timed_out at=error
Expand for Workhorse logs
{"correlation_id":"01H2CVY97PA6G546CKJRQJCYJ4","duration_ms":1999,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:42+08:00","uri":"/-/readiness"} {"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVY97PA6G546CKJRQJCYJ4","duration_ms":1999,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:42+08:00","ttfb_ms":1999,"uri":"/-/readiness","user_agent":"","written_bytes":26} {"correlation_id":"01H2CVYG2HGRXATFG259H02WNX","duration_ms":2000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:49+08:00","uri":"/-/readiness"} {"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYG2HGRXATFG259H02WNX","duration_ms":2000,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:49+08:00","ttfb_ms":2000,"uri":"/-/readiness","user_agent":"","written_bytes":26} {"correlation_id":"01H2CVYPXCDRZNT4MRRAVSHE30","duration_ms":2000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:56+08:00","uri":"/-/readiness"} {"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYPXCDRZNT4MRRAVSHE30","duration_ms":2001,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:56+08:00","ttfb_ms":2001,"uri":"/-/readiness","user_agent":"","written_bytes":26} {"error":"keywatcher: pubsub receive: EOF","level":"error","msg":"","time":"2023-06-08T14:43:58+08:00"} {"address":"10.12.1.6:6379","level":"info","msg":"redis: dialing","network":"tcp","time":"2023-06-08T14:43:58+08:00"} {"correlation_id":"01H2CVYHFF9XHN9FKD03SQ5ZV4","duration_ms":10000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:43:59+08:00","uri":"/-/readiness"} {"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYHFF9XHN9FKD03SQ5ZV4","duration_ms":10000,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:43:59+08:00","ttfb_ms":10000,"uri":"/-/readiness","user_agent":"","written_bytes":26} {"correlation_id":"01H2CVYXR8NEPCVQ2JZDTTEVJZ","duration_ms":2000,"error":"badgateway: failed to receive response: context canceled","level":"error","method":"GET","msg":"","time":"2023-06-08T14:44:03+08:00","uri":"/-/readiness"} {"content_type":"application/json; charset=utf-8","correlation_id":"01H2CVYXR8NEPCVQ2JZDTTEVJZ","duration_ms":2001,"host":"jihu-ha.futureman.xin","level":"info","method":"GET","msg":"access","proto":"HTTP/1.1","referrer":"","remote_addr":"127.0.0.1:0","remote_ip":"127.0.0.1","route":"^/-/(readiness|liveness)$","status":499,"system":"http","time":"2023-06-08T14:44:03+08:00","ttfb_ms":2000,"uri":"/-/readiness","user_agent":"","written_bytes":26}
Output of checks
N/A
Results of GitLab environment info
Expand for output related to GitLab environment info
(For installations with omnibus-gitlab package run and paste the output of: `sudo gitlab-rake gitlab:env:info`) (For installations from source run and paste the output of: `sudo -u git -H bundle exec rake gitlab:env:info RAILS_ENV=production`)
Results of GitLab application Check
Expand for output related to the GitLab application check
(For installations with omnibus-gitlab package run and paste the output of:
sudo gitlab-rake gitlab:check SANITIZE=true
)(For installations from source run and paste the output of:
sudo -u git -H bundle exec rake gitlab:check RAILS_ENV=production SANITIZE=true
)(we will only investigate if the tests are passing)
Possible fixes
/label typebug