Zero-downtime deployment of Puma
Currently for zero-downtime deployment we use gitlab-ctl hup unicorn
.
This does not work for Puma, as Puma has different meaning of signals than Unicorn.
It seems that this is impossible to support zero downtime deployment of Puma without heavy hacks,
as for Unicorn we do a mix of SIGUSR2+SIGQUIT
.
This is the exact flow difference between the two services:
Unicorn
does that:
-
SIGUSR2
re-execs binary, - The
re-exec
accepts new connections (I think so), - We wait some time,
- We then issue
SIGQUIT
that stopsmaster
and leave onlyre-exec
binary, - The
master
stops accepting new connections (I think so),
For Puma the SIGUSR1
:
- If
preload_app=false
then the graceful restart of all workers is being done, it means that each started worker does preload app, - If
preload_app=true
it behaves exactly asSIGUSR2
.
For Puma the SIGUSR2
:
- It does gracefully shutdown all workers with
SIGTERM
(this is the meaning of Graceful shutdown), - It allows during graceful shutdown to process all workers,
- It stops accepting new requests
- It re-execs itself,
- It blocks on accepting new requests till application finishes loading,
- It processes new requests.
Proposal
One of the solution is to accept that Puma
for single node installation will not accept new connections for brief time of 40-60s.
This is completely avoidable if using the health checks with blackout period as described in: gitlab#30201 (comment 224452030). This would make the node to be disconnected gracefully from the load balancer during the reconfigure.
One solution for the problem of application restart, likely the best to be implemented everywhere would be to make /-/health
to have a blackdown period where signal is issued, we move into blackdown period when healthcheck returns not-ready, and allows the load balancer to take the action and switch to another nodes.
Once the application restarts the healthcheck would start returning OK.
This is easy to implement and would serve well to allow gradual and automated restart of services. This would also be very easy to implement for all services.
The flow would be like this (ex. Puma):
- The
SIGUSR2
is sent, - Puma receives, it marks that it wants to get restarted,
- From now on
/-/health
returnsNOT READY
(or relevant status), - After some predefined period (likely 10s), the restart/shutdown proceeds no longer processing new connections,
- Once application restarts it returns on healtcheck
READY
This is very easy to flow to implement for each service, as it uses a standard mechanism of restarts, does not require a complex logic or additional wrappers. Following this workflow it could be easily adopted for all our applications: Gitaly, Workhorse, Pages, Unicorn/Puma, and make it consistent.
Secondly, this workflow would work very well as well with our Cloud-Native installation, as Cloud-Native does handle rolling update very well as well now.
This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.