Zero-downtime deployment of Puma

Currently for zero-downtime deployment we use gitlab-ctl hup unicorn.

This does not work for Puma, as Puma has different meaning of signals than Unicorn.

It seems that this is impossible to support zero downtime deployment of Puma without heavy hacks, as for Unicorn we do a mix of SIGUSR2+SIGQUIT.

This is the exact flow difference between the two services:

Unicorn does that:

SIGUSR2 re-execs binary,
The re-exec accepts new connections (I think so),
We wait some time,
We then issue SIGQUIT that stops master and leave only re-exec binary,
The master stops accepting new connections (I think so),

For Puma the SIGUSR1:

If preload_app=false then the graceful restart of all workers is being done, it means that each started worker does preload app,
If preload_app=true it behaves exactly as SIGUSR2.

For Puma the SIGUSR2:

It does gracefully shutdown all workers with SIGTERM (this is the meaning of Graceful shutdown),
It allows during graceful shutdown to process all workers,
It stops accepting new requests
It re-execs itself,
It blocks on accepting new requests till application finishes loading,
It processes new requests.

Proposal

One of the solution is to accept that Puma for single node installation will not accept new connections for brief time of 40-60s.

This is completely avoidable if using the health checks with blackout period as described in: gitlab#30201 (comment 224452030). This would make the node to be disconnected gracefully from the load balancer during the reconfigure.

One solution for the problem of application restart, likely the best to be implemented everywhere would be to make /-/health to have a blackdown period where signal is issued, we move into blackdown period when healthcheck returns not-ready, and allows the load balancer to take the action and switch to another nodes.

Once the application restarts the healthcheck would start returning OK.

This is easy to implement and would serve well to allow gradual and automated restart of services. This would also be very easy to implement for all services.

The flow would be like this (ex. Puma):

The SIGUSR2 is sent,
Puma receives, it marks that it wants to get restarted,
From now on /-/health returns NOT READY (or relevant status),
After some predefined period (likely 10s), the restart/shutdown proceeds no longer processing new connections,
Once application restarts it returns on healtcheck READY

This is very easy to flow to implement for each service, as it uses a standard mechanism of restarts, does not require a complex logic or additional wrappers. Following this workflow it could be easily adopted for all our applications: Gitaly, Workhorse, Pages, Unicorn/Puma, and make it consistent.

Secondly, this workflow would work very well as well with our Cloud-Native installation, as Cloud-Native does handle rolling update very well as well now.

This page may contain information related to upcoming products, features and functionality. It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes. Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.

Edited May 31, 2022 by 🤖 GitLab Bot 🤖