Delay workhorse shutdown so that rails can cleanly terminate websocket connections
Summary
On .com we see are seeing spikes of 503
status codes for the Websockets service when pods are scaled. An initial investigation, it looks like that our shutdown sequence is not optimal for websockets given that workhorse does a reverse proxy at the TCP level.
When a pod is terminated, both the Workhorse container and the Puma container are set a TERM signal. For Workhorse, we were previously not propagating this signal but recently we made a fix so that it receives it and initiates a shutdown. For normal web connections this is fine, but for websockets which is a TCP proxy it is likely that we will ungracefully terminate the connection at Workhorse, instead of Rails.
The easiest solution I can see would be to add a configurable preStop
hook option for Workhorse with a sleep, so that we can give Rails some time to gracefully close the connection.
See relevant thread in gitlab-org/gitlab#363096 (comment 980570509)