Delay workhorse shutdown so that rails can cleanly terminate websocket connections

Summary

On .com we see are seeing spikes of 503 status codes for the Websockets service when pods are scaled. An initial investigation, it looks like that our shutdown sequence is not optimal for websockets given that workhorse does a reverse proxy at the TCP level.

When a pod is terminated, both the Workhorse container and the Puma container are set a TERM signal. For Workhorse, we were previously not propagating this signal but recently we made a fix so that it receives it and initiates a shutdown. For normal web connections this is fine, but for websockets which is a TCP proxy it is likely that we will ungracefully terminate the connection at Workhorse, instead of Rails.

The easiest solution I can see would be to add a configurable preStop hook option for Workhorse with a sleep, so that we can give Rails some time to gracefully close the connection.

See relevant thread in gitlab-org/gitlab#363096 (comment 980570509)

Edited Jun 22, 2022 by John Jarvis