What is the workflow of workhorse and rails when being asked to shutdown for Action Cable workloads?

Problem Statement

The EOC's have been paged each time we've attempted to move forward with enabling more features with Action Cable, otherwise known as our websockets workload or deployment. See the following incidents:

Note that we've ruled out that memory saturation is root cause.

What we've noticed is that ONLY the loadbalancer SLI for Errors is triggering:

image

No SLI's are being breached for puma, nor workhorse. Source

During a deep dive investigation, it turns out that workhorse is terminating pretty much immediately when the Pod is asked to quit.

image

Source

But the rails container will continue to run until we hit our configured blackoutPeriod of 240 seconds.

image

Source

With workhorse being the proxy for all traffic between client and Action Cable, it doesn't make sense that we allow the webservice container to wait any amount of time at all to quit. Instead, when workhorse exits, which appears to both cleanly exit, and close any on-going connections, the webservice container should effectively exit at near the same time. Doing so should reduce the amount of errors seen as instead of reporting HTTP503's on a container for which no workhorse is running. We have no Pod running said healthcheck, which will reduce the error ratio since it won't be receiving any new client traffic anyways.

Testing

Using staging, we created a mock connection using curl to open up a connection to websockets:

% curl 'https://staging.gitlab.com/-/cable' \
  -H 'Pragma: no-cache' \
  -H 'Origin: https://staging.gitlab.com' \
  -H 'Accept-Language: en-US,en;q=0.9,ja-JP;q=0.8,ja;q=0.7' \
  -H 'Sec-WebSocket-Key: byql8fEEpD90j5ZKS7CPLg==' \
  -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36' \
  -H 'Upgrade: websocket' \
  -H 'Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits' \
  -H 'Sec-GPC: 1' \
  -H 'Cache-Control: no-cache' \
  -H 'Connection: Upgrade' \
  -H 'Sec-WebSocket-Version: 13' \
  -H 'Sec-WebSocket-Protocol: actioncable-v1-json, actioncable-unsupported' \
  --compressed \
  --http1.1 \
  -v

And this worked very well, curl was occasionally outputting ping messages and we had knowledge that we were connected to a Pod in one of our clusters. This test coincides with the screenshots of logs above. We then asked Kubernetes to terminate both Pods that were running in this cluster (we did not know which Pod we were connected too), and this begins the standard shutdown procedure of a Pod. For websockets:

gantt
  title websockets
  dateFormat  mm
  section Section
    SIGTERM gitlab-webservice  :gw, active, 10s
    SIGTERM workhorse          :w, active, 10s
    workhorse exits            :we, after w, 10s
    gitlab-webservice waits    :gww, after gw, 4m
    gitlab-webservice exist    :gwe, after gww, 10s

Our curl exited nearly immediately and cleanly. The two containers went into a Terminating state, with only the gitlab-webservice Pod being ready. The workhorse containers exited. During the time for which the workhorse container is gone, Puma is idling, under blackoutPeriod:

2022-05-20-SWLCvJcKM3.flamegraph.svg

For giggles here's a baseline for Puma when it's idle, but hasn't been asked to shutdown just yet

2022-05-20-4KVdeS1GQm.flamegraph.svg

Proposal

I'd like to propose that we remove our existing blackoutSeconds of 240 seconds entirely. But prior to doing so, we'd like some assistance from Engineer to understand if the above behavior is expected and safe for our clients.

I do not currently suspect that blackoutSeconds is useful at the moment because workhorse is our proxy. Since local testing showcased an immediate termination of an Action Cable connection, puma laying around waiting for 240 seconds is effectively wasted resources. And since during this period of time puma is returning HTTP503's and not handling any traffic at all, the error ratio for this is effectively 100%.

References