What is the workflow of workhorse and rails when being asked to shutdown for Action Cable workloads?
Problem Statement
The EOC's have been paged each time we've attempted to move forward with enabling more features with Action Cable, otherwise known as our websockets
workload or deployment. See the following incidents:
Note that we've ruled out that memory saturation is root cause.
What we've noticed is that ONLY the loadbalancer SLI for Errors is triggering:
No SLI's are being breached for puma, nor workhorse. Source
During a deep dive investigation, it turns out that workhorse
is terminating pretty much immediately when the Pod is asked to quit.
But the rails container will continue to run until we hit our configured blackoutPeriod
of 240
seconds.
With workhorse being the proxy for all traffic between client and Action Cable, it doesn't make sense that we allow the webservice container to wait any amount of time at all to quit. Instead, when workhorse exits, which appears to both cleanly exit, and close any on-going connections, the webservice container should effectively exit at near the same time. Doing so should reduce the amount of errors seen as instead of reporting HTTP503's on a container for which no workhorse is running. We have no Pod running said healthcheck, which will reduce the error ratio since it won't be receiving any new client traffic anyways.
Testing
Using staging, we created a mock connection using curl
to open up a connection to websockets
:
% curl 'https://staging.gitlab.com/-/cable' \
-H 'Pragma: no-cache' \
-H 'Origin: https://staging.gitlab.com' \
-H 'Accept-Language: en-US,en;q=0.9,ja-JP;q=0.8,ja;q=0.7' \
-H 'Sec-WebSocket-Key: byql8fEEpD90j5ZKS7CPLg==' \
-H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/100.0.4896.60 Safari/537.36' \
-H 'Upgrade: websocket' \
-H 'Sec-WebSocket-Extensions: permessage-deflate; client_max_window_bits' \
-H 'Sec-GPC: 1' \
-H 'Cache-Control: no-cache' \
-H 'Connection: Upgrade' \
-H 'Sec-WebSocket-Version: 13' \
-H 'Sec-WebSocket-Protocol: actioncable-v1-json, actioncable-unsupported' \
--compressed \
--http1.1 \
-v
And this worked very well, curl
was occasionally outputting ping
messages and we had knowledge that we were connected to a Pod in one of our clusters. This test coincides with the screenshots of logs above. We then asked Kubernetes to terminate both Pods that were running in this cluster (we did not know which Pod we were connected too), and this begins the standard shutdown procedure of a Pod. For websockets
:
gantt
title websockets
dateFormat mm
section Section
SIGTERM gitlab-webservice :gw, active, 10s
SIGTERM workhorse :w, active, 10s
workhorse exits :we, after w, 10s
gitlab-webservice waits :gww, after gw, 4m
gitlab-webservice exist :gwe, after gww, 10s
Our curl
exited nearly immediately and cleanly. The two containers went into a Terminating
state, with only the gitlab-webservice
Pod being ready. The workhorse
containers exited. During the time for which the workhorse
container is gone, Puma is idling, under blackoutPeriod
:
Proposal
I'd like to propose that we remove our existing blackoutSeconds
of 240
seconds entirely. But prior to doing so, we'd like some assistance from Engineer to understand if the above behavior is expected and safe for our clients.
I do not currently suspect that blackoutSeconds
is useful at the moment because workhorse
is our proxy. Since local testing showcased an immediate termination of an Action Cable connection, puma laying around waiting for 240
seconds is effectively wasted resources. And since during this period of time puma is returning HTTP503's and not handling any traffic at all, the error ratio for this is effectively 100%.
References
-
blackoutSeconds
helm chart config for .com - https://gitlab.com/gitlab-com/gl-infra/k8s-workloads/gitlab-com/-/blob/6225b20c06dd54ed4b03f3ed7ce0aa4acfd25e29/releases/gitlab/values/values.yaml.gotmpl#L650 - Implmentation of
blackoutSeconds
- https://gitlab.com/gitlab-org/gitlab/-/blob/6c62a1c0239530221856458f176171d881663180/lib/gitlab/cluster/lifecycle_events.rb#L122