WebSocket nodes on Kubernetes
As a first step towards the goal of the Real-Time Working Group, shipping real-time assignee updates to self-hosted customers, we would like to maintain open web socket connections when viewing issues on gitlab.com.
To accomplish this it is advisable to proxy WebSocket requests to separate nodes, isolated from the current Web/API nodes. These will use Action Cable initially and by building in observability from the start we should be able to get an idea of the number of simultaneous connections we'll need to support and the resources required to do so.
Following a suggestion made in the WG, since it's a brand new feature, we can take a Kubernetes-first approach using the migration of Sidekiq as a precedent. This issue is intended to help us identify, split and track the work.
Requirements of a K8s Deployment
- Access to existing or new Redis PubSub nodes.
- Access to database.
- Monitoring; number of connections, resource usage.
- Log aggregation.
Tasks
-
Write application logs to stdout
andstderr
, so that Kubernetes can handle logging; -
Containerize GitLab, including Action Cable and the real-time feature work; -
Configure a Kubernetes deployment for the container; -
Configure Helm charts for dependencies (see #749 (comment 318716984)); -
Decide on using existing versus new Redis nodes for pub/sub; -
Implement service discovery for Redis pub/sub cluster; -
Implement service discovery for database; -
Configure Ingress for WebSocket K8s pods; -
Set up a staging deployment; -
Proxy WebSocket connections to K8s cluster; -
Enable feature flag for a small, internal project for testing on staging; -
When verified, set up a production deployment, and -
Enable feature flag for gitlab-org/gitlab
.
Getting to this point would allow us to calculate a measurement of the resources and pods required to service gitlab.com, at least to an order-of-magnitude resolution.
@marin I've tried to split the work out as best I understand it so that we can spread out amongst teams as much as possible, including the Plan team, using the Sidekiq work as a precedent. I'll ask other members of the WG to review this issue and add suggestions as I've surely missed some things. It could be promoted to an epic and the tasks split out into separate issues when we know what they should be.
I would really appreciate your input and that of teamDelivery on the work and how to proceed.