Implement gRPC ingress support
Due to historical reasons, kas only accepts WebSocket traffic from the agents. Inside of that WebSocket connection we hide the real gRPC traffic. agentk/agentw and kas actually talk gRPC, but wrapped in WebSocket. This wrapping is pure overhead, it doesn't give us any benefits.
For GitLab Runner Job Router (gitlab-org&19607) we'd need Runner and then later admission controllers (GitLab Runner Admissions Controller (gitlab-org&10811)) to talk gRPC to kas too. We really want to avoid the overhead (RAM/CPU/latency) of (un)wrapping tens/hundreds of thousands of connections.
Hence, I propose we make the decision to enable gRPC ingress for kas on all GitLab platforms: SaaS, Dedicated, Self-managed (Helm Chart and Omnibus). GDK too.
This needs to be rigorously tested, but I believe this will not require any breaking changes for users. Agents that connect using WebSocket today will keep working (no user action required). Users would be able to switch to gRPC to improve performance if they desire. Switching is just adjusting the URL scheme in the agent command line (chart parameter) i.e. wss://kas.example.com
-> grpcs://kas.example.com
. We could later release an update to the agent that would make it try to use gRPC even with ws
/wss
URL.
Option 1 - status quo
Keep wrapping gRPC traffic in WebSocket.
Pros:
- No infra changes.
- Faster to implement.
Cons:
- Keep paying the RAM/CPU/latency cost of (un)wrapping all traffic. The more connections we have, the more the cost is and, hence, the more kas instances we'll need. Hard to estimate this, but maybe its 1-5% overhead for CPU and RAM usage? Not sure about latency, but it cannot be unaffected either.
- At the moment it's only agentk (and agentw), but in the future it'll be Runner, admission controllers, etc - more systems to migrate if we don't do it now before those systems start relying on kas.
- (Un)wrapping adds to the operational complexity, reduces reliability - more things can go wrong e.g. more places where a timeout may occur, etc.
✅ Option 2 - use gRPC directly
Adjust nginx/HAProxy/load balancers/etc to ensure gRPC traffic passes through where necessary unaffected. kas already supports both traffic types on the same port, no work needed.
Pros:
- No WebSocket (un)wrapping overhead.
- Easier to explain (to users and engineers) and understand what is going on. No "unwrapping" to understand.
- We could switch CloudFlare to gRPC mode (it's in TCP mode atm) and get better visibility and protection. There have been issues with native mode though, see Fix Cloudflare grpc tunnel timeout configuration (gitlab-org/gitlab#509586 - closed).
- Because the proxies (nginx/HAproxy/Load Balancers) will see the normal gRPC traffic, they will be able to produce proper access logs. Makes it easier to troubleshoot issues.
Cons:
-
Some work required, potentially delaying delivering the Job Router and admission control functionality.
Mitigation: In practice adjusting things to support gRPC directly can be done in parallel to the design and implementation of all the needed changes in kas, rails, runner. The only dependency is at the very end - gRPC traffic has to work before we launch the job router functionality. If we dedicate an engineer to this ASAP there may be no delay at all - it may be ready before the feature itself is ready.