Skip to content

GitLab

Why GitLab
Pricing
Contact Sales
Explore

Sign in
Get free trial

Implement gRPC ingress support

MRs:

Enable direct gRPC connection on KAS via nginx (gitlab-org/gitlab-development-kit!5414 - merged) • Ashvin Sharma • 18.6
Enable gRPC ingress on KAS (gitlab-org/omnibus-gitlab!8857) • Ashvin Sharma • 18.6

Accepted Criteria (generated by Duo)

Development & Configuration

Update GDK configuration for local development
Update Omnibus configuration and NGINX/HAProxy settings
Update Helm Charts ingress configuration
Add: Update GitLab Operator configuration (it's an official installation method)
Add: Update Docker installation documentation (if KAS is included)

Documentation

Update docs and mention gRPCS as the preferred way of connection
Add: Document migration path from WebSocket to gRPC for existing users
Add: Add troubleshooting guide for gRPC connection issues
Add: Update agent installation docs to show both wss:// and grpcs:// examples

Testing

Add: Verify backward compatibility - existing wss:// agents continue working
Add: Test new grpcs:// connections work on all platforms (Omnibus, Helm, Operator)

Due to historical reasons, kas only accepts WebSocket traffic from the agents. Inside of that WebSocket connection we hide the real gRPC traffic. agentk/agentw and kas actually talk gRPC, but wrapped in WebSocket. This wrapping is pure overhead, it doesn't give us any benefits.

For GitLab Runner Job Router (gitlab-org&19607) we'd need Runner and then later admission controllers (GitLab Runner Admissions Controller (gitlab-org&10811)) to talk gRPC to kas too. We really want to avoid the overhead (RAM/CPU/latency) of (un)wrapping tens/hundreds of thousands of connections.

Hence, I propose we make the decision to enable gRPC ingress for kas on all GitLab platforms: SaaS, Dedicated, Self-managed (Helm Chart and Omnibus). GDK too.

This needs to be rigorously tested, but I believe this will not require any breaking changes for users. Agents that connect using WebSocket today will keep working (no user action required). Users would be able to switch to gRPC to improve performance if they desire. Switching is just adjusting the URL scheme in the agent command line (chart parameter) i.e. wss://kas.example.com -> grpcs://kas.example.com. We could later release an update to the agent that would make it try to use gRPC even with ws/wss URL.

Option 1 - status quo

Keep wrapping gRPC traffic in WebSocket.

Pros:

No infra changes.
Faster to implement.

Cons:

Keep paying the RAM/CPU/latency cost of (un)wrapping all traffic. The more connections we have, the more the cost is and, hence, the more kas instances we'll need. Hard to estimate this, but maybe its 1-5% overhead for CPU and RAM usage? Not sure about latency, but it cannot be unaffected either.
At the moment it's only agentk (and agentw), but in the future it'll be Runner, admission controllers, etc - more systems to migrate if we don't do it now before those systems start relying on kas.
(Un)wrapping adds to the operational complexity, reduces reliability - more things can go wrong e.g. more places where a timeout may occur, etc.

✅ Option 2 - use gRPC directly

Adjust nginx/HAProxy/load balancers/etc to ensure gRPC traffic passes through where necessary unaffected. kas already supports both traffic types on the same port, no work needed.

Pros:

No WebSocket (un)wrapping overhead.
Easier to explain (to users and engineers) and understand what is going on. No "unwrapping" to understand.
We could switch CloudFlare to gRPC mode (it's in TCP mode atm) and get better visibility and protection. There have been issues with native mode though, see Fix Cloudflare grpc tunnel timeout configuration (gitlab-org/gitlab#509586 - closed).
Because the proxies (nginx/HAproxy/Load Balancers) will see the normal gRPC traffic, they will be able to produce proper access logs. Makes it easier to troubleshoot issues.

Cons:

Some work required, potentially delaying delivering the Job Router and admission control functionality.

Mitigation: In practice adjusting things to support gRPC directly can be done in parallel to the design and implementation of all the needed changes in kas, rails, runner. The only dependency is at the very end - gRPC traffic has to work before we launch the job router functionality. If we dedicate an engineer to this ASAP there may be no delay at all - it may be ready before the feature itself is ready.

Relevant links

https://gitlab.com/gitlab-cookbooks/gitlab-haproxy
GitLab.com config for the chart, etc: default values, pre, gstg, gprd

Edited Nov 06, 2025 by Ashvin Sharma

Assignee Loading

Time tracking Loading