[Meta] Support kubectl exec/attach/cp/port-forward
Release notes
Until now, users of the agent for Kubernetes had to work around limitations of the CI/CD workflow not supporting kubectl exec/attach/cp/port-forward
calls. GitLab now supports these calls on top of SPDY protocol. If your load balancer or reverse proxy supports SPDY, you can use kubectl exec/attach/cp/port-forward
with CI/CD workflows. Both the GitLab Charts and Omnibus use Nginx and are configured to support SPDY out of the box.
Unfortunately, we already know that at least some cloud providers do not support SPDY. We are working with the Kubernetes community to ship Websockets support in Kubernetes, which will be the solution for many cloud-hosted GitLab instances, including GitLab SaaS.
Current state
- agentk can connect to kas via WebSockets or via gRPC directly. WebSockets can be used to wrap the actual gRPC traffic to make it possible to use HTTP load balancers and/or reverse proxies that cannot proxy gRPC or HTTP/2.
- On GitLab.com agentk connects to kas via WebSockets (
wss://kas.gitlab.com
). We did it this way because it was faster (fewer unknowns) and because there was no mechanism for managing (rotating?) TLS certs (or secrets?) in GitLab.com deployment to Kubernetes. Don't remember now, something like that. - Chart always uses WebSockets.
- Omnibus still doesn't use a separate domain for kas. kas' traffic is accepted on the same domain, just a separate URL path. I.e. it uses WebSockets too.
Issues with current state
-
CI tunnel does not support exec/attach/cp/port-forward
kubectl
commands because they use SPDY (was discovered in gitlab-org/cluster-integration/gitlab-agent#186 (closed)). SPDY uses HTTPUpgrade
mechanism to upgrade the connection from HTTP/1.1. GKE's HTTP load balancer that we use supports WebSocket connection upgrades only, not the long deprecated SPDY i.e. the current load balancer cannot pass SPDY traffic. It seems that we need a TCP load balancer. - Encapsulating gRPC traffic in a WebSocket connection adds extra moving parts on the client and the server side. Ideally long term we'd use gRPC directly. This is not urgent, but since we are looking at load balancing anyway (to unblock solving the above issue), it's worth keeping in mind too.
Options to address the above issues
Options A and B presented below require a TCP load balancer because an HTTP GKE load balancer does not support SPDY upgrades. What would also work is a TLS+TCP load balancer (i.e. a load balancer that terminates TLS and passes cleartext TCP to the backend), but I don't see that in the docs. Using a TCP load balancer means terminating TLS at the backend (kas), which means we cannot use Google-managed TLS certs which is unfortunate.
Another thing to keep in mind is that we cannot ask all agent users to change the kas URL that their agents use on a certain date (e.g. when %15.0 is released). Even as a breaking change it's too much. We should provide a way for users to migrate to gRPC at their own pace. We could set a removal date (e.g. 16.0) but before deprecating WebSockets we need to ensure Omnibus and Chart are compatible with gRPC-only mode. Because of this, options below support WebSockets and gRPC simultaneously without breaking anyone.
Docs for GKE load balancing:
- GKE docs (use navigation on the left side): https://cloud.google.com/kubernetes-engine/docs/how-to/service-parameters
- GCP load balancing docs: https://cloud.google.com/load-balancing/docs
All options require finishing connection upgrade support in kas.
Option A - same domain for gRPC and HTTP traffic
A single domain, such as kas.gitlab.com
for GitLab.com deployment, that accepts all traffic:
-
wss://kas.gitlab.com/
for compatibility with existing agents (i.e. WebSocket traffic). -
grpcs://kas.gitlab.com
for direct gRPC traffic from agents. -
https://kas.gitlab.com/some-path/k8s-proxy/
for Kubernetes traffic proxying for CI tunnel and other future features. SPDY is handled here.
TCP load balancer is required to pass gRPC, WebSocket (for migration duration), and SPDY traffic.
Work required
- users: switch to a new URL before the old one stops working in e.g. a year. No rush, no breakage.
- kas: add a new listen port and implement traffic sniffing on it - accept TCP connection, unwrap TLS, look at the first line. HTTP/2 uses a fixed client preface that is a valid (?) HTTP/1.1 request's first line. kas could read it and, depending on what it got, pass the accepted connection to either the gRPC server or HTTP server. HTTP server could route the request based on URL path in the request (WebSocket agent traffic vs Kubernetes API reverse proxy).
- Chart: support TCP load balancing and TLS termination in kas.
- Omnibus: support TCP load balancing and TLS termination in kas.
- GitLab.com infra:
- provision a new load balancer working in TCP mode, once new kas is deployed, test that everything works, then switch DNS record to point at the new load balancer. Old load balancer will keep working as kas will be accepting traffic on both old and new ports simultaneously, no disruption to clients and everything can be safely rolled back.
- Remove old load balancer once everything is rolled out.
Pros
- Conceptually simple from user's point of view. A singe domain for all Kubernetes/kas things.
- Single domain is easier to manage than two. One cert vs two, etc. I can imagine this is a big deal for our self-managed users.
- Migration to gRPC from WebSockets without any breakage.
Cons
- Traffic sniffing can be considered a hack
😄 I don't think it's a big deal in this case because we are essentially just trying to understand if it's an HTTP/2 or HTTP/1.1 connection. - Not using managed TLS certs.
- Extra complexity in Chart and Omnibus to handle TCP load balancing and certs.
Unknowns
- How hard is it to get certs working in Chart and Omnibus?
- How hard is it to get TCP load balancer working in our GitLab.com instance?
- ?
Option B - separate domains for gRPC and HTTP traffic
Two domains - one for gRPC traffic and another one for HTTP/WebSocket.
-
wss://kas.gitlab.com/
for compatibility with existing agents (i.e. WebSocket traffic). -
grpcs://kas-grpc.gitlab.com
for direct gRPC traffic from agents. -
https://kas.gitlab.com/some-path/k8s-proxy/
for Kubernetes traffic proxying for CI tunnel and other future features. SPDY is handled here.
We cannot use separate ports on the same domain because Kubernetes Ingress
doesn't support that.
Work required
- users: switch to a new URL before the old one stops working in e.g. a year. No rush, no breakage.
- kas: Need to support handling WebSocket vs Kubernetes API proxying requests based on a URL path on a single port (currently two ports).
- Chart: support TCP load balancing and TLS termination in kas.
- Omnibus: support TCP load balancing and TLS termination in kas.
- GitLab.com infra:
- provision a new load balancer working in TCP mode, once new kas is deployed, test that everything works, then switch DNS record to point at the new load balancer. Old load balancer will keep working, pointed at the same backend port. No disruption to clients and everything can be safely rolled back.
- provision a new load balancer working in HTTP/2 to the backend mode. This one is for gRPC. Point it at the gRPC port.
- Remove old load balancer once everything is rolled out.
Pros
- Migration to gRPC from WebSockets without any breakage.
Cons
- Not using managed TLS certs.
- Extra complexity in Chart and Omnibus to handle TCP load balancing and certs.
Unknowns
- How hard is it to get certs working in Chart and Omnibus?
- How hard is it to get TCP load balancer working in our GitLab.com instance?
- ?
Option C
Implement WebSocket support in kubectl
for commands that rely on SPDY today.
Work required
- Users: upgrade to
kubectl
with the change, once released. Other tools that use client-go need to pick up the change too to benefit (I'm not sure what those tools are, these bit of code are probably not used by other tools). No other changes needed, things just start to work as they should. Migration to gRPC can be done later by using an additional domain with a load balancer in "HTTP/2 to the backend" mode. - Kubernetes: There is a draft PR that implements WebSocket support, it even got reviewed. To implement this option we'd need to pick this up, make it work well, pass reviews, and get it merged. @ash2k looked at the code and it needs work. The approach used in the PR will enable
attach
,exec
,cp
commands as they use the same underlying piece of code.port-forward
is a separate story and the PR does not do anything for it. It will likely be more complicated. - kas: route agent vs Kubernetes API proxy traffic based on URL path.
Pros
- Fixes the issue once and for all.
- No need for load balancer changes to resolve this issue.
Cons
- Getting such a change merged can take quite some time.
- A single HTTP-mode load balancer does not support gRPC and WebSockets simultaneously as when the load balancer is in HTTP/2 to the backend mode, it cannot accept WebSocket connections (and vice versa). The docs say
HTTP/2 is not supported
in the WebSockets section. So not clear how to migrate to gRPC in this scenario when the load balancer is in HTTP/1 to the backend so that it works with WebSockets.
Option D (B + C)
Implement WebSocket support in kubectl
and use two domains - one for WebSockets and Kubernetes proxying and another one for gRPC agent connections.
Proposal
Go with option C and then maybe extend it into option D. More details below in #346248 (comment 932802961).
-
Finish gitlab-org/cluster-integration/gitlab-agent!632 (merged) to be ready for WebSocket support in kubectl
. -
Change kubectl exec/attach/cp to use WebSocket protocol. -
Change kubectl port-forward to use WebSocket protocol.