Skip to content

[Meta] Support kubectl exec/attach/cp/port-forward

Release notes

Until now, users of the agent for Kubernetes had to work around limitations of the CI/CD workflow not supporting kubectl exec/attach/cp/port-forward calls. GitLab now supports these calls on top of SPDY protocol. If your load balancer or reverse proxy supports SPDY, you can use kubectl exec/attach/cp/port-forward with CI/CD workflows. Both the GitLab Charts and Omnibus use Nginx and are configured to support SPDY out of the box.

Unfortunately, we already know that at least some cloud providers do not support SPDY. We are working with the Kubernetes community to ship Websockets support in Kubernetes, which will be the solution for many cloud-hosted GitLab instances, including GitLab SaaS.

Current state

  • agentk can connect to kas via WebSockets or via gRPC directly. WebSockets can be used to wrap the actual gRPC traffic to make it possible to use HTTP load balancers and/or reverse proxies that cannot proxy gRPC or HTTP/2.
  • On GitLab.com agentk connects to kas via WebSockets (wss://kas.gitlab.com). We did it this way because it was faster (fewer unknowns) and because there was no mechanism for managing (rotating?) TLS certs (or secrets?) in GitLab.com deployment to Kubernetes. Don't remember now, something like that.
  • Chart always uses WebSockets.
  • Omnibus still doesn't use a separate domain for kas. kas' traffic is accepted on the same domain, just a separate URL path. I.e. it uses WebSockets too.

Issues with current state

  • CI tunnel does not support exec/attach/cp/port-forward kubectl commands because they use SPDY (was discovered in gitlab-org/cluster-integration/gitlab-agent#186 (closed)). SPDY uses HTTP Upgrade mechanism to upgrade the connection from HTTP/1.1. GKE's HTTP load balancer that we use supports WebSocket connection upgrades only, not the long deprecated SPDY i.e. the current load balancer cannot pass SPDY traffic. It seems that we need a TCP load balancer.
  • Encapsulating gRPC traffic in a WebSocket connection adds extra moving parts on the client and the server side. Ideally long term we'd use gRPC directly. This is not urgent, but since we are looking at load balancing anyway (to unblock solving the above issue), it's worth keeping in mind too.

Options to address the above issues

Options A and B presented below require a TCP load balancer because an HTTP GKE load balancer does not support SPDY upgrades. What would also work is a TLS+TCP load balancer (i.e. a load balancer that terminates TLS and passes cleartext TCP to the backend), but I don't see that in the docs. Using a TCP load balancer means terminating TLS at the backend (kas), which means we cannot use Google-managed TLS certs which is unfortunate.

Another thing to keep in mind is that we cannot ask all agent users to change the kas URL that their agents use on a certain date (e.g. when %15.0 is released). Even as a breaking change it's too much. We should provide a way for users to migrate to gRPC at their own pace. We could set a removal date (e.g. 16.0) but before deprecating WebSockets we need to ensure Omnibus and Chart are compatible with gRPC-only mode. Because of this, options below support WebSockets and gRPC simultaneously without breaking anyone.

Docs for GKE load balancing:

All options require finishing connection upgrade support in kas.

Option A - same domain for gRPC and HTTP traffic

A single domain, such as kas.gitlab.com for GitLab.com deployment, that accepts all traffic:

  • wss://kas.gitlab.com/ for compatibility with existing agents (i.e. WebSocket traffic).
  • grpcs://kas.gitlab.com for direct gRPC traffic from agents.
  • https://kas.gitlab.com/some-path/k8s-proxy/ for Kubernetes traffic proxying for CI tunnel and other future features. SPDY is handled here.

TCP load balancer is required to pass gRPC, WebSocket (for migration duration), and SPDY traffic.

Work required

  • users: switch to a new URL before the old one stops working in e.g. a year. No rush, no breakage.
  • kas: add a new listen port and implement traffic sniffing on it - accept TCP connection, unwrap TLS, look at the first line. HTTP/2 uses a fixed client preface that is a valid (?) HTTP/1.1 request's first line. kas could read it and, depending on what it got, pass the accepted connection to either the gRPC server or HTTP server. HTTP server could route the request based on URL path in the request (WebSocket agent traffic vs Kubernetes API reverse proxy).
  • Chart: support TCP load balancing and TLS termination in kas.
  • Omnibus: support TCP load balancing and TLS termination in kas.
  • GitLab.com infra:
    • provision a new load balancer working in TCP mode, once new kas is deployed, test that everything works, then switch DNS record to point at the new load balancer. Old load balancer will keep working as kas will be accepting traffic on both old and new ports simultaneously, no disruption to clients and everything can be safely rolled back.
    • Remove old load balancer once everything is rolled out.

Pros

  • Conceptually simple from user's point of view. A singe domain for all Kubernetes/kas things.
  • Single domain is easier to manage than two. One cert vs two, etc. I can imagine this is a big deal for our self-managed users.
  • Migration to gRPC from WebSockets without any breakage.

Cons

  • Traffic sniffing can be considered a hack 😄 I don't think it's a big deal in this case because we are essentially just trying to understand if it's an HTTP/2 or HTTP/1.1 connection.
  • Not using managed TLS certs.
  • Extra complexity in Chart and Omnibus to handle TCP load balancing and certs.

Unknowns

  • How hard is it to get certs working in Chart and Omnibus?
  • How hard is it to get TCP load balancer working in our GitLab.com instance?
  • ?

Option B - separate domains for gRPC and HTTP traffic

Two domains - one for gRPC traffic and another one for HTTP/WebSocket.

  • wss://kas.gitlab.com/ for compatibility with existing agents (i.e. WebSocket traffic).
  • grpcs://kas-grpc.gitlab.com for direct gRPC traffic from agents.
  • https://kas.gitlab.com/some-path/k8s-proxy/ for Kubernetes traffic proxying for CI tunnel and other future features. SPDY is handled here.

We cannot use separate ports on the same domain because Kubernetes Ingress doesn't support that.

Work required

  • users: switch to a new URL before the old one stops working in e.g. a year. No rush, no breakage.
  • kas: Need to support handling WebSocket vs Kubernetes API proxying requests based on a URL path on a single port (currently two ports).
  • Chart: support TCP load balancing and TLS termination in kas.
  • Omnibus: support TCP load balancing and TLS termination in kas.
  • GitLab.com infra:
    • provision a new load balancer working in TCP mode, once new kas is deployed, test that everything works, then switch DNS record to point at the new load balancer. Old load balancer will keep working, pointed at the same backend port. No disruption to clients and everything can be safely rolled back.
    • provision a new load balancer working in HTTP/2 to the backend mode. This one is for gRPC. Point it at the gRPC port.
    • Remove old load balancer once everything is rolled out.

Pros

  • Migration to gRPC from WebSockets without any breakage.

Cons

  • Not using managed TLS certs.
  • Extra complexity in Chart and Omnibus to handle TCP load balancing and certs.

Unknowns

  • How hard is it to get certs working in Chart and Omnibus?
  • How hard is it to get TCP load balancer working in our GitLab.com instance?
  • ?

Option C

Implement WebSocket support in kubectl for commands that rely on SPDY today.

Work required

  • Users: upgrade to kubectl with the change, once released. Other tools that use client-go need to pick up the change too to benefit (I'm not sure what those tools are, these bit of code are probably not used by other tools). No other changes needed, things just start to work as they should. Migration to gRPC can be done later by using an additional domain with a load balancer in "HTTP/2 to the backend" mode.
  • Kubernetes: There is a draft PR that implements WebSocket support, it even got reviewed. To implement this option we'd need to pick this up, make it work well, pass reviews, and get it merged. @ash2k looked at the code and it needs work. The approach used in the PR will enable attach, exec, cp commands as they use the same underlying piece of code. port-forward is a separate story and the PR does not do anything for it. It will likely be more complicated.
  • kas: route agent vs Kubernetes API proxy traffic based on URL path.

Pros

  • Fixes the issue once and for all.
  • No need for load balancer changes to resolve this issue.

Cons

  • Getting such a change merged can take quite some time.
  • A single HTTP-mode load balancer does not support gRPC and WebSockets simultaneously as when the load balancer is in HTTP/2 to the backend mode, it cannot accept WebSocket connections (and vice versa). The docs say HTTP/2 is not supported in the WebSockets section. So not clear how to migrate to gRPC in this scenario when the load balancer is in HTTP/1 to the backend so that it works with WebSockets.

Option D (B + C)

Implement WebSocket support in kubectl and use two domains - one for WebSockets and Kubernetes proxying and another one for gRPC agent connections.

Proposal

Go with option C and then maybe extend it into option D. More details below in #346248 (comment 932802961).

Edited by Mikhail Mazurskiy