Reverse tunnel shutdown problem

A complex bug was introduced in kas->agentk communications support (!1547 - merged). It was lying dormant and was not noticed in testing as it doesn't cause any obvious problems. The bug is only triggered by code that was enabled by default in Enable receptive agents by default (!1763 - merged). When the code got to production, it caused an incident: 2024-08-29: Redis CPU Saturated (gitlab-com/gl-infra/production#18469 - closed).

Bug description

kas -> agentk reverse tunnel, that is used in receptive agents, also establishes connections to the agent API server to make requests, coming via the reverse tunnel. Some of those RPCs are long-running (GetConfiguration, Connect). They sit open until there is something to do. There is a special context (ageCtx) to abort these RPCs where the are in the waiting stage. I.e. it's a soft abort, not causing any disruption when the RPC starts doing actual work. It's more like a notification.

When receptive agents code needs to shut down, it needs to tear down the connections to the agent API server. The only clean way to do it without potentially disrupting an RPC that is being proxied, is to use the agent API server's ageCtx to cleanly stop the server-side RPC handlers.

Agent API server is actually two servers. They have identical handlers registered, but one listens for network connections and the other one - for in-memory connections from the receptive agents code. They share ageCtx.

Receptive agents code starts and stops before the agent API server (and other servers) since it has a client that talks over the in-memory connection to that server. We stop the client before the server to avoid disrupting the RPCs that are being handled.

When receptive agents code needs to stop, it cancels ageCtx of the agent API server. It affects not only the in-memory connections, but normal connections from the agents too. Normally this context is only canceled after the server sent a GOAWAY frame and closed the listener, so that the clients cannot reach it. But in this case:

ageCtx is canceled
Agent API server waits for the listener grace duration (5 seconds)
Network and in-memory listeners are closed.

In this case during these 5 seconds certain server APIs return an empty response (as they want the client to go away to free up the connection). This is ok for the in-memory server as the clients are all from the receptive agents code. But it's absolutely not ok for the network agent API server as the clients are normal agents. They get an "Ok" reply and immediately reconnect. Normally they would not reuse the same TCP connection because of the GOAWAY that they will have received, but in this case they had not. So they keep hammering kas and kas keeps immediately replying with "Ok" without doing any work.

All requests from normal agents go via the rate limiter that tracks the per-agent rate in Redis. This is why load on Redis spikes and that is what caused an incident.

Edited Sep 01, 2024 by Mikhail Mazurskiy