Investigate GOAWAY errors

Example gitlab-org/charts/gitlab#5988 (comment 2418487942):

{"time":"2025-03-26T07:14:08.990583846Z","level":"ERROR","msg":"Error handling a connection","mod_name":"agentk2kas_tunnel","agent_id":1114102,"error":"rpc error: code = Unavailable desc = closing transport due to:
 connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: read tcp 10.112.2.17:39944->172.65.247.105:443: read: connection reset by peer\", received prior goaway: cod
e: NO_ERROR, debug data: \"max_age\""}
{"time":"2025-03-26T07:14:08.990760877Z","level":"ERROR","msg":"Error handling a connection","mod_name":"agentk2kas_tunnel","agent_id":1114102,"error":"rpc error: code = Unavailable desc = closing transport due to:
 connection error: desc = \"error reading from server: failed to get reader: failed to read frame header: read tcp 10.112.2.17:39944->172.65.247.105:443: read: connection reset by peer\", received prior goaway: cod
e: NO_ERROR, debug data: \"max_age\""}

If only we had a proper API.

Update on investigation

Mikhail and I (Timo) discussed the options that we have without a proper gRPC API to get the maximum connection age for an RPC and eventually improved our maximum connection age approximation calculation in Improve max connection age calculation for gRPC... (!3067 - merged). It's behind a feature flag that we are rolling out gradually. The FF is an environment-variable based FF, because we can't use our other Rails-based FFs for this.

Because of when we cancel the connections due to our calculated age and when gRPC actually drains the connection with a GOAWAY we see the clients bouncing back with connect retires to the same KAS replicas and the same connection (we are wrapped in WebSockets anyways) that makes the connection attempts spike. We can observe that in the metrics:

image

We expect this metric to not spike (at least as much) when we have a fix in place.

Edited by Timo Furrer