Incorrect internal server shutdown order
Problem in agentk
When agentk is handling tunnel connections and receives a signal to terminate (SIGTERM/etc) it starts shutting down all servers in a graceful way. One of the servers is the internal gRPC server. It's the one that handles incoming requests from kas that come via the tunnel. If there is a long running connection, such as a watch or websocket/SPDY request for Kubernetes proxy API, then it may take a while to shut everything down. In this case new requests that may arrive may fail because the internal listener may have shut down already. This is because internal server is shut down concurrently with other servers so there is a race between them.
One more related problem is that modules are shut down after all the servers. One of the modules is the reverse tunnel module. This means that it keeps establishing tunnels even after the servers have started the graceful shutdown. So kas may still send requests over those tunnels and encounter a shutting down server, which doesn't accept the connection because the listener has been closed already. This leads to errors:
^C
{"level":"error","time":"2022-05-13T10:00:30.629+1000","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"NewStream(): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing listener closed, cannot dial\"","correlation_id":"01G2XBE5G0TK93XDHH736N46HH"}
{"level":"error","time":"2022-05-13T10:00:30.629+1000","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"NewStream(): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing listener closed, cannot dial\"","correlation_id":"01G2XBE5G4QVJJF5965XQS1K9E"}
{"level":"error","time":"2022-05-13T10:00:30.656+1000","msg":"Error handling a connection","mod_name":"reverse_tunnel","error":"NewStream(): rpc error: code = Unavailable desc = connection error: desc = \"transport: Error while dialing listener closed, cannot dial\"","correlation_id":"01G2XBEEW8A2E2YW3YRD0PBEEY"}
Solution
Shutdown should work in the following way/order:
- Stop establishing reverse tunnel connections. This will prevent kas from starting new RPCs to the internal server.
- Wait until there are no established reverse tunnel connections i.e. module's
Run()
returns (ensure it blocks until all connections quit). We need to wait here because there may be a tunnel established and it may receive a request right after the internal server shut down. This will result in an error, like we are seeing. - Start graceful shutdown of the internal API server. It should be instantaneous because there should be no connections to it since there are no tunnels.
Problem is kas
kas has an internal API server too. It is used for routing requests to agentk i.e. kas talks to it as if it is talking to a correct agentk.
Solution
This internal server needs to shut down after all other servers have shut down.