Duo Agent Platform stateful long running connections should be less stateful and should reconnect
Problem
We're seeing a lot of issues with our long running connections that lead to broken/stuck workflows at gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#961 (comment 2722899242) .
We recently switched to sending requests via workhorse in gitlab-org/editor-extensions/gitlab-lsp!2152 (merged) which likely increases our risk of disconnects because it chains multiple stateful connections.
Solution
See the idea at !193149 (comment 2543346462) .
In summary:
- Workhorse -> DWS connection is maintained even if the client drops so the flow can keep running for a period of time
- Client -> Workhorse automatically reconnects when connection is dropped
- Redis is used as an intermediate distribution mechanism for Workhorse -> Client communication because clients that reconnect might end up connected to a different workhorse server than the one that is currently connected to DWS on their behalf.
- Workhorse -> DWS can automatically reconnect as well if the client is still connected
- We still need to detect clients that are disconnected for too long and shutdown the Workhorse -> DWS connection as they will be held up waiting for user interaction anyway.
This idea is basically making both of these connections automatically reconnect and it makes the Client -> Workhorse part approximately stateless so reconnects get you back to where you were before.
It is dependent on #548686 because as soon as we start reconnecting rapidly we'll likely see duplicate runs of the same workflow.