Skip to content

Duo Agent Platform stateful long running connections should be less stateful and should reconnect

Problem

We're seeing a lot of issues with our long running connections that lead to broken/stuck workflows at gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#961 (comment 2722899242) .

We recently switched to sending requests via workhorse in gitlab-org/editor-extensions/gitlab-lsp!2152 (merged) which likely increases our risk of disconnects because it chains multiple stateful connections.

Solution

See the idea at !193149 (comment 2543346462) .

In summary:

  1. Workhorse -> DWS connection is maintained even if the client drops so the flow can keep running for a period of time
  2. Client -> Workhorse automatically reconnects when connection is dropped
  3. Redis is used as an intermediate distribution mechanism for Workhorse -> Client communication because clients that reconnect might end up connected to a different workhorse server than the one that is currently connected to DWS on their behalf.
  4. Workhorse -> DWS can automatically reconnect as well if the client is still connected
  5. We still need to detect clients that are disconnected for too long and shutdown the Workhorse -> DWS connection as they will be held up waiting for user interaction anyway.

This idea is basically making both of these connections automatically reconnect and it makes the Client -> Workhorse part approximately stateless so reconnects get you back to where you were before.

It is dependent on #548686 because as soon as we start reconnecting rapidly we'll likely see duplicate runs of the same workflow.

Edited by 🤖 GitLab Bot 🤖