Lock workflow in Redis when running workflow in workhorse
What does this MR do and why?
Today our Duo Agent Platform features involve opening a long running websocket connection to workhorse. Workhorse proxies websocket messages via gRPC to the Duo Workflow Service. Workhorse intercepts some of these messages.
When a new flow is triggered we always open a new workhorse websocket connection. The client then proceeds to send a StartWorkflowRequest message down the websocket connection which is what triggers the flow to start in Duo Workflow Service. This message also includes the workflow ID which is necessary for this MR. This is the first time workhorse is actually using that ID.
As described in #548686 and #580581 it's not good that users can trigger the same workflow to be running twice concurrently. When this happens behaviour is undefined and likely bad for users. You can see in the recorded videos on this MR that the undefined behaviour usually results in one of the chat histories disappearing.
This MR implements a simple locking mechanism using the existing Redis
distributed locking code we have in workhorse. When a
StartWorkflowRequest message is received in an open websocket
connection we attempt to acquire a lock for that workflowID in Redis. If
we cannot then we fail immediately. Otherwise we acquire the lock and
only release it once both the websocket and gRPC connections are closed.
This new behaviour is behind a feature flag called duo_workflow_lock_concurrent_flows which we can hopefully get rid of quickly. But I wanted to use a feature flag as I have a fear that there may be cases of orphaned locks and if that happens often enough it could be much worse than the problem I'm trying to fix.
References
Screenshots or screen recordings
| Before | After |
|---|---|
| concurrency-without-lock | concurrency-with-lock-firefox |
How to set up and validate locally
- Setup Duo Agent Platform locally in GDK https://gitlab.com/gitlab-org/gitlab-development-kit/-/blob/main/doc/howto/duo_agent_platform.md
- Recompile workhorse with
cd workhorse && make && gdk restart workhorse - Enable the feature flag
duo_workflow_lock_concurrent_flows - Start a new agentic chat
- Open a new tab and then open agentic chat in the 2nd tab. It should show the same chat you just messaged in the first tab
- Send another message in the first tab
- Send a message in the 2nd tab (to the same chat) before the first tab finishes
- The 2nd tab should just fail and show you nothing. This is desired as opposed to having 2 separate chats running concurrently and interfering with each other
- In Firefox you can see the response status is 1013. Chrome doesn't seem to show that for websockets.
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.
Related to #580581