Agent Platform locking and heartbeat mechanism for reliable shutdown and preventing concurrent execution
Problem
In our original architecture plan at https://handbook.gitlab.com/handbook/engineering/architecture/design-documents/duo_workflow/ we mentioned the need for locking a workflow when it is already running:
To avoid multiple Workflow service instances running on the same workflow, the Workflow service must always acquire a lock with GitLab before it starts running. When it suspends, it will release the lock and similarly there will be a timeout state if it has not checkpointed in the last 60 seconds. GitLab will not accept checkpoints from a timed out run of the Workflow service
Since our VS Code UI already can resume workflows when they got stuck and since we're likely to want to improve automatic reconnection/resume we will be increasing the likelihood of weird edge cases where a workflow is running concurrently in 2 places. Those 2 places will both be checkpointing and overwriting each others checkpoints likely leading to a confusing conversation history.
This locking mechanism, combined with the heartbeat, solves another problem where clients have no reliable way to know if a workflow is failed. Right now we regularly have disconnects somewhere in our stack (could be due to many failure reasons) and whenever this happens it leaves the workflow state stale. Our Duo Workflow Service always attempts to update the status after it fails but these status updates almost always fail because the gRPC connection is dropped and the status update needs to be sent over the gRPC connection. This means that any clients watching the status in Rails won't see it marked as failed. We have a cleanup process that cleans this up after 30 minutes but this is too long and users will be sitting their looking at a stuck workflow for 30 minutes . There are also other issues with this mechanism being discussed at gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#1438 (comment 2790577429) .
Solution
- When a workflow starts the Duo Workflow Service should acquire a lock with GitLab Rails. This can be included in an initial request to fetch the workflow data or can be a subsequent HTTP request like POST /ai/duo_workflows/workflows/<WORKFLOW_ID/lock.
- GitLab Rails will return a lock ID (random number generated and stored in Redis or Postgres).
- All future checkpoint saves to GitLab Rails must include the lock ID. GitLab Rails will reject any checkpoints that do not match the current lock ID for the workflow
- Duo Workflow Service will send a "heartbeat" request PUT /ai/duo_workflows/workflows/<WORKFLOW_ID/lock/<LOCK_ID>to GitLab Rails every 15s . This heartbeat mechanism will likely require concurrent HTTP requests to Rails (to be reliable) and that will likely depend on solving the problems described at gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!3268 (comment 2768139527) and gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!3268 (comment 2784642810) . Beyond the Duo Workflow Service side of concurrent requests we'd likely need to solve a similar problem in workhorse because the messages are being consumed by a single goroutine which means it can't actually make parallel HTTP requests at the moment. Solving that will be a net-win for workflow performance anyway because checkpointing (inside Langgraph) is already running concurrently to the workflow which means that our current mechanism of only allowing a single message in the outbox is potentially blocking workflow progression when there is a checkpoint in flight.
- GitLab Rails will immediately invalidate the lock and mark the workflow as failed if no heartbeat is received after 30s
- All clients (gitlab-lsp and agentic web UI) need to be subscribing (either through GraphQL subscription or polling) to status updates on any workflow they are viewing. As soon as a workflow is marked as failed they should display this to the user
Technical Details
We should consider the performance implication of acquiring this lock. Note the efforts in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#1169 (closed) . We should consider having an API which acquires the lock as well as returning the workflow to the user. It could also return a lock id which is then used to later unlock the workflow when it is complete.
What about just using workhorse to update the status?
There have been some discussions about the fact that our new architecture of requests flowing through workhorse gives us a slightly more reliable way of updating statuses. Some work already went into that in !203460 (closed) . This mechanism is likely better than what we have today but it's flawed. The problem is that workhorse itself can disconnect from Duo Workflow Service or could be shutdown/crash for any reason. When this happens we're still left with a stuck workflow that Rails does not know about. And hence clients cannot know about. We expect workhorse and Duo Workflow Service connections to also be dropped when either of these components are deployed. While the deployer may give them some time for graceful shutdown it won't wait forever and these are intended to be long running processes so we should expect that we see a reasonable number of flows interrupted by deploys.
The idea described in this issue reflects some distributed systems constraints we have:
- Duo Workflow Service is the one doing the work. Only if it is running is the workflow progressing
- GitLab Rails is the only authoritative state for the workflow. The only way it can remain authoritative is to prevent state transitions that don't match it's current understanding
- Duo Workflow Service cannot answer questions from Rails like "Hey are you still running workflow 123" because any attempt to contact Duo Workflow Service will reach a randomly load balanced instance that has no way of knowing if other instances are running workflow 123.
- The clients and workhorse are unreliable participants that are needed to facilitate communication between these systems. Duo Workflow Service cannot communicate to GitLab Rails without it.
- The above 2 statements mean that there is never a reliable communication channel between Duo Workflow Service and GitLab Rails and therefore we need something like a periodic health check to guarantee that a workflow will eventually terminate in the database