Workhorse: Close ongoing DAP connections on graceful shutdown
Problem
When GitLab Workhorse receives a graceful shutdown signal (SIGINT or SIGTERM), it calls srv.Shutdown(ctx) on the HTTP server (see workhorse/cmd/gitlab-workhorse/main.go). According to the net/http.Server.Shutdown documentation:
Shutdown gracefully shuts down the server without interrupting any active connections. Shutdown works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down.
If the provided context expires before the shutdown is complete, Shutdown returns the context's error, otherwise it returns any error returned from closing the Server's underlying Listener(s).
When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. Make sure the program doesn't exit and waits instead for Shutdown to return.
Shutdown does not attempt to close nor wait for hijacked connections such as WebSockets. The caller of Shutdown should separately notify such long-lived connections of shutdown and wait for them to close, if desired.
Duo Agent Platform (DAP) WebSocket connections are hijacked connections that are not managed by srv.Shutdown(). These connections remain active because:
- The WebSocket upgrade happens in
workhorse/internal/ai_assist/duoworkflow/handler.goviaupgrader.Upgrade(w, r, nil), which hijacks the underlying TCP connection - The connection is managed by a
runnerthat continuously reads/writes messages in goroutines (handleWebSocketMessagesandhandleAgentMessagesinworkhorse/internal/ai_assist/duoworkflow/runner.go) -
srv.Shutdown()does not close or wait for these hijacked WebSocket connections
As a result, Workhorse will complete its HTTP server shutdown immediately, but the DAP WebSocket connections will continue running until the process is forcefully terminated or the configured ShutdownTimeout expires.
Expected Behavior
During graceful shutdown, Workhorse should:
- Stop accepting new DAP WebSocket connections
- Separately notify and gracefully close existing DAP WebSocket connections by:
- Sending a
StopWorkflowrequest to the agent - Sending a WebSocket close frame to the client
- Cleaning up associated resources (gRPC streams, MCP managers, etc.)
- Sending a
- Wait for all DAP connections to close before the process exits
Current Implementation Details
The DAP connection lifecycle is managed in workhorse/internal/ai_assist/duoworkflow/runner.go:
-
runner.Execute()spawns two goroutines that run until an error occurs or the connection closes -
runner.Close()handles cleanup but is only called when the handler's defer executes (after Execute returns) - The
closeWebSocketConnection()method properly closes the WebSocket with a 5-second timeout - There's already logic to send
StopWorkflowrequests when the client closes the connection
Proposed Solution
Implement a mechanism to track active DAP connections and close them during graceful shutdown:
- Maintain a registry of active DAP runners in the upstream handler or a dedicated connection manager
- When shutdown is initiated (before or alongside
srv.Shutdown()), iterate through active connections and trigger their graceful closure - Wait for all connections to close (with appropriate timeout handling)
- Ensure the shutdown timeout (
cfg.ShutdownTimeout.Duration) accounts for the time needed to close DAP connections (currentlywsStopWorkflowTimeout = 10s+wsCloseTimeout = 5s)
This will ensure that Workhorse can shut down cleanly without forcefully terminating active workflows or leaving orphaned connections.