Skip to content

Workhorse: Close ongoing DAP connections on graceful shutdown

Problem

When GitLab Workhorse receives a graceful shutdown signal (SIGINT or SIGTERM), it calls srv.Shutdown(ctx) on the HTTP server (see workhorse/cmd/gitlab-workhorse/main.go). According to the net/http.Server.Shutdown documentation:

Shutdown gracefully shuts down the server without interrupting any active connections. Shutdown works by first closing all open listeners, then closing all idle connections, and then waiting indefinitely for connections to return to idle and then shut down.

If the provided context expires before the shutdown is complete, Shutdown returns the context's error, otherwise it returns any error returned from closing the Server's underlying Listener(s).

When Shutdown is called, Serve, ListenAndServe, and ListenAndServeTLS immediately return ErrServerClosed. Make sure the program doesn't exit and waits instead for Shutdown to return.

Shutdown does not attempt to close nor wait for hijacked connections such as WebSockets. The caller of Shutdown should separately notify such long-lived connections of shutdown and wait for them to close, if desired.

Duo Agent Platform (DAP) WebSocket connections are hijacked connections that are not managed by srv.Shutdown(). These connections remain active because:

  1. The WebSocket upgrade happens in workhorse/internal/ai_assist/duoworkflow/handler.go via upgrader.Upgrade(w, r, nil), which hijacks the underlying TCP connection
  2. The connection is managed by a runner that continuously reads/writes messages in goroutines (handleWebSocketMessages and handleAgentMessages in workhorse/internal/ai_assist/duoworkflow/runner.go)
  3. srv.Shutdown() does not close or wait for these hijacked WebSocket connections

As a result, Workhorse will complete its HTTP server shutdown immediately, but the DAP WebSocket connections will continue running until the process is forcefully terminated or the configured ShutdownTimeout expires.

Expected Behavior

During graceful shutdown, Workhorse should:

  1. Stop accepting new DAP WebSocket connections
  2. Separately notify and gracefully close existing DAP WebSocket connections by:
    • Sending a StopWorkflow request to the agent
    • Sending a WebSocket close frame to the client
    • Cleaning up associated resources (gRPC streams, MCP managers, etc.)
  3. Wait for all DAP connections to close before the process exits

Current Implementation Details

The DAP connection lifecycle is managed in workhorse/internal/ai_assist/duoworkflow/runner.go:

  • runner.Execute() spawns two goroutines that run until an error occurs or the connection closes
  • runner.Close() handles cleanup but is only called when the handler's defer executes (after Execute returns)
  • The closeWebSocketConnection() method properly closes the WebSocket with a 5-second timeout
  • There's already logic to send StopWorkflow requests when the client closes the connection

Proposed Solution

Implement a mechanism to track active DAP connections and close them during graceful shutdown:

  1. Maintain a registry of active DAP runners in the upstream handler or a dedicated connection manager
  2. When shutdown is initiated (before or alongside srv.Shutdown()), iterate through active connections and trigger their graceful closure
  3. Wait for all connections to close (with appropriate timeout handling)
  4. Ensure the shutdown timeout (cfg.ShutdownTimeout.Duration) accounts for the time needed to close DAP connections (currently wsStopWorkflowTimeout = 10s + wsCloseTimeout = 5s)

This will ensure that Workhorse can shut down cleanly without forcefully terminating active workflows or leaving orphaned connections.

Edited by Igor Drozdov