Workhorse: shutdown DWS conns during blackout

What does this MR do and why?

This MR implements graceful shutdown of Duo Agent Platform (DAP) WebSocket connections during Workhorse's blackout period, addressing the issue described in Workhorse: Close ongoing DAP connections on gra... (#579793).

Problem: When Workhorse receives a shutdown signal, upgraded WebSocket connections (like DAP workflows) are not automatically managed by Go's http.Server.Shutdown(). Previously, these connections would continue running until forcefully terminated, causing workflows to fail with generic errors and incomplete status updates.

Solution: This MR initiates graceful shutdown of DAP WebSocket connections during the blackout period (the delay before Workhorse actually shuts down). This gives active workflows time to:

  • Complete naturally if they're close to finishing
  • Send StopWorkflow requests to the agent
  • Update workflow status in Rails
  • Send proper WebSocket close frames to clients
  • Clean up resources (gRPC streams, MCP managers, etc.)

Key changes:

  1. Modified UpgradedConnsManager.Shutdown() to accept a time duration instead of context, allowing it to run asynchronously during the blackout period
  2. Initiated upgraded connection shutdown in a goroutine during the blackout period, running concurrently with the delay
  3. Allocated most of the blackout period for graceful workflow completion, reserving 10 seconds for forceful termination if needed
  4. Simplified shutdown logic by removing the separate shutdown.ShutdownAll() call for the server

Impact: With this change and the infrastructure configuration in gitlab-com/gl-infra/k8s-workloads/gitlab-com!4946 (merged) (180s blackout period), DAP workflows will have sufficient time to complete gracefully during pod shutdown, reducing workflow failures and improving user experience.

Related to: #579793

Edited by Igor Drozdov

Merge request reports

Loading