Skip to content

Fix Cloudflare grpc tunnel timeout configuration

Background

During an investigation of https://gitlab.com/gitlab-org/gitlab/-/issues/501170#investigation-and-corrective-actions-delivered-so-far, it was discovered that Cloudflare's timeout settings are causing premature termination of gRPC tunnels after approximately 1 minute of idle time. As a temporary mitigation, Cloudflare has been disabled for the Duo Workflow Service.

Current Status

Action Required

Once Cloudflare resolves the timeout configuration issue:

  1. Review the solution provided by Cloudflare
  2. Re-enable Cloudflare firewall for Duo Workflow Service
  3. Verify service functionality

Backup plan

Send messages from Service

  1. Introduce a new Heartbeat message in the server Action stream (protobuf change only)
    1. gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!3050
  2. Update all executor clients to ignore this message
    1. gitlab-org/duo-workflow/duo-workflow-executor!215
  3. Ensure updated executor with ignoring messages is deployed everywhere or confirm client can handle this extra action.
    1. The Go executor exits at https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-executor/-/blob/83547fd764c28b6ecabbd49a2c8cf76dc64a3f8e/internal/services/runner/runner.go#L158 as soon as it receives an action it doesn't know about
    2. We need to check how the node executor behaves if it receives an action it doesn't know about. Also we need to consider if we can be sure about backwards compatibility here. Since the VS Code extension updates automatically are we OK to ship a breaking change that assumes it is updated shortly after we release the VS Code update?
  4. Finally deploy code on Duo Workflow Service that actually sends the heartbeats

Send messages from Executor: Already tried and it doesn't work

Now that we've verified that removing Cloudflare fixes the problem we hope that the issue can be fixed on their end. If that can't happen quickly then as a backup plan we might consider just sending messages more actively through the channel so Cloudflare thinks it's alive. Probably the easiest way to do this would be:

  1. Add a heartbeat message subtype for the stream
  2. Updated Duo Workflow Service to ignore heartbeat (probably at https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-service/-/blob/bc4fc47d5aeb7374e1d3599cd7e8273d628a6bd2/duo_workflow_service/server.py#L125 and never put it in the inbox)
  3. Add timer to Duo Workflow Executor to send a heartbeat periodically down the stream

It may also be possible to use a whole new gRPC message type so long as it's going down the same gRPC connection but we'd need to test that this still keeps the stream alive.

Edited by Dylan Griffith