Fix Cloudflare grpc tunnel timeout configuration
Background
During an investigation of https://gitlab.com/gitlab-org/gitlab/-/issues/501170#investigation-and-corrective-actions-delivered-so-far, it was discovered that Cloudflare's timeout settings are causing premature termination of gRPC tunnels after approximately 1 minute of idle time. As a temporary mitigation, Cloudflare has been disabled for the Duo Workflow Service.
Current Status
- Root cause most likely identified in #509586 (comment 2624852814) and suggests that the only workaround is to start sending more periodic heartbeat messages down our gRPC stream which is being tested in gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2949 (merged)
- Cloudflare support tickets are open and being tracked:
- Cloudflare is currently disabled for Duo Workflow Service
- Cloudflare support found a secret setting for us and we've enabled that on production in https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/merge_requests/11688
- We're going to roll this out again via a feature flag once we merge !200293 (merged)
Action Required
Once Cloudflare resolves the timeout configuration issue:
- Review the solution provided by Cloudflare
- Re-enable Cloudflare firewall for Duo Workflow Service
- Verify service functionality
Backup plan
Send messages from Service
-
Introduce a new Heartbeat message in the server Action stream (protobuf change only) -
Update all executor clients to ignore this message -
Ensure updated executor with ignoring messages is deployed everywhere or confirm client can handle this extra action. - The Go executor exits at https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-executor/-/blob/83547fd764c28b6ecabbd49a2c8cf76dc64a3f8e/internal/services/runner/runner.go#L158 as soon as it receives an action it doesn't know about
- We need to check how the node executor behaves if it receives an action it doesn't know about. Also we need to consider if we can be sure about backwards compatibility here. Since the VS Code extension updates automatically are we OK to ship a breaking change that assumes it is updated shortly after we release the VS Code update?
-
Finally deploy code on Duo Workflow Service that actually sends the heartbeats
Send messages from Executor: Already tried and it doesn't work
Now that we've verified that removing Cloudflare fixes the problem we hope that the issue can be fixed on their end. If that can't happen quickly then as a backup plan we might consider just sending messages more actively through the channel so Cloudflare thinks it's alive. Probably the easiest way to do this would be:
- Add a heartbeat message subtype for the stream
- Updated Duo Workflow Service to ignore heartbeat (probably at https://gitlab.com/gitlab-org/duo-workflow/duo-workflow-service/-/blob/bc4fc47d5aeb7374e1d3599cd7e8273d628a6bd2/duo_workflow_service/server.py#L125 and never put it in the
inbox
) - Add timer to Duo Workflow Executor to send a heartbeat periodically down the stream
It may also be possible to use a whole new gRPC message type so long as it's going down the same gRPC connection but we'd need to test that this still keeps the stream alive.