Investigate and Resolve Large gRPC payloads from DWS breaking flows

Problem

We can see in the logs that there are considerable number of entries that indicate we are trying to send a message larger than 4MiB out from the Duo Workflow Service. These instances result in ExecuteBatchError which fails a flow. They also break the gRPC connection before failing the flow which means the state is not updated and the user will likely see minimal feedback in the UI about what went wrong.

source

Metrics/Dashboards

Logs from Duo Workflow Service to count instances of message limit exceeded: https://cloudlogging.app.goo.gl/6sMqFDcXZJz9K9g1A
Logs from Duo Workflow Service for ExecuteBatchError: https://cloudlogging.app.goo.gl/BF51y7gzVrJY4xYT7
Workhorse logs of payload too large: https://log.gprd.gitlab.net/app/r/s/uZeuG
Grafana metrics for ExecuteBatchError: https://dashboards.gitlab.net/d/3d9c7954-2669-4782-9206-b714c8a589fa/dws-log-based-dashboard?orgId=1&from=now-24h&to=now&timezone=utc&var-interval=5%20MINUTE&var-workflow=all&var-client=all&var-gitlab_realm=all&var-gitlab_version=all&var-gitlab_host_name=all&var-gcp_project_id=gitlab-runway-production&var-grpc_status_code=all&var-grpc_context_details=all&viewPanel=panel-19

TODO

Figure out what it looks like to a user when this exact situation happens by trying to send a checkpoint that is too large
1. Web UI
2. VS COde
Investigate how many unique users are being impacted by this
Investigate specific instances of this to see what the root cause was for that instance
Create followup issues to address whatever we learn
Monitor for 1-2 weeks to see if instances of large workflow breaking payloads is reduced

Edited Oct 14, 2025 by Dylan Griffith