Investigate and Resolve Large gRPC payloads from DWS breaking flows
Problem
We can see in the logs that there are considerable number of entries that indicate we are trying to send a message larger than 4MiB out from the Duo Workflow Service. These instances result in ExecuteBatchError which fails a flow. They also break the gRPC connection before failing the flow which means the state is not updated and the user will likely see minimal feedback in the UI about what went wrong.
Metrics/Dashboards
- Logs from Duo Workflow Service to count instances of message limit exceeded: https://cloudlogging.app.goo.gl/6sMqFDcXZJz9K9g1A
- Logs from Duo Workflow Service for
ExecuteBatchError: https://cloudlogging.app.goo.gl/BF51y7gzVrJY4xYT7 - Workhorse logs of payload too large: https://log.gprd.gitlab.net/app/r/s/uZeuG
- Grafana metrics for
ExecuteBatchError: https://dashboards.gitlab.net/d/3d9c7954-2669-4782-9206-b714c8a589fa/dws-log-based-dashboard?orgId=1&from=now-24h&to=now&timezone=utc&var-interval=5%20MINUTE&var-workflow=all&var-client=all&var-gitlab_realm=all&var-gitlab_version=all&var-gitlab_host_name=all&var-gcp_project_id=gitlab-runway-production&var-grpc_status_code=all&var-grpc_context_details=all&viewPanel=panel-19
TODO
-
Figure out what it looks like to a user when this exact situation happens by trying to send a checkpoint that is too large -
Web UI -
VS COde
-
-
Investigate how many unique users are being impacted by this -
Investigate specific instances of this to see what the root cause was for that instance -
Create followup issues to address whatever we learn -
Monitor for 1-2 weeks to see if instances of large workflow breaking payloads is reduced
Edited by Dylan Griffith
