Ensure Duo Workflow Service is enforcing a consistent 4MiB gRPC message limit

Problem to solve

See the context in gitlab-org/gitlab#560498 (closed) but the gist is that generally a 4MiB limit is used in gRPC. This is being inconsistently enforced by our clients and our proxy layers. This leads to difficult to debug disconnections.

Since the server is both receiving and sending gRPC messages we need to ensure that it behaves consistently with this 4MiB limit. Having it reject and log messages that exceed 4MiB will help us a lot with debugging issues in production and locally.

Proposal

There are likely gRPC settings that do all of this for us as there is in the Go Executor The Duo Worklflow Service should configure:

A 4MiB message limit and log and reject messages larger than this
It should never send outgoing messages exceeding this limit either

Further details

Client size limits

At the moment each client enforces a message size limit:

Server size limit

Default gRPC limits in Duo Workflow Service are 4MiB for incoming and outgoing messages.
By default the gRPC server handles oversized messages by doing the following:
- Raises a non-ok status code (like RESOURCE_EXHAUSTED)
- Sends a error message to the client (e.g. Received message larger than max (X vs. 4194304))
- Closes the gRPC connection
- Raises an error in the application
From my local testing I can see the following errors being raised from the client sending oversized messages:
- RuntimeError: async generator raised StopAsyncIteration, seems to be raised by an oversized startRequest message in server.py due to the oversized message being rejected and the connection closed by gRPC. This is propagated through to the gRPC async iterator as it does not have any more messages to process.
- asyncio.exceptions.CancelledError, seems to be raised by oversized messages during workflow execution by asyncio during awaiting workflow_task in server.py due to the workflow_task being forcibly cancelled by gRPC when the connection closes. At the same time, action from send_events() is stopped with the StopAsyncIteration error is handled at L290.

Example StopAsyncIteration error

{
    "event": "Finished ExecuteWorkflow RPC",
    "logger": "grpc",
    "level": "info",
    "correlation_id": "555fc472-1be7-4a1b-858d-aa7862f1b8c7",
    "gitlab_global_user_id": "777",
    "workflow_id": "undefined",
    "duration_s": 0.0025215420027961954,
    "request_arrived_at": "2025-09-12T09:17:29.700856+00:00",
    "cpu_s": 0.0015339999999999243,
    "grpc_type": "BIDI_STREAM",
    "grpc_service_name": "DuoWorkflow",
    "grpc_method_name": "ExecuteWorkflow",
    "servicer_context_code": "OK",
    "gitlab_host_name": null,
    "gitlab_realm": "self-managed",
    "gitlab_instance_id": null,
    "gitlab_authentication_type": "oidc",
    "gitlab_version": "18.4.0-pre",
    "user_agent": "grpc-python/1.73.1 grpc-c/48.0.0 (osx; chttp2)",
    "exception_message": "async generator raised StopAsyncIteration",
    "exception_class": "RuntimeError",
    "exception_backtrace": "Traceback (most recent call last):\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/server.py\", line 156, in ExecuteWorkflow\n    start_workflow_request: contract_pb2.ClientEvent = await anext(\n                                                       ^^^^^^^^^^^^\nStopAsyncIteration\n\nThe above exception was the direct cause of the following exception:\n\nTraceback (most recent call last):\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/interceptors/monitoring_interceptor.py\", line 155, in monitoring\n    yield\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/interceptors/monitoring_interceptor.py\", line 128, in stream_behavior\n    async for behavior_response in behavior(\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/.venv/lib/python3.12/site-packages/dependency_injector/wiring.py\", line 1077, in _patched\n    async for obj in fn(*args, **kwargs):\nRuntimeError: async generator raised StopAsyncIteration\n",
    "workflow_definition": null,
    "timestamp": "2025-09-12T09:17:29.703442Z"
}

Example CancelledError

{
    "event": "Client-side streaming has been closed.",
    "logger": "server",
    "level": "info",
    "correlation_id": "6f4d536f-faba-4f83-9536-357aff0a2b54",
    "gitlab_global_user_id": "777",
    "workflow_id": "test-normal-message",
    "timestamp": "2025-09-12T09:17:29.754286Z"
}

Followed by:

{
  "event": "",
  "logger": "exceptions",
  "level": "error",
  "correlation_id": "6f4d536f-faba-4f83-9536-357aff0a2b54",
  "gitlab_global_user_id": "777",
  "workflow_id": "test-normal-message",
  "status_code": null,
  "exception_class": "CancelledError",
  "additional_details": {
    "workflow_id": "test-normal-message",
    "source": "duo_workflow_service.workflows.chat.workflow"
  },
  "timestamp": "2025-09-12T09:17:29.754544Z",
  "exception": "Traceback (most recent call last):\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/workflows/abstract_workflow.py\", line 224, in _compile_and_run_graph\n    await fetch_workflow_and_container_data(\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/gitlab/gitlab_api.py\", line 148, in fetch_workflow_and_container_data\n    response = await client.graphql(query, variables)\n               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/gitlab/executor_http_client.py\", line 83, in graphql\n    response = await asyncio.wait_for(\n               ^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/tim/.local/share/mise/installs/python/3.12.11/lib/python3.12/asyncio/tasks.py\", line 520, in wait_for\n    return await fut\n           ^^^^^^^^^\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/executor/action.py\", line 57, in _execute_action\n    actionResponse = await _execute_action_and_get_action_response(metadata, action)\n                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^\n  File \"/Users/tim/gitlab/gdk/gdk/gitlab-ai-gateway/duo_workflow_service/executor/action.py\", line 38, in _execute_action_and_get_action_response\n    event: contract_pb2.ClientEvent = await inbox.get()\n                                      ^^^^^^^^^^^^^^^^^\n  File \"/Users/tim/.local/share/mise/installs/python/3.12.11/lib/python3.12/asyncio/queues.py\", line 158, in get\n    await getter\nasyncio.exceptions.CancelledError"
}

Current limitations

The current implementation, and the use of default handling, will log either StopAsyncIteration or CancelledError into the logs with no indication what caused these errors. As these are both generic asyncio errors, they can be caused by various reasons.

Clients will receive a more helpful RESOURCE_EXHAUSTED error with a error message explaining the message was oversized, but as clients can be local we wouldn't be able to use these for observability of the DWS server.

Links / references

Edited Sep 12, 2025 by Tim Morriss