Pass http execution error to DWS and improve logs

What does this MR do and why?

Improved logging

We're investigating connection drops in Workhorse when running DAP flows. We would like to improve these logs because

  1. If serveHTTPSafe fails, Sending HTTP response event wouldn't be printed, and we won't be able to see any logs for that request id.
  2. Errors in nullResponseWriter are not having a correlation id, making debugging difficult.

Passing the error to DWS

  1. If HTTP request execution fails, we would stop execution and drop the connection. Instead, now we send this error back to DWS so that LLM can act.

This MR improves logging and passes the error back to DWS.

References

Related to gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#1789

Screenshots or screen recordings

Before After

How to set up and validate locally

  1. Apply this diff temporarily to simulate a large response from GraphQL request
--- a/ee/app/presenters/ai/duo_workflows/workflow_checkpoint_event_presenter.rb
+++ b/ee/app/presenters/ai/duo_workflows/workflow_checkpoint_event_presenter.rb
@@ -8,6 +8,11 @@ class WorkflowCheckpointEventPresenter < Gitlab::View::Presenter::Delegated
       DuoMessage = Struct.new(:content, :message_type, :status, :tool_info,
         :timestamp, :correlation_id, :role, keyword_init: true)

+      # TODO: Remove this - temporary override to reproduce large response bug
+      def metadata
+        { "dummy_payload" => "x" * (5 * 1024 * 1024) }.to_json
+      end
+
  1. gdk restart rails-web
  2. cd workhorse && make gitlab-workhorse && gdk restart workhorse
  3. Run a chat flow. It should finish normally. Send one more message and check workhorse logs
  4. You'll see logs like nullResponseWriter: partial write due to size limit with a correlation id and request id.
  5. You'll not see anymore Cancelled error in DWS. You can see Internal error since a flow cannot be resumed without the initial graphql call.

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist. It helps you analyze changes to reduce risks in quality, performance, reliability, security, and maintainability.

Edited by Halil Coban

Merge request reports

Loading