Retry mechanism in step executor

Description

When error event is returned from v2/chat/agent, we should retry the execution to increase the chance of successful operation.

Proposal

Add retryable field to the Error agent event.
When retryable: true, we should retry the operation. This could be a case when temporary system failure occures (e.g. system overload error)
When retryable: false, we should surface the error to the user. e.g. Something went wrong during the request. Try "/clean" and request again. This could be a case when client error happens e.g. invalid_request_error or max token length limit.
When retryable is nil, do nothing.

Example

GitLab-Sidekiq:

diff --git a/ee/lib/gitlab/duo/chat/agent_events/error.rb b/ee/lib/gitlab/duo/chat/agent_events/error.rb
index 8c2b6c4da6f5..ec154a7eb1ce 100644
--- a/ee/lib/gitlab/duo/chat/agent_events/error.rb
+++ b/ee/lib/gitlab/duo/chat/agent_events/error.rb
@@ -8,6 +8,10 @@ class Error < BaseEvent
           def message
             data["message"]
           end
+
+          def retryable
+            data["retryable"] || false
+          end
         end
       end
     end
diff --git a/ee/lib/gitlab/llm/chain/agents/single_action_executor.rb b/ee/lib/gitlab/llm/chain/agents/single_action_executor.rb
index 3b93189903a7..842f78c10952 100644
--- a/ee/lib/gitlab/llm/chain/agents/single_action_executor.rb
+++ b/ee/lib/gitlab/llm/chain/agents/single_action_executor.rb
@@ -19,6 +19,7 @@ class SingleActionExecutor
           attr_accessor :iterations
 
           MAX_ITERATIONS = 10
+          MAX_RETRY_STEP_FORWARD = 1
 
           # @param [String] user_input - a question from a user
           # @param [Array<Tool>] tools - an array of Tools defined in the tools module.
@@ -36,7 +37,7 @@ def initialize(user_input:, tools:, context:, response_handler:, stream_response
 
           def execute
             MAX_ITERATIONS.times do
-              events = step_forward
+              events = with_agent_retry { step_forward }
 
               raise EmptyEventsError if events.empty?
 
@@ -294,6 +295,26 @@ def current_blob
           def chat_feature_setting
             ::Ai::FeatureSetting.find_by_feature(:duo_chat)
           end
+
+          def with_agent_retry
+            retries = 0
+
+            begin
+              yield
+            rescue AgentEventError => ex
+              raise ex if retries >= MAX_RETRY_STEP_FORWARD
+              raise ex unless ex.retryable?
+
+              log_warn(message: "Retrying agent step forward",
+                event_name: 'retry',
+                ai_component: 'duo_chat',
+                ai_error_class: ex.class.name,
+                ai_error_message: ex.message)
+
+              retries += 1
+              retry
+            end
+          end
         end
       end
     end

AI-Gateway:

diff --git a/ai_gateway/chat/agents/react.py b/ai_gateway/chat/agents/react.py
index a64d0ebd..bdc4740f 100644
--- a/ai_gateway/chat/agents/react.py
+++ b/ai_gateway/chat/agents/react.py
@@ -251,7 +251,12 @@ class ReActAgent(Prompt[ReActAgentInputs, TypeAgentEvent]):
 
                 events.append(event)
         except Exception as e:
-            yield AgentError(message=str(e))
+            retryable = False
+
+            if "overloaded_error" in str(e):
+                retryable = True
+
+            yield AgentError(message=str(e), retryable=retryable)
             raise
 
         if any(isinstance(e, AgentFinalAnswer) for e in events):
diff --git a/ai_gateway/chat/agents/typing.py b/ai_gateway/chat/agents/typing.py
index 674f2a7f..d2673311 100644
--- a/ai_gateway/chat/agents/typing.py
+++ b/ai_gateway/chat/agents/typing.py
@@ -45,6 +45,7 @@ class AgentUnknownAction(AgentBaseEvent):
 class AgentError(AgentBaseEvent):
     type: str = "error"
     message: str
+    retryable: bool
 
 
 TypeAgentEvent = TypeVar(

Edited Sep 27, 2024 by Shinya Maeda