Architecture decision: entry point and execution engine for GitLabDuo mentions and Slack integration

@thomas-schmidt and I had a sync on the potential approaches for GitLabDuo mentions and Slack integration to figure out whether we can reuse the same approach for both use cases. Here is a summary:

Our requirements

We want a semi-async conversational agent experience: a user sends a message (MR comment, Slack message, etc.) and gets a response fast enough to feel conversational — not streaming like web chat, but not minutes either. This applies to @GitLabDuo mentions in MRs today and extends to Slack and potentially other messaging surfaces.

Both our approaches need to meet these requirements:

Latency: Fast enough to feel responsive, ideally under 5 seconds (depending on the ask, e.g., simple questions vs. complex code analysis)
Non-negotiable requirements:
- Composite identity — security requires this. The agent acts on behalf of a user with appropriate permissions, not as a privileged service account.
As much as possible:
- Visibility into AI interactions — users (and ideally admins) should be able to see what the agent did, including tool calls (sessions, chat history, intent LLM call, or similar). This is important for governance, trust, and debugging.

The 2 approaches currently in flight

Headless Workhorse (headless injector)

A Sidekiq worker POSTs to a headless endpoint. Workhorse intercepts, runs the workflow (currently using the agentic chat) to completion via gRPC to DWS, and proxies tool calls back to Rails. The worker reads the result and posts it as @GitLabDuo.

There is a working POC that demonstrates this end-to-end with very low latency.

Messaging adapter via CI runner

A @GitLabDuo mention (e.g., in a Slack message or issue comment) triggers a CI workload in a lightweight, auto-created duo-workspace project. The workflow runs on a runner, and lifecycle events (e.g., WorkloadFinishedEvent) post the response back — as an MR note, Slack reply, or any other messaging surface (depending on the adapter that triggers it).

There is a working POC that demonstrates this end-to-end.

The two decisions

We think this discussion mixes two independent architectural decisions that are worth separating:

Entry point: How does a mention get into the system and how does the response get back?
Execution engine: Where does the agent actually run?

The current approaches bundle these together (Workhorse headless bundles a specific entry point with Workhorse execution; the messaging adapter bundles its entry point with runner execution), but they don't have to be coupled. For example, the messaging adapter could use Workhorse as its execution engine instead of a runner. Separating these makes the tradeoffs clearer.

Decision 1: Entry point

Dimension	Current mention flow (Sidekiq → Workhorse)	Messaging adapter
How it works	Sidekiq worker POSTs to a headless Workhorse endpoint. Includes an upfront intent classification call (silent, user doesn't see it) to route the request.	`TriggerFlowService` creates a workflow with a callback context. On completion, `CallbackWorker` dispatches the result to the right adapter (Slack, MR note, etc.).
Reusability across surfaces	Purpose-built for MR mentions. Each new surface (Slack, issues, etc.) would need its own integration path or rebuilding / refactoring the messaging adapter pattern.	Single entry point — adding a new surface means adding an adapter, not a new pipeline.
Routing / intent	Intent classification happens upfront as a silent call before processing.	Routing is handled by which flow is triggered. No intent classification today.
Maturity	Working POC for MR mentions.	Adapter pattern exists. MR mention adapter would be new but follows established pattern.

Key question: Do we want a single entry point pattern for all messaging surfaces, or are surface-specific integrations acceptable?

Decision 2: Execution engine

This is independent of the entry point. Either entry point could use either engine.

Dimension	Workhorse headless	CI Runner
Composite identity	Not yet — needs to be added (currently bound to `ai_workflow` scope)	In place
Response time for simple questions	~1-2s	~5s (lightweight workspace project, no clone). Depends on the available runner and can benefit from improvements to startup times (which could always give a cloned project without additional response time concerns)
Permission control	The chat flow has write tools that require human approval — unclear how approval would work in a headless scenario (but currently gated by LLM intent classification in a way that write actions would never be routed to agentic chat. More Info)	Controlled via flow definition (which tools are available) + composite identity ensures permission boundaries
Repo access	Limited to API-provided data	Has access to a development environment (computer): can do API requests but also clone code if it needs deeper context — an extra capability, not a latency penalty by default
Infrastructure	New headless Workhorse service to build and maintain	Uses existing runner infrastructure that all foundational flows already use
Runner compute	Does not consume runner compute	Uses runner compute

Key questions:

Is ~5s acceptable? If yes, the latency difference is not a deciding factor.
How much weight does composite identity carry? If it's a blocker for shipping (as it seems from our security requirements), the runner path is ready today; Workhorse needs additional work.
Is a new execution path justified? There are already several triggering approaches. Is the latency gain worth adding another path to maintain?

Separate topics

These need decisions but are independent of entry point and execution engine. They can be iterated on separately:

Topic	Notes
Which flow / agent to use	For read-only questions, we can restrict to read-only tools. The Duo Developer flow works today but can be swapped or delegate to specialized flows.
Context injection	What context do we automatically provide to the agent? API-fetched context vs. cloned repo vs. hybrid. Running a CI job in the `duo-workspace` project (lightweight) is fast but if we would run in the project of an MR, the automatic cloning of a repo for CI jobs might take a lot of time.
Intent classification	Where does it live — upfront (as in the current mention flow) or within the flow itself? An upfront call adds latency and opacity; an in-flow approach gives the agent more control but may need guardrails.
Chat history / visibility	Currently filtered for chat flow — anything using chat flow shows up there. Needs thought on how agent interactions surface distinctly.

Next steps

PoC: Validate the messaging adapter path for @GitLabDuo MR mentions end-to-end (latency + response quality)
Discuss and resolve the two decisions above with input from principle engineers across both approaches
Get alignment from product on timelines as that could impact how much time we have to make this work