ADR 008: Duo Messaging Service (754f76e9) · Commits · GitLab.com / Content Sites / handbook

content/handbook/engineering/architecture/design-documents/duo_workflow/_index.md

+23 −0

Original line number	Diff line number	Diff line
		@@ -209,6 +209,29 @@ our executors and the Duo Workflow Service and therefore remove the need for our
		executors to proxy requests to the GitLab instance for self-managed as
		documented below.

		#### From messaging services (Slack, Teams, etc.)

		The Duo Messaging Service allows users to trigger workflows from external
		messaging platforms by @mentioning Duo. It uses the same CI pipeline execution
		path as remote workflows but with a different trigger mechanism:

		1. User @mentions Duo in a messaging service (e.g., Slack)
		2. The messaging service sends an event to GitLab Rails
		3. A messaging adapter translates the event into a goal and callback context
		4. The orchestrator resolves the user's `duo_default_namespace`, finds or
		creates a `duo-workspace` project, and triggers a `developer/v1` flow
		5. The agent runs in CI with the same composite identity as Duo Developer
		6. When the workflow completes, a `CallbackWorker` (subscribed to
		`WorkloadFinishedEvent` via EventStore) delivers the result back through
		the adapter to the messaging service

		The adapter pattern allows adding new messaging platforms by implementing a
		small interface (~5 methods) without changing the orchestration or execution
		infrastructure.

		For the full architecture, see
		[ADR 008: Duo Messaging Service](decisions/008_duo_messaging_service.md).

		### Self-managed architecture

		#### With local Workflow service

content/handbook/engineering/architecture/design-documents/duo_workflow/decisions/008_duo_messaging_service.md

0 → 100644

+354 −0

Original line number	Diff line number	Diff line
		---
		title: "Duo Agent Platform ADR 008: Duo Messaging Service"
		status: proposed
		creation-date: "2026-04-17"
		authors: [ "@thomas-schmidt" ]
		coach: [ ]
		approvers: [ ]
		owning-stage: "~devops::ai_powered"
		participating-stages: []
		toc_hide: true
		---

		## Context

		We want users to interact with Duo from external messaging services — starting
		with Slack, then Microsoft Teams, WhatsApp, Telegram, and others. A user
		@mentions Duo, gives it a task, and Duo works on it asynchronously and posts
		back the result.

		Two challenges are specific to messaging:

		1. CI pipelines require a project, but messaging services have no project
		context
		2. Multiple messaging platforms need to be supported without duplicating
		orchestration logic

		### Alternatives considered

		Five approaches were investigated:

		1. CI job (Flows API) — Trigger a CI pipeline via the existing Flows
		infrastructure. Battle-tested, ADR 004 compliant, no Workhorse or DWS
		changes. The only approach that provides a real execution environment —
		the agent can git clone, run tests, install tools, and do full development
		tasks. Downside: CI startup latency (~10s with empty project). Requires a
		project for the pipeline — solved by auto-creating a workspace project.

		2. WebSocket blocking — Sidekiq worker opens a WebSocket to Workhorse,
		keeps it open for the full workflow duration. Simple, supports streaming.
		Downside: blocks a Sidekiq thread for up to 5 minutes per request, limiting
		throughput to ~50 concurrent workflows per Sidekiq process. No execution
		environment — the agent runs inside Workhorse with no filesystem, no git,
		no ability to run commands. Limits the agent to read-only API interactions
		with no path to development tasks.

		3. WebSocket fire-and-forget — Sidekiq opens WebSocket, sends start
		request, disconnects immediately. Blocked: prototyping revealed Workhorse
		terminates the workflow when the client disconnects (sends `StopWorkflow` on
		clean close, tears down gRPC on abnormal close). Would require Workhorse
		changes to add a headless/detached mode. Same execution environment
		limitation as option 2.

		4. Direct gRPC — Sidekiq opens a gRPC bidi stream directly to DWS.
		Lower latency, type-safe. Violates ADR 004 (introduces a second path to
		DWS). Must reimplement HTTP action proxying in Ruby. No established pattern
		for gRPC bidi streaming from Sidekiq in the codebase. Same execution
		environment limitation — no filesystem or tooling available.

		5. Workhorse headless HTTP — New Workhorse endpoint that accepts a
		workflow trigger via HTTP POST, manages the gRPC stream internally.
		Requires cross-team Workhorse changes (~50-100 lines of Go) and a
		modified runner lifecycle. Same execution environment limitation as
		options 2-4 — no path to development tasks without additional architecture.

		## Decision

		Use the Flows API (CI job) approach with an adapter pattern for
		multi-platform support and a per-namespace workspace project to provide CI
		context.

		### Architecture

		```mermaid
		graph TB
		classDef messaging fill:#dbeafe,stroke:#93c5fd,color:#1e3a5f
		classDef adapter fill:#d1fae5,stroke:#6ee7b7,color:#065f46
		classDef orchestrator fill:#ffedd5,stroke:#fdba74,color:#7c2d12
		classDef execution fill:#ede9fe,stroke:#c4b5fd,color:#3b0764
		classDef callback fill:#fef9c3,stroke:#fde047,color:#713f12
		classDef workspace fill:#f3e8ff,stroke:#d8b4fe,color:#581c87

		subgraph MSG["💬 MESSAGING SERVICES"]
		Slack(["Slack"])
		MTeams(["Microsoft Teams"])
		Others(["WhatsApp · Telegram · ..."])
		end

		subgraph ADAPT["🔌 ADAPTERS — one per messaging service"]
		direction LR
		SA["Slack Adapter<br/>👀 ✅ ❌"]
		TA["Teams Adapter"]
		OA["..."]
		end

		subgraph ORCH["⚙️ ORCHESTRATOR"]
		direction LR
		O1["Resolve user's<br/>duo_default_namespace<br/>(root namespace)"] --> O2["Find or create<br/>duo-workspace project"] --> O3["Delegate to<br/>ExecuteWorkflowService"]
		end

		subgraph EXEC["🏃 CI RUNNER"]
		CI["Agent executes in duo-workspace<br/><i>Tools · GitLab API · MCP · git clone</i>"]
		end

		subgraph CBGRP["📬 ASYNC CALLBACK"]
		CW["CallbackWorker<br/><i>Subscribes to WorkloadFinishedEvent</i>"]
		end

		subgraph WS["📁 duo-workspace — per top-level namespace"]
		direction LR
		W1["agent-config.yml<br/><i>image · scripts · cache</i>"]
		W2["AGENTS.md<br/><i>instructions</i>"]
		W3["CI/CD vars<br/><i>secrets · keys</i>"]
		W4["Runner tags<br/><i>dedicated runners</i>"]
		end

		Slack --> SA
		MTeams --> TA
		Others --> OA

		SA & TA & OA -->\|"goal + callback_context"\| ORCH
		O3 -->\|"start pipeline"\| CI
		CI -.->\|"WorkloadFinishedEvent"\| CW
		CW -.->\|"deliver_result / on_flow_failed"\| ADAPT
		O2 -.-\|"creates / uses"\| WS

		class Slack,MTeams,Others messaging
		class SA,TA,OA adapter
		class O1,O2,O3 orchestrator
		class CI execution
		class CW callback
		class W1,W2,W3,W4 workspace
		```

		Solid arrows = synchronous calls    Dashed arrows = async events

		### Request flow

		```mermaid
		sequenceDiagram
		participant User
		participant Slack as Messaging Service
		participant Adapter
		participant Orchestrator
		participant CI as CI Runner
		participant Worker as CallbackWorker

		User->>Slack: @duo find open MRs for project X
		Slack->>Adapter: event

		rect rgb(209, 250, 229)
		Note right of Adapter: Trigger phase (sync)
		Adapter->>Orchestrator: trigger(goal, callback_context)
		Orchestrator->>Orchestrator: resolve namespace → workspace project
		Orchestrator-->>Adapter: success
		Adapter->>Slack: 👀 on_flow_started
		end

		rect rgb(237, 233, 254)
		Note right of CI: Execution phase (async)
		Orchestrator->>CI: start pipeline
		CI->>CI: Agent uses tools, APIs,<br/>git clone as needed
		end

		rect rgb(254, 249, 195)
		Note right of Worker: Callback phase (async)
		CI-->>Worker: WorkloadFinishedEvent
		Worker->>Worker: Extract answer from checkpoints
		Worker-->>Adapter: deliver_result
		Adapter->>Slack: Post answer in thread
		Adapter->>Slack: 👀 → ✅ on_flow_completed
		end
		```

		### Key design choices

		Agent flow via Flows API, delegating to `ExecuteWorkflowService`. The
		orchestrator triggers an agent flow in the workspace project using the existing
		Flows API. It delegates to the same `ExecuteWorkflowService` used by the
		existing trigger paths, avoiding duplication of privilege handling, token
		generation, and workflow start logic. The messaging service passes the thread
		context as the goal. Initially this uses the same `developer/v1` flow that
		powers Duo Developer, giving the agent full capabilities (tools, GitLab API,
		MCP, git) from day one.

		`duo-workspace` auto-created project. A private, empty project per
		top-level namespace provides CI pipeline context. This is the path forward for
		the internal MVC. The exact project name (`duo-workspace`) is not final and can
		be iterated on in a follow-up. The workspace project is
		created at the root namespace of the user's `duo_default_namespace` — for
		example, if the user's default namespace is `gitlab-org/editor-extensions`, the
		workspace project is created at `gitlab-org/duo-workspace`, not
		`gitlab-org/editor-extensions/duo-workspace`. This keeps one workspace project
		per top-level group, avoiding proliferation of projects across nested
		namespaces.

		The workspace project is created when the admin enables the `developer/v1` flow
		for the namespace (using admin permissions), with a fallback find-or-create at
		trigger time for robustness. This avoids permission issues since regular users
		may not have `create_projects` access. Existing namespaces that already have
		`developer/v1` enabled before this ships will need a backfill migration
		(follow-up).

		Teams customize the workspace project (Docker image, AGENTS.md, skills, CI
		variables, runner tags) using existing project features. Follows the same
		pattern as Security Policy Projects.

		Namespace resolution via `duo_default_namespace`. No new configuration —
		reuses the existing user preference. The root namespace of this preference
		determines the top-level group for workspace project resolution.

		`developer/v1` must be enabled. The orchestrator validates upfront that the
		`developer/v1` foundational flow is enabled for the user's namespace. If not,
		messaging returns an actionable `:flow_not_enabled` error guiding the user to
		ask their admin to enable it. This early check avoids confusing downstream
		failures (e.g., "Could not resolve service account") and lets each adapter
		craft an appropriate user-facing message.

		Adapter pattern. Each messaging platform implements an adapter with
		lifecycle hooks (`deliver_result`, `deliver_error`, `on_flow_started`,
		`on_flow_completed`, `on_flow_failed`). The orchestrator, workspace project,
		and callback infrastructure are shared.

		EventStore callback. `CallbackWorker` subscribes to
		`WorkloadFinishedEvent`, checks for `messaging_callback_context` on the
		workflow record (JSONB column), and delivers results through the adapter.
		No GraphQL, no polling. The callback context contains adapter-specific
		delivery information, e.g. for Slack it could be something like:

		```json
		{
		"adapter": "slack",
		"team_id": "T0123ABC",
		"channel_id": "C0123ABC",
		"thread_ts": "1234567890.123456",
		"message_ts": "1234567890.123456",
		"user_id": "U0123ABC"
		}
		```

		Reuses the `developer/v1` catalog service account. Messaging is a trigger
		mechanism for `developer/v1`, not a separate flow. The service account identity
		reflects the flow being executed, not the trigger source. The orchestrator
		resolves the existing SA created when an admin enables the Developer flow for
		the namespace. No separate messaging SA is created. If `developer/v1` is not
		enabled, there is no SA, and messaging returns a clear error. The SA uses
		`composite_identity_enforced: true` — the same security model used by
		Duo Developer and other agent platform flows. Effective permissions are the
		intersection of the triggering user's and the service account's access.

		### Path to streaming and human approval

		The architecture extends to real-time progress and interactive features without
		changing the core design:

		```mermaid
		sequenceDiagram
		participant CI as CI Runner
		participant Rails as Rails
		participant CW as CheckpointCallbackWorker
		participant Adapter as Messaging Adapter
		participant Slack as Slack
		participant User as User

		CI->>Rails: Save checkpoint
		Rails-->>CW: CheckpointCreatedEvent (via EventStore)
		CW->>Adapter: on_checkpoint_created(context, diff)
		Adapter->>Slack: Status update ("Searching issues...")

		Note over CI,Slack: When approval is required:
		CI->>Rails: Save checkpoint (approval_required)
		Rails-->>CW: CheckpointCreatedEvent
		CW->>Adapter: on_approval_requested(context, details)
		Adapter->>Slack: Interactive message (Approve / Reject)
		User->>Slack: Clicks "Approve"
		Slack->>Rails: Interaction payload
		Rails->>Rails: Write approval → resume workflow
		```

		A new `CheckpointCallbackWorker` subscribes to a `WorkflowCheckpointCreatedEvent`
		— separate from `CallbackWorker` because checkpoint events have different
		characteristics (high frequency, different retry semantics). Each step is
		event-driven; no persistent connections are needed. Approval state is persisted
		on the workflow record and the flow can be stopped and restarted.

		### Adapter interface

		The v1 adapter only needs two required methods. All other hooks are optional
		with no-op defaults in the base class, added when the corresponding
		infrastructure is built.

		\| Method \| Purpose \| Called by \| Required? \|
		\|---\|---\|---\|---\|
		\| `deliver_result` \| Post the final answer \| `CallbackWorker` \| Yes \|
		\| `deliver_error` \| Post an error message \| `CallbackWorker` \| Yes \|
		\| `on_flow_started` \| Signal work started (e.g., 👀) \| Trigger service \| Optional \|
		\| `on_flow_completed` \| Signal work done (e.g., ✅) \| `CallbackWorker` \| Optional \|
		\| `on_flow_failed` \| Signal failure (e.g., ❌ + error) \| Both \| Optional \|
		\| `on_checkpoint_created` \| Intermediate progress update \| `CheckpointCallbackWorker` \| Optional (future) \|
		\| `on_approval_requested` \| Post approval prompt \| `CheckpointCallbackWorker` \| Optional (future) \|

		### Responsibility split: pre-flow checks vs adapter lifecycle

		Platform-specific pre-flight checks (authentication, authorization, feature
		flags, license validation) remain in the entry-point service (e.g.,
		`AppMentionedService` for Slack). These happen before Duo is involved and may
		require platform-specific responses (e.g., an OAuth authorization link for
		unlinked Slack users).

		The adapter handles flow lifecycle only: `on_flow_started`, `on_flow_completed`,
		`on_flow_failed`, `deliver_result`, `deliver_error`. This keeps adapter
		implementations focused on delivery mechanics rather than auth logic.

		### Startup time

		\| Step \| Today (large project) \| With duo-workspace \|
		\|---\|---\|---\|
		\| Git clone \| Seconds–minutes \| Near-instant (empty repo) \|
		\| Docker image \| Default, pulled each time \| Custom via `agent-config.yml`, cached \|
		\| `duo-cli` install \| `npm install` each run (~15s) \| Pre-baked into custom image \|

		Prototyping showed end-to-end response times under 10 seconds with an empty
		workspace project. This is acceptable for async messaging. Teams optimize
		further by customizing the workspace project (cached images, dedicated runners,
		pre-installed tools).

		## Pros

		- Battle-tested CI/Flows infrastructure — no new execution runtime
		- No Workhorse or DWS changes required
		- ADR 004 compliant
		- Every CI improvement benefits messaging for free
		- Adapter pattern cleanly separates platform-specific concerns
		- Workspace project is a natural customization surface (image, skills, secrets)
		- Streaming and human approval extend the same architecture additively
		(new EventStore subscriptions, new adapter hooks — no core changes)

		## Cons

		- CI startup latency (~10s with empty project) is slower than a direct
		service call, though acceptable for async messaging
		- Auto-creating projects and service accounts adds implicit resources to
		namespaces
		- Adapter hooks are invoked from different call sites (trigger service vs.
		callback worker) — requires clear documentation for new adapter authors

		## Implementation

		- [Issue](https://gitlab.com/gitlab-org/gitlab/-/work_items/590434)

		### Feature flag

		The entire flow is gated behind the
		[`slack_duo_agent`](https://gitlab.com/gitlab-org/gitlab/-/work_items/592185)
		feature flag (per-user), which already gates the `AppMentionedService`.