Commit 754f76e9 authored by Thomas Schmidt's avatar Thomas Schmidt Committed by Shekhar Patnaik
Browse files

ADR 008: Duo Messaging Service

parent d899499f
Loading
Loading
Loading
Loading
+23 −0
Original line number Diff line number Diff line
@@ -209,6 +209,29 @@ our executors and the Duo Workflow Service and therefore remove the need for our
executors to proxy requests to the GitLab instance for self-managed as
documented below.

#### From messaging services (Slack, Teams, etc.)

The Duo Messaging Service allows users to trigger workflows from external
messaging platforms by @mentioning Duo. It uses the same CI pipeline execution
path as remote workflows but with a different trigger mechanism:

1. User @mentions Duo in a messaging service (e.g., Slack)
2. The messaging service sends an event to GitLab Rails
3. A messaging adapter translates the event into a goal and callback context
4. The orchestrator resolves the user's `duo_default_namespace`, finds or
   creates a `duo-workspace` project, and triggers a `developer/v1` flow
5. The agent runs in CI with the same composite identity as Duo Developer
6. When the workflow completes, a `CallbackWorker` (subscribed to
   `WorkloadFinishedEvent` via EventStore) delivers the result back through
   the adapter to the messaging service

The adapter pattern allows adding new messaging platforms by implementing a
small interface (~5 methods) without changing the orchestration or execution
infrastructure.

For the full architecture, see
[ADR 008: Duo Messaging Service](decisions/008_duo_messaging_service.md).

### Self-managed architecture

#### With local Workflow service
+354 −0
Original line number Diff line number Diff line
---
title: "Duo Agent Platform ADR 008: Duo Messaging Service"
status: proposed
creation-date: "2026-04-17"
authors: [ "@thomas-schmidt" ]
coach: [ ]
approvers: [ ]
owning-stage: "~devops::ai_powered"
participating-stages: []
toc_hide: true
---

## Context

We want users to interact with Duo from external messaging services — starting
with Slack, then Microsoft Teams, WhatsApp, Telegram, and others. A user
@mentions Duo, gives it a task, and Duo works on it asynchronously and posts
back the result.

Two challenges are specific to messaging:

1. CI pipelines require a project, but messaging services have no project
   context
2. Multiple messaging platforms need to be supported without duplicating
   orchestration logic

### Alternatives considered

Five approaches were investigated:

1. **CI job (Flows API)** — Trigger a CI pipeline via the existing Flows
   infrastructure. Battle-tested, ADR 004 compliant, no Workhorse or DWS
   changes. The only approach that provides a real execution environment —
   the agent can git clone, run tests, install tools, and do full development
   tasks. Downside: CI startup latency (~10s with empty project). Requires a
   project for the pipeline — solved by auto-creating a workspace project.

2. **WebSocket blocking** — Sidekiq worker opens a WebSocket to Workhorse,
   keeps it open for the full workflow duration. Simple, supports streaming.
   Downside: blocks a Sidekiq thread for up to 5 minutes per request, limiting
   throughput to ~50 concurrent workflows per Sidekiq process. No execution
   environment — the agent runs inside Workhorse with no filesystem, no git,
   no ability to run commands. Limits the agent to read-only API interactions
   with no path to development tasks.

3. **WebSocket fire-and-forget** — Sidekiq opens WebSocket, sends start
   request, disconnects immediately. **Blocked**: prototyping revealed Workhorse
   terminates the workflow when the client disconnects (sends `StopWorkflow` on
   clean close, tears down gRPC on abnormal close). Would require Workhorse
   changes to add a headless/detached mode. Same execution environment
   limitation as option 2.

4. **Direct gRPC** — Sidekiq opens a gRPC bidi stream directly to DWS.
   Lower latency, type-safe. **Violates ADR 004** (introduces a second path to
   DWS). Must reimplement HTTP action proxying in Ruby. No established pattern
   for gRPC bidi streaming from Sidekiq in the codebase. Same execution
   environment limitation — no filesystem or tooling available.

5. **Workhorse headless HTTP** — New Workhorse endpoint that accepts a
   workflow trigger via HTTP POST, manages the gRPC stream internally.
   **Requires cross-team Workhorse changes** (~50-100 lines of Go) and a
   modified runner lifecycle. Same execution environment limitation as
   options 2-4 — no path to development tasks without additional architecture.

## Decision

Use the **Flows API (CI job)** approach with an **adapter pattern** for
multi-platform support and a **per-namespace workspace project** to provide CI
context.

### Architecture

```mermaid
graph TB
    classDef messaging fill:#dbeafe,stroke:#93c5fd,color:#1e3a5f
    classDef adapter fill:#d1fae5,stroke:#6ee7b7,color:#065f46
    classDef orchestrator fill:#ffedd5,stroke:#fdba74,color:#7c2d12
    classDef execution fill:#ede9fe,stroke:#c4b5fd,color:#3b0764
    classDef callback fill:#fef9c3,stroke:#fde047,color:#713f12
    classDef workspace fill:#f3e8ff,stroke:#d8b4fe,color:#581c87

    subgraph MSG["💬 MESSAGING SERVICES"]
        Slack(["Slack"])
        MTeams(["Microsoft Teams"])
        Others(["WhatsApp · Telegram · ..."])
    end

    subgraph ADAPT["🔌 ADAPTERS — one per messaging service"]
        direction LR
        SA["Slack Adapter<br/>👀 ✅ ❌"]
        TA["Teams Adapter"]
        OA["..."]
    end

    subgraph ORCH["⚙️ ORCHESTRATOR"]
        direction LR
        O1["Resolve user's<br/>duo_default_namespace<br/>(root namespace)"] --> O2["Find or create<br/>duo-workspace project"] --> O3["Delegate to<br/>ExecuteWorkflowService"]
    end

    subgraph EXEC["🏃 CI RUNNER"]
        CI["Agent executes in duo-workspace<br/><i>Tools · GitLab API · MCP · git clone</i>"]
    end

    subgraph CBGRP["📬 ASYNC CALLBACK"]
        CW["CallbackWorker<br/><i>Subscribes to WorkloadFinishedEvent</i>"]
    end

    subgraph WS["📁 duo-workspace — per top-level namespace"]
        direction LR
        W1["agent-config.yml<br/><i>image · scripts · cache</i>"]
        W2["AGENTS.md<br/><i>instructions</i>"]
        W3["CI/CD vars<br/><i>secrets · keys</i>"]
        W4["Runner tags<br/><i>dedicated runners</i>"]
    end

    Slack --> SA
    MTeams --> TA
    Others --> OA

    SA & TA & OA -->|"goal + callback_context"| ORCH
    O3 -->|"start pipeline"| CI
    CI -.->|"WorkloadFinishedEvent"| CW
    CW -.->|"deliver_result / on_flow_failed"| ADAPT
    O2 -.-|"creates / uses"| WS

    class Slack,MTeams,Others messaging
    class SA,TA,OA adapter
    class O1,O2,O3 orchestrator
    class CI execution
    class CW callback
    class W1,W2,W3,W4 workspace
```

**Solid arrows** = synchronous calls &nbsp;&nbsp; **Dashed arrows** = async events

### Request flow

```mermaid
sequenceDiagram
    participant User
    participant Slack as Messaging Service
    participant Adapter
    participant Orchestrator
    participant CI as CI Runner
    participant Worker as CallbackWorker

    User->>Slack: @duo find open MRs for project X
    Slack->>Adapter: event

    rect rgb(209, 250, 229)
        Note right of Adapter: Trigger phase (sync)
        Adapter->>Orchestrator: trigger(goal, callback_context)
        Orchestrator->>Orchestrator: resolve namespace → workspace project
        Orchestrator-->>Adapter: success
        Adapter->>Slack: 👀 on_flow_started
    end

    rect rgb(237, 233, 254)
        Note right of CI: Execution phase (async)
        Orchestrator->>CI: start pipeline
        CI->>CI: Agent uses tools, APIs,<br/>git clone as needed
    end

    rect rgb(254, 249, 195)
        Note right of Worker: Callback phase (async)
        CI-->>Worker: WorkloadFinishedEvent
        Worker->>Worker: Extract answer from checkpoints
        Worker-->>Adapter: deliver_result
        Adapter->>Slack: Post answer in thread
        Adapter->>Slack: 👀 → ✅ on_flow_completed
    end
```

### Key design choices

**Agent flow via Flows API, delegating to `ExecuteWorkflowService`.** The
orchestrator triggers an agent flow in the workspace project using the existing
Flows API. It delegates to the same `ExecuteWorkflowService` used by the
existing trigger paths, avoiding duplication of privilege handling, token
generation, and workflow start logic. The messaging service passes the thread
context as the goal. Initially this uses the same `developer/v1` flow that
powers Duo Developer, giving the agent full capabilities (tools, GitLab API,
MCP, git) from day one.

**`duo-workspace` auto-created project.** A private, empty project per
top-level namespace provides CI pipeline context. This is the path forward for
the internal MVC. The exact project name (`duo-workspace`) is not final and can
be iterated on in a follow-up. The workspace project is
created at the **root namespace** of the user's `duo_default_namespace` — for
example, if the user's default namespace is `gitlab-org/editor-extensions`, the
workspace project is created at `gitlab-org/duo-workspace`, not
`gitlab-org/editor-extensions/duo-workspace`. This keeps one workspace project
per top-level group, avoiding proliferation of projects across nested
namespaces.

The workspace project is created when the admin enables the `developer/v1` flow
for the namespace (using admin permissions), with a fallback find-or-create at
trigger time for robustness. This avoids permission issues since regular users
may not have `create_projects` access. Existing namespaces that already have
`developer/v1` enabled before this ships will need a backfill migration
(follow-up).

Teams customize the workspace project (Docker image, AGENTS.md, skills, CI
variables, runner tags) using existing project features. Follows the same
pattern as Security Policy Projects.

**Namespace resolution via `duo_default_namespace`.** No new configuration —
reuses the existing user preference. The root namespace of this preference
determines the top-level group for workspace project resolution.

**`developer/v1` must be enabled.** The orchestrator validates upfront that the
`developer/v1` foundational flow is enabled for the user's namespace. If not,
messaging returns an actionable `:flow_not_enabled` error guiding the user to
ask their admin to enable it. This early check avoids confusing downstream
failures (e.g., "Could not resolve service account") and lets each adapter
craft an appropriate user-facing message.

**Adapter pattern.** Each messaging platform implements an adapter with
lifecycle hooks (`deliver_result`, `deliver_error`, `on_flow_started`,
`on_flow_completed`, `on_flow_failed`). The orchestrator, workspace project,
and callback infrastructure are shared.

**EventStore callback.** `CallbackWorker` subscribes to
`WorkloadFinishedEvent`, checks for `messaging_callback_context` on the
workflow record (JSONB column), and delivers results through the adapter.
No GraphQL, no polling. The callback context contains adapter-specific
delivery information, e.g. for Slack it could be something like:

```json
{
  "adapter": "slack",
  "team_id": "T0123ABC",
  "channel_id": "C0123ABC",
  "thread_ts": "1234567890.123456",
  "message_ts": "1234567890.123456",
  "user_id": "U0123ABC"
}
```

**Reuses the `developer/v1` catalog service account.** Messaging is a trigger
mechanism for `developer/v1`, not a separate flow. The service account identity
reflects the flow being executed, not the trigger source. The orchestrator
resolves the existing SA created when an admin enables the Developer flow for
the namespace. No separate messaging SA is created. If `developer/v1` is not
enabled, there is no SA, and messaging returns a clear error. The SA uses
`composite_identity_enforced: true` — the same security model used by
Duo Developer and other agent platform flows. Effective permissions are the
intersection of the triggering user's and the service account's access.

### Path to streaming and human approval

The architecture extends to real-time progress and interactive features without
changing the core design:

```mermaid
sequenceDiagram
    participant CI as CI Runner
    participant Rails as Rails
    participant CW as CheckpointCallbackWorker
    participant Adapter as Messaging Adapter
    participant Slack as Slack
    participant User as User

    CI->>Rails: Save checkpoint
    Rails-->>CW: CheckpointCreatedEvent (via EventStore)
    CW->>Adapter: on_checkpoint_created(context, diff)
    Adapter->>Slack: Status update ("Searching issues...")

    Note over CI,Slack: When approval is required:
    CI->>Rails: Save checkpoint (approval_required)
    Rails-->>CW: CheckpointCreatedEvent
    CW->>Adapter: on_approval_requested(context, details)
    Adapter->>Slack: Interactive message (Approve / Reject)
    User->>Slack: Clicks "Approve"
    Slack->>Rails: Interaction payload
    Rails->>Rails: Write approval → resume workflow
```

A new `CheckpointCallbackWorker` subscribes to a `WorkflowCheckpointCreatedEvent`
— separate from `CallbackWorker` because checkpoint events have different
characteristics (high frequency, different retry semantics). Each step is
event-driven; no persistent connections are needed. Approval state is persisted
on the workflow record and the flow can be stopped and restarted.

### Adapter interface

The v1 adapter only needs two required methods. All other hooks are optional
with no-op defaults in the base class, added when the corresponding
infrastructure is built.

| Method | Purpose | Called by | Required? |
|---|---|---|---|
| `deliver_result` | Post the final answer | `CallbackWorker` | Yes |
| `deliver_error` | Post an error message | `CallbackWorker` | Yes |
| `on_flow_started` | Signal work started (e.g., 👀) | Trigger service | Optional |
| `on_flow_completed` | Signal work done (e.g., ✅) | `CallbackWorker` | Optional |
| `on_flow_failed` | Signal failure (e.g., ❌ + error) | Both | Optional |
| `on_checkpoint_created` | Intermediate progress update | `CheckpointCallbackWorker` | Optional (future) |
| `on_approval_requested` | Post approval prompt | `CheckpointCallbackWorker` | Optional (future) |

### Responsibility split: pre-flow checks vs adapter lifecycle

Platform-specific pre-flight checks (authentication, authorization, feature
flags, license validation) remain in the entry-point service (e.g.,
`AppMentionedService` for Slack). These happen before Duo is involved and may
require platform-specific responses (e.g., an OAuth authorization link for
unlinked Slack users).

The adapter handles flow lifecycle only: `on_flow_started`, `on_flow_completed`,
`on_flow_failed`, `deliver_result`, `deliver_error`. This keeps adapter
implementations focused on delivery mechanics rather than auth logic.

### Startup time

| Step | Today (large project) | With duo-workspace |
|---|---|---|
| Git clone | Seconds–minutes | Near-instant (empty repo) |
| Docker image | Default, pulled each time | Custom via `agent-config.yml`, cached |
| `duo-cli` install | `npm install` each run (~15s) | Pre-baked into custom image |

Prototyping showed end-to-end response times under 10 seconds with an empty
workspace project. This is acceptable for async messaging. Teams optimize
further by customizing the workspace project (cached images, dedicated runners,
pre-installed tools).

## Pros

- Battle-tested CI/Flows infrastructure — no new execution runtime
- No Workhorse or DWS changes required
- ADR 004 compliant
- Every CI improvement benefits messaging for free
- Adapter pattern cleanly separates platform-specific concerns
- Workspace project is a natural customization surface (image, skills, secrets)
- Streaming and human approval extend the same architecture additively
  (new EventStore subscriptions, new adapter hooks — no core changes)

## Cons

- CI startup latency (~10s with empty project) is slower than a direct
  service call, though acceptable for async messaging
- Auto-creating projects and service accounts adds implicit resources to
  namespaces
- Adapter hooks are invoked from different call sites (trigger service vs.
  callback worker) — requires clear documentation for new adapter authors

## Implementation

- [Issue](https://gitlab.com/gitlab-org/gitlab/-/work_items/590434)

### Feature flag

The entire flow is gated behind the
[`slack_duo_agent`](https://gitlab.com/gitlab-org/gitlab/-/work_items/592185)
feature flag (per-user), which already gates the `AppMentionedService`.