Approval Human In The Loop Node (HITL) (#20652) · Epics · GitLab.org

Approval Human In The Loop Node (HITL)

### Release Notes You can now insert Human-in-the-Loop (HITL) approval checkpoints directly into your AI Custom Flows, giving you the ability to pause execution and review what the agent has done and what it plans to do next. With explicit Approve, Reject, or Modify buttons surfaced right in the agent session details view, you stay in control of critical decisions without sacrificing the speed and automation benefits of running AI-powered flows. ## Background **Problem Statement:** Enterprise users building automated workflows with AI agents need the ability to pause execution for human review and approval. Without this capability, organizations cannot safely deploy agentic workflows in high-stakes scenarios such as production deployments, security-sensitive operations, or compliance-regulated processes. **Why This Matters:** 1. **Trust & Safety:** Non-deterministic AI agents require human oversight at critical decision points to prevent costly mistakes 2. **Regulatory Compliance:** EU AI Act and other regulations mandate human-in-the-loop observability and multi-person approval for automated decision systems 3. **Enterprise Adoption Blocker:** To deploy complex automation flows, Enterprises will expect HITL capabilities 4. **Competitive Gap:** All major competitors (LangGraph, AutoGen, n8n) offer native HITL functionality **Current State:** GitLab Duo Agent Platform flows run continuously without interruption. Users cannot inject human judgment, validate agent decisions, or pause for approval before critical actions execute. **Desired State:** Users can configure human approval checkpoints at any point in their flows, with flexible notification methods, approval routing, and validation rules that match their organizational requirements. ## Main Goals Enable enterprises to safely adopt AI agent workflows by providing human oversight and verification at critical decision points, reducing risk while maintaining automation benefits. **The Problem:** Enterprises cannot safely deploy autonomous AI agents for critical operations because LLM non-determinism creates unacceptable risk. Without human verification at decision points, one agent mistake could cause production outages, security breaches, or compliance violations. **Our Solution:** Configurable Human-in-the-Loop nodes that let users define exactly when and how human oversight occurs—balancing automation speed with human judgment at critical moments. **This Enables:** - :white_check_mark: **Trust:** Humans verify high-stakes decisions before execution - :white_check_mark: **Compliance:** Meet EU AI Act and regulatory requirements for human oversight - :white_check_mark: **Adoption:** Remove one of theblockers to enterprise agentic workflow deployment - :white_check_mark: **Learning:** Capture human corrections to improve agent behavior over time ### Secondary Goals 1. Integrate seamlessly with GitLab's existing approval and notification systems 2. Support both synchronous (immediate) and asynchronous (delayed) approval patterns 3. Provide complete audit trails for compliance and post-mortems 4. Enable flexible multi-approver workflows (any/all logic) 5. Allow policy enforcement through approval validation rules <details> <summary> ### Example Scenarios </summary> #### 1) Gitlab Internal Example: Trigger: Merge Request approved and merged to main branch Agent Action 1: Analyze changed files - Scans commit diff - Identifies: database migrations, API changes, config updates - Runs SAST/DAST security scans - Checks compliance with OWASP/PCI-DSS rules ↓ Agent Action 2: Generate deployment plan - Creates rollback strategy - Estimates deployment window - Identifies affected microservices - Calculates blast radius (% of users affected) ↓ :octagonal_sign: HITL CHECKPOINT: Security & Compliance Review - Pauses workflow - Creates GitLab Issue with deployment summary - @mentions security team lead and SRE on-call - Displays in custom GitLab UI panel: * Risk score: HIGH (touches payment processing) * Database changes: 3 migrations detected * Security scan results: 0 critical, 2 medium findings * Compliance checklist: :white_check_mark: PCI-DSS, :white_check_mark: SOC2, :warning: Manual review needed * Estimated deployment time: 15 minutes ↓ Human Decision Options: - :white_check_mark: APPROVE → Continue to deployment - :warning: APPROVE WITH CONDITIONS → Deploy to canary environment first - :pencil2: REQUEST CHANGES → Specify what needs fixing, workflow stops - :x: REJECT → Block deployment, require manual intervention ↓ \[If Approved\] Agent Action 3: Execute deployment - Deploys to production using GitLab CI/CD - Monitors deployment health - Runs smoke tests - Updates deployment tracking in GitLab ↓ Agent Action 4: Post-deployment validation - Monitors error rates for 30 minutes - Checks performance metrics - If anomalies detected → automatic rollback + alert ↓ Agent Action 5: Documentation - Auto-updates deployment log in GitLab Wiki - Creates incident response runbook if new endpoints added - Comments on original MR with deployment results #### 2) High Priority bugfix "Customer X (Enterprise, $500K ARR) reports payment processing failures"↓ Agent Action 1: Analyze customer impact (Multi-system) - Queries Salesforce API: Customer tier, ARR, contract terms - Queries AWS CloudWatch: Error rates, affected users - Queries GitLab: Recent deployments to payment service - Queries PagerDuty: Current on-call engineer ↓ Agent Decision: - Customer tier: Enterprise (top 10 customer) - Revenue impact: $500K ARR at risk - Affected users: 1,247 users (8% of customer's user base) - Recent change: Payment service deployed 2 hours ago - Classification: **P0 - Production Incident** ↓ Agent Action 2: Create incident response (Jira + GitLab) - Creates Jira incident ticket with severity P0 - Links Salesforce case to Jira - Creates GitLab Issue in payment-service project - @mentions on-call SRE and payment team lead - Starts incident timeline in PagerDuty ↓ Agent Action 3: Generate hotfix proposal - Analyzes recent commits in GitLab - Identifies: Commit abc123 introduced validation bug - Generates rollback MR OR fix MR based on complexity - Runs tests in isolated GitLab CI/CD pipeline - Estimates: "Rollback recommended - 5 min fix vs 2 hour debug" ↓ :octagonal_sign: HITL CHECKPOINT #1: Incident Commander Approval - Pauses workflow - Sends Slack notification to #incidents channel - Opens approval panel in: * GitLab (primary) * Jira incident ticket (comment with link) * Slack (interactive buttons) - Displays context: * Customer: X (Enterprise, $500K ARR) * Impact: 1,247 users affected * Recommendation: ROLLBACK to commit xyz789 * Alternative: Apply hotfix * Tests: :white_check_mark: All passing on rollback branch * Estimated resolution: 15 minutes with rollback ↓ Incident Commander Decision: - :white_check_mark: APPROVE ROLLBACK - :warning: APPROVE HOTFIX (requires engineering lead approval) - :pencil2: REQUEST INVESTIGATION (escalate to engineering VP) - :wrench: MANUAL INTERVENTION (disable auto-remediation) ↓ \[If Approved Rollback\] Agent Action 4: Execute rollback - Creates rollback MR in GitLab - Auto-merges (emergency bypass enabled) - Triggers GitLab CI/CD pipeline - Deploys to AWS ECS using GitLab runners - Monitors AWS CloudWatch metrics ↓ Agent Action 5: Validate fix - Runs synthetic transaction tests - Checks error rates in Datadog/CloudWatch - Validates with test transaction in Salesforce sandbox ↓ :octagonal_sign: HITL CHECKPOINT #2: Validation Confirmation - Pauses before customer notification - Shows results dashboard: * Error rate: 0.1% → 0.01% :white_check_mark: * Test transactions: 100/100 successful :white_check_mark: * Customer-specific test: Payment processed :white_check_mark: * System health: All metrics normal :white_check_mark: - Asks: "Confirm fix successful before notifying customer?" ↓ Customer Success Decision: - :white_check_mark: CONFIRMED - Notify customer - :warning: PARTIAL - Monitor for 30 more minutes - :x: ISSUE PERSISTS - Re-escalate ↓ \[If Confirmed\] Agent Action 6: Close incident loop (Multi-system) Salesforce: - Updates case with resolution details - Adds timeline of fix - Sets case status: "Resolved" - Schedules follow-up call Jira: - Transitions ticket to "Resolved" - Links to rollback MR in GitLab - Adds RCA template for post-mortem GitLab: - Merges rollback MR - Creates follow-up issue: "Investigate payment validation bug" - Assigns to payment team - Labels: "tech-debt", "post-mortem-required" PagerDuty: - Resolves incident - Logs resolution time: 18 minutes Slack: - Posts resolution summary to #incidents - Thanks responders - Schedules post-mortem meeting AWS: - Tags deployment in AWS with incident ID - Updates CloudWatch dashboard with annotation ↓ Agent Action 7: Generate post-mortem draft - Analyzes incident timeline - Identifies root cause - Suggests preventive measures - Creates GitLab Wiki page - Assigns to incident commander for review ## </details> ## Technical Components (High-Level) 1. **HITL Node (Flow Builder UI)** - Drag-and-drop node in visual flow builder - Configuration panel for approval settings 2. **Approval Engine** - Manages approval state and routing 3. **Notification Service** - Sends approval requests via configured channels - Tracks delivery and read receipts 4. **Validation Service** - Enforces approval policies - Checks permissions and business rules 5. **Audit Log** - Records all approval decisions - Queryable for compliance reporting ## MVC Version 1) **Approval Request Delivery Options:** Configure how approval requests reach approvers: 1) Gitlab Platform: - GitLab To-Do - Email (with approval link) 2) **Approval Routing:** Configure who must approve for the flow to continue: 1) **Single Approver:** * Default approver (flow Initiator) 3) **Approval Decision Options:** Approvers can respond with: 1. **Basic Decisions:** - :white_check_mark: **APPROVE:** Continue flow as planned - :x: **REJECT:** Block flow execution, end workflow - :pencil2: **MODIFY:** Block flow, provide feedback 2) **Input Methods:** - **Visual:** Buttons/dropdowns in approval UI (primary) - **Text-based:** Natural language in comments (parsed by agent) - Example: Comment "Reject - use CONCURRENT for index creation" ## Post MVC Additions 1) **Approval Request Delivery Options:** Configure how approval requests reach approvers: 1) Gitlab Platform: 1. Duo Chat mention (if approver is online) 2. Gitlab Issue (comment with approval widget) 2) External Options: - Slack/Teams interactive message - Custom webhook to external approval system 2) **Approval Routing:** Configure who must approve for the flow to continue: - Flow creator (default) - Specific user (@username) - First available from a list - **Multiple Approvers (requires approval_mode setting):** 1) **ANY (OR logic):** Any 1 out of N approvers can approve - Example: "Any SRE from \[@alice, @bob, @charlie\]" - **ALL (AND logic):** All N approvers must approve - Example: "Both @security-lead AND @compliance-officer" 2) **User Groups:** - GitLab group members (e.g., @security-team) - Custom role (e.g., "SRE on-call" from PagerDuty integration) 2) - **Dynamic Routing:** - Route based on flow state (e.g., if risk=HIGH, require 2 approvers) - Escalation rules (e.g., if no response in 1 hour, notify manager) 3) **Timeout Configuration:** Define what happens if approvers don't respond: 1) **Timeout Settings:** - Duration: 5m, 15m, 7d etc (configurable) - Action on timeout: 2) **Auto-continue:** Proceed as if approved - **Auto-cancel:** End workflow execution (default) - **Escalate:** Notify different approver group - **Custom action:** Run specific node (e.g., notify Slack, create incident) 4) **Reviewer Notes** - Allow approvers to add context when approving/rejecting - Stored in audit log for compliance - Example: "Approved - but requires follow-up security audit in Q2" 5) **HITL Validation Service** - Validate approval responses before accepting - Check: User permissions, business rules, compliance requirements - Example validations: - "Is approver in security team?" (permission check) - "Is CI/CD pipeline green?" (business rule) - "Is it outside deployment window?" (time-based rule) - Block invalid approvals with clear error messages 6) **Handoff Mode (Asynchronous Pattern)** - For long-running approvals (hours/days) - Workflow terminates after HITL request (frees resources) - New workflow starts when human responds - Use case: Weekend security reviews, multi-day legal approvals - Technical: Requires state persistence and webhook callbacks 7) **Time Travel / Checkpoint Rewind** - Allow rewinding to previous checkpoint and replaying forward - Use cases: - Wrong approver responded → rewind and get correct approval - Agent made wrong decision → rewind and guide differently - Testing/debugging → replay scenario with different inputs - Requires: Checkpoint storage, state restoration logic - See detailed example section below: <details> <summary> ###### Time Travel Example: </summary> - **User Story:** "As a flow creator, when someone makes a mistake in the approval process (wrong person approved, wrong decision made), I want to rewind the workflow to that checkpoint and fix the mistake without restarting the entire flow." - **Technical Requirements:** - Store checkpoint at every node execution (not just HITL nodes) - Checkpoint includes: full state, timestamp, node ID, metadata - UI shows checkpoint timeline with ability to select any past checkpoint - Rewinding invalidates all checkpoints after the selected one - Replaying from checkpoint maintains audit trail (shows "rewound from checkpoint X") - **Practical example:** 1) Example: Wrong Approver - Checkpoint 1: MR created for production deployment - Checkpoint 2: Agent analyzes → requires SRE approval - Checkpoint 3: HITL approval request sent - Checkpoint 4: Junior engineer (Bob) approves - Checkpoint 5: Validation fails: "Bob is not authorized" 2) With Time Travel: - :clock1: Admin rewinds to checkpoint 3 (before Bob approved) - Admin updates state: - Remove Bob's approval attempt - Notify correct approvers (SRE team) - Replay from checkpoint 3: - Checkpoint 4: Senior SRE (Alice) approves - Checkpoint 5: Validation passes :white_check_mark: - Checkpoint 6: Deployment proceeds </details>

epic