Add gitlab-ai-principles-distiller gem and AI Catalog flow

What does this MR do and why?

Add the implementation pieces that the weekly principles-sync CI job will eventually drive: a standalone gem (gems/gitlab-ai-principles-distiller/) containing the orchestrator script, the AI Catalog flow provisioner, the distillation prompt, and the manifest config. The CI YAML wiring lives in a follow-up MR (!235014) so AI/Catalog reviewers and pipeline maintainers can each focus on their own domains.

This MR delivers issue https://gitlab.com/gitlab-org/gitlab/-/issues/599663. The follow-up CI integration tracks https://gitlab.com/gitlab-org/gitlab/-/issues/597600.

Background

An earlier prototype of this script used the Duo CLI / chat/completions REST endpoint to call Duo for each principle. Both backends fail in production:

  • Duo CLI: Errno::E2BIG on Linux runners. The security prompt (~170 KiB) exceeds Linux MAX_ARG_STRLEN of 128 KiB per argv string. macOS ARG_MAX=1MiB masked the bug locally.
  • chat/completions REST: hard 1000-character cap on the content field (ee/lib/api/chat.rb:50).

Per Fabio's suggestion, this MR adopts the Duo Agent Platform Workflow API (POST /api/v4/ai/duo_workflows/workflows with start_workflow=true). The agent runs server-side in a child CI pipeline, reads files directly from the source branch via gitaly, and reports results through the GraphQL duoWorkflowWorkflows query. There is no argv-size or content-length issue.

The Workflow API requires an AI Catalog item to drive the run. Since ai_catalog_item_consumer_id only accepts items of type flow, we provision a Catalog Flow (not Agent) whose YAML definition has a single AgentComponent carrying the distillation prompt and a curated read-only toolset. See Custom agents docs for context.

Implementation

The implementation ships as a standalone gem under gems/gitlab-ai-principles-distiller/ (extracted from earlier scripts/ai/principles_distiller/ per reviewer feedback for clearer ownership boundaries).

Library (lib/gitlab/principles_distiller/):

  • sync.rb — orchestrator. Instance-based class (one Sync per CLI invocation, one per spec example). Detects per-principle drift via the existing checksum frontmatter, triggers one Duo Workflow per affected principle (capped at 4 in parallel via MAX_CONCURRENT_DISTILLATIONS), polls each workflow via GraphQL until terminal state (finished / failed / stopped), extracts the assistant's final message from latestCheckpoint.duoMessages, and writes it back. With --push, it opens an MR via the REST API. Mixes in Sync::AutoMr for branch-and-MR helpers.
  • sync/manifest.rbSync::Manifest class: manifest loading, validation, frontmatter handling, affected-principle detection, AGENTS.md / SKILL.md generation, and prerequisite-note injection. Owns the SSOT file_cache (per-instance mutex) and runs validate_principles! at load time to fail-fast on missing sources: keys.
  • sync/workflow.rbSync::Workflow class: Duo Workflow API client + polling + diagnostics. Shared across all parallel_distill threads; memoized ||= readers are safe under sharing because each computes a deterministic value from ENV.
  • sync/auto_mr.rbSync::AutoMr module mixed into Sync. Builds and pushes the auto-MR (including per-principle SSOT diff embedding, idempotent same-day re-runs, error-cleanup rescue path).
  • sync/diff.rbSync::Diff module (extend self): stateless text helpers for stripping LLM preamble, suppressing rephrasing noise (Jaccard similarity over word tokens, MATCH_THRESHOLD = 0.6), and deciding whether output is materially different from the prior version.
  • graphql_client.rbGraphqlClient shared by Workflow and ProvisionFlow. Returns the parsed data payload, or raises GraphqlClient::Error on transport failure, non-2xx, or a non-empty errors array. Error policy (warn vs abort) is the caller's choice.
  • workspace.rb — process-wide pointer to the repository workspace. Set via --workspace or falls back to $CI_PROJECT_DIR.
  • env.rb — env var names read by the gem, as constants so typos surface as NameError rather than silent nil reads.
  • provision_flow.rb — idempotent provisioner for the AI Catalog Flow "Agent Principles Distiller" in gitlab-org/gitlab. Creates the flow on first run, releases a new version only when the YAML definition (system prompt + toolset) has drifted, and ensures an ItemConsumer exists binding the flow to the project. Designed to run before sync.rb so prompt edits in git automatically propagate to the catalog.

Binaries (bin/):

  • gitlab-ai-principles-distiller-sync — thin wrapper around Gitlab::PrinciplesDistiller::Sync.run
  • gitlab-ai-principles-distiller-provision-flow — thin wrapper around Gitlab::PrinciplesDistiller::ProvisionFlow.run

Both binaries accept --workspace PATH so callers (CI jobs, local invocations) can point at the GitLab checkout containing the SoT data files. Default is ENV['CI_PROJECT_DIR']. Required env vars (AGENT_PRINCIPLES_CATALOG_PROJECT, CI_DEFAULT_BRANCH, and GITLAB_API_TOKEN for --push) abort with helpful messages when missing.

Source-of-truth content (lives in the monolith, not the gem):

  • .ai/principles/distillation_prompt.md — the system prompt for the distiller flow. Edits here flow into the catalog via provision_flow.rb.
  • .ai/principles/manifest.yml — renamed from sources.yml (matches the existing MANIFEST_PATH constant; reflects that the file is the SoT for both source listings and target/auto-MR config). Adds an auto_mr: block (branch prefix, title template, labels, remove_source_branch) so target settings live in YAML rather than hardcoded in Ruby. The script fails fast if any required key is missing.
  • .ai/principles/README.md — documents the two-stage flow (provisioner + sync), required CI variables, local invocation, reviewer expectations, and the auth limitation flagged below.

The catalog Flow has been provisioned in production (Flow ID gid://gitlab/Ai::Catalog::Item/1009160, Consumer ID 7368818). Idempotent re-runs of provision_flow.rb confirm no drift between the production catalog and the YAML definition checked in here.

Gem packaging notes

  • The gem is registered on RubyGems.org as a reserved stub at version 0.0.0 (raise "Reserved for GitLab" pattern, per docs). Local development version is 0.1.0 via path: reference.
  • gitlab_rubygems co-ownership confirmation is pending email confirmation by the gem-account holder. To unblock CI, the gem's .gitlab-ci.yml temporarily sets skip_gem_validation: true. Tracked in https://gitlab.com/gitlab-org/gitlab/-/issues/599747 (mirrors the precedent in gems/csv_builder/).
  • Gemfile.lock includes 5 platforms (arm64-darwin, ruby, x86_64-darwin, x86_64-linux, x86_64-linux-gnu) — narrower than typical sibling gems since CI runs on x86_64 Linux only and dev happens on Darwin.
  • Direct dependencies are kept minimal: runtime depends only on rainbow; dev depends on rake, rspec, gitlab-styles, rubocop-rspec. The lockfile resolves to 44 gems total.

Risk

The gem and flow definition land dormant. There is no caller until the follow-up CI integration MR (!235014) adds the job. Local manual invocation is the only path to exercise the code until then (from inside gems/gitlab-ai-principles-distiller/):

AGENT_PRINCIPLES_CATALOG_PROJECT=gitlab-org/gitlab \
CI_DEFAULT_BRANCH=master \
  bin/gitlab-ai-principles-distiller-sync --dry-run --workspace ../..

Reviewing this MR

The commit history reflects an iterative cleanup cycle on top of the initial implementation — reviewers reading commit-by-commit will see roughly three phases:

Phase 1 — Initial implementation, packaging, and review-feedback iterations. Original scripts, gem extraction, gem packaging, retry/heartbeat hardening, error diagnostics, and the move of auto-MR target config into manifest.yml.

Phase 2 — Adversarial review pass. An end-to-end self-review surfaced ~15 small defects and design smells; each landed as its own commit so the rationale is bisectable. Highlights: an uninitialized-local bug in Workflow#poll's ever_running flag; shell-injection consistency (last three backtick calls converted to arg-form IO.popen); fail-fast on principles missing sources:; per-instance mutex on Manifest's file_cache; canonical to_json (not inspect) for checksum field encoding; thread-safety invariants for parallel distillation; the GraphQL transport extracted to a shared GraphqlClient (used by both Workflow and ProvisionFlow with different rescue policies).

Phase 3 — Refactor + structural cleanup. Sync converted from singleton-class to instance-based (no more reset!, no more global mutable state); ~35 private methods clustered under private keywords in each class/module; comments trimmed (-271 lines of over-explanatory commentary accumulated during the iteration); dead code removed (generate_agents_md tree, GENERATED_PATHS); sync_spec.rb split into per-class spec files matching the lib layout.

Phase 4 — AR sign-off additions (per access request !43931 (closed)). Docs expansion in .ai/principles/README.md documenting why this gem uses a non-composite-identity SA (code-anchored argument that composite_identity_enforced is not required on the SA for our caller path) plus a "Schedule ownership & recovery" runbook. Workflow log lines now tag the workflow ID with the principle name for easier triage of parallel-distillation traces. Auto-MR descriptions now embed permalinks to .ai/principles/manifest.yml and the CI YAML. Cross-domain approval-routing follow-up tracked in #599920.

The aggregate diff in this MR's "Changes" tab does not show the rename for the renamed source file because git's heuristic gives up on a deletion + low-similarity addition, but per-commit diffs and the spec/manifest renames in the aggregate still display correctly.

Tests

Per-class specs under gems/gitlab-ai-principles-distiller/spec/gitlab/principles_distiller/:

  • sync_spec.rb — orchestration: distill_and_write_principles, announce_distillation_start, distill_principle retry loop, parse_options, MAX_CONCURRENT_DISTILLATIONS.
  • sync/manifest_spec.rb — manifest loading + validation, frontmatter handling, checksum routing, AGENTS.md/SKILL.md generation, prerequisite-note injection, thread-safe file caching, auto_mr_config validation.
  • sync/workflow_spec.rb — Duo Workflow API client: assistant-content extraction, failure-details diagnostics, sleep-with-heartbeat, build_goal/build_additional_context, validate_config!.
  • sync/auto_mr_spec.rb — branch+MR flow: principle diff section assembly, truncate_diff caps, find_open_mr_iid, prefetch_prior_shas!, push_remote_url, create_branch_and_mr end-to-end including error-cleanup rescue.
  • sync/diff_spec.rb — stateless text helpers (strip_preamble, reduce_noise, meaningful?, internal Jaccard/section helpers via Diff.send).
  • graphql_client_spec.rb — shared GraphQL client transport: data payload, error handling, Authorization header, query+variables JSON body, host normalization.
  • provision_flow_spec.rb — prompt loading, project gid resolution, init, parse options.

Total: 176 examples, all passing locally and in the gem's child CI pipeline.

Run locally with:

cd gems/gitlab-ai-principles-distiller
bundle exec rspec spec/

MR acceptance checklist

Evaluate this MR against the MR acceptance checklist.

Edited by Pedro Pombeiro

Merge request reports

Loading