Add gitlab-ai-principles-distiller gem and AI Catalog flow
What does this MR do and why?
Add the implementation pieces that the weekly principles-sync CI job will eventually drive: a standalone gem (gems/gitlab-ai-principles-distiller/) containing the orchestrator script, the AI Catalog flow provisioner, the distillation prompt, and the manifest config. The CI YAML wiring lives in a follow-up MR (!235014) so AI/Catalog reviewers and pipeline maintainers can each focus on their own domains.
This MR delivers issue https://gitlab.com/gitlab-org/gitlab/-/issues/599663. The follow-up CI integration tracks https://gitlab.com/gitlab-org/gitlab/-/issues/597600.
Background
An earlier prototype of this script used the Duo CLI / chat/completions REST endpoint to call Duo for each principle. Both backends fail in production:
- Duo CLI:
Errno::E2BIGon Linux runners. The security prompt (~170 KiB) exceeds LinuxMAX_ARG_STRLENof 128 KiB per argv string. macOSARG_MAX=1MiBmasked the bug locally. chat/completionsREST: hard 1000-character cap on thecontentfield (ee/lib/api/chat.rb:50).
Per Fabio's suggestion, this MR adopts the Duo Agent Platform Workflow API (POST /api/v4/ai/duo_workflows/workflows with start_workflow=true). The agent runs server-side in a child CI pipeline, reads files directly from the source branch via gitaly, and reports results through the GraphQL duoWorkflowWorkflows query. There is no argv-size or content-length issue.
The Workflow API requires an AI Catalog item to drive the run. Since ai_catalog_item_consumer_id only accepts items of type flow, we provision a Catalog Flow (not Agent) whose YAML definition has a single AgentComponent carrying the distillation prompt and a curated read-only toolset. See Custom agents docs for context.
Implementation
The implementation ships as a standalone gem under gems/gitlab-ai-principles-distiller/ (extracted from earlier scripts/ai/principles_distiller/ per reviewer feedback for clearer ownership boundaries).
Library (lib/gitlab/principles_distiller/):
sync.rb— orchestrator. Instance-based class (one Sync per CLI invocation, one per spec example). Detects per-principle drift via the existing checksum frontmatter, triggers one Duo Workflow per affected principle (capped at 4 in parallel viaMAX_CONCURRENT_DISTILLATIONS), polls each workflow via GraphQL until terminal state (finished/failed/stopped), extracts the assistant's final message fromlatestCheckpoint.duoMessages, and writes it back. With--push, it opens an MR via the REST API. Mixes inSync::AutoMrfor branch-and-MR helpers.sync/manifest.rb—Sync::Manifestclass: manifest loading, validation, frontmatter handling, affected-principle detection, AGENTS.md / SKILL.md generation, and prerequisite-note injection. Owns the SSOT file_cache (per-instance mutex) and runsvalidate_principles!at load time to fail-fast on missingsources:keys.sync/workflow.rb—Sync::Workflowclass: Duo Workflow API client + polling + diagnostics. Shared across allparallel_distillthreads; memoized||=readers are safe under sharing because each computes a deterministic value from ENV.sync/auto_mr.rb—Sync::AutoMrmodule mixed into Sync. Builds and pushes the auto-MR (including per-principle SSOT diff embedding, idempotent same-day re-runs, error-cleanup rescue path).sync/diff.rb—Sync::Diffmodule (extend self): stateless text helpers for stripping LLM preamble, suppressing rephrasing noise (Jaccard similarity over word tokens, MATCH_THRESHOLD = 0.6), and deciding whether output is materially different from the prior version.graphql_client.rb—GraphqlClientshared byWorkflowandProvisionFlow. Returns the parseddatapayload, or raisesGraphqlClient::Erroron transport failure, non-2xx, or a non-emptyerrorsarray. Error policy (warn vs abort) is the caller's choice.workspace.rb— process-wide pointer to the repository workspace. Set via--workspaceor falls back to$CI_PROJECT_DIR.env.rb— env var names read by the gem, as constants so typos surface asNameErrorrather than silent nil reads.provision_flow.rb— idempotent provisioner for the AI Catalog Flow "Agent Principles Distiller" ingitlab-org/gitlab. Creates the flow on first run, releases a new version only when the YAML definition (system prompt + toolset) has drifted, and ensures anItemConsumerexists binding the flow to the project. Designed to run beforesync.rbso prompt edits in git automatically propagate to the catalog.
Binaries (bin/):
gitlab-ai-principles-distiller-sync— thin wrapper aroundGitlab::PrinciplesDistiller::Sync.rungitlab-ai-principles-distiller-provision-flow— thin wrapper aroundGitlab::PrinciplesDistiller::ProvisionFlow.run
Both binaries accept --workspace PATH so callers (CI jobs, local invocations) can point at the GitLab checkout containing the SoT data files. Default is ENV['CI_PROJECT_DIR']. Required env vars (AGENT_PRINCIPLES_CATALOG_PROJECT, CI_DEFAULT_BRANCH, and GITLAB_API_TOKEN for --push) abort with helpful messages when missing.
Source-of-truth content (lives in the monolith, not the gem):
.ai/principles/distillation_prompt.md— the system prompt for the distiller flow. Edits here flow into the catalog viaprovision_flow.rb..ai/principles/manifest.yml— renamed fromsources.yml(matches the existingMANIFEST_PATHconstant; reflects that the file is the SoT for both source listings and target/auto-MR config). Adds anauto_mr:block (branch prefix, title template, labels,remove_source_branch) so target settings live in YAML rather than hardcoded in Ruby. The script fails fast if any required key is missing..ai/principles/README.md— documents the two-stage flow (provisioner + sync), required CI variables, local invocation, reviewer expectations, and the auth limitation flagged below.
The catalog Flow has been provisioned in production (Flow ID gid://gitlab/Ai::Catalog::Item/1009160, Consumer ID 7368818). Idempotent re-runs of provision_flow.rb confirm no drift between the production catalog and the YAML definition checked in here.
Gem packaging notes
- The gem is registered on RubyGems.org as a reserved stub at version
0.0.0(raise "Reserved for GitLab"pattern, per docs). Local development version is0.1.0viapath:reference. gitlab_rubygemsco-ownership confirmation is pending email confirmation by the gem-account holder. To unblock CI, the gem's.gitlab-ci.ymltemporarily setsskip_gem_validation: true. Tracked in https://gitlab.com/gitlab-org/gitlab/-/issues/599747 (mirrors the precedent ingems/csv_builder/).Gemfile.lockincludes 5 platforms (arm64-darwin,ruby,x86_64-darwin,x86_64-linux,x86_64-linux-gnu) — narrower than typical sibling gems since CI runs on x86_64 Linux only and dev happens on Darwin.- Direct dependencies are kept minimal: runtime depends only on
rainbow; dev depends onrake,rspec,gitlab-styles,rubocop-rspec. The lockfile resolves to 44 gems total.
Risk
The gem and flow definition land dormant. There is no caller until the follow-up CI integration MR (!235014) adds the job. Local manual invocation is the only path to exercise the code until then (from inside gems/gitlab-ai-principles-distiller/):
AGENT_PRINCIPLES_CATALOG_PROJECT=gitlab-org/gitlab \
CI_DEFAULT_BRANCH=master \
bin/gitlab-ai-principles-distiller-sync --dry-run --workspace ../..Reviewing this MR
The commit history reflects an iterative cleanup cycle on top of the initial implementation — reviewers reading commit-by-commit will see roughly three phases:
Phase 1 — Initial implementation, packaging, and review-feedback iterations. Original scripts, gem extraction, gem packaging, retry/heartbeat hardening, error diagnostics, and the move of auto-MR target config into manifest.yml.
Phase 2 — Adversarial review pass. An end-to-end self-review surfaced ~15 small defects and design smells; each landed as its own commit so the rationale is bisectable. Highlights: an uninitialized-local bug in Workflow#poll's ever_running flag; shell-injection consistency (last three backtick calls converted to arg-form IO.popen); fail-fast on principles missing sources:; per-instance mutex on Manifest's file_cache; canonical to_json (not inspect) for checksum field encoding; thread-safety invariants for parallel distillation; the GraphQL transport extracted to a shared GraphqlClient (used by both Workflow and ProvisionFlow with different rescue policies).
Phase 3 — Refactor + structural cleanup. Sync converted from singleton-class to instance-based (no more reset!, no more global mutable state); ~35 private methods clustered under private keywords in each class/module; comments trimmed (-271 lines of over-explanatory commentary accumulated during the iteration); dead code removed (generate_agents_md tree, GENERATED_PATHS); sync_spec.rb split into per-class spec files matching the lib layout.
Phase 4 — AR sign-off additions (per access request !43931 (closed)). Docs expansion in .ai/principles/README.md documenting why this gem uses a non-composite-identity SA (code-anchored argument that composite_identity_enforced is not required on the SA for our caller path) plus a "Schedule ownership & recovery" runbook. Workflow log lines now tag the workflow ID with the principle name for easier triage of parallel-distillation traces. Auto-MR descriptions now embed permalinks to .ai/principles/manifest.yml and the CI YAML. Cross-domain approval-routing follow-up tracked in #599920.
The aggregate diff in this MR's "Changes" tab does not show the rename for the renamed source file because git's heuristic gives up on a deletion + low-similarity addition, but per-commit diffs and the spec/manifest renames in the aggregate still display correctly.
Tests
Per-class specs under gems/gitlab-ai-principles-distiller/spec/gitlab/principles_distiller/:
sync_spec.rb— orchestration:distill_and_write_principles,announce_distillation_start,distill_principleretry loop,parse_options,MAX_CONCURRENT_DISTILLATIONS.sync/manifest_spec.rb— manifest loading + validation, frontmatter handling, checksum routing, AGENTS.md/SKILL.md generation, prerequisite-note injection, thread-safe file caching,auto_mr_configvalidation.sync/workflow_spec.rb— Duo Workflow API client: assistant-content extraction, failure-details diagnostics, sleep-with-heartbeat, build_goal/build_additional_context, validate_config!.sync/auto_mr_spec.rb— branch+MR flow: principle diff section assembly, truncate_diff caps, find_open_mr_iid, prefetch_prior_shas!, push_remote_url, create_branch_and_mr end-to-end including error-cleanup rescue.sync/diff_spec.rb— stateless text helpers (strip_preamble, reduce_noise, meaningful?, internal Jaccard/section helpers viaDiff.send).graphql_client_spec.rb— shared GraphQL client transport: data payload, error handling, Authorization header, query+variables JSON body, host normalization.provision_flow_spec.rb— prompt loading, project gid resolution, init, parse options.
Total: 176 examples, all passing locally and in the gem's child CI pipeline.
Run locally with:
cd gems/gitlab-ai-principles-distiller
bundle exec rspec spec/Related
- Issue: https://gitlab.com/gitlab-org/gitlab/-/issues/599663
- Follow-up CI integration: !235014 (target https://gitlab.com/gitlab-org/gitlab/-/issues/597600)
- Follow-up gem ownership confirmation: https://gitlab.com/gitlab-org/gitlab/-/issues/599747
- Earlier distillation pipeline: https://gitlab.com/gitlab-org/gitlab/-/issues/597599
- Future component extraction: https://gitlab.com/gitlab-org/gitlab/-/issues/599498
- Parent epic: gitlab-org&21742
MR acceptance checklist
Evaluate this MR against the MR acceptance checklist.