Add Knowledge Graph traversal ID authorization with observability

What does this MR do and why?

Adds traversal ID authorization for the Knowledge Graph. This is Layer 2 of the three-layer security model: namespace-level access control via traversal ID prefix matching.

Rails computes which namespaces a user has Reporter+ access to, compacts them using a trie algorithm (capped at 500 IDs), and encodes them into a JWT for the GKG query engine. Results are cached in Redis with a 5-minute TTL and invalidated via EventStore subscriptions when user authorization changes.

What changed

  • Authorization context, observability (logger + Prometheus metrics), and cache invalidation worker
  • Cache invalidation via AuthorizationsAddedEvent / AuthorizationsRemovedEvent event subscriptions
  • Rails.cache.fetch with 5-minute TTL (36-545x speedup on warm reads)
How Zoekt vs Knowledge Graph handle traversal IDs

The core difference is where the prefix matching happens.

Knowledge Graph approach (our MR)

Rails does all the heavy lifting upfront:

  1. Query DB for ALL groups user has Reporter+ access to
  2. Build a trie from ALL those traversal IDs
  3. Compact the trie output down to ≤500 entries
  4. Serialize the compacted list into a JWT payload
  5. Send the JWT to the GKG query engine
  6. GKG runs ClickHouse SQL: WHERE startsWith(traversal_path, prefix)

The JWT has a size constraint -- you can't stuff thousands of traversal IDs into it. That's why the 500 cap and the TraversalIdCompactor exist. All the computation happens in Rails before the query even reaches ClickHouse.

Zoekt approach

Zoekt pushes the filtering to the search server:

# For group-level search, Zoekt just looks up ONE group:
def get_traversal_ids_for_group(group_id)
  Group.find(group_id).elastic_namespace_ancestry
  # => "9970-123-456-"  (dash-separated ancestor chain)
end

Then sends it as a regex metadata filter directly to the Zoekt server:

# filters.rb:104-108
def by_traversal_ids(traversal_ids, context: nil)
  by_meta(key: 'traversal_ids', value: "^#{traversal_ids}", context: context)
end

# This produces a JSON filter like:
{ meta: { key: "traversal_ids", value: "^9970-123-456-" } }

The Zoekt index stores traversal_ids as a metadata string on every indexed repository. The Zoekt server applies the ^ regex prefix match -- "show me all repos whose traversal_ids string starts with 9970-123-456-". This is a simple string operation done server-side on the search index, not in PostgreSQL.

Why Zoekt doesn't need compaction

  1. Group search: Zoekt only needs ONE traversal ID prefix (the group being searched). It calls Group.find(group_id).elastic_namespace_ancestry which returns the ancestor chain as a dash-separated string ("9970-123-456-"). No GroupsFinder, no trie, no compaction.
  2. Global search: Zoekt uses AccessBranchBuilder which does call GroupsFinder and builds a trie via authorized_traversal_ids_for_groups. But it sends each resulting traversal ID as a separate filter combined with OR logic:
# access_branch_builder.rb:117-118
traversal_ids = authorized_traversal_ids_for_groups(groups)
traversal_ids.map { |t| Filters.by_traversal_ids(t) }
# => [{ meta: { key: "traversal_ids", value: "^9970-" } },
#     { meta: { key: "traversal_ids", value: "^42-88-" } },
#     ...]

Zoekt's query engine handles arbitrarily many OR'd metadata filters efficiently. There's no JWT payload size constraint, so there's no need to compact down to 500.

Knowledge Graph Zoekt
Where filtering happens ClickHouse (via JWT payload) Zoekt server (via query filters)
Transport format JWT (size-constrained) JSON query payload (no hard limit)
Needs compaction? Yes (500 cap for JWT) No (server handles any count)
DB queries GroupsFinder (all Reporter+ groups) Single Group.find for group search
Format "1/22/3/" (slash, for ClickHouse startsWith) "9970-123-456-" (dash, for regex ^ match)
Database query and execution plan

No new tables, indexes, or migrations. Uses existing Search::GroupsFinder (same query path as Elasticsearch/Zoekt authorization). Results cached via Rails.cache with 5-minute TTL.

SQL (UNION of direct memberships + shared groups)

-- Branch 1: Direct group memberships
SELECT namespaces.* FROM namespaces
WHERE namespaces.type = 'Group'
AND namespaces.id IN (
  SELECT members.source_id FROM members
  LEFT OUTER JOIN users ON users.id = members.user_id
  WHERE members.source_type = 'Namespace'
    AND members.type = 'GroupMember'
    AND members.user_id = :user_id
    AND members.requested_at IS NULL
    AND members.access_level >= 20
    AND ((members.user_id IS NULL AND members.invite_token IS NOT NULL)
         OR users.state = 'active')
)
UNION
-- Branch 2: Groups shared via group_group_links
SELECT namespaces.* FROM namespaces
WHERE namespaces.type = 'Group'
AND namespaces.id IN (
  SELECT group_group_links.shared_group_id FROM group_group_links
  WHERE group_group_links.shared_with_group_id IN (
    SELECT members.source_id FROM members
    LEFT OUTER JOIN users ON users.id = members.user_id
    WHERE members.source_type = 'Namespace'
      AND members.type = 'GroupMember'
      AND members.user_id = :user_id
      AND members.requested_at IS NULL
      AND members.access_level >= 20
      AND ((members.user_id IS NULL AND members.invite_token IS NOT NULL)
           OR users.state = 'active')
  )
  AND (group_group_links.expires_at > NOW() OR group_group_links.expires_at IS NULL)
  AND group_group_links.group_access >= 20
)

EXPLAIN ANALYZE (GDK, PostgreSQL 16.11)

Unique  (cost=14.76..14.99 rows=2 width=485) (actual time=0.024..0.025 rows=1 loops=1)
  Buffers: shared hit=12
  ->  Sort  (cost=14.76..14.76 rows=2 width=485)
        Sort Key: all 45 namespace columns
        Sort Method: quicksort  Memory: 25kB
        ->  Append
              ->  Nested Loop
                    ->  Index Scan using idx_members_on_user_and_source_and_source_type_and_member_role
                          Index Cond: (user_id = 2, source_type = 'Namespace')
                          Filter: (requested_at IS NULL AND access_level >= 20 AND type = 'GroupMember')
                    ->  Index Scan using index_namespaces_on_type_and_id
              ->  (group_group_links branch: 0 rows, namespace lookup "never executed")
Planning Time: 1.298 ms
Execution Time: 0.071 ms

How to set up and validate locally

Checkout the branch and open a Rails console:

git checkout gkg-traversal-id-authorization
bundle exec rails console

1. Basic traversal ID computation

user = User.where(admin: false).joins(:group_members)
  .where(members: { access_level: Gitlab::Access::REPORTER..Gitlab::Access::OWNER }).first
context = Analytics::KnowledgeGraph::AuthorizationContext.new(user)
result = context.reporter_plus_traversal_ids
# => {:group_traversal_ids=>["1/22/"]}

2. Rails.cache cold vs warm

cache_key = "analytics:knowledge_graph:traversal_ids:#{user.id}"
Rails.cache.delete(cache_key)
# Cold call: ~0.003s (DB query)
Analytics::KnowledgeGraph::AuthorizationContext.new(user).reporter_plus_traversal_ids
Rails.cache.read(cache_key).present? # => true
# Warm call: ~0.00007s (Redis read, 36x+ speedup)
Analytics::KnowledgeGraph::AuthorizationContext.new(user).reporter_plus_traversal_ids

3. Cache invalidation

Analytics::KnowledgeGraph::AuthorizationContext.expire_cache_for_user(user.id)
Rails.cache.read(cache_key) # => nil (cleared)

4. Event worker invalidation

# Populate cache
Analytics::KnowledgeGraph::AuthorizationContext.new(user).reporter_plus_traversal_ids
Rails.cache.read(cache_key).present? # => true
# Simulate authorization change event
worker = Analytics::KnowledgeGraph::ExpireTraversalIdCacheWorker.new
worker.handle_event(OpenStruct.new(data: { user_ids: [user.id] }))
Rails.cache.read(cache_key) # => nil (cleared by worker)

5. Logger and metrics

Gitlab::KnowledgeGraph::AuthorizationLogger.build.class
# => Gitlab::KnowledgeGraph::AuthorizationLogger

Gitlab::Metrics::KnowledgeGraph::TraversalIds.respond_to?(:observe_traversal_ids_count)  # => true
Gitlab::Metrics::KnowledgeGraph::TraversalIds.respond_to?(:increment_compaction_fallback) # => true

References

MR acceptance checklist

This MR was evaluated against the MR acceptance checklist.

Edited by Michael Angelo Rivera

Merge request reports

Loading