[FF] `search_finders_redis_cache` -- Rollout

Summary

This issue is to roll out the feature on production, that is currently behind the search_finders_redis_cache feature flag.

Introduced by !224795 (closed) and finished in !231935 (merged). The flag controls a short-lived (5 min) Redis cache in Search::GroupsFinder and Search::ProjectsFinder, with invalidation driven by Search::ExpireFinderCacheWorker through a per-user cache-version bump.

Related: #594620 (investigate cache-key expiration and TTL tuning before rollout).

Owners

  • Most appropriate Slack channel to reach out to: #g_global_search
  • Best individual to reach out to: @terrichu

Expectations

What are we expecting to happen?

When the flag is enabled, calls to Search::GroupsFinder and Search::ProjectsFinder fetch authorized groups / projects from Redis when a fresh-enough entry exists, falling back to the PG query on a miss. Cache entries are per-user, 5-minute TTL. Any membership, authorization, link, or role event listed in SearchSubscriptions#register_finder_cache_events bumps the per-user version via Search::ExpireFinderCacheWorker, so the next read rebuilds the cache.

Expected net effect: significantly lower PG load on every request path that queries the user's authorized groups / projects during search and during Knowledge Graph authorization (Orbit API, MCP).

What can go wrong and how would we detect it?

  • Stale results after membership change — if the event-store invalidation path is missed for a given change path, the user sees stale authorized groups/projects for up to 5 minutes. Detect via reports of "I just got added/removed but search still behaves like before". Mitigate by rolling back the flag.
  • Redis memory growth — per-user keys at 5-min TTL scale with O(#active_users * #access_levels * #feature_variants). Monitor the search cache keyspace in Gitlab::Redis::Cache.
  • Cache miss thundering herd — many users with an expired cache hitting the finder at once. Redis and the DB should both absorb this, but watch PG search_user:* query rates during incremental rollout.

Most relevant dashboards: search-api, redis-cache, and the PG-query dashboards for Search::GroupsFinder / Search::ProjectsFinder.

Rollout Steps

Note: Please make sure to run the chatops commands in the Slack channel that gets impacted by the command.

Rollout on non-production environments

  • Verify the MR with the feature flag is merged to master and has been deployed to non-production environments with /chatops gitlab run auto_deploy status <merge-commit-of-your-feature>
  • Deploy the feature flag at 50% on non-production: /chatops gitlab run feature set search_finders_redis_cache 50 --actors --dev --pre --staging --staging-ref
  • Monitor that error rates and search latency did not regress.
  • Enable the feature globally on non-production: /chatops gitlab run feature set search_finders_redis_cache true --dev --pre --staging --staging-ref
  • Verify on staging-canary that search returns the expected authorized groups/projects.
  • Run the Orbit E2E validation (KG query path exercises the same finder cache).

Rollout on production

  • Enable on 1% of actors: /chatops gitlab run feature set search_finders_redis_cache 1 --actors
  • Monitor Redis memory and PG query rates for 24 hours.
  • Scale to 10% actors, then 25%, then 50%, then 100%, watching between each step.
  • Enable globally: /chatops gitlab run feature set search_finders_redis_cache true

Rollout on GitLab.com

Same as production above.

Release the feature with the feature flag

  • After at least one full release on default_enabled: true, remove the flag and delete the feature flag definition and the SearchSubscriptions#register_finder_cache_events if: guards.

Rollback steps

If issues are observed:

  • Disable globally: /chatops gitlab run feature set search_finders_redis_cache false

/cc @terrichu @michaelangeloio