Address Credentials Inventory Personal Access Tokens Performance on groups with enterprise users greater than 6k users

Related issues

The first request of groups/:id/manage/personal_access_tokens and groups/:id/manage/resource_access_tokens consistently times out on our largest groups (30k users).
- We will get more accurate thresholds once the staging-ref group scripted enterprise_users & credentials creation passes 40k.

During the rollout of the CredentialsInventoryPersonalAccessTokensFinder feature flag, I found out the following:

The Database Lab's execution time for a query should be ignored (higher or lower)
- The !193375 (merged) showed a 99.2% (around ~500ms) increase during developemental testing
- After merging, it showed a ~8s execution time, even with index coverage and usage
- Our largest enterprise user group of 30k still had database timeouts (~15.8s > 15s dB limit)
What matters the most is the total buffer count for cold cache (resetting the dB session per new query), after confirming with database
Since it meant we needed another factor to make sure that the optimization for our largest groups has appropriate execution times (< 15 database timeouts & less than Gitlab 1s general query guideline max) and tested at scale in production or a similar environment.
- I created a staging-ref group with [{:name=>"Credentials Inventory Testing", :enterprise_user_count=>18606, :bot_user_count=>29472, :pat_count=>7495}] and pending.

Logs

PATs Benchmarking on Staging-ref

RATs Benchmarking on Staging-ref

Pending creation. At 29k RATs and 18k enterprise users, performance is blazing fast. I'll test again once the user count is close to 40k.

PATs Benchmarking on Staging-ref

Users	PersonalAccessTokens (target stg-ref vs current customer max)	Execution Time from Postman
50000 (current is ~30k)	80k (current customer PATs: ~63k)	well below 15s timeouts

RATs Benchmarking on Staging-ref

Users	ResourceAccessTokens (target stg-ref vs current customer max)	Execution Time from Postman
50000 (current is ~30k)	29k (current customer RATs: ~3k)	well below 15s timeouts

Create a ::PreWarmGroupCredentialsInventoryCacheWorker#perform(group_id) background job that runs every 5 minutes in RedisCacheStore & caches user_ids for large enterprise groups
Cache busting based on any newly added credentials and group size (2k)
Create a ::GroupCredentialsInventoryCacheService(top_level_group) to handle the main work, edge cases (min_group_size...) & defaults
UX/UI consideration? Consider adding a last_updated_at docs note or modal if Product/Database disagrees with short TTLs for high availability

More caching improvements:
- https://docs.gitlab.com/development/cached_queries/
- Passing in multiple params, filters, and sorts
The current code only relies on the offset & limit to retain the sorting from the optimized CredentialsInventoryPAT & RATfinder queries, with heavy index usage. The need for Keyset Pagination primarily arises from the API and its scraping-related use cases at high Nth pages. Since offset pagination scales linearly (PAGE * PAGE_SIZE rows), the cache will be heavily utilized if those cases become common.
The #keyset_paginate method works out of the box with the Efficient InOperator with some changes. Please note that although implementing this in the new finder might be feasible, switching to keyset_pagination will be a breaking change, affecting the existing credentials_inventory (SM & .com) codepaths, in addition to the API needing to support backward compatibility.
Auth ReArchitecture (cells sharding work...)
- Revisit #550985 (closed) if needed

Edited Aug 18, 2025 by Hakeem Abdul-Razak