Address Credentials Inventory Personal Access Tokens Performance on groups with enterprise users greater than 6k users

Related issues

Main Overarching Issue

  • The first request of groups/:id/manage/personal_access_tokens and groups/:id/manage/resource_access_tokens consistently times out on our largest groups (30k users).
    • We will get more accurate thresholds once the staging-ref group scripted enterprise_users & credentials creation passes 40k.

Summary

During the rollout of the CredentialsInventoryPersonalAccessTokensFinder feature flag, I found out the following:

  • The Database Lab's execution time for a query should be ignored (higher or lower)
    • The !193375 (merged) showed a 99.2% (around ~500ms) increase during developemental testing
    • After merging, it showed a ~8s execution time, even with index coverage and usage
    • Our largest enterprise user group of 30k still had database timeouts (~15.8s > 15s dB limit)
  • What matters the most is the total buffer count for cold cache (resetting the dB session per new query), after confirming with database
  • Since it meant we needed another factor to make sure that the optimization for our largest groups has appropriate execution times (< 15 database timeouts & less than Gitlab 1s general query guideline max) and tested at scale in production or a similar environment.
    • I created a staging-ref group with [{:name=>"Credentials Inventory Testing", :enterprise_user_count=>18606, :bot_user_count=>29472, :pat_count=>7495}] and pending.

Current Behaviour

Logs

PATs Benchmarking on Staging-ref

Users PersonalAccessTokens Execution Time from Postman
2500 1255 1.1s
3331 2069 2.58s
4412 3181 5.93s
5005 3773 6.30s
5843 4589 10.5s
6635 5372 15.4s

RATs Benchmarking on Staging-ref

Pending creation. At 29k RATs and 18k enterprise users, performance is blazing fast. I'll test again once the user count is close to 40k.

Expected Behaviour

PATs Benchmarking on Staging-ref

Users PersonalAccessTokens (target stg-ref vs current customer max) Execution Time from Postman
50000 (current is ~30k) 80k (current customer PATs: ~63k) well below 15s timeouts

RATs Benchmarking on Staging-ref

Users ResourceAccessTokens (target stg-ref vs current customer max) Execution Time from Postman
50000 (current is ~30k) 29k (current customer RATs: ~3k) well below 15s timeouts

Proposed Solutions

Caching

  • Create a ::PreWarmGroupCredentialsInventoryCacheWorker#perform(group_id) background job that runs every 5 minutes in RedisCacheStore & caches user_ids for large enterprise groups
  • Cache busting based on any newly added credentials and group size (2k)
  • Create a ::GroupCredentialsInventoryCacheService(top_level_group) to handle the main work, edge cases (min_group_size...) & defaults
  • UX/UI consideration? Consider adding a last_updated_at docs note or modal if Product/Database disagrees with short TTLs for high availability

Long Term

  • More caching improvements:

  • The current code only relies on the offset & limit to retain the sorting from the optimized CredentialsInventoryPAT & RATfinder queries, with heavy index usage. The need for Keyset Pagination primarily arises from the API and its scraping-related use cases at high Nth pages. Since offset pagination scales linearly (PAGE * PAGE_SIZE rows), the cache will be heavily utilized if those cases become common.

  • The #keyset_paginate method works out of the box with the Efficient InOperator with some changes. Please note that although implementing this in the new finder might be feasible, switching to keyset_pagination will be a breaking change, affecting the existing credentials_inventory (SM & .com) codepaths, in addition to the API needing to support backward compatibility.

  • Auth ReArchitecture (cells sharding work...)

Edited by Hakeem Abdul-Razak