Address Credentials Inventory Personal Access Tokens Performance on groups with enterprise users greater than 6k users
Related issues
Main Overarching Issue
- The first request of
groups/:id/manage/personal_access_tokensandgroups/:id/manage/resource_access_tokensconsistently times out on our largest groups (30k users).- We will get more accurate thresholds once the staging-ref group scripted enterprise_users & credentials creation passes 40k.
Summary
During the rollout of the CredentialsInventoryPersonalAccessTokensFinder feature flag, I found out the following:
- The Database Lab's execution time for a query should be ignored (higher or lower)
- The !193375 (merged) showed a 99.2% (around ~500ms) increase during developemental testing
- After merging, it showed a ~8s execution time, even with index coverage and usage
- Our largest enterprise user group of 30k still had database timeouts (~15.8s > 15s dB limit)
- What matters the most is the total buffer count for cold cache (resetting the dB session per new query), after confirming with database
- Since it meant we needed another factor to make sure that the optimization for our largest groups has appropriate execution times (< 15 database timeouts & less than Gitlab 1s general query guideline max) and tested at scale in production or a similar environment.
- I created a staging-ref group with
[{:name=>"Credentials Inventory Testing", :enterprise_user_count=>18606, :bot_user_count=>29472, :pat_count=>7495}]and pending.
- I created a staging-ref group with
Current Behaviour
Logs
- API Logs: https://log.gprd.gitlab.net/app/r/s/P1jQt
- .com UI logs: https://log.gprd.gitlab.net/app/r/s/p1RxQ
PATs Benchmarking on Staging-ref
| Users | PersonalAccessTokens | Execution Time from Postman |
|---|---|---|
| 2500 | 1255 | 1.1s |
| 3331 | 2069 | 2.58s |
| 4412 | 3181 | 5.93s |
| 5005 | 3773 | 6.30s |
| 5843 | 4589 | 10.5s |
| 6635 | 5372 | 15.4s |
RATs Benchmarking on Staging-ref
Pending creation. At 29k RATs and 18k enterprise users, performance is blazing fast. I'll test again once the user count is close to 40k.
Expected Behaviour
PATs Benchmarking on Staging-ref
| Users | PersonalAccessTokens (target stg-ref vs current customer max) | Execution Time from Postman |
|---|---|---|
| 50000 (current is ~30k) | 80k (current customer PATs: ~63k) | well below 15s timeouts |
RATs Benchmarking on Staging-ref
| Users | ResourceAccessTokens (target stg-ref vs current customer max) | Execution Time from Postman |
|---|---|---|
| 50000 (current is ~30k) | 29k (current customer RATs: ~3k) | well below 15s timeouts |
Proposed Solutions
Caching
- Create a
::PreWarmGroupCredentialsInventoryCacheWorker#perform(group_id)background job that runs every 5 minutes in RedisCacheStore & caches user_ids for large enterprise groups - Cache busting based on any newly added credentials and group size (2k)
- Create a
::GroupCredentialsInventoryCacheService(top_level_group)to handle the main work, edge cases (min_group_size...) & defaults - UX/UI consideration? Consider adding a last_updated_at docs note or modal if Product/Database disagrees with short TTLs for high availability
Long Term
-
More caching improvements:
- https://docs.gitlab.com/development/cached_queries/
- Passing in multiple params, filters, and sorts
-
The current code only relies on the offset & limit to retain the sorting from the optimized CredentialsInventoryPAT & RATfinder queries, with heavy index usage. The need for Keyset Pagination primarily arises from the API and its scraping-related use cases at high Nth pages. Since offset pagination scales linearly (
PAGE * PAGE_SIZErows), the cache will be heavily utilized if those cases become common. -
The
#keyset_paginatemethod works out of the box with the Efficient InOperator with some changes. Please note that although implementing this in the new finder might be feasible, switching to keyset_pagination will be a breaking change, affecting the existing credentials_inventory (SM & .com) codepaths, in addition to the API needing to support backward compatibility. -
Auth ReArchitecture (cells sharding work...)
- Revisit #550985 (closed) if needed