Skip to content

[Error Budget] Improve `Groups::UsageQuotasController#index` performance

Summary

The grouputilization error budgets are being impacted by the performance of the Groups::UsageQuotasController#index action.

Example apdex failures from Kibana:

Screenshot_2023-10-26_at_11.43.27

Proposal

The bulk of the time of these failures is time spent in DB, so we can focus our attention on that and see if there are areas we can improve the query performance, or make related changes (e.g. reducing how often we need to make the query via caching).

DB queries

In this view, the largest offending queries being executed are:

  1. Generating a pending member count
Query details ```

On a cold cache, this query is very slow and for our larger customers, the request probably times out:

Time: 1.910 min
  - planning: 11.657 ms
  - execution: 1.910 min
    - I/O read: 2.478 min
    - I/O write: 0.000 ms

Shared buffers:
  - hits: 327554 (~2.50 GiB) from the buffer pool
  - reads: 92754 (~724.60 MiB) from the OS file cache, including disk I/O
  - dirtied: 6477 (~50.60 MiB)
  - writes: 0

https://console.postgres.ai/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75311

It performs better on a warm cache, but still not great: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75312

  1. Determines if a combined Free User Limit and Storage Limit banner should be displayed
Query details

This goes through the full_user_count method, which performs poorly for huge group hierarchies:

Time: 19.771 s  
  - planning: 11.245 ms  
  - execution: 19.759 s  
    - I/O read: 21.140 s  
    - I/O write: 0.000 ms  
  
Shared buffers:  
  - hits: 160267 (~1.20 GiB) from the buffer pool  
  - reads: 10814 (~84.50 MiB) from the OS file cache, including disk I/O  
  - dirtied: 357 (~2.80 MiB)  
  - writes: 0 

https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75314

It performs better on a warm cache: https://postgres.ai/console/gitlab/gitlab-production-tunnel-pg12/sessions/23430/commands/75318

The class being used performs three similar queries, all with similar performance as they go via the billed_user_finder:

Screenshot_2023-10-26_at_12.53.54

these queries have been tested against one of our largest known group hierarchies, they're certainly worst-case scenarios

Additional notes

The likelihood is that there won't be much that can be done to improve the query performance, it's known that they're slow and inefficient.

We might see significant improvement when initiatives like https://gitlab.com/gitlab-org/customers-gitlab-com/-/issues/7130+ move forward.

For now, we can look into caching and other related improvements:

Generating a pending member count

Currently, the query is made to determine whether or not a link is displayed to the pending members view.

Focussing on reducing how often the query is made for a group, some options are:

# Proposal Estimated effort Estimated impact/notes
1 Utilise ReactiveCaching for the pending member count Medium-high Although this won't improve the initial query performance, subsequent queries should be much faster (for the life of the cache), we employ this same technique for the billable members count
2 Always display a pending members tab Low-medium This removes the need for the query entirely, but adds a new tab to be built on the frontend

See #429205 (comment 1619973329) for the initial discussion/context