Database contention in ci_namespace_monthly_usages updates for large, nested top-level group
<!--- Please read this! Before opening a new issue, make sure to search for keywords in the issues filtered by the "regression" or "type::bug" label: - https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=regression - https://gitlab.com/gitlab-org/gitlab/issues?label_name%5B%5D=type::bug and verify the issue you're about to submit isn't a duplicate. ---> ### Summary <!--Summarize the bug encountered concisely.--> A ~"GitLab Dedicated" customer has a very nested organization of subgroups and projects causing a single top-level namespace to contain 90% of the projects and 90% of the CI activity. This means that every CI job completing updates the CI minutes usage on the same `ci_namespace_monthly_usages` table. Sidekiq workers, such as `UpdateProjectAndNamespaceUsageWorker`, saturated the database with long-running connections causing other services to slow. This was mitigated by disabling the job using `drop_sidekiq_jobs_Ci::Minutes::UpdateProjectAndNamespaceUsageWorker`. The side effect from disabling the job was that the instance no longer gather pipeline usage data: ![image](/uploads/ec199f7aeda4c86fafc6959f9be7d5cb/image.png) ### Possible fixes <!--If you can, link to the line of code that might be responsible for the problem.--> The corrective action identified: Buffer the namespace CI minute aggregation via Redis. > `Ci::Minutes::UpdateProjectAndNamespaceUsageWorker` increments the CI minutes and shared runners duration for a given project and its top-level namespace. This is executed for every CI job that completes. > > One idea to solve this could be to use Redis to buffer updates like we do with `CounterAttribute` module and flush them to the database regularly. One challenge to take in consideration here would be that we use the "current usage" to enforce the CI minutes limit and as an abuse-prevention system. Waiting too long to flush values to the database could cause the current usage not to be the SSoT, unless we consider adding up the temporary usage in Redis. ### Challenges * We need to take in consideration manual minutes reset like `Ci::Minutes::ResetUsageService` that need to work well with buffered increments in Redis (e.g. wipe Redis increments too). * How does this work with Live tracking of CI minutes? `Ci::Minutes::TrackLiveConsumptionService`. When checking the actual consumption of CI minutes we should always return the database value + increment in Redis as SSoT. * We may want to configure the worker delay when flushing counters to the database via `counter_attribute`. E.g. rather than flushing every 10 minutes, flush values more frequently.
issue