Database contention in ci_namespace_monthly_usages updates for large, nested top-level group
Summary
A GitLab Dedicated customer has a very nested organization of subgroups and projects causing a single top-level namespace to contain 90% of the projects and 90% of the CI activity. This means that every CI job completing updates the CI minutes usage on the same ci_namespace_monthly_usages
table. Sidekiq workers, such as UpdateProjectAndNamespaceUsageWorker
, saturated the database with long-running connections causing other services to slow. This was mitigated by disabling the job using drop_sidekiq_jobs_Ci::Minutes::UpdateProjectAndNamespaceUsageWorker
.
The side effect from disabling the job was that the instance no longer gather pipeline usage data:
Possible fixes
The corrective action identified: Buffer the namespace CI minute aggregation via Redis.
Ci::Minutes::UpdateProjectAndNamespaceUsageWorker
increments the CI minutes and shared runners duration for a given project and its top-level namespace. This is executed for every CI job that completes.One idea to solve this could be to use Redis to buffer updates like we do with
CounterAttribute
module and flush them to the database regularly. One challenge to take in consideration here would be that we use the "current usage" to enforce the CI minutes limit and as an abuse-prevention system. Waiting too long to flush values to the database could cause the current usage not to be the SSoT, unless we consider adding up the temporary usage in Redis.
Challenges
- We need to take in consideration manual minutes reset like
Ci::Minutes::ResetUsageService
that need to work well with buffered increments in Redis (e.g. wipe Redis increments too). - How does this work with Live tracking of CI minutes?
Ci::Minutes::TrackLiveConsumptionService
. When checking the actual consumption of CI minutes we should always return the database value + increment in Redis as SSoT. - We may want to configure the worker delay when flushing counters to the database via
counter_attribute
. E.g. rather than flushing every 10 minutes, flush values more frequently.