Reduce CPU utilization on the postgres primary stemming from ContainerExpirationPolicies::CleanupContainerRepositoryWorker
From https://gitlab.com/gitlab-com/gl-infra/capacity-planning/-/issues/892#note_1307833246
The ContainerExpirationPolicies::CleanupContainerRepositoryWorker
is a LimitedCapacity::Worker
currently running with a concurrency of 10.
It looks like we're not actually deleting a lot of tags most of the time (https://log.gprd.gitlab.net/goto/b9e34e50-bf1a-11ed-85ed-e7557b0a598c), and we're mostly processing the backlog of unfinished cleanups. For the unfinished cleanups, we're processing these on a best-effort basis. So perhaps we'd be okay with with reducing the concurrency? We should start by logging when the cleanup policy was supposed to run (next_run_at
) so we can use that to see if we're behind on the policies that have to run. Then we can see how much we can lower concurrency without impacting the the number of repositories that are already caught up.
After that, to be able to scale this up again when the usage of the container repository and it's automatic cleanup grows, we should look into denormalizing the query.
It executes 2 queries that are quite expensive because of the joins they need to do: we need information from both container_repositories
and container_expiration_policies
. It performs an update on the state of container_repository
. Perhaps we could move all of the attributes for running these cleanups onto container_expiration_policies
and make that the sole record that we update in the transaction with the lock. Then, for performing the actual cleanup, we can issue a simple SELECT * FROM container_repositories WHERE id = ?
query.