Skip to content

Improve cleanup policies selection during their execution

🌲 Context

Cleanup policies are executed on a regular basis by a cron job. During this execution, the cron job selects the cleanup policies that need to be executed using the .runnable scope.

The issue is that this scope selects a far too big set (700K+ rows).

Why do we have such a high set, it's because cleanup policies are created when a new project is created and by default the policy is enabled = it will be picked up by the cron job for execution.

🔍 What does this MR do?

As explained in #263110 (closed), we're going to reduce the size of the selected scope by selecting only policies that are actually linked to container repositories. Policies with no container repositories don't need to be executed as their execution is a no-op.

This MR will:

  • add a new .executable scope in the ContainerExpirationPolicy
    • this scope will select policies that are runnable_schedules and have at least one container repository.
  • add .each_bactch support to ContainerExpirationPolicy so that we can iterate on a large set more efficiently
  • add a new index to improve the ~performance of .executable.each_batch
  • update the relevant specs

💽 Database considerations

Before this MR, the cron worker was using .find_each to loop on the policies set. As stated in the documentation:

NOTE: It's not possible to set the order. That is automatically set to ascending on the primary key (“id ASC”) to make the batch ordering work. This also means that this method only works when the primary key is orderable (e.g. an integer or string).

So a order by policy_id will be imposed and here an example of the resulting query:

SELECT "container_expiration_policies".*
FROM "container_expiration_policies"
WHERE "container_expiration_policies"."enabled" = TRUE
  AND (next_run_at < '2020-10-09 08:40:06.780007')
ORDER BY "container_expiration_policies"."project_id" ASC
LIMIT 1000;

Here is the explain plan: https://explain.depesz.com/s/XxqE . We can notice that the index index_container_expiration_policies_on_next_run_at_and_enabled is not used at all. Thanks to @sabrams explanations 🙏, this is due to the ORDER BY "container_expiration_policies"."project_id" ASC imposed by .find_each

To fix that we're going to add a new index on columns project_id, next_run_at and enabled.

This MR will modify the query to include a check for ContainerRepository existence and so we end up with a query like this one:

SELECT "container_expiration_policies"."project_id"
FROM "container_expiration_policies"
WHERE "container_expiration_policies"."enabled" = TRUE
  AND (next_run_at < '2020-10-09 09:00:45.082074')
  AND (EXISTS
         (SELECT 1
          FROM "container_repositories"
          WHERE (container_repositories.project_id = container_expiration_policies.project_id)))
ORDER BY "container_expiration_policies"."project_id" ASC
LIMIT 1000

Here is the explain plan: https://explain.depesz.com/s/bUbP. This time, we can see that the new index based on the 3 columns is properly used. 🚀

🔼 Migration up

== 20201009090954 AddIndexWithProjectIdToContainerExpirationPolicies: migrating
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_expiration_policies, [:project_id, :next_run_at, :enabled], {:name=>"idx_container_exp_policies_on_project_id_next_run_at_enabled", :algorithm=>:concurrently})
   -> 0.0020s
-- execute("SET statement_timeout TO 0")
   -> 0.0002s
-- add_index(:container_expiration_policies, [:project_id, :next_run_at, :enabled], {:name=>"idx_container_exp_policies_on_project_id_next_run_at_enabled", :algorithm=>:concurrently})
   -> 0.0030s
-- execute("RESET ALL")
   -> 0.0002s
== 20201009090954 AddIndexWithProjectIdToContainerExpirationPolicies: migrated (0.0057s)

🔽 Migration down

== 20201009090954 AddIndexWithProjectIdToContainerExpirationPolicies: reverting
-- transaction_open?()
   -> 0.0000s
-- indexes(:container_expiration_policies)
   -> 0.0030s
-- execute("SET statement_timeout TO 0")
   -> 0.0002s
-- remove_index(:container_expiration_policies, {:algorithm=>:concurrently, :name=>"idx_container_exp_policies_on_project_id_next_run_at_enabled"})
   -> 0.0017s
-- execute("RESET ALL")
   -> 0.0002s
== 20201009090954 AddIndexWithProjectIdToContainerExpirationPolicies: reverted (0.0054s)

Screenshots

n / a

Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • Label as security and @ mention @gitlab-com/gl-security/appsec
  • The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • Security reports checked/validated by a reviewer from the AppSec team
Edited by David Fernandez

Merge request reports