Skip to content

Implement the loopless mode for the cleanup policy worker

🎱 Context

Cleanup policies are tasks executed by a background worker. We're not going to detail the task here but it's a non trivial process.

Currently, each enabled cleanup policy ready to run is read by a cron worker. This cron worker will:

  1. "Mark" container repositories
  2. Enqueue jobs for the related limited capacity worker

The main problem is that (1.) is not scalable. The SQL query executed to get all the runnable policies is quite complex. In addition, the worker loops on each row to then mark the linked container repositories.

This is issue #267546 (closed).

The found solution is detailed in #267546 (closed).

Since we're changing the whole logic for both workers, this effort is splitted in steps and gated behind a feature flag so that we can "rollback" to the current logic at will:

  1. Introduce the feature flag + update the cron worker according to #267546 (closed). This is !56962 (merged).
  2. Prepare the limited worker for changes. This is !56989 (merged).
  3. (You are here ) Update the limited worker to support the new logic

We have two sets that contain all the ContainerRepository in need of a cleanup.

  1. scheduled: this set is for when we hit a policy next_run_at. The linked container repositories must be cleaned up asap. This is the "high priority" set. The selection conditions for this set are:
    • policy enabled set to true
    • policy next_run_at in the past
    • ContainerRepository cleanup state in unscheduled or scheduled (for compatibility purposes with the old logic.)
    • these are implemented in the ContainerRepository.with_scheduled_cleanup scope
  2. unfinished: this is for all the leftover work from (1.). In other words, when there is no work in (1.), the backend should check this set. The conditions are:
    • policy enabled
    • ContainerRepository cleanup state in unfinished
    • these are implemented in the ContainerRepository.with_unfinished_cleanup scope

The worker will need to mix those two sets and order the results in a way that elements of (1.) go first. That's already what we do in the old logic.

In addition, we will need to schedule the next policy run. We will do this when:

  • All the cleanups started_at of the linked repositories are after the current policy next_run_at.

🤔 What does this MR do?

  • Updates the worker to use the above conditions/logic.
    • This MR removes a previously introduced inner classes as we no longer need them
    • The worker can support the feature flag by extracting the modified parts in functions. Each function will select the proper body depending on the feature flag state.
  • Updates the related specs.

🖼 Screenshots (strongly suggested)

n / a

📏 Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team

💾 Database Review

Migration Up

$ rails db:migrate
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: migrating 
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_expiration_policies, [:project_id, :next_run_at], {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :where=>"enabled = true", :algorithm=>:concurrently})
   -> 0.0025s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- add_index(:container_expiration_policies, [:project_id, :next_run_at], {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :where=>"enabled = true", :algorithm=>:concurrently})
   -> 0.0044s
-- execute("RESET ALL")
   -> 0.0006s
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: migrated (0.0088s) 

Migration Down

$ rails db:rollback
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: reverting 
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_expiration_policies, [:project_id, :next_run_at], {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :algorithm=>:concurrently})
   -> 0.0025s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- remove_index(:container_expiration_policies, {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :algorithm=>:concurrently, :column=>[:project_id, :next_run_at]})
   -> 0.0053s
-- execute("RESET ALL")
   -> 0.0005s
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: reverted (0.0096s) 

Queries

Note for all the explain plains, the following commands have been executed before the explain command on pg.ai:

  1. reset
  2. exec CREATE INDEX idx_container_exp_policies_on_project_id_next_run_at ON container_expiration_policies USING btree (project_id, next_run_at) WHERE (enabled = true)
  3. exec VACUUM ANALYZE container_expiration_policies
Edited by David Fernandez

Merge request reports