Implement the loopless mode for the cleanup policy worker (!58123) · Merge requests · GitLab.org / GitLab

🎱 Context

Cleanup policies are tasks executed by a background worker. We're not going to detail the task here but it's a non trivial process.

Currently, each enabled cleanup policy ready to run is read by a cron worker. This cron worker will:

"Mark" container repositories
Enqueue jobs for the related limited capacity worker

The main problem is that (1.) is not scalable. The SQL query executed to get all the runnable policies is quite complex. In addition, the worker loops on each row to then mark the linked container repositories.

This is issue #267546 (closed).

The found solution is detailed in #267546 (closed).

Since we're changing the whole logic for both workers, this effort is splitted in steps and gated behind a feature flag so that we can "rollback" to the current logic at will:

Introduce the feature flag + update the cron worker according to #267546 (closed). This is !56962 (merged).
Prepare the limited worker for changes. This is !56989 (merged).
(You are here ⛳) Update the limited worker to support the new logic

We have two sets that contain all the ContainerRepository in need of a cleanup.

scheduled: this set is for when we hit a policy next_run_at. The linked container repositories must be cleaned up asap. This is the "high priority" set. The selection conditions for this set are:
- policy enabled set to true
- policy next_run_at in the past
- ContainerRepository cleanup state in unscheduled or scheduled (for compatibility purposes with the old logic.)
- these are implemented in the ContainerRepository.with_scheduled_cleanup scope
unfinished: this is for all the leftover work from (1.). In other words, when there is no work in (1.), the backend should check this set. The conditions are:
- policy enabled
- ContainerRepository cleanup state in unfinished
- these are implemented in the ContainerRepository.with_unfinished_cleanup scope

The worker will need to mix those two sets and order the results in a way that elements of (1.) go first. That's already what we do in the old logic.

In addition, we will need to schedule the next policy run. We will do this when:

All the cleanups started_at of the linked repositories are after the current policy next_run_at.

🤔 What does this MR do?

Updates the worker to use the above conditions/logic.
- This MR removes a previously introduced inner classes as we no longer need them
- The worker can support the feature flag by extracting the modified parts in functions. Each function will select the proper body depending on the feature flag state.
Updates the related specs.

🖼 Screenshots (strongly suggested)

n / a

📏 Does this MR meet the acceptance criteria?

Conformity

📋 Does this MR need a changelog?
- I have included a changelog entry for the database migrations.
- I have not included a changelog entry for the cleanup logic changes because they are a feature flag .
[-] Documentation (if required)
Code review guidelines
Merge request performance guidelines
Style guides
Database guides
[-] Separation of EE specific content

Availability and Testing

Review and add/update tests for this feature/bug. Consider all test levels. See the Test Planning Process.
Tested in all supported browsers
Informed Infrastructure department of a default or new setting change, if applicable per definition of done

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

[-] Label as security and @ mention @gitlab-com/gl-security/appsec
[-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
[-] Security reports checked/validated by a reviewer from the AppSec team

💾 Database Review

Migration Up

$ rails db:migrate
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: migrating 
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_expiration_policies, [:project_id, :next_run_at], {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :where=>"enabled = true", :algorithm=>:concurrently})
   -> 0.0025s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- add_index(:container_expiration_policies, [:project_id, :next_run_at], {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :where=>"enabled = true", :algorithm=>:concurrently})
   -> 0.0044s
-- execute("RESET ALL")
   -> 0.0006s
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: migrated (0.0088s)

Migration Down

$ rails db:rollback
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: reverting 
-- transaction_open?()
   -> 0.0000s
-- index_exists?(:container_expiration_policies, [:project_id, :next_run_at], {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :algorithm=>:concurrently})
   -> 0.0025s
-- execute("SET statement_timeout TO 0")
   -> 0.0005s
-- remove_index(:container_expiration_policies, {:name=>"idx_container_exp_policies_on_project_id_next_run_at", :algorithm=>:concurrently, :column=>[:project_id, :next_run_at]})
   -> 0.0053s
-- execute("RESET ALL")
   -> 0.0005s
== 20210422142647 AddProjectIdNextRunAtIndexToContainerExpirationPolicies: reverted (0.0096s)

Queries

Count of scheduled cleanups: !58123 (comment 568125464)
Count of unfinished cleanups: !58123 (comment 563870438)
Select the next ContainerRepository available for cleanup. This query is not trivial so it has been split in two queries:
- Select next "scheduled" cleanup: !58123 (comment 570893448)
- Select next "unfinished" cleanup: !58123 (comment 569037922)
Check that any container repository has its cleanup started_at set before the given policy next_run_at: !58123 (comment 563870462)

Note for all the explain plains, the following commands have been executed before the explain command on pg.ai:

reset
exec CREATE INDEX idx_container_exp_policies_on_project_id_next_run_at ON container_expiration_policies USING btree (project_id, next_run_at) WHERE (enabled = true)
exec VACUUM ANALYZE container_expiration_policies

Edited May 09, 2021 by David Fernandez

Implement the loopless mode for the cleanup policy worker