Skip to content

Add timeout support in Projects::ContainerRepository::DeleteTagsService#execute

David Fernandez requested to merge 208193-limit-delete-tags-service-runtime into master

🤔 What does this MR do?

MR 1 of #208193 (comment 362910703) MR plan:

Put a timeout around DeleteTagsService#execute but only under those conditions:

  • The GitLab Registry is in use. (This is what gitlab.com uses)
  • The feature flag container_registry_expiration_policies_throttling is enabled
  • Application Setting container_registry_delete_tags_service_timeout set to anything else than nil or 0

Design choices

  • For a better understanding how this service is used, see the schema in #208193 (comment 362910703).
  • This service function is really simple: it receives a list of tags, it will loop on them and sending the appropriate request to the Container Registry to delete them.
  • The main issue is that these requests to the Container Registry can be really slow.
  • We can easily imagine what happens when the service receives let's say 100K tags to delete. Well, let's summarize it with 🐢
  • One upcoming feature is having Container Registry Cleanup Policies open to all projects. The policies are executed (with a background worker) under a given schedule and each execution will use this service to delete a given list of tags.
  • To avoid resources starvation, this MR puts an execution time limit to this service.
    • When the limit is hit, the loop is stopped.
    • As such, we can have a partial execution which is an accepted outcome.
    • The default value of 250 seconds has been selected so that we are way below from the 300 seconds threshold that is applied for background workers (see gitlab-com/gl-infra/scalability#461 (comment 370427582))
  • To have a better control of this execution time limit value, it is implemented as an application setting (container_registry_delete_tags_service_timeout)
    • The corresponding UI has been added to the container settings in the admin UI (located at /admin/application_settings/ci_cd)
    • This way, we can update this value directly from the UI
  • As stated above, this MR is part of a larger effort to limit resources usage during Container Registry Cleanup Policies execution.
    • This being a deep change from the current behavior, all the limits we are implementing in the larger effort are gated behind a feature flag: container_registry_expiration_policies_throttling. This provides us a safety switch between non throttled executions (current implementation) <-> throttled executions. This switch can be used in the case of a production issue.
    • This MR being the first of the larger effort, it is the one introducing the feature flag.
  • To be complete in this MR, the corresponding documentation has been updated with a Caution block to indicate that Cleanup Policies can be partially executed.

🍪 Database Migrations

Up

$ rails db:migrate:up VERSION=20200710113437 
== 20200710113437 AddContainerRegistryDeleteTagsServiceTimeoutToApplicationSettings: migrating 
-- add_column(:application_settings, :container_registry_delete_tags_service_timeout, :integer, {:default=>250, :null=>false})
   -> 0.0054s
== 20200710113437 AddContainerRegistryDeleteTagsServiceTimeoutToApplicationSettings: migrated (0.0054s) 

Down

$ rails db:migrate:down VERSION=20200710113437 
== 20200710113437 AddContainerRegistryDeleteTagsServiceTimeoutToApplicationSettings: reverting 
-- remove_column(:application_settings, :container_registry_delete_tags_service_timeout)
   -> 0.0165s
== 20200710113437 AddContainerRegistryDeleteTagsServiceTimeoutToApplicationSettings: reverted (0.0166s) 

📸 Screenshots

The admin UI:

Screenshot_2020-08-19_at_15.04.12

🛃 Does this MR meet the acceptance criteria?

Conformity

Availability and Testing

Security

If this MR contains changes to processing or storing of credentials or tokens, authorization and authentication methods and other items described in the security review guidelines:

  • [-] Label as security and @ mention @gitlab-com/gl-security/appsec
  • [-] The MR includes necessary changes to maintain consistency between UI, API, email, or other methods
  • [-] Security reports checked/validated by a reviewer from the AppSec team
Edited by David Fernandez

Merge request reports