[Feature flag] Rollout of `container_registry_expiration_policies_throttling`

What

container_registry_expiration_policies_throttling roll out.

Owners

  • Team: Package
  • Most appropriate slack channel to reach out to: #s_package
  • Best individual to reach out to: @10io

Expectations

What are we expecting to happen?

This flag will enabled some limits around the container tags cleanup services / workers. See the analysis in #208193 (comment 362910703)

With the feature flag enabled

  • A new application setting available for the container registry: container_registry_delete_tags_service_timeout
  • https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/container_repository/gitlab/delete_tags_service.rb#L17 will run for ::Gitlab::CurrentSettings.current_application_settings.container_registry_delete_tags_service_timeout max

With the feature flag disabled

  • https://gitlab.com/gitlab-org/gitlab/-/blob/master/app/services/projects/container_repository/gitlab/delete_tags_service.rb#L17 can run for an arbitrary amount of time

What might happen if this goes wrong?

  • Delete tags could not be deleted

What can we monitor to detect problems with this?

  • Container registry: https://dashboards.gitlab.net/d/registry-main/registry-overview?orgId=1&var-PROMETHEUS_DS=Global&var-environment=gprd&var-stage=main&var-sigma=2
  • Sentry on both workers:
    • https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved+%22ContainerExpirationPolicies%3A%3ACleanupContainerRepositoryWorker%22
    • https://sentry.gitlab.net/gitlab/gitlabcom/?query=is%3Aunresolved%20%22ContainerExpirationPolicyWorker%22
  • Thanos dashboard on the current load (eg. number of repositories to cleanup) for these workers: https://thanos-query.ops.gitlab.net/graph?g0.range_input=30m&g0.max_source_resolution=0s&g0.expr=max(limited_capacity_worker_remaining_work_count%7Bworker%3D%22ContainerExpirationPolicies%3A%3ACleanupContainerRepositoryWorker%22%2C%20env%3D%22gprd%22%7D)&g0.tab=0&g1.range_input=1h&g1.max_source_resolution=0s&g1.expr=max(limited_capacity_worker_max_running_jobs%7Bworker%3D%22ContainerExpirationPolicies%3A%3ACleanupContainerRepositoryWorker%22%2C%20env%3D%22gprd%22%7D)&g1.tab=0&g2.range_input=1h&g2.max_source_resolution=0s&g2.expr=min(limited_capacity_worker_running_jobs%7Bworker%3D%22ContainerExpirationPolicies%3A%3ACleanupContainerRepositoryWorker%22%2C%20env%3D%22gprd%22%7D)&g2.tab=0

Beta groups/projects

n/a. This feature flag is global for the container registry tags cleanup system.

Roll Out Steps

  • Enable on staging
  • Test on staging
    • Impossible to fully test on staging due to https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11509
  • Ensure that documentation has been updated
  • [-] Enable on GitLab.com for individual groups/projects listed above and verify behaviour
    • feature flag is globable
  • Coordinate a time to enable the flag with #production and #g_delivery on slack.
  • Announce on the issue an estimated time this will be enabled on GitLab.com
  • Enable on GitLab.com by running chatops command in #production
  • Cross post chatops slack command to #support_gitlab-com (more guidance when this is necessary in the dev docs) and in your team channel
  • Announce on the issue that the flag has been enabled
  • Remove feature flag and add changelog entry
    • Remove the preloaded option in #with_runnable_policy in ContainerExpirationPolicyWorker
    • !50858 (comment 496385677)
  • After the flag removal is deployed, clean up the feature flag by running chatops command in #production channel
Edited May 03, 2022 by David Fernandez
Assignee Loading
Time tracking Loading