[Feature flag] Rollout of `ci_delete_objects`

What

Related issues: #220422 (closed), https://gitlab.com/gitlab-org/gitlab/-/issues/223034, https://gitlab.com/gitlab-org/gitlab/-/issues/233939

Related merge requests: !42095 (merged), !42237 (merged), !39464 (merged), !43100 (merged), !42242 (merged)

We have changed how Ci::DestroyExpiredJobArtifactsService works. Instead of removing the artifacts one by one, it now copies the information needed to identify the object storage files associated with them into a new table called ci_deleted_objects in batches of 100, deletes the the records from ci_job_artifacts, and updates the project_statistics for them.

Ci::DeleteObjectsWorker will do the actual removal from object storage. This worker is configured to run concurrently, with a default max_running_jobs of 0, meaning that it will not execute any jobs unless the concurrency feature flags are on. Changing the concurrency setting to a higher value will be visible only after the execution of Ci::ScheduleDeleteObjectsCronWorker which should happen every 16 minutes. Changing it to a lower setting should reduce the number of running jobs instantly.

Feature flags:

  • ci_delete_objects - Turning this FF on will change how we remove expired job artifacts. Should see bulk inserts into ci_deleted_objects and mass deletes.
  • ci_delete_objects_low_concurrency - turning this FF on sets max_running_jobs to 2
  • ci_delete_objects_medium_concurrency - max_running_jobs will be 20 if ci_delete_objects_low_concurrency is off
  • ci_delete_objects_high_concurrency - max_running_jobs will be 50 if ci_delete_objects_low_concurrency and ci_delete_objects_medium_concurrency are off

Future work

Because of #281688 (closed) we didn't get to check ci_delete_objects_medium_concurrency and ci_delete_objects_high_concurrency. Their clean up is going to be tracked in #287632 (closed).

Owners

  • Team: ~"group::continuous integration"
  • Most appropriate slack channel to reach out to: #g_ci
  • Best individual to reach out to: @mbobin

Expectations

What are we expecting to happen?

  • The number of expired job artifacts should go down
  • Storage quota for projects should go down

What might happen if this goes wrong?

  • operations on ci_job_artifacts are atomic and we should not persist anything into ci_deleted_objects without removing it from ci_job_artifacts, so reverting the feature flag should be safe.

What can we monitor to detect problems with this?

Thanos queries as explained at #247103 (comment 435056335)

Roll Out Steps

  • Enable on staging
  • Test on staging
  • Ensure that documentation has been updated
  • Coordinate a time to enable the flag with #production and #g_delivery on slack.
  • Announce on the issue an estimated time this will be enabled on GitLab.com
  • Enable on GitLab.com by running chatops command in #production
  • Cross post chatops slack command to #support_gitlab-com (more guidance when this is necessary in the dev docs) and in your team channel
  • Announce on the issue that the flag has been enabled
  • Remove feature flag and add changelog entry
  • After the flag removal is deployed, clean up the feature flag by running chatops command in #production channel
Edited by Marius Bobin