Skip to content

Support concurrent create and delete while refreshing project statistics

What does this MR do and why?

This MR seeks to address race condition on counter that happens when there are concurrent job artifacts being created or deleted during a refresh. Detailed problem description can be found in #373469.

There are a few scenarios that need to be addressed:

  1. At the beginning of a refresh, the counter and database column are reset to zero. After this happens, there might still be decrements to the counter that brings the counter to negative. Since the records have been deleted from the database, the counter will not be corrected, resulting in a net negative value.
  2. At the beginning of a refresh, the counter and database column are reset to zero. After this happens, there might still be increments to the counter that increases the counter. These record ids might be rediscovered by the refresh, resulting in a double increment.
  3. During the refresh, there might be job artifacts being deleted as a result of regular activity. If this happens before the relevant records are accounted for by the refresh, it would lead to a net negative counter value, similar to point 1 above.

Screenshot_2022-12-06_at_2.46.53_PM

Solution

This MR changes the counter to temporarily use a refresh key during the refresh. During this period, the counter will ignore increments that should not affect the counter and deduplicates repeated increments.

Screenshot_2022-12-15_at_4.45.30_PM

The main changes are:

  1. Modifying BufferedCounter to toggle its increment behaviour during a refresh. There are various Lua scripts added for this implementation, because they need to happen atomically in Redis.
  2. Adds a finalizing step to the refresh process, where the recalculated value in the refresh key is moved into the existing counter key.

The refresh process would look as such:

  1. BuildArtifactSizeRefresh record is created in created state. At this point, there is no effect as it is just waiting to be processed.
  2. RefreshBuildArtifactsSizeStatisticsWorker picks up the refresh and move it into running. At this point, new increments are pushed into a separate refresh_key in Redis, while at the same time deduplicated. The deduplication ensures that no decrement happens unless there is an existing increment on the same id, to avoid a net negative situation. On the other hand, repeated increments from a job artifact that was created then rediscovered by the refresh will also be deduplicated, to avoid a double increment.
  3. BuildArtifactSizeRefresh would go into finalizing state at the end of the batching of the existing records. This is to delay the final step to move the recalculated amount to the counter key. The delay is needed to ensure that any job artifacts that were deleted but have not had the increments reach Redis don't result in net negative value.

Screenshots or screen recordings

Screenshots are required for UI changes, and strongly recommended for all other merge requests.

How to set up and validate locally

  1. Enable feature flag project_statistics_bulk_increment
  2. Create a project and run bunch of pipelines with artifacts to set up a baseline statistics. Note: make sure Ci::ArchiveTraceWorker have completed.
  3. Validate that project.reload.job_artifacts.sum(&:size) is equal to project.statistics.build_artifacts_size.
  4. Validate a refresh without any activity.
    1. Enqueue a refresh using Projects::BuildArtifactsSizeRefresh.enqueue([project]).
    2. Manually enqueue Projects::ScheduleRefreshBuildArtifactsSizeStatisticsWorker in Admin - Monitor - Background Jobs - Cron, to avoid waiting for the cron scheduler. This kicks off the refresh process.
    3. Verify that:
      1. refresh is in finalizing state Projects::BuildArtifactsSizeRefresh.last.state is 4.
      2. project.reload.statistics.build_artifacts_size is 0
      3. Projects::FinalizeProjectStatisticsRefreshWorker is enqueued
      4. Wait until FinalizeProjectStatisticsRefreshWorker executes and see that FlushIncrementsWorker is enqueued. At this point, we can validate that the counter key is now the correct amount. Gitlab::Counters::BufferedCounter.new(project.statistics, :build_artifacts_size).get is the same as project.reload.job_artifacts.sum(&:size).
      5. Wait until FlushIncrementsWorker has completed and validate that project.reload.job_artifacts.sum(&:size) is equal to project.statistics.build_artifacts_size.
  5. Validate a refresh whilst there are deletion and new pipelines.
    1. Enqueue a refresh using Projects::BuildArtifactsSizeRefresh.enqueue([project]).
    2. Before enqueuing Projects::ScheduleRefreshBuildArtifactsSizeStatisticsWorker, perform various actions such as DestroyBatchService, create new pipeline, delete job artifacts
    3. Manually enqueue Projects::ScheduleRefreshBuildArtifactsSizeStatisticsWorker in Admin - Monitor - Background Jobs - Cron.
    4. Continue creating new pipelines or deleting job artifacts from the console, while the refresh is ongoing.
    5. Wait until Projects::BuildArtifactsSizeRefresh.last.state is 4.
    6. Wait until the Projects::FinalizeProjectStatisticsRefreshWorker and FlushIncrementsWorker have completed.
    7. Validate that project.reload.job_artifacts.sum(&:size) is equal to project.statistics.build_artifacts_size.

MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

Related to #373469

Edited by Albert

Merge request reports