Background migration to fix non zero packages_sizes project statistics (!107799) · Merge requests · GitLab.org / GitLab

David Fernandez requested to merge 378433-fix-non-zero-packages-size into master Dec 26, 2022

🚀 Context

In the project statistics object, we have a metric called packages_size that is the aggregation (sum) of all the package files size.

In #363010 (closed), we discovered that some packages_size seemed to be "out of sync". For example, a package_size of 10MB but the sum of all package files would be 5KB. The presumed root cause of this desync is concurrent updates: the packages_size metric could be updated in a burst. During that burst race conditions could appear and make some updates getting lost.

To solve this, we switched the packages_size metric updates to a "buffered" counter. In very short words, instead of updating the database column directly we use a redis entry where we incrby. Then, once in a while (each 10 minutes max), a job is responsible to move (or "flush") the updates from redis to the database. This way, redis acts as a "protection layer" for the database.

That's all nice but what about the existing out of sync data? Well, we don't have a lot of choices here, we need background migrations to detect incoherent packages_size and fix them. That's Background migration to fix incorrect statistic... (#378433 - closed).

Given the amount/scale in play here, we are going to split the background migrations in two:

Project statistics packages_size different than 0. 👈 This is this MR.
Project statistics packages_size set to 0 and some package files exist. That's !108226 (closed).

🔬 What does this MR do and why?

Introduce a background migration to deal with incoherent and non zero packages_size.
Introduce 2 temporary indexes to support that migration.
Add the related specs.

📺 Screenshots or screen recordings

None

⚗ How to set up and validate locally

Let's create some packages for the first project. In a rails console:

def fixture_file_upload(*args, **kwargs)
  Rack::Test::UploadedFile.new(*args, **kwargs)
end

10.times { FactoryBot.create(:npm_package, project: Project.first) }

Wait at least 10 minutes, so that updates to packages_size are flushed.

Now, check the original packages_size and update it to make it incoherent.

Project.first.statistics.packages_size # 4096000 note this.

Project.first.statistics.update!(packages_size: 50)

Let's run the migration:

rails db:migrate

Wait for at least 10 minutes and check the packages_size again. It should be fixed.

Project.first.statistics.packages_size # 4096000 as it was originally

The background migration is working as expected 🎉

🚥 MR acceptance checklist

This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.

I have evaluated the MR acceptance checklist for this MR.

💾 Database review

⤴ Migrations up

main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: migrating =====
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0012s
main: -- index_exists?(:project_statistics, [:id, :project_id, :packages_size], {:where=>"packages_size != 0", :name=>"tmp_idx_project_statistics_on_non_zero_packages_size", :algorithm=>:concurrently})
main:    -> 0.0066s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0006s
main: -- add_index(:project_statistics, [:id, :project_id, :packages_size], {:where=>"packages_size != 0", :name=>"tmp_idx_project_statistics_on_non_zero_packages_size", :algorithm=>:concurrently})
main:    -> 0.0035s
main: -- execute("RESET statement_timeout")
main:    -> 0.0006s
main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: migrated (0.0299s) 

main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: migrating ==============
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.0007s
main: -- index_exists?(:packages_package_files, [:package_id, :size], {:where=>"size IS NOT NULL", :name=>"tmp_idx_package_files_on_non_zero_size", :algorithm=>:concurrently})
main:    -> 0.0067s
main: -- add_index(:packages_package_files, [:package_id, :size], {:where=>"size IS NOT NULL", :name=>"tmp_idx_package_files_on_non_zero_size", :algorithm=>:concurrently})
main:    -> 0.0024s
main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: migrated (0.0164s) =====

main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: migrating 
main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: migrated (0.0515s)

⤵ Migrations down

main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: reverting 
main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: reverted (0.0502s) 

main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: reverting ==============
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.1241s
main: -- indexes(:packages_package_files)
main:    -> 0.0131s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0009s
main: -- remove_index(:packages_package_files, {:algorithm=>:concurrently, :name=>"tmp_idx_package_files_on_non_zero_size"})
main:    -> 0.0031s
main: -- execute("RESET statement_timeout")
main:    -> 0.0005s
main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: reverted (0.1570s) =====

main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: reverting =====
main: -- transaction_open?()
main:    -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main:    -> 0.1243s
main: -- indexes(:project_statistics)
main:    -> 0.0082s
main: -- execute("SET statement_timeout TO 0")
main:    -> 0.0005s
main: -- remove_index(:project_statistics, {:algorithm=>:concurrently, :name=>"tmp_idx_project_statistics_on_non_zero_packages_size"})
main:    -> 0.0037s
main: -- execute("RESET statement_timeout")
main:    -> 0.0006s
main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: reverted (0.1530s)

⚙ Queries

There is no UPDATE query as the update will go through the buffered counter.

🏎 Background migration execution time analysis for gitlab.com

We have around 107 000 project statistics with non zero packages_size.

Given that the batch size is 200, we need 535 iterations.

Given the very small execution time to find the bounds of each sub batch, we will ignore those queries and use the SELECT query. In addition, the redis interactions should be pretty quick here so we will ignore these.

This gives us:

535 * 275ms + 534 * 2000ms (delay between 2 sub batches) = 1215125ms = 202.5 minutes = a little more than 3 hours.

Edited Jan 10, 2023 by David Fernandez

Background migration to fix non zero packages_sizes project statistics