Background migration to fix non zero packages_sizes project statistics
🚀 Context
In the project statistics object, we have a metric called packages_size
that is the aggregation (sum) of all the package files size
.
In #363010 (closed), we discovered that some packages_size
seemed to be "out of sync". For example, a package_size
of 10MB
but the sum of all package files would be 5KB
. The presumed root cause of this desync is concurrent updates: the packages_size
metric could be updated in a burst. During that burst race conditions could appear and make some updates getting lost.
To solve this, we switched the packages_size
metric updates to a "buffered" counter. In very short words, instead of updating the database column directly we use a redis entry where we incrby
. Then, once in a while (each 10 minutes max), a job is responsible to move (or "flush") the updates from redis to the database. This way, redis acts as a "protection layer" for the database.
That's all nice but what about the existing out of sync data? Well, we don't have a lot of choices here, we need background migrations to detect incoherent packages_size
and fix them. That's Background migration to fix incorrect statistic... (#378433 - closed).
Given the amount/scale in play here, we are going to split the background migrations in two:
- Project statistics
packages_size
different than0
.👈 This is this MR. - Project statistics
packages_size
set to0
and some package files exist. That's !108226 (closed).
🔬 What does this MR do and why?
- Introduce a background migration to deal with incoherent and non zero
packages_size
. - Introduce 2 temporary indexes to support that migration.
- Add the related specs.
📺 Screenshots or screen recordings
None
⚗ How to set up and validate locally
Let's create some packages for the first project. In a rails console:
def fixture_file_upload(*args, **kwargs)
Rack::Test::UploadedFile.new(*args, **kwargs)
end
10.times { FactoryBot.create(:npm_package, project: Project.first) }
Wait at least 10
minutes, so that updates to packages_size
are flushed.
Now, check the original packages_size
and update it to make it incoherent.
Project.first.statistics.packages_size # 4096000 note this.
Project.first.statistics.update!(packages_size: 50)
Let's run the migration:
rails db:migrate
Wait for at least 10
minutes and check the packages_size
again. It should be fixed.
Project.first.statistics.packages_size # 4096000 as it was originally
The background migration is working as expected
🚥 MR acceptance checklist
This checklist encourages us to confirm any changes have been analyzed to reduce risks in quality, performance, reliability, security, and maintainability.
-
I have evaluated the MR acceptance checklist for this MR.
💾 Database review
⤴ Migrations up
main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: migrating =====
main: -- transaction_open?()
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.0012s
main: -- index_exists?(:project_statistics, [:id, :project_id, :packages_size], {:where=>"packages_size != 0", :name=>"tmp_idx_project_statistics_on_non_zero_packages_size", :algorithm=>:concurrently})
main: -> 0.0066s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0006s
main: -- add_index(:project_statistics, [:id, :project_id, :packages_size], {:where=>"packages_size != 0", :name=>"tmp_idx_project_statistics_on_non_zero_packages_size", :algorithm=>:concurrently})
main: -> 0.0035s
main: -- execute("RESET statement_timeout")
main: -> 0.0006s
main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: migrated (0.0299s)
main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: migrating ==============
main: -- transaction_open?()
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.0007s
main: -- index_exists?(:packages_package_files, [:package_id, :size], {:where=>"size IS NOT NULL", :name=>"tmp_idx_package_files_on_non_zero_size", :algorithm=>:concurrently})
main: -> 0.0067s
main: -- add_index(:packages_package_files, [:package_id, :size], {:where=>"size IS NOT NULL", :name=>"tmp_idx_package_files_on_non_zero_size", :algorithm=>:concurrently})
main: -> 0.0024s
main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: migrated (0.0164s) =====
main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: migrating
main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: migrated (0.0515s)
⤵ Migrations down
main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: reverting
main: == 20221226153252 QueueFixIncorrectNonZeroPackagesSizeOnProjectStatistics: reverted (0.0502s)
main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: reverting ==============
main: -- transaction_open?()
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.1241s
main: -- indexes(:packages_package_files)
main: -> 0.0131s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0009s
main: -- remove_index(:packages_package_files, {:algorithm=>:concurrently, :name=>"tmp_idx_package_files_on_non_zero_size"})
main: -> 0.0031s
main: -- execute("RESET statement_timeout")
main: -> 0.0005s
main: == 20221226150648 AddTemporarySizeIndexToPackageFiles: reverted (0.1570s) =====
main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: reverting =====
main: -- transaction_open?()
main: -> 0.0000s
main: -- view_exists?(:postgres_partitions)
main: -> 0.1243s
main: -- indexes(:project_statistics)
main: -> 0.0082s
main: -- execute("SET statement_timeout TO 0")
main: -> 0.0005s
main: -- remove_index(:project_statistics, {:algorithm=>:concurrently, :name=>"tmp_idx_project_statistics_on_non_zero_packages_size"})
main: -> 0.0037s
main: -- execute("RESET statement_timeout")
main: -> 0.0006s
main: == 20221226145951 AddTemporaryPackagesIndexToProjectStatistics: reverted (0.1530s)
⚙ Queries
There is no UPDATE
query as the update will go through the buffered counter.
🏎 Background migration execution time analysis for gitlab.com
We have around 107 000
project statistics with non zero packages_size
.
Given that the batch size is 200
, we need 535
iterations.
Given the very small execution time to find the bounds of each sub batch, we will ignore those queries and use the SELECT
query. In addition, the redis interactions should be pretty quick here so we will ignore these.
This gives us:
535 * 275ms
+ 534 * 2000ms
(delay between 2 sub batches) = 1215125ms = 202.5 minutes = a little more than 3 hours.