We have received a report where The "packages" storage type is showing 6.9 GiB of storage space being utilized. When we go into the Packages and Containers for this project, We are only seeing 1.26 GiB of storage space being utilized.
This is affecting customers as they are reaching the storage limits.
Please note this issue is scheduled for 15.2 as an investigation so we can understand the scope of the problem prior to scheduling a potential fix.
@arihantar No. The issue seem to be related to incrementing/decrementing packages size overtime. The issue https://gitlab.com/gitlab-org/gitlab/-/issues/238558+ is more related to the overall storage size. It would be relevant to this problem if the storage size was off-sync with the sum of the other sizes.
This issue is more about the packages size to be off.
This issue was automatically tagged with the label ~"group::package" by TanukiStan, a machine learning classification model, with a probability of 1.
If this label is incorrect, please tag this issue with the correct group label as well as automation:ml wrong to help TanukiStan learn from its mistakes.
This message was generated automatically.
You're welcome to improve it.
By a quick look, I think this may be because the container registry size is not considered in the individual calculation. Still, as we go through the migration, we are now able to report total storage for the container registry.
Our best bet here is for @10io and @jdrpereira to take a quick look.
Please note we are swamped with the container registry migration work, so it may take some time to do a deep dive into this.
@michelletorres This issue is for the package registry. Let me apply the proper label.
Basically, the usage quota for the package registry is done incrementally. That means that:
when a package file is added, we take the current total and we update it adding the package file size.
when a package file is removed, we take the current total and we update it substracting the package file size.
At no point in time, we will sum all the package file sizes in a project to update the project package registry usage quota. In the code, statistics like this are called incremental statistics.
I don't know how this could happen. We did update how package files are removed (with a marking system) but ultimately, a background job process the package file destruction and will destroy the package file by calling package_file.destroy! which was the existing way to destroy files.
We need more investigation on this one to understand what happened.
@10io, this made me wonder, even if we were to find a bug and fix it, it seems that an occasional package size "refresh" or "check" might be useful to ensure that any divergence gets caught and/or corrected. Trusting that the incrementally changing value is always correct may not be the most reliable way to look at these statistics. That might be out of the scope of this issue though since we still want to understand what might cause a divergence.
I spent a little time looking at how the project statistics updates work and reading through the issue history. If there is an issue with the statistics not updating correctly, it will take a while to investigate. We could schedule that as the initial deliverable since the unknown is so large here.
@arihantar I noticed mention of a script used to remove packages. Can we find out what the script did, or how it interacted with GitLab to remove the files? The project statistics relies on callbacks so when a given package or file is deleted, it creates an action to update the statistics. I believe there are ways to delete a package but not cause these callbacks to trigger, which would lead to inconsistent statistics.
this made me wonder, even if we were to find a bug and fix it, it seems that an occasional package size "refresh" or "check" might be useful to ensure that any divergence gets caught and/or corrected. Trusting that the incrementally changing value is always correct may not be the most reliable way to look at these statistics.
@sabrams I've been wondering the same: could we have a background job that could be triggered by users to "refresh" the package size statistic? We would need to be careful how the background job is designed: trying to get the total size by adding all sizes of all package files might not work in this case as we can have a large set of packages files per project. A batched loop should allow use to make the queries more scalable.
Going further with that idea, perhaps we can leverage UpdateProjectStatisticsWorker (or even better: ProjectCacheWorker that has an exclusive lease key support) and simply mark packages_size in the ProjectStatistics as "refreshable" (constant COLUMNS_TO_REFRESH) and then implement a #update_packages_size function. Finally, it's just a matter of having an API to trigger UpdateProjectStatisticsWorker for stat packages_size.
@10io@jdrpereira did you get a chance to look at this? Customer is still affected by this issue and checking if we have a workaround for this by the time we are fixing this?
Would it be possible to manually fix the situation for this specific customer? The fix would need a rails console on production with write access. Could we use a production change for this?
Before running this script, make sure that the number of package files is low.
stats=ProjectStatistics.where(project_id: <project_id>)package_files=Packages::PackageFile.for_package_ids(Packages::Package.where(project_id: <project_id>).select(:id))package_files.count# /!\ check here that this number is reasonable /!\ package_size=package_files.sum(:size)stats.update!(packages_size: package_size)
Also, is it not something we should be running regularly if that's a known issue?
Before automating this, I would rather have an API (which can then powers a button on the frontend) first as suggested here.
Detecting unsynced package statistics might be quite challenging.
We could also change the direction in which we manage those package statistics. Instead of having incremental statistics, enqueue a statistic refresh if a package file is created or removed.
Hopefully, the investigation here will bring some light on what is happening and from there, we will be able to take the best course of action.
For %15.2 due to the level of unknowns, we are scheduling this as an investigation to start and will re-evaluate once we have a proposed solution. So for now, I will weight this as backend-weight2
Each project has a "statistics" object (model ProjectStatistics). This objects holds a bunch of metrics on the related project. One of these is the packages_size which represents how much (object storage or file system) size package assets (package files) take.
Metrics are updated in two different ways:
The whole computation is (re) done entirely (by a background job).
The incremental computation is done.
For packages_size, (2.) is used. In simple words, when a package file is created or destroyed, the size of that package file is added or subtracted from the packages_size project statistic.
The above is implemented through a concern (UpdateProjectStatistics). As we can see here and here, the incremental updates are done using after_save and after_destroy (active record) callbacks.
♻ Package file destruction
Package assets destructions don't happen inline anymore (see https://gitlab.com/gitlab-org/gitlab/-/issues/348166). Instead, packages and/or package files are "marked" as pending_destruction. Objects marked in such way are invisible within the Packages Registry. Users can't see them in the frontend and they can pull them through Package Managers CLI tools (such $ npm).
Marked objects are processsed by background jobs that we call cleaners.
For marked packages, cleaners will simply mark all the related package files as pending_destruction.
For marked package files, cleaners will destroy them one by one until all the marked package files are removed. At this point, the package itself is removed too if there are no package files left.
Note that the background job uses the standard destroy! method to destroy the package file.
🐛 The Bug
Given the feedback we see in this issue's description, there is something not working when the background job destroy! the package file. The Active Record instance is properly removed from the databasebut the statistic packages_size is not decreased.
🔦 Analysis
I will take notes here as I go through the analysis.
First thing, let's detail how the incremental update happens:
There is an after_destroycallback that is executed.
That callback is only executed if the project is not pending delete.
after_destroy callback are executed right before committing the database transaction.
This callback will compute the delta to apply. In this case, it's simply -<package file size>.
A statistic update is scheduled using a run_after_commit callback.
As we can see here and here, we can have up to 2 database transactions involved when destroying a package file.
This points me if the database transaction could have a negative side effect on the project statistics updates.
Let's create a package with 2 files and mark one of the files as pending_destruction. Then we run the cleanup background job and check the logs.
(A) TRANSACTION (0.2ms) BEGIN /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/models/concerns/packages/destructible.rb:11:in `next_pending_destruction'*/Packages::PackageFile Load (0.4ms) SELECT "packages_package_files".* FROM "packages_package_files" WHERE "packages_package_files"."status" = 1 ORDER BY "packages_package_files"."id" ASC LIMIT 1 FOR UPDATE SKIP LOCKED /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/models/concerns/packages/destructible.rb:11:in `next_pending_destruction'*/TRANSACTION (1.3ms) COMMIT /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/lib/gitlab/database.rb:332:in `commit'*/(B) TRANSACTION (0.2ms) BEGIN /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/workers/concerns/packages/cleanup_artifact_worker.rb:15:in `block in perform_work'*/Packages::PackageFile Destroy (0.5ms) DELETE FROM "packages_package_files" WHERE "packages_package_files"."id" = 784 /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/workers/concerns/packages/cleanup_artifact_worker.rb:15:in `block in perform_work'*/TRANSACTION (0.4ms) COMMIT /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/lib/gitlab/database.rb:332:in `commit'*/ProjectStatistics Load (1.0ms) SELECT "project_statistics".* FROM "project_statistics" WHERE "project_statistics"."project_id" = 131 LIMIT 1 /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/models/project_statistics.rb:121:in `increment_statistic'*/ProjectStatistics Update All (0.7ms) UPDATE "project_statistics" SET packages_size = COALESCE(packages_size, 0) + (-8), storage_size = COALESCE(storage_size, 0) + (-8) WHERE "project_statistics"."project_id" = 131 /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/models/project_statistics.rb:149:in `columns_to_increment'*/(C) TRANSACTION (0.7ms) BEGIN /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/workers/packages/cleanup_package_file_worker.rb:26:in `block in after_destroy'*/Packages::PackageFile Exists? (0.4ms) SELECT 1 AS one FROM "packages_package_files" WHERE "packages_package_files"."package_id" = 84 LIMIT 1 /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/app/workers/packages/cleanup_package_file_worker.rb:26:in `block in after_destroy'*/TRANSACTION (0.2ms) COMMIT /*application:sidekiq,correlation_id:39f25cd5b728bbd8be6d3c866f96c3a4,jid:548dd7426da6e8c7e960f071,endpoint_id:Packages::CleanupPackageFileWorker,db_config_name:main,line:/lib/gitlab/database.rb:332:in `commit'*/ ↳ lib/gitlab/database.rb:332:in `commit'
We can see that:
We have 3 transactions
1 for selecting the next package file to destroy. (A)
1 for actually destroying the package file. (B)
1 for optionally destroy the package if the package is empty. (C)
We have a correct statistics update between (B) and (C) (outside of a transaction).
The only way we can destroy the package file but not update the statistics is somehow execute (B) but not the statistics update.
I'm now suspicious of this part. We effectively silence all the exceptions. The problem is that this silence is within the transaction block. According to the documentation:
Also have in mind that exceptions thrown within a transaction block will be propagated (after triggering the ROLLBACK), so you should be ready to catch those in your application code.
So, we could be nullifying the ROLLBACK trigger by swallowing all the exceptions.
Ran the analysis. I'm still not fully understanding how this can happen.
Ran the background job locally and everything was fine. I'm now directing the analysis towards exception handling. The current code is swallowing all the errors and this could have a side effect.
To have more visibility, I created an MR where exceptions are now logged. That MR went ⚡ through the reviews and has mwps set. Waiting for it to be deployed to refine the analysis.