Previously, due to statistics calculations, artifact sizes accumulated in the usage quota may report incorrect sizes Now, to ensure artifacts are reporting the most accurate size, we have added a backstage script in GitLab to automatically recalculate sizes. You may notice occasionally artifact statistics will fluctuate to 0 at times as this script is recalculating, although the statistics will return to normal shortly.
Problem to solve
Resolving bug #23327 (closed) requires multiple steps and this issue is for one of them to write a script that can forcefully recalculate all artifacts size and document how to use it. For example, a rake task that iterates over projects.
Write a script that can forcefully recalculate all artifacts size - #238536 (closed)
This page may contain information related to upcoming products, features and functionality.
It is important to note that the information presented is for informational purposes only, so please do not rely on the information for purchasing or planning purposes.
Just like with all projects, the items mentioned on the page are subject to change or delay, and the development, release, and timing of any products, features, or functionality remain at the sole discretion of GitLab Inc.
@cheryl.li, I'd see it as a portion of a ~bug fix. I didn't get a chance to point this one before heading out on vacation today. Happy to do it when I get back.
Thanks, @allison.browne
Upon rereading this issue, I think we need to wait for Steps 1-3 to be done on #23327 (comment 415747400), and we're still part-way through Step 2. I can't tell from the discussions whether those are required for us to complete step 4, e.g. this issue.
@thaoyeager Wonder if we should actually push this to ~"candidate::13.7" and complete the work from: #23327 (closed)?
@cheryl.li@jheimbuck_gl correct. This issue is to provide the ability to recalculate the actual size of artifacts given that the aggregation has had problems in the past (#23327 (closed) even mentions the presence of negative artifacts size).
This issue is more for a post-bugfix scenario where we fixed the aggregation (increment/decrement artifacts size when creating/removing artifacts) from now on but the value we started from was wrong. It's still not refined since we may decide to use different approaches depending how we can scale them. For example, top-level group admin can trigger a recount that runs in the background (reject recount if one still in progress).
@jheimbuck_gl@shampton it's not very clear to me how we intend to run this script and at what scale, like are we planning to run this on all projects, or will this be triggered by a group admin such as noted by @fabiopitino on #238536 (comment 595546546)?
@jheimbuck_gl@shampton follow-up question, are we aware of any billing / storage quotas related intricacies that we need to consider when recalculating the artifacts size and storage size of each project? Like I am thinking it's possible that a project would suddenly have an increased in storage used once the artifacts size stats have been recalculated. Or is this something we shouldn't worry about and just go straight recalculate all project storage related stats?
@iamricecake I could be wrong, but from what I understood from the parent epic is that the current billing and quotas are working fine, but the storage statistics that the user sees is not. From the epic:
the user is not able to see which job is consuming the space and even admin has problems finding that out
Which makes it seem like when we do this migration, no billing will change. The user will just be able to now see more detailed statistics of why they are being billed.
Like I am thinking it's possible that a project would suddenly have an increased in storage used once the artifacts size stats have been recalculated.
@jheimbuck_gl I think @iamricecake is right here. Recalculating the stats means recalculating how much storage the customer has used so far. Artifacts size is included in the storage size which is accounted against the storage limit.
The goal of this issue is really "how do we fix the current values of artifacts statistics"? We fixed how these are updated (with Efficient Counters) but the starting point could have been a wrong number.
It could be a script for admins, to be run instance wide. It could be a background job that is triggered on-demand by support (affecting only 1 top-level namespace / customer). Maybe we could schedule different batches via background jobs over the next few months to process all the namespaces. These are some of the options.
The goal of this issue is really "how do we fix the current values of artifacts statistics"? We fixed how these are updated (with Efficient Counters) but the starting point could have been a wrong number.
@fabiopitino do I understand this that we want to fix the stats but not trigger any storage limits if ever? I am not very familiar about the storage limits implications, if ever we do this as a background migration and fixes all projects statistics on .com, how impactful is this to projects suddenly hitting their limits? Will they suddenly have failing pipelines or something worse?
Thanks @iamricecake and @fabiopitino - The next issue in this progression is to run this script as a database migration so the counts are accurate. If we recalculate and storage is above quota the repository will be locked and changes cannot be pushed. Users can use the Job Artifacts API to delete old artifacts at that point.
The next issue in this progression is to run this script as a database migration so the counts are accurate
@jheimbuck_gl just to clarify, I am currently implementing this as a background migration that migrates all project statistics. Do we intend to have the actual migration done on another release?
Just an update on what will be done in this issue, as discussed during my 1:1 with @shampton:
No migration will be done in this issue. A class would be added which handles recalculating the stats, but no migration would be using it yet. The class will be used in #332994 (closed).
For 14.3, a release post should be done just to give a heads up on the possible effects the migration will have once we do it on #332994 (closed) for 14.4
@iamricecake thanks for the ping. a release post wouldn't be the normal deprecation notice. I wonder if an in app message would be the best here that links to an issue with more information? @williamchia what do you think about that path forward?
@jheimbuck_gl I might be missing some context, but as far as I understand, we do announce deprecations in the release post. I'm pretty sure every release post has a "Deprecations" section.
Here are some examples when we deprecated managed apps:
Some elements that worked well are that we gave plenty of notice, linked to the issues/epic for more into, and created a migration guide. (This may not need an entire migration guide depending on how intrusive the deprecation is.)
If you go in the in-app notification route, I would recommend doing this in addition to listing a notice in the deprecation section of the official release notes (aka release post). I think you could also go with release notes-only and not the in-app notification.
@williamchia agreed, deprecations belong in the release post! I meant that in this case we aren't removing anything so adding this as a deprecation may not be the best fit.
The change here is we'll be running a script that recalculates the stored size of artifacts for projects. This could result in a users storage quota increasing and not being able to push to their repo until some of those artifacts are removed. A combo of blog / in app announcement may cover this, similar to what we did for artifact expiration changes.
Unfortunately, I don't think I'll be able to finish the script right in time for 14.3. I will need more time to do this. The main issue I am trying to figure out in my implementation is about how to avoid race condition wherein the background migration worker is recalculating the artifacts size while there is also an enqueued increment worker from the counter_attribute. So work here will have to resume after our FCL. Apologies for the delay @jheimbuck_gl@shampton.
@fabiopitino for the meantime, if ever you have some ideas, please feel free to share 🙏
@iamricecake can you drop a health update on this issue when you have a chance? I'd like to confirm we're ready for rollout in %14.5 and the messaging leading up to that soon. Thanks!
@jheimbuck_gl apologies for the late update. Been busy with FCL related work. Unfortunately, given the shorter milestone, I may be able to go back working on this next week but I am not confident I would have enough time to finish the work on time.
@jheimbuck_gl ah, good point. I think it is still safe to release the notice in 14.4 in preparation for 14.5 which is where we will have the migration in.
@jheimbuck_gl@shampton apologies for the late update here. Working on the other workhorse and artifact copying issues took a lot of my time and focus and I lost track of this. Given I will be OOO first week of November, I am not confident anymore that this can make it to 14.5.
@morefice and I paired on this today and we have run out of ideas how to tackle the race condition problem.
Here are some notes from our pairing session:
So far this is what I have in plan that will run within a background migration:
run Ci::JobArtifact.artifacts_size_for(project) on project
check if artifacts size is different from the one stored in project statistics build_artifacts_size
if different, then we need to fix the stats:
look into using with_exclusive_lease - so that if there's an existing counter flushing happening, it finishes first before we fix the stats.
BLOCKER: But we need to solve how we will handle if there's an incoming flush job that is based off an artifact that already has been included in the migration script's recalculation.
We need to prevent the incoming increment because that will count the artifact size twice.
Here's a sample flow of the race condition:
job artifact A with the size of 10 created
redis incrby 10
enqueue worker to flush for 10mins
force flush redis counters (this flushes 10 into the statistics table), now redis counter attribute is 0.
before query (Ci::JobArtifact.artifacts_size_for(project)) was executed, job artifact B with the size of 7 created (job artifact B will be included in the query)
redis incrby 7, redis counter value is now 7
run long query (Ci::JobArtifact.artifacts_size_for(project)) for like 5 mins (which includes the 10 and 7 sizes of Job artifact A and B)
while query is running, job artifact C with the size of 5 created (job artifact C is not included in the query anymore)
redis incrby 5, counter value is now 12
migration script finishes on the project
after 10 mins the worker flushes the 12 (total of Job artifact B and C), thus adding the same 7 on what's already a fixed stat, but we only want to keep the 5 actually. But given these 2 increments are combined in a single redis attribute, it's hard to figure out which increments are still valid.
At this point, we've run out of ideas how to address the race condition.
Note that there is a possibility that we can simplify this further like having a rake task that we can run on specific projects on different times depending on their inactive hours. This would temporarily block them from triggering any pipelines to prevent the race condition. This idea came from @rkenney2 during our coffee chat. But I realize this also may not be viable for the gitlab project itself as running the query to fetch total artifacts size took almost 2 hours, and I don't think we would want to block our pipelines that long.
@fabiopitino if ever you are free, can we schedule a pairing session with you?
@iamricecake@morefice could we run steal_increments into a different key so that we set the increments to 0 at the time we start the counting query? Also can the counting and update happen using the same with_excluisive_lease(lock_key) with longer TTL so we ensure nothing flushes increments in the meantime?
moduleCounterAttribute# - `lock_ttl` could be passed from the outside or defined in each class including CounterAttribute.# This TTL should take in consideration the time it takes to sum and update the counter.# - `relation` would be `Ci::JobArtifact.artifacts_size_for(project)`defrefresh_counter_from_database!(attribute,lock_ttl:)raiseArgumentError,"Block required to execute the calculation"unlessblock_given?lock_key=counter_lock_key(attribute)# using the same lock key we ensure `flush_increments_to_database!` does not runwith_exclusive_lease(lock_key,ttl: lock_ttl)doincrement_key=counter_key(attribute)recalculation_key=counter_recalculating_key(attribute)# move existing increments into a separate key because those are already in# ci_job_artifacts table. Any new increments from this point on will be # added to Redis and flushed after the update completes.steal_increments(increment_key,recalculation_key)new_value=yieldtransactiondoupdate!(attribute,new_value)# if we fail to delete the key we revert the update and the `recalculation_key`# keeps the stolen increments that already exist in the database.# When retrying this, new stolen increments are added to the same key.redis_state{|redis|redis.del(recalculation_key)}end# This ensures we refresh root namespace statistics asyncexecute_after_flush_callbacksendenddefcounter_recalculating_key(attribute)counter_key(attribute)+':recalculating'endend# Usage:project.statistics.refresh_counter_from_database!(:build_artifacts_size,lock_ttl: 2.hours)doCi::JobArtifact.artifacts_size_for(project)end
Concerns I have are:
while query is running, job artifact C with the size of 5 created (job artifact C is not included in the query anymore)
How confident are we that C won't be included in a long running count query?
Do we need to use each_batch to avoid query timeouts?
could we run steal_increments into a different key so that we set the increments to 0 at the time we start the counting query? Also can the counting and update happen using the same with_excluisive_lease(lock_key) with longer TTL so we ensure nothing flushes increments in the meantime?
Yes and yes, exactly what we discussed with @iamricecake this week!
How confident are we that C won't be included in a long running count query?
This is a good question! If we run our query outside of a transaction we will not have this issue as we will not see new values being added in the meantime.
Do we need to use each_batch to avoid query timeouts?
Probably something we should use as this is the query for gitlab.com which takes quite some time to complete.
steal_increments(increment_key,recalculation_key)# after steal_increments but before new_value query is executed, a redis counter increment# from a newly created artifact was made thus not being included in the stolen increments# but will be included in the new_value query.new_value=yield# after we've done updating the stats table, the redis counter value will then be flushed# into the DB later after the migration is done, but that increment was already included# in the query. This would end up counting the artifact's size twice.
Do you think this is a valid concern? Or am I missing something here?
@iamricecake we are first persisting the artifact and then after_commitwe increment the statistic in Redis. So, it could happen that we steal the increments first, then an artifact is added (or deleted) and the counter incremented (or decremented) before we run the query.
Do we need to use each_batch to avoid query timeouts?
If we need to also batch the calculation, which I think we should, it means that artifacts already counted in the previous batch could have been removed in the meantime.
I think if we can get to a point where the recalculation is highly accurate (but not 100%) it should still be fine. We could allow group admins to refresh the artifacts calculation manually if they believe its inaccurate at some point in time.
We also need to consider that without a data retention policy this recalculation could take longer and longer overtime.
@rkenney2 following up on the previous question about accuracy of the recalculation. Also I noticed I was unassigned from this issue. Just to confirm, are we planning to move this to 14.7?
Just a heads up that I am still working on this and this may likely miss 14.6 anyway. As I would need to ensure the queries are performant and would need more time to test.
This may be a bit of a far fetched idea to deduplicate the updates, but what if we use Redis sorted set to queue up the artifact size updates, with a timestamp as the score.
During regular flushing, it could flush values with timestamp before current time and remove the entries accordingly from the set.
# current time: 10003ZRANGE counter_key 0 10003 BYSCORE# 10, 7, 5 - sum and flush to DBZREMRANGEBYSCORE counter_key 0 10003 # remove flushed values
During the one time recalculation, it removes all entries with an earlier timestamp from the set, so it does not get re-added by the next flush.
artifact A: ZADD counter_key 10000 10artifact B: ZADD counter_key 10001 7---- recalculate at timestamp 10001: Ci::JobArtifact.artifacts_size_for(project) includes artifact A & Bartifact C added: ZADD counter_key 10002 5---- still recalculating: remove A & B from counter_key---- recalculating done# next flush at timestamp 10005ZRANGE counter_key 0 10005 BYSCORE> artifact C: 5
@alberts-gitlab@iamricecake is there a good way that we can get insight into user impact on this? Are we able to see how many projects currently have the old artifact solution?
@alberts-gitlab brought up in our 1:1 that it could be a good idea to set up a dry-run when create the script to see what the effects are of the script before actually running the real thing. That's something to consider here.
@iamricecake I meant to run the script only to query and show what would be the updated project's artifact size, without actually updating the database.
@alberts-gitlab your dry-run idea is great 👍 I am thinking now, to be surely safe, we can open a separate issue which would block this one. The blocking issue would be for implementing the dry-run script. The script will then run the query and generate a report on affected projects. What do you think about this?
Though we still have to consider the race condition that can happen, which may result to inaccurate dry-run results. I will book a pairing session with you so we can further think about this.
Instead of the dry-run, Albert came up with a great idea of running queries on Sisense to get a better look at the amount of affected projects. See Albert's comment about the dashboard below. The plan is to use the same query to limit the projects that we will run the migration script on.
Now regarding the race condition:
Based on #238536 (comment 745091654), the race condition can happen and is inevitable. This might result in some inaccurate values once the migration script runs. The good thing is that we can rerun the queries on Sisense to determine if there are still projects with inaccurate storage stats, and then we can run the recalculate script on them again.
Thanks @alberts-gitlab for the awesome pairing session, really helpful 🙇
@alberts-gitlab - if you go to the Hamburger menu in the top left corner, you should see the Dashboard Preferences option. Click that and it will popup a modal with a navigation menu in the left hand column. Select Permissions from that menu. Finally, you will see "Everyone" or "Engineering" as an option, I would say select "Edit as Copy" so we have an SSOT that remains unchanged if someone edits it incorrectly. If you want to grant all Edit permissions to "Engineering" that makes sense too! Click save after and we all should be able to edit it.
I gave "Everyone" all Read & Write access. @iamricecake let me know if you still cannot edit the charts. You should now see a "edit" button when you hover over the top right corner of each chart.
@alberts-gitlab and I paired yesterday and laid out a more concrete plan of the work needed to be done here.
There will be 2 separate MRs that Albert and I will work on in parallel:
Rake task to list the project IDs that have inaccurate build artifacts size stats from Sisense.
This would generate the list of project IDs that we will then pass as arguments to the next rake task.
We decided to fetch from Sisense instead of running the query on the actual database because of query performance. Running the same query we ran on Sisense on the actual DB would probably timeout.
Rake task to update the project stats build artifacts size of each project from the given list of IDs.
This will do the bulk of work, enqueueing workers for each project, that would then recalculate the stats.
This MR would need a review from Scalability because of the amount of workers it will enqueue.
The rake task here will have the flexibility of accepting any number of project IDs, and can be rerun anytime as needed, on .com or on self-managed instances.
On my end, I am still working on the script. Unfortunately, I won't be able to have it reviewed today (Friday), which was my initial target. I will spend some time over the weekend to continue working on it, hopefully have it reviewed by Monday. But given the scale of impact of the script, the review process may take more time. With that said, I am putting this "At Risk", or highly possible to carry over to 14.9.
Apologies for the delay.
Huge shout out to @alberts-gitlab for pairing with me on this 🙇 Have been very helpful.
@iamricecake It turns out that the API URL to download the CSV from Sisense will have to be public. I'm not sure if this would be a problem. @shampton what do you think?
Here's the link to the pairing notes. Regarding the security of exposing the CSV URL to the public, I think given they are only a list of IDs, it's fine. But we can also get some thoughts from our counterpart App Sec engineer @dcouture.
LGTM @iamricecake, thanks for checking! I agree the list of IDs is OK, especially since the affected projects aren't affected by a vulnerability or anything like that. 👍
let's get an update Monday on the review so I know to update the in app messaging timeline
@jreporter just a quick update, I have not been able to open an MR for review yet. Planning to open the draft MR of my work tomorrow/later, Tuesday my time, for review. But I am not confident this can make it to 14.8.
I have been working with a customer in this ticket who is having difficultly recalculating artifacts for a specific project. We initially thought it was related to #346261 (closed)
Deleting artifacts prior to a certain date didn't work. In our testing there were no outstanding large artifacts left.
We also tried to manually call UpdateProjectStatisticsWorker.perform_async(94,["commit_count","repository_size","storage_size","lfs_objects_size","build_artifacts_size","pipeline_artifacts_size"])
All in all, the customer's project is showing 60GB of artifacts (build_artifacts_size) but those artifacts are no where to be found. The customer's artifacts are typically around 6840b and we deleted all older than a week, so it seems unlikely that the project would be around 60GB+ with only a weeks worth of artifacts. Pipelines were also additionally deleted. We're hoping this issue would help solve for this kind of problem since build_artifacts_size seems largely broken.
Draft MR was opened for review, and there are some implementation changes that I am currently applying as discussed on !81306 (comment 853136610). I am aiming to have this reviewed again on Tuesday, March 1st.
thank you so much for this update! I am really excited to see this work make progress, as it will be really impactful for our top 12 cross functional project work! Thank you 🙇
!81306 (merged) is now back for review, updated based on recent discussion with the reviewers. I still need to do manual testing on it and add tests, but wanted to get the implementation details out for review as early as possible to catch any potential issues.
@iamricecake - I have added release notes to this issue which will be used to populate the release post MR:
Previously, due to statistics calculations, artifact sizes accumulated in the usage quota may report incorrect sizes. Now, to ensure artifacts are reporting the most accurate size, we have added a backstage script in GitLab to automatically recalculate sizes. You may notice occasionally artifact statistics will fluctuate to 0 at times as this script is recalculating, although the statistics will return to normal shortly.
!81306 (merged) implementation details have been reviewed thoroughly and I am now currently doing some testing locally and adding the specs.
I am putting this on "Needs Attention" as I may need a lending hand on !81306 (comment 868749122). We would need to add a new index on the huge ci_job_artifacts table so the process of adding one is not as simple. This should be done on a separate MR and ideally be merged before the weekend so that the migration will run on the weekend thus adding the new index.
!81306 (merged) is back in review again. Did a lot of refactoring based on recent feedback. Lots of great ideas from reviewers. I feel we're close to getting this merged 🤞
!81306 (merged) I believe is close to being merged (disabled under feature flags). Please note that depending on when the MR gets merged, it's possible that the rake task won't be included in the 14.9 release, but executing the script on .com would happen in 14.9.
We have 2 issues for rolling out the feature flag:
@jreporter!83194 (merged) has been merged but the script/rake task did not make it to the 14.9 release. Though I don't think this matters that much as we will be running it on .com as soon as it gets deployed to production.
@jreporter
Hey, just making sure that you've thought about this. Currently, our Artifacts quota isn't calculated correctly. I'm afraid that after the calculation is fixed we'll be out of our limits and the project would be locked, this could cause us great headaches as development could freeze until we'd be able to clear out artifacts or buy more storage.
Would we be able to know what the correct usage of our artifacts is and get a warning before our repos would be locked?
Got it, yes we have a broadcast message currently in the application letting people know of these statistics changing. And will let @doniquesmit advise on other messaging you can expect
@alberts-gitlab I remember you had an idea of doing a "dry-run" migration to see what the result would be before officially running it. Is this still something we want to do, or have we decided against that? Just curious if this "dry-run" would help us get a list of projects and their correct usage so that we could notify them beforehand.
@jreporter The project limit notification banners will trigger when the 10GB repo limit is approached and reached.
However, AFAIK Artifacts aren't currently included in the 10GB repo limit per project, so this change shouldn't cause projects to get locked when the Artifact size changes. @amandarueda@csouthard please correct me if I'm wrong here.
@shampton Given !83194 (merged) has already been merged and deployed to production I think we can close this one out now. The next step is to work on #332994 (closed) to run the rake task on production.
Is there any information on how to tigger this script in a self managed instance? Or does this run automatically in 14.10 or later? @jreporter@iamricecake?
The script won't be automatically executed. The rake task has to be manually executed. We don't have documentation for it yet and will be addressed in #355901 (closed).
@maxraab Please note that the script is still actively being developed so please take caution. Here's a quick rough guide on how to run the script on your installation:
First, make sure to have one of the projects_build_artifacts_size_refresh_*feature flags enabled. Let's say projects_build_artifacts_size_refresh_medium.
Then determine the list of project IDs that you would want to recalculate. This list will have to be stored in a remote CSV file, in a format like:
Thank you @iamricecake. Is there another issue I can follow to stay up to date? I am very interested in keeping up to date with the current state of development and would like to follow the next steps. Will this happen automatically with a later update?
Your explanation seems to work well, by the way.