Verify expired job artifacts are removed quickly on gitlab.com
As a follow-up to #322817 (closed), ensure the expired job artifacts on gitlab.com continue to get removed. Per the comment from @drew
#322817 (comment 876119887), the expectation is it'll take a month to remove all expired artifacts, so instead of keeping that issue opened, we'll use this issue to track progress.
Checking on our new, performant process (Status: It's done!)
Expired artifacts are being removed 50k at a time every 7 minutes, roughly 10m artifacts/day, as they were before. I'll keep monitoring this execution for sheer duration, as well as some of the finer performance metrics like vacuum frequency and dead tuple scans across all indexes on the table. If, after a few days, it looks like we have significant headroom, I'll look to increase the pace of removal. The current backlog of expired, unlocked artifacts is about 26 days worth of execution, which in practice will be longer as artifacts continue to expire during that time.
Good news everyone! Artifacts created since November 2021 are being removed within 7 minutes of their expiration timestamp. The Sisense dashboard, as explained below, is not useful for tracking this effort because it doesn't reflect deleted rows. But the numbers coming out of Kibana show that we're executing the worker for a relatively short time, and removing in the ballpark of 10k artifacts every 7 minutes, peaking a little bit higher around 13k and dropping as low as 1.5k during off hours. This correlates with the pace of artifacts creation, and indicates that they're not waiting around in any kind of lengthy queue.
-
Re-enable ci_destroy_all_expired_service
and expire up to 10k artifacts (based onci_job_artifacts.locked
) every 7 minutes. Verify sustainable performance in Kibana & Thanos. -
Increase worker throughput cap to 50k every 7 minutes. Verify sustainable performance in Kibana & Thanos. -
Watch Kibana for execution metrics to consistently remain below 300s and 50k removed artifacts per execution. When both of these numbers remain substantially below those thresholds, we can verify that there is no long a queue of expired artifacts waiting to be removed. -
Verify that the removal index index_ci_job_artifacts_on_expire_at_for_removal
has a negligible number of records in it on production. This partial index represents the removal "queue". As long as the number is below what we expect to remove in a single pass, we can assert that there is no queue.
Checking on our older artifacts backlog
The Kibana dashboard has separate panels now for the new, performant artifact expiration and the older, slower artifact expiration where the locked status is still unknown in the artifacts table and we need to go over to the ci_pipelines
table to get it.
-
Kick off Ci::UpdateLockedUnknownArtifactsWorker
, and expire up to 10k artifacts every 7 minutes. Verify sustainable performance in Kibana & Thanos. -
Increase worker throughput cap to 50k every 7 minutes. Verify sustainable performance in Kibana & Thanos. -
Watch Kibana for execution metrics to consistently remain below 300s and 50k removed artifacts per execution. When both of these numbers remain substantially below those thresholds, we can verify that there is no long a queue of expired artifacts waiting to be removed. -
Verify that the only production records left in the table with an unknown status are job traces. Since we removed the expiration dates from all trace artifacts, this worker will never pick up those JobArtifact records, and never mark them as locked or attempt to delete them. At this point, Kibana should report 0 artifacts being removed and 0 artifacts being locked by this worker. No records are ever added to the cohort of artifacts that this worker targets, so when it's job is finished it will drop to exactly 0, at which points we can close this issue and start planning for removal of the worker in %16.0.
Monitoring links
Sisense doesn't track deletion well, so we can't use that to monitor the ci_job_artifacts
table right now. If we can in the future, I'll add a link back here for us to use.
Thanos: A good overall representation of overall health, that we're not killing the db
-
Thanos charts of
INSERT
andDELETE
on theci_job_artifacts
table. This gives a high-level picture of ingress/egress on the table to show that the overall size is moving in the right direction. - Another Thanos page with index tuple reads alongside the number of dead tuples on the table. This will help us make sure that indexes are generally available to application traffic and we're not getting filled with so many dead tuples that the indexes become useless.
Kibana: A good representation of the progress our workers are making against the backlog of expired artifacts waiting for removal
- A Kibana dashboard tracking artifacts removed from the table by the more performant
ExpireBuildArtifactsWorker
that handles the removal of expired artifacts on an ongoing basis. - This saved Kibana search that shows the number of artifacts removed per worker execution.
What do we do if something looks wrong?
Turn the cap back down to 10k! Turn the service off completely! Try turning it back on later! Whether or not we're running this service at any particular time doesn't really matter. It's an important-but-not-urgent kind of thing. As long as we're making steady progress against this backlog of artifacts, it's not especially important that we go quickly at any specific moment. If there's an incident of any kind, even something artifacts-adjacent and we want to reduce churn on the table, these flags are available to you.
-
REMOVED: On/off switch for theci_destroy_all_expired_service
DestroyAllExpiredService
, called byExpireBuildArtifactsWorker
, that uses newer, performant logic to remove expired artifacts created after November 2021. -
(REMOVED): Switch between a cap of 10k and 50k artifacts processed per execution ofci_artifact_fast_removal_large_loop_limit
ExpireBuildArtifactsWorker
. Removed on 2022-04-08 and permanently set to 50k. -
ci_job_artifacts_backlog_work
: On/off switch for execution of theCi::UpdateLockedUnknownArtifactsWorker
worker that uses gnarlier, slower logic to remove artifacts created before December 2021 or mark them as locked. -
ci_job_artifacts_backlog_large_loop_limit
: Switch between a cap of 10k and 50k artifacts processed per execution ofCi::UpdateLockedUnknownArtifactsWorker
.