Refactor how we store expiring information for job artifacts
Summary
After rolling out https://gitlab.com/gitlab-org/gitlab/-/issues/233939 we can delete expired artifacts as soon as we can find them. Which leads to this problem, the query that looks for expired artifacts is highly inefficient. In !47496 (merged) we have stopped it from timing out at the expense of how long back in time we could go.
How the data is structured:
On the ci_job_artifacts
table we have a column that tells us when an artifact should expire, called expire_at
. But #16267 (closed) changed how we look for expired artifacts, we also need to consider ci_pipelines.locked
value. In order to find a batch of artifacts that can be removed we need to join three tables: ci_job_artifacts
, ci_builds
, and ci_pipelines
(we need to introduce ci_builds
because we don't have a foreign key for pipelines on job artifacts). This query is problematic because it needs to look at tens of thousands of expired artifacts to find a few that are unlocked.
The suggested solution for this was to move the locked
column on ci_job_artifacts
and create an index that covers expire_at
and locked
columns.
But we could try to solve more problems here:
- centralize the expiry data into a single place because it is stored in 3 separate tables and it's becoming hard to manage.
-
ci_job_artifacts.expire_at
- timestamp indicating when the artifact should be removed. -
ci_pipelines.locked
- integer indicating if expired artifacts could be removed - #281688 (closed) -
ci_builds.artifacts_expire_at
- timestamp - this one is used in the UI to show if artifacts are available or not.
-
- identify artifacts that are kept by users - #263234 (comment 466277687)
Theci_job_artifacts.expire_at
column can hold null values indicating that the artifacts must not be removed because users chose to keep the around by clicking theKeep
button in the UI or usedartifacts:expire_in:never
in the YAML config. But because we didn't have a default expiry policy for .com, we have a lot more artifacts with null values in this column. - UX problem:
The clock for an artifact starts as soon as it's uploaded, but with #16267 (closed) I think it should start after the artifact is unlocked because we can go
There is no grace period between the states. I would like to see when the artifacts are supposed to be removed because going from locked to removed is a bit surprising for me.
Improvements
I'm thinking of introducing a new table to hold this information:
table name: ci_job_artifacts_policies
column name data type
job_id FK to ci_builds
state enum - this could be `unlocked`, `pipeline_locked`, `user_locked`
expire_at timestamp
expire_in interval - this will hold the value defined by the user in `artifacts:expire_in` if we chose to start the clock after unlocking them.
removed_at timestamp
With this table in place, the query to find expiring artifacts becomes:
policies = Ci::JobArtifactsPolicy
.where('expire_at < ?', Time.current)
.where(removed_at: nil)
.where(state: :unlocked)
policies.each_batch(of: 100) do |policies_slice|
Ci::JobArtifact
.erasable # we don't want to remove job traces
.where(job_id: policies_slice.select(:job_id))
.destroy_all # the destroy process is simplified here
policies_slice.update_all(removed_at: Time.current) # we mark policies as removed
end
This is better because we are using only one table to do the heavy filtering and we can back the database query with an efficient index. We are still doing a join
and apply a filter on the joined relationship but the filter is going to remove only policies_slice.size
artifact records from the relationship.
Risks
I think the background migration to populate ci_job_artifacts_policies
could take a lot of time, so we would need to run the two solutions concurrently for that period of time.
Involved components
- We will have to update
Ci::UnlockArtifactsService
, but keep its efficiency. - I think we should create the policy record when we create the first artifact in
Ci::CreateJobArtifactsService
. If a job doesn't have any other artifact types besides trace, the record should be optional. -
Ci::DestroyExpiredJobArtifactsService
will need to have its query updated. - Update how we
Keep
artifacts: https://gitlab.com/gitlab-org/gitlab/-/blob/120c72350e46683c0c85d30db73b378bb66206a9/app/controllers/projects/artifacts_controller.rb#L85-89 - Set the state touser_locked
and do not update theexpire_at
value. - Update how
artifacts:expire_in:never
works. It should set thestate
touser_locked
and theexpire_at
value to the application default.
We could explore these queries in more detail with a thin database clone before starting the actual work.