Implement stale pending direct uploads cleaner (!117746) · Merge requests · GitLab.org / GitLab

Erick Bajao requested to merge eb-stale-direct-upload-cleaner into master Apr 15, 2023

This is part of the work for #285597 (closed).

This implements the cron worker that would regularly clean up stale pending direct uploads.

What are stale pending direct uploads?

In !106740 (merged), we implemented a new mechanism to support direct uploads to the final location instead of a temporary location that would rely on carrierwave to move the uploaded file to its final location.

With that change, workhorse would directly upload to the final location on object storage, and will also skip deleting the uploaded object. Because of this behavior, it may be possible in some cases that workhorse has completed upload to object storage, but when it finalizes and hits the Rails endpoint an error may occur, which would then cause a dangling object without a matching job artifact record.

This is where the stale uploads clean up worker come in to play. It regularly iterates through the redis hash that tracks the pending uploads, and checks each entry's corresponding timestamp. If the entry is older than 3 hours, then it is considered stale and we assume that the upload did not complete successfully. The worker will then delete the redis entry for the hash and also delete the corresponding dangling object if one exists.

How to test locally

This is a bit tricky to test locally and would require some manual hardcoding work to trick the system to think something is already "stale".

We can test 2 scenarios:

Ensure that the worker does not delete non-stale pending uploads
Ensure that the worker deletes stale pending uploads (older than 3 hours)

So let's set it up:

Ensure you have object storage and direct upload enabled on your GDK.
Make sure you have ci_artifacts_upload_to_final_location FF enabled on the rails console.
Setup a project that uploads an artifact. You can fork this sample project: https://gitlab.com/iamricecake/basic-artifact-test.
Run a pipeline and ensure artifacts are being uploaded correctly. Confirm it by downloading the artifact from the job details page.
Now before we retry the job to test our scenarios, we need to trigger a fake error first so that when workhorse completes the upload and hits the finalize endpoint, the API will error out and we will have a dangling object in the storage with no matching job artifact DB record.
- Edit Ci::JobArtifacts::CreateService#execute:
```
def execute(artifacts_file, params, metadata_file: nil)
   return error(foo:1) # We add this line at the very beginning to force return an error immediately
  
   ...
end
```
Now reload the job details page, and then retry the job. The job should eventually fail in the end after runner attempts to upload 3 times. This would result to 3 pending upload entries and objects.
- To confirm that this is the case, we can run this in the rails console, and may look something like:
```
> ObjectStorage::PendingDirectUpload.each {|pd| puts pd.key}
artifacts:@final/7d/4f/0f402ab69dd0f52d403f74ad331fe0b55494e49c22a49d1284c2fcceb1d2
artifacts:@final/a8/23/1a7f966ccdfd224c15cac4e45c4e1cb4e7b9a7327edca39134701eb67604
artifacts:@final/8d/70/72443b3047e9dc55a93cb2701fdd4a4c14252f9a48c05e3c3235e246be43
```
- You can check minio for presence of these objects.
At this point, these pending upload entries are not considered stale yet. We can test that the worker should not delete these entries by force enqueueing the cron job (object_storage_delete_stale_direct_uploads_worker) on http://gdk.test:3000/admin/sidekiq/cron.
- Before running the job, you can do gdk tail rails-background-jobs | grep DeleteStaleDirectUploadsWorker to see once the worker has executed.
- Enqueue the worker, and once completed, let's confirm again on the rails console that the entries are still there:
```
> ObjectStorage::PendingDirectUpload.each {|pd| puts pd.key}
artifacts:@final/7d/4f/0f402ab69dd0f52d403f74ad331fe0b55494e49c22a49d1284c2fcceb1d2
artifacts:@final/a8/23/1a7f966ccdfd224c15cac4e45c4e1cb4e7b9a7327edca39134701eb67604
artifacts:@final/8d/70/72443b3047e9dc55a93cb2701fdd4a4c14252f9a48c05e3c3235e246be43
```
Now let's test the 2nd scenario wherein the worker should delete stale entries.
- We will need to manually update the timestamps on the redis entries so that we won't have to wait for 3 hours.
```
> faketime = 4.hours.ago.utc.to_i
> ObjectStorage::PendingDirectUpload.each {|pd| Gitlab::Redis::SharedState.with {|r| r.hset('pending_direct_uploads', pd.key, faketime)} }
```
- Now re-enqueue the cron worker again.
- Once complete, check the rails console for any entries and there should be none:
```
> ObjectStorage::PendingDirectUpload.each {|pd| puts pd.key}
nil
```
- You can also check minio that the previous object paths are now also gone.

Edited May 10, 2023 by Erick Bajao

Implement stale pending direct uploads cleaner

What are stale pending direct uploads?

How to test locally

Merge request reports