Change Request [STG] Run rake task `gitlab:cleanup:list_orphan_job_artifact_final_objects`

Staging Change

Change Summary

This is a request to execute the gitlab:cleanup:list_orphan_job_artifact_final_objects rake task (added in gitlab-org/gitlab!143737 (merged)) on staging.

We will open a separate CR issue to run the same rake task on production, for now we want to validate this on staging first.

What does the rake task do?

Due to the bug (gitlab-org/gitlab#423074 (closed)), we ended up with a huge amount of orphan job artifact objects in the storage. And orphan job artifact object is an object that doesn't have a corresponding job artifact record in our database anymore.

As planned in gitlab-org/gitlab#423074 (comment 1740429063), this first rake task is for identifying all orphan job artifact objects and listing them in a CSV file. We won't be deleting any objects yet in this change request, just identifying them. On a separate CR issue, we will run a 2nd rake task that will then process the generated CSV file and delete the orphan objects.

We expect this rake task to run for a very long time. This is because it will have to go through each object under the @final directory of each project in object storage. We can't filter by creation date so we have to go through each object.

Due to the long running nature of the rake task, we added a "resume from last page marker" functionality to it. So if ever the rake task is abruptly interrupted (e.g. the node is killed) after already processing hundreds of pages, it will resume from the last known marker once we re-run it again.

Here's how to execute the rake task:

If FILENAME is not specified, the rake task will generate a CSV file named orphan_job_artifact_final_objects.csv:

> rake 'gitlab:cleanup:list_orphan_job_artifact_final_objects'

If we want to use a custom filename:

> FILENAME='custom_filename.csv' rake 'gitlab:cleanup:list_orphan_job_artifact_final_objects'

Please note that when we re-run the rake task and there's an existing file that matches the used filename (either custom or default), it will just append new entries into the file. If no matching file is found, the rake task will create a new file with the used filename and then continue adding entries into it. This feature can be beneficial if decide to process CSV files with multiple rake tasks running in parallel for each generated CSV file. For example:

We run the rake task first time:

> FILENAME='output_1.csv' rake 'gitlab:cleanup:list_orphan_job_artifact_final_objects'

And then we abruptly end the running rake task. And then re-run it again but this time with a different filename:
- ```
> FILENAME='output_2.csv' rake 'gitlab:cleanup:list_orphan_job_artifact_final_objects'
```
- This will resume from last know page marker, but will add new orphan entries into a new CSV file.

To keep it simple for now, we can choose to just run it once and use one file. I will update this description if ever the SRE review says otherwise.

The generated CSV file doesn't have headers, and will have entries with 2 comma-separated values that will look like:

35/13/35135aaa6cc23891b40cb3f378c53a17a1127210ce60e125ccf03efcfdaec458/@final/1a/1a/5abfa4ec66f1cc3b681a4d430b8b04596cbd636f13cdff44277211778f26,201

The first value is the object path, and the second value is the object size. We can write up a script later on if we want to parse the CSV file and sum up the total filesize that can be cleaned up.

Change Details

Services Impacted - ServiceGoogleCloudStorage
Change Technician - @iamricecake, SRE to execute the rake task TBD
Change Reviewer - @igorwwwwwwwwwwwwwwwwwwww
Time tracking - Can be hours or days depending on total number of objects that need to be processed. We don't have an accurate count for this.
- On staging, a rough estimate based off of the job artifact records in the DB, assuming 1 second per object (should be faster than this actually, but just a worst case number to add some buffer), there are 242,956 objects in @final directories, which equates to 62 hours for the whole rake task execution.
Downtime Component - N/A

Detailed steps for the change

Change Steps - steps to take to execute the change

NOTE: Execute all these first on staging.

Estimated Time to Complete (mins) - Estimated Time to Complete in Minutes

Set label changein-progress /label ~change::in-progress
Run the rake task on the console-01-sv-gstg.c.gitlab-staging-1.internal host.
- Since this is a potentially long-running task, we should run it inside of a screen or tmux session.
- run sudo gitlab-rake 'gitlab:cleanup:list_orphan_job_artifact_final_objects'
  - This command will automatically drop privileges to the git user and cd into /opt/gitlab/embedded/service/gitlab-rails.
- In the case the rake task is abruptly interrupted, we just re-execute the same command and it will resume from last known page marker, appending new entries found to the previously generated orphan_job_artifact_final_objects.csv (default filename).
- Ensure that the rake task finished completely by checking that it printed out Done.
- Ensure there is the file generated orphan_job_artifact_final_objects.csv or the appropriate filename if we chose to use a custom name.
- Keep the file for later use because we will have a separate CR issue to execute a 2nd rake task that would process the CSV file and delete the orphan objects.
Set label changecomplete /label ~change::complete

Rollback

Rollback steps - steps to be taken in the event of a need to rollback this change

There's no rollback for this change. We are not actually deleting anything here.

Monitoring

We may need to observe if we might get throttled or rejected by GCP due to making too many requests in a short amount of time. If this happens, we will need to modify the rake task to add a delay in between requests.

Change Reviewer checklist

C4 C3 C2 C1:

Check if the following applies:
- The scheduled day and time of execution of the change is appropriate.
- The change plan is technically accurate.
- The change plan includes estimated timing values based on previous testing.
- The change plan includes a viable rollback plan.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.

C2 C1:

Check if the following applies:
- The complexity of the plan is appropriate for the corresponding risk of the change. (i.e. the plan contains clear details).
- The change plan includes success measures for all steps/milestones during the execution.
- The change adequately minimizes risk within the environment/service.
- The performance implications of executing the change are well-understood and documented.
- The specified metrics/monitoring dashboards provide sufficient visibility for the change.
  - If not, is it possible (or necessary) to make changes to observability platforms for added visibility?
- The change has a primary and secondary SRE with knowledge of the details available during the change window.
- The change window has been agreed with Release Managers in advance of the change. If the change is planned for APAC hours, this issue has an agreed pre-change approval.
- The labels blocks deployments and/or blocks feature-flags are applied as necessary.

Change Technician checklist

Check if all items below are complete:
- The change plan is technically accurate.
- This Change Issue is linked to the appropriate Issue and/or Epic
- Change has been tested in staging and results noted in a comment on this issue.
- A dry-run has been conducted and results noted in a comment on this issue.
- The change execution window respects the Production Change Lock periods.
- For C1 and C2 change issues, the change event is added to the GitLab Production calendar.
- For C1 and C2 change issues, the SRE on-call has been informed prior to change being rolled out. (In #production channel, mention @sre-oncall and this issue and await their acknowledgement.)
- For C1 and C2 change issues, the SRE on-call provided approval with the eoc_approved label on the issue.
- For C1 and C2 change issues, the Infrastructure Manager provided approval with the manager_approved label on the issue.
- Release managers have been informed prior to any C1, C2, or blocks deployments change being rolled out. (In #production channel, mention @release-managers and this issue and await their acknowledgment.)
- There are currently no active incidents that are severity1 or severity2
- If the change involves doing maintenance on a database host, an appropriate silence targeting the host(s) should be added for the duration of the change.

Edited Feb 12, 2024 by Maina Ng'ang'a