Improve how we deal with projects that are pending_delete but job failed for some reason
Problem to solve
We have a documented procedure in : https://gitlab.com/gitlab-org/gitlab/blob/a67ad6249dc784f328ce23d77bd7ae1e8ebe57b5/doc/administration/troubleshooting/gitlab_rails_cheat_sheet.md#L193-193
The main problem here is that there is no visibility on the problem. This used not to go unnoticed with the legacy storage as we would be creating a name conflict on disk, which could lead to people pinging support. With Hashed Storage, this is no longer the case.
During the Hashed Storage migration in https://gitlab.com/gitlab-com/gl-infra/production/issues/935 we found that there are still 163 projects marked as pending_delete that have failed removal and forget.
Intended users
- Internal persona
Further details
Proposal
We can do a few things here.
- Expose projects that have their removal in a stale state (we could check check a combination of both pending_delete and updated_at with a defined threshold) in a rake task.
- Create a cronjob to retry the stale ones from time to time
- Add the stale removal to the system_checks. It should fail a check when there are projects pending_delete that are passed our threshold
Permissions and Security
System access (terminal)
Documentation
Availability & Testing
What does success look like, and how can we measure that?
- We have no project pending_delete that are stale (and has no bug preventing it to be removed)
- We have system checks telling us that a project should have been deleted already but its not
What is the type of buyer?
Links / references
Edited by 🤖 GitLab Bot 🤖