Delayed deletion of groups by default, to avoid catastrophes

Problem to solve

My current understanding of group deletion is that database metadata is immediately removed, and a job is enqueued to remove git repo resources from their storage.

Deeply nested group and project hierarchies can be deleted with one command. This is a very risky state of affairs. If application bugs cause group deletions to happen unexpectedly (e.g. !40353 (merged)), then the current deletion procedure creates a disaster for that customer, e.g. https://gitlab.com/gitlab-com/gl-infra/infrastructure/-/issues/11148. One such incident is currently causing the gitlab.com infrastructure team a lot of pain.

Proposal

I propose that deleting a group flip a "deleted" flag in the DB. Resources flagged in this way appear deleted to the customer. Support / admins can flip that flag back for some time period, if the deletion was accidental. A regular cleanup job deletes metadata and associated git data of deleted groups older than some age.

Some tiers have a configurable waiting period (see ApplicationSettings#deletion_adjourned_period and #223013 (comment 392443712)), but that's not what I'm talking about.

This is blocked by the implementation of delayed deletion of projects by default.

Edited Feb 05, 2021 by Dan Jensen
Assignee Loading
Time tracking Loading