Geo: Avoid orphaned data by tracking data paths
The following discussion from !3595 (merged) should be addressed:
-
@nick.thomas started a discussion: We don't clean up files associated with the project here :( that's a bug - we should have a follow-up issue for it.
A Geo secondary that uses selective sync only syncs a subset of all groups. The repository and any associated files for projects in those groups are synchronized, and projects outside of those groups are left untouched.
When the selective sync list is changed, the primary sends an event to the secondary to notify it. In response, the secondary removes the repository of any projects no longer in the list. However, it doesn't clean up associated files.
Not all files are associated with a project, so this is a fairly tricky thing to solve. A suggestion I made some time ago was to add an optional project_id
to the Geo::FileRegistry
model, which could be filled in at synchronization time. Doing this would simplify the implementation of this issue significantly, and would also allow us to simplify some other queries related to file synchronization.
I also don't know how LFS objects behave in this situation at the moment. If an LFS object is referenced by two projects, but only one of them is in the selective sync scope, then we should keep the object. It should only be removed if none of the projects are in the selective sync scope. I suspect that in actuality, the selective sync list is ignored here.
Currently
- On deletion, the primary Geo site creates a Geo event with the old path.
- The secondaries use the path in the event to delete the resource.
- If an event is lost, or if selective sync changes, then
RegistryConsistencyWorker
will end up deleting registry records. In that case, the resource(s) will remain, wasting storage.
Proposal
Add path/location field(s) to secondary Geo site registry tables.
This way, moves and deletions can be handled more easily and without storing critical data in Redis jobs.
The original ask in this issue would be satisfied since RegistryConsistencyWorker
already deletes registries of resources that are not associated with a selected project.