Implement a reassign process that do not use placeholder references
About
The new tables we introduced for placeholder references have quickly become among our largest.
See https://gitlab-com.gitlab.io/gl-infra/platform/stage-groups-index/import-and-integrate.html.
At time of writing (27 Feb 2025) sizes (across primary and all replicas and including all indexes):
-
import_source_user_placeholder_referencesis already 135.72 GB -
import_placeholder_membershipsincluding indexes is 6.86 GB
Problem
Those tables look like they’re going to grow and swell reasonably quickly.
The new data retention guidelines are asking us to consider data retention lifecycles (essentially, can we delete, or move outside of PG, any of our data).
Some of the context for this initiative is the risk to GitLab availability of the size of its data:
- https://gitlab.com/gitlab-com/gl-security/product-security/data-security/data-security-team/-/issues/60+
- https://gitlab.com/groups/gitlab-org/-/epics/16520+
Proposal
Technically, the placeholder references are only needed when the Import User is used #443556 (comment 1858315704). See proof of concept
So, on this issue, we will:
- Create new reassign process that does not use placeholder references
- Apply a throttling mechanism similar to https://gitlab.com/gitlab-org/gitlab/-/issues/493977
- Test reassignment thoroughly on 3k instance
- Roll out the feature flag
Follow-up issues will:
- Update importers to not create placeholder refe... (#575651)
- Delete placeholder references that are no longe... (#575649)
Continue to enforce boundaries of top-level namespace
We would need to continue to observe the idea of the top-level namespace as being the boundary, so if those objects are moved outside we don’t re-map their users. See #472725 and how it is a security requirement https://gitlab.com/gitlab-com/gl-security/product-security/appsec/appsec-team/-/issues/845#note_2136599804.
We would have to scope the query with the project_id or namespace_id
Fortunately, at this point, due to how cells work, most tables have project_id and namespace_id.
TODO: Verify all tables in Import::PlaceholderReferences::AliasResolver have sharding keys - we might need to wait for sharding keys in certain tables before we can do this piece otherwise.
For tables associated with only project_id and not namespace_id, we would need to performantly get all project IDs of the top-level namespace for large and deeply nested top-level groups. TODO The performance aspect of this needs to be investigated.
After developing a proof of concept, it became apparent that enforcing the boundaries is not feasible. The primary issue is that additional database indexes would be necessary for the reassignment process to work efficiently. For example, we would need an index on the user column and the sharding key.
The boundaries will only apply to memberships; thus, a membership won't be created if the group or project is moved outside the top-level namespace.
How much data could this reduce?
As of 27 February 2025, some data for import_source_user_placeholder_references:
- 65,441,193 records
- 1,410,611 records associated with an import user
- 4,576,253 records associated with no user (reassignment has happened) - this is odd - we shouldn't have this many records.
Discounting the 4.5m records associated with no user, a rough hand wavy guess is it the size of the table could be reduced ~43x.