Make it possible to extract "production" like data for an staging environment
In https://gitlab.com/gitlab-org/gitlab-ee/issues/11334, @ashmckenzie is proposing a way to bootstrap a Geo cluster on demand for testing purposes. The main issue we have today is the lack of "real data" to test the environment in a more realistic way.
The tentatives in the past relied upon redacting private data whenever possible, but it's an incomplete solution as any changed added after the redacting code was written, would invariably require change in the redaction code.
We can look at this from a different perspective:
The main problem with production data is handling non-public information. If we consider that anything that is public is also "crawlable" right now, we could select a slice of public data only and just randomize the access restrictions on it, which would simulate "real data" without exposing private information.
Some redaction would still be needed (on the user model, but it's a single table, so it's easier to handle).
As an example, get 5000 public projects, and while importing them make some of them private, some internal, etc.
The data extraction is probably the hardest part, but here is also one idea:
- build some database views that restrict the visibility of the data to get only what is public (because our datamodel depends on a few tables, it's actually not hard to limit this)
- limit the amount of exposed content with the help of these views (limiting to a range of IDs, for example)
- recreate every other table that relates to one of these by limiting the data that "intersects" with our limited projects/groups/namespaces.
Writting the views is a tedious process if done by hand, but as there is a clear pattern here, we can probably build a template and automate. Also we can probably build as a materialized view and "automate" the whole process:
- Use a replica
- Run the views
- Export data from the views only (which means we will be exporting only "public" data)
- Reimport in a new instance and do some randomizations later on the access restrictions.