Refresh staging environment with production data volume
The staging environment has been forked from production more than 1.5 years ago (I don't know when exactly, it might date back even earlier). This means that data volume in staging is not nearly as large as in production:
For example for PostgreSQL,
- staging is at 0.6 TB whereas
- production is at 3.4 TB of total database size.
This means that any testing on staging does not reflect reality in terms of database size. This can led to issues with data migrations, database performance testing etc. A recent example was elastic search indexing which succeeded in staging but failed in production due to size (cf https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/7090#note_187121907).
The proposal here is to update staging once with production data and scrub sensitive information off of it (just like we did when staging was created/refreshed the last time).
For the longer term, we may want to find ways how to automate this process (see Related issues).