Set up DB replica for analytics pulls into EDW
Organization:
- In this issue, we want to setup access to the archive replica and configure it such that it can be used from the ELT loads
- In a related issue https://gitlab.com/gitlab-com/gl-infra/infrastructure/issues/5970, we are going to provide network-level access to said replica for the ELT runner jobs (with appropriate firewall rules).
Discussion summary:
- The ELT load will only need read-only permissions (hence a replica is fine).
- No strict SLA guarantees regarding uptime/availability/performance.
- Queries from the ELT load are expected to finish within 5 minutes. Statement timeouts will be enforced accordingly.
Our current method for getting updated production data into our warehouse is as follows:
- Update pseudonymizer config if possible https://gitlab.com/gitlab-org/gitlab-ee/blob/master/config/pseudonymizer.yml
- Ping someone on production team to run the pseudonymizer (usually Stan)
- Hope that nothing fails or bombs out writing CSVs to GCS
- Import generated CSVs from object storage into Snowflake
This presents a number of challenges, not the least of which is precious time from a member of the production team every time we want updated data.
Reading through https://about.gitlab.com/handbook/engineering/infrastructure/database/disaster_recovery.html and https://gitlab.com/gitlab-restore/postgres-gprd it seems like we should be able to set up an isolated replica that is updated on a regular basis with production data. (We do this regularly for version, customer, and license dbs https://gitlab.com/gitlab-restore/version.gitlab.com#vpc-setup) We don't need up to the minute updates, delays of 1-2 weeks is fine.
If we follow the same model where only our project's runner is whitelisted for access I think it'll simplify the security footprint and risk. We'd still have an explicit list of fields we'd be pulling that security would approve for updates and increased data pulls.
My time request is that something be stood up by the end of January. We've got some OKRs related to modeling of the dotcom data.
@stanhu does this seem reasonable? I know it'd save you some time.
@dawsmith thoughts on this? Is this something @yguo could setup?
@jeremy @tszuromi just FYI - I'm trying to close the time between updates for production data.
cc @tlapiana