GitLab.com ELT
We need to get data from GitLab.com into a data warehouse which we can use for analytics purposes. There are two main types of data:
- Sample "lived in" groups/projects to use for developing features like cycle analytics
- Aggregate metrics on product usage across a wide set of customers
To achieve these goals, we plan to do the following.
Proposal
Build this a rake task within GitLab, so customers can benefit as well.
- Copy all public projects in the
gitlab-orggroup into the BizOps DW, except confidential issues. User ID's, names, etc. all included. We will refresh this daily, so any GDPR right to be forgotten changes in prod should simply be reflected here. Ignore the users table, to avoid issues with PAT's and other content. - Create a process for reviewing/approving/running aggregated metrics on the unaltered backup of GitLab.com. For example a PM has a query they want to add to our ELT process, they submit it in the form of an MR. It gets reviewed by the necessary people, merged, and is then executed in the next daily run. The results are copied back over to the DW, along with the data above. If we get the cycle time on this to something reasonable like 24 hours, it should be workable as an MVC.
In the future, as we get more comfortable and learn more about what we need, we can bring in additional projects or iterate on the anonymization. Current proposal:
- Rake task is run in production VPC, it outputs CSV files to local disk.
- Script then uploads those CSV files to object storage.
- BizOps CI job then picks up the latest CSV file from object storage, copies it locally to disk.
- BizOps CI job then imports the CSV file into the data warehouse.
This has a few benefits:
- No access required from BizOps VPC to Production VPC. It's push from higher privilege to lower privilege.
- Filtered data is coming over the fence, which should reduce the compliance risk/burden. (Not that we shouldn't be careful.)
Current schema https://gitlab.com/gitlab-org/gitlab-ee/blob/50feeffa41a3340b3044699fba28b96718adfc2d/lib/assets/pseudonymity_dump.yml, but per @jschatz1 already out of date.