De-Identify PII in the Data Warehouse (outdated)
The proposal is outdated as we changed the technical solution. The new epic can be found here: https://gitlab.com/groups/gitlab-org/-/epics/6309. To not have any confusion, we closed this epic as the comments and concerns were targeted at the previous solution. We still will gather internal and community feedback. <details><summary>Previous Proposal</summary> This epic tracks our work for de-identifying PII data ### Problem Statement In order to respect the privacy of our users, and leverage unique user IDs to support product and internal use cases we need to build a process and system to de-identify PII data before it can be used to improve our products and more efficiently run GitLab. ### Proposed Solution To effectively de-identify PII data we need to implement the following: 1. Identified data will be loaded into a "secure holding area" in the data warehouse, unlinked to the rest of our product analytics data. ~data 1. This data will only be accessible by a select set of SREs and Support Admins for the sole purpose of keeping the SaaS system running and managing operations. Permissions will be managed with established Permifrost permissioning process for Snowflake and the GitLab Access Authorization process. 1. Identified data will then be de-identified and copied to a general-purpose storage area where it will be available for broader access and established use cases. 1. The de-identification process will: - remove user-id from the incoming data stream replace it with an 'AnonId' (which is required for counting unique events) - 'AnonId' is created by sending the user-id and a secret SALT through a hash (propose [SHA-2/256](https://en.wikipedia.org/wiki/SHA-2)) - only 'AnonId' is available outside of the "secure holding area" 1. Begin adding `user_id`, and `project_id` too all SaaS Snowplow events ``` mermaid graph LR subgraph application[Application] application_data[Identified Data] end subgraph analytics_collectors[Analytics Collectors] application_data[Identified Data] --> analytics_collectors_data[Identified Data] end subgraph data_warehouse[Data Warehouse] subgraph secure_access[Secure Access Database] analytics_collectors_data[Identified Data] --> secure_access_data[Identified Data] end subgraph deidentification_process[De-Identification Process] secure_access_data[Identified Data] --> deidentification_process_data[De-Identified Data] end subgraph general_access[General Access Database] deidentification_process_data[De-Identified Data] --> general_access_data[De-Identified Data] end end subgraph analytics_ui[Analytics UI] general_access_data[De-Identified Data] --> analytics_ui_data[De-Identified Data] end ``` **Access Levels** - Secure access: Site Reliability Engineers, Support Engineers - General access: Everyone outside of "Secure Access" such as Product, Sales, UX, Engineering, etc. ## Next Steps - [x] Align with e-group on our privacy policy and stance for data anonymization - [x] Blog post - [ ] Internal feedback - [ ] Community feedback - [ ] Identified Data will be loaded into a special "secure holding area" in the data warehouse. There, it will only be accessible by a select set of SREs and Support Admins for the sole purpose of keeping the SaaS system running and managing operations. (Done by Data Team). - [ ] Identified Data will then be de-identified and anonymized and copied to a general-purpose storage area where it will be available for broader access and established use-cases (for PMs, Eng, UX, Sales, etc). (Done by Data Team) - [ ] Build the De-Identifier - [ ] Begin adding user_id, project_id, etc. to all Snowplow events. (Done by Product Intelligence). </details>
epic