Pseudonymization of URLs
Update: We are about to implement Option 3
We currently track URLs as part of our Snowplow Frontend events. Moving forward we want to de-identify the URLs we send based on #336779 (closed) as the URLs can contain PII data.
There can be two approaches to this
1. Submit route structures only (recommended, technically the least effort)
By only submitting the route format, we remove all PII data & IDs and replace it with the path name.
Pro:
- When we add
namespace_id, we can still analyse flows per namespace - Easier to implement
- Querying specific pages becomes easier, as we could write queries like
select count(*) from x where url = ":group_id/-/issues"
Con:
- Need to check if any existing queries depend on the current format
- ?
| URL | Route |
|---|---|
gitlab.com/my-group/my-awesome-project |
gitlab.com/:group_id/:project_id |
gitlab.com/my-group/issues |
gitlab.com/:group_id/issues |
gitlab.com/checkout |
gitlab.com/checkout |
2. Submit the URLs with pseudonymized parts (highest effort)
We would need to identify each route and see what part of the route is PII data and what not.
Pro
- Might not break existing queries
Con
- Harder to implement as we need to process and analyse each submitted URLs, especially as our pseudonymization service is likely implemented on the collector level
| URL | De-Identified |
|---|---|
gitlab.com/my-group/my-awesome-project |
gitlab.com/anonomizedstring1/anonomizedstring2 |
gitlab.com/my-group/issues |
gitlab.com/anonomizedstring1/issues |
gitlab.com/checkout |
gitlab.com/checkout |
3. Submit Route structure with Namespace and Project IDs (effort needs to be investigated)
Similar to solution 1, where we remove all data, but we still keep the namespace & project IDs in the URL.
Pro
- Querying specific pages is still relatively easy, as we could write queries like
select count(*) from x where url = ":group_id/-/issues", but we gain the extra opportunity to query for specific namespaces, when there is no namespace in the event - Does not rely on the namespace_id being in the event
- This can be done on the application layer and does not affect the de-identification service work
Con
- Need to check if there are any performance concerns and how much effort it is to resolve the
namespace_idandproject_idon all paths (there might be already something in our code that we can re-use) - Need to check if any existing queries depend on the current format
| URL | Route |
|---|---|
gitlab.com/my-group/my-awesome-project |
gitlab.com/group:123/project:356 |
gitlab.com/my-group/my-awesome-project/some_folder/some_file.js |
gitlab.com/group:123/project:346/:repository_path |
gitlab.com/my-group/issues |
gitlab.com/group:123/issues |
gitlab.com/checkout |
gitlab.com/checkout |
Edited by Nicolas Dular