Pseudonymization of URLs

Update: We are about to implement Option 3

We currently track URLs as part of our Snowplow Frontend events. Moving forward we want to de-identify the URLs we send based on #336779 (closed) as the URLs can contain PII data.

There can be two approaches to this

1. Submit route structures only (recommended, technically the least effort)

By only submitting the route format, we remove all PII data & IDs and replace it with the path name.

Pro:

  • When we add namespace_id, we can still analyse flows per namespace
  • Easier to implement
  • Querying specific pages becomes easier, as we could write queries like select count(*) from x where url = ":group_id/-/issues"

Con:

  • Need to check if any existing queries depend on the current format
  • ?
URL Route
gitlab.com/my-group/my-awesome-project gitlab.com/:group_id/:project_id
gitlab.com/my-group/issues gitlab.com/:group_id/issues
gitlab.com/checkout gitlab.com/checkout

2. Submit the URLs with pseudonymized parts (highest effort)

We would need to identify each route and see what part of the route is PII data and what not.

Pro

  • Might not break existing queries

Con

  • Harder to implement as we need to process and analyse each submitted URLs, especially as our pseudonymization service is likely implemented on the collector level
URL De-Identified
gitlab.com/my-group/my-awesome-project gitlab.com/anonomizedstring1/anonomizedstring2
gitlab.com/my-group/issues gitlab.com/anonomizedstring1/issues
gitlab.com/checkout gitlab.com/checkout

3. Submit Route structure with Namespace and Project IDs (effort needs to be investigated)

Similar to solution 1, where we remove all data, but we still keep the namespace & project IDs in the URL.

Pro

  • Querying specific pages is still relatively easy, as we could write queries like select count(*) from x where url = ":group_id/-/issues", but we gain the extra opportunity to query for specific namespaces, when there is no namespace in the event
  • Does not rely on the namespace_id being in the event
  • This can be done on the application layer and does not affect the de-identification service work

Con

  • Need to check if there are any performance concerns and how much effort it is to resolve the namespace_id and project_id on all paths (there might be already something in our code that we can re-use)
  • Need to check if any existing queries depend on the current format
URL Route
gitlab.com/my-group/my-awesome-project gitlab.com/group:123/project:356
gitlab.com/my-group/my-awesome-project/some_folder/some_file.js gitlab.com/group:123/project:346/:repository_path
gitlab.com/my-group/issues gitlab.com/group:123/issues
gitlab.com/checkout gitlab.com/checkout
Edited by Nicolas Dular