Add page_type to custom context in our snowplow pageview tracker
Parent issue: gitlab-org/gitlab#207930 (closed)
Problem
As described in this issue #84 (closed) :
At the moment, some requests could be stated as
I want to see how many people check their User preferences over the last month ? And could we put that in a dashboard ?
Of course, this is feasible but it will be very difficult to maintain such a dashboard over time. To identify which user checks their user preferences, we will use some regular expression on the url path. This has 2 fundamental problems:
- Very intensive therefore not scalable. We can do it for one, but for 100 ? or 1000 ?
- The URL could actually change over time (some new features added, UI changed...): then our regular expression will break. If we are good enough, we will catch the error, but that means we will add an extra regular expression in our SQL query
- Actually regex can't fully cover some cases. That means that metrics won't be accurate and understanding the deviation will be super hard.
Another problem that has been surfaced recently is that we are using more and more complicated REGEXP to build Monthly Active User per stage. These complicated REGEXP are not robust enough and not 100% reliable. They are a good start but definitely not a sustainable option
Proposal
Snowplow has a feature called Custom Contexts. These contexts could be used to give extra metadata about a specific page. For example, Snowplow has a good example for the case of website selling movie posters. On each page, thanks to Custom context we can add extra information about the product visited, the customer visiting...
How could we apply that in Gitlab's context:
The idea would be enriching the pageviews with some additional data. My initial thinking was to create a context that will look like this:
{
schema: "iglu:com.gitlab/pageview_context/jsonschema/1-0-0",
data: {
page_type: 'projects:issues:show', --example uses existing body element 'page' value
}
}
this would be the ideal scenario. There might be shortcuts using for example the document.body.dataset.page
which indicates the controller and the action used for a specific page.
Some open questions are more related to our Snowplow Infrastructure. These areas are the limits of my knowledge about Snowplow Infra:
In a IAAS context (so infra hosted by Snowplow), we need normally to follow this todo before starting sending custom contexts and self-describing events. In our own infra, I think we can replicate this by using iglu. Was this part of our migration ? Or is it something we need to work with the infra team on ? From my small research I would say we need to implement it!