Snowplow Tracking Improvement

This issue seems now unblocked and will be released soon. This will help Growth Data Analysts in their analysis and understand better Gitab.com usage.

Though, a lot of topics now need to be addressed to make Snowplow tracking efficient and useful.

Introducing custom context for each page view

What problem do we see ?

At the moment, some requests could be stated as

I want to see how many people check their User preferences over the last month ? And could we put that in a dashboard ?

Of course, this is feasible but it will be very difficult to maintain such a dashboard over time. To identify which user checks their user preferences, we will use some regular expression on the url path. This has 2 fundamental problems:

Very intensive therefore not scalable. We can do it for one, but for 100 ? or 1000 ?
The URL could actually change over time (some new features added, UI changed...): then our regular expression will break. If we are good enough, we will catch the error, but that means we will add an extra regular expression in our SQL query
Actually regex can't fully cover some cases. That means that metrics won't be accurate and understanding the deviation will be super harrd.

What would be the solution ?

Snowplow has a feature called Custom Contexts. These contexts could be used to give extra metadata about a specific page. For example, Snowplow has a good example for the case of website selling movie posters. On each page, thanks to Custom context we can add extra information about the product visited, the customer visiting...

We could apply this custom context to Gitlab context super easily. Quick example:

A first idea would be for example to add the type of pages the We can for example define the main page_types on Gitlab website. A non comprehensive list could be:

merge_request
issue
wiki
snippet
board
repository
pipelines
jobs
group
project
...

We can also add some context about the page visited such as:

action_type: in my mind there could be for wiki, issues, mrs, 3 types of actions which are view, create, edit
namespace_id, project_id: for example when creating an issue, you are always part of a project/group, this would be super convenient to have it in a custom context...

So when a user visits this page for example: https://gitlab.com/gitlab-data/analytics/merge_requests/1317

we could send, alongside with the pageview event, a custom context that will look like this:

{
    schema: "iglu:com.example_company/page/jsonschema/1-2-1",
    data: {
        page_type: 'mergerequest',
        merge_request_id: 1317,
        namespace_id: 4347861, -- gitlab data team group_id
        merge_request_id: 4409640, --gitlab data project_id
    }
}

This is my initial thoughts but many more dimensions can be added to these custom contexts. We can also think of having several custom contexts...

That means instead of relying on Growth Data Analyst to crunch data and do complicated regex (creating therefore technical debts and knowledge gap), we will spread out the tracking knowledge between Engineering and Data Team,

Another advantage of this setup would be moving tracking improvement much closer to feature development. That means tracking will be always part of a Feature Design Issue. This will allow us to be more proactive and also to be more scalable.

Something, that is worth mentioning: It will make also data much more easy to query for PMs directly in Periscope. They will not need anymore regex but plain reference like page_type = 'issue'

Introducing key structured events

TBC

Set up a proper QA

So, we saw 3 different types of bugs/issues the last 6 months:

snowplow user_id hasn't been set up properly when we moved from Fishtown to our infrastructure. It moved into a huge discussion on GDPR stuff and legal implications
structured events were not set up correctly and were sending bad events to the tracker
session hasn't been tracked correctly since the last 6 months, due to a change in our localStorageStrategy

That shows the lack of a solid and resilient QA workflow. There is a clear lack of ownership of this data and their quality and we don't do anything proactively.

The idea around would be again, moving the Tracking QA much closer to the Engineering testing workflow. Snowplow Micro is a promising candidate which could be part of our automated test suite.

Snowplow users can use Snowplow Micro to add tests to their automated test suites on any platform that:

Simulate particular situations, and check that the data sent is as expected.

Validate that the right event data is sent with each event

Validate that the right entities / contexts are attached to each event

Validate that the right values are sent with each event

Snowplow users can then release new versions of their apps, websites and servers with confidence that new changes won’t break the tracking set up.

Additionally, manual QA will be still needed. For that, we would also need a proper workflow which should be part of our release workflow. We shoot a new feature, with new tracking, how do we test it ? which steps need validation. Failure in tracking QA should be a blocker of any type of release.

Edited Jul 30, 2019 by Mathieu Peychet