Product Analytics via Usage Ping MVC - Parent Issue
Overview
This issue is a convergence of our work on Self-Managed Event Tracking https://gitlab.com/gitlab-org/telemetry/-/issues/373 and Product Analytics gitlab-org/gitlab#211568 (closed)
Once the Product Analytics MVC MR gitlab-org/gitlab!27730 (closed) is merged, we will have a product_analytics_events
table which will hold events from external applications and from a GitLab instance.
Using our existing Usage Ping feature, we will need to begin looking at ways to aggregate the GitLab instance events data so it can be sent back to us via Usage Ping.
The purpose of this Usage Ping data is to help us build a better GitLab. Data about how GitLab is used is collected to better understand what parts of GitLab needs improvement and what features to build next.
MVC
The goal of this MVC is to aggregate Snowplow data in the product_analytics_events
table so it can be sent back to us via Usage Ping. The key is to aggregate the data in a way which it is useful for reporting purpose in Sisense.
Some ideas we've explored include
-
product_analytics_per_day
daily aggregation table -
product_analytics_counters_per_user_per_day
daily aggregation table
Long term concerns (out of scope for MVC)
Scale of data
GitLab.com currently sees up to 18.2M events per day with peaks of 1.25M events per hour or 208,333 events per min. The events are separated into 17M good events, 1.2M bad events per day. Good events meaning the event is structured according to Snowplow's defined schema. Link to Snowplow Summary Dashboard.
Data Retention
Regarding retention policy of GitLab.com Snowplow events, our Snowflake data warehouse has unlimited retention. Link to Snowplow dbt. Here's what's currently in Snowflake dating back to around Aug 2018:
- Events (Good Events): 753 GB, 3,281,848,447 rows
- Bad Events: 256 GB, 186,793,156 rows
Go Collector
- Make the Go collector write raw events instead of also handling parsing https://gitlab.com/gitlab-org/telemetry/-/issues/383#note_344191119
- Have a Sidekiq job handle parsing events
- Have three tables https://gitlab.com/gitlab-org/telemetry/-/issues/383#note_344199347
-
product_analytics_raw_events
: is similar to the Versions app where we're saving the original usage ping payload before parsing https://gitlab.com/gitlab-services/version-gitlab-com/-/issues/281#plan -
product_analytics_events
: is the equivalent to the Events table in Snowflake -
product_analytics_raw_bad_events
: is the equivalent of our existing Bad Events table in Snowflake
-
Next Steps for MVC
-
Review Snowplow Go Collector https://gitlab.com/gitlab-org/snowplow-go-collector -
Review and merge Product Analytics MVC MR gitlab-org/gitlab!27730 (closed) -
Standardize Snowplow Tracking gitlab-org/gitlab#207930 (closed) -
Restructure Usage Ping gitlab-org/gitlab#217362 (closed) -
Aggregate Snowplow Data and resurface via Usage Ping -
product_analytics_per_day
daily aggregation table -
product_analytics_counters_per_user_per_day
daily aggregation table
-
Next Steps for Long Term Concerns
-
Think about scalability of collector (17M events per day on GitLab.com) -
Define retention periods of product_analytics_events
table