Skip to content
GitLab
Next
    • GitLab: the DevOps platform
    • Explore GitLab
    • Install GitLab
    • How GitLab compares
    • Get started
    • GitLab docs
    • GitLab Learn
  • Pricing
  • Talk to an expert
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
    Projects Groups Snippets
  • Sign up now
  • Login
  • Sign in / Register
  • P Product Intelligence
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Issues 41
    • Issues 41
    • List
    • Boards
    • Service Desk
    • Milestones
    • Iterations
    • Requirements
  • Packages and registries
    • Packages and registries
    • Package Registry
    • Container Registry
    • Infrastructure Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • Insights
    • Issue
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Create a new issue
  • Issue Boards
Collapse sidebar
  • GitLab.orgGitLab.org
  • Product Intelligence
  • Issues
  • #383
Closed
Open
Issue created May 11, 2020 by Sid Reddy@sid_reddyContributor0 of 11 checklist items completed0/11 checklist items

Product Analytics via Usage Ping MVC - Parent Issue

Overview

This issue is a convergence of our work on Self-Managed Event Tracking https://gitlab.com/gitlab-org/telemetry/-/issues/373 and Product Analytics gitlab#211568 (closed)

Once the Product Analytics MVC MR gitlab!27730 (closed) is merged, we will have a product_analytics_events table which will hold events from external applications and from a GitLab instance.

Using our existing Usage Ping feature, we will need to begin looking at ways to aggregate the GitLab instance events data so it can be sent back to us via Usage Ping.

The purpose of this Usage Ping data is to help us build a better GitLab. Data about how GitLab is used is collected to better understand what parts of GitLab needs improvement and what features to build next.

MVC

The goal of this MVC is to aggregate Snowplow data in the product_analytics_events table so it can be sent back to us via Usage Ping. The key is to aggregate the data in a way which it is useful for reporting purpose in Sisense.

Some ideas we've explored include

  • product_analytics_per_day daily aggregation table
  • product_analytics_counters_per_user_per_day daily aggregation table

Long term concerns (out of scope for MVC)

Scale of data

GitLab.com currently sees up to 18.2M events per day with peaks of 1.25M events per hour or 208,333 events per min. The events are separated into 17M good events, 1.2M bad events per day. Good events meaning the event is structured according to Snowplow's defined schema. Link to Snowplow Summary Dashboard.

Data Retention

Regarding retention policy of GitLab.com Snowplow events, our Snowflake data warehouse has unlimited retention. Link to Snowplow dbt. Here's what's currently in Snowflake dating back to around Aug 2018:

  • Events (Good Events): 753 GB, 3,281,848,447 rows
  • Bad Events: 256 GB, 186,793,156 rows

Go Collector

  • Make the Go collector write raw events instead of also handling parsing https://gitlab.com/gitlab-org/telemetry/-/issues/383#note_344191119
  • Have a Sidekiq job handle parsing events
  • Have three tables https://gitlab.com/gitlab-org/telemetry/-/issues/383#note_344199347
    • product_analytics_raw_events: is similar to the Versions app where we're saving the original usage ping payload before parsing https://gitlab.com/gitlab-services/version-gitlab-com/-/issues/281#plan
    • product_analytics_events: is the equivalent to the Events table in Snowflake
    • product_analytics_raw_bad_events: is the equivalent of our existing Bad Events table in Snowflake

Next Steps for MVC

  • Review Snowplow Go Collector https://gitlab.com/gitlab-org/snowplow-go-collector
  • Review and merge Product Analytics MVC MR gitlab!27730 (closed)
  • Standardize Snowplow Tracking gitlab#207930 (closed)
  • Restructure Usage Ping gitlab#217362 (closed)
  • Aggregate Snowplow Data and resurface via Usage Ping
    • product_analytics_per_day daily aggregation table
    • product_analytics_counters_per_user_per_day daily aggregation table

Next Steps for Long Term Concerns

  • Think about scalability of collector (17M events per day on GitLab.com)
  • Define retention periods of product_analytics_events table
Edited Aug 14, 2020 by 🤖 GitLab Bot 🤖
Assignee
Assign to
Time tracking