Product Analytics Collector Component

Closed Epic created 2 years ago by Tim Zallmann

At the moment we are using Jitsu (https://jitsu.com/) for collecting tracking events as soon as possible for our analytics offering. Jitsu is one part of the bigger Analytics Collector Stack (Jitsu, Clickhouse, Cube).

This epic maps all the work needed for a possible replacement of Jitsu in the stack and also all consecutive steps that we need for product analytics. This would be in the case that concerns about using Jitsu are bigger then an estimated time investment that would be needed to get us there. Jitsu currently takes care of Step 1 + 3, Step 2 would be needed anyhow.

Target language and technology for a possible setup would be Go. Each bullet point below is a single issue / work item.

Overview
Requirements
Deployment models
Step 1: Collector replacement
- Events Endpoint
- Configurator endpoint
Step 2: Collector extension for tracking events
- Live data enrichment
- Post-data enrichment
Step 3: Import of custom production data

Overview

Requirements

Be available for both self-managed and GitLab.com users
- This collector and data store will be used for Product Analytics and other features, which will all be available on self-managed and .com.
Data needs to end up in Clickhouse in the end, since the further Product Analytics stack is based on serving data from Clickhouse.
If a 3rd-party project is selected, it should:
- Be open source and have a license that allows us to use it.
- Have a large, active community contributing to it recently.
We do not need an all-in-one replacement, combination of solutions to different parts of the space covered by Jitsu would be possible.

Deployment models

The collector should be built in a way that it can be deployed in a self-managed environment directly on hardware without requiring a public cloud. This will enable our self-managed users to maintain full control over their infrastructure if so desired.

It also will mean that GitLab.com and GitLab Dedicated can use the same setup, though those will likely be deployed on a public cloud environment instead of raw hardware.

Step 1: Collector replacement

Events Endpoint

This is the actual endpoint that user projects would be communicating with. Architecture setup is HA.

Configuration of Clickhouse Connection
Project definition to support multiple collecting projects on 1 collector instance
Create Clickhouse DB for project
HTTP/HTTPS endpoint for receiving events
Public Keys for JS based tracking, Server side key generation and handling
HTTP Origins per project configuration and filtering
Event data validation
Rate Limiting possibility
Basic Event enrichment for User Agent String parsing, Location based on IP
Batched Event saving to Clickhouse DB with configured credentials and configured time span
Basic Logging setup to have insight into behaviour and problems of collection
Full Own Client Side SDK implementation, right now encapsulation

Configurator endpoint

The configuration endpoint will be used to execute specific tasks in a collector instance (as they are running seperated from GitLab itself, see - &8562)

Endpoint to setup a new project triggered later by the GitLab instance
Update Project settings (HTTP Origins)
Rotate Public and Private Keys

Step 2: Collector extension for tracking events

Extending the core functionality of the tracker to make more advanced analytics scenarios possible and scalable.

Live data enrichment

Further Enrichment of live event data

Anonymization of events data - for more secure scenarios having the capability to anonymize based on rules

Post-data enrichment

Session aggregation and saving of data to new table - Checking when sessions have ended based on configurable timeout, taking all events and crunching out static results (original referrer, marketing tracking, duration, is new user, drop off page, etc.) into a new sessions table to make it easier to query in the analytics stage
Funnels analysis
ML for funnel discovery, pattern analysis etc.

Step 3: Import of custom production data

As the mid term strategy is to combine the tracked events data with already existing production to build connections and extend possibilities for product analytics we would use one of the existing projects for bringing over a wide variety of data sources to the same clickhouse db to have joined analytics capabilities (for example "Show all users in organizations larger then 500 people who signed up in last 90 days"). Possible tools https://airbyte.com/ or https://meltano.com/

Edited 2 years ago by Sebastian Rehm

Product Analytics Collector Component

Product Analytics Collector Component

Overview

Requirements

Deployment models

Step 1: Collector replacement

Events Endpoint

Configurator endpoint

Step 2: Collector extension for tracking events

Live data enrichment

Post-data enrichment

Step 3: Import of custom production data

Linked items ... 0

Activity