Skip to content
Snippets Groups Projects
Closed Product Analytics Collector Component
  • Product Analytics Collector Component

  • Product Analytics Collector Component

    Closed Epic created by Tim Zallmann

    At the moment we are using Jitsu (https://jitsu.com/) for collecting tracking events as soon as possible for our analytics offering. Jitsu is one part of the bigger Analytics Collector Stack (Jitsu, Clickhouse, Cube).

    This epic maps all the work needed for a possible replacement of Jitsu in the stack and also all consecutive steps that we need for product analytics. This would be in the case that concerns about using Jitsu are bigger then an estimated time investment that would be needed to get us there. Jitsu currently takes care of Step 1 + 3, Step 2 would be needed anyhow.

    Target language and technology for a possible setup would be Go. Each bullet point below is a single issue / work item.

    Overview

    Analytics_Charts__5<span data-escaped-char>_</span>

    Requirements

    1. Be available for both self-managed and GitLab.com users
      • This collector and data store will be used for Product Analytics and other features, which will all be available on self-managed and .com.
    2. Data needs to end up in Clickhouse in the end, since the further Product Analytics stack is based on serving data from Clickhouse.
    3. If a 3rd-party project is selected, it should:
      • Be open source and have a license that allows us to use it.
      • Have a large, active community contributing to it recently.
    4. We do not need an all-in-one replacement, combination of solutions to different parts of the space covered by Jitsu would be possible.

    Deployment models

    The collector should be built in a way that it can be deployed in a self-managed environment directly on hardware without requiring a public cloud. This will enable our self-managed users to maintain full control over their infrastructure if so desired.

    It also will mean that GitLab.com and GitLab Dedicated can use the same setup, though those will likely be deployed on a public cloud environment instead of raw hardware.

    Step 1: Collector replacement

    Events Endpoint

    This is the actual endpoint that user projects would be communicating with. Architecture setup is HA.

    • Configuration of Clickhouse Connection
    • Project definition to support multiple collecting projects on 1 collector instance
    • Create Clickhouse DB for project
    • HTTP/HTTPS endpoint for receiving events
    • Public Keys for JS based tracking, Server side key generation and handling
    • HTTP Origins per project configuration and filtering
    • Event data validation
    • Rate Limiting possibility
    • Basic Event enrichment for User Agent String parsing, Location based on IP
    • Batched Event saving to Clickhouse DB with configured credentials and configured time span
    • Basic Logging setup to have insight into behaviour and problems of collection
    • Full Own Client Side SDK implementation, right now encapsulation

    Configurator endpoint

    The configuration endpoint will be used to execute specific tasks in a collector instance (as they are running seperated from GitLab itself, see - &8562)

    • Endpoint to setup a new project triggered later by the GitLab instance
    • Update Project settings (HTTP Origins)
    • Rotate Public and Private Keys

    Step 2: Collector extension for tracking events

    Extending the core functionality of the tracker to make more advanced analytics scenarios possible and scalable.

    Live data enrichment

    Further Enrichment of live event data

    • Anonymization of events data - for more secure scenarios having the capability to anonymize based on rules

    Post-data enrichment

    • Session aggregation and saving of data to new table - Checking when sessions have ended based on configurable timeout, taking all events and crunching out static results (original referrer, marketing tracking, duration, is new user, drop off page, etc.) into a new sessions table to make it easier to query in the analytics stage
    • Funnels analysis
    • ML for funnel discovery, pattern analysis etc.

    Step 3: Import of custom production data

    As the mid term strategy is to combine the tracked events data with already existing production to build connections and extend possibilities for product analytics we would use one of the existing projects for bringing over a wide variety of data sources to the same clickhouse db to have joined analytics capabilities (for example "Show all users in organizations larger then 500 people who signed up in last 90 days"). Possible tools https://airbyte.com/ or https://meltano.com/

    Edited by Sebastian Rehm

    Linked items ... 0

  • Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading