Skip to content
Snippets Groups Projects
Closed Product Analytics Collector Component
  • Product Analytics Collector Component

  • Product Analytics Collector Component

    Closed Epic created by Tim Zallmann

    At the moment we are using Jitsu (https://jitsu.com/) for collecting tracking events as soon as possible for our analytics offering. Jitsu is one part of the bigger Analytics Collector Stack (Jitsu, Clickhouse, Cube).

    This epic maps all the work needed for a possible replacement of Jitsu in the stack and also all consecutive steps that we need for product analytics. This would be in the case that concerns about using Jitsu are bigger then an estimated time investment that would be needed to get us there. Jitsu currently takes care of Step 1 + 3, Step 2 would be needed anyhow.

    Target language and technology for a possible setup would be Go. Each bullet point below is a single issue / work item.

    Overview

    Analytics_Charts__5<span data-escaped-char>_</span>

    Requirements

    1. Be available for both self-managed and GitLab.com users
      • This collector and data store will be used for Product Analytics and other features, which will all be available on self-managed and .com.
    2. Data needs to end up in Clickhouse in the end, since the further Product Analytics stack is based on serving data from Clickhouse.
    3. If a 3rd-party project is selected, it should:
      • Be open source and have a license that allows us to use it.
      • Have a large, active community contributing to it recently.
    4. We do not need an all-in-one replacement, combination of solutions to different parts of the space covered by Jitsu would be possible.

    Deployment models

    The collector should be built in a way that it can be deployed in a self-managed environment directly on hardware without requiring a public cloud. This will enable our self-managed users to maintain full control over their infrastructure if so desired.

    It also will mean that GitLab.com and GitLab Dedicated can use the same setup, though those will likely be deployed on a public cloud environment instead of raw hardware.

    Step 1: Collector replacement

    Events Endpoint

    This is the actual endpoint that user projects would be communicating with. Architecture setup is HA.

    • Configuration of Clickhouse Connection
    • Project definition to support multiple collecting projects on 1 collector instance
    • Create Clickhouse DB for project
    • HTTP/HTTPS endpoint for receiving events
    • Public Keys for JS based tracking, Server side key generation and handling
    • HTTP Origins per project configuration and filtering
    • Event data validation
    • Rate Limiting possibility
    • Basic Event enrichment for User Agent String parsing, Location based on IP
    • Batched Event saving to Clickhouse DB with configured credentials and configured time span
    • Basic Logging setup to have insight into behaviour and problems of collection
    • Full Own Client Side SDK implementation, right now encapsulation

    Configurator endpoint

    The configuration endpoint will be used to execute specific tasks in a collector instance (as they are running seperated from GitLab itself, see - &8562)

    • Endpoint to setup a new project triggered later by the GitLab instance
    • Update Project settings (HTTP Origins)
    • Rotate Public and Private Keys

    Step 2: Collector extension for tracking events

    Extending the core functionality of the tracker to make more advanced analytics scenarios possible and scalable.

    Live data enrichment

    Further Enrichment of live event data

    • Anonymization of events data - for more secure scenarios having the capability to anonymize based on rules

    Post-data enrichment

    • Session aggregation and saving of data to new table - Checking when sessions have ended based on configurable timeout, taking all events and crunching out static results (original referrer, marketing tracking, duration, is new user, drop off page, etc.) into a new sessions table to make it easier to query in the analytics stage
    • Funnels analysis
    • ML for funnel discovery, pattern analysis etc.

    Step 3: Import of custom production data

    As the mid term strategy is to combine the tracked events data with already existing production to build connections and extend possibilities for product analytics we would use one of the existing projects for bringing over a wide variety of data sources to the same clickhouse db to have joined analytics capabilities (for example "Show all users in organizations larger then 500 people who signed up in last 90 days"). Possible tools https://airbyte.com/ or https://meltano.com/

    Edited by Sebastian Rehm

    Child items ...

  • View on a roadmap
  • Linked items 1

  • Related to

    Activity

    • All activity
    • Comments only
    • History only
    • Newest first
    • Oldest first
    • Tim Zallmann added epic &8417 (closed) as parent epic
    • Tim Zallmann changed title from Product Analytics Tracker to Product Analytics Tracker Component
    • Tim Zallmann changed the description ·
    • Tim Zallmann changed title from Product Analytics Tracker Component to Product Analytics Collector Component
    • Tim Zallmann changed the description ·
    • Tim Zallmann changed the description ·
    • Tim Zallmann
      Author
      • Sam Kerr

        @timzallmann thanks for putting this together!

        • Project definition to support multiple projects on 1 instance

        This one reads a bit oddly to me. Is a project referring to itself multiple times with this? Is this referring to GitLab instance or a collector instance? I'm assuming GitLab, since we want each customer project to have its own Analytics stack (collector, Clickhouse, Cube), so a given collector instance wouldn't need to be associated with multiple projects.

      • Tim Zallmann
        Author

        A collector would be able to host multiple collector "projects" (happy over any other naming but used it as this is the name Jitsu uses). Has the advantage that on the long run in GL multiple Projects or Groups can link and show the data from one project as for example GitLab project data project and the Plan Group has only dashboards about Plan features, etc.

        Thanks for the feedback, will try to make it clearer in the top description

      • Please register or sign in to reply
    • Tim Zallmann changed the description ·
    • Tim Zallmann changed the description ·
    • Robert Hunt marked this epic as related to &8645
    • Robert Hunt

      Linked the Analytics Event Taxonomy (&8645) epic to this one as to get a unified shared event taxonomy, we're probably going to need this collector system :see_no_evil:

      • Sebastian Rehm

        @timzallmann, @stkerr I have a few further questions based on this description:

        1. Did we already discover any limitations of Jitsu for our use case apart from the fear around its future development?
        2. At least at a quick glance, I could not see a natural way to extend Jitsu with regards to Step 2 apart from forking it or directly contributing to it. Am I correct in understanding that this would be the way to go as soon as we want to extend the collector?
        3. I'm wondering, how important is it to us that the functionality from Step 3 is part of our offering from the start? At least to me this is potentially a separate product, e.g. Segment calls itself a CDP (Customer Data Platform) for being able to bundle data from lots of different sources, while our main objective is to provide easy analytics. I have previously used Segment and I'm wary of this due to two reasons:
          • It is often not flexible enough to define which data to load from the external source. As a result you reach for a tool like Meltano / Airbyte. We had multiple sources where we started with Segment and then implemented them in Meltano as well to have more freedom (e.g. load additional fields).
          • This is where a lot of potential for bugs and stuff getting out of date lies if Jitsu is not kept up to date since you rely on connecting to external APIs, which can be a moving target. From my understanding, a lot of work at Segment goes into keeping all those connectors up to date. This was also a major pain point in our usage of Meltano (Singer taps not being up to date).
        4. In my understanding Jitsu, akin to Segment, also offers the possibility to send events to other destinations apart from the Clickhouse DB. This is not mentioned here, so my assumption is that this is also not functionality that we'd want to offer?
      • Tim Zallmann
        Author

        @bastirehm

        1. For the short and mid term not aware of anything

        2. You can extend the enrichment process in Jitsu- https://jitsu.com/docs/internals/jitsu-server#lookup-enrichment-step and for post data enrichment we would end up with an additional tooling ourselves.

        3. Definitely long term, but mainly to connect an event with further meta data is the idea, so you can for example filter down on events done by users from companies with more then 500 users, etc.

        4. Good question

      • Sam Kerr

        @bastirehm

        1. This won't be critical for the start, but is where we want to go in the longer term. This is a good point to call out that there will be a potentially large maintenance cost for us building something ourselves compared to another platform maintaining them(and then the risk of bugs in that platform).
        2. This isn't something we're planning on doing. The primary goal is to provide analytics for our users to add to their apps & then have that displayed in GitLab. Providing SDKs for them to add to their apps to send the data to a non-GitLab destination is not the direction we plan to go in.

        cc @timzallmann

      • Sebastian Rehm

        @stkerr, @timzallmann

        1. From my point of view as a former user of these tools: Giving me access to the underlying event database to be able to connect a tool like Airbyte on my own, would already be a big selling point compared to other analytics solutions and I'd have a lot more trust in a dedicated solution (Airbyte, Meltano ...) handling this well than Jitsu.

        I'm bringing this up in the current context, since it in my opinion radically changes the needed effort if we "only" need to build a scalable collector or also need to offer this additional functionality.

        Edited by Sebastian Rehm
      • Tim Zallmann
        Author

        FYI: Jitsu uses for that task airbyte and singer for that task

      • Sebastian Rehm

        @timzallmann Yup, and that makes sense, but it still introduces a layer of abstraction and potential bugs (e.g. Jitsu needs to keep Airbyte up to date, vs. using Airbyte directly).

        I'm wondering: When it comes to estimating the effort to replace Jitsu should we estimate the steps independently? At least in my mind that would make sense, since we could realistically build Step 1 as MVC, and for Step 3 e.g. just offer documentation on how to integrate your Clickhouse instance with Airbyte.

        Edited by Sebastian Rehm
      • Please register or sign in to reply
    • Sam Kerr mentioned in issue gitlab#384589
      • Mikołaj Wawrzyniak

        Thank you for putting the proposal @timzallmann in the begging I wanted ask if it was considered already if using open source offering of Snowplow right now GitLab is using Snowplow pipeline to collect around 50 millions of events per business day, and Snowplow pipeline of AWS require close to non maintenance one set up. It covers points 1 and 2 from the requirements list, and it allowed GitLab to implement custom user pseudonymization, which proves that it offers some flexibility to adjust tool to individual use cases.

        Additionally if for what ever reason, snowplow is not viable option we might at least consider taking some learning from their streaming architecture, which seems to be well proven in production.

        I would also like to suggest to extract point 3 to its own epic. Building reliable and performant ETL pipeline at scale is challenging task there has been multiple attempts to replace existing GitLab.com pipeline which has decreasing performance problem with alternative ones (see: https://gitlab.com/gitlab-data/analytics/-/issues/11930) and there is another parallel effort to stream data out to downstream analytics systems gitlab#382172, those efforts also uncovered shortcomings of automated out of the box solutions like singer taps. Having this requirement as independent one, might enable other two moves with higher pace, and this can get enough effort and attention based on expected ROI

      • Tim Zallmann
        Author

        @mikolaj_wawrzyniak Open to any suggestions, Snowplow was considered a little bit! I thought Snowplow has not any connector to Clickhouse, or? Which was the main reason to take another look at alternatives, as we are as early in the process we have the possibility for other systems. And especially the out of the box enrichers and the work we have done so far would be interesting.

      • Tim Zallmann
        Author

        That was one of the systems that was also interesting - https://buz.dev/

      • Mikołaj Wawrzyniak

        @timzallmann there is no official connector to CH that I am aware of, however CH is able to ingest data from S3 (see docs) and Snowplow can write to S3 (this is even how it is done at GitLab), so the only missing piece would be some sort of cron scheduler to trigger data ingestion

      • Sebastian Rehm

        @mikolaj_wawrzyniak, @timzallmann my assumption was that Snowplow is excluded because of the setup and being tied to a public cloud. I could not find any documentation around setting up Snowplow in your own infra (e.g. a Kubernetes cluster). Others also seem to have a hard time doing it. My understanding was, that the fact that you could easily set it up on your own infra would be one of the main advantages of our analytics solution, since this is a blocker for many SM customers.

      • Mikołaj Wawrzyniak

        Great point @bastirehm :100: I have not considered case where SM would like to avoid using public cloud. Based on implementation of Snowplow stream collector (see repo and example Kafka configuration) it looks that it supports Kafka sink which at least on paper should enable administrators to run Snowplow stack on premise without need for public could. However this approach is not mentioned in documentation, so I'm not sure how good and stable support it is receiving.

        @timzallmann regardless if it would be blocker for moving onward with Snowplow, it seems to me, that it would be great to state that in the requirements that jitsu replacement needs to be runnable on premise without connection to any of public could services. That also raised another concern, I believe it is important factor to include when thinking about all Analytics stack is maintenance effort and resources cost. Despite how convenient from engineering perspective it might be to have full-fledged streaming infrastructure included in the stack it might be very discouraging to end users from small to medium organisations if they would need to set up and maintain more complex infrastructure just for analytics, than it is used to support their product

      • Sam Kerr

        @mikolaj_wawrzyniak

        However this approach is not mentioned in documentation, so I'm not sure how good and stable support it is receiving.

        I added some more details about this in the epic description above.

        cc @bastirehm @timzallmann

      • Mikołaj Wawrzyniak

        Thank you @stkerr however I'm not sure if I concur one part:

        As such, it should be built so all these use cases can use the same approach.

        As mentioned in my previous comment, different organisation might have very different needs, and one size fits all might leads us to an issue where no one would be really happy with the result. For example, for GitLab Dedicated usage of public could would be probably very desired, since they whole instance resides there, on the other hand air gapped self-managed instances might not even be able to interact with public could, finally small self-managed instance might do not want to spent effort and money to set up high scale analytics stack, and want something small and efficient. With that in mind for the sake of MVC I would suggest to identify single target audience, and build solution for them, while keep others in mind, to assure required flexibility in solution so it can be adapted to the other groups needs.

      • Sebastian Rehm

        @mikolaj_wawrzyniak @stkerr Should we maybe rewrite this part to target the possibility of deployment on a self-managed instance / on-prem infra as baseline? To me this feels like the biggest opportunity compared to other analytics solutions, and if it can be deployed there, it by definition should be possible to also deploy it in e.g. a public cloud, it might just be a tad more complex than using public cloud primitives from the start.

      • Sam Kerr

        @bastirehm @mikolaj_wawrzyniak

        target the possibility of deployment on a self-managed instance / on-prem infra as baseline

        I like your proposed reframing of the problem. GitLab.com becomes another deployment of on-premise in this case.

      • Please register or sign in to reply
    Loading Loading Loading Loading Loading Loading Loading Loading Loading Loading