POC: Snowplow as Jitsu Replacement
Problem
There are concerns around the long-term viability and scalability of Jitsu as the basis of our Product Analytics offering. At the same time, based on the research outlined in https://gitlab.com/gitlab-org/analytics-section/product-intelligence/proposals/-/merge_requests/10, creating an alternative from scratch does not seem like a good time investment.
Snowplow is an analytics collector technology we already use. It has a few downsides but also many advantages when compared to Jitsu (see #386446 (comment 1223494530)). It's main advantages are:
- Maturity of the project
- Proven to be scalable
- Different options for the deployment
- Many pre-built tracking SDKs we could wrap for Gitlab purposes
- Proven enrichment system
However, there are questions that need to be answered before we can make a decision on moving forward with Snowplow:
- How should a local setup that provides a complete tracking pipeline similar to what we currently have in the Analytics DevKit look like?
- How could such a setup be bundled for self-managed instances and individual projects on Gitlab.com? The aim would be that every projects, that wants to add analytics gets their own, isolated setup of Snowplow/CH/Cube.
- How do we efficiently get data from Snowplow into Clickhouse at Gitlab.com scale?
Proposal
This POC should primarily solve for question 1 (local setup) while also creating ideas around questions 2 and 3, which could have follow-up POCs.
Desired Outcome
Basic devkit which runs Snowplow instead of Jitsu locally to stream events into Clickhouse.