Snowplow as Jitsu Replacement PoC
Objective
This PoC was created to evaluate whether Snowplow can serve as a replacement for Jitsu in our Product Analytics stack. The primary aim of the PoC was to create a working local setup where Snowplow collects data into Clickhouse, similar to the existing devkit. At the same time, this should also help us to answer the question how a self-contained setup for Product Analytics that uses Snowplow instead of Jitsu (e.g. in a Kubernetes Cluster) could look like.
Architecture
A Snowplow setup typically consists of five components:
- An SDK that sends events.
- The collector can receive events and store them on a queue in a binary format.
- An enricher that can pick and stores them in a human-readable format (as TabSeparatedValues)
- A queueing system where the collector can store events for the enricher to pick up and store there again
- A loader that loads the data into a final storage system.
For our POC purposes, we chose the following setup:
I reuse the existing Snowplow-Micro implementation in the Gitlab Monolith to redirect Snowplow events to our local collector in a local Gitlab instance. The POC supports both RabbitMQ and Kafka, but was originally built with Kafka. Snowplow only recently released the supported Collector/Enricher for RabbitMQ and it seems that it might not be supported in the long run, while Kafka is in a more supported state. Both RabbitMQ and Kafka would generally work for our setup but have different advantages:
- RabbitMQ is easier to set up and scales vertically to high throughput numbers.
- Kafka has a more involved setup but is probably one of the most scalable and battle-tested solutions for event-streaming. (see [Shopify capturing every change from their monolith on Kafka). They are also in the process of simplifying the setup (see Removal of Zookeper)
- Clickhouse can pick up events directly from RabbitMQ/Kafka (Posthog uses this method for their Open-Source solution)
For the PoC the IP anonymization enrichment is set up to remove the last octet of the user's IP address. Instead of using any of the Snowplow-provided loaders, we load data directly into Clickhouse via their support for processing streams on RabbitMQ / Kafka
Local Setup
To try the setup on your own machine, do the following within this repository:
$ git checkout snowplow-poc
$ cd utils
$ bundle install
$ rake run_snowplow USE_KAFKA=true
This starts the relevant containers in the correct order and configures Clickhouse/Kafka to work with Snowplow. Afterwards you can stop this setup with docker-compose down
. You can use the RabbitMQ setup but removing the USE_KAFKA
env variable and reconfiguring the Clickhouse tables with rake reset_clickhouse_tables USE_KAFKA=true
. The final event table in Clickhouse looks the same regardless of the Queueing/Log system being used. If RabbitMQ creates any problems starting or stopping, restart your docker to make it work again.
Make sure Snowplow Micro is setup in your GDK to run on PORT 9091 (see instructions).
In your GDK procfile search for snowplow-micro
and comment out the line so that Snowplow Micro no longer starts:
# snowplow-micro: exec docker run --rm --mount type=bind,source=/Users/srehm/Development/gitlab/gitlab-development-kit/snowplow,destination=/config -p 9091:9091 snowplow/snowplow-micro:latest --collector-config /config/snowplow_micro.conf --iglu /config/iglu.json
Then run your GDK e.g. with gdk start
Trying it out
If you set everything up correctly, you should be able to browse through your local GDK and generate events that Clickhouse automatically picks up.
To see the events go to localhost:18123 (Password test
) and enter the following query:
SELECT * FROM snowplow_events
Here is a file with 100 of the resulting events: snowplow.tsv
You can check out the RabbitMQ setup in a graphical manner at http://localhost:15672/ (user: guest, password: guest)
Thoughts on the PoC
To me this proves, that a local setup, comparable to Jitsu is easily possible with Snowplow and at this point I do not see a blocker for us to move forward with replacing Jitsu with Snowplow based on it's matureness, our familiarity with it and the overall feature set (enrichments, SDKs). Using Snowplow means that we'd have to take care of one additional infrastructure component (Kafka/RabbitMQ) for higher load self-managed setups, at the same time it would allow us a lot of flexibility, since Snowplow would allow us to replace Kafka/RabbitMQ with just using the local file system for minimal setups or opting for a setup based on Kinesis/Google PubSub for very high load ones (like our current Snowplow pipeline).
Not covered in the PoC
I did not spend time setting up any kind of treatment for "bad" events and also did not verify again that all the columns match correctly. This should be checked again before we start to use this actively.
Next Steps
-
replicate set up with Kafka instead of RabbitMQ -
Create cloud setup and do load testing
Question to the Reviewers
-
Can you see any general blockers or open questions that would prevent us from going forward with replacing Jitsu with Snowplow based on this PoC?
-
Which further changes are necessary to make this completely usable as local setup for product analytics?
Related to gitlab-org/gitlab#388797 (closed)