Snowplow as Jitsu Replacement PoC (!37) · Merge requests · GitLab.org / Analytics Section / product-analytics / devkit

Sebastian Rehm requested to merge snowplow-poc into main Jan 27, 2023

Objective

This PoC was created to evaluate whether Snowplow can serve as a replacement for Jitsu in our Product Analytics stack. The primary aim of the PoC was to create a working local setup where Snowplow collects data into Clickhouse, similar to the existing devkit. At the same time, this should also help us to answer the question how a self-contained setup for Product Analytics that uses Snowplow instead of Jitsu (e.g. in a Kubernetes Cluster) could look like.

Architecture

A Snowplow setup typically consists of five components:

An SDK that sends events.
The collector can receive events and store them on a queue in a binary format.
An enricher that can pick and stores them in a human-readable format (as TabSeparatedValues)
A queueing system where the collector can store events for the enricher to pick up and store there again
A loader that loads the data into a final storage system.

For our POC purposes, we chose the following setup:

I reuse the existing Snowplow-Micro implementation in the Gitlab Monolith to redirect Snowplow events to our local collector in a local Gitlab instance. The POC supports both RabbitMQ and Kafka, but was originally built with Kafka. Snowplow only recently released the supported Collector/Enricher for RabbitMQ and it seems that it might not be supported in the long run, while Kafka is in a more supported state. Both RabbitMQ and Kafka would generally work for our setup but have different advantages:

RabbitMQ is easier to set up and scales vertically to high throughput numbers.
Kafka has a more involved setup but is probably one of the most scalable and battle-tested solutions for event-streaming. (see [Shopify capturing every change from their monolith on Kafka). They are also in the process of simplifying the setup (see Removal of Zookeper)
Clickhouse can pick up events directly from RabbitMQ/Kafka (Posthog uses this method for their Open-Source solution)

For the PoC the IP anonymization enrichment is set up to remove the last octet of the user's IP address. Instead of using any of the Snowplow-provided loaders, we load data directly into Clickhouse via their support for processing streams on RabbitMQ / Kafka

Local Setup

To try the setup on your own machine, do the following within this repository:

$ git checkout snowplow-poc
$ cd utils
$ bundle install
$ rake run_snowplow USE_KAFKA=true

This starts the relevant containers in the correct order and configures Clickhouse/Kafka to work with Snowplow. Afterwards you can stop this setup with docker-compose down. You can use the RabbitMQ setup but removing the USE_KAFKA env variable and reconfiguring the Clickhouse tables with rake reset_clickhouse_tables USE_KAFKA=true. The final event table in Clickhouse looks the same regardless of the Queueing/Log system being used. If RabbitMQ creates any problems starting or stopping, restart your docker to make it work again.

Make sure Snowplow Micro is setup in your GDK to run on PORT 9091 (see instructions). In your GDK procfile search for snowplow-micro and comment out the line so that Snowplow Micro no longer starts:

# snowplow-micro: exec docker run --rm --mount type=bind,source=/Users/srehm/Development/gitlab/gitlab-development-kit/snowplow,destination=/config -p 9091:9091 snowplow/snowplow-micro:latest --collector-config /config/snowplow_micro.conf --iglu /config/iglu.json

Then run your GDK e.g. with gdk start

Trying it out

If you set everything up correctly, you should be able to browse through your local GDK and generate events that Clickhouse automatically picks up. To see the events go to localhost:18123 (Password test) and enter the following query:

SELECT * FROM snowplow_events

Here is a file with 100 of the resulting events: snowplow.tsv

You can check out the RabbitMQ setup in a graphical manner at http://localhost:15672/ (user: guest, password: guest)

Thoughts on the PoC

To me this proves, that a local setup, comparable to Jitsu is easily possible with Snowplow and at this point I do not see a blocker for us to move forward with replacing Jitsu with Snowplow based on it's matureness, our familiarity with it and the overall feature set (enrichments, SDKs). Using Snowplow means that we'd have to take care of one additional infrastructure component (Kafka/RabbitMQ) for higher load self-managed setups, at the same time it would allow us a lot of flexibility, since Snowplow would allow us to replace Kafka/RabbitMQ with just using the local file system for minimal setups or opting for a setup based on Kinesis/Google PubSub for very high load ones (like our current Snowplow pipeline).

Not covered in the PoC

I did not spend time setting up any kind of treatment for "bad" events and also did not verify again that all the columns match correctly. This should be checked again before we start to use this actively.

Next Steps

replicate set up with Kafka instead of RabbitMQ
Create cloud setup and do load testing

Question to the Reviewers

Can you see any general blockers or open questions that would prevent us from going forward with replacing Jitsu with Snowplow based on this PoC?
Which further changes are necessary to make this completely usable as local setup for product analytics?

Edited Feb 02, 2023 by Sebastian Rehm

Snowplow as Jitsu Replacement PoC