Replace jitsu with snowplow for gathering data (!43) · Merge requests · GitLab.org / Analytics Section / product-analytics / devkit

Sebastian Rehm requested to merge bastirehm-snowplow-devkit-conversion into main Feb 22, 2023

Overview

This MR adds Snowplow and Kafka to the devkit. In general Snowplow works as shown here:

Four new components are added to the docker-compose file:

snowplow_collector: That's the endpoint that receives events and stores them into Kafka. It is currently configured to listen on localhost:9091 since this is also where a local GDK by default sends events to.
snowplow_enrich: This component takes events from Kafka, tests if they comply to the schema and could optionally do additional enrichments
kafka: Kafka serves as an intermediary storage for received raw events and enriched events before they are ingested into Clickhouse
zookeeper: Zookeeper is responsible for coordinating and scaling Kafka.

The enriched events are stored as lines of TabSeparatedValues(tsv) in Kafka. The script at utils/clickhouse_snowplow_setup.rb creates the initial tables in Clickhouse to ingest the data. The setup using the Kafka Engine in Clickhouse, works in the following way:

A queue table, in our case snowplow_queue, pulls data from a Kafka topic as soon as it is available. It's columns map to the TSV structure of Snowplow. This queue table can not be queried directly, it's sole purpose is the ingestion of data
One or multiple long term storage tables that have the same columns as the original snowplow_queue
One or multiple materialized views, called snowplow_consumer in our case, select the appropriate data from the snowplow_queue and move it in the correct snowplow_event table. The currently configured consumer only takes the events with an empty app_id since the GDK by default has an empty app_id for any events being sent.

How to run

Setup

Run docker compose -f docker-compose.yml -f docker-compose.snowplow.yml up to start all services. You might need to recreate the clickhouse service so that the necessary sql files are available
Run rake setup_snowplow_queue to setup the intial queue that ingests from Kafka
Run rake setup_snowplow_events app_id=[your-app-id] to setup a table called snowplow_events inside of a database called [your-app-id]_db and a view that moves the data with app_id=[your-app-id] from the main queue into your database. For usage with the GDK omit the app_id.

Getting events from GDK

Make sure Snowplow Micro is setup in your GDK to run on PORT 9091 (see instructions).
In your GDK procfile search for snowplow-micro and comment out the following line so that Snowplow Micro no longer starts:

# snowplow-micro: exec docker run --rm --mount type=bind,source=/Users/srehm/Development/gitlab/gitlab-development-kit/snowplow,destination=/config -p 9091:9091 snowplow/snowplow-micro:latest --collector-config /config/snowplow_micro.conf --iglu /config/iglu.json

Seeing the events

Go to http://localhost:18123 and enter password test
Then run your GDK e.g. with gdk start
Click around in your local gitlab version
Run SELECT * from snowplow_events Order by collector_tstamp desc to see the incoming events

Open questions

Integration with Cube in general in the devkit
Multi project setup
Do we need sample data for the start?

Relates to gitlab-org/gitlab#390846 (closed)

Edited Mar 01, 2023 by Sebastian Rehm

Replace jitsu with snowplow for gathering data