Skip to content

Replace jitsu with snowplow for gathering data

Sebastian Rehm requested to merge bastirehm-snowplow-devkit-conversion into main

Overview

This MR adds Snowplow and Kafka to the devkit. In general Snowplow works as shown here:

snowplowKafkaClickhouse

Four new components are added to the docker-compose file:

  • snowplow_collector: That's the endpoint that receives events and stores them into Kafka. It is currently configured to listen on localhost:9091 since this is also where a local GDK by default sends events to.
  • snowplow_enrich: This component takes events from Kafka, tests if they comply to the schema and could optionally do additional enrichments
  • kafka: Kafka serves as an intermediary storage for received raw events and enriched events before they are ingested into Clickhouse
  • zookeeper: Zookeeper is responsible for coordinating and scaling Kafka.

The enriched events are stored as lines of TabSeparatedValues(tsv) in Kafka. The script at utils/clickhouse_snowplow_setup.rb creates the initial tables in Clickhouse to ingest the data. The setup using the Kafka Engine in Clickhouse, works in the following way:

  1. A queue table, in our case snowplow_queue, pulls data from a Kafka topic as soon as it is available. It's columns map to the TSV structure of Snowplow. This queue table can not be queried directly, it's sole purpose is the ingestion of data
  2. One or multiple long term storage tables that have the same columns as the original snowplow_queue
  3. One or multiple materialized views, called snowplow_consumer in our case, select the appropriate data from the snowplow_queue and move it in the correct snowplow_event table. The currently configured consumer only takes the events with an empty app_id since the GDK by default has an empty app_id for any events being sent.

How to run

Setup

  1. Run docker compose -f docker-compose.yml -f docker-compose.snowplow.yml up to start all services. You might need to recreate the clickhouse service so that the necessary sql files are available
  2. Run rake setup_snowplow_queue to setup the intial queue that ingests from Kafka
  3. Run rake setup_snowplow_events app_id=[your-app-id] to setup a table called snowplow_events inside of a database called [your-app-id]_db and a view that moves the data with app_id=[your-app-id] from the main queue into your database. For usage with the GDK omit the app_id.

Getting events from GDK

  1. Make sure Snowplow Micro is setup in your GDK to run on PORT 9091 (see instructions).
  2. In your GDK procfile search for snowplow-micro and comment out the following line so that Snowplow Micro no longer starts:
# snowplow-micro: exec docker run --rm --mount type=bind,source=/Users/srehm/Development/gitlab/gitlab-development-kit/snowplow,destination=/config -p 9091:9091 snowplow/snowplow-micro:latest --collector-config /config/snowplow_micro.conf --iglu /config/iglu.json

Seeing the events

  1. Go to http://localhost:18123 and enter password test
  2. Then run your GDK e.g. with gdk start
  3. Click around in your local gitlab version
  4. Run SELECT * from snowplow_events Order by collector_tstamp desc to see the incoming events

Open questions

  • Integration with Cube in general in the devkit
  • Multi project setup
  • Do we need sample data for the start?

Relates to gitlab-org/gitlab#390846 (closed)

Edited by Sebastian Rehm

Merge request reports