Replace jitsu with snowplow for gathering data
Overview
This MR adds Snowplow and Kafka to the devkit. In general Snowplow works as shown here:
Four new components are added to the docker-compose file:
-
snowplow_collector
: That's the endpoint that receives events and stores them into Kafka. It is currently configured to listen onlocalhost:9091
since this is also where a local GDK by default sends events to. -
snowplow_enrich
: This component takes events from Kafka, tests if they comply to the schema and could optionally do additional enrichments -
kafka
: Kafka serves as an intermediary storage for received raw events and enriched events before they are ingested into Clickhouse -
zookeeper
: Zookeeper is responsible for coordinating and scaling Kafka.
The enriched events are stored as lines of TabSeparatedValues(tsv) in Kafka. The script at utils/clickhouse_snowplow_setup.rb
creates the initial tables in Clickhouse to ingest the data. The setup using the Kafka Engine in Clickhouse, works in the following way:
- A queue table, in our case
snowplow_queue
, pulls data from a Kafka topic as soon as it is available. It's columns map to the TSV structure of Snowplow. This queue table can not be queried directly, it's sole purpose is the ingestion of data - One or multiple long term storage tables that have the same columns as the original
snowplow_queue
- One or multiple materialized views, called
snowplow_consumer
in our case, select the appropriate data from thesnowplow_queue
and move it in the correctsnowplow_event
table. The currently configured consumer only takes the events with an emptyapp_id
since the GDK by default has an emptyapp_id
for any events being sent.
How to run
Setup
- Run
docker compose -f docker-compose.yml -f docker-compose.snowplow.yml up
to start all services. You might need to recreate theclickhouse
service so that the necessary sql files are available - Run
rake setup_snowplow_queue
to setup the intial queue that ingests from Kafka - Run
rake setup_snowplow_events app_id=[your-app-id]
to setup a table calledsnowplow_events
inside of a database called[your-app-id]_db
and a view that moves the data withapp_id=[your-app-id]
from the main queue into your database. For usage with the GDK omit theapp_id
.
Getting events from GDK
- Make sure Snowplow Micro is setup in your GDK to run on PORT 9091 (see instructions).
- In your GDK procfile search for
snowplow-micro
and comment out the following line so that Snowplow Micro no longer starts:
# snowplow-micro: exec docker run --rm --mount type=bind,source=/Users/srehm/Development/gitlab/gitlab-development-kit/snowplow,destination=/config -p 9091:9091 snowplow/snowplow-micro:latest --collector-config /config/snowplow_micro.conf --iglu /config/iglu.json
Seeing the events
- Go to http://localhost:18123 and enter password
test
- Then run your GDK e.g. with
gdk start
- Click around in your local gitlab version
- Run
SELECT * from snowplow_events Order by collector_tstamp desc
to see the incoming events
Open questions
-
Integration with Cube in general in the devkit -
Multi project setup -
Do we need sample data for the start?
Relates to gitlab-org/gitlab#390846 (closed)
Edited by Sebastian Rehm