Set up a Snowplow log parser for stored event data
Description
We'd like to use Snowplow for tracking pageviews and events on GitLab.com. In Snowplow, trackers fire events, which are received and logged by collectors. Trackers send data to collectors by making a GET
request for a tracking pixel.
Once log data is sent to a collector, we then need a runner to parse/clean the log data and send it to S3. Once it's in S3, we can ETL it into our data warehouse where it can be visualized in Looker.
Proposal
This step is referred to as "Enrich" in Snowplow's pipeline, which is detailed in their documentation and this meta issue. It seems likely that we'll use Snowplow's EmrEtlRunner
for this.
This step is dependent on setting up a collector and having logfiles to enrich by tracking events and sending them to the collector.
As noted in the setup documentation for EmrEtlRunner, we'll need to:
- Install and host
EmrEtlRunner
somewhere. - Configure it to process and enrich data from the collector, and schedule it to run periodically.
Specific configuration/enrichments are TBD.
Links / references
- Meta issue for setting up Snowplow on GitLab.com: https://gitlab.com/gitlab-org/gitlab-ee/issues/6329
- Snowplow documentation: https://github.com/snowplow/snowplow/wiki