Set up a Snowplow collector for tracking events on GitLab.com
Description
We'd like to use Snowplow for tracking pageviews and events on GitLab.com. In Snowplow, trackers fire events, which are received and logged by collectors. Trackers send data to collectors by making a GET
request for a tracking pixel.
This issue tracks the creation of a collector, which will merrily log these requests. We can't do tracking without a collector.
Proposal
Snowplow collectors use a tracking pixel and log GET
requests for the pixel. To stand up a collector, we need to do two things:
-
Decide on the collector we'd like to use. The 3 collector options are described here, with the Cloudfront Collector being the most commonly used. Seems the Scala Stream Collector is stable as well and recommended for future usage.
- Using CloudFront Collector place the logs to S3
- Using the Scala Stream Collector push the logs to a Kinesis Stream
-
Setup the collector, as described in the installation guide.
- We should probably set up at least 2 collector groups: one pixel/logs for GitLab.com production and one for staging/testing.
- On AWS compute optimized instances preferred
- We should have at least 3-4 instances to increase log/shard parallelization
Links / references
- Meta issue for setting up Snowplow on GitLab.com: https://gitlab.com/gitlab-org/gitlab-ee/issues/6329
- Snowplow documentation: https://github.com/snowplow/snowplow/wiki