Snowplow is an opensource, event analytics platform. There is a business entity that runs SaaS for Snowplow and they also maintain the code for the open source product. The general [architecture overview of snowplow is on GitHub](https://github.com/snowplow/snowplow/#snowplow-technology-101) and has more detail on the basics of how it works and how it is set up.
Snowplow is an open-source event analytics platform that collects and processes behavioral event data. The platform's architecture overview and source code are available in the [Snowplow repository](https://github.com/snowplow/snowplow/#snowplow-technology-101), which provides comprehensive documentation on its architecture and implementation.
In `June of 2019`, we switched sending Snowplow events from a thirdparty to sending them to infrastructure managed by GitLab, documented on this page. From the perspective of the data team, not much changed from the third party implementation. Events are sent through the collector and enricher and dumped to S3.
GitLab has managed its own Snowplow infrastructure since June 2019, when we transitioned from a third-party service to self-hosted infrastructure. From the data team's perspective, the core event flow remained the same: events are sent through the collector and enricher, then stored in S3.
As of December 2024, we have switched over to a new Snowplow env called `aws-snowplow-prd`, more information is detailed in the [Snowplow internal handbook](https://gitlab.com/gitlab-com/content-sites/internal-handbook/-/blob/main/content/handbook/enterprise-data/platform/infrastructure/_index.md?ref_type=heads#aws-snowplow-for-cpaa-internal-analytics).
#### Snowplow - adding new `app_id`
When new application should be tracked by `Snowplow` here is the few things should be considered.

The right `app_id`, and collector URL should be done in coordination with the data team.
URL wil stay the same `snowplowprd.trx.gitlab.net`. Any `app_id` is fine if there are no other concerns around enabling tracking on `CustomersPortal` staging as well.
> **Note:** Any un-expected events *(with wrong app_id)* are normally dropped.
The only model with app_id filtering is [snowplow_base_events](https://dbt.gitlabdata.com/#!/model/model.snowplow.snowplow_base_events). That model flows downstream to the page view models:
In order to add a new `app_id` to [snowplow_base_events](https://dbt.gitlabdata.com/#!/model/model.snowplow.snowplow_base_events)*(and downstream page view models)*, the `snowplow:app_ids`
variable in the dbt package must be updated. Those values are set in the `dbt_project.yml`
file. As an example, [here is an issue](https://gitlab.com/gitlab-data/analytics/-/issues/16552) to update the variable.
#### GitLab Implementation
The original design document to move our Snowplow infrastructure from a 3rd-party hosting service to 1st-part is documented in the [Infrastructure design library](https://gitlab.com/gitlab-com/gl-infra/readiness/-/tree/master/library/snowplow). This was written before the build was started and contains many of the assumptions and design decisions.
Snowplow is built with Terraform on AWS documented in the [`config-mgmt` project](https://ops.gitlab.net/gitlab-com/gl-infra/config-mgmt/-/blob/main/environments/aws-snowplow/README.md).
For a detailed walk-through of our setup, watch [this GitLab Unfiltered internal video](https://www.youtube.com/watch?v=fK9aw3bHFBg&feature=youtu.be).
Enriched events are stored in TSV format in the bucket `s3://gitlab-com-snowplow-events/output/`.
Bad events are stored as JSON in `s3://gitlab-com-snowplow-events/enriched-bad/`.
For both buckets, there are paths that follow a date format of `/YYYY/MM/DD/HH/<data>`.
#### Data Warehousing
<details><summary>Click to expand</summary>
#### Snowpipe
Once events are available in S3, we ingest them into the data warehouse using [Snowpipe](https://docs.snowflake.net/manuals/user-guide/data-load-snowpipe-intro.html#introduction-to-snowpipe). This is a feature of our Snowflake Data Warehouse.
An [Amazon SQS](https://aws.amazon.com/sqs/) event queue was set up for the good and bad event paths.
To run properly, Snowpipe needs a "stage" in Snowflake and a table to write to.
The good and bad S3 paths each have their own Stage within Snowflake.
These are named `gitlab_events` and `gitlab_bad_events`, respectively. They are owned by the `LOADER` role.
The create table statements for the good and bad events are as follows:
To force a refresh of the stage so that snowpipe picks up older events:
``` sql
ALTERPIPEgitlab_good_event_piperefresh;
```
</details>
#### dbt
To materialize data from the RAW database to PROD for querying, we have implemented a partitioning strategy within dbt. By default, the snowplow models and the [Fishtown snowplow package](https://github.com/dbt-labs/snowplow) will write to a schema scoped to the current month in the PREP database. For July 2019, the schema would be `snowplow_2019_07`.
Within each monthly partition all of the base models and the models generated by the package are written for all events that have a derived timestamp that matches the partition date. Different monthly partitions can be generated by passing in variables to dbt at run time:
Some models downstream of the monthly partitions (ex. [fct_behavior_structured_event](https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/models/common/facts_sales_and_marketing/fct_behavior_structured_event.sql) and [mart_behavior_structured_event](https://gitlab.com/gitlab-data/analytics/-/blob/master/transform/snowflake-dbt/models/common_mart/mart_behavior_structured_event.sql)) use the `incremental_backfill_date` variable to set the start and end for a backfill. This can be used locally to test backfilling a month of data by using the following variable at run time:
Backfills are done via Airflow. The [`dbt_snowplow_backfill` DAG](https://gitlab.com/gitlab-data/analytics/blob/master/dags/transformation/dbt_snowplow_backfill.py) will generate a task for each month from July 2018 to the current month.
#### Do Not Track
Our snowplow tracking configuration and particular implementations respect the [Do Not Track (DNT) headers](https://en.wikipedia.org/wiki/Do_Not_Track) whenever it's present on a user's browser.
#### Duo data redaction
We only keep Duo free form feedback for 60 days in snowflake. This is managed by the [duo_data_redaction DAG](https://gitlab.com/gitlab-data/analytics/-/blob/master/dags/general/duo_data_redaction.py), which runs daily, removing contents of the `extendedFeedback` attribute in the `contexts` column for all feedback response Snowplow events in `RAW` and `PREP`. This timeline allows for our full-refresh process to complete, updating all downstream data, within 90 days for compliance.
### Snowplow improvement: SQL scripting for issue fixing
To generate a script the issue, do the following things:
1. Open an issue in the project [snowplow-fix-scripting](https://gitlab.com/gitlab-data/snowplow-fix-scripting)
2. Open an MR in the same project
3. Adjust `config.yml` file to adjust your logic.
4. Run the pipeline 📚scripting -> ✏️generate_sql in MR
The ✏️generate_sql job is a manually triggered job in the GitLab `CI/CD` pipeline that generates SQL scripts based on provided parameters. It runs in the 📚scripting stage of the pipeline.
To run this job successfully, the following environment variables must be set:
* Required Environment Variables:
*`DATE_FROM`: Start date for the data range to process in the format `YYYY-MM-DD`
*`DATE_TO`: End date for the data range to process in the format `YYYY-MM-DD`
* Optional Environment Variables
*`LOG_LEVEL`: Sets the logging verbosity (defaults to `DEBUG` if not provided). Allowed values: `[DEBUG|INFO|WARNING|ERROR|CRITICAL]`
*`DATABASE_PREFIX`: Optional prefix for database objects or connections. If value is not provided, then PROD code is generated (`RAW`, `PREP`, `PROD`). Otherwise, enter prefix for the database name i.e. `22822-SNOWPLOW-IMPROVEMENT-SQL-SCRIPTING-FOR-ISSUE-FIXING`.
Usually, the flow will require the pipeline to be executed twice (not necessarily), once to generate a testing script to be run on the dev DB's, and once to generate a prod script to be run on the production db's.
* For testing databases, the parameter `DATABASE_PREFIX` will have a value as a prefix of development databases (ie. `22822-SNOWPLOW-IMPROVEMENT-SQL-SCRIPTING-FOR-ISSUE-FIXING`) and the code will be generated like:
As of December 2024, we have upgraded to a new Snowplow environment called `aws-snowplow-prd`. For comprehensive details on the current infrastructure, see the [Snowplow Data Pipeline documentation](https://internal.gitlab.com/handbook/enterprise-data/platform/pipelines/snowplow/).