Enhance Telemetry across Meltano, SDK, and Hub
Taken from meltano#2972 ## Idea/Problem Statement Currently, we use Google Analytics (GA) to collect anonymous telemetry data from Meltano and the Hub. We need to upgrade our telemetry strategy to get the most out of it, including: events we send, our collection and processing infrastructure, and how we share and use this data. The goal is to be fully anonymous, openly share the data, and make it easy to opt-out. All of this will set us up for success to most efficiently listen to what our users want as our ecosystem grows. ## Data Usage Strategy The purpose of collecting this data is to use it for insights to improve the product and ecosystem as a whole. We want to analyze the telemetry data from Meltano, the Hub, and the SDK to help make product decisions. What features are most used or not used? Do we need to focus more engineering effort in certain areas? We also want to make as much of this data useful for the users in our community. For now this can come in the form of extra data to enrich the hub, raw S3 data sets for the community (if they want to do their own analysis), and aggregated usage metrics to ensure users that the ecosystem is healthy and growing. ## Event Collection Strategy We already collect telemetry data from Meltano and the Hub, we want to continue doing that with some improvements to fill existing gaps and also layer on events coming from SDK based connectors. Meltano should collect events about how users are running Meltano and how its performing at the job level, while the SDK should collect events at the data level. * Meltano: * EL job start/stop timestamps * T job start/stop timestamps * Exit codes * Meltano version * Caller script (Airflow, Dagster, direct, etc.) * Environment * Installed features - orchestrators (Airflow, Dagster), dbt, Utilities (Lightdash/Superset), Files (Docker) * Plugin variant, name, pip_url, version * SDK: * Start/stop timestamps * Record counts * New spec features and capabilities (ACTIVATE_VERSION and BATCH) * Exceptions * Tests run * SDK version * Input arguments/capabilities used * Hub: ? ## Technical Design ### Implementation Considerations As an open source focused team we wanted to choose a telemetry platform that is open source and can be run in our own environment so that we can ensure the safe keeping of our users anonymous data. #### Snowplow Deployment A self hosted Snowplow deployment running on AWS which writes to S3 and use Athena for [DBT transformations](https://github.com/dbt-labs/snowplow). The intended design is inspired and based on [Gitlab's implementation](https://about.gitlab.com/handbook/business-technology/data-team/platform/snowplow/). * We will use Snowplow as our event processing tool. It was evaluated against others in the space but its preferred because its open source, scalable, flexible (web vs cli vs etc.). * AWS will be the cloud provider since mostly everyone is familiar with it and a common Snowplow deployment uses some AWS specific services like Kinesis. #### Event Publishing Changes The following things needs to be done by the event producing systems, probably in this order unless 1/3 are combined. 1. The current GA events need to be converted to Snowplow events in Meltano and the Hub. 1. The SDK needs to implement Snowplow tracking events. 1. Meltano and the Hub need to add the additional events that were missing from the GA implementation. ### Operational Considerations #### Automation All of the AWS infrastructure is managed using Terraform. #### Monitoring Logs need to be collected (cloudwatch?) and alerts need to be set up so we can ensure that we are aware of any issues with the platform. #### Testing Environment A non-production testing environment needs to be available for testing new changes to the Snowplow setup or for testing new tracking events without interfering with production data. This would probably just an ephemeral environment but the terraform needs to support multiple deployments. ## Open Items/Questions * [ ] Where should Terraform infrastructure live? * [ ] What Hub events are we not tracking right now? * [ ] How should we monitor our infrastructure? Cloudwatch? * [ ] How do we feel about the testing environment? * [ ] Do we want to run GA in parallel with Snowplow for any reason? Realtime dashboards for the Hub?
epic