Simplify Service Ping Context
Problem
The context called gitlab_service_ping is used to communicate to the data warehouse that a snowplow event is a mirror of an RedisHll event. However, it is currently needlessly complex and error-prone. The context accepts both keypath
and event_name
attributes. The idea is to use keypath
if there is a one-to-one relationship between a metric and an event, which is currently the case for Redis all-time counters according to the schema. If there is a many-to-many relationship, where the event is used in many metrics the event_name
is supposed to be used. In practice this means that the fact whether event_name
or key_path
is present is used to differentiate between unique, time-based counts using RedisHLL (dbt model) and all-time counts using Redis (dbt model).
This is based on the understanding that the assignment from Snowplow event to RedisHLL/Redis metric would otherwise not work in DBT. However this creates multiple problems:
- JSON schema, which is used for Snowplow contexts, does not support mutually exclusive properties, so the fact that
keypath
andname
should not both be set at the same time has to be enforced in the code, which only happens on the backend currently - It obscures the fact that we are never really mirroring metrics, but always only specific events, that can then be used in metrics.
- It essentially duplicates data that is already available in the metric definition in the form of the
time_frame: all_time
and thedata_source: redis
instead ofdata_source: redis_hll
attributes.
Desired Outcome
An easier to understand system where we ideally only send one value to map between Redis/RedisHLL event and Snowplow event.
Proposed Solution
- Remove
key_path
from ServicePing Context - Always send
event_name
in ServicePing Context for events that use it (see code) - Migrate all currently mirrored Redis all-time events send the event_name instead of
key_path
. The querySELECT DISTINCT metrics_path FROM workspace_customer_success.wk_rpt_event_based_metric_counts_namespace_all_time
should give the keypath to all metrics that are currently migrated
Separate issue for updating the DBT models accordingly: https://gitlab.com/gitlab-data/analytics/-/issues/16600