Product Analytics event pipeline should be more resilient to Snowplow events that do not match our schema
Summary
An onboarding user tried to send Snowplow-compatible events to the prod-1
collector, without using an SDK. While the event was compliant with Snowplow schemas, Clickhouse – as it currently connects to Kafka to ingest events – got stuck trying to process events that did not match our schema, and therefore all event storage was halted.
Details
Event collection was working normally until an onboarding customer tried sending an event that did not match our schema:
curl -v \
--request POST \
--url https://collector.prod-1.gl-product-analytics.com/com.snowplowanalytics.snowplow/tp2 \
--header 'Content-Type: application/json' \
--data '{
"schema": "iglu:com.snowplowanalytics.snowplow/payload_data/jsonschema/1-0-4",
"data": [
{
"e": "pv",
"url": "https://gitlab.com/groups/namespace3628337/-/epics/5",
"page": "GitLab7",
"aid": "gp-9Xv1xHeFInHNDSqpSEQ",
"eid": "74427589-7c3a-4f56-8193-3e436f1d205b",
"dtm": "1692871226000",
"p": "web",
"uid": "user",
"vid": "2",
"cd": "24",
"vp": "600x400",
"tv": "js-3.6.0"
}
]
}'
This was determined valid and put into the good
Kafka topic in our event collection pipeline. It was then processed through snowplow-enricher
, then placed into the enriched
topic.
Currently, in our Product Analytics stack, Clickhouse subscribes to Kafka topics, and saves them to a table. (This may change with https://gitlab.com/gitlab-org/analytics-section/product-analytics/analytics-stack/-/issues/50+, as the reverse needs to happen, as Clickhouse Cloud cannot connect to Kafka clusters)
Since the snowplow_queue
events table has a specific set of columns, these events deemed "good" were unable to be inserted, as they were missing too much information to be put into the table.
As a result, all event processing stalled from 15:10 to 19:15 UTC, as Clickhouse was unable to move from the problematic events.
By debugging the Clickhouse logs, we were able to identify the problematic offsets in the Kafka logs, and delete the records causing the issue. No other events were lost.
Since we are planning to support other SDKs that don't have browser-specific properties, we should make sure that inserts should still be able to happen if the schemas do not match exactly.
Proposal
- Make sure Snowplow events are properly categorized into the
bad
andgood
topics. - Ensure compatibility with events sent through any of our other SDKs.
References/Links
- Slack thread of the originating event
-
Slack thread tracking the incident
- The slack thread includes an event dump of the events around the time the pipeline stalled, it's worth noting the
page_view
events were the ones that specifically caused the issue.
- The slack thread includes an event dump of the events around the time the pipeline stalled, it's worth noting the