Incident: AI Gateway SSOT data source missing namespaces in events related to 'request_duo_chat_response'
Problem
High priority incident:
Events with matching gsc_correlation_id show smaller gsc_feature_enabled_by_namespace_ids arrays in AI Gateway events (app_id = 'gitlab_ai_gateway') compared to standard Snowplow events (app_id = 'gitlab') for ~45% of correlation ids (see query in Detection section). This means we are potentially under-reporting AI feature usage at the account and namespace level in our SSOT. This incident seems to be limited to chat usage where event_action = 'request_duo_chat_response' in app_id = 'gitlab', but scope should be validated by engineering.
Our hope is that there is a simple explanation for why events with matching gsc_correlation_id would show different namespace identifiers in gsc_feature_enabled_by_namespace_ids. If so, we'd love to document reasoning.
Detection
- Initially flagged by @ddeng1 during Duo Chat usage analysis here https://gitlab.com/gitlab-data/product-analytics/-/issues/2362#note_2236489848
- Further investigation revealed that for the same event (matched by
gsc_correlation_id), thegsc_feature_enabled_by_namespace_idsarrays contain different numbers of namespace IDs between the two tracking methodologies (standard vs. AI Gateway) See query below - This discrepancy became trackable after November 13th when
gsc_feature_enabled_by_namespace_idsstarted being captured consistently
Query:
SELECT DISTINCT
ai.behavior_date,
ai.gsc_correlation_id,
ai.app_id
ai_gw_app_id,
legacy.app_id
standard_snowplow_events_app_id,
ai.unit_primitive,
ARRAY_AGG(DISTINCT ai.feature) WITHIN GROUP (ORDER BY ai.feature ASC)
AS feature_array,
ai.gsc_feature_enabled_by_namespace_ids
AS ai_gsc_feature_enabled_by_namespace_ids,
LENGTH(ai.gsc_feature_enabled_by_namespace_ids) - LENGTH(REPLACE(ai.gsc_feature_enabled_by_namespace_ids, ',', '')) + 1
AS ai_namespace_array_size,
legacy.gsc_feature_enabled_by_namespace_ids
AS legacy_gsc_feature_enabled_by_namespace_ids,
LENGTH(legacy.gsc_feature_enabled_by_namespace_ids) - LENGTH(REPLACE(legacy.gsc_feature_enabled_by_namespace_ids, ',', '')) + 1
AS legacy_namespace_array_size,
legacy.event_action,
ai_gsc_feature_enabled_by_namespace_ids = legacy_gsc_feature_enabled_by_namespace_ids
AS namespace_array_match_check
FROM workspace_product.wk_rpt_ai_gateway_events_flattened_with_features ai
JOIN common_mart.mart_behavior_structured_event legacy
ON legacy.gsc_correlation_id = ai.gsc_correlation_id
AND ai.behavior_date = legacy.behavior_date
AND legacy.app_id != 'gitlab_ai_gateway' --joining to standared snowplow events, not ai gw events
WHERE ai.behavior_at >= '2024-11-13'
AND legacy.gsc_feature_enabled_by_namespace_ids IS NOT NULL
AND namespace_array_match_check = FALSE
GROUP BY ALL;
Impact
Impact is significant if we are not capturing all of the namespaces enabling usage in gsc_feature_enabled_by_namespace_ids for events triggered in app_id = 'gitlab_ai_gateway' compared to those that share a gsc_correlation_id in app_id = 'gitlab'.
Business Impacts:
- Systematic under-reporting of Chat usage at account & namespace level
- Missing namespace & account data in our designated SSOT for AI feature tracking
- Potential impact on critical business metrics and decision-making
- Risk of incorrect resource allocation and capacity planning due to understated usage
Affected Systems/Teams:
- PDI, MS&A & other functional teams using AI feature usage data
- Account teams tracking customer adoption
- Business stakeholders making decisions based on feature usage metrics
- Capacity planning teams
Additional information
Events matched using gsc_correlation_id
Discrepancy appears between:
- Standard snowplow events: event_action = 'request_duo_chat_response' - app_id = 'gitlab'
- AI Gateway events: app_id = 'gitlab_ai_gateway' - 19 unit primitive requests are mapped to 'request_duo_chat_response' via
gsc_correlation_id
Outcome of investigation should include answer to the following questions :
- Is it expected to see differences in
gsc_feature_enabled_by_namespace_idsin events with the samegsc_correlation_id? If so, why? - Do events in
app_id = 'gitlab_ai_gateway'capture all namespaces enabling usage ingsc_feature_enabled_by_namespace_idsor are some namespaces missing and why?
Checklist
-
Assigned severity tags based on this guidance -
Assigned to PM and EM of groupanalytics instrumentation -
Posted link to incident in g_analyze_analytics_instrumentationand tagged both PM and EM of the group
<---- TO BE FILLED BY ASSIGNEE / RESOLUTION DRI---->
Summary
Root Cause
Resolution
Turns out we were passing different namespace ids in request_duo_chat_response action. Which is incorrect as confirmed by @shinya.maeda
MR !174486 (merged) is merged to keep it uniform with AI-Gateway namespace_ids.
So, the current implementation of AI-Gateway is correct and should be considered SSOT.
Next steps
We might need to update the namespace_ids for request_duo_chat_response action in data warehouse.
