Incident: AI Gateway SSOT data source missing namespaces in events related to 'request_duo_chat_response'

Problem

High priority incident:

Events with matching gsc_correlation_id show smaller gsc_feature_enabled_by_namespace_ids arrays in AI Gateway events (app_id = 'gitlab_ai_gateway') compared to standard Snowplow events (app_id = 'gitlab') for ~45% of correlation ids (see query in Detection section). This means we are potentially under-reporting AI feature usage at the account and namespace level in our SSOT. This incident seems to be limited to chat usage where event_action = 'request_duo_chat_response' in app_id = 'gitlab', but scope should be validated by engineering.

Our hope is that there is a simple explanation for why events with matching gsc_correlation_id would show different namespace identifiers in gsc_feature_enabled_by_namespace_ids. If so, we'd love to document reasoning.

Detection

  • Initially flagged by @ddeng1 during Duo Chat usage analysis here https://gitlab.com/gitlab-data/product-analytics/-/issues/2362#note_2236489848
  • Further investigation revealed that for the same event (matched by gsc_correlation_id), the gsc_feature_enabled_by_namespace_ids arrays contain different numbers of namespace IDs between the two tracking methodologies (standard vs. AI Gateway) See query below
  • This discrepancy became trackable after November 13th when gsc_feature_enabled_by_namespace_ids started being captured consistently

Query:

    SELECT DISTINCT
          ai.behavior_date,
          ai.gsc_correlation_id,
          ai.app_id
            ai_gw_app_id,
          legacy.app_id
            standard_snowplow_events_app_id,
          ai.unit_primitive,
          ARRAY_AGG(DISTINCT ai.feature) WITHIN GROUP (ORDER BY ai.feature ASC)
            AS feature_array,
          ai.gsc_feature_enabled_by_namespace_ids
            AS ai_gsc_feature_enabled_by_namespace_ids,
          LENGTH(ai.gsc_feature_enabled_by_namespace_ids) - LENGTH(REPLACE(ai.gsc_feature_enabled_by_namespace_ids, ',', '')) + 1
            AS ai_namespace_array_size,
          legacy.gsc_feature_enabled_by_namespace_ids
            AS legacy_gsc_feature_enabled_by_namespace_ids,
          LENGTH(legacy.gsc_feature_enabled_by_namespace_ids) - LENGTH(REPLACE(legacy.gsc_feature_enabled_by_namespace_ids, ',', '')) + 1
            AS legacy_namespace_array_size,
          legacy.event_action,
          ai_gsc_feature_enabled_by_namespace_ids = legacy_gsc_feature_enabled_by_namespace_ids
            AS namespace_array_match_check
      FROM workspace_product.wk_rpt_ai_gateway_events_flattened_with_features ai
      JOIN common_mart.mart_behavior_structured_event legacy
        ON legacy.gsc_correlation_id = ai.gsc_correlation_id
        AND ai.behavior_date = legacy.behavior_date
        AND legacy.app_id != 'gitlab_ai_gateway' --joining to standared snowplow events, not ai gw events
      WHERE ai.behavior_at >= '2024-11-13'
       AND legacy.gsc_feature_enabled_by_namespace_ids IS NOT NULL
       AND namespace_array_match_check = FALSE
      GROUP BY ALL;

Impact

Impact is significant if we are not capturing all of the namespaces enabling usage in gsc_feature_enabled_by_namespace_ids for events triggered in app_id = 'gitlab_ai_gateway' compared to those that share a gsc_correlation_id in app_id = 'gitlab'.

Business Impacts:

  • Systematic under-reporting of Chat usage at account & namespace level
  • Missing namespace & account data in our designated SSOT for AI feature tracking
  • Potential impact on critical business metrics and decision-making
  • Risk of incorrect resource allocation and capacity planning due to understated usage

Affected Systems/Teams:

  • PDI, MS&A & other functional teams using AI feature usage data
  • Account teams tracking customer adoption
  • Business stakeholders making decisions based on feature usage metrics
  • Capacity planning teams

Additional information

Events matched using gsc_correlation_id

Discrepancy appears between:

  • Standard snowplow events: event_action = 'request_duo_chat_response' - app_id = 'gitlab'
  • AI Gateway events: app_id = 'gitlab_ai_gateway' - 19 unit primitive requests are mapped to 'request_duo_chat_response' via gsc_correlation_id

image

Outcome of investigation should include answer to the following questions :

  1. Is it expected to see differences in gsc_feature_enabled_by_namespace_ids in events with the same gsc_correlation_id? If so, why?
  2. Do events in app_id = 'gitlab_ai_gateway' capture all namespaces enabling usage in gsc_feature_enabled_by_namespace_ids or are some namespaces missing and why?

Checklist

  • Assigned severity tags based on this guidance
  • Assigned to PM and EM of groupanalytics instrumentation
  • Posted link to incident in g_analyze_analytics_instrumentation and tagged both PM and EM of the group

<---- TO BE FILLED BY ASSIGNEE / RESOLUTION DRI---->

Summary

Root Cause

Resolution

Turns out we were passing different namespace ids in request_duo_chat_response action. Which is incorrect as confirmed by @shinya.maeda

MR !174486 (merged) is merged to keep it uniform with AI-Gateway namespace_ids.

So, the current implementation of AI-Gateway is correct and should be considered SSOT.

Next steps

We might need to update the namespace_ids for request_duo_chat_response action in data warehouse.

Edited by Ankit Panchal