Instrument GKG for Product Analytics (Snowplow) (#21189) · Epics · GitLab.org

Instrument GKG for Product Analytics (Snowplow)

## Overview Defines the Snowplow event schemas and instrumentation plan for GitLab Knowledge Graph (GKG/Orbit). This is product analytics. Events go to Snowflake via the GitLab Internal Events framework. Operational metrics (OTel/Prometheus) are covered separately in `docs/design-documents/observability.md`. **North Star Metric:** Weekly Active Namespaces (WAN): namespaces with at least 1 successful GKG query in the last 7 days. **Iglu schema work:** gitlab-org/gitlab#596156 --- ## Custom Iglu Schema: `knowledge_graph_context` The standard `gitlab_standard` context does not have the fields GKG needs. We need a custom context attached to all GKG Snowplow events. | Field | Type | Nullable | Example | Why it's needed | |-------|------|----------|---------|-----------------| | `correlation_id` | string | no | `"d3f1a2b4-7e8c-..."` | Standard request trace ID. Enables end-to-end debugging of DAP → GKG call chains. | | `namespace_id` | string | no | `"14523891"` | The unit of adoption. Without this we cannot compute WAN (North Star) or any per-namespace metric. | | `root_namespace_id` | string | no | `"9812345"` | Top-level group (TLG) of the caller's namespace. Required for billing rollup, licensing attribution, and enterprise segmentation. Available as token claim in the monolith. | | `global_user_id` | string | yes | `"usr_8819234"` | Identity of the calling user. Required for MAU/DAU metrics and per-user behavior analysis. Null for bot or service account calls. | | `user_type` | enum `human`, `service_account`, `bot` | no | `"human"` | Distinguishes human users from automated callers. Prevents bots and CI pipelines from inflating human adoption metrics. | | `is_gitlab_team_member` | boolean | yes | `false` | Separates internal dogfooding from real external adoption. Null if not determinable. | | `source_type` | enum `dap`, `mcp`, `rest_api`, `cli` | no | `"mcp"` | Separates zero-rated DAP usage from billable MCP usage. Core to monetization and access method analysis. | | `tool_name` | enum `query_graph`, `get_graph_schema` | no | `"query_graph"` | Distinguishes query execution from schema discovery. Different behavior, different product signals. | | `tier` | enum `premium`, `ultimate` | no | `"ultimate"` | Required for segmenting adoption by license tier and for pricing analysis. | | `deployment_type` | enum `com`, `dedicated`, `self_managed` | no | `"com"` | Required for SM/Dedicated rollout attribution and billing path segmentation. | | `session_id` | string | yes | `"sess_a1b2c3d4"` | Correlates multiple GKG tool calls within a single DAP session. Null for non-DAP access. | --- ## Events to Instrument ### 1. `gkg_query_executed` Fired on every call to the webserver (`query_graph` and `get_graph_schema`). | Field | Type | Nullable | Example | Why it's needed | |-------|------|----------|---------|-----------------| | `query_type` | enum `find_nodes`, `traverse`, `explore`, `aggregate`, `schema` | no | `"traverse"` | Shows which query patterns agents prefer. High traverse = real dependency analysis happening. | | `entity_types` | array of strings | yes | `Merge Request, Issue, etc` | Shows the entities queried by GKG | | `result_status` | enum `success`, `empty`, `error` | no | `"success"` | `empty` is the silent failure mode: query ran but returned nothing. Without this we cannot see it. | | `result_row_count` | integer | no | `42` | Measures query usefulness. A query returning 1 row vs 500 rows tells a different story about agent behavior. | | `rows_redacted` | integer | no | `3` | Shows authorization overhead on results. High redaction = namespace scoping is too narrow for the query. | | `duration_ms` | integer | no | `340` | End-to-end latency for SLO tracking and user experience monitoring. | | `compile_duration_ms` | integer | no | `45` | Isolates query compilation time. Spikes here indicate schema complexity or planner bugs. | | `execute_duration_ms` | integer | no | `115` | Isolates ClickHouse execution time. Spikes here indicate data volume or query inefficiency. | | `authorization_duration_ms` | integer | no | `180` | Authorization is expected to dominate latency. Tracks whether Rails auth caching is working. | | `ch_read_rows` | integer | no | `128400` | ClickHouse scan volume per query. Direct input to cost modeling per query type. | | `ch_read_bytes` | integer | no | `18874368` | Bytes scanned per query. Primary ClickHouse cost driver - used for pricing and quota decisions. | | `ch_memory_bytes` | integer | no | `4194304` | Peak memory per query. Required to set safe per-query memory limits in ClickHouse. | | `entity_types_queried` | array of string | no | `["projects", "pipelines"]` | Shows which parts of the SDLC graph are actually used. Informs indexing and ontology prioritization. | | `traversal_depth` | integer | yes | `3` | Hop count for traverse queries. Deep traversals are expensive - needed to detect and cap runaway queries. Null for non-traverse. | | `error_reason` | enum `security_rejected`, `execution_failed`, `authorization_failed`, `timeout`, `rate_limited`, `validation_failed`, `allowlist_rejected` | yes | `null` | Classifies failures so we can distinguish auth problems from ClickHouse problems from bad queries. Null on success. | --- ### 2. `gkg_indexing_completed` Fired when an indexing job finishes for a namespace (code graph and SDLC). | Field | Type | Nullable | Example | Why it's needed | |-------|------|----------|---------|-----------------| | `indexing_type` | enum `initial`, `incremental` | no | `"incremental"` | Initial indexes are expensive and rare. Incremental are frequent and cheap. Different SLOs apply to each. | | `index_domain` | enum `code`, `sdlc` | no | `"code"` | Code graph and SDLC indexing are separate pipelines with different failure modes and coverage metrics. | | `trigger_type` | enum `push`, `scheduled`, `manual` | no | `"push"` | Distinguishes real-time indexing from background jobs. Push-triggered failures are user-visible. | | `status` | enum `success`, `partial`, `failed` | no | `"partial"` | `partial` is the critical case: the namespace looks indexed but some files are missing. Silent gap in coverage. | | `languages_indexed` | array of string | no | `["ruby", "python"]` | Shows which languages were touched. Required for per-language success rate analysis. | | `file_count_processed` | integer | no | `4821` | Total successfully parsed files. Denominator for coverage calculations. | | `file_count_skipped` | integer | no | `14` | Files skipped due to locks or checkpoints. Persistent skips indicate pipeline stalls. | | `file_count_errored` | integer | no | `3` | Files that failed to parse. Combined with language data, pinpoints which languages are unreliable. | | `node_count_directory` | integer | no | `312` | Graph structure metric. Tracks repo complexity over time. | | `node_count_file` | integer | no | `4821` | Should match file_count_processed. Discrepancies indicate indexing bugs. | | `node_count_definition` | integer | no | `38400` | Functions, classes, methods indexed. Primary measure of code graph richness. | | `node_count_imported_symbol` | integer | no | `21000` | Cross-file references captured. More imports = better blast radius and dependency analysis. | | `node_count_edge` | integer | no | `194000` | Total relationships in the graph. More edges = richer answers to traversal queries. | | `sdlc_entity_type` | string | yes | `"pipelines"` | Which SDLC entity was indexed in this run. Needed to track per-entity coverage and freshness. Null for code indexing runs. | | `sdlc_rows_processed` | integer | yes | `15200` | Rows written for this SDLC entity. Tracks data volume and detects empty or truncated indexing runs. Null for code indexing runs. | | `watermark_lag_seconds` | number | no | `127.4` | Seconds between current watermark and wall clock. Measures how stale the SDLC graph data is. | | `repository_resolution` | enum `cache_hit`, `incremental`, `full_download`, `full_download_fallback` | no | `"cache_hit"` | Shows whether we efficiently reused cached archives or had to re-download. Drives infrastructure cost analysis. | | `duration_ms` | integer | no | `8420` | End-to-end indexing time. Used for SLO tracking and capacity planning. | | `error_stage` | enum `decode`, `repository_fetch`, `indexing`, `arrow_conversion`, `write`, `checkpoint` | yes | `null` | Pinpoints where in the pipeline a failure occurred. Essential for debugging partial and failed indexes. Null on success. | --- ### 3. `gkg_namespace_enabled` Fired when an admin enables GKG for a namespace. | Field | Type | Nullable | Example | Why it's needed | |-------|------|----------|---------|-----------------| | `namespace_type` | enum `group`, `project` | no | `"group"` | Group-level enablement covers all child projects. Needed to understand activation scope. | | `enabled_by_user_id` | string | no | `"8819234"` | Identifies who enables GKG. Enables cohort analysis: do admin-enabled namespaces activate faster than self-serve? | --- ### 4. `gkg_dap_session_summary` Fired at end of a DAP session where GKG tools were available. DAP-only. | Field | Type | Nullable | Example | Why it's needed | |-------|------|----------|---------|-----------------| | `total_gkg_calls` | integer | no | `9` | Primary measure of GKG depth-of-use per session. Distinguishes shallow from deep adoption. | | `total_tool_calls` | integer | no | `23` | Total tool calls including non-GKG. Lets us compute GKG share of agent activity per session. | | `tools_used` | array of string | no | `["query_graph", "get_graph_schema"]` | Which specific GKG tools were called. Shows whether agents use schema discovery before querying. | | `fallback_to_rest` | boolean | no | `false` | Did the agent call `gitlab_api_get` for data GKG covers. High fallback rate = GKG coverage gaps or agent trust issues. | | `agent_name` | string | no | `"orbit_agent"` | Which DAP agent used GKG. Some agents may adopt GKG faster than others. | | `flow_type` | string | no | `"duo_chat"` | DAP flow context. Required for segmenting GKG usage by user interaction type. | | `input_tokens` | integer | no | `8940` | Session input tokens. Compared against non-GKG sessions to prove GKG reduces token consumption. | | `output_tokens` | integer | no | `2810` | Session output tokens. Part of the token efficiency story. | | `cache_read_tokens` | integer | no | `1200` | Prompt cache hits. GKG-heavy sessions may benefit more from caching due to repeated context. | --- ### 5. `gkg_schema_introspected` Fired on `get_graph_schema` calls. Separate from query events because schema calls are discovery behavior: an agent or user trying to understand what data is available. | Field | Type | Nullable | Example | Why it's needed | |-------|------|----------|---------|-----------------| | `schema_version` | string | no | `"2.1.0"` | Tracks which schema version agents are discovering. Detects when agents are running against stale schemas after ontology changes. | | `domain_filter` | array of string | yes | `["code", "sdlc"]` | Whether the caller filtered schema to a specific domain. Agents that filter are more sophisticated users. Null if no filter passed. | --- ## What Each Event Unlocks | Product Question | Source | |-----------------|--------| | Weekly Active Namespaces (North Star) | `gkg_query_executed` - distinct `namespace_id` / 7 days | | DAU | `gkg_query_executed` - distinct `global_user_id` / day | | Query volume by access method | `knowledge_graph_context` - `source_type` | | Which query types are used most | `gkg_query_executed` - `query_type` | | Which SDLC entities are most queried | `gkg_query_executed` - `entity_types_queried` | | Silent failures (empty results) | `gkg_query_executed` - `result_status = empty` | | Authorization overhead | `gkg_query_executed` - `authorization_duration_ms` vs `duration_ms` | | ClickHouse cost per query type | `gkg_query_executed` - `ch_read_bytes` by `query_type` | | Indexing success rate by language | `gkg_indexing_completed` - `status` + `languages_indexed` | | Graph size and coverage per namespace | `gkg_indexing_completed` - node and edge counts | | Data freshness / staleness | `gkg_indexing_completed` - `watermark_lag_seconds` | | Activation funnel: enable to first query | `gkg_namespace_enabled` cohort vs first `gkg_query_executed` | | DAP adoption rate | `gkg_dap_session_summary` - `total_gkg_calls > 0` | | Token cost impact of GKG | `gkg_dap_session_summary` - `input_tokens` by `total_gkg_calls` | | Agent fallback rate to REST | `gkg_dap_session_summary` - `fallback_to_rest` | | Schema discovery usage | `gkg_schema_introspected` - frequency by `source_type` | --- ## Implementation Notes - **Siphon is out of scope.** Reports via Prometheus; no product-level analytics needed here. Confirmed with @ahegyi and @arun.sori in gitlab-org/gitlab#596156. - **DWS layer events** (`orbit_dap_tool_called`, `orbit_dap_tool_failed`, `orbit_dap_fallback`) are specified in gitlab-org/orbit/knowledge-graph#434 and emit from DWS, not GKG. Coordinate with the DWS team. - **Iglu schema** for `knowledge_graph_context` needs an MR in the iglu repository, tracked in gitlab-org/gitlab#596156. - **Rust SDK:** Internal Events tracking in Rust is blocked on labkit-rs!46 (in review, @nbelokolodov). - **Billing instrumentation** is separate from product analytics, tracked in the GKG Monetization Engineering epic (gitlab-org&21198) and ADR 007. - **Type notes for Iglu MR:** `enum` fields are `type: string` with an `enum` constraint in JSON Schema. Nullable fields use `type: ["string", "null"]` or `["integer", "null"]` as appropriate.

epic