Instrument GKG for Product Analytics (Snowplow)
## Overview
Defines the Snowplow event schemas and instrumentation plan for GitLab Knowledge Graph (GKG/Orbit). This is product analytics. Events go to Snowflake via the GitLab Internal Events framework. Operational metrics (OTel/Prometheus) are covered separately in `docs/design-documents/observability.md`.
**North Star Metric:** Weekly Active Namespaces (WAN): namespaces with at least 1 successful GKG query in the last 7 days.
**Iglu schema work:** gitlab-org/gitlab#596156
---
## Custom Iglu Schema: `knowledge_graph_context`
The standard `gitlab_standard` context does not have the fields GKG needs. We need a custom context attached to all GKG Snowplow events.
| Field | Type | Nullable | Example | Why it's needed |
|-------|------|----------|---------|-----------------|
| `correlation_id` | string | no | `"d3f1a2b4-7e8c-..."` | Standard request trace ID. Enables end-to-end debugging of DAP → GKG call chains. |
| `namespace_id` | string | no | `"14523891"` | The unit of adoption. Without this we cannot compute WAN (North Star) or any per-namespace metric. |
| `root_namespace_id` | string | no | `"9812345"` | Top-level group (TLG) of the caller's namespace. Required for billing rollup, licensing attribution, and enterprise segmentation. Available as token claim in the monolith. |
| `global_user_id` | string | yes | `"usr_8819234"` | Identity of the calling user. Required for MAU/DAU metrics and per-user behavior analysis. Null for bot or service account calls. |
| `user_type` | enum `human`, `service_account`, `bot` | no | `"human"` | Distinguishes human users from automated callers. Prevents bots and CI pipelines from inflating human adoption metrics. |
| `is_gitlab_team_member` | boolean | yes | `false` | Separates internal dogfooding from real external adoption. Null if not determinable. |
| `source_type` | enum `dap`, `mcp`, `rest_api`, `cli` | no | `"mcp"` | Separates zero-rated DAP usage from billable MCP usage. Core to monetization and access method analysis. |
| `tool_name` | enum `query_graph`, `get_graph_schema` | no | `"query_graph"` | Distinguishes query execution from schema discovery. Different behavior, different product signals. |
| `tier` | enum `premium`, `ultimate` | no | `"ultimate"` | Required for segmenting adoption by license tier and for pricing analysis. |
| `deployment_type` | enum `com`, `dedicated`, `self_managed` | no | `"com"` | Required for SM/Dedicated rollout attribution and billing path segmentation. |
| `session_id` | string | yes | `"sess_a1b2c3d4"` | Correlates multiple GKG tool calls within a single DAP session. Null for non-DAP access. |
---
## Events to Instrument
### 1. `gkg_query_executed`
Fired on every call to the webserver (`query_graph` and `get_graph_schema`).
| Field | Type | Nullable | Example | Why it's needed |
|-------|------|----------|---------|-----------------|
| `query_type` | enum `find_nodes`, `traverse`, `explore`, `aggregate`, `schema` | no | `"traverse"` | Shows which query patterns agents prefer. High traverse = real dependency analysis happening. |
| `entity_types` | array of strings | yes | `Merge Request, Issue, etc` | Shows the entities queried by GKG |
| `result_status` | enum `success`, `empty`, `error` | no | `"success"` | `empty` is the silent failure mode: query ran but returned nothing. Without this we cannot see it. |
| `result_row_count` | integer | no | `42` | Measures query usefulness. A query returning 1 row vs 500 rows tells a different story about agent behavior. |
| `rows_redacted` | integer | no | `3` | Shows authorization overhead on results. High redaction = namespace scoping is too narrow for the query. |
| `duration_ms` | integer | no | `340` | End-to-end latency for SLO tracking and user experience monitoring. |
| `compile_duration_ms` | integer | no | `45` | Isolates query compilation time. Spikes here indicate schema complexity or planner bugs. |
| `execute_duration_ms` | integer | no | `115` | Isolates ClickHouse execution time. Spikes here indicate data volume or query inefficiency. |
| `authorization_duration_ms` | integer | no | `180` | Authorization is expected to dominate latency. Tracks whether Rails auth caching is working. |
| `ch_read_rows` | integer | no | `128400` | ClickHouse scan volume per query. Direct input to cost modeling per query type. |
| `ch_read_bytes` | integer | no | `18874368` | Bytes scanned per query. Primary ClickHouse cost driver - used for pricing and quota decisions. |
| `ch_memory_bytes` | integer | no | `4194304` | Peak memory per query. Required to set safe per-query memory limits in ClickHouse. |
| `entity_types_queried` | array of string | no | `["projects", "pipelines"]` | Shows which parts of the SDLC graph are actually used. Informs indexing and ontology prioritization. |
| `traversal_depth` | integer | yes | `3` | Hop count for traverse queries. Deep traversals are expensive - needed to detect and cap runaway queries. Null for non-traverse. |
| `error_reason` | enum `security_rejected`, `execution_failed`, `authorization_failed`, `timeout`, `rate_limited`, `validation_failed`, `allowlist_rejected` | yes | `null` | Classifies failures so we can distinguish auth problems from ClickHouse problems from bad queries. Null on success. |
---
### 2. `gkg_indexing_completed`
Fired when an indexing job finishes for a namespace (code graph and SDLC).
| Field | Type | Nullable | Example | Why it's needed |
|-------|------|----------|---------|-----------------|
| `indexing_type` | enum `initial`, `incremental` | no | `"incremental"` | Initial indexes are expensive and rare. Incremental are frequent and cheap. Different SLOs apply to each. |
| `index_domain` | enum `code`, `sdlc` | no | `"code"` | Code graph and SDLC indexing are separate pipelines with different failure modes and coverage metrics. |
| `trigger_type` | enum `push`, `scheduled`, `manual` | no | `"push"` | Distinguishes real-time indexing from background jobs. Push-triggered failures are user-visible. |
| `status` | enum `success`, `partial`, `failed` | no | `"partial"` | `partial` is the critical case: the namespace looks indexed but some files are missing. Silent gap in coverage. |
| `languages_indexed` | array of string | no | `["ruby", "python"]` | Shows which languages were touched. Required for per-language success rate analysis. |
| `file_count_processed` | integer | no | `4821` | Total successfully parsed files. Denominator for coverage calculations. |
| `file_count_skipped` | integer | no | `14` | Files skipped due to locks or checkpoints. Persistent skips indicate pipeline stalls. |
| `file_count_errored` | integer | no | `3` | Files that failed to parse. Combined with language data, pinpoints which languages are unreliable. |
| `node_count_directory` | integer | no | `312` | Graph structure metric. Tracks repo complexity over time. |
| `node_count_file` | integer | no | `4821` | Should match file_count_processed. Discrepancies indicate indexing bugs. |
| `node_count_definition` | integer | no | `38400` | Functions, classes, methods indexed. Primary measure of code graph richness. |
| `node_count_imported_symbol` | integer | no | `21000` | Cross-file references captured. More imports = better blast radius and dependency analysis. |
| `node_count_edge` | integer | no | `194000` | Total relationships in the graph. More edges = richer answers to traversal queries. |
| `sdlc_entity_type` | string | yes | `"pipelines"` | Which SDLC entity was indexed in this run. Needed to track per-entity coverage and freshness. Null for code indexing runs. |
| `sdlc_rows_processed` | integer | yes | `15200` | Rows written for this SDLC entity. Tracks data volume and detects empty or truncated indexing runs. Null for code indexing runs. |
| `watermark_lag_seconds` | number | no | `127.4` | Seconds between current watermark and wall clock. Measures how stale the SDLC graph data is. |
| `repository_resolution` | enum `cache_hit`, `incremental`, `full_download`, `full_download_fallback` | no | `"cache_hit"` | Shows whether we efficiently reused cached archives or had to re-download. Drives infrastructure cost analysis. |
| `duration_ms` | integer | no | `8420` | End-to-end indexing time. Used for SLO tracking and capacity planning. |
| `error_stage` | enum `decode`, `repository_fetch`, `indexing`, `arrow_conversion`, `write`, `checkpoint` | yes | `null` | Pinpoints where in the pipeline a failure occurred. Essential for debugging partial and failed indexes. Null on success. |
---
### 3. `gkg_namespace_enabled`
Fired when an admin enables GKG for a namespace.
| Field | Type | Nullable | Example | Why it's needed |
|-------|------|----------|---------|-----------------|
| `namespace_type` | enum `group`, `project` | no | `"group"` | Group-level enablement covers all child projects. Needed to understand activation scope. |
| `enabled_by_user_id` | string | no | `"8819234"` | Identifies who enables GKG. Enables cohort analysis: do admin-enabled namespaces activate faster than self-serve? |
---
### 4. `gkg_dap_session_summary`
Fired at end of a DAP session where GKG tools were available. DAP-only.
| Field | Type | Nullable | Example | Why it's needed |
|-------|------|----------|---------|-----------------|
| `total_gkg_calls` | integer | no | `9` | Primary measure of GKG depth-of-use per session. Distinguishes shallow from deep adoption. |
| `total_tool_calls` | integer | no | `23` | Total tool calls including non-GKG. Lets us compute GKG share of agent activity per session. |
| `tools_used` | array of string | no | `["query_graph", "get_graph_schema"]` | Which specific GKG tools were called. Shows whether agents use schema discovery before querying. |
| `fallback_to_rest` | boolean | no | `false` | Did the agent call `gitlab_api_get` for data GKG covers. High fallback rate = GKG coverage gaps or agent trust issues. |
| `agent_name` | string | no | `"orbit_agent"` | Which DAP agent used GKG. Some agents may adopt GKG faster than others. |
| `flow_type` | string | no | `"duo_chat"` | DAP flow context. Required for segmenting GKG usage by user interaction type. |
| `input_tokens` | integer | no | `8940` | Session input tokens. Compared against non-GKG sessions to prove GKG reduces token consumption. |
| `output_tokens` | integer | no | `2810` | Session output tokens. Part of the token efficiency story. |
| `cache_read_tokens` | integer | no | `1200` | Prompt cache hits. GKG-heavy sessions may benefit more from caching due to repeated context. |
---
### 5. `gkg_schema_introspected`
Fired on `get_graph_schema` calls. Separate from query events because schema calls are discovery behavior: an agent or user trying to understand what data is available.
| Field | Type | Nullable | Example | Why it's needed |
|-------|------|----------|---------|-----------------|
| `schema_version` | string | no | `"2.1.0"` | Tracks which schema version agents are discovering. Detects when agents are running against stale schemas after ontology changes. |
| `domain_filter` | array of string | yes | `["code", "sdlc"]` | Whether the caller filtered schema to a specific domain. Agents that filter are more sophisticated users. Null if no filter passed. |
---
## What Each Event Unlocks
| Product Question | Source |
|-----------------|--------|
| Weekly Active Namespaces (North Star) | `gkg_query_executed` - distinct `namespace_id` / 7 days |
| DAU | `gkg_query_executed` - distinct `global_user_id` / day |
| Query volume by access method | `knowledge_graph_context` - `source_type` |
| Which query types are used most | `gkg_query_executed` - `query_type` |
| Which SDLC entities are most queried | `gkg_query_executed` - `entity_types_queried` |
| Silent failures (empty results) | `gkg_query_executed` - `result_status = empty` |
| Authorization overhead | `gkg_query_executed` - `authorization_duration_ms` vs `duration_ms` |
| ClickHouse cost per query type | `gkg_query_executed` - `ch_read_bytes` by `query_type` |
| Indexing success rate by language | `gkg_indexing_completed` - `status` + `languages_indexed` |
| Graph size and coverage per namespace | `gkg_indexing_completed` - node and edge counts |
| Data freshness / staleness | `gkg_indexing_completed` - `watermark_lag_seconds` |
| Activation funnel: enable to first query | `gkg_namespace_enabled` cohort vs first `gkg_query_executed` |
| DAP adoption rate | `gkg_dap_session_summary` - `total_gkg_calls > 0` |
| Token cost impact of GKG | `gkg_dap_session_summary` - `input_tokens` by `total_gkg_calls` |
| Agent fallback rate to REST | `gkg_dap_session_summary` - `fallback_to_rest` |
| Schema discovery usage | `gkg_schema_introspected` - frequency by `source_type` |
---
## Implementation Notes
- **Siphon is out of scope.** Reports via Prometheus; no product-level analytics needed here. Confirmed with @ahegyi and @arun.sori in gitlab-org/gitlab#596156.
- **DWS layer events** (`orbit_dap_tool_called`, `orbit_dap_tool_failed`, `orbit_dap_fallback`) are specified in gitlab-org/orbit/knowledge-graph#434 and emit from DWS, not GKG. Coordinate with the DWS team.
- **Iglu schema** for `knowledge_graph_context` needs an MR in the iglu repository, tracked in gitlab-org/gitlab#596156.
- **Rust SDK:** Internal Events tracking in Rust is blocked on labkit-rs!46 (in review, @nbelokolodov).
- **Billing instrumentation** is separate from product analytics, tracked in the GKG Monetization Engineering epic (gitlab-org&21198) and ADR 007.
- **Type notes for Iglu MR:** `enum` fields are `type: string` with an `enum` constraint in JSON Schema. Nullable fields use `type: ["string", "null"]` or `["integer", "null"]` as appropriate.
epic