feat(graph): gkgaas query planner: json schema to ast scaffolding
What does this MR do and why?
This change introduces a complete query compilation pipeline for converting LLM-generated JSON queries into SQL-oriented Abstract Syntax Trees (ASTs). The implementation establishes a two-step pipeline: JSON parsing and schema validation into in-memory go structs and then translating e.g "lowering" those structs into an AST-like data structure creating a foundation for executing graph queries against ClickHouse.
Specifically, we have three components of the pipeline with a strict separation of concerns:
-
Input Layer (
internal/engine/input.go): Parses raw JSON into typed Go structures (Input,InputNode,InputRelationship,InputAggregation,InputPath). This layer handles polymorphic JSON schemas where fields liketypecan be either a string or array, andfilterscan be simple equality values or operator-based objects. -
AST Layer (
internal/engine/ast.go): Defines SQL-oriented node types (Query,RecursiveCTE,TableRef,Expr) that directly map to SQL constructs. The AST uses interface-based polymorphism with marker methods (node(),expr(),tableRef()) to enforce type safety. -
Lowering Layer (
internal/engine/lower.go): ConvertsInputstructures into AST nodes through query-type-specific logic (lowerTraversal,lowerAggregation,lowerPathFinding). This phase constructs JOIN chains, WHERE clauses, and aggregation expressions.
The compilation entry point (internal/engine/main.go) coordinates these phases: parse JSON → validate against schema → lower to AST.
Pipeline Design Details (in order of execution)
1. JSON Schema Definition (internal/engine/conf/schema.json)
A 785-line JSON Schema document defines the contract between LLM output and the query engine. Key specifications:
-
Query Types: Enum of
traversal,aggregation,path_finding,pattern -
Node Labels: Enum constraining valid GitLab entity types (e.g.,
Issue,MergeRequest,CiPipeline) -
Relationship Types: Enum of 60+ relationship names (e.g.,
AUTHORED_ISSUE,CI_PIPELINE_TO_MR,BLOCKS) -
Conditional Requirements: If
query_typeisaggregation, thenaggregationsarray must haveminItems: 1. Ifpath_finding, thenpathobject is required. -
Security Constraints:
max_hopsandmax_depthcapped at 3 to prevent resource exhaustion
The schema includes 7 example queries demonstrating traversal with filters, aggregation with GROUP BY, path finding, variable-length relationships, advanced filter operators, multi-relationship unions, and collection aggregations.
2. Schema Validation Integration (internal/engine/parse.go)
Added dependency github.com/xeipuuv/gojsonschema (visible in go.mod and go.sum changes). The ValidateSchema function accepts a schema string and a Go value, using NewGoLoader to convert the value to its JSON representation for validation. Returns formatted validation errors from the schema library.
3. Polymorphic JSON Parsing (internal/engine/input.go)
The input parser handles two forms of polymorphism:
Filter Polymorphism: A filter can be:
- Simple:
"state": "opened"(shorthand for equality) - Operator-based:
"created_at": {"op": "gte", "value": "2024-01-01"}
The parseFilter function attempts to unmarshal as an operator-based filter first, falling back to a simple value. The InputFilter struct tracks which form was used via the IsSimple boolean field.
Relationship Type Polymorphism: The type field accepts:
- Single string:
"type": "BLOCKS" - Array:
"type": ["BLOCKS", "RELATES_TO"](for UNION ALL semantics)
The parseRelationship function unmarshals raw.Type as json.RawMessage, then attempts single-string unmarshal before falling back to array unmarshal. Default values are applied during parsing: limit defaults to 30, min_hops to 1, max_hops to 1, direction to outgoing, and id_property to id.
4. SQL-Oriented AST Design (internal/engine/ast.go)
The AST defines 232 lines of types directly representing SQL semantics:
Expression Types:
-
ColumnRef:table.columnreference -
Literal: Constant values (supportsint,string,bool,nil,[]any) -
FuncCall: Aggregates and functions (e.g.,COUNT(x),SUM(x),arrayConcat(...)) -
BinaryOp: Infix operators (=,!=,<,>,AND,OR,IN,LIKE) -
UnaryOp: Prefix/suffix operators (NOT,IS NULL,IS NOT NULL)
Table Reference Types:
-
TableScan: Physical table access with optional type filter (e.g.,nodes AS i WHERE label = 'Issue') -
Join: Combines twoTableRefinstances with a join condition (INNERorLEFT)
Top-Level Query Types:
-
Query: Standard SELECT withFrom,Where,GroupBy,OrderBy,Limit -
RecursiveCTE: Recursive common table expression withBase(anchor),Recursive(recursive term),MaxDepth, andFinalquery
Builder functions (Col, Lit, Func, Eq, And, Or) provide ergonomic construction.
5. Query Lowering Logic (internal/engine/lower.go)
The Lower function dispatches on input.QueryType to three specialized lowering functions:
Traversal Lowering (lowerTraversal):
- Constructs JOIN chain: For each relationship, joins edges table with target node table
- Handles bidirectional joins: When
direction: "both", generatesORcondition:(from = X OR to = X) - Builds WHERE clause combining node ID filters, ID ranges, and property filters
- Returns
Querywith node IDs in SELECT clause
Aggregation Lowering (lowerAggregation):
- Same JOIN construction as traversal
- Builds SELECT with GROUP BY columns first, then aggregation functions
- Maps function names:
collect→groupArray(ClickHouse-specific) - Handles
aggregation_sort: Orders by Nth aggregation result
Path Finding Lowering (lowerPathFinding):
- Constructs
RecursiveCTEwith three parts:- Base: Selects start node with initial path array
- Recursive: Joins previous iteration with edges, extends path, increments depth
- Final: Filters to end node, orders by depth (shortest first)
- Prevents cycles:
WHERE NOT has(p.path, n.id) - Enforces depth limit:
WHERE p.depth < max_depth
The buildFrom helper constructs nested JOIN trees by iterating through relationships, assigning edge aliases (e0, e1, ...), and determining join direction based on direction field.
6. Visualization and Debugging Tools (internal/engine/visualize.go)
Two visualization functions support debugging:
VisualizeNode: Renders AST as tree using box-drawing characters:
└── Query
├── SELECT
│ ├── i.id AS i_id
│ └── p.id AS p_id
├── FROM
│ └── INNER JOIN
│ ├── TableScan(nodes AS i) [Issue]
│ ├── TableScan(edges AS e0)
│ └── ON i.id = e0.from_id
PrettyPrintQuery: Renders Query as SQL-like text with indentation, including WHERE, GROUP BY, ORDER BY, and LIMIT clauses.
Both functions appear in test output (visible in internal/engine/tests/lower_test.go via t.Log calls).
Related Issues
Testing
I have implemented basic test cases for now.
Performance Analysis
- This merge request does not introduce any performance regression. If a performance regression is expected, explain why.