Proposal: Create "One Parser" - A Unified Static Code Analysis Library

Vision Statement

Establish a single, efficient, and reliable static code analysis library (One Parser) built in Rust, serving as the foundation for diverse code intelligence features across GitLab, from server-side indexing (Knowledge Graph, Embeddings) to client-side analysis (Language Server, Web IDE). Initially scoped to AI and Editor Features.

Problem to Solve

Background

The need for advanced code understanding is growing within GitLab, powering features such as:

Knowledge Graph: Requires detailed analysis of code entities (classes, functions, modules) and their relationships (calls, imports, inheritance) to build a comprehensive graph representation of codebases.
Codebase Chat Context / Embeddings: Needs to parse code to create meaningful chunks (e.g., by class or function) for generating embeddings used in semantic search and AI-driven chat features.
Language Server / IDE Features: Must analyze code (including uncommitted local changes) to provide real-time feedback, navigation, and context-aware assistance directly to the developer.

Discussions between the Knowledge Graph and Codebase Chat Context sub-teams revealed overlapping requirements for static analysis. tree-sitter is the core dependency identified for parsing across multiple languages.

Building sophisticated static code analysis capabilities in silos leads to several problems:

Duplicated Effort: Teams may build and maintain separate parsing logic, wasting engineering resources.
Inconsistency: Different parsers might produce varying results or support different language features/versions, leading to inconsistent user experiences across features.
Maintenance Overhead: Maintaining multiple parsing systems increases the long-term burden on engineering.
Client-Side Challenges: Providing features that react to local code changes (e.g., in the Web IDE or via Language Servers) requires efficient, potentially client-side parsing. Integrating server-only parsing solutions with local changes is complex and often leads to stale results.

A unified approach is needed to provide a consistent, maintainable, and performant foundation for code analysis across GitLab, capable of running both server-side and client-side.

Proposed Solution: Rust Library

We propose building "One Parser": A shared static code analysis library implemented in Rust.

Core Characteristics:

Project: project name TBD, but we could call it gitlab-code-parser
Exports: a crate, WASM Bindings, and other interop requirements.
Parsing Engine: Leverages the tree-sitter library and its ecosystem of language grammars.
Design: A largely stateless library. It will accept source code content (as a string) and the language identifier as input.
Output: Returns structured information derived from the Abstract Syntax Tree (AST), such as class/function definitions, imports/exports, method calls, resolved aliases, etc. The exact output structure will need to be defined based on the requirements of initial consumers (Knowledge Graph, Embeddings, Language Server).
Versioning: The library will be versioned to allow consuming applications (e.g., Knowledge Graph indexer, Embeddings indexer, Language Server) to adopt updates independently.

Why Rust?

Rust directly addresses the core requirements of building a performant, reliable, and versatile static analysis library suitable for both server-side and client-side deployment within GitLab.

Some of Rust's benefits include:

Tree-sitter Integration:
- tree-sitter, the chosen parsing framework, is fundamental to this project. Rust offers first-class, well-maintained Tree-sitter bindings directly provided by the tree-sitter organization via the official tree-sitter crate.
- This ensures reliable access to the latest Tree-sitter features, a wide range of language grammars, and stability backed by the core maintainers, reducing dependency risk compared to less active or third-party bindings in other languages.
WebAssembly (Wasm) Compilation out of the Box:

A critical requirement is the ability to run the parser efficiently client-side (e.g., within the Language Server or Web IDE) to handle local code changes and provide real-time feedback.
Rust has robust support for compiling to Wasm, crucially including its C dependencies. This means the entire parser, including the core Tree-sitter C library and C-based language grammars, can be compiled into a single Wasm module.
This allows the exact same Rust codebase to run both natively on the server and directly within JavaScript/TypeScript environments with near-native performance, avoiding the need for complex Inter-Process Communication (IPC) or out-of-process execution for client-side use cases.

Performance, Memory Safety and Reliability:
- Static code analysis demands high performance, especially for large codebases or real-time scenarios.
- Rust provides C/C++ level performance without garbage collection pauses.
- Rust's compile-time memory safety guarantees (borrow checker) eliminate common errors like null pointer dereferences, buffer overflows, and data races without runtime overhead.
- This leads to a more reliable and secure library, necessary for a foundational component used across multiple GitLab features.
Cross-Language Integration (via Wasm):

When compiled to Wasm, tools like wasm-bindgen facilitate creating type-safe bindings between Rust and JavaScript/TypeScript.
This ensures that data structures passed between the Rust Wasm module and the consuming JavaScript environment (like the Language Server) adhere to defined types, catching integration errors at compile time rather than runtime.

Potential Challenges with Choosing Rust

Choosing Rust involves some trade-offs:

Upskilling & Resourcing: Rust tends to have a steeper learning curve than Go. Developer training will take time, and finding engineers with existing Rust skills (within GitLab or externally) may require more effort.
- Note: @dgruzd has explained:
  
  We actually have several Rust projects at GitLab already and team members with Rust experience. As an example, recently, I've advocated for and assisted with the migration of the GLQL compiler from Haskell to Rust (which uses WASM).
- Additionally, developers may be willing to onboard onto the project easier with the help of AI tooling.
Maintenance: Long-term support depends on maintaining sufficient Rust expertise within the responsible team(s).
Development Pace: Initial development might be slower due to the learning curve and potentially longer compile times compared to more familiar languages.

Why not Golang?

At the heart of this project is tree-sitter. Because tree-sitter is a core dependency, it is essential that we properly vet any runtime that will use both the core tree-sitter package and its various grammar ecosystem languages (typescript, python, ruby, zig, etc).

There are two reasons why Golang is not a good fit for this use case:

Lack of interoperability: We cannot easily export Golang tree-sitter related code to WASM. This is because the most widely used Go tree-sitter binding (smacker/go-tree-sitter) relies on cgo to compile the tree-sitter C dependencies. In order for use to export WASM, we'd have to leverage a Go WASM engine runtime (like wazero) and talk with WASM versions of tree-sitter and any additional language.
- WASM is a requirement if we wish to run any AST logic directly in the Language Server's Nodejs runtime or in the Web IDE.
Lack of support: The latest merge with smacker/go-tree-sitter was 8 months ago at the time of this posting, with 400 stars. These bindings only carry a subset of available tree-sitter languages, with a few additional community bindings. By contrast, the maintainers of tree-sitter (hosted by tree-sitter org) provide a crate for usage. We may be at risk by betting on using the Golang binding dependency given the low volume of activity and language support.
Bundle Size: Exporting Golang as WASM would result in a larger WASM binary since the entire Golang runtime will need to packaged.

More Details on Golang Research

Dealing with Tree-sitter's C dependency: go-tree-sitter by default uses cgo to call the Tree-sitter C library and grammar code. Go's js/wasm target does not support cgo -- the Go compiler cannot compile C code to WASM on its own (stackoverflow.com)(stackoverflow.com)

In other words, if you try to build the standard go-tree-sitter code to WASM, you'll hit errors because the C parts won't compile. This is a fundamental limitation: "Compiled C code and WASM bytecode exist in different universes and do not know about each other" without a compatible toolchain (stackoverflow.com) Note: Someone managed to get tree sitter working in Go (for WASM) by running wazero and embedding the binary inside: https://github.com/malivvan/tree-sitter. We'd likely have to build the bindings from scratch if we went with this approach.

As of 2025, the Go team has not integrated an official way to compile C code via cgo into WASM output (there was discussion about using Emscripten or LLVM for this, but it's not yet in Go) (github.com)

Edited Apr 07, 2025 by Michael Angelo Rivera