Skip to content

Draft: feat: ruby parser

Example Ruby Parser

This MR outlines how ast-grep can be utilized to extract code structure and relationships from Ruby source code.

Workflow

The analysis process follows these steps:

  1. Initialization:

    • A RubyAnalyzer instance is created, holding the source code, file path, the parsed Abstract Syntax Tree (AST) as an AstGrep object, and preloaded ast-grep rules.
  2. Rule Execution:

    • The analyzer executes the loaded ast-grep rules against the code's AST via run_rules (defined in rules.rs).
    • This generates a raw list of MatchInfo structs. Each MatchInfo represents a location where a rule pattern matched the code, containing details like the matched text, location (Range), the ID of the rule that matched, and any captured meta-variables (EnvNode).
  3. Indexing:

    • The build_fqn_and_node_indices function (in indexing.rs) performs a single traversal of the AST to construct two essential indices:
      • node_index_map: A HashMap<ByteRange, Node> enabling fast (O(1)) lookup of AST nodes by their byte range. This is encapsulated within a NodeIndex struct for ease of use.
      • node_fqn_map: A HashMap<ByteRange, (Node, Arc<Vec<String>>)> mapping the byte range of definition name nodes (like classes, modules, methods) to both the node itself and its precomputed Fully Qualified Name (FQN).
  4. Definition Analysis:

    • find_definitions (in definitions.rs) processes the MatchInfo list.
    • For matches classified as definitions (RubyMatchKind::Definition), it leverages the node_fqn_map and NodeIndex to efficiently fetch the definition's name node and its precomputed FQN.
    • The results are structured into DefinitionMatch structs.
  5. Reference Analysis (Graph-Based):

    • Graph Construction: FlowGraph::build (in flow_graph.rs) builds a data flow graph:
      • It utilizes build_assignment_map (from assignment_tracer.rs) to identify all assignment matches (RubyMatchKind::Assignment).
      • It employs LineMatchIndex::build (in flow_graph.rs) to index matches by line number, facilitating spatial lookups.
      • It relies on the NodeIndex for accessing AST nodes.
      • Graph nodes correspond to MatchInfo instances. Edges signify potential data flow paths (e.g., a reference flows to its preceding assignment; an assignment flows to the match representing its assigned value).
      • Crucially, during construction, it precomputes and caches the FQN of the origin of the value assigned in assignment nodes. This is done using get_assignment_value_origin_fqn (from assignment_tracer.rs) and the NodeIndex.
    • Tracing: find_references (in compute_references.rs) iterates through reference matches (RubyMatchKind::Reference):
      • For each reference, it initiates a trace starting from the reference's corresponding node within the FlowGraph using trace_from_graph.
      • The trace traverses the graph edges (e.g., from a reference node to an assignment node, then from the assignment node to its value node).
      • If the trace encounters an assignment node, it retrieves the cached FQN of the assignment's value origin directly from the FlowGraph's cache (get_cached_assignment_fqn).
      • If the original reference was part of a method call (e.g., variable.method_name), the method name is appended to the resolved origin FQN.
      • Fallbacks: If graph tracing fails to resolve an FQN, alternative strategies are employed: using extract_fqn_from_call_node or extract_ruby_fqn (from fqn.rs) on the reference node itself, or parsing constant paths directly (e.g., Module::Class).
      • The outcome is packaged into a ReferenceMatch struct, associating the original MatchInfo with the resolved origin_fqn.
  6. Result Aggregation:

    • The RubyAnalyzer gathers the DefinitionMatch and ReferenceMatch results.
    • These are converted into final DefinitionInfo and ReferenceInfo structs (using string representations for FQNs).
    • The aggregated results are returned within an AnalysisResult object.
graph TD
    A[Start: Code + Rules] --> B["RubyAnalyzer"];
    B --> C{Run ast-grep Rules};
    C --> D[Raw MatchInfo List];
    D --> E{"Build Indices (NodeIndex, FQN Map)"};
    E --> F[Indexed Data];
    F --> G{"Find Definitions (using FQN Map)"};
    G --> H[DefinitionInfo List];
    F --> I{"Build FlowGraph (using Assignments, LineIndex, NodeIndex)"};
    I --> J[FlowGraph + Cached FQNs];
    J --> K{"Find References (Trace on FlowGraph)"};
    K --> L[ReferenceInfo List];
    M[AnalysisResult] --> N["End"];
    H --> M;
    L --> M;

Key Concepts

  • RubyAnalyzer (analyzer.rs): The central struct that orchestrates the analysis. It holds the source code, AST (AstGrep), rules, and exposes methods like matches (for raw matches) and analyze (for the full analysis).
  • Rules (.yaml files): Define what to find using ast-grep patterns. They are categorized (e.g., definitions.yaml, references.yaml, assignment.yaml) and combined dynamically. Rules determine the RubyMatchKind (e.g., Assignment, Definition, Reference, Other) for each match.
  • MatchInfo (rules.rs): Represents a single, raw match found by ast-grep. Contains the matched text, location (Range), rule ID, captured meta-variables (EnvNode), and its assigned RubyMatchKind.
  • NodeIndex (indexing.rs): A wrapper around a HashMap<ByteRange, Node>, providing efficient O(1) lookup of AST nodes using their byte offsets (ByteRange). Built once during indexing.
  • Fully Qualified Name (FQN):
    • Representation: Defined by the Fqn struct (in assignment_tracer.rs), holding an Arc<Vec<String>> representing the name parts (e.g., ["Module", "Class", "method"]).
    • Extraction (fqn.rs): Contains logic to determine FQNs:
      • extract_ruby_fqn: Determines the FQN by walking up the AST from a given node to identify enclosing module, class, and method scopes.
      • extract_fqn_from_call_node: Specifically handles method call structures (e.g., a.b.c, A::B.c) to resolve the receiver path and method name.
      • extract_enclosing_module_or_class_fqn: Retrieves only the module/class scope FQN for a node.
    • Precomputation (indexing.rs): FQNs for definition nodes are computed during the initial build_fqn_and_node_indices traversal and stored in the node_fqn_map.
    • Assignment Value Origin FQN (assignment_tracer.rs): get_assignment_value_origin_fqn uses the NodeIndex and FQN extraction functions to find the FQN of the source of the value being assigned. For example, in x = MyClass.new, it resolves the FQN to ["MyClass", "new"]. This is vital for tracing references through assignments.
  • FlowGraph (flow_graph.rs):
    • Represents data flow relationships between MatchInfo instances.
    • Nodes correspond to matches; edges indicate flow direction (e.g., reference -> preceding assignment, assignment -> assigned value).
    • Built using assignment matches (via build_assignment_map), the LineMatchIndex, and the NodeIndex.
    • Crucially, it caches the precomputed FQN of the origin of assigned values (get_cached_assignment_fqn) associated with assignment nodes. This caching significantly speeds up reference tracing by avoiding redundant FQN computations.
    • Used by compute_references.rs (specifically trace_from_graph) to follow data flow paths from a reference match back to its potential origin(s).
  • LineMatchIndex (flow_graph.rs): An auxiliary index mapping line numbers to the MatchInfo instances spanning those lines. Used for potentially faster spatial lookups during FlowGraph construction, although direct ByteRange-based lookups via NodeIndex are often sufficient.
  • DefinitionMatch / DefinitionInfo (definitions.rs, analyzer.rs): Represents a successfully identified definition. Links the original MatchInfo with its resolved FQN (obtained efficiently from the precomputed node_fqn_map). DefinitionInfo is the final, serializable version.
  • ReferenceMatch / ReferenceInfo (compute_references.rs, analyzer.rs): Represents a successfully resolved reference. Links the original MatchInfo with the best-effort origin_fqn determined primarily through FlowGraph tracing (leveraging cached assignment value FQNs) and fallback mechanisms. ReferenceInfo is the final, serializable version.
  • Assignment Helpers (assignment_tracer.rs): This module provides helper functions related to assignments, primarily used during graph building and tracing:
    • build_assignment_map: Creates a map where keys are variable names and values are sorted lists of their corresponding assignment MatchInfos.
    • find_assignment_before_pos: Given a variable name and a position, efficiently finds the last assignment MatchInfo for that variable occurring before the position (using binary search on the map).
    • get_assignment_value_origin_fqn: Determines the FQN of the value being assigned (detailed in the FQN section).
    • Defines shared types like Fqn, TraceStep, and TraceReason used in the tracing process.
graph LR
    A["RubyAnalyzer"] --> B["Indexing (build_fqn_and_node_indices)"]
    A --> C["Definitions (find_definitions)"]
    A --> D["References (find_references)"]
    D --> E["FlowGraph (build, trace)"]
    E --> F["Assignment Helpers"]
    E --> G["FQN Helpers"]
    B --> C
    B --> D
    F --> E
    G --> E
    C --> H["DefinitionInfo"]
    D --> I["ReferenceInfo"]
    H --> J["AnalysisResult"]
    I --> J
Edited by Michael Angelo Rivera

Merge request reports

Loading