Skip to content

Document and expand Advanced Finders architecture

Background

The Advanced Finders concept was originally proposed in this design document MR to create a unified interface for accessing data from multiple backends (PostgreSQL, Elasticsearch/OpenSearch, and potentially ClickHouse). While the original MR was closed as the concept has been partially implemented through GLQL and other initiatives, there's a need to formalize this architecture and make it available as a broader engineering tool.

What Advanced Finders Aimed to Accomplish

The original design document proposed:

Core Objectives

  1. Unified Interface: Create a consistent API for accessing data from either PostgreSQL or Advanced Search (Elasticsearch/OpenSearch)
  2. Performance Optimization: Enable filtered searches to leverage Advanced Search when available, improving performance for complex queries
  3. Backend Selection: Intelligent routing between data sources based on:
    • Advanced Search availability
    • Query complexity
    • Parameter support allowlists
    • Data freshness and indexing lag
  4. Consistent Results: Return paginated collections with metadata rather than ActiveRecord relations to support multi-backend results

Key Features

  • Parameter Support Allowlisting: Gradual migration by maintaining lists of supported parameters for each backend
  • Transparent Backend Selection: Automatic selection with option for explicit backend specification
  • Unified Pagination: Support for offset-based, keyset, and scroll-based pagination across different backends using opaque page tokens
  • Permission Safety: Redaction logic as a final safety net to ensure no unauthorized data is returned

Architecture Changes

  • Replace ActiveRecord relation returns with FinderResult objects containing:
    • Collection of model instances
    • Pagination metadata
    • Backend information (which data source was used)
  • Support for both automatic and explicit backend selection:
    # Automatic selection
    result = AdvancedFinder::Issues.new(current_user, params).execute
    
    # Explicit backend
    result = AdvancedFinder::Issues.new(
      current_user, 
      params.merge(backend: AdvancedFinder::Backend::AdvancedSearch)
    ).execute

Current State and Need

As noted in the final comment, Advanced Finders is becoming increasingly important as GitLab expands use of different data stores (PostgreSQL, Elasticsearch, ClickHouse, Knowledge graph, etc.).

Current implementations include:

  • GLQL (GraphQL Query Language) work items API
  • Various search improvements leveraging multiple backends

Missing pieces:

  1. Documentation: No formal documentation on how to leverage Advanced Finders
  2. Selection Criteria: No documented criteria for appropriate backend selection logic
  3. Engineering Guidelines: No guidance for engineers on when and how to use this pattern
  4. Self-managed Considerations: How this architecture supports different data architectures for self-managed instances

Proposed Next Steps

  1. Document Current Implementation

    • Create developer documentation for existing Advanced Finders patterns
    • Document the GLQL implementation as a reference example
    • Provide guidelines on backend selection criteria
  2. Expand Architecture Guidelines

    • Define when Advanced Finders should be used vs traditional finders
    • Document performance considerations and trade-offs
    • Create patterns for new data store integrations (ClickHouse, Knowledge graph)
  3. Self-managed Strategy

    • Define how Advanced Finders work when certain backends aren't available
    • Document graceful degradation patterns
    • Consider feature flag strategies for progressive rollout
  4. Engineering Toolset

    • Make Advanced Finders a standard part of the engineering toolkit
    • Provide templates/generators for new finder implementations
    • Create testing patterns for multi-backend scenarios

Questions for Discussion

  1. Should we formalize Advanced Finders as a standard architectural pattern?
  2. What documentation do we need to make this accessible to all engineering teams?
  3. How do we handle backend selection criteria to avoid performance issues?
  4. What's the strategy for self-managed instances with limited data store availability?
  5. How do we integrate this with emerging data stores like ClickHouse and Knowledge graph?

References