AI Evaluation Documentation consolidation

Original Research Query

Problem

It can be hard for engineers to know when and how to evaluate AI features. The new AI feature development playbook will provide a high-level overview (gitlab-org/gitlab!193250 (merged)), but it leaves the details to be filled in. Engineers need to know what tools to use, why, when, and how.

Tasks

This is part of gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53 (closed)

Previous task: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#743 (closed)

Current task: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674 (closed)

Collect the documentation on evaluation tools and processes, and organize them into a set of guidelines covering which tools to use, when, why, and how.

Should cover feature evaluation, prompt evaluation, tool evaluation, model evaluation, and latency evaluation

References

Consider these and linked/related resources for inclusion or as reference material:

All docs within:

https://gitlab.com/gitlab-org/gitlab/-/tree/master/doc/development/ai_features

https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/ai-assist/-/tree/main/docs

https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner/-/tree/main/docs?ref_type=heads

https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/tree/main/doc?ref_type=heads

https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library/-/tree/main/doc/eli5?ref_type=heads

https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/datasets/-/tree/main/doc?ref_type=heads

https://gitlab.com/gitlab-org/quality/ai-model-latency-tester/-/tree/main/docs?ref_type=heads

Also consider general GitLab docs within https://gitlab.com/gitlab-org/gitlab/-/tree/master/doc/development

Note

The Prompt Library repository includes two projects, the original promptlib, and the updated ELI5. promptlib is obsolete. Both are being replaced by CEF. See gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424 (merged). Existing documentation might refer to Prompt Library, ELI5, or the CEF. New documentation should refer to the CEF instead of Prompt Library or ELI5

Research Report

GitLab AI Feature Evaluation: A Comprehensive Guide for Engineers

1. Introduction

Engineers at GitLab face a significant challenge in understanding when, why, and how to effectively evaluate AI-powered features. While high-level playbooks exist, a clear, consolidated, and actionable set of guidelines detailing specific tools, methodologies, and best practices has been lacking. This report synthesizes extensive research into GitLab's ongoing efforts to address this gap. It aims to provide engineers with a comprehensive understanding of the evolving AI evaluation landscape at GitLab, the tools available, and the processes to follow for various evaluation types, including feature, prompt, tool, model, and latency evaluation.

The core of GitLab's strategy involves a two-pronged approach:

Consolidation of AI Evaluation Tooling: Migrating from disparate tools like the original Prompt Library and ELI5 towards a unified Centralized Evaluation Framework (CEF).
Consolidation of AI Developer Documentation: Centralizing all AI-related documentation to provide a Single Source of Truth (SSoT) for developers.

This report will delve into these initiatives, outlining the current state, key tools, and the practical steps engineers can take to evaluate AI features effectively.

2. The Evolving AI Evaluation Landscape at GitLab

2.1. The Need for Clear Guidelines

The problem statement is clear: engineers find it difficult to navigate the AI evaluation process. As highlighted in [Document developer workflow to enable efficient... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674 - closed)], the existing AI feature development playbook, while being rewritten ([Rewrite the AI feature development playbook (gitlab-org/gitlab!193250 - merged)]), provides a high-level overview but leaves the detailed "how-to" to be filled in.

Problem to solve (from [Document developer workflow to enable efficient... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674 - closed)]): "It can be hard for engineers to know when and how to evaluate AI features. The AI feature development playbook has some information, but it's incomplete and does not necessarily provide the relevant information in an easy-to-follow, practical style. At minimum, engineers need to know what tools to use, but also when and how it's appropriate to use which tools. This requires at least some understanding of how evaluation fits within the overall software development workflow."

2.2. Strategic Shift: Centralized Evaluation Framework (CEF)

GitLab is strategically moving towards a Centralized Evaluation Framework (CEF). This initiative, tracked under Epic [🎯 Consolidate AI Evaluation Tooling (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&37 - closed)] (🎯 Consolidate AI Evaluation Tooling), aims to unify existing tools like ELI5 and the Prompt Library.

Goal (from [🎯 Consolidate AI Evaluation Tooling (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&37 - closed)]): "Create a unified, powerful Centralized Evaluation Framework (CEF) by consolidating ELI5 and Prompt Library."

The objectives of this consolidation include:

Creating a unified evaluation solution.
Enhancing flexibility and user-friendliness.
Improving documentation and guidance.
Streamlining processes for AI feature evaluation.

A critical step in this consolidation is Merge Request [chore: consolidate CEF project structure (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424 - merged)] (chore: consolidate CEF project structure). This MR renames ELI5 to CEF and deprecates the old promptlib code.

MR Description (from [chore: consolidate CEF project structure (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424 - merged)]): "chore: consolidate CEF project structure

Remove promptlib code and dependencies.

Move ELI5 to root and rename to CEF.

Consolidate dependencies and toolings."

Out of scope (from [chore: consolidate CEF project structure (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424 - merged)]): "The following work items will be in a separate MR.

Fix documentation"

This means that while the tooling is being unified, the comprehensive documentation on how to use CEF is a subsequent effort, primarily tracked under Epic [[Scope adjustment] Phase 4: documentation and c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53 - closed)] ([Scope adjustment] Phase 4: documentation and clean up).

The architectural blueprint for this consolidation is detailed in Issue [Draft blueprint to consolidate evaluation tooli... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#467 - closed)] (Draft blueprint to consolidate evaluation tooling (CEF, ELI5, Langsmith)) and the associated handbook MR ([Adds draft for consolidating evaluation tooling (gitlab-com/content-sites/handbook!8216 - merged)]). This blueprint clarifies the roles of CEF, LangSmith, and ELI5 (now part of CEF):

CEF: For large-scale, production-representative feature evaluation, used towards the end of the development cycle.
LangSmith: For rapid prompt experimentation, dataset initiation, and capturing failure examples, used from Day 1 of development.
ELI5 (as part of CEF): Automation layer for LangSmith, streamlining dataset creation, evaluation scripts, and CI/CD pipelines for mature features.

2.3. Documentation Consolidation

Parallel to tooling consolidation, there's a major effort to centralize all AI-related developer documentation, tracked under Issue [Consolidate GitLab AI Developer/Contributor Doc... (gitlab-org/gitlab#514510 - closed)] (Consolidate GitLab AI Developer/Contributor Documentation).

Background (from [Consolidate GitLab AI Developer/Contributor Doc... (gitlab-org/gitlab#514510 - closed)]): "We currently have AI-related documentation spread across multiple locations, making it difficult for developers and users to find relevant information. This initiative aims to consolidate all AI documentation into a single, organized location."

Proposed Solution (from [Consolidate GitLab AI Developer/Contributor Doc... (gitlab-org/gitlab#514510 - closed)]): The plan is to consolidate all AI documentation under https://docs.gitlab.com/ee/development/ai_features/ with clear organization by topic.

This ensures that the new evaluation guidelines will be part of a unified, easily discoverable documentation set.

3. Key Initiatives and Components for Evaluation Guidelines

3.1. AI Feature Development Playbook Rewrite

The AI Feature Development Playbook is being rewritten to serve as the SSoT for the high-level AI feature development workflow. This is tracked by Issue [Update the AI feature development playbook (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#743 - closed)] (Update the AI feature development playbook) and implemented in Merge Request [Rewrite the AI feature development playbook (gitlab-org/gitlab!193250 - merged)] (Rewrite the AI feature development playbook).

Purpose of Playbook Rewrite (from [Rewrite the AI feature development playbook (gitlab-org/gitlab!193250 - merged)]): "- improve our guidelines for engineers for evaluating AI features

create an SSoT for the AI feature development workflow...

provide an overview, with links to more detailed information"

The new playbook outlines a 5-phase iterative structure: Plan, Develop, Test & Evaluate, Deploy, Monitor. The "Test & Evaluate" phase explicitly mentions various evaluation types and links to key resources, though it acknowledges the need for more detailed follow-up documentation.

Relevant Content from doc/development/ai_features/ai_feature_development_playbook.md (via [Rewrite the AI feature development playbook (gitlab-org/gitlab!193250 - merged)]):

### 3. 🧪 **Test & Evaluate**

The test and evaluate phase is where we assess the quality, performance, and safety of AI features. This phase is closely aligned with the develop phase, as testing and evaluation are often iterative processes that inform further development. It supplements the [develop and test phase of the build track of the product development flow](https://handbook.gitlab.com/handbook/product-development/product-development-flow/#build-phase-2-develop--test).

This phase includes:

- **Model evaluation:** Assessing the performance of the underlying AI model against predefined metrics and benchmarks.
- **Feature evaluation:** Testing the end-to-end AI feature from a user perspective, ensuring it meets functional and non-functional requirements.
- **Prompt evaluation:** Systematically testing prompts to ensure they elicit desired responses and avoid unintended behaviors.
- **Latency evaluation:** Measuring the response time of AI features to ensure they meet performance targets.
- **Safety and ethical evaluation:** Identifying and mitigating potential biases, fairness issues, and other ethical concerns.

**Resources:**

- [AI evaluation and testing (internal)](https://internal.gitlab.com/handbook/product/ai-strategy/ai-integration-effort/ai_testing_and_evaluation/)
- [Evaluation Runner](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/evaluation-runner)
- [Prompt Library (CEF)](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library)
- [AI Model Latency Tester](https://gitlab.com/gitlab-org/quality/ai-model-latency-tester)
- [Datasets](https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/datasets)

Relevant Comments on Playbook Rewrite ([Rewrite the AI feature development playbook (gitlab-org/gitlab!193250 - merged)]):

@achueshev (2025-06-03 10:04:47 UTC):
"This is a great start! I like the structure and the clear separation of phases. ...
1. Evaluation details: The "Test & Evaluate" section is a good high-level overview. For engineers, it would be extremely helpful to have more concrete examples or links to specific tools/processes for each type of evaluation (model, feature, prompt, latency, safety). For instance, what does "conduct model evaluation" actually entail? Are there specific metrics, frameworks, or internal tools we recommend?"
@mlapierre (2025-06-03 14:00:00 UTC) (Author of MR):
"Thanks for the feedback!
1. Evaluation details: Yes, this is the plan. The next issue in the epic is to update the evaluation documentation. This MR is meant to be the high-level overview, and then we'll link to the detailed guides."

3.2. Detailed Evaluation Workflow Documentation

The task of creating these detailed "how-to" guides is captured in Issue [Document developer workflow to enable efficient... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674 - closed)] (Document developer workflow to enable efficient evaluations).

Proposal (from [Document developer workflow to enable efficient... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674 - closed)]): "- Collect the documentation on evaluation tools and processes, and organize them into a set of guidelines covering which tools to use, and when and how it's appropriate to use which tools. This will include:

A high-level overview of the evaluation process within the AI feature development workflow.

Detailed guidance on specific evaluation types (feature, prompt, tool, model, latency).

Practical examples and best practices."

Relevant Comments on Workflow Documentation ([Document developer workflow to enable efficient... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#674 - closed)]):

@mlapierre (2025-06-05):

"This issue is about documenting the how of evaluation, and it's a dependency for #743 (which is about the what and when)."
@achueshev (2025-06-05):

"Yes, that's the idea. This issue is about documenting the developer workflow to enable efficient evaluations. It should cover how to use the CEF, when to use it, and why."
@achueshev (2025-06-05):

"I think we should create a new ai_evaluation directory under doc/development/ai_features and put all the new docs there. This will make it easier to find all the evaluation-related docs in one place."

4. Guidelines for Specific Evaluation Types

The following sections outline the tools, processes, and considerations for each type of AI evaluation, based on the consolidated research. All new documentation and practices should align with the Centralized Evaluation Framework (CEF).

4.1. Feature Evaluation

What: End-to-end assessment of an AI feature's performance, user experience, and impact in a real-world or production-representative context.
Why: To validate the overall value proposition, identify regressions, ensure user satisfaction, and measure business impact.
When: Throughout the development lifecycle: during development (e.g., A/B testing, dogfooding), pre-release, and for continuous post-deployment monitoring.
How (Tools & Process):
- CEF: The primary framework for large-scale, production-representative evaluations.
- Evaluation Runner: For automating scheduled evaluation runs (daily, etc.) using CEF.
- A/B Testing: Comparing the AI feature against a control or alternative versions.
- User Feedback: Collecting qualitative (interviews, surveys) and quantitative (analytics) data.
- Business Metrics: Tracking Key Performance Indicators (KPIs) relevant to the feature.
- The AI Feature Development Playbook provides the high-level workflow.

4.2. Prompt Evaluation

Prompt evaluation is crucial for features leveraging Large Language Models (LLMs). GitLab has significantly enhanced capabilities in this area.

What: Assessing the quality, effectiveness, and safety of prompts used to interact with LLMs.
Why: To optimize LLM outputs, ensure consistency, reduce hallucinations, improve relevance, and align with desired behavior.
When:
- During initial prompt engineering and iterative development.
- Whenever a prompt is modified.
- When the underlying LLM is updated.
- As part of regular regression testing in CI/CD.
How (Tools & Process):
- CEF (ELI5 component) & AI Gateway (AIGW): ELI5 is integrated with AIGW, allowing prompt evaluations to run directly from AIGW merge requests via CI jobs. This is a core part of Epic [[Scope adjustment] AIGW setup to evaluate prompts (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&43 - closed)] ([Scope adjustment] AIGW setup to evaluate prompts) and Epic [Prompt Evaluation Orchestration (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&49 - closed)] (Prompt Evaluation Orchestration).
- Evaluators within CEF:
  - ExactMatchEvaluator: For direct comparison of actual vs. expected outputs. (See [Implement a generic ExactMatchEvaluator to eval... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#664)])
```
# Conceptual snippet from gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#664+
class EvaluationInput(TypedDict):
    expected_answer: str
    actual_answer: str

class ExactEvaluator(BaseEvaluator[EvaluationInput]):
    def _run(self, inputs: EvaluationInput) -> EvaluationResult:
        if inputs["expected_answer"] == inputs["actual_answer"]:
            return EvaluationResult(key="exact_match", score=1.0)
        else:
            return EvaluationResult(key="exact_match", score=0.0)
```
  - LLMJudgeEvaluator: Uses an LLM to assess the correctness and quality of prompt outputs for more nuanced evaluations. (See [Implement a generic LLMJudgeEvaluator to asses ... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#665 - closed)])
```
# Conceptual snippet from gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#665+
PROMPT_SYSTEM = """
You are an AI assistant tasked with evaluating the quality of a response.
Compare the expected output with the actual output and provide a score based on relevance, coherence, and accuracy.
"""
```
- Dataset Generation: Initial datasets can be auto-generated using LLMs ([Automatically generate initial datasets for pro... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#708 - closed)]), then refined.
- LangSmith: Used for logging, tracking evaluation experiments, and managing datasets.
- Documentation: Enhanced documentation for prompt evaluation processes is available, stemming from Issue [Enhance documentation about the process of prom... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#721 - closed)] (Enhance documentation about the process of prompt evaluation), which was resolved by Merge Request [chore(docs): enhance documentation for prompt e... (gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!2456 - merged)].
- Guidelines for Efficient Prompts: To be developed under Issue [Draft: Re-visit guidelines about creating effic... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#676 - closed)] (Draft: Re-visit guidelines about creating efficient prompts).

4.3. Tool Evaluation

What: Assessing the performance, suitability, and reliability of specific AI tools or components (e.g., a particular LLM, vector database, ReAct agent capabilities).
Why: To make informed decisions during architectural design, tool selection, and upgrades.
When: During technology selection, feature development (especially for agentic systems), and when considering tool updates.
How (Tools & Process):
- CEF (ELI5 component) & Prompt Registry: Evaluate prompts against specific tools, especially for ReAct tool selection accuracy ([Connect ELI5 Workspace to Prompt Registry (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#622 - closed)]).
- LangSmith Evaluation Framework: Can be integrated with pytest/vitest for unit-testing prompts and tool interactions ([Integrate LangSmith Evaluation Framework with A... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#647 - closed)]).
- Benchmarking: Comparing different tools against defined criteria (accuracy, speed, cost).
- Documenting Existing Evaluators: An ongoing task ([Re-visit documentation about existing evaluatio... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#671 - closed)]) will help identify available tools within CEF.

4.4. Model Evaluation

What: Assessing the performance of underlying AI models (e.g., LLMs, fine-tuned models) on specific tasks or benchmarks.
Why: To understand model capabilities, identify biases, track performance improvements, and ensure model safety and reliability.
When: During model selection, fine-tuning, before deployment, and periodically to monitor for drift.
How (Tools & Process):
- CEF: Use CEF with appropriate datasets and evaluators (e.g., LLMJudgeEvaluator) to assess model outputs.
- Datasets: Crucial for comprehensive model evaluation. See Dataset Management section below.
- Metrics: Define and track relevant metrics (e.g., accuracy, F1-score, BLEU, ROUGE, perplexity, fairness metrics).
- LangSmith: For experiment tracking and comparing performance across different models or versions.

4.5. Latency Evaluation

What: Measuring the response time of AI features and their underlying components.
Why: To ensure a responsive user experience, meet performance Service Level Agreements (SLAs), and optimize resource usage.
When: Throughout the development lifecycle, especially during integration testing, performance testing, and post-deployment monitoring.
How (Tools & Process):
- ai-model-latency-tester: GitLab has a dedicated tool for this, documented at [gitlab-org/quality/ai-model-latency-tester/-/tree/main/docs?ref_type=heads+].
- CEF: May include components for latency measurement (e.g., cef/codesuggestions/summarize_latency.py found in [chore: consolidate CEF project structure (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424 - merged)]).
- Benchmarking: Measuring response times under various load conditions.
- Profiling: Identifying bottlenecks in the AI inference pipeline.
- Integration into CI/CD for continuous performance monitoring.

4.6. Dataset Management for Evaluation

High-quality datasets are fundamental to all AI evaluation efforts.

What: Creating, managing, versioning, and curating datasets used for training and evaluating AI models and features.
Why: To ensure evaluations are robust, reliable, representative of real-world scenarios, and can detect regressions or biases.
When: Continuously, as features evolve and new data becomes available.
How (Tools & Process):
- LangSmith: The emerging Single Source of Truth (SSoT) for evaluation datasets and results.
- CEF: Includes capabilities for dataset generation and management.
- Migration from Legacy Systems: Dataset creation pipelines are being migrated from the old Prompt Library (which used BigQuery/Apache Beam) to ELI5/CEF, leveraging LangSmith. This is tracked in Issue [List PL dataset creation pipelines not covered ... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#661 - closed)] (List PL dataset creation pipelines not covered in ELI5) and Epic [[Scope adjustment] Phase 2.5: Move PL dataset c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&52 - closed)] ([Scope adjustment] Phase 2.5: Move PL dataset creation pipeline to ELI5).
  
  Problem Statement (from [List PL dataset creation pipelines not covered ... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#661 - closed)]): "The Prompt Library contains logic for creating datasets to run evaluations. This logic uses BigQuery and Apache Beam. Since we rely on LangSmith and given our evaluation consolidation efforts, this dataset logic is no longer maintained well and needs to be moved to ELI5."
- Guidelines for Building Datasets: To be developed under Issue [Re-visit documentation about existing dataset c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#672 - closed)] (Draft: Re-visit guidelines about building evaluation datasets).
- Existing Datasets: A list of available datasets can be found at https://datasets-gitlab-org-modelops-ai-model-validation-b35d3d2afe403e.gitlab.io/#coverage (referenced in [Eval coverage (and gaps) for Duo Workflow / Age... (gitlab-org/gitlab#547712)]).

5. Current Status, Gaps, and Future Work

GitLab's AI evaluation framework is actively evolving.

Progress:
- The foundational consolidation of CEF is well underway ([chore: consolidate CEF project structure (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!1424 - merged)]).
- Phase 1 of CEF consolidation (ELI5 moved to Prompt Library repo) is complete.
- Phase 2 (Core Integration) and 2.5 (Dataset Migration) are in progress, migrating evaluators and dataset pipelines.
- Documentation for prompt evaluation has been enhanced ([Enhance documentation about the process of prom... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#721 - closed)]).
- The AI Feature Development Playbook rewrite is in progress ([Rewrite the AI feature development playbook (gitlab-org/gitlab!193250 - merged)]).
Identified Gaps (example from [Eval coverage (and gaps) for Duo Workflow / Age... (gitlab-org/gitlab#547712)] for Duo Workflow / Agentic Duo Chat):
- Evaluation of disambiguation steps (human-AI interaction).
- Evaluation for non-Python languages. These specific gaps highlight areas where new evaluation methodologies or datasets might be needed.
Future Work (Phase 4: Documentation and Clean Up - [[Scope adjustment] Phase 4: documentation and c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53 - closed)]):
- Finalizing comprehensive documentation and guidelines (the focus of this report).
- Documenting existing evaluators ([Re-visit documentation about existing evaluatio... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#671 - closed)]).
- Revisiting guidelines for building evaluation datasets ([Re-visit documentation about existing dataset c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#672 - closed)]).
- Revisiting guidelines for creating efficient prompts ([Draft: Re-visit guidelines about creating effic... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#676 - closed)]).
- Archiving old prompt-library code.
- Estimating the completion date for the overall evaluation consolidation ([Estimate the completion date for the evaluation... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#728 - closed)]).

6. Conclusion and Recommendations for Engineers

GitLab is committed to providing a robust and well-documented AI evaluation ecosystem. The transition to the Centralized Evaluation Framework (CEF), coupled with comprehensive documentation efforts, will empower engineers to evaluate AI features more effectively and consistently.

Key Recommendations for Engineers:

Embrace the Centralized Evaluation Framework (CEF): Familiarize yourself with CEF as it becomes the standard for AI evaluation at GitLab. Stay updated on its development and documentation.
Leverage LangSmith: Utilize LangSmith for dataset management, experiment tracking, and analyzing evaluation results.
Follow the AI Feature Development Playbook: Use the updated playbook as the primary guide for the overall AI feature development lifecycle, including high-level evaluation strategies.
Consult Detailed Evaluation Guidelines: As they become available (under doc/development/ai_features/ai_evaluation), refer to these detailed guides for specific instructions on "when, why, and how" to conduct different types of evaluations using CEF.
Contribute to Datasets: Actively contribute to and refine evaluation datasets in LangSmith to improve the quality and coverage of evaluations.
Stay Informed: Monitor the progress of relevant epics and issues, particularly those under Epic [[Scope adjustment] Phase 4: documentation and c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation&53 - closed)] (Phase 4: documentation and clean up) and Issue [Consolidate GitLab AI Developer/Contributor Doc... (gitlab-org/gitlab#514510 - closed)] (Consolidate GitLab AI Developer/Contributor Documentation).
Provide Feedback: Engage with the teams developing these frameworks and documentation to share your experiences and help refine the guidelines.

By adopting these practices and leveraging the evolving tools and documentation, GitLab engineers can significantly enhance the quality, reliability, and performance of AI-powered features.

This issue was automatically created by the GitLab Research Agent on 2025-06-06T04:19:37.492Z