AI Model Validation FY-25 Q1 Report

💥 Overview

groupai model validation started off the year with two team members returning to the team post-borrow: @AndrasHerczeg and @tle_gitlab. We also welcomed our awesome Product Manager, @susie.bee, to GitLab and the team, as well as welcoming @pcalder to the Data Science Section 🎊 . We began with in-person collaboration within the AI offsite in Berlin, where EM - PM had a chance to collaborate, brainstorm, and better understand how Model Validation can further support the evaluation of Gen AI use-cases, further continuing the discussion in Summit in understanding how we go from quality to system evaluation.

The first quarter, the primary focus was to support the groupduo chat for GA and continue evaluating new models while creating a robust experimentation framework to iterate and build quality-driven Gen AI features

🎺 Highlights

🤖 Chat:

We were able to build the dashboard and provide a daily tracking mechanism for all data in the chat dashboard.
We built custom datasets for Code Generation, Explanation, Documentation, and Issue-Epic.
Furthermore, we built a pipeline and introduced the Collective Judge metric as an enhancement of LLM Judge metrics.
While working with chat, we also introduced the new experimentation framework using a subset of data for daily runs
It was awesome to see the collaboration between groupai model validation and groupduo chat as iterating through the experiments.

🔢 Foundational Models:

We added Claude 3, Gemini, and Mixtral to our repertoire. Additionally, we began exploring how we could incorporate evaluations of our competitors in the space, a venture that will extend into the next quarter, along with the integration of more models

Knowledge Sharing and Thought Leadership:

The team gathered for a fruitful poster session during the summit. AI Model Validation Poster.pdf
We organized three sessions on how to utilize CEF with Code Create, Duo Chat, and Custom Models Knowledge Sharing Session for experimentation w... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#208 - closed) and Knowledge sharing of CEF with Custom Models (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#278 - closed).
@susie.bee published our first blog on how we evaluate processes at GitLab, using custom prompt library.
@pcalder delivered a thought leadership talk on AI in the education space in New Zealand.
@mray2020 held several talks in the research space, discussing strategies for surpassing industry benchmarks and provided a soft pitch for our NeurIPS submission, where in the next quarter we will be submitting few research papers on our novel approach of Language Models as a step as , we move to behavorial models
@tmccaslin presenting on AI-powered in GitLab Summit as well as Google Next. Slides attached below.
Our AI Continuity Plan is a pillar of our AI Transparency center which was announced during our Q1 earnings call, providing industry thought leadership and customer confidence in our model validation process

👀 Lowlights

One of the lowlight was the scalability of Prompt Library as we scale in adding new features , pipeline, metrics and somehting we will be address further in Q2 and Q3
We would be moving to GitLab CI to trigger pipeline as team members out side of Model Validation do not have the understanding of the runs. Migrate Centralised Evaluation daily runs from ... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#203 - closed)
We also would like to evaluation more developer friendly as the process for experimentation is still complex , something we plan to also address in Q2 , Q3 , Q4.

🔍 Opportunity and Focus for Q2

In addition to addressing the lowlights, the focus for Q2 would include:

Evaluation of Vulnerability Explanation.
Continuing support for Chat with a refined dataset for slash commands and other use-cases.
Solution validation of Root Cause Analysis.
Continued support for Code Suggestion by incorporating performance metrics (latency) in addition to quality metrics and a seperate dashboard for Code Suggestion Feature. Add Latency Metric to CEF (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#280 - closed)
We will also continue supporting the evaluation of language models as they are released this quarter which includes different variants of gemma , gemini , IBM Garnite.
We also plan to explore more metrics and tuning for evaluation purposes as part of Category:AI Research and complete the submission of research papers for NeuraIPS https://gitlab.com/gitlab-com/marketing/corporate_marketing/corporate-marketing/-/issues/8807.
We will continue working on documentation and blueprints, knowledge sharing, thought leadership to align with the transparency.

Q2-OKR

Team and Engineering OKR: https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/7215
Product OKR: https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/7209 and https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/7206 and https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/7216

Success and Insights

We have chat dashboard published. https://lookerstudio.google.com/u/0/reporting/151b233a-d6ad-413a-9ebf-ea6efbf5387b/page/p_dt1c6y4xed.
As we utilized CEF to iterate on experiments, we observed the average correctness improve from 3.24 to 3.83, as depicted in the trend, which also includes migrating to Claude-3.
We worked on a total of 24 experiments, with a few of them significantly impacting accuracy.
The daily subset data for evaluation, allowing engineers to rapidly experiment, has been a success. This is a small subset that only takes 10 minutes to run, enabling engineers to test with every MR and every prompt before pushing to production. This framework is reproducible and can be used to evaluate other features as well.
We have integrated the Code Suggestion API into the Prompt Library, enabling us to evaluate the features in reference to foundational models.
We closed the Q1 OKR on chat at 96% (https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/5944) as the scope changed, and we aim to add slash commands this quarter. We deprioritized the OKR on blueprints last quarter and closed (https://gitlab.com/gitlab-com/gitlab-OKRs/-/work_items/6033) at 73%. This is due to prioritizing daily runs of chat and adding new models

Praise

We would like to take the opportunity to thank @tlinz , @juan-silva , @oregand on working better together as we make chat better.
A special thanks to @shinya.maeda for being the founding engineer using the experimentation framework and chat engineers iterating on experiments with it @lesley-r , @bcardoso- , @lulalala . ( We may be missing names here , feel free to tag)
A special thanks to @mikolaj_wawrzyniak for his contribution to Prompt Library for Code Suggestions
Kudos to @AndrasHerczeg , @tle_gitlab , @HongtaoYang , @srayner on iterating and delivering on our roadmap!!

Edited May 13, 2024 by Mon Ray