Model Validation Weekly Report - 07- 15

🎺 Overview

Model Validation is dedicated to developing a centralized Evaluation and Decision Science Framework for Gen AI models and features. In 17.2, the main focus lies in collaborating closely with the feature team to streamline large-scale evaluations for Root Cause Analysis and Vulnerability Explanation, Code Review and Workflow. Further, we plan to integrate additional foundational models and continue to build the framework as a developer-friendly tool.

📣 Completed Last Week

Root Cause Analysis

1.We have a path forward to move to staging environment to analyse the features and create a daily run. Details are here: Use a non-prod environment for evaluating GitLa... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#346 - closed) 🙌.

We have worked on the classification and would take feedback to extend the dataset from 409 prompts to further delve into areas where it performs low and different classifications of pipeline errors next week as well as looking into scheduled runs.

Vulnerability Explanation

We added Resolve to the daily run by today and iterate on feedback from the dashboard.
Evaluation of the 9000 in staging will be post GA
We plan to iterate on metrics based on testing and the experimentation framework

Code Review

We have completed the first iteration of the dataset as per here: gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#348 (comment 1986907982) and will plan for review and knowledge sharing as next steps

Duo-Workflow

We have an action plan for the first dataset creation of pipeline error as an off-spring of our Root Cause Analysis Dataset. We will be adding more layers to understand the diffs of the MR. The first exploration is manually till we reverse engineer to understand how we scale it.

🎯 Focus for This Week

Feature Epic

Here is the epic with the current status for all the features groupai model validation is supporting.

Root Cause Analysis

We continue to work on the Graph QL endpoint to understand how we can set up daily run and add additional data and classification to the pipeline errors. We also will start working on the daily run pipeline to incorporate upto ~2000 prompts.

Vulnerability Explanation

We will continue incorporating based on feedback and the experimentation recommendations.

Duo-Workflow

We will be co-creating the first dataset for Langsmith testing and to better understand how to scale it for the various projects.

Code Review

We will prepare docs for the knowledge sharing and any additional feedback for the first rudimentary dataset.

📖 Gen AI Reading List

APIGen

This paper presents an automated data generation pipeline to synthesize high-quality datasets for function-calling applications; shows that 7B models trained on curated datasets outperform GPT-4 models and other state-of-the-art models on the Berkeley Function-Calling Benchmark; a dataset consisting of 60K entries is also released to help with research in function-calling enabled agents

Crtic GPT

This paper represents a new model based on GPT-4 to help write critiques for responses generated by ChatGPT; trained using RLHF using a large number of inputs that contained mistakes for which it had to critique; built to help human trainers spot mistakes during RLHF and claims that CriticGPT critiques are preferred by trainers over ChatGPT critiques in 63% of cases on naturally occurring bugs

Searching for Best Practices in RAG

shows the best practices for building effective RAG workflows; proposes strategies that focus on performance and efficiency, including emerging multimodal retrieval techniques.

Scaling Synthetic Data Creation

This paper proposes 1 billion diverse personas to facilitate the creation of diverse synthetic data for different scenarios; uses a novel persona-driven data synthesis methodology to generate diverse and distinct data covering a wide range of perspectives; to measure the quality of the synthetic datasets, they performed an out-of-distribution evaluation on MATH. A fine-tuned model on their synthesized 1.07M math problems achieves 64.9% on MATH, matching the performance of gpt-4-turbo-preview at only a 7B scale.

👀 What's happening in AI Company-Wide?

We're excited to hear about your experiences using BigQuery for evaluation data in our AI development projects. Your insights are crucial for improving our processes and tools.🔍 We've put together a quick survey to gather your feedback: https://forms.gle/Bzrij84RCuUQxXaE9
We have the Manage and DataScience Fun Team day on July 26th!: gitlab-org/manage/general-discussion#17681

Praise

We would like to extend our gratitude to the entire team and to the extended AI teams for their dedicated efforts. 👏🏻 👏🏼 Thanks all for making it through the reading as well . If anyone would like to subscribe to this, do tag yourself in the issue or ping in #g_ai_model_validation.