Model Validation Weekly Report -05-20

⛳ Overview

Model Validation is dedicated to developing a centralised Evaluation and Decision Science Framework for Gen AI models and features. In 17.1, the main focus lies in collaborating closely with the Chat team to streamline large-scale evaluations for code and search related tasks and Vulnerability Explanation . Further we plan to integrate additional Code-Gemma variants, as well as Mistral and Mixtral variants into the ensemble of models, and meticulously documenting the architecture through a blueprint.

📣 Completed Last Week

Chat Evaluation

The slash command is being added to daily runs and will start reflecting this week.
We are iterating on the Code Generation Dataset as well to include gitlab specific code beyond industry benchmarks Iterating on Code Generation with GitLab speci... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#133 - closed)
We had a sync with @NickHertz last week on running a back on follow up question Test follow-up questions in Chat to ensure it c... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#154 - closed). and hoping to run a bash soon

Code Completion

We have added the response time as the first performance metric as response time. Thanks @acook.gitlab for the contribution. We will be now working on the weekly runs of Code Suggestion as well. We have a first look of the first draft dashboard here

Vulnerability Explanation

We have the first POC of data curation of 8173 vulnerability as the first dataset https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/vulnerabilityexplanation. We are adding the code to the datasets to be able to create the prompts. We are also working on the metric pipeline and tracking for the first look of the dashboard by June 10th.

🎯 Focus for This Week

Chat Evaluation

We will continue working and iterating on the items from previous week
We will continue to support on the pattern and investigation for the experiments here.
We are further refining on the Documentation specific to experimentation as feedback here: Documenting a detailed instruction guide on usi... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#298 - closed)

Code Completion and Competitive Intelligence

We will continue to work on weekly run as well as well as the dashboard to be published by this week.

Foundational Model Evaluations

We will be adding gpt4-0 this week to the pipeline and plan to also leverage for lLM based judge for Vulnerbaility explanation , based on evaluation and judge results Model Request - GPT-4o (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#297 - closed)

Vulnerability Explanation

We would be improving on the vulnerability explanation data curation and working on the metric pipeline.

Note : This week the team will be submitted a conferance paper for NIPS and will be spending a signification time documenting Evaluation for the paper

📖 Gen AI Reading List

Fine Tuning and Hallucination

This paper studies the impact of fine-tuning on new knowledge on the hallucination tendencies of LLMs; the setup includes fine-tuning examples that include new knowledge; shows that LLMs struggle to acquire new factual knowledge via fine-tuning; also finds that as new knowledge is learned it increases the model’s tendency to hallucinate

Chameleon

This paper goes througha family of token-based mixed-modal models for generating images and text in any arbitrary sequence; reports state-of-the-art performance in image captioning and outperforms Llama 2 in text-only tasks and is also competitive with Mixtral 8x7B and Gemini-Pro; exceeds the performance of Gemini Pro and GPT-4V on a new long-form mixed-modal generation evaluation.

RLHF Workflow

This paper provides an easily reproducible recipe for online iterative RLHF; discusses theoretical insights and algorithmic principles of online iterative RLHF and practical implementation

Zero-shot Tokenizer Transfer

This paper proposes trains a hypernetwork taking a tokenizer as input and predicting the corresponding embeddings; it demonstrates generalization to new tokenizers both with encoder and decoder LLMs; reports that the method achieves performance close to the original models' performance in cross-lingual and coding tasks while reducing the length of the tokenized sequence

👀 Whats happening in AI Company Wide?

With the recent AI Priortization announcement , the prioritization spreadsheet can be found here. Specifically for Model Validation, we will be working closely with the feature teams to prioritize feature evaluation. Following Vulnerability Explanation, the next features will be Root Cause Analysis. We will further expand to add more performance metrics as well as system evaluation such as RAG.

Praise

We would like to extend our gratitude to the entire team and to the extended AI Core teams for their dedicated efforts.👏🏻 👏🏼 Thanks all for making through the reading as well . If anyone would like to subscribe to this do tag yourself in the issue or ping in #g_ai_model_validation
We would like to thank @acook.gitlab for his contribution in including Response Time for the Code Suggestion Pipeline.
We would also like to thank @bcardoso- on his contribution to the Prompt Library Documentation for the Code Suggestion Pipeline.

Edited May 24, 2024 by Stephan Rayner