Model Validation Weekly Report -04 -08

⛳ Overview

Model Validation is dedicated to developing a centralised Evaluation and Decision Science Framework for Gen AI models and features. In 16.10, the main focus lies in collaborating closely with the Chat team to streamline large-scale evaluations for code and search related tasks. Further we plan to integrate Code-Gemma into the ensemble of models , solution validating explain the vulnerability and meticulously documenting the architecture through a blueprint.

📣 Completed Last Week

Chat Evaluation

We observed a decline in the Correctness Score for the Epic/Issue Dataset in the Eval Pipeline from the 31ST of March (3.42) to the 1ST April (3.13) Investigation on Duo Chat regression for Issue-... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#27 - closed)
We have a Draft MR to review for slash commands for the explain the code datasets Central Evaluation Framework - add tests for /e... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#153 - closed)
We are iterating on the Code Generation Dataset as well to include gitlab specific code beyond industry benchmarks Iterating on Code Generation with GitLab speci... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#133 - closed)

Code Completion

We have completed iterating on the Code Completion pipeline to include the lasted Claude-3 Models and the grafana dashboard is also updated! 🎊 Iterating on Code Suggestion Pipeline (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#24 - moved). We will be updating the dashboard for Code Suggesting this week as well!

🎯 Focus for This Week

Chat Evaluation

We will continue working on expanding the doc and code generation dataset.
We will continue to support on the pattern and investigation for the experiments here. https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments. We have found a regression on the issue/epic dataset. Details here: Investigation on Duo Chat regression for Issue-... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#27 - closed)
We are improving the developer experience based on the knowledge sharing session for Prompt library by making input and output much easier to follow Remove input_adapter to improve usability of pr... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#217 - closed) and Auto detect input schema, remove input_adapter (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!369 - merged)

Code Completion and Competitive Intelligence

We will start working on Comptetive Intelligence for Duo Features https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments/-/issues/21+ now that we have iterated upon the Code Suggestion Pipeline.

Foundational Model Evaluations

We will be evaluation Code-Gemma this week https://gitlab.com/groups/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/-/epics/4

📖 Gen AI Reading List

Local Context LLMs Struggle with Long In-Context Learning

This paper finds that after evaluating 13 long-context LLMs on long in-context learning the LLMs perform relatively well under the token length of 20K. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically

JetMOe

This paper proposed an 8B model trained with less than $ 0.1 million cost but outperforms LLaMA2-7B; shows that LLM training can be much cheaper than generally thought; JetMoE-8B has 24 blocks where each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE); each MoA and MoE layer has 8 experts, and 2 experts are activated for each input token with 2.2B active parameters.

Representation Finetuning for LMs

This paper proposes a method for representation fine-tuning (ReFT) that operates on a frozen base model and learns task-specific interventions on hidden representations; in other words, by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient.

SWE-Agent

This paper shows a new open-source agentic system that can automatically solve GitHub issues with similar accuracy as Devin on the SWE-bench; the agent interacts with a specialized terminal and enables important processing of files and executable tests to achieve good performance; on SWE-bench, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.

Opera

This article shows how with Opera one can download and run large language models locally on their computers, with over 150 models from more than 50 families available.

👀 Whats happening in AI Company Wide?

There are also other various roles within the core AI teams , do check the job board.
AI Framework Weekly Updates can be found here: https://gitlab.com/gitlab-org/ai-powered/ai-framework/team-hq/-/issues/36+
There is a new AI Slack channel for only reading feeds #ai_reading_feed

Praise

We would like to extend our gratitude to the entire team and to the extended AI Core teams for their dedicated efforts.👏🏻 👏🏼 Thanks all for making through the reading as well . If anyone would like to subscribe to this do tag yourself in the issue or ping in #g_ai_model_validation