Model Validation Weekly Report -04 -08

Overview

Model Validation is dedicated to developing a centralised Evaluation and Decision Science Framework for Gen AI models and features. In 16.10, the main focus lies in collaborating closely with the Chat team to streamline large-scale evaluations for code and search related tasks. Further we plan to integrate Code-Gemma into the ensemble of models , solution validating explain the vulnerability and meticulously documenting the architecture through a blueprint.

📣 Completed Last Week

Chat Evaluation

Code Completion

🎯 Focus for This Week

Chat Evaluation

  1. We will continue working on expanding the doc and code generation dataset.
  2. We will continue to support on the pattern and investigation for the experiments here. https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments. We have found a regression on the issue/epic dataset. Details here: Investigation on Duo Chat regression for Issue-... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#27 - closed)
  3. We are improving the developer experience based on the knowledge sharing session for Prompt library by making input and output much easier to follow Remove input_adapter to improve usability of pr... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#217 - closed) and Auto detect input schema, remove input_adapter (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!369 - merged)

Code Completion and Competitive Intelligence

  1. We will start working on Comptetive Intelligence for Duo Features https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments/-/issues/21+ now that we have iterated upon the Code Suggestion Pipeline.

Foundational Model Evaluations

  1. We will be evaluation Code-Gemma this week https://gitlab.com/groups/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/-/epics/4

📖 Gen AI Reading List

Local Context LLMs Struggle with Long In-Context Learning

This paper finds that after evaluating 13 long-context LLMs on long in-context learning the LLMs perform relatively well under the token length of 20K. However, after the context window exceeds 20K, most LLMs except GPT-4 will dip dramatically

JetMOe

This paper proposed an 8B model trained with less than $ 0.1 million cost but outperforms LLaMA2-7B; shows that LLM training can be much cheaper than generally thought; JetMoE-8B has 24 blocks where each block has two MoE layers: Mixture of Attention heads (MoA) and Mixture of MLP Experts (MoE); each MoA and MoE layer has 8 experts, and 2 experts are activated for each input token with 2.2B active parameters.

Representation Finetuning for LMs

This paper proposes a method for representation fine-tuning (ReFT) that operates on a frozen base model and learns task-specific interventions on hidden representations; in other words, by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient.

SWE-Agent

This paper shows a new open-source agentic system that can automatically solve GitHub issues with similar accuracy as Devin on the SWE-bench; the agent interacts with a specialized terminal and enables important processing of files and executable tests to achieve good performance; on SWE-bench, SWE-agent resolves 12.29% of issues, achieving the state-of-the-art performance on the full test set.

Opera

This article shows how with Opera one can download and run large language models locally on their computers, with over 150 models from more than 50 families available.

👀 Whats happening in AI Company Wide?

Praise

We would like to extend our gratitude to the entire team and to the extended AI Core teams for their dedicated efforts.👏🏻 👏🏼 Thanks all for making through the reading as well . If anyone would like to subscribe to this do tag yourself in the issue or ping in #g_ai_model_validation