Model Validation Weekly Report -04-15

⛳ Overview

Model Validation is dedicated to developing a centralised Evaluation and Decision Science Framework for Gen AI models and features. In 16.10, the main focus lies in collaborating closely with the Chat team to streamline large-scale evaluations for code and search related tasks. Further we plan to integrate Code-Gemma into the ensemble of models , solution validating explain the vulnerability and meticulously documenting the architecture through a blueprint.

📣 Completed Last Week

Chat Evaluation

We observed a decline in the Correctness Score for Claude-3-sonnet (both code generation and code explanation) in the Eval Pipeline, with the similarity score dropping from 0.86 to 0.61. Upon investigation, we found the root cause to be the COT Parser. Once the MR was merged, we observed an improvement in quality, with the correctness score moving from 2.86 to 3.62. Well done to both teams on the investigation and fix! 🙌
We continue to iterated on Draft MR to review for slash commands for the explain the code datasets Central Evaluation Framework - add tests for /e... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#153 - closed)
We are iterating on the Code Generation Dataset as well to include gitlab specific code beyond industry benchmarks Iterating on Code Generation with GitLab speci... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#133 - closed)
Currently, for our daily runs, we are supporting Claude-3 for experimentation, as well as Claude 2.1, as the model in production for chat. Only the executive page on the dashboard will be updated for Claude 2.1. One can find Claude 2.1 overview as the last graph when scrolled down.

Code Completion

The dashboard with the lates Code Completion foundational models have been updated. We see Clause-3-Opus leading followed by Code-Gecko, GPT-4 turbo followed by Claude-3 Sonnet . We are working on the integration of the CS API Iterating on Code Suggestion Pipeline (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments#24 - moved).

🎯 Focus for This Week

Chat Evaluation

We will continue working on expanding the doc , working with legal and code generation dataset.
We will continue to support on the pattern and investigation for the experiments here.
We are improving the developer experience based on the knowledge sharing session for Prompt library by making input and output much easier to follow Remove input_adapter to improve usability of pr... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#217 - closed) and Auto detect input schema, remove input_adapter (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library!369 - merged)

Code Completion and Competitive Intelligence

We will start working on Comptetive Intelligence for Duo Features https://gitlab.com/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/ai-experiments/-/issues/21+ now that we have iterated upon the Code Suggestion Pipeline.

Foundational Model Evaluations

We are working on Mistral and Code-Gemma evaluation this week https://gitlab.com/groups/gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/-/epics/4 and Adding Mistral OS Mixtral Models to Prompt Library (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#187 - closed)

📖 Gen AI Reading List

LM Guided Chain-of-thought

This paper applies knowledge distillation to a small LM with rationales generated by the large LM with the hope of narrowing the gap in reasoning capabilities; the rationale is generated by the lightweight LM and the answer prediction is then left for the frozen large LM; this resource-efficient approach avoids the need to fine-tune the large model and instead offloads the rationale generation to the small language model; the knowledge-distilled LM is further optimized with reinforcement learning using several rational-oriented and task-oriented reward signals; the LM-guided CoT prompting approach proposed in this paper outperforms both standard prompting and CoT prompting. Self-consistency decoding also enhances performance

Best Practices on Synthetic Data

This paper proposes an overview by Google DeepMind on synthetic data research, covering applications, challenges, and future directions; discusses important topics when working with synthetic data such as ensuring quality, factuality, fidelity, unbiasedness, trustworthiness, privacy, and more

Representation Finetuning for LMs

This paper proposes a method for representation fine-tuning (ReFT) that operates on a frozen base model and learns task-specific interventions on hidden representations; in other words, by manipulating a small fraction of model representations it is possible to effectively steer model behavior to achieve better downstream performance at inference time; also proposes LoReFT as a drop-in replacement for PEFTs that is 10-50x more parameter efficient.

Gemma

This paper walkthroug a family of open code LLMs based on Gemma; CodeGemma 7B models excel in mathematical reasoning and match the code capabilities of other open models; the instruction-tuned CodeGemma 7B model is the more powerful model for Python coding as assessed via the HumanEval benchmark; results also suggest that the model performs best on GSM8K among 7B models; the CodeGemma 2B model achieves SoTA code completion and is designed for fast code infilling and deployment in latency-sensitive settings

👀 Whats happening in AI Company Wide?

There are also other various roles within the core AI teams , do check the job board.
AI Framework Weekly Updates can be found here: https://gitlab.com/gitlab-org/ai-powered/ai-framework/team-hq/-/issues/36+
There is a new AI Slack channel for only reading feeds #ai_reading_feed

Praise

We would like to extend our gratitude to the entire team and to the extended AI Core teams for their dedicated efforts.👏🏻 👏🏼 Thanks all for making through the reading as well . If anyone would like to subscribe to this do tag yourself in the issue or ping in #g_ai_model_validation

Edited Apr 15, 2024 by Mon Ray