Model Validation Weekly Report -02-11

⛳ Overview

Model Validation is dedicated to developing a centralized Evaluation and Decision Science Framework for Gen AI models and features. In 16.9, the main focus lies in collaborating closely with the Chat team to streamline large-scale evaluations for code and search related tasks. Further we plan to integrate Gemini into the ensemble of models and meticulously documenting the architecture through a blueprint.

📣 Completed Last Week

Chat Evaluation

We have integrated the second metric of cosine similarity for chat evaluation Adding Similarity score to Chat pipeline to com... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#132 - closed). We have merged the LLM Based Evaluator Judge to standardise the LLM Evaluator metrics and will be running the pipeline to incorporated all the chat datasets Iterating on Chat Eval Metrics (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#134 - closed).
We were able to record a end to end walkthrough video for developers using Chat for Diagnostic Purposes for Code Generation with MBPP Dataset here: https://youtu.be/U2CW95yylMs
We have started looking in to the data schema for the final table including adding prompt id , task id etc as a purpose to scale the datasets and to have a betetr understanding of the final output table Building Schema for the various data sources an... (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#151 - closed)
We had an awesome meeting with @NickHertz to better understand the chat bashes as well as the diary study and will be working with him on the requirement of testing as a proxy to production.
We have started working on looker prototype of the dashboard here: https://lookerstudio.google.com/reporting/151b233a-d6ad-413a-9ebf-ea6efbf5387b and will be incorporating feedback from @tlinz this week before the next draft in Grafana

🎯 Focus for This Week

Chat Evaluation

This week, we embark on the development of the data pipeline and the final verdict metric for the chat evaluation system Simultaneously, we aim to augment the dataset by leveraging our existing open-source dataset obtained from 14 GitLab projects for code explanation and refra toring Code Explanation / Refractoring with GitLab Data (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#131 - closed)
We would be working on the feedback for the dashboard prototype and aim to have the first draft of grfana dashboard with Issue/Epic and Code Generation data by the next update

📖 Gen AI Reading List

Any Tool

This paper explores an LLM-based agent that can utilize 16K APIs from Rapid API; proposes a simple framework consisting of 1) a hierarchical API-retriever to identify relevant API candidates to a query, 2) a solver to resolve user queries, and 3) a self-reflection mechanism to reactivate AnyTool if the initial solution is impracticable; this tool leverages the function calling capability of GPT-4 so no further training is needed; the hierarchical API-retriever is inspired by a divide-and-conquer approach to help reduce the search scope of the agents which leads to overcoming limitations around context length in LLMs; the self-reflection component helps with resolving easy and complex queries efficiently

More Agents is all you need

This paper presents a study on the scaling property of raw agents instantiated by LLMs; finds that performance scales when increasing agents by simply using a sampling-and-voting method

LLM Based Multi Agent

This paper discusses the essential aspects of LLM-based multi-agent systems; it includes a summary of recent applications for problem-solving and word simulation; it also discusses datasets, benchmarks, challenges, and future opportunities to encourage further research and development from researchers and practitioners

Indirect Reasoning with LLM's

This paper proposes an indirect reasoning method to strengthen the reasoning power of LLMs; it employs the logic of contrapositives and contradictions to tackle IR tasks such as factual reasoning and mathematic proof; it consists of two key steps: 1) enhance the comprehensibility of LLMs by augmenting data and rules (i.e., the logical equivalence of contrapositive), and 2) design prompt templates to stimulate LLMs to implement indirect reasoning based on proof by contradiction; experiments on LLMs like GPT-3.5-turbo and Gemini Pro show that the proposed method enhances the overall accuracy of factual reasoning by 27.33% and mathematic proof by 31.43% compared to traditional direct reasoning methods. Something that can be tested for Chat and Code Suggestions?

👀 Whats happening in AI Company Wide?

There is new slack channel AI-Portfolio to surface challenges , connect on XFN Topics , consume , clarify and provide feedback on updates and asks on anything AI Related
We have added a page in the internal handbook on AI Evaluation and Testing for AI Powered features which encompasses all the three types of testing to assess the effectiveness of AI powered features. Link : https://internal.gitlab.com/handbook/product/ai-strategy/ai-integration-effort/ai_testing_and_evaluation/
@NickHertz , @enf , @alasch conducted a diary study to learn how various DevSecOps personas use conversational AI tools, like ChatGPT, Bard etc., to support their workflows, what benefits they perceive and what capabilities they'd like to see. As part of this research people logged examples of their AI usage over a two week period and we mapped their tasks against the DevSecOps loop. Here is the deck with insights 👏🏼 👏🏻
There are also other various roles within the core AI teams , do check the job board.
AI Framework Weekly Updates can be found here: https://gitlab.com/gitlab-org/ai-powered/ai-framework/team-hq/-/issues/36+

Praise

We would like to extend our gratitude to the entire team and to the extended AI Core teams for their dedicated efforts.👏🏻 👏🏼 Thanks all for making through the reading as well . If anyone would like to subscribe to this do tag yourself in the issue or ping in #g_ai_model_validation

Edited Feb 12, 2024 by Mon Ray