Model Validation Weekly Report-06-11

🎺 Overview

Model Validation is dedicated to developing a centralized Evaluation and Decision Science Framework for Gen AI models and features. In 17.2, the main focus lies in collaborating closely with the feature team to streamline large-scale evaluations for Root Cause Analysis and Vulnerability Explanation . Further we plan to integrate additional foundational models and continue to build the framework as a developer friendly tool.

📣 Completed Last Week

Root Cause Analysis

  1. We have first draft of dataset curated for 14 projects with 420 prompts: Creation of a RCA prompt library (gitlab-org/modelops/ai-model-validation-and-research/ai-evaluation/prompt-library#254 - closed). We will not run our answering model and metric pipeline to have the first draft of dashboard ready for the week of June 17th including the current feature analysis .
  2. We would like to add more classification of pipeline to better diagnose if there are certain classifications that the LLM based applications do not perform well.
  3. Post June 21st , we can have the benchmark and an understanding of the changes needed to be made to be GA Ready for RCA

Vulnerability Explanation

  1. We had a sample prompt analysis on 250 set prompts
  2. We have the dashboard draft in progress that is getting updated with new data

Here is a video update on what the data/ dashboard looks like for Vulnerability and RCA

Model_Validation_Update__Vulnerability_Explanation_and_Root_Cause_Analysis

🎯 Focus for This Week

Feature Epic

Here is the epic with the current status for all the features groupai model validation is supporting.


Root Cause Analysis

  1. We continue to complete the first draft of dashboard for foundational model and feature pipeline . We would also start working on the local experimentation inference pipeline for local seeding simultaneosly , so we are able to test the changes we would need to post June 21st.

Foundational Model Evaluations

  1. We have tested GPT4-O for ETV and will be adding the other tasks for assessment.
  2. We will be looking into Claude-Igloo as the next models .

Vulnerability Explanation

  1. We have the dashboard ready and will be running the full pipeline for the models this week , aiming for June 12th with the team PTO for majority of the week.
  2. We will be looking into the feature pipeline as well and how Vulnerability Resolve and Explain works
  3. We will continue enriching the data as well.

📖 Gen AI Reading List

Buffer of Thoughts

This paper presents a thought-augmented reasoning approach to enhance the accuracy, efficiency, and robustness of LLM-based reasoning; it leverages a meta-buffer containing high-level thoughts (thought templates) distilled from problem-solving processes; the relevant thought template is then retrieved and instantiated with task-specific reasoning structures for the thought-augmented reasoning process; it demonstrates SOTA performance on 10 challenging tasks while requiring 12% of the cost of multi-query prompting methods like Tree-of-Thoughts

Aligning LLM's with Demonstrated Feedback

This paper proposes a method to align LLMs to a specific setting via a very small number of demonstrations as feedback; it aligns LLM outputs to a user’s demonstrated behaviors and can learn fine-grained style and task alignment across domains; outperforms few-shot prompting, SFT, and self-play methods on the tested benchmarks.

Agent Gym

This paper a new framework featuring various environments and tasks for broad, real-time, and concurrent agent exploration; builds a generally capable LLM-based agent with self-evolution abilities and explores its potential beyond previously seen data across tasks and environments.

Guide for Evaluating LLM's

This paper presents a new attention mechanism that can be trained in parallel (like Transformers) and be updated efficiently with new tokens requiring constant memory usage for inferences (like RNNs); the attention formulation is based on the parallel prefix scan algorithm which enables efficient computation of attention’s many-to-many RNN output; achieves comparable performance to Transformers on 38 datasets while being more time and memory-efficient.

👀 Whats happening in AI Company Wide?

  • With the recent AI Prioritization announcement (prioritization spreadsheet can be found here) Model Validation has shifted to work closely with feature teams to prioritize feature evaluation. Following Vulnerability Explanation, the next features will be Root Cause Analysis. We will further expand to add more performance metrics as well as system evaluation such as RAG.

Praise

  1. We would like to extend our gratitude to the entire team and to the extended AI Core teams for their dedicated efforts.👏🏻 👏🏼 Thanks all for making through the reading as well . If anyone would like to subscribe to this do tag yourself in the issue or ping in #g_ai_model_validation