16.3 Planning - AI Assisted: AI Evaluation
🤖 AI-Assisted: AI Evaluations - Direction
The AI Evaluations category focuses on assessing the performance and quality of algorithms for various AI models designed for code generation and completion. We are exploring models initially for this evaluation, including a subset of Google models - code-gecko and text-bison for 12 programming languages. This evaluation is crucial in developing and improving Code Suggestions, as it amplifies our understanding of how well the models are performing and identifies areas that require enhancements.
Themes
✨ Code Suggestions AI Evaluation
Our goal for AI evaluations on Code Suggestions is to assess what are high-quality prompts, language, different semantics of code, the taxonomy of code completion and code generation, mapping the taxonomy, and then adding similarity metrics to historically written code. We intend to improve Code Suggestions through comprehensive and robust assessment of Generative AI models, leading to a reliable, efficient and users friendly product. As a long-term initiative, we want to evaluate the models' Quality, Cost and Latency. In support of GitLab's vision for AI, areas of interest and improvement can be organized as follows.
FY24 Q3 OKR: Increase user acceptance rate with High-Quality Suggestions
To drive usage for Code Suggestions and help increase the user acceptance rate, this is a critical business initiative to dial up the accuracy of Code Suggestions via improving the prompt engine and building a database of prompts for validating code completion at scale for High-Quality Suggestions.
- Building a prompt library for model evaluation at scale
- Scope of A/B Testing Platform for Prompt Engineering Experiments
AdHoc manual testing to help improve the quality of Code Suggestions Help us evaluate JS, Python, Go, and C model responses.
Outcomes
- In the subsequent 1-3 releases, we will better understand what prompt engineering we must do based on the language and the use case for Code Suggestions. Through this, we would be able to retrieve metrics around Code correctness (consistent with the facts), harmless (ensuring not to include completions that might be offensive), and Helpful (ensuring completions help accomplish the goal of the developer).
- Based on the 12 languages evaluations from phase 1, we will have a good amount of X,000 prompts which we have chunked in different formats, and this will help us understand where which models do well and what to do with that information to run A/B testing for prompt transformation.
- Our collaboration with the AI framework team will then be grounded in facts to move things forward in production.
🔗
Helpful Links -
Slack channels:
-
Issue boards - overview of all workflow stages
-
Metrics