16.3: Scope of A/B Testing Platform for Prompt Engineering Experiments
Overview
This is for us to scope the requirements for an A/B testing platform focused on prompt engineering. This is purely for rapid experiments for AI Assisted, Something like optimizely for Prompt Engineering. The purpose of the platform is to have a control and test version tracked lean system to test and validate various hypothesis for prompt engineering before rolling it out in production
Functionality Requirements
- Have isolation between experiments in order for experiment X to not interfere with experiment Y.
- Define a success criteria based on a measurable metric (e.g.: acceptance rate by language, resource usage, request time, etc). We want to make data-driven decisions and not decide based on gut feeling. Metrics should come first. If there is an experiment which can't be measured, we should first develop a metric for it.
- Allow developers to easily to manage experiments. One should have autonomy to manage the lifecycle of an experiment without requiring a new deployment. This includes start/stopping the experiment but also choosing a winner variant (A/B/.../N) to receive 100% of the traffic, once a decision is made.
- Run statistical analysis in order to understand if a shift in the measured metric is significant enough or if it can be attributed to random variation (t-test comes to mind).
- Experiment version control and dashboards
High-level Workflow
- Build a small Experimentation engine in the Model Gateway to distribute requests through different experiments/variants.
- Include the
experiments
data in the telemetry payload in the/v2/completions
response - Store the
experiments
data in the client, e.g. VSCode - Send the
experiments
data in the subsequent/v2/completions
request from the client - Record experiments using Prometheus labels and aggregate/visualise acceptance rate in Grafana.
Experiment MVC
- With language suffix: gitlab-org/modelops/applied-ml/code-suggestions/ai-assist!305 (merged)
- Temperature of Code-Gecko: Experiment with a lower temperature with code-g... (gitlab-org/modelops/applied-ml/code-suggestions/ai-assist#229)
Resources
Edited by Tan Le