Calibrate LLM Judge to be a better proxy for the user

Objective:

To enable efficient development we have been building an central evaluation framework that developers can use to test how prompt changes improve the chat. We are using LLMs a judge to evaluate the chat answers. However, it seems that the LLM judge is not a good proxy for the users. The goal of this research is to see if we can understand why this is so and how we could calibrate the LLM judge to be a better user proxy.

Click to expand details why it seems that the chat is a poor proxy for the user

I put all feedback from all chat bashes into one file. I found that the only type of question where we have enough data to be statistically relevant are questions about how to use GitLab. (Probably this is so, because bashes happened in GitLab Web App and not in IDEs). So, I compared how the such chat questions improved over time in the eyes of the bash users vs. the Central Eval Framework (CEF) ratings:

There is a clear improvement just before the GA release in the CEF results (34% -> 7% poor answers).
In the user bashes we see an improvement but it is not nearly as pronounced as in the CEF.

So, I suppose at least for GitLab docs related questions CEF is not yet a good proxy for the user. Anyhow, CEF has helped improving the results. So, there is definitely a lot of value in CEF. Especially, it has helped us get rid of the large number of "I don't know how I can help" answers, that most likely contributed to improving the user ratings.

I have not made a detail analysis of the questions and answers in the chat bash. Probably this can reveal more insights to how we could adjust the CEF to be a better proxy. I will leave this to the data scientists. Probably, we would have to weed through the questions clean them, and then let the CEF rate them and adjust the rating so it matches the user rating. Anyway, we now have all the chat bash data in this one file.

Metric:

Just a proposal:

Mean root square of difference between LLM judge rating and user rating
Note user rating and and LLM judge use different scales that need to be adjusted.
- LLM judge 1 to 4
- user 1 to 5

Dataset:

Chat bash questions/answers/user-ratings about GitLab docs-questions.

Proposed approach

Haven an LLM (or human) go through User rating (column F) and How could Duo Chat's response been improved? (column I) and check if the user rating was about the answer or some other problem (e.g. chat took for ever to answer or got stalled). Filter out all cases where the rating does not seem to relate to the answer itself.
Take the user question (column E) and the chat response (column G) and run it through the LLM judge
Compare the user rating (column F) and compare it to the LLM judge (maybe RMS of difference)
Optional: run the user question (column E) again through chat and see if the answer we get today is better than at the time of the bash. This is interesting but does not help calibrate the LLM judge.
Experiment with the LLM judge prompt containing its task description to see if it can be adjusted so that the RMS(diff of user rating vs. LLM judge) gets smaller.
- Maybe use an LLM to fiddle with this prompt and make adjustments.
If this is successful change the LLM judge in production at least for the docs-related questions. If there is reason to believe that the adjustments to the judge task assignments are generic in nature, also update the LLM judge for the other types of questions.

Metrics:

1.Control Metric Score:

2.Experiment Metric Score:

3.Variance:

Experiment Details:

Success (Y/N):

Edited Jun 12, 2024 by Torsten Linz