Are we overfitting with Code Suggestions Prompts?

Context

@shekharpatnaik ran the Human Eval code benchmark on a number of models for comparison with our API. The dataset contains python code examples for code generation. The dataset mostly consists of logical problems, contain a comment describing the problem as well as unit tests to check if the code was generated correctly. I've tried to format the output from the models to give them all the best chance of success. The code for the eval can be found here.

The outcome of this evaluation is the following:

GitLab Code Suggestions (with intent generation): 84/164 (~51%)
Open AI gpt-3.5-turbo: 115/164 (~70%)
Codellama2 7b 4bq: 60/164 (~36%)
Dolphin Mistral 7b 4bq: 65/164 (~39%)
Deepseek Coder 6.7b-instruct 4bq: 84/164 (~51%)
Claude 2.1: 99/164 (60%)

The indentation for code suggestions caused some issues, but GitLab is still significantly lower than Claude 2.1.

Proposal

Go deeper with this experiment to see if we have an overfitting issue.