Iterating on Code Suggestion Pipeline

Overview

As we are iterating on the code completion pipeline to add more models and make it scalable, here are the list of things we are looking to iterate on

Currently the Code Suggestion Pipeline only has Code Completion. Add Code Generation to the Pipeline.
Update to the latest version of the models for Gecko , text-bison , code-bison
@srayner to document other refactoring of the pipeline as well. The goals are as follows:
- Generate OutputV5 for the Grafana Dashboards
- Create Config files that work for various models along with prompt templates to go with those configs so that future users or automated systems (like daily runs) have a working starting place to run these models and get good results. Doing this would result in the OutputV5 data being of a reasonable quality
- Document the process to successfully run all of these models along with the constituent parts of the config files for future system users.
Add daily runs and Code Suggestion API ( similar to duo chat API)

Making Code Completion Pipeline Work

Please not that iteration 0 through iteration 4 are added after the fact so there are so there are some minor iterations missing. Going forward the iterations will be tracked more closely.

Iteration 0:

Run the pipeline with the example configurations provided Result: The results were not promising (below are vertex and huggingface models but claude was equally debious)

Iteration 1:

(vertex only)In the case of vertex, set different prompt templates for text models vs code models. code-gecko and code-bison expect to be just given code, whereas text-bison will expect text and will return just text.

Results: Created two prompt templates dedicated for vertex-code models and vertex-text models. Much more promising for code-gecko (although still lower than before) but text-bison and code-bison were still low.

Iteration 2:

(vertex specific) There was a problem in the last iteration related to sending suffix to code-bison despite the template not containing suffix. After some digging it turned out that "include_suffix": true needed to be added to the config for the code-bison run. This iteration tests that

Result: code-bison is now showing results that are inline with code-gecko

Iteration 3

(vertex only) Transformations: We have a series of transformations to add function definitions and attempt to nudge the model that it should be generating a specific language these were explored.

Results: Improvements in all vertex models by between 2 and 4% but still lower than the OutputV4 runs

Iteration 4

Switching to focusing on anthropic, adding debugging which helped determined that the extract_completion_xml post-transform and what the template were asking for are not aligned (template says to put the code in <result> tags but the post-transform is looking for <completion> tags, made them both <completion> as to avoid future conflicts with duo-chat eval pipeline (it uses the same library)

Results: The number of parsing errors has dropped but only marginally the XML parser is still not able to parse the results despite <completion> tag being present.

Iteration 5

Use this prompt template for Claude models

Human: Here is the content of a file '{file_path}' written in {lang_id} enclosed in <code></code> tags.
Please write only the {lang_id} code where the <complete> tag is located. Again please only return {lang_id} code

<code>
    {prefix}<complete>
    {suffix}
</code>

Assistant:

Instead of one that requires XML output (as the parsing seems to be insufficient for Claude output). Additionally remove "extract_completion_xml" from the post-transform steps because we are no longer requesting XML

Results: After processing for 16 hours there was some non-descript network failure and the whole pipeline crashed yielding no data. Below are the logs of relevance.

I am going to break up this run into smaller runs and kick it off again, with a focus on Claude-3 models.

Edited Apr 03, 2024 by Stephan Rayner