Iterating on Code Generation with GitLab specific data

Problem to solve

As LLMs become more robust, the variability between them starts to plateau on generic data sets. In order to truly assess and identify the best LLMs for our uses cases, we need to expand into more complex code for prompts and responses. We are currently leveraging MBBP for code generation, but this dataset features code that is insufficient complex for our purposes. We need to build a customized library.

Proposal

We will use historic datasets to build a custom library for GitLab validation of foundational LLMs.

We will take GitLab Project and additional open source datasets and with the current 14 open source repos, chunk them using various techniques based on pattern and intent. Few patterns would include:

Full file for YAML
Chunk between comment to comment
Chunk between function to function

Post that we would then reverse engineers and ask a choice of LLM for the code generation question and use consesus filtering with the Question and chunk to see how chat responds vs foundational API's.

Further details

Iteration 1 (!437 (merged) )

To create a code-generation dataset that utilize real-world gitlab codebases to reflect real-world code-generation complexity.

The process of generating this dataset:

Chunk gitlab code bases into blocks of functions and classes.
Let an LLM to read the code and ask it to generate the prompt that would generate the code.

This dataset will have

Diverse language examples because it uses real-world codebases
Challenging code-generation use-cases including generating a whole function or classes, and other more granular code blocks like loops, match, if/else etc.

Links / references

https://leaddev.com/tech/researchers-say-generative-ai-isnt-replacing-devs-any-time-soon

Edited Jul 08, 2024 by Hongtao Yang