Data Preparation for Code Suggestions Fine Tuning
This epic is to capture work related to finding ideal defaults for preparing customer data for fine tuning of code generation models. Once a customer has selected their data for use in fine tuning, GitLab will automate processes for:
Data Extraction
* extract data identified by customer
Data Cleaning
* standardize formatting: using consistent indentations, line breaks, and style
* de-duplication: remove duplicated code block
Data Chunking
* code generation: identify and extract example chunks that include potential prompts (such as function signatures or comments) as well as the following snippets to serve as input/outputs for training
* code completion: split code into sensible units, such as functions or classes; then split each unit into an input and expected output.
Data Formatting
* tokenize per requirements of the base model to be trained (i.e. [Mistral requires Byte-Pair Encoding (BPE)](https://docs.mistral.ai/guides/tokenization/))
* structure in suitable format for training (i.e. a plain text file)
Split Dataset
* split dataset into train, validate, and test sets
### Research work
Fine tuning research work can be found [in this repo](https://gitlab.com/gitlab-org/ai-powered/custom-models/cm-research)
### Definition of Done
Customers are readily able to use their GitLab repo data to enable fine tuning. Customers can use sensible defaults to:
* [ ] Identify files to be used as a basis for fine-tuning https://gitlab.com/gitlab-org/gitlab/-/issues/505077+
* [ ] Chunk using sensible defaults per use-case https://gitlab.com/gitlab-org/gitlab/-/issues/501453+
* [ ] deduplicate
* [ ] Format as required for fine-tuning platform https://gitlab.com/gitlab-org/gitlab/-/issues/501454+
* [ ] split dataset into train, test, and validate
epic