Data Preparation for Code Suggestions Fine Tuning (#15663) · Epics · Epics · GitLab.org

Data Preparation for Code Suggestions Fine Tuning

This epic is to capture work related to finding ideal defaults for preparing customer data for fine tuning of code generation models. Once a customer has selected their data for use in fine tuning, GitLab will automate processes for: Data Extraction * extract data identified by customer Data Cleaning * standardize formatting: using consistent indentations, line breaks, and style * de-duplication: remove duplicated code block Data Chunking * code generation: identify and extract example chunks that include potential prompts (such as function signatures or comments) as well as the following snippets to serve as input/outputs for training * code completion: split code into sensible units, such as functions or classes; then split each unit into an input and expected output. Data Formatting * tokenize per requirements of the base model to be trained (i.e. [Mistral requires Byte-Pair Encoding (BPE)](https://docs.mistral.ai/guides/tokenization/)) * structure in suitable format for training (i.e. a plain text file) Split Dataset * split dataset into train, validate, and test sets ### Research work Fine tuning research work can be found [in this repo](https://gitlab.com/gitlab-org/ai-powered/custom-models/cm-research) ### Definition of Done Customers are readily able to use their GitLab repo data to enable fine tuning. Customers can use sensible defaults to: * [ ] Identify files to be used as a basis for fine-tuning https://gitlab.com/gitlab-org/gitlab/-/issues/505077+ * [ ] Chunk using sensible defaults per use-case https://gitlab.com/gitlab-org/gitlab/-/issues/501453+ * [ ] deduplicate * [ ] Format as required for fine-tuning platform https://gitlab.com/gitlab-org/gitlab/-/issues/501454+ * [ ] split dataset into train, test, and validate

epic