Data Preparation for Code Suggestions Fine Tuning
This epic is to capture work related to finding ideal defaults for preparing customer data for fine tuning of code generation models. Once a customer has selected their data for use in fine tuning, GitLab will automate processes for: Data Extraction * extract data identified by customer Data Cleaning * standardize formatting: using consistent indentations, line breaks, and style * de-duplication: remove duplicated code block Data Chunking * code generation: identify and extract example chunks that include potential prompts (such as function signatures or comments) as well as the following snippets to serve as input/outputs for training * code completion: split code into sensible units, such as functions or classes; then split each unit into an input and expected output. Data Formatting * tokenize per requirements of the base model to be trained (i.e. [Mistral requires Byte-Pair Encoding (BPE)](https://docs.mistral.ai/guides/tokenization/)) * structure in suitable format for training (i.e. a plain text file) Split Dataset * split dataset into train, validate, and test sets ### Research work Fine tuning research work can be found [in this repo](https://gitlab.com/gitlab-org/ai-powered/custom-models/cm-research) ### Definition of Done Customers are readily able to use their GitLab repo data to enable fine tuning. Customers can use sensible defaults to: * [ ] Identify files to be used as a basis for fine-tuning https://gitlab.com/gitlab-org/gitlab/-/issues/505077+ * [ ] Chunk using sensible defaults per use-case https://gitlab.com/gitlab-org/gitlab/-/issues/501453+ * [ ] deduplicate * [ ] Format as required for fine-tuning platform https://gitlab.com/gitlab-org/gitlab/-/issues/501454+ * [ ] split dataset into train, test, and validate
epic