Multi Language Model Tuning ( Ensemble of Models)
Overview
To support multi-language , we have decided to use various datasets and fine tune the model. In order to start with , we will look into few datasets including Polycoder Dataset
Ensemble of models
Our idea for the ensemble is to use a bucket of models. To our problem this means that we will route the request for the best model based on the programming language being requested. For now, we discussed having the following model list:
Model | Target language(s) |
---|---|
CodeGen-MONO | Python |
CodeGen-MULTI | C, C++, Go, Java, JavaScript and others |
CodeGen-{MULTI,NL} fine-tuned on a language of choice (@bcardoso- TODO: add languages we are experimenting on @HongtaoYang and @bcardoso-) |
language of choice |
Google Model | Back-up Traffic |
We will use CodeGen-MULTI or CodeGen-NL as the base model and fine tune it for some languages using the dataset described below.
Dataset
This section describes the dataset, including data preprocessing, used to fine-tune the model for our needs.
Obtaining raw data
Our raw dataset contains non-preprocessed source code from GitHub for a total of 13 languages. To build this dataset, we rely on several data sources (BigQuery public dataset released by Google and GitHub archive on BigQuery as well). We also extend the CodeParrot SQL script by implementing additional heuristics that potentially increase the quality of the dataset. We include only those repos to the raw dataset that have at least 50 stars, 50 watches, or 70 commits. The dataset contains only permissive licenses approved by Legal. All files are less than 1 MB similar to the Codex, Polycode, and CodeParrot settings.
Supported Languages:
- c
- cpp
- csharp
- go
- java
- js
- php
- python
- ruby
- rust
- scala
- ts
- kotlin
Where do we store the raw dataset? We use our private BigQuery instance, table unreview-poc-390200e5.gl_code_suggestions.repo_contents_v2
.
Where did we push the SQl file used to build the dataset? Model Dev. repo - file
Data preprocessing
We implement the DataFlow pipeline to efficiently preprocess the raw dataset for fine-tuning the model. Here is the list of filters we provide now:
-
exact deduplication based on md5 hash (similar to Codex, Polycoder, CodeParrot) -
filter by maximum line length ( 1000
) and maximum mean line length (100
) in the file (similar to Codex, Polycoder, CodeParrot) -
filter out auto-generated files by looking for specific keywords (similar to Codex, CodeParrot) -
filter by maximum fraction ( 25%
) of non-alphanumeric characters (similar to Codex, CodeParrot) -
filter by maximum fraction ( 10%
) of hexadecimal numbers -
infer language from the file extension -
clean copyrights by looking for specific keywords -
removing decorations, e.g., ----
,*******
from comments -
redact email and IPv4, Ipv6 addresses (similar to SantaCoder) -
redact secrets (similar to SantaCoder)
The DataFlow pipeline divides the pre-processed data set into train
, test
, and val
at the ratio of 80%, 10%, and 10%.
Where do we store the raw dataset? We're going to use our private BigQuery instance as well as GCS. We need to implement all filters first before running the DF pipeline on the full raw dataset.
Where did we push the source code of the DF pipeline? Model Dev repo.
- Pipeline to preprocess the raw dataset - file.
- Pipeline to export the preprocessed dataset from BQ to GCS - file.
What is the schema of the preprocessed dataset?
{
"repo_name": "fastapi",
"content": "from fastapi import FastAPI\nfrom datetime import datetime\n\napp = FastAPI()\n\n@app.get(\"/\")\ndef hello_world():\n ..."
"file_path": "cool_project/cool_function.py",
"language": "python"
}
Other Technical details
In the literature, it has been proven that training on a single language yield better performance on that language than a model trained on multiple languages (e.g codegen-mono > codegen-multi on python). So as a first step, we will use codegen-multi as a base model, and finetune on other languages we aim to support.
We can finetune on a group of similar languages or on single languages. We need more experimentation to decide.
- Extract single language datasets. We can use DF job to extract data from BigQuery dataset and follow the deduplication and filtering steps on codegen (jaxformer).
- Construct batch and train the model on next-token-prediction task. We also need to decide which layer to finetune, chances are we don't need to finetune the whole model.
- We need to observe validation loss going down as a sanity check on whether the finetuning works or not. The tricky part is that the validation dataset must not contain any file present during the initial codegen-multi training and our finetuning process. One way to do this is to use our own repos as validation datasets.
Fine Tuning
Fintune Directions
Based on the data we have, I think we can try the following directions:
- Different Input Data
- Single (or subset) language finetuning.
Language buckets:
- c, c++
- python, go
- ??
- NL prompt vs code prompt (out-of-scope for now)
- Single (or subset) language finetuning.
Language buckets:
- Different Objectives
- (Default) next token prediction
- Masked token prediction. Randomly mask some token, and ask the model to predict that token based on code before and after.
- Code refinement. Randomly alter a code block to make it incorrect, then ask the model to identify and correct it.
- Different Layers. We can finetune different layers (bottom, top, some new heads etc.) instead of the whole model.
Finetuning on Colab
(This is deprecated, we are using a GCE VM)
We plan to fine tune with deep speed using Google Collab for the 2GB model first and then move to the development environment that will have the 7GB model. Script: https://colab.research.google.com/drive/1SBULe7o0xTUoXFRBlAgZxDzSPS4LL2Os#scrollTo=taxR3iZLuR6h
Its best if we can have a VM for experimentation.
Finetuning on GCE VM
Training script: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/model-dev/-/tree/main/training
- Deepspeed for multi-gpu training. (work in progress)
Evaluation
We will conduct extrinsic and intrinsic evaluation of finetuned models.
Extrinsic benchmarks
- HumanEval (colab notebook)
- Apps (TODO)
Intrinsic
- Calculating the model's perplexity
Meeting Notes:
https://docs.google.com/document/d/1zLoJi6Hc4YprS1lL53AChaeHhhADl5gtYhT8mclOra4/edit