Multi Language Model Tuning ( Ensemble of Models)

Overview
Ensemble of models
Dataset
- Obtaining raw data
- Data preprocessing
Other Technical details
- Fine Tuning
Evaluation
- Extrinsic benchmarks
- Intrinsic
Meeting Notes:

Overview

To support multi-language , we have decided to use various datasets and fine tune the model. In order to start with , we will look into few datasets including Polycoder Dataset

Ensemble of models

Our idea for the ensemble is to use a bucket of models. To our problem this means that we will route the request for the best model based on the programming language being requested. For now, we discussed having the following model list:

Model	Target language(s)
CodeGen-MONO	Python
CodeGen-MULTI	C, C++, Go, Java, JavaScript and others
CodeGen-{MULTI,NL} fine-tuned on a language of choice (@bcardoso- TODO: add languages we are experimenting on @HongtaoYang and @bcardoso-)	language of choice
Google Model	Back-up Traffic

We will use CodeGen-MULTI or CodeGen-NL as the base model and fine tune it for some languages using the dataset described below.

Dataset

This section describes the dataset, including data preprocessing, used to fine-tune the model for our needs.

Obtaining raw data

Our raw dataset contains non-preprocessed source code from GitHub for a total of 13 languages. To build this dataset, we rely on several data sources (BigQuery public dataset released by Google and GitHub archive on BigQuery as well). We also extend the CodeParrot SQL script by implementing additional heuristics that potentially increase the quality of the dataset. We include only those repos to the raw dataset that have at least 50 stars, 50 watches, or 70 commits. The dataset contains only permissive licenses approved by Legal. All files are less than 1 MB similar to the Codex, Polycode, and CodeParrot settings.

Supported Languages:

c
cpp
csharp
go
java
js
php
python
ruby
rust
scala
ts
kotlin

Where do we store the raw dataset? We use our private BigQuery instance, table unreview-poc-390200e5.gl_code_suggestions.repo_contents_v2.

Where did we push the SQl file used to build the dataset? Model Dev. repo - file

Data preprocessing

We implement the DataFlow pipeline to efficiently preprocess the raw dataset for fine-tuning the model. Here is the list of filters we provide now:

The DataFlow pipeline divides the pre-processed data set into train, test, and val at the ratio of 80%, 10%, and 10%.

Where do we store the raw dataset? We're going to use our private BigQuery instance as well as GCS. We need to implement all filters first before running the DF pipeline on the full raw dataset.

Where did we push the source code of the DF pipeline? Model Dev repo.

Pipeline to preprocess the raw dataset - file.
Pipeline to export the preprocessed dataset from BQ to GCS - file.

What is the schema of the preprocessed dataset?

{
  "repo_name": "fastapi",
  "content": "from fastapi import FastAPI\nfrom datetime import datetime\n\napp = FastAPI()\n\n@app.get(\"/\")\ndef hello_world():\n ..."
  "file_path": "cool_project/cool_function.py",
  "language": "python"
}

Other Technical details

In the literature, it has been proven that training on a single language yield better performance on that language than a model trained on multiple languages (e.g codegen-mono > codegen-multi on python). So as a first step, we will use codegen-multi as a base model, and finetune on other languages we aim to support.

We can finetune on a group of similar languages or on single languages. We need more experimentation to decide.

Extract single language datasets. We can use DF job to extract data from BigQuery dataset and follow the deduplication and filtering steps on codegen (jaxformer).
Construct batch and train the model on next-token-prediction task. We also need to decide which layer to finetune, chances are we don't need to finetune the whole model.
We need to observe validation loss going down as a sanity check on whether the finetuning works or not. The tricky part is that the validation dataset must not contain any file present during the initial codegen-multi training and our finetuning process. One way to do this is to use our own repos as validation datasets.

Fine Tuning

Fintune Directions

Based on the data we have, I think we can try the following directions:

Different Input Data
- Single (or subset) language finetuning. Language buckets:
  - c, c++
  - python, go
  - ??
- NL prompt vs code prompt (out-of-scope for now)
Different Objectives
- (Default) next token prediction
- Masked token prediction. Randomly mask some token, and ask the model to predict that token based on code before and after.
- Code refinement. Randomly alter a code block to make it incorrect, then ask the model to identify and correct it.
Different Layers. We can finetune different layers (bottom, top, some new heads etc.) instead of the whole model.

Finetuning on Colab

(This is deprecated, we are using a GCE VM)

We plan to fine tune with deep speed using Google Collab for the 2GB model first and then move to the development environment that will have the 7GB model. Script: https://colab.research.google.com/drive/1SBULe7o0xTUoXFRBlAgZxDzSPS4LL2Os#scrollTo=taxR3iZLuR6h

✅ Happy to report that using deepspeed on colab, even without zero optimization, achieve 2x speed boost compared to vanilla pytorch.

❌ Cannot enable zero optimization on colab, due to not enough RAM and not enough GPU memory.

Its best if we can have a VM for experimentation.

Finetuning on GCE VM

Training script: https://gitlab.com/gitlab-org/modelops/applied-ml/code-suggestions/model-dev/-/tree/main/training

Deepspeed for multi-gpu training. (work in progress)

Evaluation

We will conduct extrinsic and intrinsic evaluation of finetuned models.

Extrinsic benchmarks

HumanEval (colab notebook)
Apps (TODO)

Intrinsic

Calculating the model's perplexity

Meeting Notes:

https://docs.google.com/document/d/1zLoJi6Hc4YprS1lL53AChaeHhhADl5gtYhT8mclOra4/edit

Edited Jun 05, 2023 by Mon Ray