Model Refinements

Model refinements involve several related changes: increasing the model quality by updating the next token decision strategy and introducing postprocessing layers to mask PII (Personally identifiable information).

Next-token decision strategy:

Autoregressive models use a variety of next-token decision strategies:

Greedy search
Beam search
Top-K sampling
Top-p sampling
Contrastive search

More examples can be found in https://huggingface.co/blog/how-to-generate and https://huggingface.co/blog/introducing-csearch. According to the latest research, contrastive search is SOTA. Due to the lack of contrastive search implementation in Triton backend, we have settled on using the top-p sampling with the FauxPilot-Codegen model following the original paper. Pure python Codegen models with implemented contrastive search show low throughput and high-memory consumption.

Hyperparameters used by the FauxPilot-Codegen model:

# Number of tokens to generate
REQUEST_OUTPUT_LEN = 16

# Model hyperparameters
MODEL_TEMPERATURE = .2
MODEL_REPETITION_PENALTY = 1
MODEL_TOP_K = 0
MODEL_TOP_P = .95
MODEL_PAD_ID = 50256

PII data anonymization.

We rely on the work done by the SantaCoder author https://github.com/bigcode-project/bigcode-dataset/tree/main/pii and apply regex to identify and mask sensitive data. One difference is that we apply anonymization during postprocessing. We attempt to anonymize the following PII:

email addresses
Ipv4/v6 addresses
various secrets like GitLab tokens. To detect secrets, we rely on the work of the BigCode project and on the detect-secrets Python lib. We're able to detect and mask the following secrets:
basic auth, e.g., git clone https://username:1eeccr334f@gitlab.com/username/repository.git
artifactory credentials
sendgrid tokens
azure storage tokens
discord tokens
twilio tokens
secret-sounding variable names. Use cases we support - https://github.com/Yelp/detect-secrets/blob/master/tests/plugins/keyword_test.py#L126

Outcomes - cc @mray2020

The detect-secrets lib contains other useful regex expressions (e.g., GitHub tokens). However, these expressions often require an additional HTTP request to verify the found secret, which is not applicable in post-processing. If we want to support masking other tokens, we need to update the required detect-secrets detectors. We can do this in &3 (closed) after testing the existing work.
We can further improve the masking of secret-sounding variable names using the same detect-secrets lib. In this case, we need to parse file extensions to get correct masking, e.g. for golang, c++ source code. Please, check the examples here: https://github.com/Yelp/detect-secrets/blob/master/tests/plugins/keyword_test.py

Edited Jan 25, 2023 by Alexander Chueshev