Model Refinements
Model refinements involve several related changes: increasing the model quality by updating the next token decision strategy and introducing postprocessing layers to mask PII (Personally identifiable information).
Next-token decision strategy:
Autoregressive models use a variety of next-token decision strategies:
- Greedy search
- Beam search
- Top-K sampling
- Top-p sampling
- Contrastive search
More examples can be found in https://huggingface.co/blog/how-to-generate and https://huggingface.co/blog/introducing-csearch. According to the latest research, contrastive search is SOTA. Due to the lack of contrastive search implementation in Triton backend, we have settled on using the top-p
sampling with the FauxPilot-Codegen model following the original paper. Pure python Codegen models with implemented contrastive search show low throughput and high-memory consumption.
Hyperparameters used by the FauxPilot-Codegen model:
# Number of tokens to generate
REQUEST_OUTPUT_LEN = 16
# Model hyperparameters
MODEL_TEMPERATURE = .2
MODEL_REPETITION_PENALTY = 1
MODEL_TOP_K = 0
MODEL_TOP_P = .95
MODEL_PAD_ID = 50256
PII data anonymization.
We rely on the work done by the SantaCoder author https://github.com/bigcode-project/bigcode-dataset/tree/main/pii and apply regex to identify and mask sensitive data. One difference is that we apply anonymization during postprocessing. We attempt to anonymize the following PII:
-
email addresses -
Ipv4/v6 addresses -
various secrets like GitLab tokens. To detect secrets, we rely on the work of the BigCode project and on the detect-secrets Python lib. We're able to detect and mask the following secrets: - basic auth, e.g.,
git clone https://username:1eeccr334f@gitlab.com/username/repository.git
- artifactory credentials
- sendgrid tokens
- azure storage tokens
- discord tokens
- twilio tokens
- secret-sounding variable names. Use cases we support - https://github.com/Yelp/detect-secrets/blob/master/tests/plugins/keyword_test.py#L126
@mray2020
Outcomes - cc- The detect-secrets lib contains other useful regex expressions (e.g., GitHub tokens). However, these expressions often require an additional HTTP request to verify the found secret, which is not applicable in post-processing. If we want to support masking other tokens, we need to update the required
detect-secrets
detectors. We can do this in &3 (closed) after testing the existing work. - We can further improve the masking of secret-sounding variable names using the same detect-secrets lib. In this case, we need to parse file extensions to get correct masking, e.g. for golang, c++ source code. Please, check the examples here: https://github.com/Yelp/detect-secrets/blob/master/tests/plugins/keyword_test.py