Skip to content

Use BlendedDataset to select samples from different languages with different probabilities.

Hongtao Yang requested to merge weighted_datasets into language_identifier_prompt

Introduced BlendedDataset, which is a blend of multiple datasets. Each dataset within it contains only code of a single language. BlendedDataset will draw sample from each dataset with the given probability

The goal is to train on all 13 languages, but given higher probabilities to new languages.

Merge request reports