Split into train/dev/test
Currently, the splitting script divides the input dataset into two parts: train
and test
.
It should divide the input dataset to three parts: train
, dev
, and test
.
Additionally, we should make sure that the dev
set has similar properties to the test
set in the sense that it also contains a non-negligible number of unseen MWEs w.r.t. train.
NOTE: we stick to the definition that a MWE is unseen if the same bag of lemmas does not occur in train
.