... | @@ -25,7 +25,7 @@ The annotation of corpora, in most languages, uses the central PARSEME annotatio |
... | @@ -25,7 +25,7 @@ The annotation of corpora, in most languages, uses the central PARSEME annotatio |
|
## File format conversion
|
|
## File format conversion
|
|
|
|
|
|
PARSEME provides scripts to convert between CUPT, CoNLL-U and FoLiA (and also the deprecated PARSEME-TSV format). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository, in the folder `st-organizers`. The most important scripts are:
|
|
PARSEME provides scripts to convert between CUPT, CoNLL-U and FoLiA (and also the deprecated PARSEME-TSV format). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository, in the folder `st-organizers`. The most important scripts are:
|
|
* `to_cupt.py`: this script converts a file given as input into CUPT. The input can be in FoLiA, CUPT, CoNLL-UP (of which CoNLL-U is an instance) or PARSEME-TSV (deprecated) formats. The script automatically detects the input format. It is also possible to use this script to align the input with a corresponding CoNLL-U file (e.g. a newer UD treebank version). If a CoNLL-U file is provided with the `--conllu` option, the information in the CoNLL-U will be prioritary, except for the MWE annotations present only in the `--input` file. The script is capable of correcting minor tokenization incompatibilities, considering that the CoNLL-U file is correct. Other options control the presence of `NonVMWE` annotations, MWE annotations on multiword tokens (forbidden in CUPT) and the names of columns in CoNLL-UP (if not present in the header, as it is the case for standard CoNLL-U).
|
|
* `to_cupt.py`: this script converts a file given as input into CUPT. The input can be in FoLiA, CUPT, CoNLL-UP (of which CoNLL-U is an instance) or PARSEME-TSV (deprecated) formats. The script automatically detects the input format. If your source file is FoLiA and was updated to FLAT in 2019 or before, you should provide a companion CoNLL-U file using the `--conllu` option to provide morphosyntactic information. Newer FoLiA files will have the corresponding CoNLL-U information embedded (as long as it was provided when the document was uploaded to FLAT). It is also possible to use this script to align the input with a corresponding CoNLL-U file (e.g. a newer UD treebank version). If a CoNLL-U file is provided with the `--conllu` option, the information in the CoNLL-U will be prioritary, except for the MWE annotations present only in the `--input` file. The script is capable of correcting minor tokenization incompatibilities, considering that the CoNLL-U file is correct. Other options control the presence of `NonVMWE` annotations, MWE annotations on multiword tokens (forbidden in CUPT) and the names of columns in CoNLL-UP (if not present in the header, as it is the case for standard CoNLL-U).
|
|
* `to_folia.py`: similarly to above, this script converts anything into FoLiA. This script is already integrated into FLAT so it should not be necessary to run it manually.
|
|
* `to_folia.py`: similarly to above, this script converts anything into FoLiA. This script is already integrated into FLAT so it should not be necessary to run it manually.
|
|
|
|
|
|
|
|
|
... | | ... | |