... | ... | @@ -27,8 +27,24 @@ PARSEME provides scripts to convert between CUPT, CoNLL-U and FoliA. They can be |
|
|
* `to_cupt.py`: this script converts a file given as input into CUPT. The input can be in FoLiA, CUPT, CoNLL-UP (of which CoNLL-U is an instance) or PARSEME-TSV (deprecated) formats. The script automatically detects the input format. It is also possible to use this script to align the input with a corresponding CoNLL-U file (e.g. a newer UD treebank version). If a CoNLL-U file is provided with the `--conllu` option, the information in the CoNLL-U will be prioritary, except for the MWE annotations present only in the `--input` file. The script is capable of correcting incompatible tokenization, considering that the CoNLL-U file is correct. Other options control the presence of `NonVMWE` annotations, MWE annotations on multiword tokens (forbidden in CUPT) and the names of columns in CoNLL-UP (if not present in the header, as it is the case for standard CoNLL-U).
|
|
|
* `to_folia.py`: similarly to above, this script converts anything into FoLiA. This script is already integrated into FLAT so it should not be necessary to run it manually.
|
|
|
|
|
|
|
|
|
## Morphosyntactic annotations: UDPipe
|
|
|
|
|
|
If your corpus does not have manual mrphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide a script and some instructions below to make this process easier.
|
|
|
|
|
|
1. Download the UDPipe model for your language:
|
|
|
- Models are described on [UDPipe pretrained model](https://ufal.mff.cuni.cz/udpipe/models). One model is available by corpus so you may have more than one models for your language. You can compare the scores of these models on the UDPipe page.
|
|
|
- On [Clarin page](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998), you can download models trained on version 2.4 of UD.
|
|
|
2. Download the PARSEME utilities repository:
|
|
|
- `git clone git@gitlab.com:parseme/utilities.git` (if not already done) or `git pull` (to get latest files)
|
|
|
3. Run UDPipe:
|
|
|
- Assuming that `MODELPATH` is the path to your language's model (e.g. `udpipe-ud-2.0-170801/romanian-ud-2.0-170801.udpipe`), run the following command:\
|
|
|
`utilities/lang-leaders/pre-annot/run_udpipe_raw.sh MODELPATH raw-001.txt raw-002.txt ...`
|
|
|
- If this is the first time you run this script, it will automatically download and compile UDPipe (so it may take some extra minutes to run).
|
|
|
- UDPipe will perform sentence splitting, but if you prefer to keep the current split you must edit the script which runs UDPipe `run_udpipe_raw.sh` and use the option `--tokenizer=presegmented` (see lines 59-60).
|
|
|
- If you already have a recent version of UDPipe installed on your computer, you can precede the script call by `UDPIPE_PATH=<path-to-udpipe-folder>` (path to the folder, not to the executable).
|
|
|
Since the corpus is large, you may want to parallelise this process by running several jobs, e.g. on a cluster. We can help if needed, just drop us a line. For more details on UDPipe's options, check [UDPipe's user manual](https://ufal.mff.cuni.cz/udpipe/users-manual).
|
|
|
|
|
|
## Consistency checks scripts
|
|
|
|
|
|
PARSEME provides scripts to increase the consistency of annotations. Their use is described on the LL's guide to [enhance existing corpora](Enhancing-existing-corpora). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository.
|
... | ... | |