... | ... | @@ -30,19 +30,22 @@ PARSEME provides scripts to convert between CUPT, CoNLL-U and FoliA. They can be |
|
|
|
|
|
## Morphosyntactic annotations: UDPipe
|
|
|
|
|
|
If your corpus does not have manual mrphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide a script and some instructions below to make this process easier.
|
|
|
If your corpus does not have manual mrphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide a script and some instructions below to make this process easier. The input files can be FoLiA, CUPT, CoNLL-U, parseme-tsv, or raw text (UTF-8, LF line endings, one sentence per line).
|
|
|
|
|
|
1. Download the UDPipe model for your language:
|
|
|
1. Download the UDPipe **model** for your language:
|
|
|
- Models are described on [UDPipe pretrained model](https://ufal.mff.cuni.cz/udpipe/models). One model is available by corpus so you may have more than one models for your language. You can compare the scores of these models on the UDPipe page.
|
|
|
- On [Clarin page](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998), you can download models trained on version 2.4 of UD.
|
|
|
2. Download the PARSEME utilities repository:
|
|
|
- `git clone git@gitlab.com:parseme/utilities.git` (if not already done) or `git pull` (to get latest files)
|
|
|
3. Run UDPipe:
|
|
|
- Assuming that `MODELPATH` is the path to your language's model (e.g. `udpipe-ud-2.0-170801/romanian-ud-2.0-170801.udpipe`), run the following command:\
|
|
|
`utilities/lang-leaders/pre-annot/run_udpipe_raw.sh MODELPATH raw-001.txt raw-002.txt ...`
|
|
|
- Assuming your corpus is in files `input-001.txt input-002.txt ...`, and
|
|
|
- assuming that `MODELPATH` is the path to your language's model (e.g. `udpipe-ud-2.0-170801/romanian-ud-2.0-170801.udpipe`),
|
|
|
- run the following command:\
|
|
|
`utilities/lang-leaders/pre-annot/run_udpipe.sh MODELPATH input-001.txt input-002.txt ...`
|
|
|
- If this is the first time you run this script, it will automatically download and compile UDPipe (so it may take some extra minutes to run).
|
|
|
- UDPipe will perform sentence splitting, but if you prefer to keep the current split you must edit the script which runs UDPipe `run_udpipe_raw.sh` and use the option `--tokenizer=presegmented` (see lines 59-60).
|
|
|
- if the input files are in raw text format (UTF-8, LF line ending, one sentence per line), you can use option `-r` of the script to parse raw text files. UDPipe splits sentences, but if you prefer to keep the current split you must use option `-s` of the script
|
|
|
- If you already have a recent version of UDPipe installed on your computer, you can precede the script call by `UDPIPE_PATH=<path-to-udpipe-folder>` (path to the folder, not to the executable).
|
|
|
|
|
|
Since the corpus is large, you may want to parallelise this process by running several jobs, e.g. on a cluster. We can help if needed, just drop us a line. For more details on UDPipe's options, check [UDPipe's user manual](https://ufal.mff.cuni.cz/udpipe/users-manual).
|
|
|
|
|
|
## Consistency checks scripts
|
... | ... | |