Changes

Carlos Ramisch · 8b42ba22
--- a/PARSEME-tools.md
+++ b/PARSEME-tools.md
@@ -27,8 +27,24 @@ PARSEME provides scripts to convert between CUPT, CoNLL-U and FoliA. They can be
  * `to_cupt.py`: this script converts a file given as input into CUPT. The input can be in FoLiA, CUPT, CoNLL-UP (of which CoNLL-U is an instance) or PARSEME-TSV (deprecated) formats. The script automatically detects the input format. It is also possible to use this script to align the input with a corresponding CoNLL-U file (e.g. a newer UD treebank version). If a CoNLL-U file is provided with the `--conllu` option, the information in the CoNLL-U will be prioritary, except for the MWE annotations present only in the `--input` file. The script is capable of correcting incompatible tokenization, considering that the CoNLL-U file is correct. Other options control the presence of `NonVMWE` annotations, MWE annotations on multiword tokens (forbidden in CUPT) and the names of columns in CoNLL-UP (if not present in the header, as it is the case for standard CoNLL-U).
  * `to_folia.py`: similarly to above, this script converts anything into FoLiA. This script is already integrated into FLAT so it should not be necessary to run it manually.

+
 ## Morphosyntactic annotations: UDPipe

+If your corpus does not have manual mrphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide a script and some instructions below to make this process easier.
+
+1. Download the UDPipe model for your language:
+    - Models are described on [UDPipe pretrained model](https://ufal.mff.cuni.cz/udpipe/models). One model is available by corpus so you may have more than one models for your language. You can compare the scores of these models on the UDPipe page.
+    - On [Clarin page](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998), you can download models trained on version 2.4 of UD.
+2. Download the PARSEME utilities repository:
+    - `git clone git@gitlab.com:parseme/utilities.git` (if not already done) or `git pull` (to get latest files)
+3. Run UDPipe:
+    - Assuming that `MODELPATH` is the path to your language's model (e.g. `udpipe-ud-2.0-170801/romanian-ud-2.0-170801.udpipe`), run the following command:\
+  `utilities/lang-leaders/pre-annot/run_udpipe_raw.sh MODELPATH raw-001.txt raw-002.txt ...`
+    - If this is the first time you run this script, it will automatically download and compile UDPipe (so it may take some extra minutes to run).
+    - UDPipe will perform sentence splitting, but if you prefer to keep the current split you must edit the script which runs UDPipe `run_udpipe_raw.sh` and use the option `--tokenizer=presegmented` (see lines 59-60).
+    - If you already have a recent version of UDPipe installed on your computer, you can precede the script call by `UDPIPE_PATH=<path-to-udpipe-folder>` (path to the folder, not to the executable). 
+Since the corpus is large, you may want to parallelise this process by running several jobs, e.g. on a cluster. We can help if needed, just drop us a line. For more details on UDPipe's options, check [UDPipe's user manual](https://ufal.mff.cuni.cz/udpipe/users-manual).
+
 ## Consistency checks scripts

 PARSEME provides scripts to increase the consistency of annotations. Their use is described on the LL's guide to [enhance existing corpora](Enhancing-existing-corpora). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository.