Changes

Carlos Ramisch · bc665fd0
--- a/PARSEME-tools.md
+++ b/PARSEME-tools.md
@@ -30,19 +30,22 @@ PARSEME provides scripts to convert between CUPT, CoNLL-U and FoliA. They can be

 ## Morphosyntactic annotations: UDPipe

-If your corpus does not have manual mrphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide a script and some instructions below to make this process easier.
+If your corpus does not have manual mrphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide a script and some instructions below to make this process easier. The input files can be FoLiA, CUPT, CoNLL-U, parseme-tsv, or raw text (UTF-8, LF line endings, one sentence per line).

-1. Download the UDPipe model for your language:
+1. Download the UDPipe **model** for your language:
    - Models are described on [UDPipe pretrained model](https://ufal.mff.cuni.cz/udpipe/models). One model is available by corpus so you may have more than one models for your language. You can compare the scores of these models on the UDPipe page.
    - On [Clarin page](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998), you can download models trained on version 2.4 of UD.
 2. Download the PARSEME utilities repository:
    - `git clone git@gitlab.com:parseme/utilities.git` (if not already done) or `git pull` (to get latest files)
 3. Run UDPipe:
-    - Assuming that `MODELPATH` is the path to your language's model (e.g. `udpipe-ud-2.0-170801/romanian-ud-2.0-170801.udpipe`), run the following command:\
-  `utilities/lang-leaders/pre-annot/run_udpipe_raw.sh MODELPATH raw-001.txt raw-002.txt ...`
+    - Assuming your corpus is in files `input-001.txt input-002.txt ...`, and
+    - assuming that `MODELPATH` is the path to your language's model (e.g. `udpipe-ud-2.0-170801/romanian-ud-2.0-170801.udpipe`), 
+    - run the following command:\
+  `utilities/lang-leaders/pre-annot/run_udpipe.sh MODELPATH input-001.txt input-002.txt ...`
    - If this is the first time you run this script, it will automatically download and compile UDPipe (so it may take some extra minutes to run).
-    - UDPipe will perform sentence splitting, but if you prefer to keep the current split you must edit the script which runs UDPipe `run_udpipe_raw.sh` and use the option `--tokenizer=presegmented` (see lines 59-60).
+    - if the input files are in raw text format (UTF-8, LF line ending, one sentence per line), you can use option `-r` of the script to parse raw text files. UDPipe splits sentences, but if you prefer to keep the current split you must use option `-s` of the script
    - If you already have a recent version of UDPipe installed on your computer, you can precede the script call by `UDPIPE_PATH=<path-to-udpipe-folder>` (path to the folder, not to the executable). 
+
 Since the corpus is large, you may want to parallelise this process by running several jobs, e.g. on a cluster. We can help if needed, just drop us a line. For more details on UDPipe's options, check [UDPipe's user manual](https://ufal.mff.cuni.cz/udpipe/users-manual).

 ## Consistency checks scripts