... | ... | @@ -37,15 +37,18 @@ PARSEME provides scripts to convert between CUPT, CoNLL-U and FoLiA (and also th |
|
|
## File format validation
|
|
|
TODO
|
|
|
|
|
|
## Morphosyntactic annotations: UDPipe
|
|
|
## Morphosyntactic annotations
|
|
|
|
|
|
If your corpus does not have manual morphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. We provide instructions below to make this process easier. The input files can be FoLiA, CUPT, CoNLL-U, parseme-tsv, or raw text (UTF-8, LF line endings, one sentence per line).
|
|
|
### Generating morphosyntactic annotations before annotating MWEs
|
|
|
|
|
|
### Running UDPipe online
|
|
|
If your corpus does not have manual morphological and morphosyntactic annotations, you can/should generate them using an automatic UD-compatible parser such as UDPipe. The best option is to do so before you start annotating MWEs, since the knowledge of morphosyntax can help you identify important words in a sentence (e.g. verbs).
|
|
|
We provide instructions below to make this process easier. The input files can be FoLiA, CUPT, CoNLL-U, parseme-tsv, or raw text (UTF-8, LF line endings, one sentence per line).
|
|
|
|
|
|
#### Running UDPipe online
|
|
|
|
|
|
If you have a raw corpus in the .txt format, and if it is not too big, you can directly [parse it online](https://lindat.mff.cuni.cz/services/udpipe/), provided that UDPipe has a model for your language (around 200 languages are served in version 2). This will produce files in the .conllu format, which you can then download to FLAT for manual MWE annotation.
|
|
|
|
|
|
### Running UDPipe locally
|
|
|
#### Running UDPipe locally
|
|
|
|
|
|
1. Download the UDPipe **model** for your language:
|
|
|
- Pretrained models are described on the [UDPipe models page](https://ufal.mff.cuni.cz/udpipe/models). One model per corpus is available so you may have more than one model to choose from for your language. You can compare the scores of these models on the UDPipe models page.
|
... | ... | @@ -64,13 +67,16 @@ If you have a raw corpus in the .txt format, and if it is not too big, you can d |
|
|
|
|
|
Since the corpus is large, you may want to parallelise this process by running several jobs, e.g. on a cluster. We can help if needed, just drop us a line. For more details on UDPipe's options, check [UDPipe's user manual](https://ufal.mff.cuni.cz/udpipe/users-manual).
|
|
|
|
|
|
### Running UDPipe on MWE-annotated files
|
|
|
### Generating morphosyntactic annotations after annotating MWEs
|
|
|
Suppose you performed the annotation of MWEs before taking care of the underlying morphosyntactic annotations. No worris, you can, complete these with our scripts.
|
|
|
|
|
|
#### Running UDPipe on MWE-annotated files
|
|
|
|
|
|
Suppose your corpus is already tokenized and annotated for MWEs and in the .cupt (or .folia) format but misses morphosyntactic annotation. To enhance it with UDPipe, proceed as above, this time passing your .cupt files to the parser:
|
|
|
`utilities/lang-leaders/pre-annot/run_udpipe.sh MODELPATH input-001.cupt input-002.cupt ...`
|
|
|
Any pre-existing information, other than tokenisation, will be overwritten. Therefore, if you already have part of the morphosyntactic annotation (e.g. UPOS tags) which you want to keep, run UDPipe in a customized way (see below).
|
|
|
|
|
|
### Running UDPipe on partly annotated files
|
|
|
#### Running UDPipe on partly annotated files
|
|
|
|
|
|
Suppose your corpus is already tokenized, annotated for MWEs, and annotated for morphology (LEMMA, UPOS and FEATS columns) but not for syntax (HEAD and DEPREL columns). You can use UDPipe in a custom way:
|
|
|
|
... | ... | |