... | ... | @@ -13,18 +13,18 @@ Quick links: |
|
|
The annotation of corpora, in most languages, uses the central PARSEME annotation platform. Below you will find the link to the platform and the user guide. If you do not have an account on FLAT yet, you have to ask the core organisers to create one for you and your team.
|
|
|
|
|
|
* [FLAT annotation platform](http://mwe.phil.hhu.de/): the PARSEME instance of [FLAT](https://github.com/proycon/flat), developed by Maarten van Gompel and hosted at the University of Düsseldorf.
|
|
|
* [FLAT user guide](https://docs.google.com/document/d/1zd_VhXQTel_IRVQ_u6s2wvJttwBHdDIk5YtWDMa3QW4/edit#) for PARSEME annotation
|
|
|
* [FLAT user guide](https://docs.google.com/document/d/1zd_VhXQTel_IRVQ_u6s2wvJttwBHdDIk5YtWDMa3QW4/edit#) containing instructions for PARSEME corpus annotation
|
|
|
|
|
|
## File format documentation
|
|
|
|
|
|
* **CUPT**: Most files in PARSEME use the [CUPT format](http://multiword.sourceforge.net/cupt-format/) (short for **C**oNll-**U** **P**arseme-**T**SV). CUPT is the PARSEME version/instance of extended [CoNLL-U format](https://universaldependencies.org/format.html), which has been defined jointly with [Universal Dependencies](http://universaldependencies.org/). The generic meta-format extending CoNLL-U is called [CoNLL-U Plus](https://universaldependencies.org/ext-format.html).
|
|
|
* **CoNLL-U**: the [CoNLL-U format](https://universaldependencies.org/format.html) is used in the [Universal Dependencies](http://universaldependencies.org/) project to represent and release morphological and syntactic annotations (i.e. treebanks) for many languages. PARSEME often relies on UD annotations, both manual (in treebanks) and automatic (output of tools like [UDPipe](##morphosyntactic-annotations-udpipe)). Our [conversion scripts](#file-format-conversion) can deal with CoNLL-U and perform integration of MWE annotations with UD annotations.
|
|
|
* **FoLiA**: files in FLAT are manipulated using a generic XML format called [FoLiA](https://proycon.github.io/folia/). We provide tools to convert from FoLiA to CUPT and vice-versa below, as well as integration with UD's CoNLL-U format.
|
|
|
* **CUPT**: Most files in PARSEME use the [CUPT format](http://multiword.sourceforge.net/cupt-format/) (short for **C**oNLL-**U** **P**arseme-**T**SV). CUPT is the PARSEME version/instance of extended [CoNLL-U format](https://universaldependencies.org/format.html), which has been defined jointly with [Universal Dependencies](http://universaldependencies.org/). The generic meta-format extending CoNLL-U is called [CoNLL-U Plus](https://universaldependencies.org/ext-format.html).
|
|
|
* **CoNLL-U**: the [CoNLL-U format](https://universaldependencies.org/format.html) is used in the [Universal Dependencies](http://universaldependencies.org/) project to represent and release morphological and syntactic annotations (i.e. treebanks) for many languages. PARSEME often relies on UD annotations, both manual (in treebanks) and automatic (output of tools like [UDPipe](#morphosyntactic-annotations-udpipe)). Our [conversion scripts](#file-format-conversion) can deal with CoNLL-U and perform integration of MWE annotations with UD-style morphosyntactic annotations.
|
|
|
* **FoLiA**: files in FLAT are manipulated using a generic XML format called [FoLiA](https://proycon.github.io/folia/). We provide tools to [convert](#file-format-conversion) from FoLiA to CUPT and vice-versa, as well as integration with UD's CoNLL-U format.
|
|
|
|
|
|
## File format conversion
|
|
|
|
|
|
PARSEME provides scripts to convert between CUPT, CoNLL-U and FoliA. They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository, in the folder `st-organizers`. The most important scripts are:
|
|
|
* `to_cupt.py`: this script converts a file given as input into CUPT. The input can be in FoLiA, CUPT, CoNLL-UP (of which CoNLL-U is an instance) or PARSEME-TSV (deprecated) formats. The script automatically detects the input format. It is also possible to use this script to align the input with a corresponding CoNLL-U file (e.g. a newer UD treebank version). If a CoNLL-U file is provided with the `--conllu` option, the information in the CoNLL-U will be prioritary, except for the MWE annotations present only in the `--input` file. The script is capable of correcting incompatible tokenization, considering that the CoNLL-U file is correct. Other options control the presence of `NonVMWE` annotations, MWE annotations on multiword tokens (forbidden in CUPT) and the names of columns in CoNLL-UP (if not present in the header, as it is the case for standard CoNLL-U).
|
|
|
PARSEME provides scripts to convert between CUPT, CoNLL-U and FoLiA (and also the deprecated PARSEME-TSV format). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository, in the folder `st-organizers`. The most important scripts are:
|
|
|
* `to_cupt.py`: this script converts a file given as input into CUPT. The input can be in FoLiA, CUPT, CoNLL-UP (of which CoNLL-U is an instance) or PARSEME-TSV (deprecated) formats. The script automatically detects the input format. It is also possible to use this script to align the input with a corresponding CoNLL-U file (e.g. a newer UD treebank version). If a CoNLL-U file is provided with the `--conllu` option, the information in the CoNLL-U will be prioritary, except for the MWE annotations present only in the `--input` file. The script is capable of correcting minor tokenization incompatibilities, considering that the CoNLL-U file is correct. Other options control the presence of `NonVMWE` annotations, MWE annotations on multiword tokens (forbidden in CUPT) and the names of columns in CoNLL-UP (if not present in the header, as it is the case for standard CoNLL-U).
|
|
|
* `to_folia.py`: similarly to above, this script converts anything into FoLiA. This script is already integrated into FLAT so it should not be necessary to run it manually.
|
|
|
|
|
|
|
... | ... | @@ -34,7 +34,7 @@ If your corpus does not have manual mrphological and morphosyntactic annotations |
|
|
|
|
|
1. Download the UDPipe **model** for your language:
|
|
|
- Models are described on [UDPipe pretrained model](https://ufal.mff.cuni.cz/udpipe/models). One model is available by corpus so you may have more than one models for your language. You can compare the scores of these models on the UDPipe page.
|
|
|
- On [Clarin page](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998), you can download models trained on version 2.4 of UD.
|
|
|
- On [Clarin page](https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2998), you can download models trained on version 2.4 of UD (latest at the time of writing, use more recent versions if available).
|
|
|
2. Download the PARSEME utilities repository:
|
|
|
- `git clone git@gitlab.com:parseme/utilities.git` (if not already done) or `git pull` (to get latest files)
|
|
|
3. Run UDPipe:
|
... | ... | @@ -50,7 +50,7 @@ Since the corpus is large, you may want to parallelise this process by running s |
|
|
|
|
|
## Consistency checks scripts
|
|
|
|
|
|
PARSEME provides scripts to increase the consistency of annotations. Their use is described on the LL's guide to [enhance existing corpora](Enhancing-existing-corpora). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository.
|
|
|
PARSEME provides scripts to increase the consistency of annotations. Their use is described on the LL's guide to [enhance existing corpora](Enhancing-existing-corpora). They can be found in the [PARSEME utilities](https://gitlab.com/parseme/utilities/) repository. The script is based on lemmas, verifying if annotations concerning the same sets of lemmas use the same labels across the whole corpus. The script can also help spotting skipped expressions. However, only potential problems are found: the corrections still need to be examined and performed manually.
|
|
|
|
|
|
## Error mining: Grew-match
|
|
|
|
... | ... | @@ -61,5 +61,7 @@ PARSEME provides scripts to increase the consistency of annotations. Their use i |
|
|
* [PARSEME utilities](https://gitlab.com/parseme/utilities/): a repository containing useful scripts for corpus management, including parsemetsv<->CUPT conversion, adjudication, consistency checks, and corpus statistics. LLs may need to run some of these scripts with the help of core organizers.
|
|
|
* [Development Gitlab space](https://gitlab.com/parseme/sharedtask-data-dev) (for authorised users): contains development versions of the corpora, double-aligned corpora for IAA calculation, system results from previous editions, various scripts for ST organizers (automating system evaluation, publishing the results, running IAA). In 2020, we gradually move the development version of language corpora to dedicated gitlab repositories, keeping in this repository only organisation data (preliminary results, IAA data, internal scripts)
|
|
|
* [PARSEME guidelines](https://gitlab.com/parseme/sharedtask-guidelines): a repository hosting the HTML guidelines and issues page (LLs generally do not need to edit the guidelines directly but they do participate in raising and solving issues).
|
|
|
* [Description of PARSEME repositories](https://docs.google.com/document/d/1Wkx7bWTR04TXFVypPKy-qYi4ugc_034BtfskDeLDoGU/). This document may require updates and its content should be slowly moved here. Please send us a message if you find any inconsistency.
|
|
|
* [Description of PARSEME repositories](https://docs.google.com/document/d/1Wkx7bWTR04TXFVypPKy-qYi4ugc_034BtfskDeLDoGU/).
|
|
|
```bash
|
|
|
TODO: This document may require updates and its content should be slowly moved here.```
|
|
|
|