Commit 1b97d117 authored by Candito's avatar Candito
Browse files

added differences from 1.0 version

parent 90986edc
......@@ -6,9 +6,10 @@ This is the README file from the PARSEME verbal multiword expressions (VMWEs) co
Corpora
-------
The verbal MWEs have been annotated in the following corpora:
1. `sequoia`: all the 3099 sentences of the [Sequoia Treebank](https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia)
2. `fr-ud`: the 2.1 version of the French universal dependencies treebank (recently renamed "GDS" for Google dataset)
3. `fr_partut-ud`: the 2.1 UD version of the French part of the ParTUT
1. `sequoia`: the [Sequoia Treebank](https://www.rocq.inria.fr/alpage-wiki/tiki-index.php?page=CorpusSequoia) (3099 sentences)
2. `fr-ud`: the 2.1 version of the French universal dependencies treebank (recently renamed "GDS" for Google dataset) (16449 sentences)
3. `fr_partut-ud`: the 2.1 UD version of the French part of the ParTUT (1020 sentences)
4. `fr_pud-ud`: the first 500 sentences of the French part of the 2.1 UD version of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](http://universaldependencies.org/conll17/).
......@@ -22,10 +23,18 @@ The data are in the [.cupt](http://multiword.sourceforge.net/cupt-format) format
* MISC (column 10): No-space information available. Automatically annotated.
* PARSEME:MWE (column 11): Manually annotated. The following [VMWE categories](http://parsemefr.lif.univ-mrs.fr/parseme-st-guidelines/1.1/?page=030_Categories_of_VMWEs) are annotated: VID, LVC.full, LVC.cause, IRV, MVC.
The CoNLL-U columns are those found in the UD 2.1 release (for the Sequoia corpus, the UD 2.1 version results from an automatic conversion by Bruno Guillaume).
The CoNLL-U columns (1-10) are those found in the UD 2.1 release (for the Sequoia corpus, the UD 2.1 version results from an automatic conversion by Bruno Guillaume).
So the annotation scheme for POS tags and syntactic dependencies are relatively homogeneous.
Note though that differences remain, as the UD guidelines may have been interpreted differently by the various teams having produced the different corpus.
Differences from 1.0 dataset
----------------------------
The 1.1 dataset follows the sligthly modified [1.1 PARSEME shared task guidelines](https://parsemefr.lis-lab.fr/parseme-st-guidelines/1.1/-/
The 1.0 annotations (sequoia and fr-ud corpora) were modified to match the 1.1 guidelines.
Annotations on fr-pud and fr-partut are new in the 1.1 version.
Tokenization
------------
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment