Commit d65bae8f authored by Silvio Ricardo Cordeiro's avatar Silvio Ricardo Cordeiro
Browse files

Analyze new version of FR, with LEMMA field falling back to UDPipe-generated lemmas

parent d08ab20f
This is the README file from the PARSEME verbal multiword expressions (VMWEs) corpus for French, edition 1.1.
The verbal MWEs have been annotated in the following corpora:
1. `sequoia`: all the 3099 sentences of the [Sequoia Treebank](
2. `fr-ud`: the 2.1 version of the French universal dependencies treebank (recently renamed "GDS" for Google dataset)
3. `fr_partut-ud`: the 2.1 UD version of the French part of the ParTUT
4. `fr_pud-ud`: the first 500 sentences of the French part of the 2.1 UD version of the Parallel Universal Dependencies (PUD) treebanks created for the [CoNLL 2017 shared task on Multilingual Parsing from Raw Text to Universal Dependencies](
Provided annotations
The data are in the [.cupt]( format. Here is detailed information about some columns:
* LEMMA (column 3): Available.
* UPOS (column 4): Available. Manually annotated.
* HEAD and DEPREL (columns 7 and 8): Available. Manually annotated. The inventory is [Universal Dependency Relations](
* MISC (column 10): No-space information available. Automatically annotated.
* PARSEME:MWE (column 11): Manually annotated. The following [VMWE categories]( are annotated: VID, LVC.full, LVC.cause, IRV, MVC.
The CoNLL-U columns are those found in the UD 2.1 release (for the Sequoia corpus, the UD 2.1 version results from an automatic conversion by Bruno Guillaume).
So the annotation scheme for POS tags and syntactic dependencies are relatively homogeneous.
Note though that differences remain, as the UD guidelines may have been interpreted differently by the various teams having produced the different corpus.
* The tokenization is that of the French UD treebanks, in which the following contractions appear as multi-word tokens (e.g. 1-2 au), split into words:
E.g. : Au soleil
1-2 Au
1 à
2 le
3 soleil
The list of contractions is:
Note that the only ambiguous case are "des" / "du". Depending on the context, these tokens are either a plain determiner, or are split into preposition "de" + determiner "le" / "les".
The VMWEs annotations were performed by Marie Candito, Mathieu Constant, Caroline Pasquer, Yannick Parmentier, Carlos Ramisch, Jean-Yves Antoine.
The annotations for the new test set for the 1.1 shared task were performed by Marie Candito.
The VMEs annotations are distributed under the terms of the [CC-BY v4 license]( As far as the CONLL-U files are concerned, the UD part of the corpus is under [CC BY-NC-SA 4.0]( and the Sequoia part is under [LGPL-LR]( UD sentences can be identified by their `sentid` prefixed with `fr-ud`.
Language: FR
## File: FR/dev.cupt
* Sentences: 2236
* Tokens: 56254
* Total VMWEs: 629
* `IRV`: 154
* `LVC.cause`: 15
* `LVC.full`: 252
* `MVC`: 1
* `VID`: 207
This diff is collapsed.
Language: FR
## File: FR/test.cupt
* Sentences: 1606
* Tokens: 39489
* Total VMWEs: 498
* `IRV`: 108
* `LVC.cause`: 14
* `LVC.full`: 160
* `MVC`: 4
* `VID`: 212
This diff is collapsed.
Language: FR
## File: FR/train.cupt
* Sentences: 17225
* Tokens: 432389
* Total VMWEs: 4550
* `IRV`: 1247
* `LVC.cause`: 68
* `LVC.full`: 1470
* `MVC`: 19
* `VID`: 1746
This diff is collapsed.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment