The PARSEME international community develops and maintains multilingual corpora annotated for multiword expressions (MWEs). The community originated from a homonymous COST Action (2013--2017). One of the main activities of the community is to organise and provide resources for the PARSEME shared tasks.
Table of contents:
- PARSEME corpora
- Multiword expressions
- Shared tasks
- Why and how to contribute
- Contacts and communication
- Language teams
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic,
semantic, pragmatic and/or statistical idiosyncrasies. MWEs encompass diverse linguistic objects such as idioms (to pull the strings 'to make use of one's influence to gain an advantage'), compounds (a hot dog), light-verb constructions (to pay a visit), rhetorical figures (as busy as a bee), institutionalized phrases (traffic light) and multiword named entities (European Central Bank).
A prominent feature of many MWEs, especially of verbal idioms such as to pull the strings, is their non-compositional semantics, that is, the fact that their meaning cannot be deduced from the meanings of their components and from their syntactic structure in a way deemed regular for the given language. For this reason, MWEs pose special challenges both to linguistic modeling (e.g. as linguistic objects crossing boundaries between lexicon and grammar) and to natural language processing (NLP) applications, especially to those which rely on semantic interpretation of text (e.g. information retrieval, information extraction or machine translation). A prerequisite for an appropriate handling of MWEs is their automatic identification (addressed in the PARSEME shared tasks).
The PARSEME corpora were initially created for the shared tasks and were released briefly after the evaluation phase on LINDAT/CLARIN under open licences (mostly Creative Commons):
- Download temporary release 1.2 (2020): covers 14 languages and is being used in the ongoing PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions, which puts special impact on identifying unseen VMWEs. The corpus statistics can be found on the above shared task page. For more details of this release, see the upcoming shared task 1.2 description paper. For each language covered in this edition two corpora are provided (the links below are temporary and will later be replaced by permanent ones):
- Annotated corpus, where verbal MWEs are manually annotated, and morphosyntactic data either stem from the original corpora or are automatically generated. The annotation guidelines are mostly the same as in edition 1.1, but some languages extended and enhanced their annotations, new languages appeared and a few previously covered languages are not covered this time. Each annotated corpus is split into a test, a development/validation and a training set, so that a minimum number of unseen VMWEs is present in both test and dev and so that their proportion is close to the average.
- Raw corpus, where VMWEs are not annotated and morphosyntax is automatically tagged. The raw corpora are meant for automatic discovery of VMWEs unseen in the training corpus.
- Download release 1.1 (2018): covers 19 languages and contains verbal MWEs only, with updated categories and improved/extended annotations wrt edition 1.0. Corpora were split into test set (~500 VMWEs), development/validation set (~500 VMWEs) and training set (variable size). Corpus statistics can be found on the shared 1.1 task page. For a more detailed description of this release, see the shared task 1.1 description paper.
- Download release 1.0 (2017): covers 18 languages and contains verbal MWEs only. Corpora were split into test set (~500 VMWEs) and training set (variable size). Corpus statistics can be found on the shared 1.0 task page. A more detailed description can be found on the corpus description chapter.
Before the official 1.0 and 1.1 releases, the shared task participants had access to the corpora via the sharedtask-data repository. Although the data on the repo should be identical to the releases, we recommend downloading the corpora from the official LINDAT/CLARIN releases.
Active languages are those participating in the latest annotation campaign, currently (early 2020) 1.2. We manage continuous corpus development and enhancements on per-language git repositories. These repositories may contain work in progress. For an explanation of how these repositories should be structured, see here.
The links named latest in the table below refer to the latest version available in the gitlab repository. Grew-match data and consistency check results are updated automatically after each push on GitLab.
|German (DE)||1.2||PARSEME_corpus_DE||1.1 / latest||latest|
|Greek (EL)||1.2||PARSEME_corpus_EL||1.1 / latest||latest|
|Basque (EU)||1.2||PARSEME_corpus_EU||1.1 / latest||latest|
|French (FR)||1.2||PARSEME_corpus_FR||1.1 / latest||latest|
|Hebrew (HE)||1.2||PARSEME_corpus_HE||1.1 / latest||latest|
|Hindi (HI)||1.2||PARSEME_corpus_HI||1.1 / latest||latest|
|Italian (IT)||1.2||PARSEME_corpus_IT||1.1 / latest||latest|
|Polish (PL)||1.2||PARSEME_corpus_PL||1.1 / latest||latest|
|Portuguese (PT)||1.2||PARSEME_corpus_PT||1.1 / latest||latest|
|Romanian (RO)||1.2||PARSEME_corpus_RO||1.1 / latest||latest|
|Turkish (TR)||1.2||PARSEME_corpus_TR||1.1 / latest||latest|
Standby languages are those which have participated in previous annotation campaigns/shared task but are not part of the latest annotation campaign. The git repositories will contain the last version of the data in previous annotation campaigns.
|Arabic (AR)||1.1||not freely available|
The PARSEME corpora are annotated for verbal multiword expressions according to the following guidelines:
Improvements to the guidelines are discussed via GitLab issues:
For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also provided in the corpora. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
Annotation of other MWE categories (nominal, functional, etc.) is planned but guidelines do not exist yet.
The goal of the PARSEME shared tasks is to provide a framework for teams developing automatic MWE identification systems. Participants are given training and development data and must produce predictions which are then compared to the gold annotations in the test sets using predefined evaluation metrics.
PARSEME shared task on semi-supervised identification of verbal multiword expressions - edition 1.2 (2020)
- Organized as part of the MWE-LEX 2020 workshop
- PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)
- PARSEME shared task on automatic identification of verbal MWEs - edition 1.0 (2017)
Why and how to contribute
PARSEME is a free and open initiative: anyone interested in annotating multiword expressions is welcome to contribute. Here are some reasons to do so:
- Integration into a multilingual, friendly and enthusiastic international community
- Contribution to open science (our data and publications are available under open licenses)
- Stimulating experience in creating synergies between different languages and linguistic traditions
- Mature annotation methodology and technical infrastructure
- Training in the annotation methodology
- Joint publications
What we cannot provide, however, is funding for the annotation work. All our language teams fund their work locally.
If you have never participated in the community and you wish to do so, you should:
- Express your interest by contacting the core organisers via email@example.com.
- Read the Language Leader guide
If you have already contributed in previous editions of the annotation campaigns for PARSEME shared tasks 1.0 and 1.1, then you will find updated instructions in the Language Leader guide.
The 2020 annotation campaign: the 2020 annotation campaign involves three activities: annotating new data (for some languages), enhancing the existing corpora (optional) and preparing raw corpora (mandatory). The Language Leader guide and the dedicated pages provide more details on these tasks. The specific timeline will be announced soon, but teams are expected to send their corpora to the core organisers before March 1, 2020.
- Academic papers describing the corpora and the initiative
- Corpus annotation campaign pages for previous shared task editions:
Contacts and communication
- firstname.lastname@example.org - Internal communication among annotators, language leaders, technical experts, guidelines experts, core organizers
- email@example.com - Internal communication among language leaders, technical experts, guidelines experts, core organizers
- firstname.lastname@example.org - Contact with the core organizers
- verbal-mwe - Announcements to shared task participants
To contact the organizers of the shared tasks and maintainers of the corpora, please use the parseme-st-core address. Do alike, if you wish to register to one of the mailing lists above.
The language teams for the 2020 annotation campaign are as follows (LL stands for language leaders):
- Balto-Slavic group:
- Polish (PL): Agata Savary (LL), Jakub Waszczuk (LL), Emilia Palka-Binkiewicz
- Germanic group:
- German (DE): Timm Lichte (LL), Rafael Ehren
- Swedish (SV): Sara Stymne (LL), Elsa Erenmalm, Gustav Finnveden, Bernadeta Griciūtė, Ellinor Lindqvist, Eva Pettersson
- Romance group:
- French (FR): Marie Candito (LL), Matthieu Constant, Bruno Guillaume, Carlos Ramisch, Caroline Pasquer, Yannick Parmentier, Jean-Yves Antoine, Agata Savary
- Italian (IT): Johanna Monti (LL), Carola Carlino, Valeria Caruso, Maria Pia di Buono, Antonio Pascucci, Annalisa Raffone, Anna Riccio, Federico Sangati, Giulia Speranza
- Brazilian Portuguese (PT): Carlos Ramisch (LL), Renata Ramisch (LL), Silvio Ricardo Cordeiro, Helena de Medeiros Caseli, Isaac Miranda, Alexandre Rademaker, Oto Vale, Aline Villavicencio, Gabriela Wick Pedro, Rodrigo Wilkens, Leonardo Zilio
- Romanian (RO): Verginica Barbu Mititelu (LL), Monica-Mihaela Rizea, Mihaela Ionescu, Mihaela Onofrei
- Other languages:
- Chinese (ZH): Menghan Jiang (LL), Hongzhi Xu (LL), , Jia Chen, Xiaomin Ge, Fangyuan Hu, Sha Hu, Minli Li, Siyuan Liu, Zhenzhen Qin, Ruilong Sun, Chengwen Wang, Huangyang Xiao, Peiyi Yan, Tsy Yih, Ke Yu, Songping Yu, Si Zeng, Yongchen Zhang, Yun Zhao
- Greek (EL): Voula Giouli (LL), Vassiliki Foufi, Aggeliki Fotopoulou, Stella Markantonatou, Stella Papadelli, Sevasti Louizou
- Basque (EU): Uxoa Iñurrieta (LL), Itziar Aduriz, Ainara Estarrona, Itziar Gonzalez, Antton Gurrutxaga, Larraitz Uria, Ruben Urizar
- Irish (GA): Abigail Walsh (LL), Jennifer Foster, Teresa Lynn
- Hebrew (HE): Chaya Liebeskind (LL), Hevi Elyovich, Yaakov Ha-Cohen Kerner, Ruth Malka
- Hindi (HI): Archna Bhatia (LL), Ashwini Vaidya (LL), Kanishka Jain, Vandana Puri, Shraddha Ratori, Vishakha Shukla, Shubham Srivastava
- Turkish (TR): Tunga Güngör (LL), Zeynep Yirmibeşoğlu, Gozde Berk, Berna Erden