The PARSEME international community develops and maintains multilingual corpora annotated for multiword expressions (MWEs). The community originated from a homonymous COST Action (2013--2017). One of the main activities of the community is to organise and provide resources for the PARSEME shared tasks.
Table of contents:
- PARSEME corpora
- Multiword expressions
- Shared tasks
- Why and how to contribute
- Contacts and communication
- Language teams
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic,
semantic, pragmatic and/or statistical idiosyncrasies. MWEs encompass diverse linguistic objects such as idioms (to pull the strings 'to make use of one's influence to gain an advantage'), compounds (a hot dog), light-verb constructions (to pay a visit), rhetorical figures (as busy as a bee), institutionalized phrases (traffic light) and multiword named entities (European Central Bank).
A prominent feature of many MWEs, especially of verbal idioms such as to pull the strings, is their non-compositional semantics, that is, the fact that their meaning cannot be deduced from the meanings of their components and from their syntactic structure in a way deemed regular for the given language. For this reason, MWEs pose special challenges both to linguistic modeling (e.g. as linguistic objects crossing boundaries between lexicon and grammar) and to natural language processing (NLP) applications, especially to those which rely on semantic interpretation of text (e.g. information retrieval, information extraction or machine translation). A prerequisite for an appropriate handling of MWEs is their automatic identification (addressed in the PARSEME shared tasks).
The PARSEME corpora were initially created for the shared tasks and were released briefly after the evaluation phase on LINDAT/CLARIN under open licences (mostly Creative Commons):
- Download temporary release 1.2 (2020): covers 14 languages and is being used in the ongoing PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions, which puts special impact on identifying unseen VMWEs. The corpus statistics can be found on the above shared task page. For more details of this release, see the upcoming shared task 1.2 description paper. For each language covered in this edition two corpora are provided (the links below are temporary and will later be replaced by permanent ones):
- Annotated corpus, where verbal MWEs are manually annotated, and morphosyntactic data either stem from the original corpora or are automatically generated. The annotation guidelines are mostly the same as in edition 1.1, but some languages extended and enhanced their annotations, new languages appeared and a few previously covered languages are not covered this time. Each annotated corpus is split into a test, a development/validation and a training set, so that a minimum number of unseen VMWEs is present in both test and dev and so that their proportion is close to the average.
- Raw corpus, where VMWEs are not annotated and morphosyntax is automatically tagged. The raw corpora are meant for automatic discovery of VMWEs unseen in the training corpus.
- Download release 1.1 (2018): covers 19 languages and contains verbal MWEs only, with updated categories and improved/extended annotations wrt edition 1.0. Corpora were split into test set (~500 VMWEs), development/validation set (~500 VMWEs) and training set (variable size). Corpus statistics can be found on the shared 1.1 task page. For a more detailed description of this release, see the shared task 1.1 description paper.
- Download release 1.0 (2017): covers 18 languages and contains verbal MWEs only. Corpora were split into test set (~500 VMWEs) and training set (variable size). Corpus statistics can be found on the shared 1.0 task page. A more detailed description can be found on the corpus description chapter.
Before the official 1.0 and 1.1 releases, the shared task participants had access to the corpora via the sharedtask-data repository. Although the data on the repo should be identical to the releases, we recommend downloading the corpora from the official LINDAT/CLARIN releases.
We manage continuous corpus development and improvement on per-language git repositories. These repositories may contain work in progress. For an explanation of how these repositories should be structured, see here. These repositories are referred to under Working version in the table below. Consistency check results and grew-match data are updated automatically after each push on GitLab.
Improved versions of the corpus are released on a regular basis. Up to version 1.2 these releases coincided with PARSEME shared tasks. From now on, we plan to release new corpus versions independently of the shared tasks (likely once a year).
The PARSEME corpora are annotated for verbal multiword expressions according to the following guidelines:
Improvements to the guidelines are discussed via GitLab issues:
For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also provided in the corpora. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
Annotation of other MWE categories (nominal, functional, etc.) is planned but guidelines do not exist yet.
The goal of the PARSEME shared tasks is to provide a framework for teams developing automatic MWE identification systems. Participants are given training and development data and must produce predictions which are then compared to the gold annotations in the test sets using predefined evaluation metrics.
PARSEME shared task on semi-supervised identification of verbal multiword expressions - edition 1.2 (2020)
- Organized as part of the MWE-LEX 2020 workshop
- PARSEME shared task on automatic identification of verbal MWEs - edition 1.1 (2018)
- PARSEME shared task on automatic identification of verbal MWEs - edition 1.0 (2017)
Why and how to contribute
PARSEME is a free and open initiative: anyone interested in annotating multiword expressions is welcome to contribute. Here are some reasons to do so:
- Integration into a multilingual, friendly and enthusiastic international community
- Contribution to open science (our data and publications are available under open licenses)
- Stimulating experience in creating synergies between different languages and linguistic traditions
- Mature annotation methodology and technical infrastructure
- Training in the annotation methodology
- Joint publications
What we cannot provide, however, is funding for the annotation work. All our language teams fund their work locally.
If you have never participated in the community and you wish to do so, you should:
- Express your interest by contacting the core organisers via firstname.lastname@example.org.
- Read the Language Leader guide
If you have already contributed in previous editions of the annotation campaigns for PARSEME shared tasks 1.0 and 1.1, then you will find updated instructions in the Language Leader guide.
The post-2020 annotation campaigns: From 2021 on, the PARSEME annotation efforts are no longer strictly linked to the PARSEME shared tasks. We plan to organize regular releases of the PARSEME corpora for new languages and enhanced versions of the corpora in the previously included languages. We are also working on extending the PARSEME guidelines to all MWE categories (not only verbal ones). More details about the organization of these efforts will appear soon. One new language currently being added to the PARSEME suite is Serbian.
The 2020 annotation campaign: the 2020 annotation campaign involves three activities: annotating new data (for some languages), enhancing the existing corpora (optional) and preparing raw corpora (mandatory). The Language Leader guide and the dedicated pages provide more details on these tasks. The specific timeline will be announced soon, but teams are expected to send their corpora to the core organisers before March 1, 2020.
- Academic papers describing the corpora and the initiative
- Corpus annotation campaign pages for previous shared task editions:
Contacts and communication
- email@example.com - Internal communication among annotators, language leaders, technical experts, guidelines experts, core organizers
- firstname.lastname@example.org - Internal communication among language leaders, technical experts, guidelines experts, core organizers
- email@example.com - Contact with the core organizers
- verbal-mwe - Announcements to shared task participants
To contact the organizers of the shared tasks and maintainers of the corpora, please use the parseme-st-core address. Do alike, if you wish to register to one of the mailing lists above.
The following language teams co-authored the PARSEME corpus (LL stands for language leader):
- Balto-Slavic group:
- Bulgarian (BG): Ivelina Stoyanova (LL v1.0/1.1), Tsvetana Dimitrova (v1.0/1.1), Svetlozara Leseva (v1.0/1.1), Valentina Stefanova (v1.0/1.1), Maria Todorova (v1.0/1.1)
- Czech (CS): Eduard Bejček (LL v1.0), Zdeňka Urešová (v1.0)
- Croatian (HR): Maja Buljan (LL v1.1), Goranka Blagus (v1.1), Ivo-Pavao Jazbec (v1.1), Nikola Ljubešić (v1.1), Ivana Matas (v1.1), Jan Šnajder (v1.1)
- Lithuanian (LT): Jolanta Kovalevskaitė (LL v1.1), Agne Bielinskiene (v1.1), Loic Boizou (v1.1)
- Polish (PL): Agata Savary (LL v1.0/1.1/1.2), Jakub Waszczuk (LL v1.2), Emilia Palka-Binkiewicz (v1.1)
- Serbian (SR): Cvetana Krstev (LL), Anđela Antić, Isidora Jaknić
- Slovene (SL): Polona Gantar (LL v1.1), Simon Krek (LL v1.0/1.1), Špela Arhar Holdt (v1.1), Jaka Čibej (v1.1), Teja Kavčič (v1.1), Taja Kuzman (v1.0/1.1)
- Germanic group:
- English (EN): Abigail Walsh (LL v1.1), Claire Bonial (v1.1), Paul Cook (v1.1), Jamie Findlay (v1.1), Teresa Lynn (v1.1), John McCrae (v1.1), Nathan Schneider (v1.1), Clarissa Somers (v1.1)
- German (DE): Fabienne Cap (LL v1.0), Timm Lichte (LL v1.1/1.2), Rafael Ehren (v1.1/1.2), Glorianna Jagfeld (v1.0)
- Swedish (SV): Fabienne Cap (LL v1.0), Sara Stymne (LL v1.2, annotator v1.0), Elsa Erenmalm (v1.2), Gustav Finnveden (v1.2), Bernadeta Griciūtė (v1.2), Ellinor Lindqvist (v1.2), Joakim Nivre (v1.0), Eva Pettersson (v1.0/1.2)
- Romance group:
- French (FR): Marie Candito (LL v1.0/1.1/1.2), Matthieu Constant (v1.0/1.1/1.2), Bruno Guillaume, Carlos Ramisch (v1.0/1.1/1.2), Caroline Pasquer (v1.0/1.1/1.2), Yannick Parmentier (v1.0/1.1/1.2), Jean-Yves Antoine (v1.0/1.1/1.2), Agata Savary (v1.0/1.1/1.2), Ismail El Maarouf (v1.0)
- Italian (IT): Johanna Monti (LL v1.0/1.1/1.2), Carola Carlino, Valeria Caruso (v1.0/1.1/1.2), Manuela Cherchi (v1.0), Maria Pia di Buono (v1.0/1.1/1.2), Antonio Pascucci (v1.1/1.2), Annalisa Raffone (v1.0/1.1/1.2), Anna Riccio (v1.1/1.2), Federico Sangati (v1.0/1.2), Anna De Santis (v1.0), Giulia Speranza
- Brazilian Portuguese (PT): Carlos Ramisch (LL v1.2, annotator v1.0/1.1), Renata Ramisch (LL v1.1/1.2, annotator 1.0), Silvio Ricardo Cordeiro (LL v1.10; annotator v1.1/1.2), Helena de Medeiros Caseli (v1.0/1.1/1.2), Isaac Miranda, Alexandre Rademaker, Oto Vale, Aline Villavicencio (v1.0/1.1/1.2), Gabriela Wick Pedro, Rodrigo Wilkens, Leonardo Zilio (v1.1/1.2)
- Romanian (RO): Verginica Barbu Mititelu (LL v1.0/1.1/1.2), Monica-Mihaela Rizea (v1.0/1.1/1.2), Mihaela Ionescu (v1.0/1.1/1.2), Mihaela Onofrei (v1.0/1.1/1.2)
- Spanish (ES): Carla Parra Escartín (LL v1.0/1.1), Cristina Aceta (v1.0/1.1), Itziar Aduriz (v1.0), Uxoa Iñurrieta (v1.0), Carlos Herrero (v1.0), Alfredo Maldonado (v1.1), Héctor Martínez Alonso (v1.0/1.1), Belem Priego Sanchez (v1.0/1.1)
- Other languages:
- Arabic (AR): Abdelati Hawwari (LL v1.1), Mona Diab (v1.1), Mohamed Elbadrashiny (v1.1), Rehab Ibrahim (v1.1)
- Basque (EU): Uxoa Iñurrieta (LL v1.1/1.2), Itziar Aduriz (v1.1/1.2), Ainara Estarrona (v1.1/1.2), Itziar Gonzalez (v1.1/1.2), Antton Gurrutxaga (v1.1/1.2), Larraitz Uria (v1.1/1.2), Ruben Urizar (v1.1/1.2)
- Chinese (ZH): Menghan Jiang (LL v1.2), Hongzhi Xu (LL v1.2), , Jia Chen (v1.2), Xiaomin Ge (v1.2), Fangyuan Hu (v1.2), Sha Hu (v1.2), Minli Li (v1.2), Siyuan Liu (v1.2), Zhenzhen Qin (v1.2), Ruilong Sun (v1.2), Chengwen Wang (v1.2), Huangyang Xiao (v1.2), Peiyi Yan (v1.2), Tsy Yih (v1.2), Ke Yu (v1.2), Songping Yu (v1.2), Si Zeng (v1.2), Yongchen Zhang (v1.2), Yun Zhao (v1.2)
- Farsi (FA): Behrang QasemiZadeh (LL v1.0/1.1), Shiva Taslimipoor (v1.1)
- Greek (EL): Voula Giouli (LL v1.0/1.1/1.2), Aggeliki Fotopoulou (v1.0/1.1/1.2), Vassiliki Foufi (v1.0/1.1/1.2), Sevasti Louizou (v1.0/1.2), Stella Markantonatou (v1.1/1.2), Stella Papadelli (v1.1/1.2), Natasa Theoxari (v1.1)
- Hebrew (HE): Chaya Liebeskind (LL v1.0/1.1/1.2), Yaakov Ha-Cohen Kerner (LL v1.0, annotator v1.1/1.2), Hevi Elyovich (v1.0/1.1/1.2), Ruth Malka (v1.0/1.1/1.2)
- Hindi (HI): Archna Bhatia (LL v1.1/1.2), Ashwini Vaidya (LL v1.1/1.2), Kanishka Jain (v1.1/1.2), Vandana Puri (v1.1/1.2), Shraddha Ratori (v1.1/1.2), Vishakha Shukla (v1.1/1.2), Shubham Srivastava (v1.1/1.2)
- Hungarian (HU): Veronika Vincze (LL v1.0/1.1), Katalin Simkó (v1.0/1.1), Viktória Kovács (v1.0/1.1)
- Irish (GA): Abigail Walsh (LL v1.2), Jennifer Foster (v1.2), Teresa Lynn (v1.2)
- Maltese (MT): Lonneke van der Plaas (LL v1.0), Luke Galea (LL v1.0), Greta Attard (v1.0), Kirsty Azzopardi (v1.0), Janice Bonnici (v1.0), Jael Busuttil (v1.0), Ray Fabri (v1.0), Alison Farrugia (v1.0), Sara Anne Galea (v1.0), Albert Gatt (v1.0), Anabelle Gatt (v1.0), Amanda Muscat (v1.0), Michael Spagnol (v1.0), Nicole Tabone (v1.0), Marc Tanti (v1.0)
- Turkish (TR): Tunga Güngör (LL v1.1/1.2), Gülşen Eryiğit (LL v1.0), Kübra Adalı (LL v1.0), Gozde Berk (v1.1/1.2), Tutkum Dinç (v1.0), Berna Erden (v1.1/1.2), Ayşenur Miral (v1.0), Mert Boz (v1.0), Zeynep Yirmibeşoğlu (v1.2)