The PARSEME international community develops and maintains multilingual corpora annotated for multiword expressions (MWEs).
The community originated from a homonymous COST Action (2013--2017).
One of the main activities of the community is to organise and provide resources for the PARSEME shared tasks.
A multiword expression (MWE) is a combination of words which exhibits lexical, morphosyntactic,
semantic, pragmatic and/or statistical idiosyncrasies. MWEs encompass diverse linguistic objects such as idioms (to pull the strings 'to make use of one's influence to gain an advantage'), compounds (a hot dog), light-verb constructions (to pay a visit), rhetorical figures (as busy as a bee), institutionalized phrases (traffic light) and multiword named entities (European Central Bank).
A prominent feature of many MWEs, especially of verbal idioms such as to pull the strings, is their non-compositional semantics, that is, the fact that their meaning cannot be deduced from the meanings of their components and from their syntactic structure in a way deemed regular for the given language.
For this reason, MWEs pose special challenges both to linguistic modeling (e.g. as linguistic objects crossing boundaries between lexicon and grammar) and to natural language processing (NLP) applications, especially to those which rely on semantic interpretation of text (e.g. information retrieval, information extraction or machine translation). A prerequisite for an appropriate handling of MWEs is their automatic identification (addressed in the PARSEME shared tasks).
The PARSEME corpora were initially created for the shared tasks and were released briefly after the evaluation phase on LINDAT/CLARIN under open licences (mostly Creative Commons):
Download temporary release 1.2 (2020): covers 14 languages and is being used in the ongoing PARSEME shared task 1.2 on semi-supervised identification of verbal multiword expressions, which puts special impact on identifying unseen VMWEs. The corpus statistics can be found on the above shared task page. For more details of this release, see the upcoming shared task 1.2 description paper. For each language covered in this edition two corpora are provided (the links below are temporary and will later be replaced by permanent ones):
Annotated corpus, where verbal MWEs are manually annotated, and morphosyntactic data either stem from the original corpora or are automatically generated. The annotation guidelines are mostly the same as in edition 1.1, but some languages extended and enhanced their annotations, new languages appeared and a few previously covered languages are not covered this time. Each annotated corpus is split into a test, a development/validation and a training set, so that a minimum number of unseen VMWEs is present in both test and dev and so that their proportion is close to the average.
Raw corpus, where VMWEs are not annotated and morphosyntax is automatically tagged. The raw corpora are meant for automatic discovery of VMWEs unseen in the training corpus.
Download release 1.1 (2018): covers 19 languages and contains verbal MWEs only, with updated categories and improved/extended annotations wrt edition 1.0. Corpora were split into test set (~500 VMWEs), development/validation set (~500 VMWEs) and training set (variable size). Corpus statistics can be found on the shared 1.1 task page. For a more detailed description of this release, see the shared task 1.1 description paper.
Before the official 1.0 and 1.1 releases, the shared task participants had access to the corpora via the sharedtask-data repository. Although the data on the repo should be identical to the releases, we recommend downloading the corpora from the official LINDAT/CLARIN releases.
Active languages are those participating in the latest annotation campaign, currently (early 2020) 1.2. We manage continuous corpus development and enhancements on per-language git repositories. These repositories may contain work in progress. For an explanation of how these repositories should be structured, see here.
The links named latest in the table below refer to the latest version available in the gitlab repository. Grew-match data and consistency check results are updated automatically after each push on GitLab.
Standby languages are those which have participated in previous annotation campaigns/shared task but are not part of the latest annotation campaign. The git repositories will contain the last version of the data in previous annotation campaigns.
For most languages, morphosyntactic data (parts of speech, lemmas, morphological features and/or syntactic dependencies) are also provided in the corpora. Depending on the language, the information comes from treebanks (e.g., Universal Dependencies) or from automatic parsers trained on treebanks (e.g., UDPipe).
Annotation of other MWE categories (nominal, functional, etc.) is planned but guidelines do not exist yet.
The goal of the PARSEME shared tasks is to provide a framework for teams developing automatic MWE identification systems. Participants are given training and development data and must produce predictions which are then compared to the gold annotations in the test sets using predefined evaluation metrics.