Multiword Expressions toolkit
- Official website: http://mwetoolkit.sf.net
- Project repo: https://gitlab.com/mwetoolkit/mwetoolkit/
- Latest release: version 1.1
- Release date: October 08, 2015
- Authors: Carlos Ramisch, Silvio Ricardo Cordeiro, Vitor de Araujo, Sandra Castellanos
The mwetoolkit aids in the automatic identification and extraction of multiword expressions (MWEs) from running text. These include idioms (kick the bucket), noun compounds (cable car), phrasal verbs (take off, give up), etc.
Even though it focuses on multiword expresisons, the framework is quite complete and can be useful in any corpus-based study in computational linguistics. The mwetoolkit can be applied to virtually any text collection, language, and MWE type. It is a command-line tool written mostly in Python. Its development started in 2010 as a PhD thesis but the project keeps active (see commit logs).
Up-to-date documentation and details about the tool can be found at the mwetoolkit website: http://mwetoolkit.sourceforge.net/
Please refer to the website for up-to-date installation instructions.
2) QUICK START
To install the mwetoolkit, just download it from the GIT repository using the following command:
git clone --depth=1 "https://gitlab.com/mwetoolkit/mwetoolkit.git"
As the code evolves fast, we recommend you to use the GIT version instead of old
git pull to have access to latest improvements.
Once you have downloaded the toolkit, navigate to the main folder and run the command below for compiling the C libraries used by the toolkit.1
toy folder contains a set of files for performing a toy experiment.
You can try to run the whole pipeline by calling
Specific documentation about the examples is in the script itself, as comments.
4) REGRESSION TESTS
test folder contains regression tests for most scripts. In order to test
your installation of the mwetoolkit, navigate to this folder and then call the
cd test ./testAll.sh
Should one of the tests fail2, please send a copy of the output and a brief description of your configurations (operating system, version, machine) to our email.
If you do not run this command, the toolkit will still work but it will use a Python version (much slower and possibly obsolete!) of the indexing and counting scripts. This may be OK for small corpora.
Please, beware that on Mac OS some test will appear to fail when they actually succeed, the only differences being in rounding less significant digits of float numbers.