This site contains all the publicly available material of our contribution to the KONVENS 2016 "Parsing Free-Form Language Learner Data: Current State and Error Analysis".

Parsing Free-Form Language Learner Data: Current State and Error Analysis

This repository contains material that was used or produced for the following paper:

Christine Köhn, Tobias Staron and Arne Köhn. 2016. Parsing Free-Form Language Learner Data: Current State and Error Analysis. In Proceedings of the 13th Conference on Natural Language Processing (KONVENS). pages 135-145, Bochum, Germany, September. http://nbn-resolving.de/urn:nbn:de:gbv:18-228-7-2269

Additionally, links to already published software utilized by us can be found below.

For material which is too big to upload or for any other references or material that is missing here, please contact us directly (see below).

Currently only the gold standard annotations for Falko-100dep are contained in this repository. More is coming soon!

Falko-100dep corpus

The sentences were randomly sampled from the FalkoEssayL2 corpus v2.4 downloaded from http://www.linguistik.hu-berlin.de/de/institut/professuren/korpuslinguistik/forschung/falko/zugang.

  • 100dep_gold: Gold standard annotations (gold labeled dependencies and gold PoS tags) for 100 sentences from the FalkoEssayL2 corpus

License for text: Creative Commons Attribution 3.0 Unported (CC BY 3.0)

License for annotation: Creative Commons Attribution 4.0 International (CC BY 4.0)

File names for sentences

ID_ESSAYID_SENTENCE.cda or sentence_ID.conll:

ID encodes the the proficiency level of the writer in terms of the Common European Framework of References for Languages (CEFR) level:

ID 0001 → level B2,

ID 0002 → level C1,

ID 0003 → level C2,

ID 0004 → level B2,...

ESSAYID corresponds to original Excel file name (e.g. kne03_2006_06_L2v2.4 -> kne03_2006_06_L2v2.4.xls), which itself encodes meta information about the essay (v2.4 for version, for the others see Reznicek et. al. 2012).

SENTENCE is the sentence number according to the manually annotated sentence boundaries on the ZH0 level (see Reznicek et. al. 2012)


https://gitlab.com/nats/jwcdg - jwcdg parser (includes links to other required components)

https://github.com/taolei87/RBGParser - RBGParser

https://www.cs.cmu.edu/~ark/TurboParser/ - TurboParser (including TurboTagger)

Data on inquiry

The trained models are quite big. Thus, they will not be uploaded. If you are interested in them, contact us (see below).

If you are interested in more information about the hybrid approach, contact Tobias Staron directly.


Christine Köhn - ckoehn at informatik.uni-hamburg.de

Tobias Staron - staron at informatik.uni-hamburg.de

Arne Köhn - koehn at informatik.uni-hamburg.de