CoNLL-U and CoNLL-RDF export
Fixes Partially addresses #126 (closed) Partially addresses #242 (closed)
Description Cleans up some of the view traits used for exporting, and implements views for CoNLL-U and CoNLL-RDF.
Type of PR This PR is a feature.
Technicalities
- adds CoNLL-U and CoNLL-RDF converters to the Node.JS service
- Note: not optimal, as Node.JS has to communicate through the CLI. However, the servers could easily be replaced by Python and Java servers respectively, in this container or another, without having to change the PHP end of things.
- adds a CLI for CoNLL-U that supports stdin/stdout
- adds caching for CoNLL-U and CoNLL-RDF conversions (!)
Tests
- Run
curl -LH 'Accept: text/x-conll-u' http://127.0.0.1:2354/inscriptions/2341997
- Run
curl -LH 'Accept: text/x-conll+turtle' http://127.0.0.1:2354/inscriptions/2341997
Checklist:
-
My pull request has a descriptive title (not a vague title like "Update index.md
"). -
My pull request targets the phoenix/develop
branch of the repository. -
My commit messages follow best practices. -
My code follows the established code style of the repository. -
I added tests for the changes I made (if applicable). -
I added or updated documentation (if applicable). -
I tried running the project locally and verified that there are no visible errors.
Merge request reports
Activity
Same as with the rest of inscription APIs, i.e.:
curl -LH 'Accept: text/x-conll-u' http://127.0.0.1:2354/artifacts/100188
curl -LH 'Accept: text/x-conll+turtle' http://127.0.0.1:2354/artifacts/100188
However,
is_latest
needs to be set. Also, forgot to mention: CoNLL-RDF breaks on the CDLI-CoNLL of P100149 because the word(?) identifiers contain single quotes, which doesn't work with RDF URIs.
Yes ! The CDLI-CoNLL files are actually public domain :) https://github.com/cdli-gh/mtaac_gold_corpus (They are stored in the to_dict folder)
Edited by Émilie Pagé-PerronAh, good. Seems like I'm not the first one though: https://github.com/acoli-repo/conll-rdf/issues/11#issuecomment-505987456. Apparently the ID column should contain numbers only (except perhaps hyphens?). I can rename the ID column though (
CDLI_ID
?), so it treats it as a regular column and generates new IDs automatically.:s1_51 a nif:Word ; nif:nextWord :s1_52 ; conll:CDLI_ID "r.7'.2" ; conll:FORM "sila3" ; conll:SEGM "sila[unit]" ; conll:XPOSTAG "N" .
instead of the invalid
:s1_r.7'.2 a nif:Word ; nif:nextWord :s1_r.7'.3 ; conll:ID "r.7'.2" ; conll:FORM "sila3" ; conll:SEGM "sila[unit]" ; conll:XPOSTAG "N" .
but is it a restriction in the rdf format that was implemented in the software or a restriction based on usage in the software itself? In other words, why is conll-rdf expecting a number for the id? @max-ionov
Note: I disabled the service again to fix the CI. You can enable it locally (see !177 (70c61415)).
mentioned in issue #216 (closed)
mentioned in commit cfd99f7f