... | ... | @@ -39,9 +39,47 @@ PARSEME provides scripts to convert between CUPT, CoNLL-U and FoLiA (and also th |
|
|
PARSEME provides a validation script `parseme_validate.py`, designed to check the content within [CUPT](https://multiword.sourceforge.net/cupt-format) files. Located within the `st-organizers/release-preparation` directory of the PARSEME [utilities](https://gitlab.com/parseme/utilities) repository. This is the official PARSEME validator, described in more detail below.
|
|
|
|
|
|
The validation procedure is structured around various levels. The validation script can be optionally commanded to test the validity of your data up to a specific level.
|
|
|
* **Level 1** (CUPT backbone): At this level, the validator exclusively tests the order of lines, newline encoding, and conducts core tests to ensure the file's integrity. It invokes the [UD validator](https://universaldependencies.org/validation-rules.html) at level 1 and supplements it with new tests designed for the CUPT format. For instance, one such test ensures that the first line appropriately specifies **global.columns**, and that the `ID` and `PARSEME:MWE` columns are present.
|
|
|
* **Level 2** (PARSEME and UD contents): On this level, the script utilizes the [UD validator](https://universaldependencies.org/validation-rules.html) at level 2 for morphosyntax examination, and implements additional tests for the PARSEME content, such as a syntax check for the `PARSEME:MWE` column.
|
|
|
* **Level 3** (PARSEME releases): This is the conclusive level intended for PARSEME releases. Before any corpus can be released, it must successfully pass this level. Example tests at this level include the prohibition of the `NotMWE` tag and the exclusion of the `metadata` field.
|
|
|
* **Level 1** (CUPT backbone): At this level, the validator exclusively tests the order of lines, newline encoding, and conducts core tests to ensure the file's integrity. It invokes the [UD validator](https://universaldependencies.org/validation-rules.html) at level 1 and supplements it with new tests designed for the CUPT format. The comprehensive list of tests conducted at this level includes:
|
|
|
* Column ID verification:
|
|
|
* Checks that the token IDs follow a sequentially ascending order, i.e., 1, 2, and so forth.
|
|
|
* Confirms that multiword tokens are indexed with integer intervals such as 1-2 or 3-5, with lines representing these tokens positioned before the first word in the range. These ranges must be non-empty and should not overlap.
|
|
|
* Ensures that empty nodes indexed as i.1, i.2, etc., appear immediately after a token with index i.
|
|
|
* Verifies that the empty node identifier is located prior to any multiword tokens.
|
|
|
* Asserts that an empty line does not contain any space characters.
|
|
|
* Requires a single blank line after each sentence annotation.
|
|
|
* Confirms that comments are allowed only before a sentence's tokenization.
|
|
|
* Checks that all non-empty lines begin with a number or the '#' character.
|
|
|
* Expects an empty line at the end of the file.
|
|
|
* Allows only Unix-style LF line termination.
|
|
|
* Ensures that the first line correctly indicates **global.columns** and includes the `ID` and `PARSEME:MWE` columns.
|
|
|
* Checks that the number of columns in each sentence's tokenization aligns with the number of columns outlined in the global.columns line.
|
|
|
* Leading and trailing spaces, as well as two or more consecutive spacing characters, are not permissible in the columns. Additionally, column content should not be empty, and spaces within multiword tokens are not allowed.
|
|
|
* **Level 2** (PARSEME and UD contents): On this level, the script utilizes the [UD validator](https://universaldependencies.org/validation-rules.html) at level 2 for morphosyntax examination, and implements additional tests for the PARSEME content, such as a syntax check for the `PARSEME:MWE` column. The complete list of tests performed at this level consists of the following:
|
|
|
* `PARSEME:MWE` column:
|
|
|
* Content should contain a star "*", an underscore "_", or a list of VMWE codes separated by semicolons.
|
|
|
* Content should be underscore "_" for the blind version.
|
|
|
* A VMWE code should comprise a VMWE identifier, followed by a colon ':' and a VMWE category label (for instance, 1:VID), or only the VMWE identifier.
|
|
|
* The VMWE category must fall within one of the 9 defined labels (IAV; IRV; LS.ICV; LVC.cause; LVC.full; MVC; NotMWE; VID; VPC.full; VPC.semi).
|
|
|
* The VMWE identifier must have a category for the first token of the respective MWE and only the VMWE identifier for subsequent tokens of the same VMWE.
|
|
|
* VMWE identifiers should form a sequentially ascending order (1, 2, ...) and must fall within the token ID range.
|
|
|
* A multiword token should contain only a star "*" or underscore "_" in the `PARSEME:MWE` column.
|
|
|
* UD columns:
|
|
|
* Validation of general constraints on valid characters, for example, `UPOS` should only contain [A-Z].
|
|
|
* The `UPOS` tag must be present and should match one of the 17 tags defined by UD.
|
|
|
* `FEATS`, if present, must conform to the prescribed format (in the form Feature=Value, starting with [A-Z0-9] and containing only [A-Za-z0-9]).
|
|
|
* Each feature in the `FEATS` field should have unique, sorted values.
|
|
|
* A multiword token should have empty values "_" in all fields except `MISC` and `PARSEME:MWE`.
|
|
|
* `HEAD` and `DEPREL` must be present, and the universal part of `DEPREL` must be one of the 37 known labels.
|
|
|
* An empty node should have empty values "_" in `HEAD` and `DEPREL`.
|
|
|
* `HEAD` and `DEPS` should refer to existing identifiers.
|
|
|
* `DEPS` must be correctly formatted and must not contain cycles.
|
|
|
* `MISC` attributes (SpaceAfter|Lang|Translit|LTranslit|Gloss|LId|LDeriv) should not appear more than once.
|
|
|
* The graph reconstructed from `HEAD` and `DEPREL` should form a single-root, connected, and cycle-free tree.
|
|
|
* Metadata:
|
|
|
* The `source_sent_id` field must contain three parts separated by spaces (prefix-uri file-path-under-root sentence-id).
|
|
|
* The `source_sent_id` field should be present only once and its id must be unique across the corpus.
|
|
|
* The `text` field should be present only once and must not end with a space.
|
|
|
* **Level 3** (PARSEME releases): This is the conclusive level intended for PARSEME releases. Before any corpus can be released, it must successfully pass this level. Tests at this level include the prohibition of the `NotMWE` tag and the exclusion of the `metadata` field.
|
|
|
|
|
|
Upon making a push to your language GitLab repository, the script is launched automatically. For more information on this automatic initiation, please refer to the corresponding [section about managing branches](Managing-branches-in-the-git-repository-of-your-language). Alternatively, for data checks before uploading, the `parseme_validate.py` script can be used locally. It is crucial to keep it updated to the latest version at all times. Note that the validator's execution requires Python 3, and a third-party Python module named `regex` that can be installed via pip. Prior to invoking the validator, the following steps may be necessary:
|
|
|
|
... | ... | |