Commit 503f5f60 authored by Davide Liga's avatar Davide Liga

doc update

parent 8995c6d5
......@@ -23,11 +23,22 @@ Title: SANKOFA – Semantic Annotation of Knowledge Fostering Akoma Ntoso
3.2.3 [ Qualifier of Preambular and Operationl ](#three-2-3)
4. [ Technologies Details ](#four)
4.1 [NER](#four-1)
4.1.1 [Query phases](#four-1-1)
4.1.2 [Conclusion and Result](#four-1-2)
4.2 [Classifier SDG](#four-2)
4.2.1 [Background Information](#four-2-1)
4.2.2 [BoW-TFIDF](#four-2-2)
4.2.3 [Averaged Word Embedding and Doc2Vec](#four-2-3)
4.2.4 [Our Solution for SDG Classification](#four-2-4)
4.2.5 [Conclusion and Results](#four-2-5)
4.3 [Qualifier of Preambular and Operationl](#four-3)
5. [ Akoma Ntoso Marker and Conversion ](#five)
4.3.1 [A final pattern](#four-3-1)
4.3.2 [Resolution of Mistakes](#four-3-2)
4.3.3 [Dynamic Pattern Research](#four-3-3)
4.2.4 [Conclusion and Results](#four-3-4)
5. [Akoma Ntoso Marker-Converter](#five)
5.1 [Process of Conversion](#five-1)
5.2 [Results](#five-2)
5.2 [Conclusion and Results](#five-2)
6. [RDF Generation](#six)
7. [Milestones](#seven)
8. [Installation](#eight)
......@@ -49,11 +60,11 @@ The SANKOFA project intends to produce a web applications (API) that is capable
To achieve the above mentioned objective the project is divided in tasks:
1. **Detect NER** like role, event, organization, location, date.
2. **Classify the sentences** that talk about **SDG** using the definitions of the SDGO.
2. **Classify the sentences** that talk about **SDG** using the definitions of the SDGIO.
3. **Qualify the sentences** in operational and **preambulary** using linguistic patterns and also to detect the action proposed (e.g., noting, decides).
4. **Recognise the document structure** in the main parts: coverPage, preface, preamble, mainBody, conclusions, annexes. We use also the information detected in the first three steps in order to distinguish the hierarchical structure of the document and to qualify the preambular sentencens from the operational ones.
5. **Convert** all the extracted knowledge in **Akoma Ntoso**.
6. **Interpret** the extracted knowledge and to create semantic assertion using to the existing ontologies (e.g., ALLOT, UNDO, SDGO, etc.). The idea is to create a RDF repository with those assertions.
6. **Interpret** the extracted knowledge and to create semantic assertion using to the existing ontologies (e.g., ALLOT, UNDO, SDGIO, etc.). The idea is to create a RDF repository with those assertions.
Some principles lead our solution:
......@@ -61,66 +72,75 @@ Some principles lead our solution:
2. To use **authentic** sources and **authoritative** information, using **FRBR** approach for declaring the provenance;
3. To design the solution following ontology design patterns principles [xx];
4. To provide a **scalable** method that is customizable also for **other UN agencies**;
5. To design a modularized solution that is adaptable to **other kind of documents** (e.g., report of conference, order of the day, constitution, basic texts, etc.);
5. To design a **modularized** solution that is adaptable to **other kind of documents** (e.g., report of conference, order of the day, constitution, basic texts, etc.);
6. To apply the same tools, with a minimal customization, for the other **five languages** of UN using the principle of **portability** and **customization**.
The RESTFul Service for testing all the pipeline is available [here](http://bach.cirsfid.unibo.it/unresolution2akn/)
The repo with all the documentation is available [here](https://gitlab.com/CIRSFID/un-challange-2019)
The license for all the material published in the light of this UN Challenge is: Creative Commons Attribution 4.0 International
<a name="three"></a>
## 3. Methodology of Knowledge Extraction
<a name="three-1"></a>
### 3.1 Related work
There are several methodologies for coping with the above mentioned tasks especially starting from Word files that includes some styling information.
<a name="three-1-1"></a>
### 3.1.1 Use the Word styles
One easy method is to use the **word docx information** and the **styles** (e.g., italic, notes, font, size) for detecting the semantic part of the text (e.g., italic could be action, font size . This method apparently produces a very high percentage of success, but it is prone to several side effects: i) it depends to a specific style rules applied to a specific organization/agency of UN, ii) those rules changed over the time and this affects the accuracy of this method; iii) the style rules can change language by language (e.g., Chinese, Arabic, Russian); iv) some mistakes in editing phase are possible; v) **scalability**, **modularization** and **portability** are not guaranteed. For these reasons we don’t use this method in our solution.
#### 3.1.1 Use the Word styles
One easy method is to use the **word docx information** and the **styles** (e.g., italic, notes, font, size) for detecting the semantic part of the text (e.g., italic could be action, font size. This method apparently produces a very high percentage of success, but it is prone to several side effects: i) it depends to a specific style rules applied to a specific organization/agency of UN, ii) those rules changed over the time and this affects the accuracy of this method; iii) the style rules can change language by language (e.g., Chinese, Arabic, Russian); iv) some mistakes in editing phase are possible; v) **scalability**, **modularization** and **portability** are not guaranteed. For these reasons, we don’t use this method in our solution.
<a name="three-1-2"></a>
### 3.1.2 Machine Learning
#### 3.1.2 Machine Learning
For applying this method we need a corpus marked up before as gold standard and to have a strong training phase, followed to an evaluation phase made by real expert of domain. This approach need a long term project for producing a robust corpus, training set, super-versioned evaluation. For those reasons we think that this methodology could be used properly on the UN resolution document collections only in a second phase, when a relevant XML AKN corpus is available. Machine learning in legal domain suffers to three main problems nowadays: i) the machine learning is limited to a fragment of text (e.g., sentence) but in legal domain it is super important the context and the relationships between the sentences; ii) the sentences includes normative citations and references and machine learning usually neglects this aspect that is very significant; iii) legal domain is dynamic over the time and the rules detected in such period of time is not valid in another time. So the regularity detected by the machine learning is based on historical series that change over time. Finally it is not **language independent**. We need a training set for each language of the six managed by UN.
<a name="three-1-3"></a>
### 3.1.3 Frames
#### 3.1.3 Frames
To use frames like FrameNet or FRED [xx] is a good solution, but it needs a modelization of the most important situations in the UN resolution domain in order to avoid excessive fragmentation and to reduce the dimension of the tree. To model situations takes time and effort, we need to involve expert of domain and organize feedback workflow. Finally the situations should be modelled for each language involved in UN and some tool available in this sector are very effective only in English.
<a name="three1-4"></a>
### 3.1.4 Linked open data
#### 3.1.4 Linked open data
Some techniques use linked open data tools for detecting information in the text and to merge them with Wikipedia information or using Linked Open Cloud ontologies. This method is not accurate and could produce invalid information (not legal valid), not authoritative because they are not checked and validated. Secondly the legal domain change over the time. So we need an ontological level capable to manage the modification over the time like for instance authority competences One of the most important principle of our solution is to track the provenance of the information using FRBR approach.
<a name="three-2"></a>
## 3.2 Our Solutions
### 3.2 Our Solutions
We prefer to use the following technologies for making the results authoritative (not mixed with external not checked sources), legal valid, accurate, scalable, modular, portable, language independent.
<a name="three-2-1"></a>
### 3.2.1 NER
#### 3.2.1 NER
It is method for detecting inside of the text little and fine information about role, person, organization, date, location, etc. We have used this approach for detecting legal knowledge information inside of the sentences.
We have to recognize the following named entities inside United Nations (UN) documents:
1. Roles (e.g., Secretary-General);
2. Organizations (e.g., United-Nations);
3. Deadlines (e.g., by 2030
3. Deadlines (e.g., by 2030);
4. Persons (e.g., Ban Ki-moon);
5. Geo-political entities (e.g., countries like Nigeria);
6. Places (e.g., Vienna).
<a name="three-2-2"></a>
### 3.2.2 Classifier SDG
#### 3.2.2 Classifier SDG
We have to identify whether a paragraph of a United Nations (UN) document is related to one or more Sustainable Development Goals (SDG). Furthermore, every SDG may have different targets (sub-SDGs) that may change in time (some of the sub-SDGs have a short- or mid-term deadline: 2020, 2030).
More in details, we need a Natural Language technique that should respect at least the following requirements:
1. The algorithm should be able to measure how similar is a given paragraph to a SDG and a sub-SDG.
2. The algorithm should allow us to easily change the SDG (and sub-SDG) definitions, without incurring in significantly slow and error-prone pre-processing processes (eg. a slow model training process).
If we represent a SDG by a document describing it, then the 1st requirement can be met using geometrical encodings (embeddings) of words/documents. In other words, if we are able to associate a numerical vector to every document, then we can easily compute the similarity of two vectors/documents (eg. through cosine similarity).
**Solution:** If we represent a SDG by a document describing it, then the 1st requirement can be met using geometrical encodings (embeddings) of words/documents. In other words, if we are able to associate a numerical vector to every document, then we can easily compute the similarity of two vectors/documents (eg. through cosine similarity).
Many models exist for document embedding, and probably some of the most famous are:
- BoW-TFIDF
- BoW TF-IDF
- Averaged Word2Vec/GloVe/fastText
- Doc2Vec
- Etc…
The first model (BoW-TFIDF) is probably the fastest to build/train, especially because it does not require huge amount of data for training and it does not require hyper-parameters tuning. While the other models are much slower to train (because they are usually modeled with Artificial Neural Networks) and they perform better when trained with huge datasets, and they usually depend on a lot of hyper-parameters. Fortunately many pre-trained Word2Vec/GloVe/fastText models, trained on very big and generic datasets, are easily available on the web. But, these pre-trained models are not optimized for most of the domain-specific tasks.
......@@ -129,65 +149,9 @@ In order to perform SDG classification, we have to build a model for every SDG a
For this reason we designed a new ensemble method that effectively combines generic (non domain-specific) Averaged GloVe document similarities with domain-specific BoW-TFIDF document similarities.
This way we do not need any complicated and error-prone learning phase for building the document embedding models, thus allowing us to easily tackle also the 2nd requirement, because we can easily change the index corpus.
**Background Information**
We know that several techniques exist for learning geometrical encodings of words from their co-occurrence information (how frequently they appear together in large text corpora). The goal of word embeddings is to capture the meaning of a word using a low-dimensional vector, and these embeddings are ubiquitous in natural language processing because they allow us to perform arithmetical operations on natural language words.
Document embedding is somehow related to word embedding, but it is a different task because the granularity of the embedder input shifts from words to documents. For this reason, document embedding is said to be harder to get than word embedding and in many cases it is built upon word embeddings.
Some famous document embedding techniques are:
- BoW-TFIDF;
- Singular Value Decomposition (used by Latent Semantic Analysis);
- Averaged Word2Vec / GloVe / fastText;
- Doc2Vec.
**BoW-TFIDF**
Bow-TFIDF stands for: Bag of Words based Term Frequency - Inverse Document Frequency. In BoW, documents are described by word occurrences while completely ignoring the relative position information of the words in the document. BoW tokenizes the documents and counts the occurrences of token and return them as a sparse matrix.
The BoW model is a perfectly acceptable model to convert raw text to numbers. However, if our purpose is to identify signature words in a document, there is a better transformation that we can apply.
The TF-IDF is the product of two statistics, term frequency and inverse document frequency and it is one of the most popular term-weighting schemes today. Term frequency (TF) is basically the output of the BoW model. For a specific document, it determines how important a word is by looking at how frequently it appears in the document. Term frequency measures the local importance of the word. [2]
The second component of tf-idf is inverse document frequency (idf). For a word to be considered a signature word of a document, it shouldn’t appear that often in the other documents. Thus, a signature word’s document frequency must be low, meaning its inverse document frequency must be high.
The tf-idf is the product of these two frequencies. For a word to have high tf-idf in a document, it must appear a lot of times in said document and must be absent in the other documents. It must be a signature word of the document. [2]
The similarity between TFIDF embeddings is said to be a syntagmatic (topic related) and it is usually measured through cosine similarity.
**Averaged Word Embedding and Doc2Vec**
Word2Vec [9], GloVe [5] and fastText [10] are unsupervised learning algorithms for word embedding, based on artificial neural networks. Word embedding is a type of mapping that allows words with similar meaning to have similar representation. The basic idea behind word embedding (and distributional semantics) can be summed up in the so-called distributional hypothesis [6]: linguistic items with similar distributions have similar meanings; words that are used and occur in the same contexts tend to purport similar meanings.
All the aforementioned word embedding algorithms consist in a shallow or a deep Artificial Neural Network usually trained by mean of Stocastic Gradient Descent, intuitively with the goal of optimally predicting a word given its context or viceversa.
For example, the GloVe model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. [5]
It is worthy to mention that both Word2Vec and GloVe are not able to handle new words (words not used during the training phase), while fastText can produce word embedding also for new words.
A naive approach to build document embeddings through word embedding might be averaging the word embeddings of a document. This naive approach is called Averaged Word Embedding (AWE). One of the disadvantages of this document embedding technique is that it is not sensible to words ordering, for example the sentences “This is a meaningful sentence” and “This a meaningful sentence is” have the same AWE but only the first one is properly formed and it has a meaning. Another disadvantage of ANN-based word embedding algorithms is that they usually require a significantly large amount of training data.
A more sophisticated approach for document embedding might be Doc2Vec [8]. Doc2Vec is very powerful when combined with ANNs and it represents the state-of-the-art for document embedding, but, like AWE, it might require a huge amount of data in order to express its real potential. Thus Doc2Vec is probably the best choice if we can dispose of a big-enough dataset of documents regarding a specific linguistic domain, but this is usually unlikely to be when working with uncommon languages or very specific domains.
Anyway, an important aspect of these representations is the ability to solve word analogies of the form “A is to B what C is to X” using simple arithmetic. This is generally simplified as “Paris  -  France + Germany = Berlin” or “King - Man + Woman = Queen”. [1]
Thus, the similarity between these document embeddings is said to be a paradigmatic and it is usually measured through cosine similarity.
**Classification SDG Solution**
Our dataset is made of SDG descriptions taken from [3] and it is too small for any ANN-based approach, because of the SDGs are just 17 and even the sub-SDGs are always less than 20. In other words, under these conditions it seems that we cannot train an ad hoc ANN-based model without data augmentation, because we definitively have not enough data.
For this reason we designed a new ensemble method that effectively combines generic (non domain-specific) Averaged GloVe document similarities with domain-specific BoW-TFIDF document similarities.
Our solution tries to exploit the best from both the aforementioned techniques:
- The GloVe model is much slower to train and it requires a huge amount of data for proper training, but there exist many models pre-trained on very generic datasets that can be easily exploited. We use a GloVe model pre-trained on data coming from Common Crawl, this data is not domain-specific, thus the resulting word embeddings tend to lose information when used in specific domains such as UN Documents. A solution to the domain-specificity problem can be transfer learning, but even when adopting the classical transfer learning approach we would need a significantly bigger dataset than the one we dispose. Thus, we use AWE based on GloVe to model only generic (non domain-specific) information (eg. semantic relationship among non domain-specific words).
- The BoW-TFIDF model is fast to train and it does not require a huge dataset and hyper-parameters tuning, but it is a shallow learning technique and it lacks of semantic expressivity when compared to techniques such as GloVe. But TFIDF has been specifically designed to extract document signatures, topic information. Thus, we can use BoW-TFIDF to model domain-specific information.
Let A (the query) and B (a corpus document) be two distinct documents, we want to compute the similarity between A and B.
In order to do that, we combine:
- The cosine similarity of the TFIDF embeddings of A and B: that is a topical/syntagmatic similarity extracted by populating the vectors with information on which text regions the linguistic items occur in [4]. On the semantic level Syntagmatic associations indicate compatible combinations of words.
- The cosine similarity of the average of the GloVe word embeddings of A and B: that is a paradigmatic similarity extracted by populating the vectors with information on which other linguistic items the items co-occur with \[4\]. On the semantic level Paradigmatic substitutions allow items from a semantic set (synonyms, antonyms, etc..) to be grouped together.
In other words, the idea behind this ensemble is to combine the unique and different properties of the aforementioned similarities, in order to get a new paradigmatic similarity potentially able to express topic similarity on a domain on which the GloVe model has not been trained on.
Our way of combining TFIDF with Word2Vec/GloVe differs from the one adopted for example in \[7\] or in \[8\]. In \[7,8\] the document embedding is obtained by averaging the Word2Vec/GloVe word embeddings weighted by their TFIDF word embedding. But, in our approach we combine document similarities instead of word embeddings. In principle, the technique adopted in [7] might be used to improve the paradigmatic similarity used in our technique.
<a name="three-2-3"></a>
### 3.2.3 Qualifier of Preambular and Operationl
The Qualifier module has two tasks:
#### 3.2.3 Qualifier of Preambular and Operationl
The Qualifier module has to perform two tasks:
1. qualifying a given paragraph as “preambular” or “operational”;
2. identifying the “Terms” that characterize the qualification the so-called action (e.g. “alsoConsidering”).
......@@ -197,7 +161,9 @@ Within the pipeline, it is used to process paragraphs one by one, giving as a re
1. a qualification (“preambular” or “operational”);
2. a starting and ending offset which indicates where the terms start and end.
**Methods**
A pharagraph is “preambular” when it belongs to a preamble. Usually it is a verb in –ing form, or an adjective. Some examples are:
......@@ -211,7 +177,11 @@ Concerned… | concerned
However sometime we have preambular sentence in the body of the resolution like this example:
![alt text](preambular-operational.png)
Figure 1 - A/RES/68/247 B, N1429631.doc, N1429631.xml
A pharagraph is “operational” when it is not part of a preamble. Usually it is a verb at the present tense. Some examples are:
......@@ -221,11 +191,12 @@ A pharagraph is “operational” when it is not part of a preamble. Usually it
--------------------------------------- | --------------------
Takes note with appreciation… | takesNoteWithAppreciation
Reaffirm also… | reaffirmAlso
As already stated in[…], encourages… | engourages
As already stated in[…], encourages… | encourages
The module uses a tokenizer to create **a POS-Tag sequence of the first tokens**. In the first tokens, in fact we can found all the information we need.
For this reason, we create a list composed by pair of token and labels. For example, considering just the first seven tokens of the sentence “Having considered in depth the question of Western Sahara” we will have a tokenization as follow:
For this reason, we create a list composed of pair of token and labels. For example, considering just the first seven tokens of the sentence “Having considered in depth the question of Western , we will have a tokenization as follow:
TOKEN | POS-tag
----------------|----------------
......@@ -240,11 +211,18 @@ TOKEN | POS-tag
The first token already shows that we are dealing with a “preambular” sentence.
The qualifier is also able to deal with more complex structures, where the first tokens are not enough to fulfil a prediction, like those sentences that start with an introductory part:
> “As a contribution to the preparatory process for this Global Compact, we recognize the … ”
In this case, the qualifier will also check the other parts of the sentence, searching for significant patterns of POS-tags, in the attempt to perform a qualification.
<a name="four"></a>
## 4. Technologies Details
<a name="four-1"></a>
## 4.1 NER
### 4.1 NER
We used the Spacy [1] Named-Entities Recognizer (NER) for recognizing all the entities expect the 1st one (the roles). The Spacy NER is based on Artificial Neural Networks (ANN). Spacy has several pre-trained models for English:
......@@ -285,7 +263,7 @@ We found that the Spacy NER performs poorly in recognizing roles (accuracy close
- we changed the query similarity and classification phases.
<a name="four-1-1"></a>
### 4.1.1 Query phases
#### 4.1.1 Query phases
The query phases are three:
1. pre-processing: in this phase we build a list of composite tokens
......@@ -318,7 +296,7 @@ Now, if a similarity group G has more than one composite token, then we have to
In order to do that, we sum the similarities vectors of the composite tokens in G obtaining in such way the similarity vector S of G, then we can sort these similarities in descending order and get a similarity ranking. Finally, every group G having S > T, where T = 1.25 is the group threshold, can be classified as a group of words representing a Role Entity.
<a name="four-1-2"></a>
### 4.1.2 Conclusion and Result
#### 4.1.2 Conclusion and Result
The Named-Entities Recognizer we used is the same default NER coming with Spacy. But that NER has not been trained to recognize Role Entities, thus we have modified the algorithm used for SDG classification in order to perform Role Entity Recognition (RER).
One of the advantages of the new RER algorithm is that:
......@@ -328,10 +306,7 @@ One of the advantages of the new RER algorithm is that:
- it performs quite well with relatively small training sets.
But the main disadvantage of the new RER algorithm is that its time complexity depends on the size of the corpus. In other words: the bigger is the roles corpus, the linearly slower is the algorithm.
The results are the following:
~~~~
xxx
~~~~
The results are the following.
The NER extraction of geo-political entity in Spacy:
......@@ -400,40 +375,114 @@ Akoma Ntoso markup made by Marker-Converter:
</content>
~~~~
<a name="four-2"></a>
## 4.2 Classifier SDG
The pipeline of our algorithm is as follows:
### 4.2 Classifier SDG
#### 4.2.1 Background Information
We know that several techniques exist for learning geometrical encodings of words from their co-occurrence information (how frequently they appear together in large text corpora). The goal of word embeddings is to capture the meaning of a word using a low-dimensional vector, and these embeddings are ubiquitous in natural language processing because they allow us to perform arithmetical operations on natural language words.
Document embedding is somehow related to word embedding, but it is a different task because the granularity of the embedder input shifts from words to documents. For this reason, document embedding is said to be harder to get than word embedding and in many cases it is built upon word embeddings.
Some famous document embedding techniques are:
- BoW-TFIDF;
- Singular Value Decomposition (used by Latent Semantic Analysis);
- Averaged Word2Vec / GloVe / fastText;
- Doc2Vec.
#### 4.2.2 BoW-TFIDF
Bow-TFIDF stands for: **Bag of Words based Term Frequency - Inverse Document Frequency**. In BoW, documents are described by word occurrences while completely ignoring the relative position information of the words in the document. BoW tokenizes the documents and counts the occurrences of token and return them as a sparse matrix.
The BoW model can reasonably convert raw text to numbers. However, if our purpose is to **identify signature (important) words in a document**, there is a better transformation that we can apply. Here, by “signature words in a document” we mean all those words in the document that are important to resume the meaning of a document.
The TF-IDF is the product of two statistics: term frequency and inverse document frequency. **Term frequency** (TF) is basically the output of the BoW model. For a specific document, it determines how important a word is by looking at how frequently it appears in the document. Term frequency measures the local importance of the word. [2]
The second component of TF-IDF is **Inverse Document Frequency (IDF)**. For a word to be considered a signature word of a document, it shouldn’t appear that often in the other documents. Thus, the frequency among different documents of a signature word must be low, in other words the inverse document frequency must be high. For example the word “and” might reasonably have an high TF for a specific document, but this does not mean that “and” is an important word for that document, in fact probably it has also a high IDF.
The TF-IDF is the product of these two frequencies. For a word to have high TF-IDF in a document, it must appear a lot of times in said document and must be absent in the other documents. It must be a signature word of the document. [2]
The similarity between TF-IDF embeddings is said to be **syntagmatic (topic related)** and it is usually measured through cosine similarity.
#### 4.2.3 Averaged Word Embedding and Doc2Vec
Word2Vec [9], GloVe [5] and fastText [10] are unsupervised learning algorithms for word embedding, based on **artificial neural networks**. Word embedding is a type of mapping that allows words with similar meaning to have similar representation. The basic idea behind word embedding (and distributional semantics) can be summed up in the so-called **distributional hypothesis** [6]: linguistic items with similar distributions have similar meanings; words that are used and occur in the same contexts tend to purport similar meanings.
All the aforementioned word embedding algorithms consist in a shallow or a deep Artificial Neural Network usually trained by mean of Stocastic Gradient Descent, intuitively with the goal of optimally predicting a word given its context or viceversa.
For example, the **GloVe** model is trained on the non-zero entries of a global word-word co-occurrence matrix, which tabulates how frequently words co-occur with one another in a given corpus. GloVe is essentially a log-bilinear model with a weighted least-squares objective. The main intuition underlying the model is the simple observation that ratios of word-word co-occurrence probabilities have the potential for encoding some form of meaning. The training objective of GloVe is to learn word vectors such that their dot product equals the logarithm of the words' probability of co-occurrence. [5]
It is worthy to mention that both Word2Vec and GloVe are not able to handle new words (words not used during the training phase), while fastText can produce **word embedding also for new words**.
A naive approach to build document embeddings through word embedding might be averaging the word embeddings of a document. This naive approach is called **Averaged Word Embedding (AWE)**. One of the disadvantages of this document embedding technique is that it is not sensible to words ordering, for example the sentences “This is a meaningful sentence” and “This a meaningful sentence is” have the same AWE but only the first one is properly formed and it has a meaning. Another disadvantage of ANN-based word embedding algorithms is that they usually require a significantly large amount of training data.
A more sophisticated approach for document embedding might be **Doc2Vec** [8]. Doc2Vec is very powerful when combined with ANNs and it represents the state-of-the-art for document embedding, but, like AWE, it might require a huge amount of data in order to express its real potential. Thus Doc2Vec is probably the best choice if we can dispose of a big-enough dataset of documents regarding a specific linguistic domain, but this is usually unlikely to be when working with **uncommon languages or very specific domains**.
Anyway, an important aspect of these representations is the ability to solve word analogies in the form “A is to B what C is to D”, by using simple arithmetic. For example, in Word2Vec, we might see that the following word embeddings equations are valid [1]:
1. Corpus pre-processing for TFIDF: transform the raw text into a formatted input.
2. Build the TFIDF model: only once for every query.
3. Query pre-processing for TFIDF: same as corpus pre-processing.
4. Compute the combined query similarity.
5. Classify the query according to its combined similarity.
- “Paris  -  France + Germany = Berlin”
- “King - Man + Woman = Queen”
We decided to use:
Thus, the similarity between these embeddings is said to be a paradigmatic and it is usually measured through cosine similarity.
- Spacy \[11\] (a python library) for pre-processing and for word embedding.
- The Snowball algorithm coming with NLTK (a python library), for stemming.
#### 4.2.4 Our Solution for SDG Classification
Assuming that we represent SDGs (and sub-SDGs) by descriptive documents taken from [3], then we want to produce good-enough document embeddings of these SDGs descriptions in order to easily perform document similarity through arithmetic.
We might produce document embeddings by training from scratch a Doc2Vec algorithm or some other ANN-based algorithm, but our dataset is made of SDG descriptions taken from [3] and it is too small for any ANN-based approach, because of the SDGs are just 17 and even the sub-SDGs are always less than 20. In other words, under these conditions it seems that we cannot train an ad hoc ANN-based model without data augmentation, because we definitely have not enough data.
For this reason we designed a new ensemble method that effectively combines **generic (non domain-specific) Averaged GloVe** document similarities with **domain-specific BoW TF-IDF** document similarities.
Our solution tries to exploit the best from both the aforementioned techniques:
- The GloVe model is much slower to train and it requires a huge amount of data for proper training, but there exist many models pre-trained on very generic datasets that can be easily exploited. We use a GloVe model pre-trained on data coming from Common Crawl, this data is not domain-specific, thus the resulting word embeddings tend to lose information when used in specific domains such as UN Documents. A solution to the domain-specificity problem can be transfer learning, but even when adopting the classical transfer learning approach we would need a significantly bigger dataset than the one we dispose. Thus, we use AWE based on GloVe to model only generic (non domain-specific) information (eg. semantic relationship among non domain-specific words).
- The BoW-TFIDF model is fast to train and it does not require a huge dataset and hyper-parameters tuning, but it is a shallow learning technique and it lacks of semantic expressivity when compared to techniques such as GloVe. But TFIDF has been specifically designed to extract document signatures, topic information. Thus, we can use BoW-TFIDF to model domain-specific information.
Let A (the query) and B (a corpus document) be two distinct documents, we want to compute the similarity between A and B.
In order to do that, we combine:
- The cosine similarity of the TFIDF embeddings of A and B: that is a **topical/syntagmatic** similarity extracted by populating the vectors with information on which text regions the linguistic items occur in [4]. On the semantic level Syntagmatic associations indicate compatible combinations of words.
- The cosine similarity of the average of the GloVe word embeddings of A and B: that is a **paradigmatic** similarity extracted by populating the vectors with information on which other linguistic items the items co-occur with \[4\]. On the semantic level Paradigmatic substitutions allow items from a semantic set (synonyms, antonyms, etc..) to be grouped together.
In other words, the idea behind this ensemble is to combine the unique and different properties of the aforementioned similarities, in order to get a new paradigmatic similarity potentially able to express topic similarity on a domain on which the GloVe model has not been trained on.
Our way of combining TFIDF with Word2Vec/GloVe differs from the one adopted for example in \[7\] or in \[8\]. In \[7,8\] the document embedding is obtained by averaging the Word2Vec/GloVe word embeddings weighted by their TFIDF word embedding. But, in our approach we combine document similarities instead of word embeddings. In principle, the technique adopted in [7] might be used to improve the paradigmatic similarity used in our technique.
The **pipeline** of our **algorithm** is as follows:
1. **Corpus pre-processing**: for building the TF-IDF model we need properly formatted data.
2. **TF-IDF model building**: only once, before performing any query.
3. **Query pre-processing**: same as corpus pre-processing.
4. **Query similarity computation**.
5. **Query classification**: classify the query according to its similarity.
As backbone Python library we decided to use Spacy [11], for pre-processing and for word embedding. Furthermore we used the Snowball algorithm implementation coming with NLTK [12], for stemming.
The default Spacy Language (Pre-)Processing Pipeline is the following:
- Tokenizer: Segment text into tokens.
- Tagger: Assign part-of-speech tags.
- Parser: Assign dependency labels.
- NER: Detect and label named entities.
In the TFIDF pre-processing phase (for both corpus documents and queries) we perform the following steps:
During the TF-IDF pre-processing phase (for both corpus documents and queries) we perform the following steps:
1. Replace upper-cases with lower-cases.
2. Replace every occurrence of "sustainable development goal” with “sdg”.
3. Replace every occurrence of “sdg 1”, “sdg 2”, …, “sdg 17” with “sdg1”, “sdg2”, …, “sdg17”.
4. Replace every occurrence of “1st sdg”, “2nd sdg”, …, “17th sdg” with “sdg1”, “sdg2”, …, “sdg17”.
5. Replace every occurrence of “first sdg”, “second sdg”, …, “seventeenth sdg” with “sdg1”, “sdg2”, …, “sdg17”.
6. Perform: tokenization and lemmatization
7. Perform stemming.
8. Remove stop-words (as defined by Spacy) and punctuation.
2. Replace every occurrence of "sustainable development goal” with “sdg”, and set as word embedding of “sdg” the sum of the embeddings of "sustainable”, “development” and “goal”.
3. Replace every occurrence of:
- “sdg 1”, “sdg 2”, …, “sdg 17”;
- “1st sdg”, “2nd sdg”, …, “17th sdg”
- “first sdg”, “second sdg”, …, “seventeenth sdg”
with “sdg1”, “sdg2”, …, “sdg17”, and set as their word embeddings the sum of the embedding of “sdg” with the embedding of the respective number.
We have empirically observed that stemming helps TFIDF in achieving greater generalization and better results in SDG classification.
We decided to consider the words “Sustainable Development Goal” as a unique token and furthermore to give a unique token to every SDG (SDG1 stands for the first SDG, and so on), this is the reason behind the occurrences replacements described at point 2, 3, 4 and 5. We took this decision in order to better classify all those SDGs explicitly mentioned through their unique identifier (eg. SDG 3 for the third SDG).
1. Perform: tokenization and lemmatization
2. Perform **stemming** on lemmas.
3. Remove stop-words (as defined by Spacy) and punctuation.
We have empirically observed that **stemming helps TF-IDF in achieving greater generalization** and better results in SDG classification.
We decided to consider the words _“Sustainable Development Goal”_ as a unique token and furthermore to give a unique token to every SDG (SDG1 stands for the first SDG, and so on), this is the reason behind the occurrences replacements described at point 2 and 3. We took this decision in order to better classify all those SDGs explicitly mentioned through their unique identifier (e.g. SDG 3 for the third SDG).
The corpus is usually defined by N different documents:
......@@ -441,39 +490,42 @@ The corpus is usually defined by N different documents:
- For the sub-SDG classification problem we build the corpus similarly to the SDG classification one, but instead of using the definitions of the SDGs as corpus we use the definitions of the sub-SDGs of a SDG. Thus we have a separate sub-SDG classifier for every SDG.
We have empirically observed that the bigger is the SDG (or sub-SDG) definition, the better is the resulting TFIDF model in computing syntagmatic similarities.
In other words, if we use as SDG definitions only the SDG titles (a few tokens per SDG), the resulting model in practice performs poorly. But if we use as SDG definitions the whole text used in [3] for describing every SDG (hundreds of tokens per SDG), then the resulting TFIDF model performs much better.
We build the TFIDF model only once, using the corpus and performing the following steps:
We **build the TF-IDF model only once**, using the corpus and performing the following steps:
1. Build a fixed Dictionary of all possible words in the corpus.
2. Using the aforementioned Dictionary, get the Bag-of-Word of every document in the corpus.
3. Build the TF-IDF model using the aforementioned BoWs.
After we have built the TFIDF model we can compute a query similarity as follows:
After we have built the TF-IDF model we can compute a query similarity as follows:
1. Get the BoW of a query Q using the fixed corpus Dictionary and compute its TF-IDF vector.
2. Compute the TF-IDF cosine similarity (T) between the query vector and the vector of every document in the original corpus. The result should be a vector T of N real numbers in [0,1], each one representing the similarity of the query and a document in the corpus.
3. Compute the GloVe AWE cosine similarity (G) between the whole corpus and the query, using the predefined Spacy method for document similarity. The result should be a vector G of N real numbers in [0,1], each one representing the similarity of the query and a document in the corpus.
4. Compute R: the average of G. R should be a measure of how much Q is relevant to the SDG (or sub-SDG) topic (the topic of Sustainable Development Goals).
5. Compute C = (T + G)\*R. Where C here is called combined similarity. Furthermore in this formula G is said to be the semantic shift, while R is said to be the paradigmatic topic weight.
1. Get the BoW of a query Q using the fixed Dictionary and compute its TF-IDF vector.
2. Compute the TFIDF cosine similarity (T) between the query vector and the vector of every document in the original corpus. The result should be a vector T of N real numbers in [0,1], each one representing the similarity of the query and a document in the corpus.
3. Compute the GloVe AWE cosine similarity (G) between the whole corpus and the query, using the pre-defined Spacy method. The result should be a vector G of N real numbers in [0,1], each one representing the similarity of the query and a document in the corpus.
4. Compute R: the average of G. R should be a measure of how much Q is relevant to the SDG (or sub-SDG) topic.
5. Compute C = (G + T)\*R. Where C here is the combined similarity. Furthermore in this formula G is said to be the semantic shift, while R is said to be the paradigmatic topic weight.
The intuitive idea behind using the semantic shift G and the paradigmatic topic weight R is that the TF-IDF similarity T is high for a query Q and a document D when the query words and the document words are similar, but T is a syntagmatic similarity and thus may be lower (or even 0) when Q contains words in the synsets of D. Thus, in order to address the aforementioned synset-words problem we sum T with a paradigmatic similarity G before scaling it by R. We scale (T + G) by R in order to give significantly more similarity to the queries paradigmatically more inherent to the corpus topics.
The intuitive idea behind using the semantic shift G and the paradigmatic topic weight R is that the TFIDF similarity T is high for a query Q and a document D when the query words and the document words are similar, but T is a syntagmatic similarity and thus may be lower when Q contains words in the synsets of D. Thus, in order to address the aforementioned synset-words problem we sum T with a paradigmatic similarity G before scaling it by R. We scale (T + G) by R in order to give significantly more similarity to the queries paradigmatically more inherent to the corpus topics.
Now that we have the combined similarity, we can use it in order to perform SDG (or sub-SDG) classification. A query can be classified as related to a specific SDG or not. Thus, we have to understand when a query is not related to any SDG. In order to do this, we have to choose a similarity threshold T.
Now that we have the combined similarity, we can use it in order to perform SDG (or sub-SDG) classification. A query can be classified as related to a specific SDG or not. Thus, we have to understand when a query is not related to any SDG. In order to do this, we have to chose a similarity threshold T.
Empirically we observe that the bigger is the query, the smaller tends to be the value of C. Thus we hypothesize that T is a function of the size of the query, for this reason in order to perform SDG classification using C we perform the following steps:
1. Compute W = C \* (1 + L), where L is the base 2 logarithm of the number of tokens in the query.
1. We compute W = C \* (1 + L), where L is the base-2 logarithm of the number of tokens in the query. This is called log-length scaling.
2. We sum the weighted similarity of all the bias documents to the weighted similarity of the corresponding class document, thus obtaining the biased similarity B. In other words, the bias documents add bias to the corresponding class document only.
3. We sort the class documents D ordered by descending biased similarity B.
4. We set T = 0.75, and we get the index of all the class documents V having B > T.
5. If the set of V is empty, then the query Q is not related to any class (document). Otherwise we have the ranking of the most related classes to Q (one or even more).
3. Let M be the average of B, we compute B = B - M in order to center the biased similarity vector B. Please note that we do NOT normalize B by its standard deviation. The goal of centering B is to give more focus on the variance of the query similarity to the corpus.
4. We sort the class documents D ordered by descending biased similarity B.
5. We set T = 0.75, and we get the index of all the class documents V having B > T.
6. If the set of V is empty, then the query Q is said to be not related to any class (document). Otherwise we have the ranking of the most related classes to Q (one or even more).
The intuitive idea behind the scaling of C by L is that the bigger is the query Q, the (smoothly) lower is C. We sum 1 to L before scaling because otherwise queries having length 1 would have W equal to 0. Queries having length 1 might be reasonable, for example a query containing only the token “SDG1” should be classified as SDG 1.
We set T = 0.75 because we empirically found that 0.75 is a good threshold. But a more robust approach would be to apply an automatic regression technique on a labeled dataset in order to find the optimal value of T.
We set T = 0.75 because we empirically find that 0.75 is a good threshold. But a more robust approach would be to apply an automatic regression technique on a labelled dataset in order to find the optimal value of T.
<a name="four-2-1"></a>
### 4.2.1 Conclusion and Results
#### 4.2.5 Conclusion and Results
We designed a new ensemble method that effectively combines generic (non domain-specific) Averaged GloVe document similarities with domain-specific TF-IDF document similarities, for achieving SDG classification of UN documents paragraphs. Furthermore we have also shown how to improve this algorithm by using the Universal Sentence Encoder [13]. The algorithm we described is quite versatile and powerful. In fact it is able to perform multi-class classification, it does not require much hyper-parameters tuning (practically only the value of T has to be tuned), it is super fast to train, it allows us to easily change the classes definitions without incurring in significantly slow and error-prone pre-processing processes, and it performs quite well with relatively small training sets. But, it is very important to mention that we do not have properly evaluated our results due to the lack of a gold standard or even a big-enough test-set annotated by domain experts.
......@@ -481,6 +533,7 @@ We separately annotated two different test-sets (annotated by different people):
- Test-Set A: made of 128 annotated paragraphs (by F. Sovrano)
- Test-Set B: made of 112 annotated paragraphs (by F. Draicchio)
Those 2 sets share around 50 paragraphs.
We performed several experiments in order to understand how good is our classifier in recognizing concepts related to SDG goals (we have not tested SDG targets yet).
......@@ -496,8 +549,8 @@ The classifier output is a prediction set, every time the classifier predicts mo
- If the intersection set is not empty, then we use as Truth and Guess the first class in the intersection set. Please note that a set in Python is “randomly” ordered.
We use Truth and Guess to compute all the following statistics, by using the functions provided by sklearn.metrics [18].
For the algorithm describe until now, we get the following statistics:
For the algorithm describe until now, we get the following statistics:
Test-Set A
......@@ -595,7 +648,7 @@ N1740682.xml file
<a name="four-3"></a>
## 4.3 Qualifier of Preambular and Operationl
### 4.3 Qualifier of Preambular and Operationl
Let first_word, second_word. third_word, etc, be the first words of a paragraph, and let first_label, second_label, etc. The qualifier will search for patterns **following strict linguistic patterns** and exploiting **the power of the SpaCy POS-tagger**.
In particular, the algorithm search for patterns in the following straightforward way: it starts checking the paragraph from the first token-tag pair, choosing what to do after, going down little by little towards the following words.
Just in limited cases it searches for specific tokens (in rare verbal phrases such as “bearing in mind” or “pay tribute to”).
......@@ -618,38 +671,46 @@ OTHER CASES|Searching for the pattern<br>-COMA + VERB<br>-COMA + PRONOUN<br> e.g
*only for getting phrasal verbs\*\**
**A final pattern**
<a name="four-3-1"></a>
#### 4.3.1 A final pattern
After having detected the main term (e.g. “alsoRequest”), the qualifier also search for a final pattern “With”+JJ+NN (e.g. with grave concern), in order to detect expand the term into “alsoRequestWithGraveConcern”.
**Resolution of Mistakes**
<a name="four-3-2"></a>
#### 4.3.2 Resolution of Mistakes
Interestingly, we also prepared the algorithm to prevent POS-tag mistakes.
Sometimes, for example, a paragraph starting with a verb is considered a noun. For example, if the first word is “requests”, the POS-tagger could interpret it as a NNS (plural noun). For this reason, we created a function that add a virtual subject before any NN and NNS found at the beginning of a paragraph. Because, they are very likely to be verbs. With this adjustment, the POS-tag correctly detect a verb instead of a noun.
**Dynamic Pattern Research**
<a name="four-3-3"></a>
#### 4.3.3 Dynamic Pattern Research
In some case, a paragraph can start with informative components, that cannot describe the qualification of the paragraph.
In some cases, a paragraph can start with informative components, that cannot describe the qualification of the paragraph.
For example, in the sentence “Given the recent facts, urges…” the part before the coma is automatically excluded by the qualifier. In this way, only “Urges” will be considered as an element for discriminating between the ‘preambolar’ or ‘operational’ categories, and for creating the right term “urge”.
**The ability of generalize and the need for some ad hoc pattern**
**The ability to generalize and the need for some ad hoc pattern**
For some special case, some ad-hoc rules have been created. For example: “Bearing in mind”, “Keeping in mind”, “Pay tribute”. However, their occurrence is not frequent. Having chosen very general POS-tag patterns, the qualifier’s ability to generalize is relatively high.
**Exceptions**
We have detected two rare exceptions like the following:
1. Sentences starting with "re-", for example:
1. Sentences starting with a 3rd person verb that begins with "re-", for example:
> *Re-emphasizes the need to …*
2. Sentences starting with a subject pronoun (usually “we”) separated from its own predicate and having other verbal forms included in such separation:
![alt text](re-emphasizes.png)
Figure 6 -N0448880.doc
2. Inside of the **declarations**, that are **annex of resolutions** (e.g., A/RES/66/2) we find sentences starting with a subject pronoun (usually “we”) separated from its own predicate and having other verbal forms included in such separation:
> *we, heads of state and government, assembled at the united nations on 27 september 2018 to undertake a comprehensive …*
For now those exceptions are not detected but we can fix this in one month.
![alt text](we-sentence.png)
Figure 7 - N1247866.pdf
<a name="four-3-1"></a>
### 4.3.1 Conclusions and Results
#### 4.3.4 Conclusions and Results
After having created a dataset of 1000 sentences, manually annotated as "preambolar" or "operational", we have elaborated a preliminary testing/validation that is the following:
![alt text](confmatrix_qual.png)
......@@ -666,20 +727,21 @@ The Marker-Converter convert this information in \<term\> with the related \<TLC
<p><term refersTo="#underlining">Underlining</term> the fact that mutual understanding, dialogue, cooperation, transparency and confidence-building are important elements in all activities for the promotion and protection of human rights,</p>
</container>
~~~~
Figure 8- N1642803.doc, N1642803.xml
~~~~
<container name="reaffirmingAlso" eId="container_preamble_pg14">
<p><term refersTo="#reaffirmingAlso">Reaffirming also</term> the hope that, in appointing highly qualified lecturers for the seminars to be held within the framework of the fellowship programmes in international law, account would be taken of the need to secure the representation of major legal systems and balance among various geographical regions,</p>
</container>
~~~~
Figure 9 - N1248556.doc, N1248556.xml
<a name="five"></a>
## 5. Akoma Ntoso Marker and Conversion
## 5. Akoma Ntoso Marker-Converter
This module produce the Akoma Ntoso XML using Regular expressions and heuristics for detecting coverPage, preface, preamble, body, conclusions, annex, table.
This module reuse all the previous step knowledge for marking correctly the semantic part of the text.
**Even if the AKN4UN define the resolutions as \<documentCollection\> composed by different parts, we have preferred in this challenge to simplify the structure using \<statement\> only. It will not a big deal to wrap the result in a \<documentCollection\> later on.**
......@@ -688,15 +750,25 @@ This module reuse all the previous step knowledge for marking correctly the sema
The first step of the conversion consists in loading the provided word document and converting it (or rather: its parts) into txt.
The second step is parsing the text top to bottom and using pattern matching to identify structural elements such as the document title, number, the paragraphs, sections, annexes and so on.
The pattern matching process uses replus, which provides a method to write modular, template-based extensible regular expression.
Depending on the result of the pattern matching on the text, the text itself is mapped as-is into Objects that work as a proxy for Akoma Ntoso xml generation.
Before the objects are appended, the text is qualified via a paragraph_qualifier which has the job to determine if the text represents a preambular or an operational element; then it is appended accordingly.
Once the objects are all appended, a downward recursive algorithm is used to ensure that all the elements are placed accordingly to their hierarchical value (eg. if a section and a paragraph happen to be siblings, the latter will be set as a child of the first).
The next step is to correctly generate the eIds of the objects (which may have prefixes depending on their parent(s)).
After that the structure is in place, it is possible to run pattern matching and machine learning algorithms to identify all inline elements, such as dates, references, roles, organizations and so on.
The inline pattern matching also uses replus; the match objects are passed through a series of resolvers which will extract the metadata, build the attributes and the corresponding Top-Level Concept to be added to the AKN references.
Other than regexes, inline elements are recognized using spaCy with some customized NER.
Once the structure and the inlines are done, another spaCy-powered custom algorithm identifies SDG with their targets and respective indicators. The results are mapped into AKN keywords, references and custom name-spaced (akn4un) elements that will link the results to their corresponding elements.
The last step simply consists in writing the AKN to an xml file and validate it.
......@@ -979,7 +1051,7 @@ expeditiously;</p>
~~~~
<a name="five-2"></a>
### 5.2 Results
### 5.2 Conclusion and Results
We have processed the UN documents with the following results:
......@@ -1061,7 +1133,7 @@ We have completed the 60% of the work respect the tasks expected. We need more t
3. QUALIFIER: adjectives, etc. in the preambular sentences – duration: one month; effort: one month;
4. MARKER-CONVERTER: 1) eId of Annex – one week; 2) table, component – two weeks; 3) semantic parts and references – two weeks. Duration: one month; effort: one month;
5. ONTOLOGY/RDF: Duration: two month; effort: two month;
6. All the tasks need a scientific evaluation by third parties. We would like to ask to external people (e.g., student of Summer School LEX or expert of UN) – Duration: three month; effort: three month;
6. All the tasks need a scientific evaluation by third parties. We would like to ask to external people (e.g., student of Summer School LEX or expert of UN) – Duration: three month; effort: three month;
7. Refinement of the software – three months.
......@@ -1073,21 +1145,27 @@ We have completed the 60% of the work respect the tasks expected. We need more t
- requires python3.7+
- clone this repo
- create a virtual environment: python3 -m venv/venv
- cd to the repo
- install the dependencies: pip install -r requirements.txt
- cd to the repo/development
- create a virtual environment: _python3 -m venv venv_
- load the virtual environment: _source venv/bin/activate_
- install the dependencies: _pip install -r requirements.txt_
- install the spaCy model: _python -m spacy download en_core_web_md_
<a name="eight-1"></a>
### 8.1 Usage
- download all the documents: python run.py --download
- to parse one document: python run.py --parse \<filepath\>
- to parse one document: python run.py --parse <filepath>
- to parse all the documents: python run.py --parseall
- to use with a GUI: python run.py --gui \[--port: port_no\] (it will return \*.akn zip archives, also saved locally in keld/server/converted/)
**All the converted files will be written in the directory _out_**
**All the converted files will be written in the directory _data/output_**
<a name="eight-2"></a>
### 8.2 Troubleshooting
If you are experiencing problems with import errors, export the PYTHONPATH as follows:
~~~~
......
Markdown is supported
0%
or
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment