natural-language-processing.md 7.46 KB
Newer Older
1
# Status: Draft
Klaus Strohmenger's avatar
Klaus Strohmenger committed
2

3
# Natural Language Processing Scenario
Christoph Jansen's avatar
Christoph Jansen committed
4 5 6 7 8 9 10 11 12 13 14

Human communication more and more relies on digital text exchange, like websites, emails and chat systems. Although humans can easily interprete a text written in a natural language they've learned, the vast amount of data to read and scan for useful information is simply overwhelming. Therefore, during the last decades, many advances have been made to use machines for automatic classification, interpretation and translation of textual data.

The most prominent Natural Language Processing (NLP) systems are search engines, since many people use them on a daily basis. But correctly interpreting a users search query and the content of websites to provide the most relevant search results is a major challenge. For example a user, who is looking for the final score of last nights' soccer game, could formulate the search query "Bayern München score". The search engine must then be capable to interprete "Bayern München" as an organisation and not the two locations "Bayern" and "München". In this case, the word "score" is a good hint for a search engine, to correctly tag the entity "Bayern München".

## Named Entity Recognition

The search engine example describes a common NLP information extraction task called Named Entity Recognition (NER), where a sentence or sequence of words is given to an algorithm as input, which then produces a sequence of NER tags as output. The NER problem can be approached from two angles: A computer linguist can program a complex set of handwritten rules (e.g. regular expressions) to tag words. Creating this kind of expert system requires a lot of human labor, because every sentence is different and each word can occur in indefinite contexts. Therefore we focus on a more general approach based on machine learning techniques, where an algorithm learns to solve this problem from an existing set of training data. A well trained and generalized machine learning system will then be able to create a correct output for a given sentence, even if it was not included in the training data. The requirement for this to be successful, is a large text corpus of sentences, where each word has already correctly been annotated with its corresponding named entity tag.

## Named Entity Tags

15
The following sentence is a *simplified* example from the [GermEval 2014](https://www.lt.informatik.tu-darmstadt.de/de/data/german-named-entity-recognition/) corpus:
Christoph Jansen's avatar
Christoph Jansen committed
16

17
| Term | Tag |
Christoph Jansen's avatar
Christoph Jansen committed
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
| --- | --- |
| Bayern | ORG |
| München | ORG |
| ist | O |
| wieder | O |
| alleiniger | O |
| Top- | O |
| Favorit | O |
| auf | O |
| den | O |
| Gewinn | O |
| der | O |
| deutschen | LOC |
| Fußball-Meisterschaft | O |
| . | O |

34
In this example each term of the sentence is annotated with a named entity tag. Both terms "Bayern" and "München" are tagged with `ORG` (organisation) as expected in this context. The term "deutschen" is tagged as `LOC` (location). Every term not representing a named entity is tagged as `O`. This notation scheme is called NER-IO.
Christoph Jansen's avatar
Christoph Jansen committed
35 36 37

With NER-IO it is indistiguishable if "Bayern München" is a single `ORG` entity or two separate entities. The following example shows the NER-IOB notation, which solves this problem.

38
| Term | Tag |
Christoph Jansen's avatar
Christoph Jansen committed
39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58
| --- | --- |
| Bayern | B-ORG |
| München | I-ORG |
| ist | O |
| wieder | O |
| alleiniger | O |
| Top- | O |
| Favorit | O |
| auf | O |
| den | O |
| Gewinn | O |
| der | O |
| deutschen | B-LOC |
| Fußball-Meisterschaft | O |
| . | O |

In IOB-Notation, each `B-` marks the beginning and each `I-` marks the continuation of a tag.

## Algorithms

59
Named Entity Recognition is a sequence-to-sequence learning problem, because each sentence can be seen as a sequence of terms, which should be automatically tagged with an equally long sequence of NER tags.
Christoph Jansen's avatar
Christoph Jansen committed
60

61
Algorithms tackling this problem include Hidden Markov Models (HMM), Maximum Entropy Markov Models (MEMM) and Conditional Random Fields (CRF), which are based on bayesian statistics (counting occurrences of term sequences to calculate a probabilistic model), as well as Deep Learning approaches like Recurrent Neural Networks (RNN) and Convolutional Neural Networks (CNN).
Christoph Jansen's avatar
Christoph Jansen committed
62 63 64 65 66

The deep.TEACHING project provides educational material for students to gain basic knowledge about the problem domain, the programming, math and statistics requirements, as well as the mentioned algorithms and their evaluation. Students will also learn how to construct complex machine learning systems, which can incorporate several algorithms at once.

## Related Problems

67
Learning about Named Entity Recognition and applicable algorithms, will provide students with knowledge transferable to similar problems. Similar NLP tasks include Part-of-Speech tagging (tagging terms as nouns, verbs, etc.) and sentence splitting (tagging the end of a sequence). Besides text, sequence learning is also applicable to signal processing (e.g. detecting sleep stages in sleep medical biosignals), audio data (e.g. text-to-speech) and various other fields.
Christoph Jansen's avatar
Christoph Jansen committed
68 69 70 71 72

## NER Corpora

| Corpus | Language | Samples | Source | Info |
| --- | --- | --- | --- | --- |
73
| GermEval 2014 | German | 31302 | [Link](https://www.lt.informatik.tu-darmstadt.de/de/data/german-named-entity-recognition/) | [Notebook](notebooks/text-information-extraction/data-exploration/germ-eval-2014.ipynb) |
Christoph Jansen's avatar
Christoph Jansen committed
74 75 76
| CoNLL 2002 | Spanish | 11755 | [Link](https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002) | Available in NLTK (Python) |
| CoNLL 2002 | Netherlandish | 23896 | [Link](https://github.com/teropa/nlp/tree/master/resources/corpora/conll2002) | Available in NLTK (Python) |
| Annotated Corpus for Named Entity Recognition | English | 47959 | [Link](https://www.kaggle.com/abhinavwalia95/entity-annotated-corpus) | Download requires Kaggle account |
77 78 79 80 81 82 83

## Educational Materials

*Work in Progress.*


* Probability Theory
84 85 86 87 88
  * [Bayes' Rule](../notebooks/math/bayes-theorem.ipynb)
  * [Exercise: Bayes' Rule](../notebooks/math/exercise-bayes-rule.ipynb)
  * [Exercise: Cookie Problem](../notebooks/math/exercise-cookie-problem.ipynb)
  * [Bayesian Networks by Example](../notebooks/graphical-models/directed/bayesian-networks-by-example.ipynb)
  * [Exercise: D-Separation](../notebooks/graphical-models/directed/exercise-d-separation.ipynb)
89 90 91


* Graphical Models
92 93 94
  * Markov Models - [Exercise: Bi-Gram Language Model](../notebooks/sequence-learning/exercise-bi-gram-language-model.ipynb)
  * Hidden Markov Models (HMM) - [Exercise: Hidden Markov Models](../notebooks/sequence-learning/exercise-hidden-markov-models.ipynb)
  * Maximum Entropy Markov Models (MEMM) - [Exercise: MEMM](../notebooks/sequence-learning/exercise-memm.ipynb)
95
  * Linear-Chain Conditional Random Fields (CRF) [Exercise: Linear-Chain CRF](../notebooks/natural-language-processing/text-information-extraction/exercise-linear-chain-crf.ipynb)
96 97


98 99 100
* Linear Models 
  * Linear Regression - see course [Introduction to Machine Learning](introduction-to-ml.md)
  * Logistic Regression - see course [Introduction to Machine Learning](introduction-to-ml.md)
101

Klaus Strohmenger's avatar
Klaus Strohmenger committed
102
* Text Classification
Klaus Strohmenger's avatar
Klaus Strohmenger committed
103
  * Exercise: Naive Bayes - [Exercise: Naive Bayes](../notebooks/natural-language-processing/text-classification/exercise-naive-bayes.ipynb)  
104 105

* Neural Networks
106 107 108
  * Feed Forward Artificial Neural Networks (ANN) - see course [Introduction to Neural Networks](introduction-to-nns.md)
  * Backpropagation - see course [Introduction to Neural Networks](introduction-to-nns.md)
  * Convolutional Neural Networks (CNN)  - see course [Convolutional Neural Networks](cnns.md)(recommended)
109
  * Recurrent Neural Networks (RNN) **TODO Create Notebook**
110