Indicates whether or not a term is present in a document.

\n", "

`X_train[i,j]` is $1$ if $document_i$ contains $term_j$, otherwise it is $0$\n", " \n", " \n", "* **y_train : *array, shape = (n_documents,)***\n", "

The target vector indicating the category of each document. There are two distinct categories, $0$ and $1$.

\n", "

`y_train[i]` is the category of $document_i$\n", " \n", "

\n", " \n", "* **lexicon : *list, \\_\\_len\\_\\_ = n_features***\n", "

A list of strings that stores the actual term corresponding to each term id. Useful if you've identified some term ids as significant and want to understand what they actually mean.

\n", "

E.g. `lexicon[5247] == \"chicken\"`\n", " \n", "The following code snippet provides an example of obtaining information about a single document." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def doc_info(idx):\n", " doc = X_train[idx,:].toarray().reshape((-1,))\n", " cat_idx = y_train[idx]\n", " cat_name = twenty_train.target_names[cat_idx]\n", " term_count = np.array(lexicon)[np.where(doc>0)].shape[0]\n", " return locals()\n", "doc_info(123)" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "-" } }, "source": [ "## Exercise - Mutual Information\n", "Calculate the mutual information of each term and return a dict that maps each term id to the mutual information of the term. Later, this function will be invoked with `X_train` and `y_train` as arguments. Refer to the docstring for details.\n", "\n", "**Task:**\n", "\n", "Implement the following python function.\n", "\n", "**Hint:**\n", "\n", "When dealing with text classifiaction and binary features, here is an example with concrete values and their meaning:\n", "- $p(y=0)$: The probability that a document is in class $0$. Computed with number of documents in class $0$ divided by the total number of documents.\n", "- $p(x=1)$: The probability that a document contains term $x$.\n", "- $p(x=1, y=0)$: The probability a document contains term $x$ and is in class $0$.\n", "\n", "Further: To avoid division by $0$ when calculating the mutual information, it is common practice to always add $1$ when counting the number of any documents." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "def select_features(X, y):\n", " \"\"\"Calculate the mutual information of all terms.\n", " Assumes y comprises two distinct categories 0 and 1.\n", "\n", " :param X: A document-term matrix\n", " :param y: A target vector\n", " :return: A dictionary containg n_features items. Each entry maps\n", " the term id (int) to the mutual information of the term (float or\n", " float-like)\n", " \"\"\"\n", " raise NotImplementedError()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "X_train[:,3].toarray().flatten()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "a = np.array([[0,1,1],[1,0,0]])\n", "a[a == 0]\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ex_mi = select_features(X_train, y_train)\n", "best_terms = sorted(ex_mi.items(), key=lambda kv : kv[1], reverse=True)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If your implementation is correct, the output of the cell below should look similar to the following (depending on what base for the logarithm is used, numbers may vary, but the order should be the same):\n", "\n", "`\n", "Terms with the greatest mutual information\n", "god................. 0.19173825649655832\n", "atheists............ 0.17111409529128105\n", "keith............... 0.11713774078039724\n", "cco................. 0.10858678795625454\n", "atheism............. 0.10115308769302479\n", "schneider........... 0.0986931815614237\n", "pitt................ 0.09657796740920438\n", "religion............ 0.0962422171570431\n", "allan............... 0.09502007881084548\n", "gordon.............. 0.08813798072387438\n", "`" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print('Terms with the greatest mutual information')\n", "for (term_idx, score) in best_terms[:10]:\n", " print('{} {}'.format(lexicon[term_idx].ljust(20,\".\"), score))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Training a feature selector\n", "This segment defines a feature selector class which implements two methods:\n", "* **fit(X,y)**\n", "

Takes a document-term matrix and a target vector. It learns the mutual information of each term using the function you just implemented.\n", " \n", "* **transform(X)**\n", "

Takes a document-term matrix and returns a subset of it. The shape of the subset is (n_samples, self.k_features) and only represents the k best features in the column vectors." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "from sklearn.base import BaseEstimator\n", "\n", "class MutualInformation(BaseEstimator):\n", " cached_best_indices = None\n", "\n", " def __init__(self, k_features=10):\n", " self.k_features=k_features\n", "\n", " def fit(self, X, y=None):\n", " \"\"\"Upon fitting, calculate the best features. The result is cached in a\n", " static map.\"\"\"\n", " if MutualInformation.cached_best_indices is None:\n", " mi = select_features(X,y)\n", " mi_sorted = sorted(mi.keys(), key=mi.__getitem__, reverse=True)\n", " MutualInformation.cached_best_indices = mi_sorted\n", " return self\n", "\n", " def transform(self, X):\n", " \"\"\"Return a subset of X which contains only the best columns.\"\"\"\n", " indices = MutualInformation.cached_best_indices\n", " selected_terms = indices[:self.k_features]\n", " subset = X[:,selected_terms]\n", " return subset\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Testing our Selection\n", "Now that we have a way to identify the k best features, the question arises what value k should be.\n", "In the concluding part of this notebook, we set up a [pipeline](https://scikit-learn.org/stable/modules/compose.html) to streamline the task of `Vectorization → Feature selection → Classification` and predict the categories of the training set. We increase the value of k in each iteration and observe how the number of features selected affects the accuracy of the prediction.\n", "\n", "As classifier we use Multinominal Naive Bayes from the scikit learn library." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# First we download the test set\n", "twenty_test = fetch_20newsgroups('./newsgroups_dataset',\n", " subset='test',\n", " categories=categories,\n", " shuffle=True,\n", " random_state=42)\n", "print('(Down)load complete!')" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "min_zoom = 600\n", "accuracies = []\n", "\n", "# log distribuition from 1 to n_features\n", "k_values = set(np.geomspace(1,len(lexicon), dtype=np.int))\n", "k_values = sorted(k_values.union(set(np.geomspace(min_zoom,len(lexicon), dtype=np.int))))\n", "\n", "for k in k_values:\n", " pipe = Pipeline([\n", " ('vectorizer', CountVectorizer(binary=binary, stop_words=stop_words, min_df=min_df)),\n", " #('class_mapper', ),\n", " ('feature_selector', MutualInformation(k_features = k)),\n", " ('classifier', classifier) # \n", " ])\n", " pipe.fit(twenty_train.data, twenty_train.target)\n", " prediction = pipe.predict(twenty_test.data)\n", " accuracy = np.mean(prediction == twenty_test.target)# or use sklearn.metrics.accuracy_score(twenty_test.target, prediction)\n", " accuracies.append(accuracy)\n", " \n", "k_values = np.array(k_values)\n", "accuracies = np.array(accuracies)\n", "\n", "print('highest accuracy of {} achieved with top {} features used'.format(accuracies.max(), k_values[accuracies.argmax()]))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "When the calculation is done, draw the plot." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "plt.figure(figsize=(10,8))\n", "plt.plot(k_values, accuracies, marker='.')\n", "plt.xscale('log')\n", "plt.xlabel('k features used')\n", "plt.minorticks_off()\n", "plt.gca().get_xaxis().set_major_formatter(ScalarFormatter())\n", "plt.ylabel('Accuracy')\n", "using_all = accuracies[-1]\n", "plt.plot(k_values,using_all*np.ones_like(k_values), color='orange', label='Using all features')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "plt.figure(figsize=(10,8))\n", "plt.plot(k_values[k_values>min_zoom], accuracies[k_values>min_zoom], marker='.')\n", "plt.xscale('log')\n", "plt.xlabel('k features used')\n", "plt.minorticks_off()\n", "plt.gca().get_xaxis().set_major_formatter(ScalarFormatter())\n", "plt.ylabel('Accuracy')\n", "using_all = accuracies[-1]\n", "plt.plot(k_values[k_values>min_zoom], using_all*np.ones_like(k_values[k_values>min_zoom]), color='orange', label='Using all features')\n", "plt.legend()\n", "plt.show()" ] }, { "cell_type": "markdown", "metadata": { "slideshow": { "slide_type": "slide" } }, "source": [ "## Literature\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "\n", "

\n", "by Diyar Oktay, Christian Herta, Klaus Strohmenger

\n", "is licensed under a [Creative Commons Attribution-ShareAlike 4.0 International License](http://creativecommons.org/licenses/by-sa/4.0/).

\n", "Based on a work at https://gitlab.com/deep.TEACHING.\n", "\n", "\n", "### Code License (MIT)\n", "\n", "*The following license only applies to code cells of the notebook.*\n", "\n", "Copyright 2018 Diyar Oktay, Christian Herta, Klaus Strohmenger\n", "\n", "Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the \"Software\"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:\n", "\n", "The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.\n", "\n", "THE SOFTWARE IS PROVIDED \"AS IS\", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE." ] } ], "metadata": { "celltoolbar": "Slideshow", "kernelspec": { "display_name": "deep_teaching_kernel", "language": "python", "name": "deep_teaching_kernel" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.6.5" } }, "nbformat": 4, "nbformat_minor": 2 }