README.md 7.31 KB
Newer Older
Graham Williams's avatar
Graham Williams committed
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
StellarGraph Node Classification
================================

This [MLHub](https://mlhub.ai) package demonstrates and provides a
guide to graph machine learning for node classification using
[StellarGraph](https://www.stellargraph.io/). The demonstration is
based on the [ Node Classification with GCN
Notebook](https://stellargraph.readthedocs.io/en/stable/demos/node-classification/gcn-node-classification.html)

The demonstration is based on the
[Cora](https://linqs.soe.ucsc.edu/data) dataset of academic
publications. Each publication has a subject and links between
publications represent citations. 

For an introduction to graphs see the
[StellarGraph Guide to Graphs](https://medium.com/stellargraph/knowing-your-neighbours-machine-learning-on-graphs-9b7c3d0d5896) 

Visit the gitlab repository for more details:
<https://gitlab.com/kayontoga/sgnc>

Usage
-----

- To install mlhub (Ubuntu 20.04 LTS):

```console
$ pip3 install mlhub
$ ml configure
```

- To install and configure the demo:

```console
$ ml install sgnc
$ ml configure sgnc
$ ml readme sgnc
$ ml commands sgnc
```

*Note:* on the first *configure* tensorflow is installed if not
already. This is a large download of some 500MB. This can take some
time. It is installed once that it will not be downloaded again if
sgnc is updated.

Demonstration
-------------

```console
====================================
StellarGraph for Node Classification
====================================

Welcome to a demonstration of node classification in a graph knowledge
structure. StellarGraph is used to represent the graph. The sample
dataset is a well known public network dataset known as Cora. It is
available from linqs.soe.ucsc.edu/data and consists of nodes which are
academic publications and edges that link citations. The nodes have
been classified  into seven subject areas as we will see below.

A Graph Convolution Network (GCN) is used to build a classification
model to predict the subject area of a publication based on the graph
structure. This neural network model uses a graph convolution layer
which uses the graph adjacency matrix to learn about a publication's
citations.

This demonstration will prepare the dataset, create the GCN layers,
and then train a model and evaluate its performance.

Press Enter to continue: 

===================
Dataset Description
===================

The dataset, available through the StellarGraph package itself, has
been attached and is ready to be loaded into the StellaGraph data
structures.

The Cora dataset consists of 2708 scientific publications classified
into one of seven classes. The citation network consists of 5429
links. Each publication in the dataset is described by a 0/1-valued
word vector indicating the absence/presence of the corresponding word
from the dictionary. The dictionary consists of 1433 unique words.

Press Enter to continue: 

===========
Graph Shape
===========

We can ask for information about the StellarGraph structure to confirm
it matches the description above.

StellarGraph: Undirected multigraph
 Nodes: 2708, Edges: 5429

 Node types:
  paper: [2708]
    Features: float32 vector, length 1433
    Edge types: paper-cites->paper

 Edge types:
    paper-cites->paper: [5429]
        Weights: all 1 (default)
        Features: none

Press Enter to continue: 

====================
Subject Distribution
====================

Each publication has a subject attribute, which will be the target of
the classification model. The full dataset has the following
distribution of subjects.

                        subject
Neural_Networks             818
Probabilistic_Methods       426
Genetic_Algorithms          418
Theory                      351
Case_Based                  298
Reinforcement_Learning      217
Rule_Learning               180

Press Enter to continue: 

=====================
Splitting the Dataset
=====================

The dataset set is split into three subsets as usual for building
models: the training set of 140 node labels, the tuning dataset of 500
node labels, leaving 2068 for the test dataset.

The training dataset has the following subject distribution:

                        subject
Neural_Networks              42
Probabilistic_Methods        22
Genetic_Algorithms           22
Theory                       18
Case_Based                   16
Reinforcement_Learning       11
Rule_Learning                 9

Press Enter to continue: 

======================
Machine Learning Model
======================

A model is now being built. This is the Graph Convolution Network
model. For a small dataset it takes a few seconds.

Using GCN (local pooling) filters...

Press Enter to continue: 

=================
Accuracy and Loss
=================

For neural networks the learning happens over a series of so-called
epochs. After each epoch we would expect the accuracy of the model to
improve and the loss to reduce. At some point the curves flatten out
and we gain little by performing any further training.

We will display two plots here, one for the accuracy and the other for
the loss. In each we show the performance measure against both the
training dataset (always expected to show better performance) and the
tuning (validation)  dataset (which should be a less biased
performance).

Accuracy is the percentage of observations correctly classified.  The
higher the accuracy the better.  The Loss is a measure of the
difference between the predicted and actual values.  The smaller the
loss the better.

Close the window with Ctrl-w.
```
![](performance.png)
```console
Press Enter to continue: 

================
Test Set Metrics
================

The Test Set is a hold-out dataset, not used in the model building at
all, unlike the training and tuning (validation) datasets. The
performance measured on the test dataset is a unbiased (i.e., more
realistic) estimate of the  performance of the model in general.

For our model the accuracy is estimated to be 83% and the loss is
estimated to be 0.63.


Press Enter to continue: 

==================
Sample Predictions
==================


                      Predicted                  Actual  Correct
31336           Neural_Networks         Neural_Networks     True
1061127           Rule_Learning           Rule_Learning     True
1106406  Reinforcement_Learning  Reinforcement_Learning     True
13195    Reinforcement_Learning  Reinforcement_Learning     True
37879     Probabilistic_Methods   Probabilistic_Methods     True
1126012   Probabilistic_Methods   Probabilistic_Methods     True
1107140  Reinforcement_Learning                  Theory    False
1102850         Neural_Networks         Neural_Networks     True
31349           Neural_Networks         Neural_Networks     True
1106418                  Theory                  Theory     True
1123188         Neural_Networks         Neural_Networks     True
1128990      Genetic_Algorithms      Genetic_Algorithms     True
109323    Probabilistic_Methods   Probabilistic_Methods     True
217139                   Theory              Case_Based    False
31353           Neural_Networks         Neural_Networks     True
32083           Neural_Networks         Neural_Networks     True
1126029  Reinforcement_Learning  Reinforcement_Learning     True
1118017              Case_Based         Neural_Networks    False
49482           Neural_Networks         Neural_Networks     True
753265          Neural_Networks         Neural_Networks     True

Press Enter to continue: 
```