Neural Networks and Deep Learning by Michael Nielsen.md 11.5 KB
Newer Older
1
2
3
4
# Neural Networks and Deep Learning by Michael Nielsen

## Intro

5
- *Neural network* - biologically inspired programming paradigm enabling a computer to learn through observation
6
- *Deep learning* - techniques for learning in neural networks
7
8
9

## Chapter 1 - Using neural nets to recognize handwritten digits

10
- Neural networks leverage supplied *training examples* against a programmed system that learns from the examples by inferring rules for pattern recognition.
11
- common formulaic symbols
12
    - *Σ* - sum (sigma uppercase)
13
        - *σ* - sigmoid (sigma lowercase)
14
    - *≡* - equivalence
15
    - *∂* - partial derivative
16
    - *η* - learning rate
17
    - *Δ* - delta
18
        - *δ* - delta error (lowercase)
19
    - *∇* - gradient (∇C as gradient vector) (nabla pronunication)
20
    - *ϵ* - epsilon (small quantity)
21
    - *⊙* - Hadamard product (multiplies two matrices of same dimensions into a 3rd matrix)
22
    - *λ* - regularization parameter
23
- artificial neurons
24
    - *perceptron* - an artificial neuron with n binary inputs and 1 binary output. A *weight* is a real number associated with each input where the *weighted sum (Σ)* meets a real number *threshold* value determining the output.
25
26
27
        - Varying the weights and threshold result in different *decision making models*
        - Layering sets of perceptrons enables abstract and sophisticated decision making as the outputs of a perceptron may be an input in any number of perceptrons in a subsequent layer
        - Forumla (`w` = weight vector, `x` = input vector, `b` = bias)
28
            - `output = w⋅x+b > 0 ? 1 : 0`
29
30
            - *bias* - measue of ease for perceptron to fire
        - Can mimic NAND gate and thus a neural network can mimic circuits and achieve computational universality. Better than conventional circuit design however, *learning algorithms* can automatically tune weights and biases of a neural network resulting in a learned and potentially superior design.
31
    - *sigmoid* - an artificial neuron with n inputs and 1 output where each input and output is 0, 1, or a fraction between. A *weight* is a real number associated with each input where the *weighted sum (Σ)* meets a real number *threshold* value determining the output.
32
33
        - AKA - logistic neuron
        - Formula (`σ` = sigmoid function, `w` = weight vector, `x` = input vector, `b` = bias)
34
            - `z = w⋅x+b`
35
            - `output = σ(z)≡1/1+e^−z`
36
37
38
39
- *activation function* - defines a neuron's output given a set of input(s)
    - *linear perceptron* 
    - *sigmoid function*
    - *ReLU* (rectified linear unit)
40
41
- neural network
    - anatomy
42
        1. input layer comprised of input neuron(s)
43
        2. hidden layer(s) comprised of neuron(s) informing the next layer
44
        3. output layer comprised of output neuron(s)
45
46
    - types
        - feedforward (no loops)
47
            - a single input determines the activations of all the neurons through the remaining layers
48
        - recurrent (looping allowed where select neuron(s) only fire for a limited duration)
49
            - nets in which there is some notion of dynamic change over time (there are many models)
50
    - *cost function* - quantifies accuracy of the weights and biases of a desired output
51
    - *gradient descent* - minimization algorithm aiding neural net learning by minimizing the cost function
52
53
        - Ball in valley analogy for cost function of two variables 
            - x,z as vars (floor plane) and y as function result (vertical valley depth)
54
        - By choosing a random point (function variables) an output results. The algorithm determines another point just slightly offset from the original one and compares the two outputs to learn which result is deeper in the valley (using the update rule `v → v′ = v−η∇C`). This informs the algorithm's direction to continue in to find the deepest point in the valley. This is the "minimization" of gradient descent as we seek the global minimum output.
55
    - The goal of training a neural net is to find weights and biases which minimize the cost function so that inputs not in the training set result in accurate output
56
57
58
    - *stochastic gradient descent* - optimized approach of gradient descent to approximate the gradient by sampling random test inputs in mini-batches
    - *mini-batch* - random sample of inputs from a training set
    - *training epoch* - time to complete a training
59
    - *online/incremental learning* - gradient descent with a mini-batch of 1 input
60
    - *hyper-parameters* - exclude the weights and biases and include the epoch count, mini batch size, and learning rate
61
        - I imagine you could get meta and use an algorithm that learns the best hyper-parameters
62
    - *shallow neural network* - neural network with one hidden layer
63
    - *deep neural network* - neural network with two or more hidden layers (layers of abstraction)
64

65
66
## Chapter 2 - How the backpropagation algorithm works

67
- *backpropagation* - a fast algorithm composed of four equations used to compute the gradient of the cost function
68
69
70
71
72
73
    - Equation descriptions
        - An equation for the error in the output layer
        - An equation for the error `δˡ` in terms of the error in the next layer
        - An equation for the rate of change of the cost with respect to any bias in the network
        - An equation for the rate of change of the cost with respect to any weight in the network
    - "Think of backpropagation as a way of computing the gradient of the cost function by systematically applying the chain rule from multi-variable calculus."
74
    - "The backpropagation algorithm is a clever way of keeping track of small perturbations to the weights (and biases) as they propagate through the network, reach the output, and then affect the cost"
75
- *weighted input* - `zˡⱼ` is just the weighted input to the activation function for neuron `j` in layer `l`
76
77
78

## Chapter 3 - Improving the way neural networks learn

79
- Using the *cross-entropy* cost function instead of the *quadratic* cost function improves the speed of learning changes greatly. Essentially, when initial weight and bias values are far from their respective to-be-determined ideals, the quadratic cost function is slow. Replacing it with cross-entropy resolves this issue.
80
    - This is true in nets of sigmoid neurons, quadratic cost is fine for nets of perceptron/linear neurons.
81
    - "Roughly speaking, the idea is that the cross-entropy is a measure of surprise"
82
- *softmax function* - a softmax layer outputs a probability distribution which is often useful/convient for interpreting the output activation 
83
84
- *saturation* - state in which a neuron predominantly outputs values close to the asymptotic ends of the bounded activation function (negatively impacts learning)
- *overfitting* - overtraining a network
85
- *hold out* - validation data as a type of training data that helps us learn good hyper-parameters that's held out of the normal training data
86
- *early stopping* - computation of the classification accuracy after each epoch to escape training if saturation has occurred 
87
- *regularization* - techniques to help reduce overfitting
88
89
90
    - weight decay - modification to a cost function in pursuit of small weights
    - dropout - modification to hidden layers by excluding random neuron sets during training
    - artificial training data expansion - duplicate training data and tweak it before appending it to the original training data set (rotating/translating/skewing images for example)
91
- *weight initialization* - weights input to a neuron are initialized as Gaussian random variables with mean 0 and standard deviation 1 divided by the square root of the number of connections input to the neuron
92
93
94
- *learning rate schedule* - dynamically adjusting the learning rate (as opposed to defining it constant) for improving network performance
- As an author of Neural Networks, it can be very time consuming for a few core reasons:
    1. Determining hyper-parameters that truly work for a given problem can take a lot of time
95
    2. Computation (in real-time) is high and that is why there are so many tricks/optimization-hacks in an effort to find the ideal hyper-parameters. Improvements in computation enable faster iteration toward finding ideals.
96
97
98

## Chapter 4 - A visual proof that neural nets can compute any function

99
- *universality theorem* - neural networks with a single hidden layer can be used to approximate any continuous function to any desired precision
100
    - the single hidden layer is composed of neuron pairs mimicking a *step function* where `s=−b/w`. Each neuron pair is mapped over 0-1 for the entire neuron count. So choosing a large weight in conjunction with the step neuron's 0-1 value results in a calculable bias (`b=−ws`). So the more pairs you have, the more accurate the approximation.
101
        - Example: 6 neurons as 3 pairs map to [[0,.33],[.33,.66],[.66,1]]
102
    - "In essence, we use a single-layer neural networks to build a lookup table for the function"
103
104
    
## Chapter 5 - Why are deep neural networks hard to train?
105

106
- *unstable gradient problem* - "if we use standard gradient-based learning techniques, different layers in the network will tend to learn at wildly different speeds"
107
    - *vanishing gradient problem* - later layers of deep nets learn faster
108
        - "...empirically it is typically found in sigmoid networks that gradients vanish exponentially quickly in earlier layers. As a result, learning slows down in those layers. This slowdown isn't merely an accident or an inconvenience: it's a fundamental consequence of the approach we're taking to learning."
109
    - *exploding gradient problem* - earlier layers of deep nets learn faster
110
- "...the choice of activation function, the way weights are initialized, and even details of how learning by gradient descent is implemented... choice of network architecture and other hyper-parameters" make it hard to train deep nets
111
112
113

## Chapter 6 - Deep learning

114
- *convolutional neural network* - a feed-forward net in which the connectivity pattern between its neurons is inspired by the organization of the animal visual cortex (spatial/feature/pattern understanding)
115
    - components:
116
        1. *local receptive field* - each neuron of a hidden layer consumes the input of a region of input neurons where a stride length helps determine how the regions are defined
117
            - traditional nets in contrast map every single input to each neuron in a subsequent layer
118
        2. *shared weights and biases* - each neuron of a hidden layer shares the same weights (to the region) and bias enabling each hidden layer to detect a particular "feature" or "pattern" of the image as a whole
119
            - traditional nets in contrast do not share the weight and bias
120
121
122
            - *feature map* - embodies the input layer to the first hidden layer
            - *filter* - defined by a set of shared weights and bias
            - *convolutional layer* - a layer consisting of 1 or more feature maps
123
124
        3. *pooling layers* - a layer succeeding each feature map of a convolutional layer that prepares a condensed feature map for subsequent layers
            - *max-pooling* - determines if a sub-feature exists within a region
125
    - "convolutional networks are well adapted to the translation invariance of images"
126
    - Convolutional nets have a fraction of the parameters required (due to shared weights and bias) compared to traditional nets and thus train and perform much faster
127
- *deep belief network* - is a generative model net that can "run the network backward" which enables generating values for the input activations (ex. generate images of MNIST handwritten digits based on network settings)