Description:
- Made up of nodes or units, connected by links
- Each link has an associated weight and activation level
- Each node has an input function (typically summing over weighted inputs) or an activation function and an output
- X to Y can be non-linear, and can still use Stochastic gradient descent
Activation functions:
-
- ex: binary step activation fn
- There are more functions
Multi-layer neural network:
Theorem 9 in CIML:
- Two-Layer Networks are Universal Function Approximator
- Approximation error is 0
- A two-layer neural network can approximate any continuous function, given enough neurons in the hidden layer.
Expressiveness of NN:
- Deeper layers of a trained network learn more and more complex functions
Compositionality via mathematics:
- Given a library of simple functions, ex: sin,cos,log,exp
- If each node is one of the functions, next layer’s node can be:
- Linear combination: f(x)=∑iaigi(x)
- Composition: f(x)=g1(g2((...gn(x)...)))
- for deep learning, Hierarchical Compositionality is used
- vision: pixels → edge → texton → motif → part → object
- speech: sample → spectral → formant → motif → phone → word
- NLP: character → word → NP/VP/.. → clause → sentense → story
SDG:
- If the objective of optimization is convex, it will reach the bottom
- If not, SDG still perform well
- But we will need to have the loss function
Training a neural network:
- The backpropagation algorithm = gradient descent + Chain rule
- To search for set of weight values that minimize the total error of the network over the set of training examples
- Repeated procedures of the following 2 passes:
- forward pass: compute the outputs of all units in the network, and the error of the output layers
- backward pass: the network error is used for updating the weights
- Starting at the output layer, the error is propagated backwards through the network, layer by layer.
- This is done by recursively computing the local gradient of each neuron.
- Gradient of objective w.r.t. output layer weights v
- ∇viL=E[(Y−∑i=1mviϕ(witX))ϕ(witX)]
- Gradient of objective w.r.t. hidden unit weights:
- ∇Lw1=(dϕ1dL)∇w1ϕ1
- dϕ1dL=−E[(Y−i=1∑mviϕ(witX))v1]
- ∇w1=E[ϕ′(w1tX)X]