Definition:
- Properties:
- Online:
- Look at one example at a time, and update the model as soon as we make an error
- Compared to batch algorithms that update parameters after seeing the entire training dataset
- Error-driven: We only update parameters if we make an error.
- Practical considerations:
- Order of training examples matter, random is better
- Early stopping: to avoid overfitting
- Simple modifications dramatically improve performance: voting or averaging
- Now hypothesis h is a Hyperplane that best separate a linearly separable sample
- the bias scalar b can be encoded in by adding extra dimension?
Perceptron prediction algorithm:
- PerceptronTest(w0,w1,...,wD,b,x):
- a←∑d=1Dwdx^d+b //compute activation for the test example
- return sign(a)
- Standard Perceptron training algorithm: PERCEPTRONTRAIN(D, MaxIter)
- wd←0, foralld=1...D // initialize weights
- b←0 // initialize bias
- for iter = 1…MaxIter do
- for all (x,y) ∈ D do
- a←∑d=1Dwd∗xd+b // compute activation for this example
- if y∗a≤0 then // if opposite sign
- wd←wd+y∗xd,for all d = 1…D // update weights
- b←b+y // update bias
- endif
- end for
- end for
- return w0, w1, ...,wD, b
- Other variations that predict based on final + intermediate parameters:
- Require to keep track of “survival time” of weight vectors?
- The voted perceptron:
- y^=sign(∑k=1Kc(k)sign(w(k).x^+b(k)))
- The averaged perceptron:
- y^=sign(∑k=1Kc(k)(w(k).x^+b(k)))
- Averaged Perceptron with efficient decision:
- combination of weighted sum of w and weighed sum of bias
- y^=sign((∑k=1Kc(k)w(k)).x^+(∑k=1Kc(k)b(k)))
- Sometimes perceptron cant converge
Convergence of perceptron:
- Theorem Block and Novikoff
- If a training dataset D={(x(1),y(1),...,(x(N),y(N))} is linearly separable with a margin γ by a unit norm hyperplane w∗ (∣∣w∗∣∣=1) with b=0
- The perceptron training algorithm converges after γ2R2 errors during training (assuming ∣∣x∣∣<R)
- Margin of a dataset:
- Distance between the separating hyperplane (w,b) and the nearest point in dataset
- margin(D,w,b)={min(x,y)∈Dy(w.x+b)−∞if w separates Dotherwise?
- We want the hyperplane to have largest attainable margin on D margin(D)=supw,bmargin(D,w,b)
- Perceptron converges quickly when margin is large, slowly when margin is small
- Bound does not depend on number of training examples, nor on number of features
- Proof guarantees that perceptron converges, but not necessarily to the max margin separator (there are several possible w∗
- Practical Implications
- Sensitive to noise: No convergence or accuracy guarantee if the data is not linearly separable due to noise
- Linear separability in practice
- Data may be linearly separable in practice
- Especially when the number of features >> number of examples
- Overfitting:
- Early stopping and Averaging