StrixTheKiet Notes

Search

❯

❯

ArtificialIntelligence

❯

❯

Perceptron Training

Perceptron Training

Mar 20, 20252 min read

Definition:

Properties:
- Online:
  - Look at one example at a time, and update the model as soon as we make an error
  - Compared to batch algorithms that update parameters after seeing the entire training dataset
- Error-driven: We only update parameters if we make an error.
Practical considerations:
- Order of training examples matter, random is better
- Early stopping: to avoid overfitting
- Simple modifications dramatically improve performance: voting or averaging
Now hypothesis $h$ is a Hyperplane that best separate a linearly separable sample
- the bias scalar $b$ can be encoded in by adding extra dimension?

Perceptron prediction algorithm:

$PerceptronTest (w_{0}, w_{1}, ..., w_{D}, b, x) :$
- $a \leftarrow \sum_{d = 1}^{D} w_{d} \overset{x}{^}_{d} + b$ //compute activation for the test example
- return sign(a)
Standard Perceptron training algorithm: $PERCEPTRONTRAIN (D, M a x I t er)$
- $w_{d} \leftarrow 0, f or a ll d = 1... D$ // initialize weights
- $b \leftarrow 0$ // initialize bias
- for iter = 1…MaxIter do
  - for all (x,y) ∈ D do
    - $a \leftarrow \sum_{d = 1}^{D} w_{d} * x_{d} + b$ // compute activation for this example
    - if $y * a \leq 0$ then // if opposite sign
      - $w_{d} \leftarrow w_{d} + y * x_{d}$ ,for all d = 1…D // update weights
      - $b \leftarrow b + y$ // update bias
    - $e n d i f$
  - end for
- end for
- return $w_{0}, w_{1}, ..., w_{D}, b$
Other variations that predict based on final + intermediate parameters:
- Require to keep track of “survival time” of weight vectors?
- The voted perceptron:
  - $\overset{y}{^} = sign (\sum_{k = 1}^{K} c^{(k)} sign (w^{(k)} . \overset{x}{^} + b^{(k)}))$
- The averaged perceptron:
  - $\overset{y}{^} = sign (\sum_{k = 1}^{K} c^{(k)} (w^{(k)} . \overset{x}{^} + b^{(k)}))$
- Averaged Perceptron with efficient decision:
  - combination of weighted sum of $w$ and weighed sum of bias
  - $\overset{y}{^} = sign ((\sum_{k = 1}^{K} c^{(k)} w^{(k)}) . \overset{x}{^} + (\sum_{k = 1}^{K} c^{(k)} b^{(k)}))$
Sometimes perceptron cant converge

Convergence of perceptron:

Theorem Block and Novikoff
If a training dataset $D = {(x^{(1)}, y^{(1)}, ..., (x^{(N)}, y^{(N)})}$ is linearly separable with a margin $γ$ by a unit norm hyperplane $w_{*} (∣∣ w_{*} ∣∣ = 1)$ with $b = 0$
The perceptron training algorithm converges after $\frac{R ^{2}}{γ ^{2}}$ errors during training (assuming $∣∣ x ∣∣ < R$ )
Margin of a dataset:
- Distance between the separating hyperplane $(w, b)$ and the nearest point in dataset
- $ma r g in (D, w, b) = {min_{(x, y) \in D} y (w . x + b) - \infty if w separates D otherwise$ ?
- We want the hyperplane to have largest attainable margin on D $ma r g in (D) = sup_{w, b} ma r g in (D, w, b)$
  - less margin of error
Perceptron converges quickly when margin is large, slowly when margin is small
Bound does not depend on number of training examples, nor on number of features
Proof guarantees that perceptron converges, but not necessarily to the max margin separator (there are several possible $w_{*}$
Practical Implications
- Sensitive to noise: No convergence or accuracy guarantee if the data is not linearly separable due to noise
- Linear separability in practice
  - Data may be linearly separable in practice
  - Especially when the number of features >> number of examples
- Overfitting:
  - Early stopping and Averaging

Graph View

Definition:
Perceptron prediction algorithm:
Convergence of perceptron:

Backlinks

Machine Learning

Created with strixthekiet

GitHub
Email