Linear Classifier

Definition:

$Objective fn ar g w, b min L (w, b) = Loss fn ar g w, b min n = 1 \sum N ∥ [y_{n} (w^{T} x_{n} + b) < 0] + Regulizer λ R (w, b)$

The 0-1 loss:

Small changes in in w, b can lead to big changes in the loss values
0-1 loss is non-smooth, non-convex.
→ Approximate 0-1 loss with surrogate loss functions, example b=0:
- Hinge loss: $[1 - y_{n} w^{T} x_{n}]_{+} = max (0, 1 - y_{n} w^{T} x)$
- Log loss: $lo g (1 + exp (- y_{n} w^{T} x))$
- Exponential loss: $exp (- y_{n} w^{T} x$ )
Problem: The 0-1 loss above is NP-hard to optimize exactly/approximately in general
Solution: Use surrogate loss that is somehow related to 0-1 loss
- Different loss function approximations and regularizers lead to specific algorithms

The regularization term:

find simple solutions (inductive bias. Occam’s Razor:)
Ideally, we want most entries of $w$ to be zero, so prediction depends only on a small number of features
Formally, we want to formalize $R^{c n t} (w, b) = \sum_{d = 1}^{D} ∥ [w_{d} \neq = 0]$
Again, it’s NP-hard ⇒ approximation
- e.g., we encourage w to be small
Norm-based regularizers:
- $∣∣ w ∣ ∣_{2}^{2} = \sum_{d = 1}^{D} w_{d}^{2}$
- $∣∣ w ∣ ∣_{1} = \sum_{d = 1}^{D} w_{d}$
- $∣∣ w ∣ ∣_{p} = (\sum_{d = 1}^{D} w_{d}^{p})^{1/ p}$
- $l_{p}$ norms can be used as regularizers
- Smaller $p$ favors sparse vectors $w$
  - i.e., most entries of w are close to or equal to 0
- $l_{2}$ norm: convex, smooth, easy to optimize
- $l_{1}$ norm: encourage sparse w, convex, but not smooth at axis points
- $l_{p}, p < 1$ norm: become non-convex and hard to optimize

Optimize with Gradient Descent:

Idea: take iterative steps to update parameters in the direction of the gradient
GradientDescent(F,k, $η_{1}$ ,…):
- $z^{(0)} \leftarrow [0, 0, 0, 0..., 0]$
- for k=1,…,K do:
  - $g^{k} \leftarrow Δ_{z} F ∣_{z^{(k - 1)}}$ // compute gradient at current location
  - $z^{(k)} \leftarrow z^{(k - 1)} - η^{(k)} g^{(k)}$ // take a step down the gradient
- end for
- return $z^{(K)}$
where $F$ is function, $K$ is nb of steps and $η_{1}$ is step size (learning rate)
when to stop?
- When the gradient gets close to zero
- When the objective stop changing much
- When the parameters stop changing much
- Early
- When performance on held-out validation set plateaus
Step size: usually start with large steps and get smaller
Subgradient:
- Problem: some objective functions are not differentiable everywhere, ex: Hinge loss, l1 norm
- Solution: subgradient optimization (differentiate by parts), they are only non differentiable at the end and probability at 2 ends are 0
- Subgradient of hinge loss:
- For a given example $n$
  - $\partial_{w} max {0, 1 - y_{n} (w \cdot x_{n} + b)}$
  - $= \partial_{w} {0 y_{n} (w \cdot x_{n} + b) if y_{n} (w \cdot x_{n} + b) > 1 otherwise$
  - $= {0 - y_{n} x_{n} if y_{n} (w \cdot x_{n} + b) > 1 otherwise$
- HingeRegularizedGD(D, $λ$ , MaxIter)
  - $w \leftarrow ⟨ o, o, \dots o ⟩, b \leftarrow o //$ initialize weights and bias
  - for iter $= 1 \dots$ MaxIter do
    - $g \leftarrow ⟨ 0, 0, \dots o ⟩, g \leftarrow o //$ initialize gradient of weights and bias
    - for all $(x, y) \in D$ do
      - if $y (w \cdot x + b) \leq 1$ then
        
        $g \leftarrow g + y x$ // update weight gradient
        
        $g \leftarrow g + y$ // update bias derivative
      - end if
    - end for
      - $g \leftarrow g - λ w //$ add in regularization term
      - $w \leftarrow w + η g //$ update weights
      - $b \leftarrow b + η g$ // update bias
    - end for
    - return $w, b$

StrixTheKiet Notes

Explorer

Linear Classifier

Definition:

The 0-1 loss:

The regularization term:

Optimize with Gradient Descent:

Binary Classification with Linear Models

Graph View

Table of Contents

Backlinks