Definition:
- Objective fnargw,bminL(w,b)=Loss fnargw,bminn=1∑N∥[yn(wTxn+b)<0]+RegulizerλR(w,b)
The 0-1 loss:
- Small changes in in w, b can lead to big changes in the loss values
- 0-1 loss is non-smooth, non-convex.
- → Approximate 0-1 loss with surrogate loss functions, example b=0:
- Hinge loss: [1−ynwTxn]+=max(0,1−ynwTx)
- Log loss: log(1+exp(−ynwTx))
- Exponential loss: exp(−ynwTx)
- Problem: The 0-1 loss above is NP-hard to optimize exactly/approximately in general
- Solution: Use surrogate loss that is somehow related to 0-1 loss
- Different loss function approximations and regularizers lead to specific algorithms
The regularization term:
- find simple solutions (inductive bias. Occam’s Razor:)
- Ideally, we want most entries of w to be zero, so prediction depends only on a small number of features
- Formally, we want to formalize Rcnt(w,b)=∑d=1D∥[wd=0]
- Again, it’s NP-hard ⇒ approximation
- e.g., we encourage w to be small
- Norm-based regularizers:
- ∣∣w∣∣22=∑d=1Dwd2
- ∣∣w∣∣1=∑d=1Dwd
- ∣∣w∣∣p=(∑d=1Dwdp)1/p
- lp norms can be used as regularizers
- Smaller p favors sparse vectors w
- i.e., most entries of w are close to or equal to 0
- l2 norm: convex, smooth, easy to optimize
- l1 norm: encourage sparse w, convex, but not smooth at axis points
- lp, p<1 norm: become non-convex and hard to optimize
- Idea: take iterative steps to update parameters in the direction of the gradient
- GradientDescent(F,k,η1,…):
- z(0)←[0,0,0,0...,0]
- for k=1,…,K do:
- gk←ΔzF∣z(k−1) // compute gradient at current location
- z(k)←z(k−1)−η(k)g(k) // take a step down the gradient
- end for
- return z(K)
- where F is function, K is nb of steps and η1 is step size (learning rate)
- when to stop?
- When the gradient gets close to zero
- When the objective stop changing much
- When the parameters stop changing much
- Early
- When performance on held-out validation set plateaus
- Step size: usually start with large steps and get smaller
- Subgradient:
- Problem: some objective functions are not differentiable everywhere, ex: Hinge loss, l1 norm
- Solution: subgradient optimization (differentiate by parts), they are only non differentiable at the end and probability at 2 ends are 0
- Subgradient of hinge loss:
- For a given example n
- ∂wmax{0,1−yn(w⋅xn+b)}
- =∂w{0yn(w⋅xn+b) if yn(w⋅xn+b)>1 otherwise
- ={0−ynxn if yn(w⋅xn+b)>1 otherwise
- HingeRegularizedGD(D, λ, MaxIter)
- w←⟨o,o,…o⟩,b←o// initialize weights and bias
- for iter =1… MaxIter do
- g←⟨0,0,…o⟩,g←o// initialize gradient of weights and bias
- for all (x,y)∈D do
- if y(w⋅x+b)≤1 then
- g←g+yx // update weight gradient
- g←g+y // update bias derivative
- end if
- end for
- g←g−λw// add in regularization term
- w←w+ηg// update weights
- b←b+ηg // update bias
- end for
- return w,b