Definition:

The 0-1 loss:

  • Small changes in in w, b can lead to big changes in the loss values
  • 0-1 loss is non-smooth, non-convex.
  • Approximate 0-1 loss with surrogate loss functions, example b=0:
    • Hinge loss:
    • Log loss:
    • Exponential loss: )
  • Problem: The 0-1 loss above is NP-hard to optimize exactly/approximately in general
  • Solution: Use surrogate loss that is somehow related to 0-1 loss
    • Different loss function approximations and regularizers lead to specific algorithms

The regularization term:

  • find simple solutions (inductive bias. Occam’s Razor:)
  • Ideally, we want most entries of to be zero, so prediction depends only on a small number of features
  • Formally, we want to formalize
  • Again, it’s NP-hard ⇒ approximation
    • e.g., we encourage w to be small
  • Norm-based regularizers:
    • norms can be used as regularizers
    • Smaller favors sparse vectors
      • i.e., most entries of w are close to or equal to 0
    • norm: convex, smooth, easy to optimize
    • norm: encourage sparse w, convex, but not smooth at axis points
    • norm: become non-convex and hard to optimize

Optimize with Gradient Descent:

  • Idea: take iterative steps to update parameters in the direction of the gradient
  • GradientDescent(F,k,,…):
    • for k=1,…,K do:
      • // compute gradient at current location
      • // take a step down the gradient
    • end for
    • return
  • where is function, is nb of steps and is step size (learning rate)
  • when to stop?
    • When the gradient gets close to zero
    • When the objective stop changing much
    • When the parameters stop changing much
    • Early
    • When performance on held-out validation set plateaus
  • Step size: usually start with large steps and get smaller
  • Subgradient:
    • Problem: some objective functions are not differentiable everywhere, ex: Hinge loss, l1 norm
    • Solution: subgradient optimization (differentiate by parts), they are only non differentiable at the end and probability at 2 ends are 0
    • Subgradient of hinge loss:
    • For a given example
    • HingeRegularizedGD(D, , MaxIter)
      • initialize weights and bias
      • for iter MaxIter do
        • initialize gradient of weights and bias
        • for all do
          • if then
            • // update weight gradient
            • // update bias derivative
          • end if
        • end for
          • add in regularization term
          • update weights
          • // update bias
        • end for
        • return

Binary Classification with Linear Models