Definition:

  • Learn and improve from “experience” without emperically programmed
  • Most central problem is generalization, how to generalize the data learned to the data that the model never seen before
    • principle of induction
  • Data set
    • comes from a set of samples
    • Each sample is a pair of input-output data
      • input : a set of features (independent variables)
      • output : a label, a set of features (dependent variable)
      • Test set: final exam to test the ability of model to generalize
        • if test given is too different from what was taught, the algorithm can’t generalize pass its experience, it is a bad test
        • also a bad test if the test is exact same as what was taught, it doesn’t generalize at all
        • a good test should test the ability to generalize
  • Target function
    • unknown
  • Function hypotheses
    • a set of possible functions that can map input to output
    • We aim to find such best approximates
  • Loss function:
    • determines how bad is the predicted label compared to the true label
    • ex: in classification, if , 1 if
  • Data generating distribution:
  • Expected loss:
    • We wish to minimize expected loss
    • as we dont know but only samples of , we assume Law of large number, therefore, the sample has same distribution as D

Two approaches in learning:

  • Eager learning:
    • ex: Decision trees
    • Learn/train: induce from an abstract model from the data
    • Test/Predict/Classify: apply learned model to new data
  • Lazy learning:
    • ex: Nearest neighbor
    • Learn: store data in memory
    • Test/Predict/Classify: compare new data to stored data
    • Properties:
      • retians all information seen in training
      • complex hypothesis space
      • classification can be slow
  • To predict the target/label using features: Supervised Learning
  • To find interesting patterns in data: Unsupervised learning
  • Semi-supervised learning
  • Reinforcement Learning
  • Deep Learning

Inductive Bias:

  • What the machine learns might differ from what we intended
    • ex: recognizing US vs Russian tanks model performs bad due to the quality of input pictures
  • Due to not enough variety of data to narrow down the relevant concept to learn

Fitting

  • Training Error over training data,
  • True Error over training data,
  • Overfitting
    • if
      • test error as we cant know the true error
    • Amount of overfitting =

Decision Tree

Classification Model and Nearest Neighbor for Classification

Perceptron Training

Linear Classifier


HMM

Model evaluation


Practical Consideration:

Hypothesis Testing
  • Accuracy score of once cant tell if a model is better than another, perhaps due to chance
Debugging:
  • Check implementation is correct
    • measure loss rather than accuracy
    • replace it with another easier dataset (toy)
  • Is data too noisy?
  • is learning problem too hard?
Strategies for isolating causes of errors:
  • Is representation adequate?
    • Can you learn if you add a cheating feature that perfectly correlates with correct class?
  • Train/test mismatch?
    • Try re-selecting train/test by shuffling training data and test together
  • Do you have enough data? Try training on 80% of the training set, how much does it hurt performance?
Hyperparameter tuning with validation set vs cross-validation
  • Cross-validation: loop through every hyperparameter setting for a hyperparameter and try which one leads to least error
  • N-fold cross validation:
    • Instead of a single test-training split, split data into N equal-sized parts
    • Train and test N different classifiers and report average & std.deviation accuracy
Model evaluation
  • Formalizing errors:
      • Estimation error (variance) + approximation error (bias)
        • estimation error = error of chosen f to optimal
        • approximation error = error of of the hypothesis chosen
      • F is set of all possible classifiers of a hypothesis
      • is optimal classifier
Bias/variance trade-off
  • if the learning algorithm is too flexible, it may fit each training dataset differently, high variance
    • only one of bias or variance can be reduced at once and the optimum point should be chosen
Class imbalance problem
  • How to deal with:
    1. Choose the right metrics: choose the one that is more relevant and important (ex: recall on cancer prediction is more important)
    2. Data-level methods: resampling
      • Undersampling: remove samples from the majority class
        • can cause loss of information
        • ex: Tomek Links, find pairs of close samples of opposite classes and remove the majority class in each pair
          • Make decision boundary clear but make model underfit (learn less)
          • only work with low-dimension data
      • Oversampling: add more examples to the minority class
        • can cause overfitting
        • ex: SMOTE
          • Synthesize samples of minority class as convex (linear) combinations of existing points and their nearest neighbors of same class
          • only work with low-dimension data
    3. Algorithm-level methods:
      • Instead of naive loss where all samples contribute equally to the loss,
      • Idea: training samples we care about should contribute more to the loss
      • Ex:
        • Cost-sensitive learning
          • let be cost of class i but classified as class j
          • The loss caused by instance x of class i will become the weighted average of all possible classifications of instance x:
        • class-balanced loss
          • Give more weight to rare classes - then you incentivize the model to learn to classify them better.
        • focal loss
          • Give more weight to the examples that the model is having difficulty with.
            • downweighs well-classified samples
          • estimates probability of the model for class
          • Cross-entropy:
          • Focal loss:

A probabilistic view:

  • Bayes theorem
  • Joint Distribution
  • Bayes Optimal Classifier
    • Assume we know the data generating function,
    • We define Bayes Optimal Classifier as
    • Theorem: Of all the classifiers, the Bayes Optimal Classifier achieve the smallest zero/one loss
  • Training = estimating from the finite training set
    • typically estimate as parameters of a probability distributions
    • assumption: independent and identically distributed
    • with maximum likelihood
    • If
    • Naive Bayes classifier

Logistic regression:

  • Binary classification:
    • Predicted label of is which is a function that takes in a set of parameters and sample of input
  • We let be a Sigmoid Function as it range from 0 to 1
  • Logistic regression with Maximum Likelihood Estimation:
      • the parameter that make the product of observed samples as likely as possible
    • taking
    • Let

Multiclass classification:

  • Classification Model
  • Reductions for Multi-class Classification
    • Given
      1. An input space and number of classes
      2. An unknown distribution over
    • Learning: Find a function minimizing
    • In most task, this works for , for larger , frame problem differently
    • Reduction 1: One-versus-All
      • Train K number of binary classifiers, each classifier classifies whether a sample is in that class or not
      • At test time
        • If only one classifier predicts positive, predict that class
        • if not, Break ties randomly
      • OneVersusAllTraing(D_multiclass, BinaryTraing)
          for i = 1 to K:
            D_bin <- relabel D_multiclass so class i is positive and !i is negative
            f_i <- BinaryTrain(D_bin)
         endfor
          return f_1, f_2,..,f_K
        
      • OneVersusAllTest(f_1, f_2,...,f_K, hat_x)
        score <- <0,0,...,0> //initialize K-many scores to 0
        for i = 1 to K:
            y <- f_i(hat_x)
            score_i <- score_i + y
        end for
        return argmax_k(score_k)
        
    • Reduction 2: All-versus-All
      • Train number of binary classifiers, each classifier classifies whether a sample is in that class i or class j for every pair of class
  • Recall that:
    • Let be array of
    • softmax function
  • This can be extended to work with multl-label classification:
  • Recall the binary case
  • Multi-label case

Neural Network

PCA

  • better representations of our data point
    • to visualize better, remove noise, less resource requirements, better classification
  • Sample variance of data project on vector is
  • Objective: Finding vector that maximize sample variance of projected data Lagrangian folds constants into objective:
    • solution: such
  • The eigenvalue denotes the amount of variability captured along dimension
    • sample variance of projection
  • if we rank eigenvalues from large to small
    • the 1st principle component of is associated with largest eigenvalue
    • the 2nd principle component of is associated with 2nd largest eigenvalue
  • It also minimizes construction error:
  • limit to only linear projection and only based on covariance

Autoencoder

  • Its non-linear dimensionality reduction method
  • If autoencoder doesnt have activation function, it is works same as PCA
    • is activation function such as sigmoid
    • is the latent representation
  • Then
  • pretraining helps the model start with weights that have already been optimized for general patterns, improving learning efficiency and potentially leading to better performance
  • Pretraining process with autoencoders
    • Pretraining step: train a sequence of shallow autoencoders, greedily one layer at a time, using unsupervised data
    • Fine-tuning step 1: train the last layer using supervised data
    • Fine-tuning step 2: use backpropagation to fine-tune the entire network using supervised data.
    • Why does this work?: it is easier to train one layer at a time and can utilize unlabeled data

Kernel methods/tricks:

  • Feature mapping:
    • Add new dimensions to the vector so that its linearly separable
    • example:
    • Cons: more expensive to train, require more training examples
  • Kernel methods:
    • rewrite linear models so that the mapping never needs to be explicitly computed, only depend on the dot products between 2 examples
      • and
    • Replace dot product by
    • example: is as same as
  • Kernels: Formally defined
    • each kernel has an associated feature mapping that takes input (input space) and maps to (feture space)
    • Kernel k takes 2 inputs and gives their similarity in space,
    • needs to be s vector space with a dot product defined on it, also called Hilbert Space
  • Mercer’s condition
    • Not all function can be used as a kernal function, it must satisfy Mercer’s condition
    • For to be a kernel function:
      • There must exist a Hilbert Space for which defines a dot product
      • The above is true if is a positive definite function
        • For all square integrable functions
  • Constructing combinations of kernels:
    • Let be two kernel functions then the following are also kernel fns:
      • : direct sum
      • : scalar product
      • : direct sum
    • Commonly used kernel fns:
      • Linear (trivial) kernel: (mapping fn is identity - no mapping)
      • Quadratic kernel: or
      • Polynomial kernel (of degree d): or
      • Radial basis function (RBF) kernel:
  • Kernelizing the perceptron:
    • Naive approach: train a perceptron in the new feature space
      • if :
    • Perceptron representer theorem: During a run of the perceptron algorithm, the weight vector w can always be represented as a linear combination of the expanded training data
    • We can use the perceptron representer theorem to compute activation as a dot product between examples:

Support vector machines

  • In the dataset that is linearly separable, there are many hyperplane that helps us classify the dataset. But what is the “best” hyperplane?
  • SVM: a hyperplane based linear classifier defined by
  • Assume that the data is linearly separable
    • The problem reduced to
      • s.t
    • Large margin > small good generalization
    • How to solve?
      • Lagranian achieve dual problem
      • Solve the maximum of dual function in dual problem
      • According to KKT condition, the in the lagranian is mostly 0.
  • If the data is not linearly separable:
    • add the slack variable
    • s.t
    • small large margin and more misclassified samples
    • large small margin and less misclassified samples
  • Soft SVM:

Self-supervised learning:

  • A version of unsupervised learning where data provides the supervision.
    • A type of representation learning.
  • In general, withhold some part of the data and the task a neural network to predict it from the remaining parts.
    • Self-supervised learning tasks are also known as pretext tasks.
  • Details decide what pretext task the network tries to solve
  • Depending on the quality of the pretext task, good semantic features can be obtained without actual labels.
  • Cognitive principles:
    • reconstruct from corrupted (or partial) version
      • Denoising Autoencoder
      • in-painting
      • Colorization, Split-brain encoder (split image into 2, each generate the other half)
    • visual common sense tasks
      • relative patch prediction
      • jigsaw puzzles
      • rotation
    • contrastive learning
      • word2vec
      • contrastive predictive coding (CPC)
      • Instance discrimination
    • In general, considering spatial and temporal dimensions
      • Predict any part of the input from any other part
      • predict future from the past/recent past
      • preduct past from present
      • predict top from bottom
      • predict the occluded from the visible
      • pretend there is a part of the input you dont know and predict that

Generative Modeling:

  • Simple generative models:
    • histograms
      • If we know for each data point of a history
      • We can sample by having cumulative distribution from for all
      • Then return the smallest for
      • new datapoint is created
      • but it comes too expensive in higher dimetion
      • training in data distribution, poor generalization
  • Parameterized distribution: Likelihood-based Generative Model
    • approximate the function with parameters
    • Learn so that
    • There will be many choices for model design, each with different tradeoffs and different compatibility criteria
    • Set up model class: a set of parameterized distribution
    • Search/optimization problem
    • the loss fn and search problem must work with large dataset and yield that matches loss is small enough
    • Maximum likelihood,
    • enough dataabd model family is expressive enought solving it will yield parameters
    • Solve with Stochastic gradient descent
    • Equivalent to minimizing KL divergence between the empirical data distribution and the model
      • .