Sunday, January 04, 2015

Machine Learning Course Summary (Part 1)

Summary from the Stanford's Machine learning class by Andrew Ng


  • Part 1
    • Supervised vs. Unsupervised learning, Linear Regression, Logistic Regression, Gradient Descent
  • Part 2
    • Regularization, Neural Networks
  • Part 3
    • Debugging and Diagnostic, Machine Learning System Design
  • Part 4
    • Support Vector Machine, Kernels
  • Part 5
    • K-means algorithm, Principal Component Analysis (PCA) algorithm
  • Part 6
    • Anomaly detection, Multivariate Gaussian distribution
  • Part 7
    • Recommender Systems, Collaborative filtering algorithm, Mean normalization
  • Part 8
    • Stochastic gradient descent, Mini batch gradient descent, Map-reduce and data parallelism

Introduction

  • “algorithms for inferring unknowns from knowns”
  • Arthur Samuel was the father of ML. Created first self-learning program to play checkers.
  • “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E” – Tom Mitchell
  • Use for Data mining (e.g. web click data, medical records, biology) and applications which can’t be programmed by hand (e.g. autonomous helicopter, handwriting recognition, natural language processing, computer vision…)
  • Supervised Learning
    • Where right answers are given.
    • Regression – Predict continuous output (e.g. house price)
    • Classification – Discrete valued output 0 or 1 (e.g. breast cancer, malignant or benign)
    • Features – E.g. Size of house, No of bedrooms in house, Clump thickness, uniformity of cell size.
  • Unsupervised Learning
    • Need to cluster the data
    • Possible applications:
      • Social network analysis
      • Market segmentation
      • Astronomical data analysis
      • Gene classification
      • Organize computing clusters in data center.
    • Cocktail Party problem – Separate out the noise from 2 voices

Supervised Learning

Training Set -> Learning Algorithm -> [ X -> Hypothesis -> Estimated Value (Y) ]

Linear Regression

  • Liner Regression with One Variable or Univariate Liner Regression
    • Hypothesis: clip_image002
    • Parameters: clip_image004
    • Cost function: clip_image008
    • Goal : clip_image006

clip_image010

  • Gradient Descent Algorithm
    • Use “gradient descent” algorithm to minimize the “cost function” in linear regression model
    • Start with some: clip_image011
    • Keep changing clip_image012 to reduce the cost function clip_image014 until we hopefully end up at a minimum.

clip_image016

clip_image018

    • If α is too small, gradient descent can be slow.
    • If α is too large, gradient descent can overshoot the minimum. It may fail to converge, or even diverge.
    • There can be multiple “Local Optima” but only one “Global optima”
    • For liner regression the cost function is “bowl” shape so it only have global function.

  • Linear Algebra (101)
    • Matrices and Vectors
    • Addition + Scalar multiplication
    • Matrix-vector multiplication
    • Matrix-Matrix multiplication
    • Properties: Commutative, Associative, Identity Matrix
    • Inverse and Transpose
  • Linear Regression with Multiple Features
    • Hypothesis : image
    • Covert x into a vector
    • Covert theta into a vector

    image

  • Gradient Descent for multiple variables

image

  • Feature Scaling
    • Get very feature into approximately a –1 <= x  <= 1 range.  -3 to 3 is also ok but more than that is not good.
    • Use “Mean Normalization”. (  x – u / s ) where u is the average value of x in training set and s is the max, min or standard deviation of the x range)

  • Learning rate
    • If α is too small: slow convergence.
    • If α is too large: may not decrease on every iteration; may not converge.
    • To choose α, try  0.001, 0.01, 0.1, 1....

  • Features and Polynomial regression
    • Features can be calculated from other features (e.g. frontage * depth = new feature for house)

image

  • Normal Equation
    • Method to solve for theta analytically

image

image

  • Normal Equation : Non Invertibility
    • image

image

Logistic Regression

  • Classification
    • Examples: Email (spam/ no spam), online transactions fraudulent (yes,no), Tumor (malignant, benign)
    • Hypothesis can be either 0 or 1
    • Use Sigmoid function/Logistic Function

image

image

  • Decision Boundary
    • It is the property of the “hypothesis” and not the dataset
    • Linear and Non-Linear boundaries.
    • With much higher polynomial it is possible to show much complex Non-Linear decision boundaries.

image

image


  • Cost Function
    • How to choose parameter theta ?

image

image

image

  • Gradient Descent

image

  • Advanced Optimizations
    • Algorithms : Conjugate gradient, BFGS, L-BFGS
    • Advantages : No need to manually pick “learning rate alpha”, often faster than gradient descent
    • Disadvantages: more complex

  • Multi-class Classification: One-vs-all
    • Examples: Email tagging (work, friends, family, hobby), Medical diagrams (not ill, cold, flu…), Weather (sunny, cloudy, rain, snow)
    • Train a logistic regression classifier image for each class image to predict the probability that image.
    • On a new input image, to make a prediction, pick the class image that maximizes  image

image

image

Regularization

  • Problem of Overfitting

image

image

    • Options to address overfitting:
      • Reduce number of features
        • Manually select which features to keep
        • Model selection algorithm (later in course)
      • Regularization
        • Keep all features but reduce magnitude/values of parameters theta.
        • works well when we have lots of features, each of which contributes a bit to predicting y.
    • Too much regularization can “underfit” the training set and this can lead to worse performance even for examples not in the training set.

  • Cost Function
    • if lamda is too large (e.g. 1010) than the algorithm results in “underfitting”(fails to fit even training set)

image

  • Regularization with Liner Regression

image

  • Normal Equation

image

  • Regularization with Logistic Regression

image

Neural Networks

  • Introduction

    • Algorithms that try to mimic the brain. Was very widely used in 80s and early 90s; popularity diminished in late 90s.

    • Send a signal to any brain sensor and it will learn to deal with it. E.g. Auditory cortex learns to see, Somatosensory cortex learn to see.
      Seeing with your tongue, human echolocation, third eye for frog.

image

image

  • Model Representation

image

image

  • Forward propagation

image

  • Non-linear classification example: XOR/XNOR

image

  • Non-linear classification example: AND

image

  • Non-linear classification example: OR

image

  • XNOR

image

  • Multi-class classification

image

  • Cost Function

image

image

    • Unlike logistic regression we DO NOT sum the value of “Bias Unit” in the regularization term for cost of neural networks.
    • Just as logistic regression a large value of “lamda” will penalize large parameter values, thereby, reducing the changes of overfitting the training set.

  • Backpropagation algorithm

image

image

  • Unrolling parameters

image

  • Gradient Checking
    • There may be bugs in forward/back propagation algorithms even if the cost function looks correct.
    • Gradient checking helps identify these bugs.

image

    • Implementation Note:
      • Implement backprop to compute DVec (unrolled ).
      • Implement numerical gradient check to compute gradApprox.
      • Make sure they give similar values.
      • Turn off gradient checking. Using backprop code for learning
    • Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on
      every iteration of gradient descent (or in the inner loop of costFunction(…))your code will be very slow.

  • Random initialization
    • Initializing theta to 0 works for logistic regression but it does not work for neural network.
    • If we initialize theta to 0 than for neural network, after each update, parameters corresponding to inputs going to each of two hidden units are identical.
    • This causes the “Problem of Symmetric Weight”
    • To solve this issue randomly initialize the theta values.

image

  • Training a neural network

image

image

image

 

No comments:

AddIn