Saturday, January 24, 2015

Machine Learning Course Summary (Part 6)

Summary from the Stanford's Machine learning class by Andrew Ng


  • Part 1
    • Supervised vs. Unsupervised learning, Linear Regression, Logistic Regression, Gradient Descent
  • Part 2
    • Regularization, Neural Networks
  • Part 3
    • Debugging and Diagnostic, Machine Learning System Design
  • Part 4
    • Support Vector Machine, Kernels
  • Part 5
    • K-means algorithm, Principal Component Analysis (PCA) algorithm
  • Part 6
    • Anomaly detection, Multivariate Gaussian distribution
  • Part 7
    • Recommender Systems, Collaborative filtering algorithm, Mean normalization
  • Part 8
    • Stochastic gradient descent, Mini batch gradient descent, Map-reduce and data parallelism

Anomaly detection

  • Examples
    • Fraud Detection
      • x(i) = features of users i activities.
      • Model p(x) from data
      • Identify unusual  users by checking  which have p(x) < epsilon
    • Manufacturing
    • Monitoring computers in a data center
  • Algorithm
    • Choose features x(i) that you think might be indicative of anomalous examples.
    • Fit parameters u1,…un, sigma square 1,… sigma square n
    • Given new example x, compute p(x):
    • Anomaly if p(x) < epsilon

image

  • Aircraft engines example
    • 10000     good (normal) engines
    • 20     flawed engines (anomalous)
    • Alternative 1
      • Training set: 6000 good engines
      • CV: 2000 good engines (y=0), 10 anomalous (y=1)
      • Test: 2000 good engines (y=0), 10 anomalous (y=1)
    • Alternative 2:
      • Training set: 6000 good engines
      • CV: 4000 good engines (y=0), 10 anomalous (y=1)
      • Test: 4000 good engines (y=0), 10 anomalous (y=1)
    • Algorithm Evaluation

image

  • Anomaly detection vs. Supervised learning

      Anomaly detection

      Supervised learning

      • Very small number of positive examples
      • Large number of positive and negative examples.
      • Large number of negative examples
      • Enough positive examples for algorithm to get a sense of what positive examples are like, future positive examples likely to be similar to ones in training set.
      • Many different “types” of anomalies. Hard for any algorithm to learn from positive examples what the anomalies look like; future anomalies may look nothing like any of the anomalous examples we’ve seen so far.
      • Fraud detection
      • Email spam classification
      • Manufacturing (e.g. aircraft engines)
      • Weather prediction (sunny/rainy/etc).
      • Monitoring machines in a data center
      • Cancer classification
    • Choose what features to use
      • Plot a histogram and see the data

    image

      • Original vs. Multivariate Gaussian model

    image

    Tuesday, January 13, 2015

    Installing node-canvas on Win64 and Visual Studio 2013 Update 4

     

    While exploring machine learning using JavaScript I came across this nice video and github project by Heather Arthur.

    I decided to try it out, but I had to resolve quite a few issues. Here is the takeaway from the solution:

    • Install 32bit version of python instead of 64bit.

      • This is help overcome a lot of issues later when installing python libraries as the library installer look for details in registry to check if specific version of Python is installed or not. And with 64bit installation it wasn’t able to detect it.

    • node-canvas requires cairo which can be installed on Windows via the GTK+ package.

      • Download the 64 bit 2.22 all-in-one bundle for 64bit machines or else the C++ compiler will throw LINKER errors while building the package.
      • Make sure to follow the instructions in the gtk+-bundle_2.22.1-20101229_win64.README.txt file at the root of the bundle
      • Copy all the contents of the bundle to “C:\GTK” folder as the node-gyp needs that

    • Use the –-msvs_version=2013 switch to install the ‘canvas’ package

      • I got this error while installing the ‘canvas’ npm package from Visual Studio
    C:\Program Files (x86)\MSBuild\Microsoft.Cpp\v4.0\V120\Microsoft.Cpp.Platform.targets(64,5): error MSB8020: The build tools for Visual Studio 2010 (Platform Toolset = 'v100') cannot be found. To build using the v100 build tools, please install Visual Studio 2010 build tools.  Alternatively, you may upgrade to the current Visual Studio tools by selecting the Project menu or right-click the solution, and then selecting "Upgrade Solution...".
      • I had to do 2 things to resolve it:
        • Update node-gyp to the latest version, including the one that is internally used by node
        • Use ‘npm install –-msvs_version=2013 canvas’ from command line to install the package.

    Finally add the canvas package with version number in dependencies in package.json to make Visual studio happy…

    Sunday, January 11, 2015

    Machine Learning Course Summary (Part 5)

    Summary from the Stanford's Machine learning class by Andrew Ng


    • Part 1
      • Supervised vs. Unsupervised learning, Linear Regression, Logistic Regression, Gradient Descent
    • Part 2
      • Regularization, Neural Networks
    • Part 3
      • Debugging and Diagnostic, Machine Learning System Design
    • Part 4
      • Support Vector Machine, Kernels
    • Part 5
      • K-means algorithm, Principal Component Analysis (PCA) algorithm
    • Part 6
      • Anomaly detection, Multivariate Gaussian distribution
    • Part 7
      • Recommender Systems, Collaborative filtering algorithm, Mean normalization
    • Part 8
      • Stochastic gradient descent, Mini batch gradient descent, Map-reduce and data parallelism

    K-means Algorithm

    • Clustering

    image

    • K-means algorithm
      • Put random cluster centroids
      • Find mean and replot cluster centroids
      • repeat…

    image

    image

    image

      • *Note – If a centroid has NO POINTS assigned to it, than eliminate that cluster centroid

    image

      • If we randomly choose WRONG X values than it might get stuck in “local optima”

    image

      • To solve the above random initialization problem, we run it 100 times and pick the clustering that gave lowest cost.
        • Note:
          • if K is small (between 2 to 10) than multiple random initialization will find better local optima
          • if K is large than multiple random initialization may not help or make a huge difference.

    image

    • How to choose the Number of Clusters ??
      • Elbow method

    image

    image

    Dimensionality Reduction

    • Reduce data from 2D to 1D and 3D to 2D

    image

    image

    image

    image

    image

    Principal Component Analysis (PCA) Algorithm

    • Data Preprocessing

    image

    image

    • Algorithm
      • Reduce data from n dimensions to k dimensions
      • Compute “covariance matrix”
      • Compute “eigenvectors” of matrix sigma

    image

    image

      • Summary

    image

    • Choosing K (number of principal components)
      • Typically choose k to be smallest value:
      • 95% to 99% is a common variance value

    image

      • Keep changing K and see what gives us the smallest value which gives us 99% variance.

    image

    Applying Principal Component Analysis (PCA)

    • Supervised Learning Speedup
      • Extract inputs
      • Apply PCA
      • Get New Training Set
      • Use logistic regression or other algorithms

    image

    • Application of PCA
      • Compressions
        • Reduce memory/disk needed to store data
        • Speed up learning algorithm
      • Visualization
      • DO NOT USE PCA to prevent overfitting (to reduce the number of features), use regularization instead.
      • Before implementing PCA, first try running whatever you want to do with the original/raw data. Only if that doesn’t do what you want, then implement PCA

    Machine Learning Course Summary (Part 4)

    Summary from the Stanford's Machine learning class by Andrew Ng


    • Part 1
      • Supervised vs. Unsupervised learning, Linear Regression, Logistic Regression, Gradient Descent
    • Part 2
      • Regularization, Neural Networks
    • Part 3
      • Debugging and Diagnostic, Machine Learning System Design
    • Part 4
      • Support Vector Machine, Kernels
    • Part 5
      • K-means algorithm, Principal Component Analysis (PCA) algorithm
    • Part 6
      • Anomaly detection, Multivariate Gaussian distribution
    • Part 7
      • Recommender Systems, Collaborative filtering algorithm, Mean normalization
    • Part 8
      • Stochastic gradient descent, Mini batch gradient descent, Map-reduce and data parallelism

    Support Vector Machines

    • Hypothesis and Decision Boundary 
      • Hypothesis 

    image

      • Decision Boundary

    image

    • Kernels
      • For Non Linear Decision Boundary we can use higher order polynomials but is there a different/better choice of the features ?
      • High order polynomials can be computationally heavy for image processing type problem.
      • Given X, compute new feature depending  on proximity to landmarks.

    image

    image

    image

      • When x = 3/5 the bump increases as its near to the landmark l(1), which is equal to [3,5], sigma square value makes the bump narrow or wide.

    image

      • Predict “1” when theta0 + theta1 f1 + theta2 f2 + theta3 f3  >= 0
      • Let’s assume theta0 = -0.5, theta1 = 1, theta2 = 1, theta3 = 0
      • For x(1),
        • f1 == 1 ( as its close to l(1))
        • f2 == 0 ( as its close to l(2))
        • f3 == 0 ( as its close to l(3))
      • Hypothesis :
        • = theta0 + theta1 * 1 + theta2 * 0 + theta3 * 0
        • = -0.5 + 1
        • = 0.5  (which is > 0, so we predict 1 )

    image

      • How do we choose Landmarks ??
      • What other similarity functions we can use other than ‘Gaussian Kernel” ??
        • * Make l same as X

    image

      • Kernels do not go well with logistic regression due to computational complexity
      • SVM with Kernels are optimized for computation and runs much faster

    image

    image

    Using SVM

      Use software packages (liblinear, libsvm, …) to solve for parameters theta.

      Need to Specify:

      • Choice of parameter C
      • Choice of Kernel (similarity function)
        1. No Kernel (“linear kernel”)
          • Predict “y=1” if theta transpose x > 0
          • No landmarks or features f(i) from x.
          • Choose this when n is large and m is small. (Number of features is large and training examples are less)
        2. Gaussian Kernel
          • Predict “y=1” if theta transpose f > 0
          • Use landmarks
          • Need to choose sigma square
          • Choose this when n is small and m is large. ( very large training set but small number of features)
          • Do perform feature scaling before using the Gaussian kernel
        3. Other choices
          • All similarity function need to satisfy a technical condition called “Mercers Theorem”
          • Polynomial kernel
          • More esoteric:
            1. string kernel
            2. chi-square kernel
            3. historgram intersection kernel

      Many SVM packages already have built-in multi class classification functionality, if not use the one-vs-all method

      • Logistic Regression vs. SVM

      image

       

      Monday, January 05, 2015

      Machine Learning Course Summary (Part 2)

       

      Summary from the Stanford's Machine learning class by Andrew Ng


      • Part 1
        • Supervised vs. Unsupervised learning, Linear Regression, Logistic Regression, Gradient Descent
      • Part 2
        • Regularization, Neural Networks
      • Part 3
        • Debugging and Diagnostic, Machine Learning System Design
      • Part 4
        • Support Vector Machine, Kernels
      • Part 5
        • K-means algorithm, Principal Component Analysis (PCA) algorithm
      • Part 6
        • Anomaly detection, Multivariate Gaussian distribution
      • Part 7
        • Recommender Systems, Collaborative filtering algorithm, Mean normalization
      • Part 8
        • Stochastic gradient descent, Mini batch gradient descent, Map-reduce and data parallelism

      Regularization

      • Problem of Overfitting

      image_thumb84

      image_thumb87

        • Options to address overfitting:
          • Reduce number of features
            • Manually select which features to keep
            • Model selection algorithm (later in course)
          • Regularization
            • Keep all features but reduce magnitude/values of parameters theta.
            • works well when we have lots of features, each of which contributes a bit to predicting y.
        • Too much regularization can “underfit” the training set and this can lead to worse performance even for examples not in the training set.

      • Cost Function
        • if lamda is too large (e.g. 1010) than the algorithm results in “underfitting”(fails to fit even training set)

      image_thumb89

      • Regularization with Liner Regression

      image_thumb93

      • Normal Equation

      image_thumb95

      • Regularization with Logistic Regression

      image_thumb97

      Neural Networks

      • Introduction

        • Algorithms that try to mimic the brain. Was very widely used in 80s and early 90s; popularity diminished in late 90s.

        • Send a signal to any brain sensor and it will learn to deal with it. E.g. Auditory cortex learns to see, Somatosensory cortex learn to see.
          Seeing with your tongue, human echolocation, third eye for frog.

      image_thumb102

      image_thumb100

      • Model Representation

      image_thumb104

      image_thumb106

      • Forward propagation

      image_thumb108

      • Non-linear classification example: XOR/XNOR

      image_thumb110

      • Non-linear classification example: AND

      image_thumb115

      • Non-linear classification example: OR

      image_thumb117

      • XNOR

      image_thumb119

      • Multi-class classification

      image_thumb121

      • Cost Function

      image_thumb123

      image_thumb125

        • Unlike logistic regression we DO NOT sum the value of “Bias Unit” in the regularization term for cost of neural networks.
        • Just as logistic regression a large value of “lamda” will penalize large parameter values, thereby, reducing the changes of overfitting the training set.

      • Backpropagation algorithm

      image_thumb127

      image_thumb129

      • Unrolling parameters

      image_thumb131

      • Gradient Checking
        • There may be bugs in forward/back propagation algorithms even if the cost function looks correct.
        • Gradient checking helps identify these bugs.

      image_thumb136

        • Implementation Note:
          • Implement backprop to compute DVec (unrolled ).
          • Implement numerical gradient check to compute gradApprox.
          • Make sure they give similar values.
          • Turn off gradient checking. Using backprop code for learning
        • Be sure to disable your gradient checking code before training your classifier. If you run numerical gradient computation on
          every iteration of gradient descent (or in the inner loop of costFunction(…))your code will be very slow.

      • Random initialization
        • Initializing theta to 0 works for logistic regression but it does not work for neural network.
        • If we initialize theta to 0 than for neural network, after each update, parameters corresponding to inputs going to each of two hidden units are identical.
        • This causes the “Problem of Symmetric Weight”
        • To solve this issue randomly initialize the theta values.

      image_thumb138

      • Training a neural network

      image_thumb144

      image_thumb141

      image_thumb142

       

      AddIn