(Here are my notes from Week 1 of Stanford's Machine learning class by Andrew Ng)
Definition: ”Algorithms for inferring unknowns from knowns”
- Database Mining
- study large data sets like web click data, medical records, biology and engineering data
- Application which can’t be program by hand
- autonomous helicopters, handwriting recognition, natural language processing, computer vision
- Self Customizing programs
- Amazon and Netflix recommendations
- Understanding human learning
- Real Artificial intelligence
Tom Mitchell provides a more modern definition: "A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E."
Example: playing checkers.
- E = the experience of playing many games of checkers
- T = the task of playing checkers.
- P = the probability that the program will win the next game.
Size in feet is a feature.
Tumor size is a feature but there can be multiple other features like clump thickness, uniformity of cell size etc…
How to create an algorithm which can deal with “infinite” number of features ???
- No structure is given
- We need to cluster the Data, examples:
- Google news
- Organize computing clusters
- social network analysis
- market segmentation
- astronomical data analysis
- gene classification for individuals
Separate out the noises from two voices….
Linear Regression With One Variable Model
Hypothesis is the prediction of the value y based on x, given the theta.
How to choose the theta ????
Cost function is also known as the “Squared Error function”
Minimize the J (theta 0, theta 1) so that our answer is close to y.
Gradient Descent Algorithm
- It is important to perform a “Simultaneous Update” on the j 0 and j 1 theta’s.
- Alpha = learning rate.
- If Alpha is too small, gradient descent can be slow
- If Alpha is too large, gradient descent can overshoot the minimum. It may fail to converge or even diverge.
- There can be multiple “local optima”
- There is only 1 “global optima”. which is the minimum from the entire set
- Cost function is “bowl” shape so it only has global optima.
- Gradient descent can converge to a local minimum, even with the learning rate α fixed.