JoeKurokawa

Learning ML Part III: Regressions

Regressions

This is part II of my blog for Machine Learning algorithms.  The core of my learning will be based on Georgia Techs ML class which can be found on Udacity. https://www.udacity.com/course/machine-learning--ud262 . It is the same class that students take to earn credit in their online masters program.

Regression, in terms of machine learning is a form of supervised learning . It is used to  map example data of inputs and outputs and come up with a generalization that can be used to predict a future outcome. One thing to be said about the data is that it is continuous and not discrete.

The most common type of regression is linear regression. If you remember back to algebra, we will first plot all the points on a graph and   find the best fit line with the old

y = mx + b.

From there, you can expand on the idea to include polynomial fit (i.e. squared, cubic, octic, etc..)  Remember back to Algebra where the different degree function exhibit different behaviors.

So let's say we have a  set of points, we can inspect those points and eyeball what polynomial function it approximates to. If it looks like a parabola, it's probably a degree 2 function, if it looks linear its probably a degree 1 function.

Can you generalize this in a mathematical way other than eyeballing the data? Yes, you can use linear algebra and use least squares to come up with the error.  https://en.wikipedia.org/wiki/Polynomial_regression

Given a matrix of inputs and outputs we can solve for the coefficients B. And the vector of coefficients B can be written as

From the above, if y...yn is the output and x...xn is the input,  you can come up with coefficients B...Bm uisng the ordinary least squares estimation using:

Cross Validations

The concept of cross validation is using a smaller portion of the data as a test set, and the majority as a training set. The training set is used to come up with the regression model and the test set will be used to validate that model by checking the error. This is assuming that the data is Independent and Identically distributed (data is all coming from the same source.)

You can split your data in to four parts. You can use parts 1 through 3 as the training set and part 4 as the test set. Next you can use parts 1, 2, 4 as the training set and part 3 as the test set. After using each part as a test set you can use the model that give you the least amount of error.  It is just a way of checking the error and taking the best model as the one you use.

Okay, that's it for this week. Next week I plan on tackling my first project in Machine Learning where I build a project from concepts we have learned s o far using Scikit Learn!

-Joe Kurokawa