Welcome to the first series of topics on machine learning. Each week I will post what I learned in Machine learning. I have no prior knowledge in this area and will attempt to learn things as I go using online tools. The core of my learning will be based on Georgia Tech's ML class which can be found on Udacity. https://www.udacity.com/course/machine-learning--ud262 , It is the same class that students take to earn credit in their online masters program. This week, I touch on the different types of machine learning. There are three types of machine learning algorithms:
- Supervised learning
- Unsupervised learning
- Reinforcement learning
Supervised learning is the task of creating a model that maps an input to an output based on example input-output pairs. The model has to infer what comes next according to the training data. For example we have ordered pair: (1,1)(2,4)(3,9) . The next logical conclusion should be (4,16) . The logical conclusion is that these points map to y= X ². Given a set of training data we can use supervised learning to find the relation in a testing set.
Unsupervised learning is a type of machine learning algorithm used to draw inferences from datasets consisting of input data without labeled responses. There is an input but not specific mapping. Say for example you have an amplitude graph of recorded narration. Without knowing what the recording is saying and just by analyzing the amplitude and wavelength we can use Unsupervised learning to predict what words are being said. Another example would be image processing. We understand that a certain arrangement of pixels is a cow and if we were given a similar arrangements of pixels we can use unsupervised learning to determine if the image is a cow or not.
The third type of machine learning is reinforcement learning. It deals with how software agents ought to take actions in an environment as to maximize some notion of cumulative reward. For example. Let's say we have a robot that can buy and sell stocks. Without knowing any of the rules of the stock market such as what is a good trading strategy vs. what is not, the algorithm will trade stocks based on given negative or positive feedback. It is rewarded when it makes money and not rewarded when it loses money. The way the robot goes about doing it is left up to itself without explicit code telling what to do. This is a technique used by OpenAI to create a AI that was able to beat the best players in the popular online game DOTA within two weeks of learning.
Ensemble Learning is a type of Supervised Learning that uses multiple learning methods to achieve a result. The combined methods can include regression, classifications and functions. This allows the use of multiple types of learning methods on one dataset. And can get you a better result by averaging over all of the models. The advantages to Ensemble learning is that since you are averaging the models, you can get results that are less noisy and have smaller variance as well as prevent some of the over fitting that comes with a single model.
Here is an simple example of ensemble learning method called bagging:
- Take your training set and split up into 5 sets of 5 data points.
- Apply 3rd order polynomials to each of your 5 sets.
- Average them.
The result is , you get a better result at predicting the training set than you would applying a 3rd degree or 4th degree polynomial over the whole set. The graph below shows housing data according to time:
In the graph above, the red X shows training data set and the green X shows a test data point. The Red line is derived from the result of the Bagging algorithm and the blue line is derived from applying a 4th order polynomial over the whole training set. You can see that the red line predicts the test point closer than the blue line which indicates that in this situation bagging has a slightly better outcome than applying regression over the training set.
Learning ML part VIII. Nearest Neighbor Algorithm.
The Nearest Neighbor algorithm is an instance based learning algorithm than can handle both classification and regression. It is one of the simplest of algorithms because any data point can be inferred from it’s nearest neighbor. See the diagram below: Let’s say you have a graph of red and blue points and we want to figure out what the green point is.
We can see that the nearest point to the green circle is a red, so logically we could assume green is also a red. There is also Kth Nearest Neighbor, method which takes into account not just the closest but also the kth nearest points. For example in the same graph, if we want the 3rd Nearest Neighbor to the green circle, there are two reds and one blue. What should the green circle be then? Well if we are using classification we implement a voting system (count up the number of blues and reds), or if we were doing regression we would compare the mean of the distances.
So this algorithm is helpful when you are looking at a graph with geographical information such as finding out what the cost of a house is in a neighborhood, or mapping a certain classification to a graph of data. For example, If we want to assign credit scores according to financial data of people in a database, we can graph them and find their nearest neighbor.
So then, we come to the idea of Instance based learning or lazy learning. Kth Nearest Neighbor is considered an instance based learning method because learning takes a relatively short amount of time but the querying takes a relatively longer amount of time. What is meant by that is Kth Nearest Neighbor does not produce a function to predict outputs like a regression. That makes Kth Nearest Neighbors extremely efficient because the training data is the model for learning. However, KNN takes a longer time on the back end to compute all the neighbors, as well as store all the data points in memory. The opposite of this is Eager learning methods like regression, decision trees and neural networks who produce models first.
KNN can also be described as non-parametric method because it does not produce a function which it maps to. It does not make assumptions about the form of the function of the data, the data is the model.
That’s It for Today and the Kth Learning Algorithm
Learning ML Part VII. Neural Networks
This week we will explore Neural Networks, a form of supervised learning that takes in multiple inputs and predicts a continuous output. The name neural networks comes from the way biological neurons work and were modeled after.
Much like Regression it attempts to predict a continuous output to given inputs. A basic example of a Neural network is a Perception. It is an algorithm that based on the inputs and combined with a set of weights it’s output is 0 or 1.
Basic Perceptron Example
The formula goes like this, given inputs X1.. Xn and corresponding weights, if the summation of the product of the weights and inputs are bigger than θ (the activation) then the result is true and if not it is false. A perception unit can be combined to make any logic work. For example an And statement can be written as such:
If the inputs X1 and X2 can only be 0 or 1. Then, lets say if they were both 1 the output is 1. if they were both 0 the out put will be 0. If you solve the truth table for every combination you would get and exactly. If you graph this you get the following:
Perceptions are linearly separable, meaning there is a line that separates true or false. If the data is not linearly separable we have to use anther method call Gradient Descent.
This is Linearly Seperable:
This is not:
Gradient Descent is used when we model more complicated non linear seperable data. It seeks to minimize the Error given in d the training set.:
with yd as a output minus the w*xd the activation (from the percepitron model) times the x from the input. Again this is used for non-linearly separable data, although the final formula looks a lot like the previous percepitron model. It is used for non-linearly separble data.
An S Like function that gets applied to the activation. As the value of the activation (z) increases towards +inf it goes towards 1 and the as it goes towards -inf 0. The sigmoid affects the activation of each unit. It acts as a middleware between input and the output, the input is weighted, then put through the sigmoid function. By using sigmoids we can model more complex functions that are non-linear.
Putting it all Together
So putting it all together, The above diagram shows a neural network. We take in a bunch of inputs and we get an output. But between the inputs we have sigmoid units, which are the circles. Basically an input is weighted, put through the sigmoid function and the output is passed to next level. Each level is called a hidden layer because you can’t really see the inputs or the outputs. As we go up the neural network each level passes down outputs to the next level until you get to the end. So then this brings us to the idea of back propagation where the output is taken and passed through the input again and going through the neural network again from the beginning. This is made possible because the weights are differential, meaning they can be adjusted based on input to match more of an output we want. Since the network seeks to minimize error, every time we go back, are errors are getting less and we are achieve a result we want, and thus that is where the learning is taking place.
Last week we built a linear regression and a polynomial regression from historical Bitcoin prices: http://jkurokawa.com/2018/04/25/learning-ml-part-v-creating-a-model-using-scikit-learn/. This week we will make more models from historical prices of other cryptocurrencies like Litecoin, Ethereum and Ripple. Let's see if we can produce a linear model out of the data first. I am only analyzing this year's prices as to eliminate some of the volatility experienced in the market last year that will make it hard to come up with an accurate model.
Like last week, we will loop through our rows in the extracted csv using loadtxt, put the datetime and corresponding prices to arrays and store then as variables.
And then , we take all but the last 10 data points as the training set and the last 10 points as the testing set. We do this in order to use cross-validation to make sure that the model created from the training set can be validated against the testing set. We print the three plots for each of the coins.
and the console output is this:
The MSE for LTC is: 145.5845122372679
The coef. for LTC is: [-7.26416342e-06]
The MSE for ETH is: 16752.089816773823
The coef. for ETH is: [-7.52252359e-05]
The MSE for RIP is: 0.12567067740217847
The coef. for RIP is: [-2.76329702e-07]
We see that the Error for ETH is high so the linear model does not predict the price of Ethereum coins. The Error for the LTC and RIP models are much smaller so they follow a more linear path.
The above graph depicts just the linear model and plotting it against the testing data.
We see that the LTC graph does not follow the model very well. Let's try to examine LTC prices using polynomial regression for the LTC graph:
Just list last week we will cycle through the various polynomial degrees and make a model as we go along. At every degree we will print out the MSE and plot the model on our graph.
the coefficients for the degree 2 is: [ 1.38845504e+05 -1.74472013e-04 5.47667364e-14]
the mean square error for degree 2 is: 1024.9864388458207
the coefficients for the degree 3 is: [ 3.62199542e+02 4.07698308e-03 -5.36497513e-12 1.76491388e-21]
the mean square error for degree 3 is: 924.1754899788522
the coeficients for the degree 4 is: [ 1.33493162e+03 1.16188108e-30 2.68213895e-12 -3.53062801e-21
the mean square error for degree 4 is: 924.6724109907932
the coeficients for the degree 5 is: [ 6.82213034e+02 1.61019028e-02 -2.10148658e-11 1.38657496e-21
the mean square error for degree 5 is: 921.2089860046844
the coefficients for the degree 6 is: [-6.10565773e-01 -8.30229727e-46 3.78262115e-36 2.90815712e-18
-5.74468520e-27 3.78262283e-36 -8.30229711e-46]
the mean square error for degree 6 is: 515.7131605348862
As you can see the MSE for each progressive degree is decreasing. So as we go higher in polynomials, we see a closer and closer fit to the model. Polynomail of degree 6 seems to be the best fit.
Congrats! You made Regressions in Sci-Kit Learn!
This week we will develop our model further with regression. The goal of my project is to see if there is a reliable model that can be produced from bitcoin or any of the Altcoins. Mainly Ethereum, Ripple, and Litecoin.
We will first leave off with where we left on in Part IV After you have imported data through numpy, we will process the training data.
After importing data through csv, we will take the first row and convert it to a datetime object. from the datetime object, we will convert it to a unix timestamp and put in an array dates. For prices we will take the closing value and put them in an array prices.
We will use linear regression first to try it out. You can see a good example here: Linear Reg. Example
The above image, you can see that a linear regression does not quite fit the training data. Lets try a polynomial regression:
The plot of the polynomial graph is above. The code is as follows
Here we enumerate over degrees 3 to 6 and plot the results. We see that the plot degree 6 has the closest to the scatter plot so we will use that. The coefficients are : [-5.40037231e+06, 5.23410271e-01, -1.44786669e-09, 1.51985693e-18, -7.11067134e-28, 1.24872305e-37]
remember that polynomial regressions are of the form:
So the equation is of the form:
To get the MSE for the degree 6 plot we call:
and we get 1095563.04 which is significantly high number. So it does not guarantee the accuracy of our model. Next week we will see if we can get our model closer and see next week if any other digital currencies can be modeled using linear or polynomial regression.
You Just Finished your First Model in Sci-Kit Learn!
Welcome back to this week's series in Machine learning. This week, we will create our first machine learning project using scikit-learn. This post will cover just the setup of the development environment and how to export your first dataset. First, I am working in Ubuntu 16.04 on a PC. Everything can be installed using any OS or computer. In order to install scikit-learn you need >= Python 2.7 or
>= Python 3.6. You can look up how to install Python from other resources or the Python homepage. After doing so following the next steps:
Step 1: Install Anaconda
You can install scikit-learn using a single command:
pip install -U scikit-learn
however, you will need all the other dependencies such as numpy and scikit. The easiest way to get scikit-learn and all the dependencies is to get Anaconda. Anaconda is a scientific package of libraries and software used by data scientists. It also comes with useful tools like Jupiter Lab and Notebook so it is worth downloading: https://www.anaconda.com
Step 2: Installing and configuring PyCharm
In order to work in Python, I recommend using an IDE instead of just the terminal. I use PyCharm by JetBrains since the community version is free and it's easy to use. It can be found at https://www.jetbrains.com/pycharm/ to install you can use the installation package or use command
sudo apt-get install pycharm-community
Once you are done installing, open up the IDE and start a new project and click File -> Settings
In the project sidebar go to project interpreter
Then, at the top drop down, select Show All This should open up a new window.
click the Green + sign on the right
This should open up a new window and on the left click the Conda Environment. And once you do make sure the python version is correct in the dropdown, if not change it. Click okay, and you should see your Anaconda environment loaded into your interpreter list with your version of Python. For my project I used 2.7.
Now we will add the scikit-learn library to our project.
Select your new Anaconda python interpreter and click the
green + sign
This opens up a new window and search for scikit-learn. Once you selected it click install package to install. Once done go back and repeat the process for the following libraries: numpy, panda, and scipy
Step 3: Obtaining Your First Dataset
When considering your first project. It can often be difficult to figure out what you will be doing. Go to kaggle.com/datasets
to browse and experiment with datasets. Kaggle is a great resource for data scientists in general. I would highly recommend looking at a few different datasets before starting a project. Since we will be taking over a month to complete, I highly suggest picking a topic you find interesting in order to keep your motivation alive. I find that side projects you find interesting you do better compared to topics you choose at random. Here are the steps I took to getting an idea for starting my project:
- Figure out an interesting results to predict. For example, can you predict the price of a home given it's location and square footage? Can you find other indicators like number of bedrooms and garages as a positive correlation of price?
- Can you attempt to predict the price of a certain stock?
- Can you calculate the strength of concrete given a dataset of water content related to strength
- In the case of predicting home prices , we will need to use a classification algorithm because it the outcome is based on multiple variables
- In the case of stock prices you can you regression because it is based on just two variables time and price
For my first Machine Learning project I will use regression to predict the future price of different digital currencies. The dataset I will be using is this: All Cryptocurrency Data
Once you have picked out a good dataset we will import it in python using numpy or pandas.
Step 4: Uploading your Data using Numpy or Pandas
Now that we have all the libraries imported into pyCharm we can start wrting code. You can use the numpy library or the pandas.
import numpy by
import numpy as np
or pandas as
import pandas as pd
The first few statements uses numpy, lets deconstruct it:
Using the loadtxt command, open the crypot-markets.csv,
data = np.loadtxt("crypto-markets.csv",delimiter=",",dtype=object, skiprows=1, usecols=[2,3,8])
Here is a breakdown of the arguements:
delimiter is set to commas for csv
dtype is set to object because the csv is a mixed datatype of strings and numbers.
skiprows is set to 1 to skip the first row of just headers
usecol is set to an array because we only need 2,3, and 8 and the rest of the data we do not need
In order to reference this data we just call
data as if it were an array as such:
And the code for pandas is similar if you choose to use it over numpy:
data = pd.read_csv("crypto-markets.csv", usecols=[2,3,8])
Great job you have setup your computer for your project!!
This is part II of my blog for Machine Learning algorithms. The core of my learning will be based on Georgia Techs ML class which can be found on Udacity. https://www.udacity.com/course/machine-learning--ud262 . It is the same class that students take to earn credit in their online masters program.
Regression, in terms of machine learning is a form of supervised learning . It is used to map example data of inputs and outputs and come up with a generalization that can be used to predict a future outcome. One thing to be said about the data is that it is continuous and not discrete.
The most common type of regression is linear regression. If you remember back to algebra, we will first plot all the points on a graph and find the best fit line with the old
y = mx + b.
From there, you can expand on the idea to include polynomial fit (i.e. squared, cubic, octic, etc..) Remember back to Algebra where the different degree function exhibit different behaviors.
So let's say we have a set of points, we can inspect those points and eyeball what polynomial function it approximates to. If it looks like a parabola, it's probably a degree 2 function, if it looks linear its probably a degree 1 function.
Can you generalize this in a mathematical way other than eyeballing the data? Yes, you can use linear algebra and use least squares to come up with the error. https://en.wikipedia.org/wiki/Polynomial_regression
Given a matrix of inputs and outputs we can solve for the coefficients B. And the vector of coefficients B can be written as
From the above, if y...yn is the output and x...xn is the input, you can come up with coefficients B...Bm uisng the ordinary least squares estimation using:
The concept of cross validation is using a smaller portion of the data as a test set, and the majority as a training set. The training set is used to come up with the regression model and the test set will be used to validate that model by checking the error. This is assuming that the data is Independent and Identically distributed (data is all coming from the same source.)
You can split your data in to four parts. You can use parts 1 through 3 as the training set and part 4 as the test set. Next you can use parts 1, 2, 4 as the training set and part 3 as the test set. After using each part as a test set you can use the model that give you the least amount of error. It is just a way of checking the error and taking the best model as the one you use.
Okay, that's it for this week. Next week I plan on tackling my first project in Machine Learning where I build a project from concepts we have learned s o far using Scikit Learn!
Machine Learning Part II: Decision trees.
This is part II of my blog for Machine Learning algorithms. The core of my learning will be based on Georgia Techs ML class which can be found on Udacity , It is the same class that students take to earn credit in their online masters program. This week we will cover decision trees and what they are used for. A decision tree is just a series of choices that lead to a certain conclusion. For example let's say that you are going to play ball if it is sunny outside, and it is not muddy, and if the humidity is low. However you will not play ball if any of those cases are not true. Each conclusion (to play or not to play) is based on series of true or false values conditions. Here is another example below:
The above tree shows a "20 questions" style deduction where each successive question tries to narrow down what the conclusion is (twenty questions). It asks more general questions first before narrowing down to the more specific questions. Would narrowing down from specific to general make more sense? Probably not. If your first question was to ask "is it a pig"? You may have a small chance of getting it right off the bat but most likely you get a no and it will tell you nothing about what you can ask next to narrow your guesses down further.
Making Decision Trees
So Making a Decision tree is fairly simple. A and B is a condition it is expressed as below:
A or B is a condition and it is expressed as below.
Algorithm for making descion trees.
An algorithm for building Decision trees ID3:
- Find the best attribute
- Assign A as a decision attribute for node
- For each value of A create a descendant of node
- For each value of A create a descendant of nodes
- Sort training examples of leaves
- If examples perfectly classified then stop.
- Else iterate over leaves
Example of ID3
How would you convert a table into a decision tree using ID3? The ID3 algorithm builds decision trees using a top down greedy approach. The example table below is broken down into attributes and classifications. A attribute is a decision node and classification is the end outcome of tree. The attributes are Outlook, Temp, Humidity, and wind. The classification is whether we play or not. We said that the first step for ID3 is to find the "best" attribute. The best value is one in which you get the best information gain.
Information gain splits the data into two halves. Low information gain is one in which classifications (Yes or No) is evenly distributed and high information gain is one in which classification can be split into less evenly distributed groups, it is more opinionated. To actually calculate this value we will use entropy.
Entropy is is a measure of how homogeneous a dataset is. For a binary classification, it ranges from 0 to log2(2). Where 0 is all data in the set is the same.
Entropy formula :
For a more specific case the entropy forumula becomes this:
From entropy we can obtain gain, which measures the reduction in tormentor that results from partitioning the data on that attribute.
Finally, Here is an example of how Gain is calculated across all attributes. The attribute with a highest gain will become the root node.
After the root is found the Tree should look like this:
The process should be repeated over again (ie. find the Gains for each attribute and get the highest and set it as the root), but this time for subset of the training data that has the outlook of Sunny. That is the basic idea of descision trees and ID3. Look out next week for more machine learning algorithms.