Welcome back to this week's series in Machine learning. This week, we will create our first machine learning project using scikit-learn. This post will cover just the setup of the development environment and how to export your first dataset. First, I am working in Ubuntu 16.04 on a PC. Everything can be installed using any OS or computer. In order to install scikit-learn you need >= Python 2.7 or
>= Python 3.6. You can look up how to install Python from other resources or the Python homepage. After doing so following the next steps:
Step 1: Install Anaconda
You can install scikit-learn using a single command:
pip install -U scikit-learn
however, you will need all the other dependencies such as numpy and scikit. The easiest way to get scikit-learn and all the dependencies is to get Anaconda. Anaconda is a scientific package of libraries and software used by data scientists. It also comes with useful tools like Jupiter Lab and Notebook so it is worth downloading: https://www.anaconda.com
Step 2: Installing and configuring PyCharm
In order to work in Python, I recommend using an IDE instead of just the terminal. I use PyCharm by JetBrains since the community version is free and it's easy to use. It can be found at https://www.jetbrains.com/pycharm/ to install you can use the installation package or use command
sudo apt-get install pycharm-community
Once you are done installing, open up the IDE and start a new project and click File -> Settings
In the project sidebar go to project interpreter
Then, at the top drop down, select Show All This should open up a new window.
click the Green + sign on the right
This should open up a new window and on the left click the Conda Environment. And once you do make sure the python version is correct in the dropdown, if not change it. Click okay, and you should see your Anaconda environment loaded into your interpreter list with your version of Python. For my project I used 2.7.
Now we will add the scikit-learn library to our project.
Select your new Anaconda python interpreter and click the
green + sign
This opens up a new window and search for scikit-learn. Once you selected it click install package to install. Once done go back and repeat the process for the following libraries: numpy, panda, and scipy
Step 3: Obtaining Your First Dataset
When considering your first project. It can often be difficult to figure out what you will be doing. Go to kaggle.com/datasets
to browse and experiment with datasets. Kaggle is a great resource for data scientists in general. I would highly recommend looking at a few different datasets before starting a project. Since we will be taking over a month to complete, I highly suggest picking a topic you find interesting in order to keep your motivation alive. I find that side projects you find interesting you do better compared to topics you choose at random. Here are the steps I took to getting an idea for starting my project:
- Figure out an interesting results to predict. For example, can you predict the price of a home given it's location and square footage? Can you find other indicators like number of bedrooms and garages as a positive correlation of price?
- Can you attempt to predict the price of a certain stock?
- Can you calculate the strength of concrete given a dataset of water content related to strength
- In the case of predicting home prices , we will need to use a classification algorithm because it the outcome is based on multiple variables
- In the case of stock prices you can you regression because it is based on just two variables time and price
For my first Machine Learning project I will use regression to predict the future price of different digital currencies. The dataset I will be using is this: All Cryptocurrency Data
Once you have picked out a good dataset we will import it in python using numpy or pandas.
Step 4: Uploading your Data using Numpy or Pandas
Now that we have all the libraries imported into pyCharm we can start wrting code. You can use the numpy library or the pandas.
import numpy by
import numpy as np
or pandas as
import pandas as pd
The first few statements uses numpy, lets deconstruct it:
Using the loadtxt command, open the crypot-markets.csv,
data = np.loadtxt("crypto-markets.csv",delimiter=",",dtype=object, skiprows=1, usecols=[2,3,8])
Here is a breakdown of the arguements:
delimiter is set to commas for csv
dtype is set to object because the csv is a mixed datatype of strings and numbers.
skiprows is set to 1 to skip the first row of just headers
usecol is set to an array because we only need 2,3, and 8 and the rest of the data we do not need
In order to reference this data we just call
data as if it were an array as such:
And the code for pandas is similar if you choose to use it over numpy:
data = pd.read_csv("crypto-markets.csv", usecols=[2,3,8])