You are on page 1of 7

# This Python 3 environment comes with many helpful analytics libraries installed

# It is defined by the kaggle/python docker image: https://github.com/kaggle/docker-python

# For example, here's several helpful packages to load in

import numpy as np # linear algebra

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the "../input/" directory.

# For example, running this (by clicking run or pressing Shift+Enter) will list the files in the input
directory

from subprocess import check_output

print(check_output(["ls", "../input"]).decode("utf8"))

# Any results you write to the current directory are saved as output.

ex1data2.txt

Machine Learning #1: Linear Regression in the beginning

This tutorial series is for absolute beginners in machine learning algorithms, for those who want to
review/practice the fundamentals of machine learning and how to build them from scratch.

What is Machine Learning?

"Machine Learning is the science (and art) of programming computers so they can learn from data "

Aurelion Geron, 2017

Type of Machine Learning Systems

reference 1: Hands-On Machine Learning with Scikit-Learn and Tensorflow

reference 2: Machine Learning, Stanford University by Andrew Ng


There are a lot of different types of Machine Learning Systems and usually it is best to classify them
in broad categories based on:

* Whether or not they are trained with human supervision (supervised, unsupervised,

semisupervised, and Reinforcement Learning

* Whether or not they can learn incrementally on the fly (online versus batch learning)

* Whether they work by simply comparing new data points to know data points or instead

detect patterns in the training data and build a predictive model, much like scientists

do (instance-based versus model-based learning)

Let's look at the very first criteria a bit more closely...

Supervised/Unsupervised Learning

Machine Learning systems are usually classified according to the amount and type of supervision
they get during training. There are four major categories: supervised learning, unsupervised learning,
semisupervised learning, and Reinforcement Learning.

Let us tackle Supervised Learning for now

Supervised Learning

In supervised learning, the training data you feed to the algorithm includes the desired solutions,
called labels

A typical supervised learning task is classification. The spam filter is a good example of this: it is
trained with many examples emails along with their class (spam or ham), and it must learn how to
classify new emails.

Another typical task is to predict a target numeric value, such as the price of a car, given a set of
features (mileage, age, brand, etc.) called predictors. This sort of task is called regression. To train
the system, you need to give it many examples of cars, including both their predictors and their
labels (i.e., their prices)

Let's try Regression!


Linear Regression: Univariate

Let's start with a very simple task of linear regression using a sample dataset called Portland Housing
Prices, wherein we are given some features of a house (i.e. area, no. of rooms, etc) and predict the
target price.

To make things much simpler. Let us use only one feature or in this case one variable, also known as
univariate linear regression. That is we are only gonna use the 'Area' of a given house to train a
linear model

Let's get the data and examine it!

#importing dependencies

import numpy as np #python library for scientific computing

import pandas as pd #python library for data analysis and dataframes

data = pd.read_csv('../input/ex1data2.txt', header=None)

data.head()

0 1 2

0 2104 3 399900

1 1600 3 329900

2 2400 3 369000

3 1416 2 232000

4 3000 4 539900

The data itself does not contain feature names or labels, let's set that up first. According to the
source the first column is the size of the house in sq.ft. followed by the no. of bedrooms and lastly
the price.

data.columns =(['Size','Bedroom','Price'])

data.head()

Size Bedroom Price

0 2104 3 399900

1 1600 3 329900

2 2400 3 369000

3 1416 2 232000
4 3000 4 539900

Let us remove the 'Bedroom' feature since we are doing univariate linear regression

data.drop('Bedroom', axis=1, inplace=True)

data.head()

Size Price

0 2104 399900

1 1600 329900

2 2400 369000

3 1416 232000

4 3000 539900

#data = data.sample(frac=1)

#data.head()

Now that looks much simpler! Let's plot our data and draw some insights of how a linear model
could fit.

# necessary dependencies for plotting

import matplotlib.pyplot as plt #python library for plot and graphs

%matplotlib inline

plt.plot(data.Size, data.Price, 'r.')

plt.show()

From the plot results we could see that there is a high correlation between Housing Area and
Housing Price (obviously) and therefore we could use a line (linear model) to fit this data.

# another way to test the correlation

data.corr()

Size Price

Size 1.000000 0.854988

Price 0.854988 1.000000

Linear Model

The idea of linear regression is to fit a line to a set of points. So let's use the line function given by:
f(x)=y=mx+b

where m is the slope and b is our y intercept, or for a more general form (multiple variables)

h(x)=θ0x0+θ1x1+θ2x2+...+θnxn

such that for a single variable where n = 1,

h(x)=θ0+θ1x1

when x0=1

where theta is our parameters (slope and intercept) and h(x) is our hypothesis or predicted value

class LinearModel():

def __init__(self, features, target):

self.X = features

self.y = target

def GradDesc(self, parameters, learningRate, cost):

self.a = learningRate

self.c = cost

self.p = parameters

return self.a, self.Cost(self.c), self.p

def Cost(self,c):

if c =='RMSE':

return self.y

elif c == 'MSE':

return self.X
X=1

y=0

a = LinearModel(5,4)

print(a.GradDesc(2,0.01,'MSE'))

print(a.Cost('RMSE'))

(0.01, 5, 2)

Matrix Math

As it turns, using Matrices and Vectors is actually very convenient in these type of problems (talking
about the obvious) To demonstrate that let's have an example:

# given a matrix A (3x2) and a matrix B (1x2)

A = np.array([[1,2],

[1,3],

[1,4]])

B = np.array([[2],[3]])

print('A =')

print(A,'\nsize =',A.shape)

print('\nB =')

print(B,'\nsize =',B.shape)

A=

[[1 2]

[1 3]

[1 4]]

size = (3, 2)

B=

[[2]

[3]]

size = (2, 1)
Suppose A is our feature matrix X and B as our parameter matrix theta, that is,

X=[ 1 2 ] θ=[ 2 3 ]

[13]

[14]

Remember that we have our linear model

h(x)=θ0x0+θ1x1

We know that

X0=[ 1 ] X1=[ 2 ] θT=[ 2 ]

[1] [3] [3]

[1] [4]

then we can actually use matrix dot product to do the multiplication and addition at the same time
(and faster)

H=[ θ0X00+θ1X01 ]=[ θ0+θ1X01 ]=[ 2+3(2) ]=[ 8 ]

[ θ0X10+θ1X11 ] [ θ0+θ1X11 ] [ 2+3(3) ] [ 11 ]

[ θ0X20+θ1X21 ] [ θ0+θ1X21 ] [ 2+3(4) ] [ 14 ]

can be as simple as

H=X dot θ

Yes, that is the power of Matrices!

You might also like