Professional Documents
Culture Documents
6a Classification
The Classification algorithm is a Supervised Learning technique that is used to
identify the category of new observations on the basis of training data. In
Classification, a program learns from the given dataset or observations and then
classifies new observation into a number of classes or groups, such as, Yes or No, 0
or 1, Spam or Not Spam, cat or dog, etc. Classes can be called targets/labels or
categories.
It takes labeled input data, which means it contains input with the corresponding
output. In classification algorithm, a discrete output function(y) is mapped to input
variable(x).
Consider the case of a Mall owner selling computers. The customers are of all ages
and incomes, as shown below. The goal is to predict whether customer will buy or
not buy the computers. Data points have attributes age and income.
X1 Y1 Validation
X2 Y2
Test Set
Normalize the values to X1 = <0.15, 0.25>, Y1 = o = -1
X2 = <0.4, 0.45>, Y1 = x = +1
Inductive bias refers to a set of (explicit or implicit) assumptions made by a learning
algorithm in order to perform induction, that is, to generalize a finite set of
observation (training data) into a general model of the domain. The inductive bias
(also known as learning bias) of a learning algorithm is the set of assumptions that
the learner uses to predict outputs of given inputs that it has not encountered.
Error
test
Target Y
Applications
Credit card fraud detection – valid transaction or not
Sentiment analysis – opinion mining
Medical diagnosis – risk analysis
Churn prediction – employee loyalty
Models based on Artificial Neural Networks (ANN), Support Vector Machines, (SVM)
Decision tree, Bayesian Networks
6b Regression (Prediction)
The output is no longer discrete, but continuous. A regression problem is when the
output variable is a real or continuous value, such as “salary” or “weight”.
Example: Temperature at different times of day and night
Temp
x
x x x x x
day night x
Time of day x
Temp
x
x x x x x
day night x
Time of day x
Temp
x
x x x x x
day night x
Time of day x
This fits all the noise in the data – overfitting, has to be avoided
Overfitting occurs when our machine learning model tries to cover all the data
points or more than the required data points present in the given dataset. Because of
this, the model starts caching noise and inaccurate values present in the dataset,
and all these factors reduce the efficiency and accuracy of the model.
Many different models can be used for regression. The simplest is the Linear
Regression. It tries to fit data with the best hyper-plane which goes through the
points.
Types of Regression
Linear Regression
Linear regression is one of the easiest and most popular Machine Learning
algorithms. It is a statistical method that is used for predictive analysis. Linear
regression makes predictions for continuous/real or numeric variables such as sales,
salary, age, product price, etc.
Linear regression algorithm shows a linear relationship between a dependent (y) and
one or more independent (x) variables, hence called as linear regression. Since
linear regression shows the linear relationship, which means it finds how the value of
the dependent variable is changing according to the value of the independent
variable.
The linear regression model provides a sloped straight line representing the
relationship between the variables. Consider the image below:
Mathematically, we can represent a linear regression by the (Hypothesis) function
y= a0+a1x, where,
Y= Dependent Variable (Target Variable)
X= Independent Variable (predictor Variable)
a0= intercept of the line (Gives an additional degree of freedom)
a1 = Linear regression coefficient (scale factor to each input value).
When working with linear regression, our main goal is to find the best fit line, which
means the error between predicted values and actual values should be minimized.
The best fit line will have the least error.
The different values for weights or the coefficient of lines (a 0, a1) gives a different line
of regression. So we need to calculate the best values for a 0 and a1 to find the best fit
line. To calculate this we use cost function.
Cost function-
o The different values for weights or coefficient of lines (a 0, a1) gives different
lines of regression, and the cost function is used to estimate the values of the
coefficient for the best fit line.
o Cost function optimizes the regression coefficients or weights. It measures
how a linear regression model is performing.
o We can use the cost function to find the accuracy of the mapping function,
which maps the input variable to the output variable. This mapping function is
also known as Hypothesis function.
For Linear Regression, we use the Mean Squared Error (MSE) cost function, which
is the average of squared error occurred between the predicted values and actual
values. It can be written as:
Here,
N=Total number of observations
Yi = Actual value
(a1xi+a0) = Predicted value