You are on page 1of 56

MACHINE LEARNING

PROGRAM: B.TECH (CSE-DATA SCIENCE)

SEM-V

TAKEN BY: PROF. SHWETA LOONKAR

shwetaloonkar@gmail.com
Syllabus Unit Description Duration
1 Introduction: What is Machine Learning. Supervised Learning. Unsupervised Learning 2

2 Linear Model Selection and Regularization: Linear regression. Hypothesis 8


representation. Gradient descent. Cost function. Linear regression with multiple variables.
Polynomial regression. Logistic
regression. Hypothesis representation. Gradient descent. Cost function.
Linear regression with multiple variables. Normal Equation.
Polynomial regression. Regularization.

3 Moving Beyond Linearity: Neural networks. Hypothesis representation. Cost function. 5


Back propagation. Activation function.
4 Machine Learning System Design: Evaluating hypothesis. Train – Validation – Test. Bias 2
and variance curves. Error analysis. Error metrics for skewed classes. Precision and bias
tradeoff.
5 Tree-Based Methods: The Basics of Decision Trees, Regression Trees, Classification 4
Trees, Trees Versus Linear Models, Advantages and Disadvantages of Trees, Bagging,
Random Forests, Boosting
6 Support Vector Machines: Maximal Margin Classifier, Support Vector Classifiers, Support 4
Vector Machines, SVMs with More than Two
Classes, Relationship to Logistic Regression, ROC Curves, Application
to Gene Expression Data

7 Unsupervised Learning: The Challenge of Unsupervised Learning, Principal Components 5


Analysis, Clustering Methods, K-Means
Clustering, Hierarchical Clustering, Anomaly detection and large scale
machine learning.

Total 30
Teaching and Evaluation Scheme
Program: B. Tech. CSDS Semester : II
Course/Module : Machine Learning Module Code:

Teaching Scheme Evaluation Scheme


Term End Examinations
Lecture (Hours Practical (Hours Tutorial Credit Internal (TEE)
per week) per week) (Hours per Continuous (Marks- 100 in Question
week) Assessment (ICA) Paper)
(Marks - 50)

2 2 0 3 Marks Scaled to Marks Scaled


50 to 50
What is Learning??
• Learning is a process that improves
the knowledge of an AI program by
making observations about its
environment.
• To understand the different types of AI
learning models, we can use two of the
main elements of human learning
processes:
• Knowledge- From the knowledge
perspective, learning models can be
classified based on the representation of
input and output data points.
• Feedback- AI learning models can be
classified based on the interactions with the
outside environment, users and other
external factors.

Difference Between AI, ML and DL


Existence of AI, ML and Deep Learning
What is Machine Learning??
Machine learning (ML) is a branch of artificial intelligence (AI) that enables computers to “self-
learn” from training data and improve over time, without being explicitly programmed. Machine
learning algorithms are able to detect patterns in data and learn from them, in order to make
their own predictions.
State of the Art Applications for ML
What are the steps involved in building Machine
Learning models?
Any machine learning model development can broadly be divided into six
steps:
•Problem definition involves converting a Business Problem to a
machine learning problem
•Hypothesis generation is the process of creating a possible
business hypothesis and potential features for the model
•Data Collection requires you to collect the data for testing your
hypothesis and building the model
•Data Exploration and cleaning helps you remove outliers, missing
values and then transform the data into the required format
•Modeling is where you actually build the machine learning
models
•Once built, you will deploy the models
Supervised Learning
• Supervised learning, as the name indicates, has the presence of a supervisor as a teacher.
• Basically supervised learning is when we teach or train the machine using data that is well labeled.
• Means some data is already tagged with the correct answer.
• After that, the machine is provided with a new set of examples(data) so that the supervised learning algorithm
analyses the training data(set of training examples) and produces a correct outcome from labeled data.
Unit-1 Types of Learning.docx
Supervised Learning Process: Two Steps

 Learning (training): Learn a model using the training data

 Testing: Test the model using unseen test data to assess the model accuracy

Number of correct classifications


Accuracy  ,
Total number of test cases

CS583, BING
20
LIU, UIC
What do we mean by Learning?
• Given

• a data set D,

• a task T, and

• a performance measure M,

a computer system is said to learn from D to perform the task T if after learning the system’s
performance on T improves as measured by M.

• In other words, the learned model helps the system to perform T better as compared to no
learning.
An Example
• Data: Loan application data
• Task: Predict whether a loan should be approved or not.
• Performance measure: accuracy.

No learning: classify all future applications (test data) to the majority class (i.e., Yes):
Accuracy = 9/15 = 60%.
• We can do better than 60% with learning.
Fundamental Assumption of Learning
Assumption: The distribution of training examples is identical to the distribution of test
examples (including future unseen examples).

• In practice, this assumption is often violated to certain degree.


• Strong violations will clearly result in poor classification accuracy.
• To achieve good accuracy on the test data, training examples must be sufficiently
representative of the test data.
Steps Involved in Supervised Learning:
•First Determine the type of training dataset
•Collect/Gather the labelled training data.
•Split the training dataset into training dataset, test dataset, and validation dataset.
•Determine the input features of the training dataset, which should have enough knowledge so that the
model can accurately predict the output.
•Determine the suitable algorithm for the model, such as support vector machine, decision tree, etc.
•Execute the algorithm on the training dataset. Sometimes we need validation sets as the control
parameters, which are the subset of training datasets.
•Evaluate the accuracy of the model by providing the test set. If the model predicts the correct output, which
means our model is accurate.
Types of Supervised Learning
• Supervised learning can be further divided into two types of problems:
Unsupervised Learning
• Unsupervised learning is the training of a machine using information that is neither classified
nor labeled.
• It allows the algorithm to act on that information without guidance.
• Here the task of the machine is to group unsorted information according to similarities, patterns,
and differences without any prior training of data.
• Unlike supervised learning, no teacher is provided that means no training will be given to the
machine. Therefore the machine is restricted to find the hidden structure in unlabeled data by
itself.
Reinforcement Learning
• Reinforcement learning is an area of Machine Learning.
• It is about taking suitable action to maximize reward in a particular situation.
• It is employed by various software and machines to find the best possible behavior or path it
should take in a specific situation.
• Reinforcement learning differs from supervised learning in a way that in supervised learning
the training data has the answer key with it so the model is trained with the correct answer
itself.
• Whereas in reinforcement learning, there is no answer but the reinforcement agent decides
what to do to perform the given task. In the absence of a training dataset, it is bound to learn
from its experience.
Terminologies Used in Reinforcement Learning

Agent – is the sole decision-maker and learner


Environment – a physical world where an agent learns and decides the actions to be performed
Action – a list of action which an agent can perform
State – the current situation of the agent in the environment
Reward – For each selected action by agent, the environment gives a reward. It’s usually a scalar value
and nothing but feedback from the environment
Policy – the agent prepares strategy(decision-making) to map situations to actions.
Value Function – The value of state shows up the reward achieved starting from the state until the
policy is executed
Model – Every RL agent doesn’t use a model of its environment. The agent’s view maps state-action pairs
probability distributions over the states
Reinforcement Learning Workflow

Reinforcement Learning Workflow

– Create the Environment


– Define the reward
– Create the agent
– Train and validate the agent
– Deploy the policy
Semi Supervised Learning
• Where an incomplete training signal is given: a training set with some (often many) of the
target outputs missing.
• There is a special case of this principle known as Transduction where the entire set of problem
instances is known at learning time, except that part of the targets are missing.
• Semi-supervised learning is an approach to machine learning that combines small labeled
data with a large amount of unlabeled data during training. Semi-supervised learning falls
between unsupervised learning and supervised learning.
What are some of the latest achievements and
developments in machine learning?
Some of the latest achievements of machine learning include:
•Winning DOTA2 against the professional players (OpenAI’s development)
• Beating Lee Sidol at the traditional game of GO (Google DeepMind’s algorithm)
• Google saving up to 40% of electricity in its data centers by using Machine Learning
• Writing entire essays and poetry, and creating movies from scratch using Natural Language
Processing (NLP) techniques (Multiple breakthroughs, the latest being OpenAI’s GPT-2)
• Creating and generating images and videos from scratch (this is both incredibly creative and
worryingly accurate)
• Building automated machine learning models. This is revolutionizing the field by expanding the
circle of people who can work with machine learning to include non-technical folks as well
• Building machine learning models in the browser itself! (A Google creation – TensorFlow.js)
What are some of the Challenges in the
ad0ption of Machine Learning?
While machine learning has made tremendous progress in the last few years, there are some big challenges that
still need to be solved. It is an area of active research and I expect a lot of effort to solve these problems in the
coming time.
Huge data required: It takes a huge amount of data to train a model today. For example – if you want to
classify Cats vs. Dogs based on images (and you don’t use an existing model) – you would need the model to be
trained on thousands of images. Compare that to a human – we typically explain the difference between Cat and
Dog to a child by using 2 or 3 photos
High compute required: As of now, machine learning and deep learning models require huge computations
to achieve simple tasks (simple according to humans). This is why the use of special hardware including GPUs
and TPUs is required. The cost of computations needs to come down for machine learning to make a next-level
impact
Interpretation of models is difficult at times: Some modeling techniques can give us high accuracy but
are difficult to explain. This can leave the business owners frustrated. Imagine being a bank, but you cannot tell
why you declined a loan for a customer!
New and better algorithms required: Researchers are consistently looking out for new and better
algorithms to address some of the problems mentioned above
More Data Scientists needed: Further, since the domain has grown so quickly – there aren’t many people
with the skill sets required to solve the vast variety of problems. This is expected to remain so for the next few
years. So, if you are thinking about building a career in machine learning – you are in good stead!
Types of Learning
For instance, suppose you are given a basket filled with different kinds of fruits. Now the first
step is to train the machine with all the different fruits one by one like this:

• If the shape of the object is rounded and has a depression at the top, is red in color, then it
will be labeled as –Apple.
• If the shape of the object is a long curving cylinder having Green-Yellow color, then it will
be labeled as –Banana.
Now suppose after training the data, you have given a new separate fruit, say Banana from the
basket, and asked to identify it.

Since the machine has already learned the things from previous data and this time has to use it
wisely. It will first classify the fruit with its shape and color and would confirm the fruit name
as BANANA and put it in the Banana category. Thus the machine learns the things from
training data(basket containing fruits) and then applies the knowledge to test data(new fruit).
Supervised learning is classified into two categories of algorithms:
• Classification: A classification problem is when the output variable is a category, such as
“Red” or “blue” , “disease” or “no disease”.
• Regression: A regression problem is when the output variable is a real value, such as
“dollars” or “weight”.
Supervised learning deals with or learns with “labeled” data. This implies that some data is
already tagged with the correct answer.
Steps

How Supervised Learning Works?

In supervised learning, models are trained using labelled dataset, where the model learns about
each type of data. Once the training process is completed, the model is tested on the basis of test
data (a subset of the training set), and then it predicts the output.

The working of Supervised learning can be easily understood by the below example and
diagram:

Suppose we have a dataset of different types of shapes which includes square, rectangle, triangle,
and Polygon. Now the first step is that we need to train the model for each shape.

o If the given shape has four sides, and all the sides are equal, then it will be labelled as
a Square.
o If the given shape has three sides, then it will be labelled as a triangle.
o If the given shape has six equal sides, then it will be labelled as hexagon.

Now, after training, we test our model using the test set, and the task of the model is to identify
the shape.
The machine is already trained on all types of shapes, and when it finds a new shape, it classifies
the shape on the bases of a number of sides, and predicts the output.

1. Regression

Regression algorithms are used if there is a relationship between the input variable and the
output variable. It is used for the prediction of continuous variables, such as Weather forecasting,
Market Trends, etc. Below are some popular Regression algorithms which come under
supervised learning:

o Linear Regression
o Regression Trees
o Non-Linear Regression
o Bayesian Linear Regression
o Polynomial Regression

2. Classification

Classification algorithms are used when the output variable is categorical, which means there are
two classes such as Yes-No, Male-Female, True-false, etc.

Spam Filtering,

o Random Forest
o Decision Trees
o Logistic Regression
o Support vector Machines

Advantages of Supervised learning:

o With the help of supervised learning, the model can predict the output on the basis of
prior experiences.
o In supervised learning, we can have an exact idea about the classes of objects.
o Supervised learning model helps us to solve various real-world problems such as fraud
detection, spam filtering, etc.
o Supervised learning allows collecting data and produces data output from previous
experiences.
o Helps to optimize performance criteria with the help of experience.
o Supervised machine learning helps to solve various types of real-world computation
problems.

Disadvantages of supervised learning:

o Supervised learning models are not suitable for handling complex tasks.
o Supervised learning cannot predict the correct output if the test data is different from the
training dataset.
o Training required lots of computation times.
o In supervised learning, we need enough knowledge about the classes of object.
o Classifying big data can be challenging.

Unsupervised

For instance, suppose it is given an image having both dogs and cats which it has never seen.

Thus the machine has no idea about the features of dogs and cats so we can’t categorize it as
‘dogs and cats ‘. But it can categorize them according to their similarities, patterns, and
differences, i.e., we can easily categorize the above picture into two parts. The first may
contain all pics having dogs in them and the second part may contain all pics having cats in
them. Here you didn’t learn anything before, which means no training data or examples.
It allows the model to work on its own to discover patterns and information that was
previously undetected. It mainly deals with unlabelled data.
Unsupervised learning is classified into two categories of algorithms:
• Clustering: A clustering problem is where you want to discover the inherent groupings in
the data, such as grouping customers by purchasing behavior.
• Association: An association rule learning problem is where you want to discover rules that
describe large portions of your data, such as people that buy X also tend to buy Y.
Types of Unsupervised Learning:-
Clustering
1. Exclusive (partitioning)
2. Agglomerative
3. Overlapping
4. Probabilistic
Clustering Types:-
1. Hierarchical clustering
2. K-means clustering
3. Principal Component Analysis
4. Singular Value Decomposition
5. Independent Component Analysis
Supervised vs. Unsupervised Machine Learning
Parameters Supervised machine learning Unsupervised machine learning

Algorithms are trained using Algorithms are used against data


Input Data labeled data. that is not labeled

Computational
Complexity Simpler method Computationally complex

Accuracy Highly accurate Less accurate

No. of classes No. of classes is known No. of classes is not known

Data Analysis Uses offline analysis Uses real-time analysis of data

Linear and Logistics regression,


Random forest, K-Means clustering, Hierarchical
Support Vector Machine, Neural clustering,

Algorithms used Network, etc. Apriori algorithm, etc.


Unit-2 Linear Regression Numericals
Linear regression is the most basic and commonly used predictive analysis. One variable is considered to
be an explanatory variable, and the other is considered to be a dependent variable. For example, a modeler
might want to relate the weights of individuals to their heights using a linear regression model.
There are several linear regression analyses available to the researcher.
Simple linear regression

• One dependent variable (interval or ratio)


• One independent variable (interval or ratio or dichotomous)
Multiple linear regression

• One dependent variable (interval or ratio)


• Two or more independent variables (interval or ratio or dichotomous)
Logistic regression

• One dependent variable (binary)


• Two or more independent variable(s) (interval or ratio or dichotomous)
Ordinal regression

• One dependent variable (ordinal)


• One or more independent variable(s) (nominal or dichotomous)
Multinomial regression

• One dependent variable (nominal)


• One or more independent variable(s) (interval or ratio or dichotomous)
Discriminant analysis

• One dependent variable (nominal)


• One or more independent variable(s) (interval or ratio)
Formula for linear regression equation is given by:
𝑦 = 𝑎 + 𝑏𝑥
a and b are given by the following formulas:

𝑛∑𝑥𝑦 − (∑𝑥)(∑𝑦)
𝑏(𝑠𝑙𝑜𝑝𝑒) =
𝑛∑𝑥 2 − (∑𝑥)2
Where,
x and y are two variables on the regression line.
b = Slope of the line.
a = y-intercept of the line.
x = Values of the first data set.
y = Values of the second data set.

Solved Examples
Question: Find linear regression equation for the following two sets of data:

x 2 4 6 8

y 3 7 5 10
Solution:
Construct the following table:

x y x2 xy

2 3 4 6

4 7 16 28

6 5 36 30

8 10 64 80

= 20 = 25 = 120 = 144
𝑛∑𝑥𝑦−(∑𝑥)(∑𝑦)
𝑏= 𝑛∑𝑥 2 −(∑𝑥)2
=
b = 0.95
∑𝑦∑𝑥 2 –∑𝑥∑𝑥𝑦
𝑎= 𝑛(∑𝑥 2 )–(∑𝑥)2

a = 1.5
Linear regression is given by:
y = a + bx
y = 1.5 + 0.95 x
Linear Regression
Problems with Solutions

Linear regression and modelling problems are presented along with their solutions at the bottom of the
page. Also a linear regression calculator and grapher may be used to check answers and create more
opportunities for practice.

Review
If the plot of n pairs of data (x , y) for an experiment appear to indicate a "linear relationship" between y
and x, then the method of least squares may be used to write a linear relationship between x and y.
The least squares regression line is the line that minimizes the sum of the squares (d1 + d2 + d3 + d4) of
the vertical deviation from each data point to the line (see figure below as an example of 4 points).

Figure 1. Linear regression where the sum of vertical distances d1 + d2 + d3 + d4 between observed and
predicted (line and its equation) values is minimized.

The least square regression line for the set of n data points is given by the equation of a line in slope
intercept form:

y=ax+b

where a and b are given by


Figure 2. Formulas for the constants a and b included in the linear regression .

• Problem 1

Consider the following set of points: {(-2 , -1) , (1 , 1) , (3 , 2)}


a) Find the least square regression line for the given data points.
b) Plot the given points and the regression line in the same rectangular system of axes.

• Problem 2

a) Find the least square regression line for the following set of data

{(-1 , 0),(0 , 2),(1 , 4),(2 , 5)}

b) Plot the given points and the regression line in the same rectangular system of axes.

• Problem 3

The values of y and their corresponding values of y are shown in the table below

x 0 1 2 3 4

y 2 3 5 4 6

a) Find the least square regression line y = a x + b.


b) Estimate the value of y when x = 10.

• Problem 4

The sales of a company (in million dollars) for each year are shown in the table below.

x (year) 2005 2006 2007 2008 2009


y (sales) 12 19 29 37 45

a) Find the least square regression line y = a x + b.

Solutions to the Above Problems

1. a) Let us organize the data in a table.

x y xy x2

-2 -1 2 4

1 1 1 1

3 2 6 9

Σx = 2 Σy = 2 Σxy = 9 Σx2 = 14

2.
We now use the above formula to calculate a and b as follows
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (3*9 - 2*2) / (3*14 - 22) = 23/38

b = (1/n)(Σy - a Σx) = (1/3)(2 - (23/38)*2) = 5/19

b) We now graph the regression line given by y = a x + b and the given points.

3.

Figure 3. Graph of linear regression in problem 1.

4. a) We use a table as follows

x Y xy x2
-1 0 0 1

0 2 0 0

1 4 4 1

2 5 10 4

Σx = 2 Σy = 11 Σx y = 14 Σx2 = 6

We now use the above formula to calculate a and b as follows


a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (4*14 - 2*11) / (4*6 - 22) = 17/10 = 1.7

b = (1/n)(Σy - a Σx) = (1/4)(11 - 1.7*2) = 1.9

b) We now graph the regression line given by y = ax + b and the given points.

5.

Figure 4. Graph of linear regression in problem 2.

6. a) We use a table to calculate a and b.

x Y xy x2
0 2 0 0

1 3 3 1

2 5 10 4

3 4 12 9

4 6 24 16

Σx = 10 Σy = 20 Σx y = 49 Σx2 = 30

We now calculate a and b using the least square regression formulas for a and b.
a = (nΣx y - ΣxΣy) / (nΣx2 - (Σx)2) = (5*49 - 10*20) / (5*30 - 102) = 0.9

b = (1/n)(Σy - a Σx) = (1/5)(20 - 0.9*10) = 2.2

b) Now that we have the least square regression line y = 0.9 x + 2.2, substitute x by 10 to find the
value of the corresponding y.
y = 0.9 * 10 + 2.2 = 11.2

7. a) We first change the variable x into t such that t = x - 2005 and therefore t represents the
number of years after 2005. Using t instead of x makes the numbers smaller and therefore
manageable. The table of values becomes.

t (years after 2005) 0 1 2 3 4

y (sales) 12 19 29 37 45

We now use the table to calculate a and b included in the least regression line formula.

t Y ty t2

0 12 0 0

1 19 19 1

2 29 58 4

3 37 111 9

4 45 180 16
Σx = 10 Σy = 142 Σxy = 368 Σx2 = 30

We now calculate a and b using the least square regression formulas for a and b.
a = (nΣt y - ΣtΣy) / (nΣt2 - (Σt)2) = (5*368 - 10*142) / (5*30 - 102) = 8.4
b = (1/n)(Σy - a Σx) = (1/5)(142 - 8.4*10) = 11.6

b) In 2012, t = 2012 - 2005 = 7


The estimated sales in 2012 are: y = 8.4 * 7 + 11.6 = 70.4 million dollars.

Example 9.9

Calculate the regression coefficient and obtain the lines of regression for the following data

Solution:

Regression coefficient of X on Y
(i) Regression equation of X on Y

(ii) Regression coefficient of Y on X

(iii) Regression equation of Y on X


Y = 0.929X–3.716+11

= 0.929X+7.284

The regression equation of Y on X is Y= 0.929X + 7.284

Example 9.10

Calculate the two regression equations of X on Y and Y on X from the data given below, taking deviations
from a actual means of X and Y.

Estimate the likely demand when the price is Rs.20.

Solution:

Calculation of Regression equation

(i) Regression equation of X on Y


(ii) Regression Equation of Y on X

When X is 20, Y will be

= –0.25 (20)+44.25

= –5+44.25

= 39.25 (when the price is Rs. 20, the likely demand is 39.25)

Example 9.11

Obtain regression equation of Y on X and estimate Y when X=55 from the following

Solution:
(i) Regression coefficients of Y on X
(ii) Regression equation of Y on X

Y–51.57 = 0.942(X–48.29 )

Y = 0.942X–45.49+51.57=0.942 #–45.49+51.57

Y = 0.942X+6.08

The regression equation of Y on X is Y= 0.942X+6.08 Estimation of Y when X= 55

Y= 0.942(55)+6.08=57.89

Example 9.12

Find the means of X and Y variables and the coefficient of correlation between them from the following
two regression equations:

2Y–X–50 = 0

3Y–2X–10 = 0.

Solution:

We are given

2Y–X–50 = 0 ... (1)

3Y–2X–10 = 0 ... (2)

Solving equation (1) and (2)

We get Y = 90

Putting the value of Y in equation (1)

We get X = 130

Calculating correlation coefficient

Let us assume equation (1) be the regression equation of Y on X

2Y = X+50
Example 9.13

Find the means of X and Y variables and the coefficient of correlation between them from the following
two regression equations:

4X–5Y+33 = 0

20X–9Y–107 = 0

Solution:

We are given

4X–5Y+33 = 0 ... (1)

20X–9Y–107 =0 ... (2)

Solving equation (1) and (2)

We get Y = 17

Putting the value of Y in equation (1)

Calculating correlation coefficient

Let us assume equation (1) be the regression equation of X on Y


Let us assume equation (2) be the regression equation of Y on X

But this is not possible because both the regression coefficient are greater than

So our above assumption is wrong. Therefore treating equation (1) has regression equation of Y on X and
equation (2) has regression equation of X on Y . So we get

Example 9.16

For 5 pairs of observations the following results are obtained ∑X=15, ∑Y=25, ∑X2 =55, ∑Y2 =135,
∑XY=83 Find the equation of the lines of regression and estimate the value of X on the first line
when Y=12 and value of Y on the second line if X=8.

Solution:
Y–5 = 0.8(X–3)

= 0.8X+2.6

When X=8 the value of Y is estimated as


= 0.8(8)+2.6

=9

You might also like