0% found this document useful (0 votes)
1K views120 pages

Machine Learning Course with Python

This document provides an overview of a course on machine learning with Python. The course aims to teach students to explore machine learning examples and libraries, apply appropriate regression and classification methods to datasets, use correct clustering algorithms, and implement recommendation systems. The course contains 5 modules that cover topics like regression, classification, clustering, and recommender systems. Students' understanding will be assessed through graded quizzes and a final project.

Uploaded by

Aanchal Saran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1K views120 pages

Machine Learning Course with Python

This document provides an overview of a course on machine learning with Python. The course aims to teach students to explore machine learning examples and libraries, apply appropriate regression and classification methods to datasets, use correct clustering algorithms, and implement recommendation systems. The course contains 5 modules that cover topics like regression, classification, clustering, and recommender systems. Students' understanding will be assessed through graded quizzes and a final project.

Uploaded by

Aanchal Saran
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Machine Learning with Python: A Practical Introduction

LEARNING OBJECTIVES

In this course, you will:

o Explore examples of Machine Learning and the libraries and


languages used to create them.

o Apply the appropriate form of regression to a data set for


estimation.

o Apply an appropriate classification method for a particular Machine


Learning challenge.

o Use the correct clustering algorithms on different data sets.

o Explain how recommendation systems work, and implement one


on a data set.

o Demonstrate your understanding of Machine Learning in an


assessed project.

Syllabus

Module 1 - Introduction to Machine Learning

o What is Machine Learning?

Module 2 - Regression

o Linear Regression

o Non-Linear Regression

Module 3 - Classification

o K-Nearest Neighbours

o Decision Trees

o Logistic Regression

o Support Vector Machine

Module 4 - Clustering
o k-Means Clustering

o Hierarchical Clustering

o Density-based Clustering

Module 5 - Recommender Systems

o Content-based Recommendation Engines

Final Assignment

GRADING SCHEME
This section contains information for those earning a certificate. Those
auditing the course can skip this section and click next.

1. The course contains 5 Graded Quizzes, 1 per module. The 5 Graded


Quizzes carry an equal weight and 75% of the total grade, and the Final
Assignment carries a weight of 25% of the total grade.

2. The minimum passing mark for the course is 70%.

3. Permitted attempts are per question

One attempt - For True/False questions

Two attempts - For any question other than True/False

4. There are no penalties for incorrect attempts.

5. Clicking the "Final Check" button when it appears, means your submission


is FINAL.  You will NOT be able to resubmit your answer for that question again.

6. Check your grades in the course at any time by clicking on the "Progress" tab.
1. Module 1: Introduction to Machine Learning

Module Introduction

In this module, you will learn about the applications of Machine Learning in different
fields such as healthcare, banking, telecommunication, and others. You'll gain a general
overview of Machine Learning topics such as supervised vs unsupervised learning, and
the usage of each algorithm. Also, you will understand the advantage of using Python
libraries for implementing Machine Learning models. 

Learning Objectives

 To give examples of Machine Learning.

 To demonstrate the Python libraries for Machine Learning.

 To classify Supervised vs. Unsupervised algorithms

Video 1

Start of transcript. Skip to the end.


Hello and welcome to machine learning with Python.
In this course, you'll learn how machine learning
is used in many key fields and industries.
For example, in the healthcare industry,
data scientists use machine learning to predict whether a human cell that is
believed to be at risk of developing cancer is either benign or malignant.
As such, machine learning can play
a key role in determining a person's health and welfare.
You'll also learn about the value of decision trees,
and how building a good decision tree from historical data
helps doctors to prescribe the proper medicine for each of their patients.
You'll learn how bankers use machine learning to make
decisions on whether to approve loan applications.
You will learn how to use machine learning to do bank customer segmentation,
where it is not usually easy to run for huge volumes of varied data.
In this course, you'll see how machine learning helps websites such as YouTube, Amazon,
or Netflix develop recommendations to their customers about various products or
services such as which movies they might be
interested in going to see or which books to buy.
There is so much that you can do with machine learning.
Here, you'll learn how to use popular Python libraries to build your model.
For example given an automobile dataset,
we can use the scikit-learn library to estimate
the CO2 emission of cars using their engine size or cylinders.
We can even predict what the CO2 emissions will
be for a car that hasn't even been produced yet.
We'll see how the telecommunications industries can predict customer churn.
You can run and practice the code of
all these samples using the built-in lab environment in this course.
You don't have to install anything to your computer or do anything on the cloud.
All you have to do is click a button to start the lab environment in your browser.
The code for the samples is already written using Python language in Jupiter notebooks,
and you can run it to see the results or change it to understand the algorithms better.
So, what will you be able to achieve by taking this course?
Well, by putting in just a few hours a week over the next few weeks,
you'll get new skills to add to your resume such as regression,
classification, clustering, scikit-learn, and scipy.
You'll also get new projects that you can add to
your portfolio including cancer detection,
predicting economic trends, predicting customer churn,
recommendation engines, and many more.
You'll also get a certificate in machine learning to prove your competency and share
it anywhere you like online or offline such as LinkedIn profiles and social media.
So, let's get started.
End of transcript. Skip to the st

Video 2
Start of transcript. Skip to the end.
Hello, and welcome!
In this video I will give you a high level introduction to Machine Learning.
So let’s get started.
This is a human cell sample extracted from a patient,
and this cell has characteristics. For example, its clump thickness is 6, its uniformity
of cell size is 1, its marginal adhesion is 1, and so on.
One of the interesting questions we can ask, at this point is: Is this a benign or malignant
cell?
In contrast with a benign tumor, a malignant tumor is a tumor that may invade its surrounding
tissue or spread around the body, and diagnosing it early might be the key to a patient’s
survival.
One could easily presume that only a doctor with years of experience could diagnose that
tumor and say if the patient is developing cancer or not.
Right?
Well, imagine that you’ve obtained a dataset containing characteristics of thousands of
human cell samples extracted from patients who were believed to be at risk of developing
cancer.
Analysis of the original data showed that many of the characteristics differed significantly
between benign and malignant samples.
You can use the values of these cell characteristics in samples from other patients to give an
early indication of whether a new sample might be benign or malignant.
You should clean your data, select a proper algorithm for building a prediction model,
and train your model to understand patterns of benign or malignant cells within the data.
Once the model has been trained by going through data iteratively, it can be used to predict
your new or unknown cell with a rather high accuracy.
This is machine learning!
It is the way that a machine learning model can do a doctor’s task or at least help
that doctor make the process faster.
Now, let me give a formal definition of machine learning.
Machine learning is the subfield of computer science that gives "computers the ability
to learn without being explicitly programmed.”
Let me explain what I mean when I say “without being explicitly programmed.”
Assume that you have a dataset of images of animals such as cats and dogs, and you want
to have software or an application that can recognize and differentiate them.
The first thing that you have to do here is interpret the images as a set of feature sets.
For example, does the image show the animal’s eyes?
If so, what is their size?
Does it have ears?
What about a tail?
How many legs?
Does it have wings?
Prior to machine learning, each image would be transformed to a vector of features.
Then, traditionally, we had to write down some rules or methods in order to get computers
to be intelligent and detect the animals.
But, it was a failure.
Why?
Well, as you can guess, it needed a lot of rules, highly dependent on the current dataset,
and not generalized enough to detect out-of-sample cases.
This is when machine learning entered the scene.
Using machine learning, allows us to build a model that looks at all the feature sets,
and their corresponding type of animals, and it learns the pattern of each animal.
It is a model built by machine learning algorithms.
It detects without explicitly being programmed to do so.
In essence, machine learning follows the same process that a 4-year-old child uses to learn,
understand, and differentiate animals.
So, machine learning algorithms, inspired by the human learning process, iteratively
learn from data, and allow computers to find hidden insights.
These models help us in a variety of tasks, such as object recognition, summarization,
recommendation, and so on.
Machine Learning impacts society in a very influential way.
Here are some real-life examples.
First, how do you think Netflix and Amazon recommend videos, movies, and TV shows to its
users?
They use Machine Learning to produce suggestions that you might enjoy!
This is similar to how your friends might recommend a television show to you, based
on their knowledge of the types of shows you like to watch.
How do you think banks make a decision when approving a loan application?
They use machine learning to predict the probability of default for each applicant, and then
approve
or refuse the loan application based on that probability.
Telecommunication companies use their customers’ demographic data to segment them, or
predict
if they will unsubscribe from their company the next month.
There are many other applications of machine learning that we see every day in our daily
life, such as chatbots, logging into our phones or even computer games using face recognition.
Each of these use different machine learning techniques and algorithms.
So, let’s quickly examine a few of the more popular techniques.
The Regression/Estimation technique is used for predicting a continuous value. For example,
predicting things like the price of a house based on its characteristics, or to estimate
the Co2 emission from a car’s engine.
A Classification technique is used for Predicting the class or category of a case, for example,
if a cell is benign or malignant, or whether or not a customer will churn.
Clustering groups of similar cases, for example, can find similar patients, or can be used
for customer segmentation in the banking field.
Association technique is used for finding items or events that often co-occur, for example,
grocery items that are usually bought together by a particular customer.
Anomaly detection is used to discover abnormal and unusual cases, for example, it is used
for credit card fraud detection.
Sequence mining is used for predicting the next event, for instance, the click-stream
in websites.
Dimension reduction is used to reduce the size of data.
And finally, recommendation systems, this associates people's preferences with others
who have similar tastes, and recommends new items to them, such as books or movies.
We will cover some of these techniques in the next videos.
By this point, I’m quite sure this question has crossed your mind, “What is the difference
between these buzzwords that we keep hearing these days, such as Artificial intelligence
(or AI), Machine Learning and Deep Learning?”
Well, let me explain what is different between them.
In brief, AI tries to make computers intelligent in order to mimic the cognitive functions
of humans.
So, Artificial Intelligence is a general field with a broad scope including: Computer Vision,
Language Processing, Creativity, and Summarization.
Machine Learning is the branch of AI that covers the statistical part of artificial
intelligence.
It teaches the computer to solve problems by looking at hundreds or thousands of examples,
learning from them, and then using that experience to solve the same problem in new
situations.
And Deep Learning is a very special field of Machine Learning where computers can actually
learn and make intelligent decisions on their own.
Deep learning involves a deeper level of automation in comparison with most machine learning
algorithms.
Now that we’ve completed the introduction to Machine Learning, subsequent videos will
focus on reviewing two main components: First, you’ll be learning about the purpose
of Machine Learning and where it can be applied in the real world; and
Second, you’ll get a general overview of Machine Learning topics, such as supervised
vs unsupervised learning, model evaluation and various Machine Learning algorithms.
So now that you have a sense with what’s in store on this journey, let’s continue
our exploration of Machine Learning!
Thanks for watching!

Video-3

Hello and welcome. In this video,


we'll talk about how to use Python for machine learning. So let's get started.
Python is a popular and powerful general purpose programming language
that recently emerged as the preferred language among data scientists.
You can write your machine-learning algorithms using Python, and it works very well.
However, there are a lot of modules and libraries already implemented in Python,
that can make your life much easier.
We try to introduce the Python packages in
this course and use it in the labs to give you better hands-on experience.
The first package is NumPy which is
a math library to work with N-dimensional arrays in Python.
It enables you to do computation efficiently and effectively.
It is better than regular Python because of its amazing capabilities.
For example, for working with arrays, dictionaries,
functions, datatypes and working with images you need to know NumPy.
SciPy is a collection of numerical algorithms and domain specific toolboxes,
including signal processing, optimization,
statistics and much more.
SciPy is a good library for scientific and high performance computation.
Matplotlib is a very popular plotting package that provides 2D plotting,
as well as 3D plotting.
Basic knowledge about these three packages which are built on top of Python,
is a good asset for data scientists who want to work with real-world problems.
If you're not familiar with these packages,
I recommend that you take the data analysis with Python course first.
This course covers most of the useful topics in these packages.
Pandas library is a very high-level Python library
that provides high performance easy to use data structures.
It has many functions for data importing, manipulation and analysis.
In particular, it offers data structures and
operations for manipulating numerical tables and timeseries.
SciKit Learn is a collection of algorithms and tools for
machine learning which is our focus here
and which you'll learn to use within this course.
As we'll be using SciKit Learn quite a bit in the labs,
let me explain more about it and show you why it is so popular among data scientists.
SciKit Learn is a free Machine Learning Library for the Python programming language.
It has most of the classification,
regression and clustering algorithms,
and it's designed to work with
a Python numerical and scientific libraries: NumPy and SciPy.
Also, it includes very good documentation.
On top of that,
implementing machine learning models with SciKit Learn
is really easy with a few lines of Python code.
Most of the tasks that need to be done in a machine learning pipeline are
implemented already in Scikit Learn including pre-processing of data,
feature selection, feature extraction, train test splitting,
defining the algorithms, fitting models,
tuning parameters, prediction, evaluation, and exporting the model.
Let me show you an example of how SciKit Learn looks like when you use this library.
You don't have to understand the code for now but just see
how easily you can build a model with just a few lines of code.
Basically, machine-learning algorithms benefit from standardization of the dataset.
If there are some outliers or different scales fields in your dataset,
you have to fix them.
The pre-processing package of SciKit Learn provides several common utility functions and
transformer classes to change
raw feature vectors into a suitable form of vector for modeling.
You have to split your dataset into train and test sets to
train your model and then test the model's accuracy separately.
SciKit Learn can split arrays or matrices into
random train and test subsets for you in one line of code.
Then you can set up your algorithm.
For example, you can build a classifier using a support vector classification algorithm.
We call our estimator instance CLF and initialize its parameters.
Now you can train your model with the train
set by passing our training set to the fit method,
the CLF model learns to classify unknown cases.
Then we can use our test set to run predictions,
and the result tells us what the class of each unknown value is.
Also, you can use the different metrics to evaluate your model accuracy.
For example, using a confusion matrix to show the results.
And finally, you save your model.
You may find all or some of these machine-learning terms confusing but don't worry,
we'll talk about all of these topics in the following videos.
The most important point to remember is that the entire process of
a machine learning task can be done simply in a few lines of code using SciKit Learn.
Please notice that though it is possible,
it would not be that easy if you want to do all of this using NumPy or SciPy packages.
And of course, it needs much more coding if you use
pure Python programming to implement all of these tasks.

1. Thanks for watching.


Video -4
Start of transcript. Skip to the end.
Hello, and welcome.
In this video we'll introduce supervised algorithms versus unsupervised algorithms.
So, let's get started.
An easy way to begin grasping the concept of
supervised learning is by looking directly at the words that make it up.
Supervise, means to observe,
and direct the execution of a task, project, or activity.
Obviously we aren't going to be supervising a person,
instead will be supervising a machine learning model that
might be able to produce classification regions like we see here.
So, how do we supervise a machine learning model?
We do this by teaching the model,
that is we load the model with knowledge so that we can have it predict future instances.
But this leads to the next question which is,
how exactly do we teach a model?
We teach the model by training it with some data from a labeled dataset.
It's important to note that the data is labeled,
and what does a labeled dataset look like?
Well, it could look something like this.
This example is taken from the cancer dataset.
As you can see, we have some historical data for patients,
and we already know the class of each row.
Let's start by introducing some components of this table.
The names up here which are called clump thickness,
uniformity of cell size,
uniformity of cell shape,
marginal adhesion and so on are called attributes.
The columns are called features which include the data.
If you plot this data,
and look at a single data point on a plot,
it'll have all of these attributes that would make
a row on this chart also referred to as an observation.
Looking directly at the value of the data,
you can have two kinds.
The first is numerical.
When dealing with machine learning,
the most commonly used data is numeric.
The second is categorical,
that is its non-numeric because it contains characters rather than numbers.
In this case, it's categorical because this dataset is made for classification.
There are two types of supervised learning techniques.
They are classification, and regression.
Classification is the process of predicting a discrete class label, or category.
Regression is the process of predicting
a continuous value as opposed to predicting a categorical value in classification.
Look at this dataset.
It is related to CO2 emissions of different cars.
It includes; engine size, cylinders,
fuel consumption, and CO2 emission of various models of automobiles.
Given this dataset, you can use regression to predict
the CO2 emission of a new car by using other fields such as engine size,
or number of cylinders.
Since we know the meaning of supervised learning,
what do you think unsupervised learning means?
Yes, unsupervised learning is exactly as it sounds.
We do not supervise the model,
but we let the model work on its own to discover
information that may not be visible to the human eye.
It means, the unsupervised algorithm trains on the dataset,
and draws conclusions on unlabeled data.
Generally speaking, unsupervised learning has more difficult algorithms
than supervised learning since we know little to no information about the data,
or the outcomes that are to be expected.
Dimension reduction, density estimation,
market basket analysis, and clustering are
the most widely used unsupervised machine learning techniques.
Dimensionality reduction, and/or feature selection,
play a large role in this by reducing
redundant features to make the classification easier.
Market basket analysis is a modeling technique
based upon the theory that if you buy a certain group of items,
you're more likely to buy another group of items.
Density estimation is a very simple concept that is
mostly used to explore the data to find some structure within it.
And finally, clustering:
Clustering is considered to be one of
the most popular unsupervised machine learning techniques used for grouping data points,
or objects that are somehow similar.
Cluster analysis has many applications in different domains,
whether it be a bank's desire to segment his customers based on certain characteristics,
or helping an individual to organize in-group his,
or her favorite types of music.
Generally speaking though, clustering is used mostly for discovering structure,
summarization, and anomaly detection.
So, to recap, the biggest difference between supervised
and unsupervised learning is that supervised learning deals with
labeled data while unsupervised learning deals with unlabeled data.
In supervised learning, we have
machine learning algorithms for classification and regression.
In unsupervised learning, we have methods such as clustering.
In comparison to supervised learning,
unsupervised learning has fewer models
and fewer evaluation methods that can be used
to ensure that the outcome of the model is accurate.
As such, unsupervised learning creates
a less controllable environment as the machine is
creating outcomes for us. Thanks for watching.

1.
Module -2
In this module, you will get a brief intro to regression. You will learn about Linear, Non-
linear, Simple and Multiple regression, and their applications. You will apply all these
methods on two different data sets, in the lab sections. Also, you will learn how to
evaluate your regression model, and calculate its accuracy.

Learning Objectives
 To understand the basics of regression.

 To apply Simple and Multiple, Linear and Non-Linear Regression on a data set for
estimation.
Video -1

1. Start of transcript. Skip to the end.


2. Hello, and welcome.

3. In this video,

4. we'll be giving a brief introduction to regression.

5. So, let's get started.

6. Look at this data set,

7. it's related to CO_2 emissions from different cars.

8. It includes engine size, number of cylinders,

9. fuel consumption, and CO_2 emission from various automobile models.

10. The question is, given this data set,

11. can we predict the CO_2 emission of a car,

12. using other fields such as engine size, or cylinders?

13. Let's assume we have some historical data from different cars,

14. and assume that a car such as,

15. in row nine has not been manufactured yet,

16. but we're interested in estimating its approximate CO_2 emission, after production.

17. Is it possible?

18. We can use regression methods to predict a continuous value such as CO_2
emission,

19. using some other variables.


20. Indeed, regression is the process of predicting a continuous value.

21. In regression, there are two types of variables,

22. a dependent variable, and one or more independent variables.

23. The dependent variable, can be seen as the state,

24. target, or final goal we study,

25. and try to predict,

26. and the independent variables,

27. also known as explanatory variables,

28. can be seen as the causes of those states.

29. The independent variables are shown conventionally by X,

30. and the dependent variable is notated by Y.

31. Our regression model relates Y,

32. or the dependent variable,

33. to a function of X i.e.

34. The independent variables.

35. The key point in the regression,

36. is that our dependent value should be continuous

37. and cannot be a discrete value.

38. However, the independent variable or variables,

39. can be measured on either a categorical,

40. or continuous measurement scale.

41. So, what we want to do here is to use the historical data of some cars,

42. using one or more of their features and from that data, make a model.

43. We use regression to build such a regression estimation model,

44. then the model is used to predict the expected CO_2 emission for a new, or
unknown car.

45. Basically, there are two types of regression models.

46. Simple regression, and multiple regression.

47. Simple regression is when

48. one independent variable is used to estimate a dependent variable.


49. It can be either linear, or non-linear.

50. For example, predicting CO_2 emission using the variable of engine size.

51. Linearity of regression, is based on the nature of

52. relationship between independent and dependent variables.

53. When more than one independent variable is present,

54. the processes is called multiple linear regression.

55. For example, predicting CO_2 emission using engine size,

56. and the number of cylinders in any given car.

57. Again, depending on the relation between dependent and independent variables,

58. it can be either linear or nonlinear regression.

59. Let's examine some sample applications of regression.

60. Essentially, we use regression when we want to estimate a continuous value.

61. For instance, one of the applications of

62. regression analysis could be in the area of sales forecasting.

63. You can try to predict a salesperson's total yearly sales from independent variables,

64. such as age, education,

65. and years of experience.

66. It can also be used in the field of psychology for example,

67. to determine individual satisfaction based on demographic and psychological


factors.

68. We can use regression analysis to predict the price of a house in an area,

69. based on its size,

70. number of bedrooms, and so on.

71. We can even use it to predict employment income,

72. for independent variables such as hours of work, education,

73. occupation, sex age, years of experience, and so on.

74. Indeed you can find many examples of the usefulness of

75. regression analysis in these and many other fields,

76. or domains such as finance,

77. healthcare, retail, and more.


78. We have many regression algorithms.

79. Each of them has its own importance,

80. and a specific condition to which their application is best suited,

81. and while we've covered just a few of them in this course,

82. it gives you enough base knowledge for you to explore different regression
techniques.

83. Thanks for watching.


Video-2
Simple linear regressiomn

1. Start of transcript. Skip to the end.


2. Hello and welcome.

3. In this video, we'll be covering linear regression.

4. You don't need to know any linear algebra to understand topics in linear regression.

5. This high-level introduction will give you enough background information on linear

6. regression to be able to use it effectively on your own problems.

7. So let's get started.

8. Let's take a look at this data set.

9. It's related to the Co2 emission of different cars.

10. It includes engine size, cylinders, fuel consumption and

11. Co2 emissions for various car models.

12. The question is, given this data set,

13. can we predict the Co2 emission of a car using another field such as engine size?

14. Quite simply, yes.

15. We can use linear regression to predict a continuous value

16. such as Co2 emission by using other variables.

17. Linear regression is the approximation of a linear model

18. used to describe the relationship between two or more variables.

19. In simple linear regression, there are two variables,

20. a dependent variable and an independent variable.

21. The key point in the linear regression

22. is that our dependent value should be continuous and cannot be a discrete value.
23. However, the independent variables can be measured on either a categorical or

24. continuous measurement scale.

25. There are two types of linear regression models.

26. They are simple regression and multiple regression.

27. Simple linear regression is when one independent variable is used

28. to estimate a dependent variable.

29. For example, predicting Co2 emission using the engine size variable.

30. When more than one independent variable is present the process is called

31. multiple linear regression, for example,

32. predicting Co2 emission using engine size and cylinders of cars.

33. Our focus in this video is on simple linear regression.

34. Now let's see how linear regression works.

35. Okay, so let's look at our data set again.

36. To understand linear regression, we can plot our variables here.

37. We show engine size as an independent variable and

38. emission as the target value that we would like to predict.

39. A scatter plot clearly shows the relation between variables where changes in

40. one variable explain or possibly cause changes in the other variable.

41. Also, it indicates that these variables are linearly related.

42. With linear regression you can fit a line through the data.

43. For instance, as the engine size increases, so do the emissions.

44. With linear regression you can model the relationship of these variables.

45. A good model can be used to predict what the approximate emission of each car is.

46. How do we use this line for prediction now?

47. Let us assume for a moment that the line is a good fit of the data.

48. We can use it to predict the emission of an unknown car.

49. For example, for a sample car with engine size 2.4,

50. you can find the emission is 214.

51. Now, let's talk about what the fitting line actually is.

52. We're going to predict the target value y.


53. In our case using the independent variable engine size represented by x1.

54. The fit line is shown traditionally as a polynomial.

55. In a simple regression problem, a single x,

56. the form of the model would be theta 0 plus theta 1 x1.

57. In this equation, y hat is the dependent variable of the predicted value.

58. And x1 is the independent variable.

59. Theta 0 and theta 1 are the parameters of the line that we must adjust.

60. Theta 1 is known as the slope or

61. gradient of the fitting line and theta 0 is known as the intercept.

62. Theta 0 and theta 1 are also called the coefficients of the linear equation.

63. You can interpret this equation as y hat being

64. a function of x1, or y hat being dependent of x1.

65. How would you draw a line through the points?

66. And how do you determine which line fits best?

67. Linear regression estimates the coefficients of the line.

68. This means we must calculate theta 0 and

69. theta 1 to find the best line to fit the data.

70. This line would best estimate the emission of the unknown data points.

71. Let's see how we can find this line or, to be more precise,

72. how we can adjust the parameters to make the line the best fit for the data.

73. For a moment, let's assume we've already found the best fit line for our data.

74. Now, let's go through all the points and check how well they align with this line.

75. Best fit here means that if we have, for instance,

76. a car with engine size x1 = 5.4 and

77. actual Co2 = 250,

78. its Co2 should be predicted very close to the actual value,

79. which is y = 250 based on historical data.

80. But if we use the fit line, or better to say

81. using our polynomial with known parameters to predict the Co2 emission,

82. it will return y hat = 340.


83. Now if you compare the actual value of the emission of the car with what

84. we've predicted using our model, you will find out that we have a 90 unit error.

85. This means our prediction line is not accurate.

86. This error is also called the residual error.

87. So we can say the error is the distance from the data point

88. to the fitted regression line.

89. The mean of all residual errors shows how poorly the line

90. fits with the whole data set.

91. Mathematically it can be shown by the equation Mean Squared Error, shown as MSE.

92. Our objective is to find a line where the mean of all these errors is minimized.

93. In other words,

94. the mean error of the prediction using the fit line should be minimized.

95. Let's reword it more technically.

96. The objective of linear regression, is to minimize this MSE equation and

97. to minimize it, we should find the best parameters theta 0 and theta 1.

98. Now the question is how to find theta 0 and

99. theta 1 in such a way that it minimizes this error?

100. How can we find such a perfect line?

101. Or said another way, how should we find the best parameters for our line?

102. Should we move the line a lot randomly and

103. calculate the MSE value every time and choose the minimum one?

104. Not really.

105. Actually, we have two options here.

106. Option one, we can use a mathematic approach, or

107. option two, we can use an optimization approach.

108. Let's see how we could easily use a mathematic formula to find the theta 0 and

109. theta 1.

110. As mentioned before, theta 0 and

111. theta 1 in the simple linear regression are the coefficients of the fit line.

112. We can use a simple equation to estimate these coefficients.


113. That is, given that it's a simple linear regression with only two parameters,

114. and knowing that theta 0 and theta 1 are the intercept and

115. slope of the line, we can estimate them directly from our data.

116. It requires that we calculate the mean of the independent and dependent or

117. target columns from the data set.

118. Notice that all of the data must be available to traverse and

119. calculate the parameters.

120. It can be shown that the intercept and

121. slope can be calculated using these equations.

122. We can start off by estimating the value for theta 1.

123. This is how you can find the slope of a line based on the data.

124. X bar is the average value for the engine size in our data set.

125. Please consider that we have nine rows here, rows 0 to 8.

126. First we calculate the average of x1 and of y,

127. then we plug it into the slope equation to find theta 1.

128. The xi and yi in the equation refer to the fact that we

129. need to repeat these calculations across all values in our data set.

130. And i refers to the ith value of x or y.

131. Applying all values, we find theta 1 equals 39.

132. It is our second parameter.

133. It is used to calculate the first parameter

134. which is the intercept of the line.

135. Now we can plug theta 1 into the line equation to find theta 0.

136. It is easily calculated hat theta 0 equals 125.74.

137. So these are the two parameters for the line,

138. where theta 0 is also called the bias coefficient, and

139. theta 1 is the coefficient for the Co2 emission column.

140. As a side note, you really don't need to remember the formula for

141. calculating these parameters, as most of the libraries used for machine learning

142. in Python, R and Scala can easily find these parameters for you.
143. But it's always good to understand how it works.

144. Now, we can write down the polynomial of the line.

145. So we know how to find the best fit for our data and its equation.

146. Now the question is how can we use it to predict the emission of a new car

147. based on its engine size?

148. After we found the parameters of the linear equation,

149. making predictions is as simple as solving the equation for a specific set of inputs.

150. Imagine we are predicting Co2 emission, or y,

151. from engine size, or x for the automobile in record number 9.

152. Our linear regression model representation for

153. this problem would be y hat= theta 0 + theta 1 x1.

154. Or if we map it to our data set,

155. it would be Co2Emission =theta 0 + theta 1 EngineSize.

156. As we saw, we can find theta 0,

157. theta 1 using the equations that we just talked about.

158. Once found, we can plug in the equation of the linear model.

159. For example, let's use theta 0 = 125 and theta 1 = 39.

160. So we can rewrite the linear model as Co2Emission

161. equals 125 plus 39 EngineSize.

162. Now let's plug in the 9th row of our data set and

163. calculate the Co2 emission for a car with an engine size of 2.4.

164. So Co2Emission = 125 + 39 x 2.4.

165. Therefore, we can predict that the Co2Emission for

166. this specific car would be 218.6.

167. Let's talk a bit about why linear regression is so useful.

168. Quite simply, it is the most basic regression to use and understand.

169. In fact, one reason why linear regression is so useful is that it's fast.

170. It also doesn't require tuning of parameters.

171. So something like tuning the K parameter and K nearest neighbors, or

172. the learning rate in neural networks isn't something to worry about.
173. Linear regression is also easy to understand, and highly interpretable.

174. Thanks for watching this video.

175.
Video -3

1. tart of transcript. Skip to the end.


2. Hello and welcome.

3. In this video,

4. we'll be covering model evaluation.

5. So let's get started.

6. The goal of regression is to build a model to accurately predict an unknown case.

7. To this end, we have to perform regression evaluation after building the model.

8. In this video, we'll introduce and discuss two types of

9. evaluation approaches that can be used to achieve this goal.

10. These approaches are train and test on the same dataset and train/test split.

11. We'll talk about what each of these are,

12. as well as the pros and cons of using each of these models.

13. Also, we'll introduce some metrics for accuracy of regression models.

14. Let's look at the first approach.

15. When considering evaluation models,

16. we clearly want to choose the one that will give us the most accurate results.

17. So, the question is,

18. how can we calculate the accuracy of our model?

19. In other words, how much can we trust this model for prediction of

20. an unknown sample using

21. a given dataset and having built a model such as linear regression?

22. One of the solutions is to select a portion of our dataset for testing.

23. For instance, assume that we have 10 records in our dataset.

24. We use the entire dataset for training,

25. and we build a model using this training set.

26. Now, we select a small portion of the dataset,


27. such as row number six to nine,

28. but without the labels.

29. This set is called a test set,

30. which has the labels,

31. but the labels are not used for prediction and is used only as ground truth.

32. The labels are called actual values of the test set.

33. Now we pass the feature set of the testing portion

34. to our built model and predict the target values.

35. Finally, we compare the predicted values by

36. our model with the actual values in the test set.

37. This indicates how accurate our model actually is.

38. There are different metrics to report the accuracy of the model,

39. but most of them work generally based on

40. the similarity of the predicted and actual values.

41. Let's look at one of the simplest metrics to

42. calculate the accuracy of our regression model.

43. As mentioned, we just compare the actual values y with the predicted values,

44. which is noted as y hat for the testing set.

45. The error of the model is calculated as the average difference

46. between the predicted and actual values for all the rows.

47. We can write this error as an equation.

48. So, the first evaluation approach we just talked about is the simplest one,

49. train and test on the same dataset.

50. Essentially, the name of this approach says it all.

51. You train the model on the entire dataset,

52. then you test it using a portion of the same dataset.

53. In a general sense,

54. when you test with a dataset in which you know the target value for each data point,

55. you're able to obtain a percentage of accurate predictions for the model.

56. This evaluation approach would most likely have a high training accuracy and
57. the low out-of-sample accuracy since

58. the model knows all of the testing data points from the training.

59. What is training accuracy and out-of-sample accuracy?

60. We said that training and testing on the same dataset produces a high training accuracy,

61. but what exactly is training accuracy?

62. Training accuracy is the percentage of

63. correct predictions that the model makes when using the test dataset.

64. However, a high training accuracy isn't necessarily a good thing.

65. For instance, having a high training accuracy may result in an over-fit the data.

66. This means that the model is overly trained to the dataset,

67. which may capture noise and produce a non-generalized model.

68. Out-of-sample accuracy is the percentage of correct predictions that

69. the model makes on data that the model has not been trained on.

70. Doing a train and test on the same dataset will most likely have

71. low out-of-sample accuracy due to the likelihood of being over-fit.

72. It's important that our models have

73. high out-of-sample accuracy because the purpose of our model is,

74. of course, to make correct predictions on unknown data.

75. So, how can we improve out-of-sample accuracy?

76. One way is to use another evaluation approach called train/test split.

77. In this approach, we select a portion of our dataset for training, for example,

78. row zero to five,

79. and the rest is used for testing,

80. for example, row six to nine.

81. The model is built on the training set.

82. Then, the test feature set is passed to the model for prediction.

83. Finally, the predicted values for

84. the test set are compared with the actual values of the testing set.

85. The second evaluation approach is called train/test split.

86. Train/test split involves splitting the dataset


87. into training and testing sets respectively,

88. which are mutually exclusive.

89. After which, you train with the training set and test with the testing set.

90. This will provide a more accurate evaluation on out-of-sample accuracy because

91. the testing dataset is not part of the dataset that has been used to train the data.

92. It is more realistic for real-world problems.

93. This means that we know the outcome of each data point in the dataset,

94. making it great to test with.

95. Since this data has not been used to train the model,

96. the model has no knowledge of the outcome of these data points.

97. So, in essence, it's truly out-of-sample testing.

98. However, please ensure that you train your model with the testing set afterwards,

99. as you don't want to lose potentially valuable data.

100. The issue with train/test split is that it's highly

101. dependent on the datasets on which the data was trained and tested.

102. The variation of this causes train/test split to have

103. a better out-of-sample prediction than training and testing on the same dataset,

104. but it still has some problems due to this dependency.

105. Another evaluation model, called K-fold cross-validation,

106. resolves most of these issues.

107. How do you fix a high variation that results from a dependency?

108. Well, you average it.

109. Let me explain the basic concept of K-fold

110. cross-validation to see how we can solve this problem.

111. The entire dataset is represented by the points in the image at the top left.

112. If we have K equals four folds,

113. then we split up this dataset as shown here.

114. In the first fold for example,

115. we use the first 25 percent of the dataset for testing and the rest for training.

116. The model is built using the training set and is evaluated using the test set.
117. Then, in the next round or in the second fold,

118. the second 25 percent of the dataset is

119. used for testing and the rest for training the model.

120. Again, the accuracy of the model is calculated.

121. We continue for all folds.

122. Finally, the result of all four evaluations are averaged.

123. That is, the accuracy of each fold is then averaged,

124. keeping in mind that each fold is distinct,

125. where no training data in one fold is used in another.

126. K-fold cross-validation in its simplest form performs multiple train/test splits,

127. using the same dataset where each split is different.

128. Then, the result is average to produce a more consistent out-of-sample accuracy.

129. We wanted to show you an evaluation model that

130. addressed some of the issues we've described in the previous approaches.

131. However, going in-depth with K-fold cross-validation model

132. is out of the scope for this course. Thanks for watching.

133.
video-4

1. Hello and welcome.

2. In this video,

3. we'll be covering accuracy metrics for model evaluation. So let's get started.

4. Evaluation metrics are used to explain the performance of a model.

5. Let's talk more about the model evaluation metrics that are used for regression.

6. As mentioned, basically, we can compare the actual values and predicted values,

7. to calculate the accuracy of our regression model.

8. Evaluation metrics, provide a key role in the development of a model,

9. as it provides insight to areas that require improvement.

10. We'll be reviewing a number of model evaluation metrics including

11. mean absolute error, mean squared error, and root mean squared error.

12. But before we get into defining these,


13. we need to define what an error actually is.

14. In the context of regression,

15. the error of the model is the difference between

16. the data points and the trend line generated by the algorithm.

17. Since there are multiple data points,

18. an error can be determined in multiple ways.

19. Mean absolute error is the mean of the absolute value of the errors.

20. This is the easiest of the metrics to understand,

21. since it's just the average error.

22. Mean squared error is the mean of the squared error.

23. It's more popular than mean absolute error

24. because the focus is geared more towards large errors.

25. This is due to the squared term

26. exponentially increasing larger errors in comparison to smaller ones.

27. Root mean squared error is the square root of the mean squared error.

28. This is one of the most popular of the evaluation metrics

29. because root mean squared error is interpretable

30. in the same units as the response vector or y units,

31. making it easy to relate its information.

32. Relative absolute error, also known as residual sum of square,

33. where y bar is a mean value of y,

34. takes the total absolute error and normalizes it by

35. dividing by the total absolute error of the simple predictor.

36. Relative squared error is very similar to relative absolute error

37. but is widely adopted by the data science community,

38. as it is used for calculating R squared.

39. R squared is not an error per se

40. but is a popular metric for the accuracy of your model.

41. It represents how close the data values are

42. to the fitted regression line.


43. The higher the R-squared,

44. the better the model fits your data.

45. Each of these metrics can be used for quantifying of your prediction.

46. The choice of metric

47. completely depends on the type of model,

48. your data type, and domain of knowledge.

49. Unfortunately, further review is out of scope of this course.

50. Thanks for watching.

51.
Lab 1-

Simple Linear Regression


Estimated time needed: 15 minutes
Objectives
After completing this lab you will be able to:

 Use scikit-learn to implement simple Linear Regression


 Create a model, train,test and use the model

Importing Needed packages


[ ]:

import [Link] as plt


import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

Downloading Data
To download the data, we will use !wget to download it from IBM Object Storage.
[ ]:

!wget -O [Link] [Link]


[Link]/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module
%202/data/[Link]
Did you know? When it comes to Machine Learning, you will likely be working with large datasets. As a
business, where can you host your data? IBM is offering a unique opportunity for businesses, with 10 Tb
of IBM Cloud Object Storage: Sign up now for free

Understanding the Data


[Link]:
We have downloaded a fuel consumption dataset, [Link], which contains model-specific
fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale
in Canada. Dataset source
 MODELYEAR e.g. 2014
 MAKE e.g. Acura
 MODEL e.g. ILX
 VEHICLE CLASS e.g. SUV
 ENGINE SIZE e.g. 4.7
 CYLINDERS e.g 6
 TRANSMISSION e.g. A6
 FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
 FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
 FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
 CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0

Reading the data in


[ ]:

df = pd.read_csv("[Link]")

# take a look at the dataset


[Link]()

Data Exploration
Lets first have a descriptive exploration on our data.
[ ]:

# summarize the data


[Link]()
Lets select some features to explore more.
[ ]:

cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
[Link](9)
We can plot each of these fearues:
[ ]:

viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
[Link]()
[Link]()
Now, lets plot each of these features vs the Emission, to see how linear is their relation:
[ ]:

[Link](cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')


[Link]("FUELCONSUMPTION_COMB")
[Link]("Emission")
[Link]()
[ ]:
[Link]([Link], cdf.CO2EMISSIONS, color='blue')
[Link]("Engine size")
[Link]("Emission")
[Link]()

Practice
Plot CYLINDER vs the Emission, to see how linear is their relation:
[ ]:

# write your code here

Click here for the solution

Creating train and test dataset

Train/Test Split involves splitting the dataset into training and testing sets respectively, which are mutually
exclusive. After which, you train with the training set and test with the testing set. This will provide a more
accurate evaluation on out-of-sample accuracy because the testing dataset is not part of the dataset that
have been used to train the data. It is more realistic for real world problems.

This means that we know the outcome of each data point in this dataset, making it great to test with! And
since this data has not been used to train the model, the model has no knowledge of the outcome of these
data points. So, in essence, it is truly an out-of-sample testing.

Lets split our dataset into train and test sets, 80% of the entire data for training, and the 20% for testing.
We create a mask to select random rows using [Link]() function:
[ ]:

msk = [Link](len(df)) < 0.8


train = cdf[msk]
test = cdf[~msk]

Simple Regression Model


Linear Regression fits a linear model with coefficients B = (B1, ..., Bn) to minimize the 'residual sum of
squares' between the actual value y in the dataset, and the predicted value yhat using linear approximation.

Train data distribution


[ ]:

[Link]([Link], train.CO2EMISSIONS, color='blue')


[Link]("Engine size")
[Link]("Emission")
[Link]()

Modeling

Using sklearn package to model data.


[ ]:

from sklearn import linear_model


regr = linear_model.LinearRegression()
train_x = [Link](train[['ENGINESIZE']])
train_y = [Link](train[['CO2EMISSIONS']])
[Link] (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)
As mentioned before, Coefficient and Intercept in the simple linear regression, are the parameters of the
fit line. Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters
are the intercept and slope of the line, sklearn can estimate them directly from our data. Notice that all of
the data must be available to traverse and calculate the parameters.

Plot outputs

We can plot the fit line over the data:


[ ]:

[Link]([Link], train.CO2EMISSIONS, color='blue')


[Link](train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
[Link]("Engine size")
[Link]("Emission")

Evaluation
We compare the actual values and predicted values to calculate the accuracy of a regression model.
Evaluation metrics provide a key role in the development of a model, as it provides insight to areas that
require improvement.

There are different model evaluation metrics, lets use MSE here to calculate the accuracy of our model
based on the test set:

- Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the
metrics to understand since it’s just average error.
- Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s
more popular than Mean absolute error because the focus is geared more towards large errors.
This is due to the squared term exponentially increasing larger errors in comparison to smaller
ones.
- Root Mean Squared Error (RMSE).
- R-squared is not error, but is a popular metric for accuracy of your model. It represents how
close the data are to the fitted regression line. The higher the R-squared, the better the model fits
your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily
worse).
[ ]:

from [Link] import r2_score

test_x = [Link](test[['ENGINESIZE']])
test_y = [Link](test[['CO2EMISSIONS']])
test_y_ = [Link](test_x)

print("Mean absolute error: %.2f" % [Link]([Link](test_y_ - test_y)))


print("Residual sum of squares (MSE): %.2f" % [Link]((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )

Want to learn more?


IBM SPSS Modeler is a comprehensive analytics platform that has many machine learning algorithms. It
has been designed to bring predictive intelligence to decisions made by individuals, by groups, by systems
– by your enterprise as a whole. A free trial is available through this course, available here: SPSS Modeler
Also, you can use Watson Studio to run these notebooks faster with bigger datasets. Watson Studio is
IBM's leading cloud solution for data scientists, built by data scientists. With Jupyter notebooks, RStudio,
Apache Spark and popular libraries pre-packaged in the cloud, Watson Studio enables data scientists to
collaborate on their projects without having to install anything. Join the fast-growing community of
Watson Studio users today with a free account at Watson Studio
Video-5
Multiple linear regression

1. Start of transcript. Skip to the end.


2. Hello, and welcome.

3. In this video,

4. we'll be covering multiple linear regression.

5. As you know, there are two types of linear regression models,

6. simple regression and multiple regression.

7. Simple linear regression is when

8. one independent variable is used to estimate a dependent variable.

9. For example, predicting CO_2 emission using the variable of engine size.

10. In reality, there are multiple variables that predict the CO_2 emission.

11. When multiple independent variables are present,

12. the process is called multiple linear regression.

13. For example, predicting CO_2 emission using

14. engine size and the number of cylinders in the car's engine.

15. Our focus in this video is on multiple linear regression.

16. The good thing is that multiple linear regression

17. is the extension of the simple linear regression model.

18. So, I suggest you go through

19. the simple linear regression video first if you haven't watched it already.

20. Before we dive into a sample dataset and see how multiple linear regression works,

21. I want to tell you what kind of problems it can solve,

22. when we should use it, and specifically,

23. what kind of questions we can answer using it.

24. Basically, there are two applications for multiple linear regression.

25. First, it can be used when we would like to identify the strength of

26. the effect that the independent variables have on the dependent variable.

27. For example, does revision time, test anxiety,

28. lecture attendance and gender have any effect on exam performance of students?

29. Second, it can be used to predict the impact of changes, that is,
30. to understand how the dependent variable changes

31. when we change the independent variables.

32. For example, if we were reviewing a person's health data,

33. a multiple linear regression can tell you how much

34. that person's blood pressure goes up or down for

35. every unit increase or decrease in

36. a patient's body mass index holding other factors constant.

37. As is the case with simple linear regression,

38. multiple linear regression is a method of predicting a continuous variable.

39. It uses multiple variables called independent variables or predictors

40. that best predict the value of the target

41. variable which is also called the dependent variable.

42. In multiple linear regression,

43. the target value Y,

44. is a linear combination of independent variables X.

45. For example, you can predict how much CO_2 a car might

46. admit due to independent variables such as the car's engine size,

47. number of cylinders, and fuel consumption.

48. Multiple linear regression is very useful because you can examine

49. which variables are significant predictors of the outcome variable.

50. Also, you can find out how each feature impacts the outcome variable.

51. Again, as is the case in simple linear regression,

52. if you manage to build such a regression model,

53. you can use it to predict the emission amount of

54. an unknown case such as record number nine.

55. Generally, the model is of the form y hat equals theta zero,

56. plus theta one x_1,

57. plus theta two x_2 and so on,

58. up to theta n x_n.

59. Mathematically, we can show it as a vector form as well.


60. This means it can be shown as a dot product of two vectors;

61. the parameters vector and the feature set vector.

62. Generally, we can show the equation for a multidimensional space as theta transpose x,

63. where theta is an n by one vector of unknown parameters in a multi-dimensional space,

64. and x is the vector of the featured sets,

65. as theta is a vector of coefficients and is

66. supposed to be multiplied by x. Conventionally,

67. it is shown as transpose theta.

68. Theta is also called the parameters or weight vector of the regression equation.

69. Both these terms can be used interchangeably,

70. and x is the feature set which represents a car.

71. For example, x_1 for engine size or x_2 for cylinders, and so on.

72. The first element of the feature set would be set to one,

73. because it turns that theta zero into the intercept or biased

74. parameter when the vector is multiplied by the parameter vector.

75. Please notice that theta transpose x in

76. a one-dimensional space is the equation of a line,

77. it is what we use in simple linear regression.

78. In higher dimensions when we have more than one input

79. or x the line is called a plane or a hyperplane,

80. and this is what we use for multiple linear regression.

81. So, the whole idea is to find the best fit hyperplane for our data.

82. To this end and as is the case in linear regression,

83. we should estimate the values for theta vector that

84. best predict the value of the target field in each row.

85. To achieve this goal,

86. we have to minimize the error of the prediction.

87. Now, the question is,

88. how do we find the optimized parameters?

89. To find the optimized parameters for our model,


90. we should first understand what the optimized parameters are,

91. then we will find a way to optimize the parameters.

92. In short, optimized parameters are the ones which lead to a model with the fewest errors.

93. Let's assume for a moment that we have already found the parameter vector of our
model,

94. it means we already know the values of theta vector.

95. Now we can use the model and the feature set of the first row of

96. our dataset to predict the CO_2 emission for the first car, correct?

97. If we plug the feature set values into the model equation,

98. we find y hat.

99. Let's say for example,

100. it returns 140 as the predicted value for this specific row,

101. what is the actual value?

102. Y equals 196.

103. How different is the predicted value from the actual value of 196?

104. Well, we can calculate it quite simply as 196 subtract 140,

105. which of course equals 56.

106. This is the error of our model only for one row or one car in our case.

107. As is the case in linear regression,

108. we can say the error here is the distance from

109. the data point to the fitted regression model.

110. The mean of all residual errors shows how bad the model is representing the data set,

111. it is called the mean squared error, or MSE.

112. Mathematically, MSE can be shown by an equation.

113. While this is not the only way to expose the error of a multiple linear regression model,

114. it is one of the most popular ways to do so.

115. The best model for our data set is the one with minimum error for all prediction values.

116. So, the objective of multiple linear regression is to minimize the MSE equation.

117. To minimize it, we should find the best parameters theta, but how?

118. Okay, how do we find the parameter or coefficients for multiple linear regression?
119. There are many ways to estimate the value of these coefficients.

120. However, the most common methods are

121. the ordinary least squares and optimization approach.

122. Ordinary least squares tries to estimate the values of

123. the coefficients by minimizing the mean square error.

124. This approach uses the data as a matrix and uses

125. linear algebra operations to estimate the optimal values for the theta.

126. The problem with this technique is the time complexity of calculating

127. matrix operations as it can take a very long time to finish.

128. When the number of rows in your data set is less than 10,000,

129. you can think of this technique as an option.

130. However, for greater values,

131. you should try other faster approaches.

132. The second option is to use an optimization algorithm to find the best parameters.

133. That is, you can use a process of optimizing the values of

134. the coefficients by iteratively minimizing the error of the model on your training data.

135. For example, you can use gradient descent which

136. starts optimization with random values for each coefficient,

137. then calculates the errors and tries to minimize it

138. through y's changing of the coefficients in multiple iterations.

139. Gradient descent is a proper approach if you have a large data set.

140. Please understand however, that there are other approaches to estimate

141. the parameters of the multiple linear regression that you can explore on your own.

142. After you find the best parameters for your model,

143. you can go to the prediction phase.

144. After we found the parameters of the linear equation,

145. making predictions is as simple as solving the equation for a specific set of inputs.

146. Imagine we are predicting CO_2 emission or Y

147. from other variables for the automobile in record number nine.

148. Our linear regression model representation for


149. this problem would be y hat equals theta transpose x.

150. Once we find the parameters,

151. we can plug them into the equation of the linear model.

152. For example, let's use theta zero equals 125,

153. theta one equals 6.2,

154. theta two equals 14, and so on.

155. If we map it to our data set,

156. we can rewrite the linear model as CO_2 emissions equals

157. 125 plus 6.2 multiplied by engine size,

158. plus 14 multiplied by cylinder, and so on.

159. As you can see, multiple linear regression

160. estimates the relative importance of predictors.

161. For example, it shows cylinder has higher impact

162. on CO_2 emission amounts in comparison with engine size.

163. Now, let's plug in the ninth row of our data set and calculate

164. the CO_2 emission for a car with the engine size of 2.4.

165. So, CO_2 emission equals 125 plus 6.2 times 2.4,

166. plus 14 times four, and so on.

167. We can predict the CO_2 emission for this specific car would be 214.1.

168. Now, let me address some concerns that you might

169. already be having regarding multiple linear regression.

170. As you saw, you can use

171. multiple independent variables to predict a target value in multiple linear regression.

172. It sometimes results in a better model compared to using

173. a simple linear regression which uses

174. only one independent variable to predict the dependent variable.

175. Now the question is how,

176. many independent variable should we use for the prediction?

177. Should we use all the fields in our data set?

178. Does adding independent variables to


179. a multiple linear regression model always increase the accuracy of the model?

180. Basically, adding too many independent variables without

181. any theoretical justification may result in an overfit model.

182. An overfit model is a real problem because it is too

183. complicated for your data set and not general enough to be used for prediction.

184. So, it is recommended to avoid using many variables for prediction.

185. There are different ways to avoid overfitting a model in regression,

186. however that is outside the scope of this video.

187. The next question is,

188. should independent variables be continuous?

189. Basically, categorical independent variables can be incorporated

190. into a regression model by converting them into numerical variables.

191. For example, given a binary variables such as car type,

192. the code dummy zero for manual and one for automatic cars.

193. As a last point,

194. remember that multiple linear regression is a specific type of linear regression.

195. So, there needs to be a linear relationship between

196. the dependent variable and each of your independent variables.

197. There are a number of ways to check for linear relationship.

198. For example, you can use scatter plots and then visually checked for linearity.

199. If the relationship displayed in your scatter plot is not linear,

200. then you need to use non-linear regression.

201. This concludes our video. Thanks for watching.

202.
Video -6
Non lineae reagression

1. Hello and welcome.

2. In this video,

3. we'll be covering non-linear regression basics. So, let's get started.


4. These data points correspond to China's gross domestic product or GDP from 1960-
2014.

5. The first column is the years and the second is

6. China's corresponding annual gross domestic income in US dollars for that year.

7. This is what the data points look like.

8. Now, we have a couple of interesting questions.

9. First, can GDP be predicted based on time?

10. Second, can we use a simple linear regression to model it?

11. Indeed. If the data shows a curvy trend,

12. then linear regression would not produce

13. very accurate results when compared to a non-linear regression.

14. Simply because, as the name implies,

15. linear regression presumes that the data is linear.

16. The scatter plot shows that there seems to be a strong relationship between GDP and
time,

17. but the relationship is not linear.

18. As you can see, the growth starts off slowly,

19. then from 2005 onward,

20. the growth is very significant.

21. Finally, it decelerates slightly in the 2010s.

22. It looks like either a logistical or exponential function.

23. So, it requires a special estimation method of the non-linear regression procedure.

24. For example, if we assume that the model for these data points are exponential
functions,

25. such as Y hat equals Theta zero plus Theta

26. one Theta two transpose X or to the power of X,

27. our job is to estimate the parameters of the model, i.e., Thetas,

28. and use the fitted model to predict GDP for unknown or future cases.

29. In fact, many different regressions exists

30. that can be used to fit whatever the dataset looks like.

31. You can see a quadratic and cubic regression lines here,
32. and it can go on and on to infinite degrees.

33. In essence, we can call all of these polynomial regression,

34. where the relationship between the independent variable X and

35. the dependent variable Y is modeled as an Nth degree polynomial in X.

36. With many types of regression to choose from,

37. there's a good chance that one will fit your dataset well.

38. Remember, it's important to pick a regression that fits the data the best.

39. So, what is polynomial regression?

40. Polynomial regression fits a curve line to your data.

41. A simple example of polynomial with degree three is shown as Y hat equals Theta zero

42. plus Theta 1_X plus Theta 2_X squared plus Theta 3_X cubed or to the power of three,

43. where Thetas are parameters to be estimated that

44. makes the model fit perfectly to the underlying data.

45. Though the relationship between X and Y is

46. non-linear here and polynomial regression can't fit them,

47. a polynomial regression model can still be expressed as linear regression.

48. I know it's a bit confusing,

49. but let's look at an example.

50. Given the third degree polynomial equation,

51. by defining X_1 equals X and X_2 equals X squared or X to the power of two and so on,

52. the model is converted to a simple linear regression with new variables as Y hat equals

53. Theta zero plus Theta one X_1 plus Theta two X_2 plus Theta three X_3.

54. This model is linear in the parameters to be estimated, right?

55. Therefore, this polynomial regression is considered to

56. be a special case of traditional multiple linear regression.

57. So, you can use the same mechanism as linear regression to solve such a problem.

58. Therefore, polynomial regression models can fit using the model of least squares.

59. Least squares is a method for estimating

60. the unknown parameters in a linear regression model by minimizing the sum of

61. the squares of the differences between


62. the observed dependent variable in

63. the given dataset and those predicted by the linear function.

64. So, what is non-linear regression exactly?

65. First, non-linear regression is a method to model

66. a non-linear relationship between

67. the dependent variable and a set of independent variables.

68. Second, for a model to be considered non-linear,

69. Y hat must be a non-linear function of the parameters Theta,

70. not necessarily the features X.

71. When it comes to non-linear equation,

72. it can be the shape of exponential,

73. logarithmic, and logistic, or many other types.

74. As you can see in all of these equations,

75. the change of Y hat depends on changes in the parameters Theta,

76. not necessarily on X only.

77. That is, in non-linear regression,

78. a model is non-linear by parameters.

79. In contrast to linear regression,

80. we cannot use the ordinary least squares method to fit the data in non-linear regression.

81. In general, estimation of the parameters is not easy.

82. Let me answer two important questions here.

83. First, how can I know if a problem is linear or non-linear in an easy way?

84. To answer this question,

85. we have to do two things.

86. The first is to visually figure out if the relation is linear or non-linear.

87. It's best to plot bivariate plots of output variables with each input variable.

88. Also, you can calculate the correlation coefficient

89. between independent and dependent variables, and if,

90. for all variables, it is 0.7 or higher,

91. there is a linear tendency and thus,


92. it's not appropriate to fit a non-linear regression.

93. The second thing we have to do is to use non-linear regression instead of

94. linear regression when we cannot

95. accurately model the relationship with linear parameters.

96. The second important question is,

97. how should I model my data if it displays non-linear on a scatter plot?

98. Well, to address this,

99. you have to use either a polynomial regression,

100. use a non-linear regression model,

101. or transform your data,

102. which is not in scope for this course.

103. Thanks for watching.

104.
Lab 2,3,4

Module-3
Classifcation

Video1

1. Start of transcript. Skip to the end.


2. Hello, in this video,

3. we'll give you an introduction to classification.

4. So let's get started.

5. In machine learning classification is

6. a supervised learning approach which can be thought of as

7. a means of categorizing or classifying some unknown items into a discrete set of classes.

8. Classification attempts to learn the relationship between a set

9. of feature variables and a target variable of interest.

10. The target attribute in classification is a categorical variable with discrete values.

11. So, how does classification and classifiers work?

12. Given a set of training data points along with the target labels,

13. classification determines the class label for an unlabeled test case.
14. Let's explain this with an example.

15. A good sample of classification is the loan default prediction.

16. Suppose a bank is concerned about the potential for loans not to be repaid?

17. If previous loan default data can be used to predict

18. which customers are likely to have problems repaying loans,

19. these bad risk customers can either have

20. their loan application declined or offered alternative products.

21. The goal of a loan default predictor is to use

22. existing loan default data which has information about the customers such as age,
income,

23. education et cetera, to build a classifier,

24. pass a new customer or potential future default to the model,

25. and then label it, i.e the data points as defaulter or not defaulter.

26. Or for example zero or one.

27. This is how a classifier predicts an unlabeled test case.

28. Please notice that this specific example was about a binary classifier with two values.

29. We can also build classifier models for

30. both binary classification and multi-class classification.

31. For example, imagine that you've collected data about a set of patients,

32. all of whom suffered from the same illness.

33. During their course of treatment,

34. each patient responded to one of three medications.

35. You can use this labeled dataset with

36. a classification algorithm to build a classification model.

37. Then you can use it to find out which drug might be

38. appropriate for a future patient with the same illness.

39. As you can see, it is a sample of multi-class classification.

40. Classification has different business use cases as well.

41. For example, to predict the category to which a customer belongs,

42. for churn detection where we predict whether


43. a customer switches to another provider or brand,

44. or to predict whether or not a customer responds to a particular advertising campaign.

45. Data classification has several applications in a wide variety of industries.

46. Essentially, many problems can be expressed as

47. associations between feature and target variables,

48. especially when labelled data is available.

49. This provides a broad range of applicability for classification.

50. For example, classification can be used for email filtering, speech recognition,

51. handwriting recognition, biometric identification,

52. document classification and much more.

53. Here we have the types of classification algorithms and machine learning.

54. They include decision trees,

55. naive bayes, linear discriminant analysis,

56. k-nearest neighbor, logistic regression,

57. neural networks, and support vector machines.

58. There are many types of classification algorithms.

59. We will only cover a few in this course. Thanks for watching.

60.
Video

1. Start of transcript. Skip to the end.


2. Hello and welcome.

3. In this video,

4. we'll be covering the K-Nearest Neighbors algorithm.

5. So, let's get started.

6. Imagine that a telecommunications provider has

7. segmented his customer base by service usage patterns,

8. categorizing the customers into four groups.

9. If demographic data can be used to predict group membership, the

10. company can customize offers for individual perspective customers.

11. This is a classification problem.


12. That is, given the dataset with predefined labels,

13. we need to build a model to be used to predict the class of a new or unknown case.

14. The example focuses on using demographic data, such as

15. region, age, and marital status to predict usage patterns.

16. The target field called custcat has

17. four possible values that correspond to the four customer groups as follows:

18. Basic Service, E Service,

19. Plus Service, and Total Service.

20. Our objective is to build a classifier.

21. For example, using the row zero to seven to predict the class of row eight.

22. We will use a specific type of classification called K-Nearest Neighbor.

23. Just for sake of demonstration,

24. let's use only two fields as predictors specifically,

25. age and income, and then

26. plot the customers based on their group membership.

27. Now, let's say that we have a new customer.

28. For example, record number eight,

29. with a known age and income.

30. How can we find the class of this customer?

31. Can we find one of the closest cases and assign the same class label to our new
customer?

32. Can we also say that the class of

33. our new customer is most probably group four i.e Total Service,

34. because it's nearest neighbor is also of class four?

35. Yes, we can. In fact,

36. it is the first nearest neighbor.

37. Now, the question is,

38. to what extent can we trust our judgment which is based on the first nearest neighbor?

39. It might be a poor judgment especially

40. if the first nearest neighbor is a very specific case or an outlier, correct?
41. Now, let's look at our scatter plot again.

42. Rather than choose the first nearest neighbor,

43. what if we chose the five nearest neighbors and did

44. a majority vote among them to define the class of our new customer?

45. In this case, we'd see that

46. three out of five nearest neighbors tell us to go for class three,

47. which is Plus Service.

48. Doesn't this make more sense?

49. Yes. In fact, it does.

50. In this case, the value of K in the K-Nearest Neighbors algorithm is five.

51. This example highlights the intuition behind the K-Nearest Neighbors algorithm.

52. Now, let's define the K Nearest Neighbors.

53. The K-Nearest Neighbors algorithm is a classification algorithm that

54. takes a bunch of labeled points and uses them to learn how to label other points.

55. This algorithm classifies cases based on their similarity to other cases.

56. In K-Nearest Neighbors, data points that are near each other are said to be neighbors.

57. K-Nearest Neighbors is based on this paradigm.

58. Similar cases with the same class labels are near each other.

59. Thus, the distance between two cases is a measure of their dissimilarity.

60. There are different ways to calculate the similarity or conversely,

61. the distance or dissimilarity of two data points.

62. For example, this can be done using Euclidean distance.

63. Now, let's see how the K-Nearest Neighbors algorithm actually works.

64. In a classification problem,

65. the K-Nearest Neighbors algorithm works as follows.

66. One, pick a value for K. Two,

67. calculate the distance from the new case hold out from each of the cases in the dataset.

68. Three, search for the K-observations in

69. the training data that are nearest to the measurements of the unknown data point.

70. And four, predict the response of the unknown data point
71. using the most popular response value from the K-Nearest Neighbors.

72. There are two parts in this algorithm that might be a bit confusing.

73. First, how to select the correct K and second,

74. how to compute the similarity between cases,

75. for example, among customers.

76. Let's first start with the second concern.

77. That is, how can we calculate the similarity between two data points?

78. Assume that we have two customers,

79. customer one and customer two,

80. and for a moment, assume that these two customers have only one feature,

81. H. We can easily use a specific type of

82. Minkowski distance to calculate the distance of these two customers,

83. it is indeed the Euclidean distance.

84. Distance of X_1 from X_2 is root of 34 minus 30 to power of two, which is four.

85. What about if we have more than one feature?

86. For example, age and income.

87. If we have income and age for each customer,

88. we can still use the same formula but this time,

89. we're using it in a two dimensional space.

90. We can also use the same distance matrix for multidimensional vectors.

91. Of course, we have to normalize our feature

92. set to get the accurate dissimilarity measure.

93. There are other dissimilarity measures as well that

94. can be used for this purpose but as mentioned,

95. it is highly dependent on datatype and

96. also the domain that classification is done for it.

97. As mentioned, K and K-Nearest Neighbors is the number of nearest neighbors to


examine.

98. It is supposed to be specified by the user.

99. So, how do we choose the right K?


100. Assume that we want to find the class of

101. the customer noted as question mark on the chart.

102. What happens if we choose a very low value of K?

103. Let's say, K equals one.

104. The first nearest point would be blue,

105. which is class one.

106. This would be a bad prediction,

107. since more of the points around it are magenta or class four.

108. In fact, since its nearest neighbor is blue we can say that we capture

109. the noise in the data or we chose one of the points that was an anomaly in the data.

110. A low value of K causes a highly complex model as well,

111. which might result in overfitting of the model.

112. It means the prediction process is not

113. generalized enough to be used for out-of-sample cases.

114. Out-of-sample data is data that is outside of the data set used to train the model.

115. In other words, it cannot be trusted to be used for prediction of unknown samples.

116. It's important to remember that overfitting is bad,

117. as we want a general model that works for any data,

118. not just the data used for training.

119. Now, on the opposite side of the spectrum,

120. if we choose a very high value of K such as K equals 20,

121. then the model becomes overly generalized.

122. So, how can we find the best value for K?

123. The general solution is to reserve a part of

124. your data for testing the accuracy of the model.

125. Once you've done so,

126. choose K equals one and then use the training part for modeling

127. and calculate the accuracy of prediction using all samples in your test set.

128. Repeat this process increasing the K and see which K is best for your model.

129. For example, in our case,


130. K equals four will give us the best accuracy.

131. Nearest neighbors analysis can also be used to compute values for a continuous target.

132. In this situation, the average or median target value of

133. the nearest neighbors is used to obtain the predicted value for the new case.

134. For example, assume that you are predicting the price of a home based on its feature
set,

135. such as number of rooms,

136. square footage, the year it was built, and so on.

137. You can easily find the three nearest neighbor houses

138. of course not only based on distance but

139. also based on all the attributes and then

140. predict the price of the house as the medium of neighbors.

141. This concludes this video.

142. Thanks for watching

143.
Video

1. Start of transcript. Skip to the end.


2. Hello, and welcome!

3. In this video, we’ll be covering evaluation metrics for classifiers.

4. So let’s get started.

5. Evaluation metrics explain the performance of a model.

6. Let’s talk more about the model evaluation metrics that are used for classification.

7. Imagine that we have an historical dataset which shows the customer churn for a
telecommunication

8. company.

9. We have trained the model, and now we want to calculate its accuracy using the test set.

10. We pass the test set to our model, and we find the predicted labels.

11. Now the question is, “How accurate is this model?”

12. Basically, we compare the actual values in the test set with the values predicted by

13. the model, to calculate the accuracy of the model.


14. Evaluation metrics provide a key role in the development of a model, as they provide
insight

15. to areas that might require improvement.

16. There are different model evaluation metrics but we just talk about three of them here,

17. specifically: Jaccard index, F1-score, and Log Loss.

18. Let’s first look at one of the simplest accuracy measurements, the Jaccard index -- also

19. known as the Jaccard similarity coefficient.

20. Let’s say y shows the true labels of the churn dataset.

21. And y ̂ shows the predicted values by our classifier.

22. Then we can define Jaccard as the size of the intersection divided by the size of the

23. union of two label sets.

24. For example, for a test set of size 10, with 8 correct predictions, or 8 intersections,

25. the accuracy by the Jaccard index would be 0.66.

26. If the entire set of predicted labels for a sample strictly matches with the true set

27. of labels, then the subset accuracy is 1.0; otherwise it is 0.0.

28. Another way of looking at accuracy of classifiers is to look at a confusion matrix.

29. For example, let’s assume that our test set has only 40 rows.

30. This matrix shows the corrected and wrong predictions, in comparison with the actual

31. labels.

32. Each confusion matrix row shows the Actual/True labels in the test set, and the columns
show

33. the predicted labels by classifier.

34. let's Look at the first row.

35. The first row is for customers whose actual churn value in the test set is 1.

36. As you can calculate, out of 40 customers, the churn value of 15 of them is 1.

37. And out of these 15, the classifier correctly predicted 6 of them as 1, and 9 of them as 0.

38. This means that for 6 customers, the actual churn value was 1, in the test set, and the

39. classifier also correctly predicted those as 1.

40. However, while the actual label of 9 customers was 1, the classifier predicted those as 0,

41. which is not very good.

42. We can consider this as an error of the model for the first row.
43. What about the customers with a churn value 0?

44. Let’s look at the second row.

45. It looks like there were 25 customers whose churn value was 0.

46. The classifier correctly predicted 24 of them as 0, and one of them wrongly predicted as
1.

47. So, it has done a good job in predicting the customers with a churn value of 0.

48. A good thing about the confusion matrix is that it shows the model’s ability to correctly

49. predict or separate the classes.

50. In the specific case of a binary classifier, such as this example, we can interpret these

51. numbers as the count of true positives, false negatives, true negatives, and false
positives.

52. Based on the count of each section, we can calculate the precision and recall of each

53. label.

54. Precision is a measure of the accuracy, provided that a class label has been predicted.

55. It is defined by: precision = True Positive / (True Positive + False Positive).

56. And Recall is the true positive rate.

57. It is defined as: Recall = True Positive / (True Positive + False Negative).

58. So, we can calculate the precision and recall of each class.

59. Now we’re in the position to calculate the F1 scores for each label, based on the
precision

60. and recall of that label.

61. The F1 score is the harmonic average of the precision and recall, where an F1 score
reaches

62. its best value at 1 (which represents perfect precision and recall) and its worst at 0.

63. It is a good way to show that a classifier has a good value for both recall and precision.

64. It is defined using the F1-score equation.

65. For example, the F1-score for class 0 (i.e. churn=0), is 0.83, and the F1-score for class 1

66. (i.e. churn=1), is 0.55.

67. And finally, we can tell the average accuracy for this classifier is the average of the

68. F1-score for both labels, which is 0.72 in our case.

69. Please notice that both Jaccard and F1-score can be used for multi-class classifiers as
70. well, which is out of scope for this course.

71. Now let's look at another accuracy metric for classifiers.

72. Sometimes, the output of a classifier is the probability of a class label, instead of the

73. label.

74. For example, in logistic regression, the output can be the probability of customer churn,

75. i.e., yes (or equals to 1).

76. This probability is a value between 0 and 1.

77. Logarithmic loss (also known as Log loss) measures the performance of a classifier
where

78. the predicted output is a probability value between 0 and 1.

79. So, for example, predicting a probability of 0.13 when the actual label is 1, would

80. be bad and would result in a high log loss.

81. We can calculate the log loss for each row using the log loss equation, which measures

82. how far each prediction is, from the actual label.

83. Then, we calculate the average log loss across all rows of the test set.

84. It is obvious that ideal classifiers have progressively smaller values of log loss.

85. So, the classifier with lower log loss has better accuracy.

86. Thanks for watching!

87.
Voideo

1. Hello and welcome.

2. In this video,

3. we're going to introduce an examine decision trees.

4. So let's get started.

5. What exactly is a decision tree?

6. How do we use them to help us classify?

7. How can I grow my own decision tree?

8. These may be some of the questions that you have in

9. mind from hearing the term decision tree.

10. Hopefully, you'll soon be able to answer


11. these questions and many more by watching this video.

12. Imagine that you're a medical researcher compiling data for a study.

13. You've already collected data about a set of

14. patients all of whom suffered from the same illness.

15. During their course of treatment,

16. each patient responded to one of two medications.

17. We call them drug A and drug B.

18. Part of your job is to build a model to find out which drug

19. might be appropriate for a future patient with the same illness.

20. The feature sets of this dataset are age, gender,

21. blood pressure, and cholesterol of our group of

22. patients and the target is the drug that each patient responded to.

23. It is a sample of binary classifiers, and you can

24. use the training part of the data set to build a decision tree

25. and then use it to predict the class of an unknown patient.

26. In essence, to come up with a decision on which drug to prescribe to a new patient.

27. Let's see how a decision tree is built for this dataset.

28. Decision trees are built by splitting the training set into distinct nodes,

29. where one node contains all of or most of one category of the data.

30. If we look at the diagram here,

31. we can see that it's a patient's classifier.

32. So as mentioned, we want to prescribe a drug to a new patient,

33. but the decision to choose drug A or B will be influenced by the patient's situation.

34. We start with age,

35. which can be young, middle aged or senior.

36. If the patient is middle aged,

37. then we'll definitely go for drug B.

38. On the other hand, if he has a young or a senior patient,

39. will need more details to help us determine which drug to prescribe.

40. The additional decision variables can be things such as cholesterol levels,
41. gender or blood pressure.

42. For example, if the patient is female,

43. then we will recommend drug A,

44. but if the patient is male,

45. then will go for drug B.

46. As you can see,

47. decision trees are about testing an attribute and

48. branching the cases based on the result of the test.

49. Each internal node corresponds to a test,

50. and each branch corresponds to a result of the test,

51. and each leaf node assigns a patient to a class.

52. Now the question is,

53. how can we build such a decision tree?

54. Here is the way that a decision tree is build.

55. A decision tree can be constructed by considering the attributes one by one.

56. First, choose an attribute from our dataset.

57. Calculate the significance of the attribute in the splitting of the data.

58. In the next video,

59. we will explain how to calculate the significance of

60. an attribute to see if it's an effective attribute or not.

61. Next, split the data based on the value of the best attribute,

62. then go to each branch and repeat it for the rest of the attributes.

63. After building this tree,

64. you can use it to predict the class of unknown cases; or in our case,

65. the proper drug for a new patient based on his or her characteristics.

66. This concludes this video.

67. Thanks for watching.

68.
Video

1. tart of transcript. Skip to the end.


2. Hello and welcome. In this video,

3. we'll be covering the process of building decision trees.

4. So, let's get started.

5. Consider the drug data set again.

6. The question is, how do we build a decision tree based on that data set?

7. Decision trees are built using recursive partitioning to classify the data.

8. Let's say we have 14 patients in our data set,

9. the algorithm chooses the most predictive feature to split the data on.

10. What is important in making a decision tree,

11. is to determine which attribute is the best or more

12. predictive to split data based on the feature.

13. Let's say we pick cholesterol as the first attribute to split data,

14. it will split our data into two branches.

15. As you can see,

16. if the patient has high cholesterol we cannot say

17. with high confidence that drug B might be suitable for him.

18. Also, if the patient's cholesterol is normal,

19. we still don't have sufficient evidence or information to

20. determine if either drug A or drug B is in fact suitable.

21. It is a sample of bad attributes selection for splitting data.

22. So, let's try another attribute.

23. Again, we have our 14 cases,

24. this time we picked the sex attribute of patients.

25. It will split our data into two branches, male and female.

26. As you can see, if the patient is female,

27. we can say drug B might be suitable for her with high certainty.

28. But if the patient is male,

29. we don't have sufficient evidence or

30. information to determine if drug A or drug B is suitable.

31. However, it is still a better choice in comparison with


32. the cholesterol attribute because the result in the nodes are more pure.

33. It means nodes that are either mostly drug A or drug B.

34. So, we can say the sex attribute is more significant than cholesterol,

35. or in other words it's more predictive than the other attributes.

36. Indeed, predictiveness is based on decrease in impurity of nodes.

37. We're looking for the best feature to decrease the impurity of patients in the leaves,

38. after splitting them up based on that feature.

39. So, the sex feature is a good candidate in

40. the following case because it almost found the pure patients.

41. Let's go one step further.

42. For the male patient branch,

43. we again test other attributes to split the sub-tree.

44. We test cholesterol again here,

45. as you can see it results in even more pure leaves.

46. So we can easily make a decision here.

47. For example, if a patient is male and his cholesterol is high,

48. we can certainly prescribe drug A,

49. but if it is normal,

50. we can prescribe drug B with high confidence.

51. As you might notice,

52. the choice of attribute to split data is very

53. important and it is all about purity of the leaves after the split.

54. A node in the tree is considered pure if in 100 percent of the cases,

55. the nodes fall into a specific category of the target field.

56. In fact, the method uses recursive partitioning to split

57. the training records into segments by minimizing the impurity at each step.

58. Impurity of nodes is calculated by entropy of data in the node.

59. So, what is entropy?

60. Entropy is the amount of information disorder or the amount of randomness in the data.

61. The entropy in the node depends on


62. how much random data is in that node and is calculated for each node.

63. In decision trees, we're looking for trees that have the smallest entropy in their nodes.

64. The entropy is used to calculate the homogeneity of the samples in that node.

65. If the samples are completely homogeneous,

66. the entropy is zero and if the samples are equally divided it has an entropy of one.

67. This means if all the data in a node are either drug A or drug B,

68. then the entropy is zero,

69. but if half of the data are drug A and other half are B then the entropy is one.

70. You can easily calculate the entropy of a node using the frequency table of

71. the attribute through the entropy formula where

72. P is for the proportion or ratio of a category,

73. such as drug A or B.

74. Please remember though that you don't have to calculate these as

75. it's easily calculated by the libraries or packages that you use.

76. As an example, let's calculate the entropy of the data set before splitting it.

77. We have nine occurrences of drug B and five of drug A.

78. You can embed these numbers into the entropy formula to

79. calculate the impurity of the target attribute before splitting it.

80. In this case, it is 0.94.

81. So, what is entropy after splitting?

82. Now, we can test different attributes to find the one with the most predictiveness,

83. which results in two more pure branches.

84. Let's first select the cholesterol of the patient and

85. see how the data gets split based on its values.

86. For example, when it is normal we have six for drug B,

87. and two for drug A.

88. We can calculate the entropy of this node based on

89. the distribution of drug A and B which is 0.8 in this case.

90. But, when cholesterol is high,

91. the data is split into three for drug B and three for drug A.
92. Calculating it's entropy, we can see it would be 1.0.

93. We should go through all the attributes and calculate the entropy

94. after the split and then choose the best attribute.

95. Okay. Let's try another field.

96. Let's choose the sex attribute for the next check.

97. As you can see, when we use the sex attribute to split the data,

98. when its value is female,

99. we have three patients that responded to

100. drug B and four patients that responded to drug A.

101. The entropy for this node is 0.98 which is not very promising.

102. However, on the other side of the branch,

103. when the value of the sex attribute is male,

104. the result is more pure with sex for drug B and only one for drug A.

105. The entropy for this group is 0.59.

106. Now, the question is between

107. the cholesterol and sex attributes which one is a better choice?

108. Which one is better at the first attribute to divide the data-set into two branches?

109. Or in other words,

110. which attribute results in more pure nodes for our drugs?

111. Or in which tree do we have less entropy after splitting rather than before splitting?

112. The sex attribute with entropy of 0.98 and 0.59 or

113. the cholesterol attribute with entropy of 0.81 and 1.0 in it's branches.

114. The answer is the tree with the higher information gain after splitting.

115. So, what is information gain?

116. Information gain is the information that can

117. increase the level of certainty after splitting.

118. It is the entropy of a tree before the split

119. minus the weighted entropy after the split by an attribute.

120. We can think of information gain and entropy as opposites.

121. As entropy or the amount of randomness decreases,


122. the information gain or amount of certainty increases and vice versa.

123. So, constructing a decision tree is all about

124. finding attributes that return the highest information gain.

125. Let's see how information gain is calculated for the sex attribute.

126. As mentioned, the information gained is the entropy of the tree

127. before the split minus the weighted entropy after the split.

128. The entropy of the tree before the split is 0.94,

129. the portion of female patients is seven out of 14 and its entropy is 0.985.

130. Also, the portion of men is seven out of 14 and the entropy of the male node is 0.592.

131. The result of a square bracket here is the weighted entropy after the split.

132. So, the information gain of the tree if we use

133. the sex attribute to split the data set is 0.151.

134. As you could see, we will consider the entropy

135. over the distribution of samples falling under

136. each leaf node and we'll take a weighted average of

137. that entropy weighted by the proportion of samples falling under that leave.

138. We can calculate the information gain of the tree if we use cholesterol as well.

139. It is 0.48.

140. Now, the question is,

141. which attribute is more suitable?

142. Well, as mentioned, the tree with the higher information gained after splitting,

143. this means the sex attribute.

144. So, we select the sex attribute as the first splitter.

145. Now, what is the next attribute after branching by the sex attribute?

146. Well, as you can guess,

147. we should repeat the process for each branch and test each of

148. the other attributes to continue to reach the most pure leaves.

149. This is the way you build a decision tree. Thanks for watching.

150.
Video
1. Start of transcript. Skip to the end.
2. Hello and welcome.

3. In this video, we'll learn a machine learning method called

4. Logistic Regression which is used for classification.

5. In examining this method, we'll specifically answer these three questions.

6. What is logistic regression?

7. What kind of problems can be solved by logistic regression?

8. In which situations do we use logistic regression?

9. So let's get started.

10. Logistic regression is a statistical and machine learning technique for classifying records

11. of a dataset based on the values of the input fields.

12. Let's say we have a telecommunication dataset that we'd like to

13. analyze in order to understand which customers might leave us next month.

14. This is historical customer data where each row represents one customer.

15. Imagine that you're an analyst at this company and you have to find out who is leaving
and

16. why?

17. You'll use the dataset to build a model based on

18. historical records and use it to predict the future churn within the customer group.

19. The dataset includes information about services that

20. each customer has signed up for, customer account information, demographic
information

21. about customers like gender and age range and also customers who've left the company

22. within the last month.

23. The column is called churn.

24. We can use logistic regression to build a model for

25. predicting customer churn using the given features.

26. In logistic regression, we use one or more independent variables such as tenure, age,

27. and income to predict an outcome, such as churn, which we call the dependent variable

28. representing whether or not customers will stop using the service.

29. Logistic regression is analogous to linear regression but tries to predict a categorical
30. or discrete target field instead of a numeric one.

31. In linear regression, we might try to predict a continuous value of variables such as the

32. price of a house, blood pressure of a patient, or fuel consumption of a car.

33. But in logistic regression, we predict a variable which is binary such as yes/no, true/false,

34. successful or not successful, pregnant/not pregnant,

35. and so on, all of which can be coded as zero or one.

36. In logistic regression independent variables should be continuous.

37. If categorical, they should be dummy or indicator coded.

38. This means we have to transform them to some continuous value.

39. Please note that logistic regression can be used for both binary classification and multi-
class

40. classification.

41. But for simplicity in this video, we'll focus on binary classification.

42. Let's examine some applications of logistic regression before we explain how they work.

43. As mentioned, logistic regression is a type of classification algorithm, so it can be

44. used in different situations.

45. For example, to predict the probability of a person having a heart attack within a specified

46. time period, based on our knowledge of the person's age,

47. sex, and body mass index.

48. Or to predict the chance of mortality in an injured patient or to predict whether a patient

49. has a given disease such as diabetes based on

50. observed characteristics of that patient such as weight,

51. height, blood pressure, and results of various blood tests and so on.

52. In a marketing context, we can use it to predict the likelihood of a customer purchasing

53. a product or halting a subscription as we've done in our churn example.

54. We can also use logistic regression to predict the probability of failure of a given process,

55. system or product.

56. We can even use it to predict the likelihood of a homeowner defaulting on a mortgage.

57. These are all good examples of problems that can be solved using logistic regression.

58. Notice that in all these examples not only do we predict the class of each case,
59. we also measure the probability of a case belonging to a specific class.

60. There are different machine algorithms which can classify or estimate a variable.

61. The question is, when should we use logistic regression?

62. Here are four situations in which logistic regression is a good candidate.

63. First, when the target field in your data is categorical or specifically is binary.

64. Such as zero/one, yes/no, churn or no churn, positive/negative and so on.

65. Second, you need the probability of your prediction.

66. For example, if you want to know what the probability is of a customer buying a product.

67. Logistic regression returns a probability score between zero and one for a given sample

68. of data.

69. In fact, logistic regression predicts the probability of that sample and we map the

70. cases to a discrete class based on that probability.

71. Third, if your data is linearly separable.

72. The decision boundary of logistic regression is a line or a plane or a hyper plane.

73. A classifier will classify all the points on one side of the decision boundary as belonging

74. to one class and all those on the other side as belonging to the other class.

75. For example, if we have just two features and are not applying any polynomial
processing

76. we can obtain an inequality like Theta zero plus Theta 1x1 plus theta 2x2 is greater than

77. zero, which is a half-plane easily plottable.

78. Please note that in using logistic regression, we can also achieve a complex decision
boundary

79. using polynomial processing as well, which is out of scope here.

80. You'll get more insight from decision boundaries when you understand how

81. logistic regression works.

82. Fourth, you need to understand the impact of a feature.

83. You can select the best features based on the statistical significance of the logistic

84. regression model coefficients or parameters.

85. That is, after finding the optimum parameters, a feature X with the weight Theta one
close

86. to zero has a smaller effect on the prediction than features


87. with large absolute values of Theta one.

88. Indeed, it allows us to understand the impact an independent variable

89. has on the dependent variable while controlling other independent variables.

90. Let's look at our dataset again.

91. We defined the independent variables as X and dependent variable as Y.

92. Notice, that for the sake of simplicity we can code the target or dependent values to

93. zero or one.

94. The goal of logistic regression is to build a model to predict the class of each sample

95. which in this case is a customer, as well as the probability of each sample belonging

96. to a class.

97. Given that, let's start to formalize the problem.

98. X is our dataset in the space of real numbers of m by n.

99. That is, of m dimensions or features and n records, and Y is the class that we want to

100. predict, which can be either zero or one.

101. Ideally, a logistic regression model, so-called Y hat, can predict that the class of the
customer

102. is one, given its features X.

103. It can also be shown quite easily that the probability of a customer being in class zero

104. can be calculated as one minus the probability that the class of the customer is one.

105. Thanks for watching this video.

106.
Video

1. tart of transcript. Skip to the end.


2. Hello and welcome. In this video,

3. we will learn the difference between linear regression and logistic regression.

4. We go over linear regression and see why it

5. cannot be used properly for some binary classification problems.

6. We also look at the sigmoid function,

7. which is the main part of logistic regression. Let's start.

8. Let's look at the telecommunication dataset again.


9. The goal of logistic regression is to build a model to predict the class of

10. each customer and also the probability of each sample belonging to a class.

11. Ideally, we want to build a model, y hat,

12. that can estimate that the class of a customer is one given its feature is

13. x. I want to emphasize that y is the label's vector,

14. also called actual values,

15. that we would like to predict, and y hat

16. is the vector of the predicted values by our model.

17. Mapping the class labels to integer numbers,

18. can we use linear regression to solve this problem?

19. First, let's recall how linear regression works to better understand logistic regression.

20. Forget about the churn prediction for a minute and assume

21. our goal is to predict the income of customers in the dataset.

22. This means that instead of predicting churn,

23. which is a categorical value,

24. let's predict income, which is a continuous value.

25. So, how can we do this?

26. Let's select an independent variable such as

27. customer age and predict the dependent variable such as income.

28. Of course, we can have more features but for the sake of simplicity,

29. let's just take one feature here.

30. We can plot it and show age as

31. an independent variable and income as the target value we would like to predict.

32. With linear regression, you can fit a line or polynomial through the data.

33. We can find this line through training our model or

34. calculating it mathematically based on the sample sets.

35. We'll say, this is a straight line through the sample set.

36. This line has an equation shown as a plus bx1.

37. Now, use this line to predict the continuous value, y.

38. That is, use this line to predict the income of


39. an unknown customer based on his or her age, and it is done.

40. What if we want to predict churn?

41. Can we use the same technique to predict

42. a categorical field such as churn? Okay, let's see.

43. Say, we're given data on customer churn and our goal

44. this time is to predict the churn of customers based on their age.

45. We have a feature,

46. age denoted as x1,

47. and a categorical feature, churn,

48. with two classes, churn is yes and churn is no.

49. As mentioned, we can map yes and no to integer values zero and one.

50. How can we model it now?

51. Well, graphically, we could represent our data with a scatterplot,

52. but this time, we have only two values for the y-axis.

53. In this plot, class zero is denoted in red, and class one is denoted in blue.

54. Our goal here is to make a model based on

55. existing data to predict if a new customer is red or blue.

56. Let's do the same technique that we used for linear regression here to

57. see if we can solve the problem for a categorical attribute such as churn.

58. With linear regression, you again can fit a polynomial through the data,

59. which is shown traditionally as a plus bx.

60. This polynomial can also be shown traditionally as Theta0 plus Theta1 x1.

61. This line has two parameters which are shown with

62. vector Theta where the values of the vector are Theta0 and Theta1.

63. We can also show the equation of this line formally as Theta transpose x.

64. Generally, we can show the equation for a multidimensional space as Theta transpose x,

65. where Theta is the parameters of the line in

66. two-dimensional space or parameters of a plane in three-dimensional space, and so on.

67. As Theta is a vector of parameters and is supposed to be multiplied by x,

68. it is shown conventionally as transpose Theta.


69. Theta is also called the weights factor or confidences of the equation,

70. with both these terms used

71. interchangeably, and X is the feature set which represents a customer.

72. Anyway, given a dataset,

73. all the feature sets x Theta parameters can be

74. calculated through an optimization algorithm or mathematically,

75. which results in the equation of the fitting line.

76. For example, the parameters of this line are minus one and 0.1,

77. and the equation for the line is minus one plus 0.1 x1.

78. Now, we can use this regression line to predict the churn of a new customer.

79. For example, for our customer or, let's say,

80. a data point with x value of age equals 13,

81. we can plug the value into the line formula,

82. and the y value is calculated and returns a number.

83. For instance, for p1 point,

84. we have Theta transpose x equals minus 1 plus 0.1 times x1,

85. equals minus 1 plus 0.1 times 13, equals 0.3.

86. We can show it on our graph.

87. Now, we can define a threshold here.

88. For example, at 0.5 to define the class.

89. So, we write a rule here for our model,

90. y hat, which allows us to separate class zero from class one.

91. If the value of Theta transpose x is less than 0.5,

92. then the class is zero.

93. Otherwise, if the value of Theta transpose x is more than 0.5,

94. then the class is one, and

95. because our customers y value is less than the threshold,

96. we can say it belongs to class zero based on our model.

97. But there is one problem here.

98. What is the probability that this customer belongs to class zero?
99. As you can see, it's not the best model to solve this problem.

100. Also, there are some other issues which verify that

101. linear regression is not the proper method for classification problems.

102. So, as mentioned, if we use the regression line to calculate the class of a point,

103. it always returns a number such as three or negative two, and so on.

104. Then, we should use a threshold, for example,

105. 0.5, to assign that point to either class of zero or one.

106. This threshold works as a step function that

107. outputs zero or one regardless of how big or small,

108. positive or negative the input is.

109. So, using the threshold,

110. we can find the class of a record.

111. Notice that in the step function,

112. no matter how big the value is,

113. as long as it's greater than 0.5,

114. it simply equals one and vice versa.

115. Regardless of how small the value y is,

116. the output would be zero if it is less than 0.5.

117. In other words, there is no difference between

118. a customer who has a value of one or 1,000.

119. The outcome would be one.

120. Instead of having this step function,

121. wouldn't it be nice if we had a smoother line,

122. one that would project these values between zero and one?

123. Indeed, the existing method does not really

124. give us the probability of a customer belonging to a class,

125. which is very desirable.

126. We need a method that can give us the probability of falling in the class as well.

127. So, what is the scientific solution here?

128. Well, if instead of using Theta transpose x,


129. we use a specific function called sigmoid,

130. then sigmoid of Theta transpose x gives us the probability of

131. a point belonging to a class instead of the value of y directly.

132. I'll explain this sigmoid function in a second,

133. but for now, please except that it will do the trick.

134. Instead of calculating the value of Theta transpose x directly,

135. it returns the probability that a Theta transpose x is very big or very small.

136. It always returns a value between 0 and 1,

137. depending on how large the Theta transpose x actually is.

138. Now, our model is sigmoid of Theta transpose x,

139. which represents the probability that the output is 1 given x.

140. Now, the question is,

141. what is the sigmoid function?

142. Let me explain in detail what sigmoid really is.

143. The sigmoid function, also called the logistic function,

144. resembles the step function and is used by

145. the following expression in the logistic regression.

146. The sigmoid function looks a bit complicated at first,

147. but don't worry about remembering this equation,

148. it'll make sense to you after working with it.

149. Notice that in the sigmoid equation,

150. when Theta transpose x gets very big,

151. the e power minus Theta transpose x in the denominator of the fraction

152. becomes almost 0, and the value of the sigmoid function gets closer to 1.

153. If Theta transpose x is very small,

154. the sigmoid function gets closer to 0.

155. Depicting on the in sigmoid plot,

156. when Theta transpose x gets bigger,

157. the value of the sigmoid function gets closer to 1, and

158. also, if the Theta transpose x is very small,


159. the sigmoid function gets closer to 0.

160. So, the sigmoid functions output is always between 0 and 1,

161. which makes it proper to interpret the results as probabilities.

162. It is obvious that when the outcome of the sigmoid function gets closer to 1,

163. the probability of y equals 1 given x goes up.

164. In contrast, when the sigmoid value is closer to 0,

165. the probability of y equals 1 given x is very small.

166. So what is the output of our model when we use the sigmoid function?

167. In logistic regression, we model the probability that an input, x,

168. belongs to the default class y equals 1,

169. and we can write this formally as probability of y equals 1 given x.

170. We can also write probability of y belongs to

171. class 0 given x is 1 minus probability of y equals 1 given x.

172. For example, the probability of a customer staying with the company can be

173. shown as probability of churn equals 1 given a customer's income and age,

174. which can be, for instance, 0.8, and

175. the probability of churn is 0 for the same customer given a customer's income and

176. age can be calculated as 1 minus 0.8 equals 0.2.

177. So, now our job is to train the model to set its parameter values in

178. such a way that our model is a good estimate of probability of y equals 1 given x.

179. In fact, this is what a good classifier model

180. built by logistic regression is supposed to do for us.

181. Also, it should be a good estimate of probability of y belongs to

182. class 0 given x that can be shown as 1 minus sigmoid of Theta transpose x.

183. Now, the question is,

184. how can we achieve this?

185. We can find Theta through the training process.

186. So, let's see what the training process is.

187. Step one, initialize Theta vector

188. with random values as with most machine learning algorithms.


189. For example, minus 1 or 2.

190. Step two, calculate the model output,

191. which is sigmoid of Theta transpose x.

192. For example, customer in your training set.

193. X and Theta transpose x is the feature vector values.

194. For example, the age and income of the customer, for instance,

195. 2 and 5, and Theta is the confidence or weight that you've set in the previous step.

196. The output of this equation is the prediction value,

197. in other words, the probability that the customer belongs to class 1.

198. Step three, compare the output of our model,

199. y hat, which could be a value of,

200. let's say, 0.7, with the actual label of the customer,

201. which is for example, 1, for churn.

202. Then, record the difference as our model's error for this customer,

203. which would be 1 minus 0.7,

204. which of course, equals 0.3.

205. This is the error for only one customer out of all the customers in the training set.

206. Step four, calculate the error for

207. all customers as we did in the previous steps and add up these errors.

208. The total error is the cost of your model and is calculated by the models cost function.

209. The cost function, by the way,

210. basically represents how to calculate the error of the model which is

211. the difference between the actual and the models predicted values.

212. So, the cost shows how poorly the model is estimating the customers labels.

213. Therefore, the lower the cost,

214. the better the model is at estimating the customers labels correctly.

215. So, what we want to do is to try to minimize this cost.

216. Step five, but because the initial values for Theta were chosen randomly,

217. it's very likely that the cost function is very high,

218. so we change the Theta in such a way to hopefully reduce the total cost.
219. Step six, after changing the values of Theta,

220. we go back to step two,

221. then we start another iteration and calculate the cost of the model again.

222. We keep doing those steps over and over,

223. changing the values of Theta each time until the cost is low enough.

224. So, this brings up two questions.

225. First, how can we change the values of

226. Theta so that the cost is reduced across iterations?

227. Second, when should we stop the iterations?

228. There are different ways to change the values of Theta,

229. but one of the most popular ways is gradient descent.

230. Also, there are various ways to stop iterations,

231. but essentially you stop training by calculating

232. the accuracy of your model and stop it when it's satisfactory.

233. Thanks for watching this video.

234.
Video= logistic regression training

1. Start of transcript. Skip to the end.


2. Hello and welcome! In this video we'll learn more about training a logistic

3. regression model. Also, we'll be discussing how to change the parameters

4. of the model to better estimate the outcome. Finally, we talk about the cost

5. function and gradient descent in logistic regression as a way to optimize

6. the model, so let's start. The main objective of training in logistic

7. regression is to change the parameters of the model so as to be the best

8. estimation of the labels of the samples in the data set. For example, the customer

9. churn. How do we do that? In brief, first we have to look at the

10. cost function and see what the relation is between the cost function and the

11. parameters' theta. So, we should formulate the cost function, then using the

12. derivative of the cost function, we can find how to change the parameters to

13. reduce the cost or rather the error. Let's dive into it to see how it works.
14. But before I explain it I should highlight for you that it needs

15. some basic mathematical background to understand it. However, you shouldn't

16. worry about it as most data science languages like Python, R, and Scala have

17. some packages or libraries that calculate these parameters for you. So,

18. let's take a look at it. Let's first find the cost function equation for a sample

19. case. To do this, we can use one of the customers in the churn problem.

20. There's normally a general equation for calculating the cost. The cost function

21. is the difference between the actual values of Y and our model output, y-hat.

22. This is a general rule for most cost functions in machine learning. We can

23. show this as the cost of our model comparing it with actual labels, which is

24. the difference between the predicted value of our model and actual value of

25. the target field, where the predicted value of our model is sigmoid of theta

26. transpose X. Usually the square of this equation is used because of the

27. possibility of the negative result and for the sake of simplicity, half of this

28. value is considered as the cost function through the derivative process.

29. Now we can write the cost function for all the samples in our training set. For

30. example, for all customers we can write it as the average sum of the cost

31. functions of all cases. It is also called the mean squared error and, as it is a

32. function of a parameter vector theta, it is shown as J of theta. Okay, good. We have

33. the cost function, now how do we find or set the best weights or parameters that

34. minimize this cost function? The answer is we should calculate the

35. minimum point of this cost function and it'll show us the best parameters for

36. our model. Although we can find the minimum point of a function using the

37. derivative of a function, there's not an easy way to find the global minimum

38. point for such an equation. Given this complexity, describing how to reach the

39. global minimum for this equation is outside the scope of this video. So what

40. is the solution? Well, we should find another cost function instead; one which

41. has the same behavior but is easier to find its minimum point. Let's plot the

42. desirable cost function for our model. Recall that our model is y-hat. Our

43. actual value is Y which equals 0 or 1 and our model tries to estimate it as we
44. want to find a simple cost function for our model. For a moment assume that our

45. desired value for Y is 1. This means our model is best if it estimates Y equals 1. In this
case we need a cost function that returns zero if the outcome of our

46. model is 1, which is the same as the actual label. And, the cost should keep

47. increasing as the outcome of our model gets farther from 1. And cost should be

48. very large if the outcome of our model is close to zero. We can see that the

49. minus log function provides such a cost function for us. It means if the actual

50. value is 1 and the model also predicts 1, the minus log function returns zero cost.

51. But, if the prediction is smaller than 1, the

52. minus log function returns a larger cost value. So, we can use the minus log

53. function for calculating the cost of our logistic regression model. So if you

54. recall, we previously noted that in general it is difficult to calculate the

55. derivative of the cost function. Well, we can now change it with the minus log of

56. our model. We can easily prove that in the case that desirable Y is 1, the cost

57. can be calculated as minus log y-hat and in the case that desirable Y is 0 the

58. cost can be calculated as minus log 1 minus y hat. Now we can plug it into our

59. total cost function and rewrite it as this function. So, this is the logistic

60. regression cost function. As you can see for yourself,

61. it penalizes situations in which the class is 0 and the model output is 1 and

62. vice versa. Remember, however, that y-hat does not

63. return a class as output, but it's a value of 0 or 1 which should be assumed

64. as a probability. Now we can easily use this function to find the parameters of

65. our model in such a way as to minimize the cost. Okay, let's recap what we've

66. done. Our objective was to find a model that best estimates the actual labels.

67. Finding the best model means finding the best parameters' theta for that model. So,

68. the first question was, how do we find the best parameters for our model? Well,

69. by finding and minimizing the cost function of our model. In other words, to

70. minimize the J of theta we just defined. The next question is how do we minimize

71. the cost function. The answer is, using an optimization approach. There are

72. different optimization approaches but we use one of the most famous and effective
73. approaches here, gradient descent. The next question is what is gradient

74. descent? Generally, gradient descent is an iterative approach to finding the

75. minimum of a function. Specifically, in our case, gradient descent is a technique

76. to use the derivative of a cost function to change the parameter values to

77. minimize the cost or error. Let's see how it works. The main objective of gradient

78. descent is to change the parameter values so as to minimize the cost.

79. How can gradient descent do that? Think of the parameters or weights in our

80. model to be in a two-dimensional space. For example, theta 1 theta 2 for two

81. feature sets, age and income. Recall the cost function, J, that we discussed in the

82. previous slides. We need to minimize the cost function J, which is a function of

83. variables theta 1 and theta 2. So let's add a dimension for the observed cost or

84. error, J function. Let's assume that if we plot the cost function based on all

85. possible values of theta 1 and theta 2, we can see something like this. It

86. represents the error value for different values of parameters that is error, which

87. is a function of the parameters. This has called your error curve or error bole of

88. your cost function. Recall that we want to use this error bole to find the best

89. parameter values that result in minimizing the cost value. Now, the

90. question is, which point is the best point for your cost function? Yes, you

91. should try to minimize your position on the error curve. So, what should you do?

92. You have to find the minimum value of the cost by changing the parameters, but

93. which way? Will you add some value to your weights or deduct some value? And

94. how much would that value be? You can select random parameter values that

95. locate a point on the bowl. You can think of our starting point being the yellow

96. point. You change the parameters by delta theta 1 and delta theta 2, and take one

97. step on the surface. Let's assume we go down one step in the bowl. As long as

98. we're going downwards, we can go one more step. The steeper the slope, the further we
can

99. step, and we can keep taking steps. As we approach the lowest point, the slope

100. diminishes, so we can take smaller steps, until we reach a flat surface. This is

101. the minimum point of our curve and the optimum theta 1 theta 2.
102. What are these steps really? I mean in which direction should we take these

103. steps to make sure we descend? And how big should the steps be? To find the

104. direction and size of these steps, in other words, to find how to update the

105. parameters, you should calculate the gradient of the cost function at that

106. point. The gradient is the slope of the surface at every point and the direction

107. of the gradient is the direction of the greatest uphill. Now the question is, how

108. do we calculate the gradient of a cost function at a point? If you select a

109. random point on this surface, for example the yellow point, and take the partial

110. derivative of J of theta with respect to each parameter at that point, it gives

111. you the slope of the move for each parameter at that point. Now, if we move

112. in the opposite direction of that slope it guarantees that we go down in the

113. error curve. For example, if we calculate the derivative of J with respect to

114. theta 1, we find out that it is a positive number. This indicates that

115. function is increasing as theta 1 increases. So to decrease J we should

116. move in the opposite direction. This means to move in the direction of the

117. negative derivative for theta 1, i.e: slope. We have to calculate it for other

118. parameters as well at each step. The gradient value also indicates how big of

119. a step to take. If the slope is large we should take a large step because we're

120. far from the minimum. If the slope is small we should take a smaller step.

121. Gradient descent takes increasingly smaller steps towards the minimum with

122. each iteration. The partial derivative of the cost function J is calculated using

123. this expression. If you want to know how the derivative of the J function is

124. calculated, you need to know the derivative concept, which is beyond our scope here.

125. But to be honest, you don't really need to remember all the details about it as

126. you can easily use this equation to calculate the gradients. So, in a nutshell,

127. this equation returns the slope of that point and we should update the parameter

128. in the opposite direction of the slope. A vector of all these slopes is the

129. gradient vector and we can use this vector to change or update all the

130. parameters. We take the previous values of the parameters and subtract the error

131. derivative. This results in the new parameters for theta that we know will
132. decrease the cost. Also, we multiply the gradient value by a constant value, mu,

133. which is called the learning rate. Learning rate gives us additional

134. control on how fast we move on the surface. In sum, we can simply say:

135. gradient descent is like taking steps in the current direction of the slope and

136. the learning rate is like the length of the step you take. So, these would be our

137. new parameters. Notice that it's an iterative operation and in each

138. iteration we update the parameters and minimize the cost until the algorithm

139. converge is on an acceptable minimum. Okay, let's recap what we've done to this

140. point by going through the training algorithm again, step by step. Step one, we

141. initialize the parameters with random values. Step two, we feed the cost

142. function with the training set and calculate the cost. We expect a high

143. error rate as the parameters are set randomly. Step three, we calculate the

144. gradient of the cost function, keeping in mind that we have to use a partial

145. derivative. So, to calculate the gradient vector we need all the training data to

146. feed the equation for each parameter. Of course this is an expensive part of the

147. algorithm but there are some solutions for this. Step four, we update the weights

148. with new parameter values. Step 5, here we go back to step 2 and feed the cost

149. function again, which has new parameters. As was explained

150. earlier, we expect less error as we're going down the error surface. We continue

151. this loop until we reach a short value of cost or some limited number of

152. iterations. Step 6, the parameter should be roughly found after some iterations.

153. This means the model is ready and we can use it to predict the probability of a

154. customer staying or leaving. Thanks for watching this video.

155. you

156.
video
SVM

1. Start of transcript. Skip to the end.


2. Hello and welcome.

3. In this video, we will learn a machine learning method called,


4. Support Vector Machine, or SVM, which is used for classification.

5. So let's get started.

6. Imagine that you've obtained a dataset containing characteristics

7. of thousands of human cell samples

8. extracted from patients who were believed to be at risk of developing cancer.

9. Analysis of the original data showed that many of the characteristics differed

10. significantly between benign and malignant samples.

11. You can use the values of these cell characteristics

12. in samples from other patients,

13. to give an early indication of whether a new sample might be benign or malignant.

14. You can use Support Vector Machine, or

15. SVM, as a classifier to train your model to

16. understand patterns within the data that might show, benign or malignant cells.

17. Once the model has been trained, it can be used to predict your new or

18. unknown cell with rather high accuracy.

19. Now, let me give you a formal definition of SVM.

20. A Support Vector Machine is a supervised algorithm

21. that can classify cases by finding a separator.

22. SVM works by first mapping data to a high dimensional feature space so that data

23. points can be categorized, even when the data are not otherwise linearly separable.

24. Then, a separator is estimated for the data.

25. The data should be transformed in such a way

26. that a separator could be drawn as a hyperplane.

27. For example, consider the following figure, which shows the distribution of

28. a small set of cells only based on their unit size and clump thickness.

29. As you can see, the data points fall into two different categories.

30. It represents a linearly non separable data set.

31. The two categories can be separated with a curve but not a line.

32. That is, it represents a linearly non separable data set,

33. which is the case for most real world data sets.
34. We can transfer this data to a higher-dimensional space, for

35. example, mapping it to a three-dimensional space.

36. After the transformation,

37. the boundary between the two categories can be defined by a hyperplane.

38. As we are now in three-dimensional space, the separator is shown as a plane.

39. This plane can be used to classify new or unknown cases.

40. Therefore, the SVM algorithm

41. outputs an optimal hyperplane that categorizes new examples.

42. Now, there are two challenging questions to consider.

43. First, how do we transfer data in such a way

44. that a separator could be drawn as a hyperplane?

45. And two, how can we find the best or

46. optimized hyperplane separator after transformation?

47. Let's first look at transforming data to see how it works.

48. For the sake of simplicity, imagine that our dataset is one-dimensional data.

49. This means we have only one feature x.

50. As you can see, it is not linearly separable.

51. So what can we do here?

52. Well, we can transfer it into a two-dimensional space.

53. For example, you can increase the dimension of data by mapping x into

54. a new space using a function with outputs x and x squared.

55. Now the data is linearly separable, right?

56. Notice that as we are in a two-dimensional space, the hyperplane is a line

57. dividing a plane into two parts where each class lays on either side.

58. Now we can use this line to classify new cases.

59. Basically, mapping data into a higher-dimensional space is called,

60. kernelling.

61. The mathematical function used for the transformation is known as the kernel

62. function, and can be of different types, such as linear,

63. polynomial, Radial Basis Function,or RBF, and sigmoid.


64. Each of these functions has its own characteristics, its pros and cons, and

65. its equation.

66. But the good news is that you don't need to know them as most of them

67. are already implemented in libraries of data science programming languages.

68. Also, as there's no easy way of knowing which function performs best with any

69. given dataset, we usually choose different functions in turn and compare the results.

70. Now we get to another question.

71. Specifically, how do we find the right or optimized separator after transformation?

72. Basically, SVMs are based on the idea of finding a hyperplane

73. that best divides a data set into two classes as shown here.

74. As we're in a two-dimensional space, you can think of the hyperplane

75. as a line that linearly separates the blue points from the red points.

76. One reasonable choice as the best hyperplane is the one that represents

77. the largest separation or margin between the two classes.

78. So the goal is to choose a hyperplane with as big a margin as possible.

79. Examples closest to the hyperplane are support vectors.

80. It is intuitive that only support vectors matter for achieving our goal.

81. And thus, other trending examples can be ignored.

82. We tried to find the hyperplane in such a way that

83. it has the maximum distance to support vectors.

84. Please note that the hyperplane and boundary decision lines

85. have their own equations.

86. So finding the optimized hyperplane can be formalized using an equation which

87. involves quite a bit more math, so I'm not going to go through it here in detail.

88. That said, the hyperplane is learned from training data

89. using an optimization procedure that maximizes the margin.

90. And like many other problems, this optimization problem can

91. also be solved by gradient descent, which is out of scope of this video.

92. Therefore, the output of the algorithm is the values w and b for the line.

93. You can make classifications using this estimated line.


94. It is enough to plug in input values into the line equation.

95. Then, you can calculate whether an unknown point is above or below the line.

96. If the equation returns a value greater than 0,

97. then the point belongs to the first class which is above the line, and vice-versa.

98. The two main advantages of support vector machines are that they're

99. accurate in high-dimensional spaces.

100. And they use a subset of training points in the decision function called,

101. support vectors, so it's also memory efficient.

102. The disadvantages of Support Vector Machines

103. include the fact that the algorithm is prone for

104. over-fitting if the number of features is much greater than the number of samples.

105. Also, SVMs do not directly provide probability estimates,

106. which are desirable in most classification problems.

107. And finally, SVMs are not very efficient computationally

108. if your dataset is very big, such as when you have more than 1,000 rows.

109. And now our final question is, in which situation should I use SVM?

110. Well, SVM is good for image analysis tasks,

111. such as image classification and hand written digit recognition.

112. Also, SVM is very effective in text mining tasks,

113. particularly due to its effectiveness in dealing with high-dimensional data.

114. For example, it is used for

115. detecting spam, text category assignment and sentiment analysis.

116. Another application of SVM is in gene expression data classification,

117. again, because of its power in high-dimensional data classification.

118. SVM can also be used for other types of machine learning problems,

119. such as regression, outlier detection and clustering.

120. I'll leave it to you to explore more about these particular problems.

121. This concludes this video,thanks for watching.

122.
Video
Clustering

1. tart of transcript. Skip to the end.


2. Hello and welcome.

3. In this video we'll give you a high level introduction to clustering,

4. its applications, and different types of clustering algorithms.

5. Let's get started! Imagine that you have

6. a customer dataset and you need to apply customer segmentation on this historical data.

7. Customer segmentation is the practice of partitioning

8. a customer base into groups of individuals that have similar characteristics.

9. It is a significant strategy,

10. as it allows the business to target specific groups of customers,

11. so as to more effectively allocate marketing resources.

12. For example, one group might contain customers who are high profit and low risk.

13. That is, more likely to purchase products or subscribe for a service.

14. Knowing this information allows a business to devote

15. more time and attention to retaining these customers.

16. Another group might include customers from nonprofit organizations and so on.

17. A general segmentation process is not usually feasible for large volumes of varied data,

18. therefore you need an analytical approach to

19. deriving segments and groups from large datasets.

20. Customers can be grouped based on several factors, including

21. age, gender, interests, spending habits and so on.

22. The important requirement is to use the available data to

23. understand and identify how customers are similar to each other.

24. Let's learn how to divide a set of customers into categories,

25. based on characteristics they share.

26. One of the most adopted approaches that can be used

27. for customer segmentation is clustering.

28. Clustering can group data only unsupervised,

29. based on the similarity of customers to each other.


30. It will partition your customers into mutually exclusive groups.

31. For example, into three clusters.

32. The customers in each cluster are similar to each other demographically.

33. Now we can create a profile for each group,

34. considering the common characteristics of each cluster.

35. For example, the first group made up of affluent and middle aged customers.

36. The second is made up of young,

37. educated and middle income customers,

38. and the third group includes young and low income customers.

39. Finally, we can assign each individual in

40. our dataset to one of these groups or segments of customers.

41. Now imagine that you cross join this segmented dataset with

42. the dataset of the product or services that customers purchase from your company.

43. This information would really help to understand and predict the differences and

44. individual customers preferences and their buying behaviors across various products.

45. Indeed, having this information would allow your company to

46. develop highly personalized experiences for each segment.

47. Customer segmentation is one of the popular usages of clustering.

48. Cluster analysis also has many other applications in different domains.

49. So let's first define clustering and then we'll look at other applications.

50. Clustering means finding clusters in a dataset, unsupervised.

51. So what is a cluster?

52. A cluster is a group of data points or objects in

53. a dataset that are similar to other objects in the group,

54. and dissimilar to datapoints in other clusters.

55. Now the question is," What is different between clustering and classification?"

56. Let's look at our customer dataset again.

57. Classification algorithms predict categorical classed labels.

58. This means assigning instances to predefined classes such as defaulted or not defaulted.

59. For example, if an analyst wants to analyze


60. customer data in order to know which customers might default on their payments,

61. she uses a labeled dataset as

62. training data and uses classification approaches such as a decision tree,

63. Support Vector Machines or SVM,

64. or logistic regression, to predict the default value for a new or unknown customer.

65. Generally speaking, classification is a supervised learning

66. where each training data instance belongs to a particular class.

67. In clustering however, the data is unlabeled and the process is unsupervised.

68. For example, we can use a clustering algorithm such

69. as k-means to group similar customers as mentioned,

70. and assign them to a cluster,

71. based on whether they share similar attributes,

72. such as; age, education, and so on.

73. While I'll be giving you some examples in different industries,

74. I'd like you to think about more samples of clustering.

75. In the retail industry,

76. clustering is used to find associations among customers based on

77. their demographic characteristics and use

78. that information to identify buying patterns of various customer groups.

79. Also, it can be used in recommendation systems to find a group of

80. similar items or similar users and use it for collaborative filtering,

81. to recommend things like books or movies to customers.

82. In banking, analysts find clusters of

83. normal transactions to find the patterns of fraudulent credit card usage.

84. Also they use clustering to identify clusters of customers.

85. For instance, to find loyal customers versus churned customers.

86. In the insurance industry,

87. clustering is used for fraud detection in claims analysis,

88. or to evaluate the insurance risk of certain customers based on their segments.

89. In publication media, clustering is used to auto


90. categorize news based on his content or to tag news,

91. then cluster it so as to recommend similar news articles to readers.

92. In medicine, it can be used to characterize patient behavior,

93. based on their similar characteristics.

94. So as to identify successful medical therapies for different illnesses or in biology,

95. clustering is used to group genes with similar expression patterns

96. or to cluster genetic markers to identify family ties.

97. If you look around you can find many other applications of clustering,

98. but generally clustering can be used for one of the following purposes:

99. exploratory data analysis, summary generation or reducing the scale,

100. outlier detection- especially to be used for fraud detection or noise removal,

101. finding duplicates and datasets or as a pre-processing step for either prediction,

102. other data mining tasks or as part of a complex system.

103. Let's briefly look at different clustering algorithms and their characteristics.

104. Partition-based clustering is a group of

105. clustering algorithms that produces sphere-like clusters,

106. such as; K-Means, K-Medians or Fuzzy c-Means.

107. These algorithms are relatively efficient and are

108. used for medium and large sized databases.

109. Hierarchical clustering algorithms produce trees of clusters,

110. such as agglomerative and divisive algorithms.

111. This group of algorithms are very intuitive

112. and are generally good for use with small size datasets.

113. Density-based clustering algorithms produce arbitrary shaped clusters.

114. They are especially good when dealing with

115. spatial clusters or when there is noise in your data set.

116. For example, the DB scan algorithm.

117. This concludes our video. Thanks for watching!

118.
Video
K means

1. rt of transcript. Skip to the end.


2. Hello and welcome. In this video,

3. we'll be covering K-Means Clustering. So let's get started.

4. Imagine that you have a customer dataset and you need to

5. apply customer segmentation on this historical data.

6. Customer segmentation is the practice of partitioning

7. a customer base into groups of individuals that have similar characteristics.

8. One of the algorithms that can be used for customer segmentation is K-Means clustering.

9. K-Means can group data only

10. unsupervised based on the similarity of customers to each other.

11. Let's define this technique more formally.

12. There are various types of clustering algorithms such as partitioning,

13. hierarchical or density-based clustering.

14. K-Means is a type of partitioning clustering, that is,

15. it divides the data into K non-overlapping subsets or

16. clusters without any cluster internal structure or labels.

17. This means, it's an unsupervised algorithm.

18. Objects within a cluster are very similar,

19. and objects across different clusters are very different or dissimilar.

20. As you can see, for using K-Means we have to find similar samples:

21. for example, similar customers.

22. Now, we face a couple of key questions.

23. First, how can we find the similarity of samples in clustering, and

24. then how do we measure how similar two customers are with regard to their
demographics?

25. Though the objective of K-Means is to form clusters in

26. such a way that similar samples go into a cluster,

27. and dissimilar samples fall into different clusters,

28. it can be shown that instead of a similarity metric,


29. we can use dissimilarity metrics.

30. In other words, conventionally the distance of

31. samples from each other is used to shape the clusters.

32. So we can say K-Means tries to minimize

33. the intra-cluster distances and maximize the inter-cluster distances.

34. Now, the question is,

35. how can we calculate the dissimilarity or distance of two cases such as two customers?

36. Assume that we have two customers,

37. we will call them Customer one and two.

38. Let's also assume that we have only one feature for

39. each of these two customers and that feature is age.

40. We can easily use a specific type of

41. Minkowski distance to calculate the distance of these two customers.

42. Indeed, it is the Euclidean distance.

43. Distance of x1 from x2 is root of 34 minus 30_2 which is four.

44. What about if we have more than one feature,

45. for example age and income.

46. For example, if we have income and age for each customer,

47. we can still use the same formula but this time in a two-dimensional space.

48. Also, we can use the same distance matrix for multidimensional vectors.

49. Of course, we have to normalize our feature

50. set to get the accurate dissimilarity measure.

51. There are other dissimilarity measures as well that can be used for this purpose,

52. but it is highly dependent on

53. datatype and also the domain that clustering is done for it.

54. For example you may use Euclidean distance,

55. Cosine similarity, Average distance, and so on.

56. Indeed, the similarity measure highly controls how the clusters are formed,

57. so it is recommended to understand the domain knowledge of your dataset and

58. datatype of features and then choose the meaningful distance measurement.
59. Now, let's see how K-Means clustering works.

60. For the sake of simplicity,

61. let's assume that our dataset has only two features:

62. the age and income of customers.

63. This means, it's a two-dimensional space.

64. We can show the distribution of customers using a scatter plot:

65. The Y-axis indicates age and the X-axis shows income of customers.

66. We try to cluster the customer dataset into

67. distinct groups or clusters based on these two dimensions.

68. In the first step, we should determine the number of clusters.

69. The key concept of the K-Means algorithm

70. is that it randomly picks a center point for each cluster.

71. It means we must initialize K which represents number of clusters.

72. Essentially, determining the number of clusters in

73. a dataset or K is a hard problem in K-Means,

74. that we will discuss later.

75. For now, let's put K equals three here for our sample dataset.

76. It is like we have three representative points for our clusters.

77. These three data points are called centroids of clusters

78. and should be of same feature size of our customer feature set.

79. There are two approaches to choose these centroids.

80. One, we can randomly choose three observations out of

81. the dataset and use these observations as the initial means.

82. Or two, we can create three random points as centroids of

83. the clusters which is our choice that is shown in the plot with red color.

84. After the initialization step which was defining the centroid of each cluster,

85. we have to assign each customer to the closest center.

86. For this purpose, we have to calculate the distance of

87. each data point or in our case each customer from the centroid points.

88. As mentioned before, depending on the nature


89. of the data and the purpose for which clustering is being

90. used different measures of distance may be used to place items into clusters.

91. Therefore, you will form a matrix where each row

92. represents the distance of a customer from each centroid.

93. It is called the Distance Matrix.

94. The main objective of K-Means clustering is to minimize the distance of data points from

95. the centroid of this cluster and maximize the distance from other cluster centroids.

96. So, in this step,

97. we have to find the closest centroid to each data point.

98. We can use the distance matrix to find the nearest centroid to datapoints.

99. Finding the closest centroids for each data point,

100. we assign each data point to that cluster.

101. In other words, all the customers will fall to

102. a cluster based on their distance from centroids.

103. We can easily say that it does not result in

104. good clusters because the centroids were chosen randomly from the first.

105. Indeed, the model would have a high error.

106. Here, error is the total distance of each point from its centroid.

107. It can be shown as within-cluster sum of squares error.

108. Intuitively, we try to reduce this error.

109. It means we should shape clusters in such a way that

110. the total distance of all members of a cluster from its centroid be minimized.

111. Now, the question is,

112. how can we turn it into better clusters with less error?

113. Okay, we move centroids.

114. In the next step,

115. each cluster center will be updated to be the mean for datapoints in its cluster.

116. Indeed, each centroid moves according to their cluster members.

117. In other words the centroid of each of the three clusters becomes the new mean.

118. For example, if point A coordination is 7.4 and 3.6,


119. and B point features are 7.8 and 3.8,

120. the new centroid of this cluster with two points would be the average of them,

121. which is 7.6 and 3.7.

122. Now, we have new centroids.

123. As you can guess, once again we will have to

124. calculate the distance of all points from the new centroids.

125. The points are reclustered and the centroids move again.

126. This continues until the centroids no longer move.

127. Please note that whenever a centroid moves,

128. each points distance to the centroid needs to be measured again.

129. Yes, K-Means is an iterative algorithm and we

130. have to repeat steps two to four until the algorithm converges.

131. In each iteration, it will move the centroids,

132. calculate the distances from

133. new centroids and assign data points to the nearest centroid.

134. It results in the clusters with minimum error or the most dense clusters.

135. However, as it is a heuristic algorithm,

136. there is no guarantee that it will converge to the global

137. optimum and the result may depend on the initial clusters.

138. It means, this algorithm is guaranteed to converge to

139. a result, but the result may be a local optimum i.e.

140. not necessarily the best possible outcome.

141. To solve this problem,

142. it is common to run the whole process multiple times with different starting conditions.

143. This means with randomized starting centroids,

144. it may give a better outcome.

145. As the algorithm is usually very fast,

146. it wouldn't be any problem to run it multiple times.

147. Thanks for watching this video.

148.
Video
More onk means

1. t of transcript. Skip to the end.


2. Hello and welcome.

3. In this video, we'll look at k-Means accuracy and characteristics.

4. Let's get started.

5. Let's define the algorithm more concretely,

6. before we talk about its accuracy.

7. A k-Means algorithm works by randomly placing k centroids, one for each cluster.

8. The farther apart the clusters are placed, the better.

9. The next step is to calculate the distance of each data point or

10. object from the centroids.

11. Euclidean distance is used to measure the distance from the object to the centroid.

12. Please note, however, that you can also use

13. different types of distance measurements, not just Euclidean distance.

14. Euclidean distance is used because it's the most popular.

15. Then, assign each data point or object to its closest centroid creating a group.

16. Next, once each data point has been classified to a group,

17. recalculate the position of the k centroids.

18. The new centroid position is determined by the mean of all points in the group.

19. Finally, this continues until the centroids no longer move.

20. Now, the questions is,

21. how can we evaluate the goodness of the clusters formed by k-Means?

22. In other words, how do we calculate the accuracy of k-Means clustering?

23. One way is to compare the clusters with the ground truth, if it's available.

24. However, because k-Means is an unsupervised algorithm

25. we usually don't have ground truth in real world problems to be used.

26. But there is still a way to say how bad each cluster is,

27. based on the objective of the k-Means.

28. This value is the average distance between data points within a cluster.

29. Also, average of the distances of data points from their cluster centroids
30. can be used as a metric of error for the clustering algorithm.

31. Essentially, determining the number of clusters in a data set, or

32. k as in the k-Means algorithm, is a frequent problem in data clustering.

33. The correct choice of K is often ambiguous because it's very dependent on the shape

34. and scale of the distribution of points in a dataset.

35. There are some approaches to address this problem, but

36. one of the techniques that is commonly used is to run the clustering across

37. the different values of K and looking at a metric of accuracy for clustering.

38. This metric can be mean, distance between data points and their cluster's centroid,

39. which indicate how dense our clusters are or,

40. to what extent we minimize the error of clustering.

41. Then, looking at the change of this metric, we can find the best value for K.

42. But the problem is that with increasing the number of clusters,

43. the distance of centroids to data points will always reduce.

44. This means increasing K will always decrease the error.

45. So, the value of the metric as a function of K is plotted and

46. the elbow point is determined where the rate of decrease sharply shifts.

47. It is the right K for clustering.

48. This method is called the elbow method.

49. So let's recap k-Means clustering:

50. k-Means is a partition-based clustering which is A,

51. relatively efficient on medium and large sized data sets;

52. B, produces sphere-like

53. clusters because the clusters are shaped around the centroids;

54. and C, its drawback is that we should pre-specify the number of clusters,

55. and this is not an easy task.

56. Thanks for watching.

57.
Vieo
Hoerarcheal clustring
1. Start of transcript. Skip to the end.
2. Hello and welcome.

3. In this video,

4. we'll be covering Hierarchical Clustering.

5. So, let's get started.

6. Let's look at this chart.

7. An international team of scientists led by UCLA biologists used

8. this dendrogram to report genetic data from more than 900 dogs from 85 breeds,

9. and more than 200 wild gray wolves worldwide,

10. including populations from North America,

11. Europe, the Middle East, and East Asia.

12. They used molecular genetic techniques to analyze more than 48,000 genetic markers.

13. This diagram shows hierarchical clustering of

14. these animals based on the similarity in their genetic data.

15. Hierarchical clustering algorithms build a hierarchy of

16. clusters where each node is a cluster consisting of the clusters of its daughter nodes.

17. Strategies for hierarchical clustering generally fall into

18. two types, divisive and agglomerative.

19. Divisive is top down,

20. so you start with all observations in a large cluster and break it

21. down into smaller pieces. Think about divisive as dividing the cluster.

22. Agglomerative is the opposite of divisive.

23. So it is bottom up,

24. where each observation starts in its own cluster and

25. pairs of clusters are merged together as they move up the hierarchy.

26. Agglomeration means to amass or collect things,

27. which is exactly what this does with the cluster.

28. The agglomerative approach is more popular among

29. data scientists and so it is the main subject of this video.

30. Let's look at a sample of agglomerative clustering.


31. This method builds the hierarchy from

32. the individual elements by progressively merging clusters.

33. In our example, let's say we want to cluster

34. six cities in Canada based on their distances from one another.

35. They are Toronto, Ottawa,

36. Vancouver, Montreal, Winnipeg, and Edmonton.

37. We construct a distance matrix at this stage,

38. where the numbers in the row i column j is the distance between the i and j cities.

39. In fact, this table shows the distances between each pair of cities.

40. The algorithm is started by assigning each city to its own cluster.

41. So if we have six cities,

42. we have six clusters each containing just one city.

43. Let's note each city by showing the first two characters of its name.

44. The first step is to determine which cities,

45. let's call them clusters from now on,

46. to merge into a cluster.

47. Usually, we want to take the two closest clusters according to the chosen distance.

48. Looking at the distance matrix,

49. Montreal and Ottawa are the closest clusters so we make a cluster out of them.

50. Please notice that we just use a simple one-dimensional distance feature here,

51. but our object can be multidimensional and distance measurement can either be
Euclidean,

52. Pearson, average distance or many others depending on data type and domain
knowledge.

53. Anyhow, we have to merge these two closest cities in the distance matrix as well.

54. So, rows and columns are merged as the cluster is constructed.

55. As you can see in the distance matrix,

56. rows and columns related to Montreal and

57. Ottawa cities are merged as the cluster is constructed.

58. Then, the distances from all cities to this new merged cluster get updated. But how?

59. For example, how do we calculate the distance


60. from Winnipeg to the Ottawa/Montreal cluster?

61. Well, there are different approaches but let's assume for example,

62. we just select the distance from the center of the Ottawa/Montreal cluster to Winnipeg.

63. Updating the distance matrix,

64. we now have one less cluster.

65. Next, we look for the closest clusters once again.

66. In this case, Ottawa,

67. Montreal, and Toronto are the closest ones which creates another cluster.

68. In the next step, the closest distances between

69. the Vancouver cluster and the Edmonton cluster.

70. Forming a new cluster,

71. the data in the matrix table gets updated.

72. Essentially, the rows and columns are merged

73. as the clusters are merged and the distance updated.

74. This is a common way to implement this type of

75. clustering and has the benefit of caching distances between clusters.

76. In the same way,

77. agglomerative algorithm proceeds by merging clusters,

78. and we repeat it until all clusters are merged and the tree becomes completed.

79. It means, until all cities are clustered into a single cluster of size six.

80. Hierarchical clustering is typically visualized as a dendrogram as shown on this slide.

81. Each merge is represented by a horizontal line.

82. The y-coordinate of the horizontal line is the similarity of

83. the two clusters that were merged where cities are viewed as singleton clusters.

84. By moving up from the bottom layer to the top node,

85. a dendrogram allows us to reconstruct the history

86. of merges that resulted in the depicted clustering.

87. Essentially, hierarchical clustering does not require a prespecified number of clusters.

88. However, in some applications,

89. we want a partition of disjoint clusters just as in flat clustering.


90. In those cases, the hierarchy needs to be cut at some point.

91. For example here, cutting in a specific level of similarity,

92. we create three clusters of similar cities.

93. This concludes this video. Thanks for watching.

94.

Video
More on hierarchicl clustering

1. tart of transcript. Skip to the end.


2. Hello and welcome.

3. In this video, we'll be covering more details about hierarchical clustering.

4. Let's get started.

5. Let's look at agglomerative algorithm for hierarchical clustering.

6. Remember that agglomerative clustering is a bottom up approach.

7. Let's say our data set has n data points.

8. First, we want to create n clusters, one for each data point.

9. Then, each point is assigned as a cluster.

10. Next, we want to compute the distance proximity matrix

11. which will be an n by n table.

12. After that, we want to iteratively run the following steps until the specified
13. cluster number is reached, or until there is only one cluster left.

14. First, merge the two nearest clusters.

15. Distances are computed already in the proximity matrix.

16. Second, update the proximity matrix with the new values.

17. We stop after we've reached the specified number of clusters, or

18. there is only one cluster remaining with the result stored in a dendogram.

19. So in the proximity matrix, we have to measure the distances between clusters and

20. also merge the clusters that are nearest.

21. So, the key operation is the computation of the proximity between

22. the clusters with one point and also clusters with multiple data points.

23. At this point, there are a number of key questions that need to be answered.

24. For instance, how do we measure the distances between these clusters and

25. how do we define the nearest among clusters?

26. We also can ask, which points do we use?

27. First, let's see how to calculate the distance between two clusters

28. with one point each.

29. Let's assume that we have a data set of patients and

30. we want to cluster them using hierarchy clustering.

31. So our data points are patients with a featured set of three dimensions.

32. For example, age, body mass index, or BMI and blood pressure.

33. We can use different distance measurements to calculate the proximity matrix.

34. For instance, Euclidean distance.

35. So, if we have a data set of n patience,

36. we can build an n by n dissimilarity distance matrix.

37. It will give us the distance of clusters with one data point.

38. However, as mentioned, we merge clusters in agglomerative clustering.

39. Now the question is, how can we calculate the distance between clusters

40. when there are multiple patients in each cluster?

41. We can use different criteria to find the closest clusters and merge them.

42. In general, it completely depends on the data type, dimensionality of data and
43. most importantly, the domain knowledge of the data set.

44. In fact, different approaches to defining the distance between clusters

45. distinguish the different algorithms.

46. As you might imagine, there are multiple ways we can do this.

47. The first one is called single linkage clustering.

48. Single linkage is defined as the shortest distance between two points in

49. each cluster, such as point a and b.

50. Next up is complete linkage clustering.

51. This time, we are finding the longest distance between the points in

52. each cluster, such as the distance between point a and b.

53. The third type of linkage is average linkage clustering or the mean distance.

54. This means we're looking at the average distance of each point from one cluster

55. to every point in another cluster.

56. The final linkage type to be reviewed is centroid linkage clustering.

57. Centroid is the average of the feature sets of points in a cluster.

58. This linkage takes into account the centroid of each cluster

59. when determining the minimum distance.

60. There are three main advantages to using hierarchical clustering.

61. First, we do not need to specify the number of clusters required for

62. the algorithm.

63. Second, hierarchical clustering is easy to implement.

64. And third, the dendrogram produced is very useful in understanding the data.

65. There are some disadvantages as well.

66. First, the algorithm can never undo any previous steps.

67. So for example, the algorithm clusters two points and

68. later on, we see that the connection was not a good one.

69. The program can not undo that step.

70. Second, the time complexity for the clustering can result in very long

71. computation times in comparison with efficient algorithms such as K-means.

72. Finally, if we have a large data set, it can become difficult to determine
73. the correct number of clusters by the dendrogram.

74. Now, lets compare hierarchical clustering with K-means.

75. K-means is more efficient for large data sets.

76. In contrast to K-means,

77. hierarchical clustering does not require the number of cluster to be specified.

78. Hierarchical clustering gives more than one partitioning depending on

79. the resolution or as K-means gives only one partitioning of the data.

80. Hierarchical clustering always generates the same clusters,

81. in contrast with K-means, that returns different clusters each time it is run,

82. due to random initialization of centroids.

83. Thanks for watching.

84.
Video
DBSCAn

1. art of transcript. Skip to the end.


2. Hello and welcome. In this video,

3. we'll be covering DB scan.

4. A density-based clustering algorithm which is

5. appropriate to use when examining spatial data.

6. So let's get started.

7. Most of the traditional clustering techniques such as K-Means, hierarchical,

8. and Fuzzy clustering can be used to group data in an unsupervised way.

9. However, when applied to tasks with

10. arbitrary shaped clusters or clusters within clusters,

11. traditional techniques might not be able to achieve good results, that is

12. elements in the same cluster might not share

13. enough similarity or the performance may be poor.

14. Additionally, while partitioning based algorithms such as

15. K-Means may be easy to understand and implement in practice,

16. the algorithm has no notion of outliers that is,


17. all points are assigned to a cluster even if they do not belong in any.

18. In the domain of anomaly detection,

19. this causes problems as anomalous points will be

20. assigned to the same cluster as normal data points.

21. The anomalous points pull the cluster centroid towards

22. them making it harder to classify them as anomalous points.

23. In contrast, density-based clustering locates regions of

24. high density that are separated from one another by regions of low density.

25. Density in this context is defined as the number of points within a specified radius.

26. A specific and very popular type of density-based clustering is DBSCAN.

27. DBSCAN is particularly effective for tasks

28. like class identification on a spatial context.

29. The wonderful attributes of the DBSCAN algorithm is that it can

30. find out any arbitrary shaped cluster without getting effected by noise.

31. For example, this map shows the location of weather stations in Canada.

32. DBSCAN can be used here to find

33. the group of stations which show the same weather condition.

34. As you can see, it not only finds different arbitrary shaped clusters it can find

35. the denser part of data-centered samples by ignoring less dense areas or noises.

36. Now, let's look at this clustering algorithm to see how it works.

37. DBSCAN stands for Density-Based Spatial Clustering of Applications with Noise.

38. This technique is one of

39. the most common clustering algorithms which works based on density of object.

40. DBSCAN works on the idea that if a particular point belongs to

41. a cluster it should be near to lots of other points in that cluster.

42. It works based on two parameters:

43. radius and minimum points.

44. R determines a specified radius that if it includes enough points within it,

45. we call it a dense area.

46. M determines the minimum number of data points we


47. want in a neighborhood to define a cluster.

48. Let's define radius as two units.

49. For the sake of simplicity,

50. assume it has radius of two centimeters around a point of interest.

51. Also, let's set the minimum point or M to be six points including the point of interest.

52. To see how DBSCAN works,

53. we have to determine the type of points.

54. Each point in our dataset can be either a core,

55. border, or outlier point.

56. Don't worry, I'll explain what these points are in a moment.

57. But the whole idea behind the DBSCAN algorithm

58. is to visit each point and find its type first,

59. then we group points as clusters based on their types.

60. Let's pick a point randomly.

61. First, we check to see whether it's a core data point.

62. So, what is a core point?

63. A data point is a core point if within

64. our neighborhood of the point there are at least M points.

65. For example, as there are six points in the two centimeter neighbor of the red point,

66. we mark this point as a core point.

67. Okay, what happens if it's not a core point?

68. Let's look at another point.

69. Is this point a core point?

70. No. As you can see,

71. there are only five points in this neighborhood including the yellow point,

72. so what kind of point is this one?

73. In fact, it is a border point.

74. What is a border point?

75. A data point is a border point if A;

76. its neighbourhood contains less than M data points or B;


77. it is reachable from some core point.

78. Here, reachability means it is within our distance from a core point.

79. It means that even though the yellow point is

80. within the two centimeter neighborhood of the red point,

81. it is not by itself a core point because it

82. does not have at least six points in its neighborhood.

83. We continue with the next point.

84. As you can see, it is also a core point

85. and all points around it which are not core points are border points.

86. Next core point and next core point.

87. Let's pick this point.

88. You can see it is not a core point nor is it a border point.

89. So, we'd label it as an outlier.

90. What is an outlier?

91. An outlier is a point that is not a core point and

92. also is not close enough to be reachable from a core point.

93. We continue and visit all the points in the dataset

94. and label them as either core, border, or outlier.

95. The next step is to connect core points that are

96. neighbors and put them in the same cluster.

97. So, a cluster is formed as

98. at least one core point plus all reachable core points plus all their borders.

99. It's simply shapes all the clusters and finds outliers as well.

100. Let's review this one more time to see why DBSCAN is cool.

101. DBSCAN can find arbitrarily shaped clusters.

102. It can even find a cluster completely surrounded by a different cluster.

103. DBSCAN has a notion of noise and is robust to outliers.

104. On top of that,

105. DBSCAN makes it very practical for use in many real-world problems because it does

106. not require one to specify the number of clusters such as K in K-means.
Recommender system

1. transcript. Skip to the end.


2. Hello, and welcome! In this video, we’ll be going through a

3. quick introduction to recommendation systems. So, let’s get started.

4. Even though peoples’ tastes may vary, they generally follow patterns.

5. By that, I mean that there are similarities in the things that people tend to like … or

6. another way to look at it, is that people tend to like things in the same category or

7. things that share the same characteristics. For example, if you’ve recently purchased

8. a book on Machine Learning in Python and you’ve enjoyed reading it, it’s very likely that

9. you’ll also enjoy reading a book on Data Visualization.

10. People also tend to have similar tastes to those of the people they’re close to in

11. their lives. Recommender systems try to capture these patterns

12. and similar behaviors, to help predict what else you might like.

13. Recommender systems have many applications that I’m sure you’re already familiar

14. with. Indeed, Recommender systems are usually at

15. play on many websites. For example, suggesting books on Amazon and

16. movies on Netflix. In fact, everything on Netflix’s website


17. is driven by customer selection. If a certain movie gets viewed frequently

18. enough, Netflix’s recommender system ensures that that movie gets an increasing
number

19. of recommendations.

20. Another example can be found in a daily-use mobile app, where a recommender engine
is

21. used to recommend anything from where to eat or what job to apply to.

22. On social media, sites like Facebook or LinkedIn, regularly recommend friendships.

23. Recommender systems are even used to personalize your experience on the web.

24. For example, when you go to a news platform website, a recommender system will make
note

25. of the types of stories that you clicked on and make recommendations on which types of

26. stories you might be interested in reading in future.

27. There are many of these types of examples and they are growing in number every day.

28. So, let’s take a closer look at the main benefits of using a recommendation system.

29. One of the main advantages of using recommendation systems is that users get a
broader exposure

30. to many different products they might be interested in.

31. This exposure encourages users towards continual usage or purchase of their product.

32. Not only does this provide a better experience for the user but it benefits the service
provider,

33. as well, with increased potential revenue and better security for its customers.

34. There are generally 2 main types of recommendation systems: Content-based and
collaborative filtering.

35. The main difference between each, can be summed up by the type of statement that a
consumer

36. might make. For instance, the main paradigm of a Content-based

37. recommendation system is driven by the statement: “Show me more of the same of what
I've liked before."

38. Content-based systems try to figure out what

39. a user's favorite aspects of an item are, and then make recommendations on items that

40. share those aspects.


41. Collaborative filtering is based on a user saying, “Tell me what's popular among my

42. neighbors because I might like it too.” Collaborative filtering techniques find similar

43. groups of users, and provide recommendations based on similar tastes within that group.

44. In short, it assumes that a user might be interested in what similar users are interested in.

45. Also, there are Hybrid recommender systems,

46. which combine various mechanisms.

47. In terms of implementing recommender systems, there are 2 types: Memory-based and
Model-based.

48. In memory-based approaches, we use the entire user-item dataset to generate a


recommendation

49. system. It uses statistical techniques to approximate

50. users or items. Examples of these techniques include: Pearson

51. Correlation, Cosine Similarity and Euclidean Distance, among others.

52. In model-based approaches, a model of users is developed in an attempt to learn their

53. preferences. Models can be created using Machine Learning

54. techniques like regression, clustering, classification, and so on.

55. This is the end of our video. Thanks for watching!

56.
Video 2
Content based recommender

1. tart of transcript. Skip to the end.


2. Hello, and welcome.

3. In this video,

4. we'll be covering Content-Based Recommender Systems. So let's get started.

5. A Content-based recommendation system tries

6. to recommend items to users based on their profile.

7. The user's profile revolves around that user's preferences and tastes.

8. It is shaped based on user ratings,

9. including the number of times that user has clicked on

10. different items or perhaps even liked those items.

11. The recommendation process is based on the similarity between those items.
12. Similarity or closeness of items is

13. measured based on the similarity in the content of those items.

14. When we say content,

15. we're talking about things like the items category,

16. tag, genre, and so on.

17. For example, if we have four movies,

18. and if the user likes or rates the first two items,

19. and if Item 3 is similar to Item 1 in terms of their genre,

20. the engine will also recommend Item 3 to the user.

21. In essence, this is what content-based recommender system engines do.

22. Now, let's dive into a content-based recommender system to see how it works.

23. Let's assume we have a data set of only six movies.

24. This data set shows movies that our user has

25. watched and also the genre of each of the movies.

26. For example, Batman versus Superman is in the Adventure,

27. Super Hero genre and Guardians of the Galaxy is in the Comedy,

28. Adventure, Super Hero and Science-fiction genres.

29. Let's say the user has watched and rated three movies so

30. far and she has given a rating of two out of 10 to the first movie,

31. 10 out of 10 to the second movie and eight out of 10 to the third.

32. The task of the recommender engine is to recommend one of

33. the three candidate movies to this user, or in other,

34. words we want to predict what the user's possible rating would

35. be of the three candidate movies if she were to watch them.

36. To achieve this, we have to build the user profile.

37. First, we create a vector to show

38. the user's ratings for the movies that she's already watched.

39. We call it Input User Ratings.

40. Then, we encode the movies through the one-hot encoding approach.

41. Genre of movies are used here as a feature set.


42. We use the first three movies to make this matrix,

43. which represents the movie feature set matrix.

44. If we multiply these two matrices we can get the weighted feature set for the movies.

45. Let's take a look at the result.

46. This matrix is also called the Weighted Genre matrix and represents

47. the interests of the user for each genre based on the movies that she's watched.

48. Now, given the Weighted Genre Matrix,

49. we can shape the profile of our active user.

50. Essentially, we can aggregate

51. the weighted genres and then normalize them to find the user profile.

52. It clearly indicates that she likes superhero movies more than other genres.

53. We use this profile to figure out what movie is proper to recommend to this user.

54. Recall that we also had three candidate movies for

55. recommendation that haven't been watched by the user,

56. we encode these movies as well.

57. Now we're in the position where we have to figure out

58. which of them is most suited to be recommended to the user.

59. To do this, we simply multiply the User Profile matrix by the candidate Movie Matrix,

60. which results in the Weighted Movies Matrix.

61. It shows the weight of each genre with respect to the User Profile.

62. Now, if we aggregate these weighted ratings,

63. we get the active user's possible interest level in these three movies.

64. In essence, it's our recommendation lists,

65. which we can sort to rank the movies and recommend them to the user.

66. For example, we can say that the Hitchhiker's Guide to the Galaxy

67. has the highest score in our list, and it's proper to recommend to the user.

68. Now, you can come back and fill the predicted ratings for the user.

69. So, to recap what we've discussed so far,

70. the recommendation in a content-based system is based on

71. user's taste and the content or feature set items.


72. Such a model is very efficient.

73. However, in some cases, it doesn't work.

74. For example, assume that we have a movie in the drama genre,

75. which the user has never watch.

76. So, this genre would not be in her profile.

77. Therefore, shall only get recommendations related to genres that are already in

78. her profile and the recommender engine may never recommend any movie within other
genres.

79. This problem can be solved by other types of

80. recommender systems such as collaborative filtering. Thanks for watching.

81.
video 3
collaborative recommnder

1. Start of transcript. Skip to the end.


2. Hello, and welcome.

3. In this video,

4. we'll be covering a recommender system technique called collaborative filtering.

5. So let's get started.

6. Collaborative filtering is based on the fact that

7. relationships exist between products and people's interests.

8. Many recommendation systems use collaborative filtering to find these relationships

9. and to give an accurate recommendation of

10. a product that the user might like or be interested in.

11. Collaborative filtering has basically two approaches: user-based and item-based.

12. User-based collaborative filtering is based on the user similarity or neighborhood.

13. Item-based collaborative filtering is based on similarity among items.

14. Let's first look at the intuition behind the user-based approach.

15. In user-based collaborative filtering,

16. we have an active user for whom the recommendation is aimed.

17. The collaborative filtering engine first looks for users who are similar.

18. That is users who share the active users rating patterns.
19. Collaborative filtering basis this similarity on things like history,

20. preference, and choices that users make when buying,

21. watching, or enjoying something.

22. For example, movies that similar users have rated highly.

23. Then it uses the ratings from these similar users to predict

24. the possible ratings by the active user for a movie that she had not previously watched.

25. For instance, if two users are similar or

26. are neighbors in terms of their interested movies,

27. we can recommend a movie to the active user that her neighbor has already seen.

28. Now, let's dive into the algorithm to see how all of this works.

29. Assume that we have a simple user item matrix,

30. which shows the ratings of four users for five different movies.

31. Let's also assume that our active user has

32. watched and rated three out of these five movies.

33. Let's find out which of the two movies that

34. our active user hasn't watched should be recommended to her.

35. The first step is to discover how similar

36. the active user is to the other users. How do we do this?

37. Well, this can be done through

38. several different statistical and vectorial techniques such as

39. distance or similarity measurements including Euclidean Distance,

40. Pearson Correlation, Cosine Similarity, and so on.

41. To calculate the level of similarity between two users,

42. we use the three movies that both the users have rated in the past.

43. Regardless of what we use for similarity measurement,

44. let's say for example,

45. the similarity could be 0.7,

46. 0.9, and 0.4 between the active user and other users.

47. These numbers represent similarity weights or

48. proximity of the active user to other users in the dataset.


49. The next step is to create a weighted rating matrix.

50. We just calculated the similarity of users to our active user in the previous slide.

51. Now, we can use it to calculate the possible opinion

52. of the active user about our two target movies.

53. This is achieved by multiplying the similarity weights to the user ratings.

54. It results in a weighted ratings matrix,

55. which represents the user's neighbors opinion

56. about are two candidate movies for recommendation.

57. In fact, it incorporates the behavior of other users and gives

58. more weight to the ratings of those users who are more similar to the active user.

59. Now, we can generate the recommendation matrix by aggregating all of the weighted
rates.

60. However, as three users rated

61. the first potential movie and two users rated the second movie,

62. we have to normalize the weighted rating values.

63. We do this by dividing it by the sum of the similarity index for users.

64. The result is the potential rating that our active user will

65. give to these movies based on her similarity to other users.

66. It is obvious that we can use it to rank

67. the movies for providing recommendation to our active user.

68. Now, let's examine what's different

69. between user-based and item-based collaborative filtering.

70. In the user-based approach,

71. the recommendation is based on users of

72. the same neighborhood with whom he or she shares common preferences.

73. For example, as User 1 and User 3 both liked Item 3 and Item 4,

74. we consider them as similar or neighbor users,

75. and recommend Item 1 which is positively rated by User 1 to User 3.

76. In the item-based approach,

77. similar items build neighborhoods on the behavior of users.


78. Please note however, that it is not based on their contents.

79. For example, Item 1 and Item 3 are considered

80. neighbors as they were positively rated by both User 1 and User 2.

81. So, Item 1 can be recommended to User 3 as he has already shown interest in Item 3.

82. Therefore, the recommendations here are based on

83. the items in the neighborhood that a user might prefer.

84. Collaborative filtering is a very effective recommendation system.

85. However, there are some challenges with it as well.

86. One of them is data sparsity.

87. Data sparsity happens when you have a large data set of

88. users who generally rate only a limited number of items.

89. As mentioned, collaborative based recommenders can only

90. predict scoring of an item if there are other users who have rated it.

91. Due to sparsity, we might not have enough ratings in the user item

92. dataset which makes it impossible to provide proper recommendations.

93. Another issue to keep in mind is something called cold start.

94. Cold start refers to the difficulty

95. the recommendation system has when there is a new user,

96. and as such a profile doesn't exist for them yet.

97. Cold start can also happen when we have a new item which has not received a rating.

98. Scalability can become an issue as well.

99. As the number of users or items increases and the amount of data expands,

100. collaborative filtering algorithms will begin to suffer drops in performance,

101. simply due to growth and the similarity computation.

102. There are some solutions for each of these challenges

103. such as using hybrid based recommender systems,

104. but they are out of scope of this course.

105. Thanks for watching.

106.
LABS-1
Simple Linear regression

Simple Linear Regression


Estimated time needed: 15 minutes

Objectives
After completing this lab you will be able to:

 Use scikit-learn to implement simple Linear Regression


 Create a model, train,test and use the model

Importing Needed packages

import [Link] as plt


import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

Downloading Data

To download the data, we will use !wget to download it from IBM Object Storage.
!wget -O [Link] [Link]
[Link]/IBMDeveloperSkillsNetwork-ML0101EN-SkillsNetwork/labs/Module
%202/data/[Link]

[Link]:
We have downloaded a fuel consumption dataset, [Link], which contains model-specific
fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale
in Canada. Dataset source
 MODELYEAR e.g. 2014
 MAKE e.g. Acura
 MODEL e.g. ILX
 VEHICLE CLASS e.g. SUV
 ENGINE SIZE e.g. 4.7
 CYLINDERS e.g 6
 TRANSMISSION e.g. A6
 FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
 FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
 FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
 CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0

Reading the data in


df = pd.read_csv("[Link]")

# take a look at the dataset


[Link]()

Data Exploration

Lets first have a descriptive exploration on our data.

# summarize the data


[Link]()

Lets select some features to explore more.


cdf = df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB','CO2EMISSIONS']]
[Link](9)
We can plot each of these fearues:

viz = cdf[['CYLINDERS','ENGINESIZE','CO2EMISSIONS','FUELCONSUMPTION_COMB']]
[Link]()
[Link]()

Now, lets plot each of these features vs the Emission, to see how linear is their relation:

[Link](cdf.FUELCONSUMPTION_COMB, cdf.CO2EMISSIONS, color='blue')


[Link]("FUELCONSUMPTION_COMB")
[Link]("Emission")
[Link]()

Practice
Plot CYLINDER vs the Emission, to see how linear is their relation:

[Link]([Link], cdf.CO2EMISSIONS, color='blue')

[Link]("CYLINDER")

[Link]("Emission")
[Link]()

Click here for the solution


[Link]([Link], cdf.CO2EMISSIONS, color='blue')
[Link]("Cylinders")
[Link]("Emission")
[Link]()

Creating train and test dataset

Train/Test Split involves splitting the dataset into training and testing sets
respectively, which are mutually exclusive. After which, you train with the training set
and test with the testing set. This will provide a more accurate evaluation on out-of-
sample accuracy because the testing dataset is not part of the dataset that have been
used to train the data. It is more realistic for real world problems.

This means that we know the outcome of each data point in this dataset, making it
great to test with! And since this data has not been used to train the model, the
model has no knowledge of the outcome of these data points. So, in essence, it is
truly an out-of-sample testing.

Lets split our dataset into train and test sets, 80% of the entire data for training, and
the 20% for testing. We create a mask to select random rows
using [Link]() function:

msk = [Link](len(df)) < 0.8

train = cdf[msk]
test = cdf[~msk]

Simple Regression Model

Linear Regression fits a linear model with coefficients B = (B1, ..., Bn) to minimize the
'residual sum of squares' between the actual value y in the dataset, and the predicted
value yhat using linear approximation.

Train data distribution

[Link]([Link], train.CO2EMISSIONS, color='blue')

[Link]("Engine size")

[Link]("Emission")
[Link]()

Modeling
Using sklearn package to model data.
from sklearn import linear_model
regr = linear_model.LinearRegression()
train_x = [Link](train[['ENGINESIZE']])
train_y = [Link](train[['CO2EMISSIONS']])
[Link] (train_x, train_y)
# The coefficients
print ('Coefficients: ', regr.coef_)
print ('Intercept: ',regr.intercept_)

Coefficients: [[39.33605979]]
Intercept: [124.31439341]
As mentioned before, Coefficient and Intercept in the simple linear regression, are the parameters of the
fit line. Given that it is a simple linear regression, with only 2 parameters, and knowing that the parameters
are the intercept and slope of the line, sklearn can estimate them directly from our data. Notice that all of
the data must be available to traverse and calculate the parameters.

Plot outputs

We can plot the fit line over the data:


[Link]([Link], train.CO2EMISSIONS, color='blue')
[Link](train_x, regr.coef_[0][0]*train_x + regr.intercept_[0], '-r')
[Link]("Engine size")
[Link]("Emission")

Evaluation

We compare the actual values and predicted values to calculate the accuracy of a
regression model. Evaluation metrics provide a key role in the development of a
model, as it provides insight to areas that require improvement.

There are different model evaluation metrics, lets use MSE here to calculate the
accuracy of our model based on the test set:

- Mean absolute error: It is the mean of the absolute value of the errors. This is the easiest of the
metrics to understand since it’s just average error.
- Mean Squared Error (MSE): Mean Squared Error (MSE) is the mean of the squared error. It’s
more popular than Mean absolute error because the focus is geared more towards large errors.
This is due to the squared term exponentially increasing larger errors in comparison to smaller
ones.
- Root Mean Squared Error (RMSE).
- R-squared is not error, but is a popular metric for accuracy of your model. It represents how
close the data are to the fitted regression line. The higher the R-squared, the better the model fits
your data. Best possible score is 1.0 and it can be negative (because the model can be arbitrarily
worse).

from [Link] import r2_score

test_x = [Link](test[['ENGINESIZE']])
test_y = [Link](test[['CO2EMISSIONS']])
test_y_ = [Link](test_x)

print("Mean absolute error: %.2f" % [Link]([Link](test_y_ - test_y)))


print("Residual sum of squares (MSE): %.2f" % [Link]((test_y_ - test_y) ** 2))
print("R2-score: %.2f" % r2_score(test_y , test_y_) )

Want to learn more?


IBM SPSS Modeler is a comprehensive analytics platform that has many machine
learning algorithms. It has been designed to bring predictive intelligence to decisions
made by individuals, by groups, by systems – by your enterprise as a whole. A free
trial is available through this course, available here: SPSS Modeler
Also, you can use Watson Studio to run these notebooks faster with bigger datasets.
Watson Studio is IBM's leading cloud solution for data scientists, built by data
scientists. With Jupyter notebooks, RStudio, Apache Spark and popular libraries pre-
packaged in the cloud, Watson Studio enables data scientists to collaborate on their
projects without having to install anything. Join the fast-growing community of
Watson Studio users today with a free account at Watson Studio

Lab
mUltiple linear regression

Multiple Linear Regression


Estimated time needed: 15 minutes

Objectives
After completing this lab you will be able to:

 Use scikit-learn to implement Multiple Linear Regression


 Create a model, train,test and use the model

Table of contents
1. Understanding the Data
2. Reading the Data in
3. Multiple Regression Model
4. Prediction
5. Practice

Importing Needed packages


import [Link] as plt
import pandas as pd
import pylab as pl
import numpy as np
%matplotlib inline

Downloading Data

To download the data, we will use !wget to download it from IBM Object Storage.
!wget -O [Link] [Link]
[Link]/IBMDeveloperSkillsNetwork-ML0101EN-
SkillsNetwork/labs/Module%202/data/[Link]

Understanding the Data


[Link]:
We have downloaded a fuel consumption dataset, [Link], which contains model-specific
fuel consumption ratings and estimated carbon dioxide emissions for new light-duty vehicles for retail sale
in Canada. Dataset source
 MODELYEAR e.g. 2014
 MAKE e.g. Acura
 MODEL e.g. ILX
 VEHICLE CLASS e.g. SUV
 ENGINE SIZE e.g. 4.7
 CYLINDERS e.g 6
 TRANSMISSION e.g. A6
 FUELTYPE e.g. z
 FUEL CONSUMPTION in CITY(L/100 km) e.g. 9.9
 FUEL CONSUMPTION in HWY (L/100 km) e.g. 8.9
 FUEL CONSUMPTION COMB (L/100 km) e.g. 9.2
 CO2 EMISSIONS (g/km) e.g. 182 --> low --> 0

Reading the data in


df = pd.read_csv("[Link]")

# take a look at the dataset


[Link]()
Lets select some features that we want to use for regression.
cdf =
df[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_HWY'
,'FUELCONSUMPTION_COMB','CO2EMISSIONS']]
[Link](9)
Lets plot Emission values with respect to Engine size:

[Link]([Link], cdf.CO2EMISSIONS, color='blue')


[Link]("Engine size")
[Link]("Emission")
[Link]()
Creating train and test dataset

Train/Test Split involves splitting the dataset into training and testing sets
respectively, which are mutually exclusive. After which, you train with the training set
and test with the testing set. This will provide a more accurate evaluation on out-of-
sample accuracy because the testing dataset is not part of the dataset that have been
used to train the data. It is more realistic for real world problems.

This means that we know the outcome of each data point in this dataset, making it
great to test with! And since this data has not been used to train the model, the
model has no knowledge of the outcome of these data points. So, in essence, it’s
truly an out-of-sample testing.
msk = [Link](len(df)) < 0.8
train = cdf[msk]
test = cdf[~msk]

Train data distribution

[Link]([Link], train.CO2EMISSIONS, color='blue')


[Link]("Engine size")
[Link]("Emission")
[Link]()

Multiple Regression Model


In reality, there are multiple variables that predict the Co2emission. When more than one independent
variable is present, the process is called multiple linear regression. For example, predicting co2emission
using FUELCONSUMPTION_COMB, EngineSize and Cylinders of cars. The good thing here is that
Multiple linear regression is the extension of simple linear regression model.
from sklearn import linear_model
regr = linear_model.LinearRegression()
x = [Link](train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = [Link](train[['CO2EMISSIONS']])
[Link] (x, y)
# The coefficients
print ('Coefficients: ', regr.coef_)

As mentioned before, Coefficient and Intercept , are the parameters of the fit line.


Given that it is a multiple linear regression, with 3 parameters, and knowing that the
parameters are the intercept and coefficients of hyperplane, sklearn can estimate
them from our data. Scikit-learn uses plain Ordinary Least Squares method to solve
this problem.
Ordinary Least Squares (OLS)
OLS is a method for estimating the unknown parameters in a linear regression
model. OLS chooses the parameters of a linear function of a set of explanatory
variables by minimizing the sum of the squares of the differences between the target
dependent variable and those predicted by the linear function. In other words, it tries
to minimizes the sum of squared errors (SSE) or mean squared error (MSE) between
the target variable (y) and our predicted output ( 𝑦̂ y^) over all samples in the dataset.
OLS can find the best parameters using of the following methods:

- Solving the model parameters analytically using closed-form equations


- Using an optimization algorithm (Gradient Descent, Stochastic Gradient Descent, Newton’s
Method, etc.)

Prediction
y_hat= [Link](test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
x = [Link](test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_COMB']])
y = [Link](test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"
% [Link]((y_hat - y) ** 2))

# Explained variance score: 1 is perfect prediction


print('Variance score: %.2f' % [Link](x, y))

explained variance regression score:


If 𝑦̂ y^ is the estimated target output, y the corresponding (correct) target output, and Var is Variance, the
square of the standard deviation, then the explained variance is estimated as follow:
𝚎𝚡𝚙𝚕𝚊𝚒𝚗𝚎𝚍𝚅𝚊𝚛𝚒𝚊𝚗𝚌𝚎(𝑦,𝑦̂ )=1−𝑉𝑎𝑟𝑦−𝑦̂ 𝑉𝑎𝑟𝑦
explainedVariance(y,y^)=1−(Vary−y^/Vary)
The best possible score is 1.0, lower values are worse.

Practice
Try to use a multiple linear regression with the same dataset but this time use __FUEL
CONSUMPTION in CITY__ and __FUEL CONSUMPTION in HWY__ instead of
FUELCONSUMPTION_COMB. Does it result in better accuracy?

regr = linear_model.LinearRegression()
x=
[Link](train[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION
_HWY']])
y = [Link](train[['CO2EMISSIONS']])
[Link] (x, y)
print ('Coefficients: ', regr.coef_)
y_=
[Link](test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_H
WY']])
x=
[Link](test[['ENGINESIZE','CYLINDERS','FUELCONSUMPTION_CITY','FUELCONSUMPTION_
HWY']])
y = [Link](test[['CO2EMISSIONS']])
print("Residual sum of squares: %.2f"% [Link]((y_ - y) ** 2))
print('Variance score: %.2f' % [Link](x, y))

LAB
Non linear regression

You might also like