You are on page 1of 19

Mid 1 (5th batch)

3 Mark
Mid 2 (5th batch)
1. Print the probability of being {Yellow, Sweet, Long} Mango by given dataset.
2. Calculate accuracy, error, precision, recall
3. Describe the processes to detect an output by random forest algorithm

Random Forest is a popular machine learning algorithm that belongs to the supervised
learning technique. It can be used for both Classification and Regression problems in
ML. Random Forest is a classifier that contains a number of decision trees on various
subsets of the given dataset and takes the average to improve the predictive accuracy of
that dataset.

It is based on the concept of ensemble learning, which is a process of combining


multiple classifiers to solve a complex problem and to improve the performance of the
model.

The Working process can be explained in the below steps and diagram:

Step-1: Select random K data points from the training set.


Step-2: Build the decision trees associated with the selected data points (Subsets).
Step-3: Choose the number N for decision trees that you want to build.
Step-4: Repeat Step 1 & 2.
Step-5: For new data points, find the predictions of each decision tree, and assign the
new data points to the category that wins the majority votes.
4. Positively labelled data points (3,1)(3,-1)(6,1)(6,-1) and Negatively labelled data points
(1,0)(0,1)(0,-1)(-1,0)
5. Consider the below sample data set. In this data set, we have four predictor variables,
namely:

a. Weight
b. Blood flow
c. Blocked Arteries
d. Chest Pain

Creating A Random Forest for the above data and check the following query.
Source: https://medium.com/edureka/random-forest-classifier-92123fd2b5f9

(THE PROCESS, NOT THE ACTUAL SOLVE)

Step 1: Create a Bootstrapped Data Set

Bootstrapping is an estimation method used to make predictions on a data set by


re-sampling it. To create a bootstrapped data set, we must randomly select samples from
the original data set. A point to note here is that we can select the same sample more than
once.

In the above figure, I have randomly selected samples from the original data set and
created a bootstrapped data set. Simple, isn’t it? Well, in real-world problems you’ll
never get such a small data set, thus creating a bootstrapped data set is a little more
complex.

Step 2: Creating Decision Trees

● Our next task is to build a Decision Tree by using the bootstrapped data set
created in the previous step. Since we’re making a Random Forest we will not
consider the entire data set that we created, instead we’ll only use a random subset
of variables at each step.
● In this example, we’re only going to consider two variables at each step. So, we
begin at the root node, here we randomly select two variables as candidates for
the root node.
● Let’s say we selected Blood Flow and Blocked arteries. Out of these 2 variables,
we must now select the variable that best separates the samples. For the sake of
this example, let’s say that Blocked Arteries is a more significant predictor and
thus assign it as the root node.
● Our next step is to repeat the same process for each of the upcoming branch
nodes. Here, we again select two variables at random as candidates for the branch
node and then choose a variable that best separates the samples.
● Just like this, we build the tree by only considering random subsets of variables at
each step. By following the above process, our tree would look something like
this:

We just created our first Decision tree.

Step 3: Go back to Step 1 and Repeat

Like I mentioned earlier, Random Forest is a collection of Decision Trees. Each Decision
Tree predicts the output class based on the respective predictor variables used in that tree.
Finally, the outcome of all the Decision Trees in a Random Forest is recorded and the
class with the majority votes is computed as the output class.

Thus, we must now create more decision trees by considering a subset of random
predictor variables at each step. To do this, go back to step 1, create a new bootstrapped
data set and then build a Decision Tree by considering only a subset of variables at each
step. So, by following the above steps, our Random Forest would look something like
this:
This iteration is performed 100’s of times, therefore creating multiple decision trees with
each tree computing the output, by using a subset of randomly selected variables at each
step.

Having such a variety of Decision Trees in a Random Forest is what makes it more
effective than an individual Decision Tree created using all the features and the whole
data set.

Step 4: Predicting the outcome of a new data point

Now that we’ve created a random forest, let’s see how it can be used to predict whether a
new patient has heart disease or not.

The below diagram has the data about the new patient. All we have to do is run this data
down the decision trees that we made.

The first tree shows that the patient has heart disease, so we keep a track of that in a table
as shown in the figure.
Similarly, we run this data down the other decision trees and keep a track of the class
predicted by each tree. After running the data down all the trees in the Random Forest,
we check which class got the majority votes. In our case, the class ‘Yes’ received the
most number of votes, hence it’s clear that the new patient has heart disease.

To conclude, we bootstrapped the data and used the aggregate from all the trees to make a
decision, this process is known as Bagging.

Step 5: Evaluate the Model

Our final step is to evaluate the Random Forest model. Earlier while we created the
bootstrapped data set, we left out one entry/sample since we duplicated another sample.
In a real-world problem, about 1/3rd of the original data set is not included in the
bootstrapped data set.

The below figure shows the entry that didn’t end up in the bootstrapped data set.

This sample data set that does not include in the bootstrapped data set is known as the
Out-Of-Bag (OOB) data set. The Out-Of-Bag data set is used to check the accuracy of the
model, since the model wasn’t created using this OOB data it will give us a good
understanding of whether the model is effective or not.

In our case, the output class for the OOB data set is ‘No’. So, in order for our Random
Forest model to be accurate, if we run the OOB data down the Decision trees, we must
get a majority of ‘No’ votes. This process is carried out for all the OOB samples, in our
case we only had one OOB, however, in most problems, there are usually many more
samples.

Therefore, eventually, we can measure the accuracy of a Random Forest by the


proportion of OOB samples that are correctly classified.

The proportion of OOB samples that are incorrectly classified is called the Out-Of-Bag
Error.
Mid 1 (4th batch)
Question 1:

Use the k-means algorithm and Euclidean distance to cluster the following 8 examples
into 3 clusters:

A1=(2,10), A2=(2,5), A3=(8,4), A4=(5,8), A5=(7,5), A6=(6,4), A7=(1,2),

Suppose that the initial seeds (centers of each cluster) are A1, A4 and A7. Run the
k-means algorithm for 1 epoch only. At the end of this epoch show:

a) The new clusters (i.e. the examples belonging to each cluster)

b) The centers of the new clusters

c) Draw a 10 by 10 space with all the 8 points and show the clusters after the first epoch
and the new centroids.

d) How many more iterations are needed to converge? Draw the result for each epoch.

Source:
https://webdocs.cs.ualberta.ca/~zaiane/courses/cmput695/F07/exercises/Exercises695Clu
s-solution.pdf

Soln: https://imgur.com/TeSQLtG
Question 2:

a) Why Naïve Bayesian is classifications called Naïve?

Soln:
https://www.tutorialspoint.com/why-na-ve-bayesian-is-classifications-called-na-v
e
https://medium.com/@kirudang/why-naive-bayes-is-called-naive-and-what-are-th
e-benefits-of-being-naive-180757155b69

Naive Bayes is a simple and powerful algorithm for predictive modeling. It is


called “naive” because it assumes that each input variable is independent. This is
a strong assumption and unrealistic for real data; however, the technique is very
effective on a large range of complex problems.

Example: A normal classification; If a strawberry is ripe or not to harvest?

Let’s say we have two independent variables in this case: Fruit size and Fruit
color. To demonstrate that the assumptions of the algorithm are partially invalid,
pose the following two critical questions:

● Size and Color are independent? Not really, a positive correlation could
be seen apparently. The fruit grows, expanding in size and changing in
color.
● Size and Color equally contribute to the outcome of “Being ripe”? Not
really! Although it greatly depends on the type of fruit, we can roughly
anticipate the answer.
b)

Soln: http://eprints.dinus.ac.id/6216/1/Exercises_Classification.pdf

You might also like