You are on page 1of 23

Data Science with Python – Microsoft

Coincent Assignment

Name : Nellore Sai Nikhil

1. Perform Exploratory Data Analysis (EDA) on Iris dataset.


Ans. Exploratory Data Analysis (EDA) is a technique to understand the variables, their
relationship, trends, patterns etc in the big picture. This process includes several steps :

1. Importing and loading the dataset.


First, we import all the libraries required to perform EDA and load the dataset file
using “pd.read_csv()” function and name it as “iris”

2. Understanding the Big Picture.


Now, we start performing EDA. “iris.head()” and “iris.tail()” are used to check the
first few rows and last few rows of dataset.

“iris.shape” gives us the dimensionality of dataset. We can learn the no. of columns and
length of the dataset.
“iris.columns” gives us variables present in the dataset.

“iris.describe()” method gives us the details of variables present in the dataset such as
total values, average, mean, minimum value and maximum value of variables in the
dataset. The describe() method gives us a good picture of the distribution of data.
“iris.info()” gives us the short summary of our dataset like datatype, non-null values
and memory usage.
3. Understanding the Variables
“iris.duplicated().sum()” will print the number of duplicated rows in our dataset.
“iris.isnull().sum()” returns the missing values in the iris dataset.
We can understand how many values of each and every variable in our iris
dataset.
“iris.Species.value_counts()” method is used for this purpose and
“iris.Species.value_counts(normalize = True) helps us to understand the variables
as percentages.

We can also understand them using histogram and bargraph. The libraries seaborn
and matplotlib are required for this purpose.
4. Study the relationship
In the above step, we understood what kind of variables are present, how many
empty values are there, what is the size of the dataset, how many training sets are
there, what is the datatype of each variable. Here in this iris dataset, there are no
null or empty values so we did not fill them. If there are any empty values in the
dataset then we need to fill them up or remove those rows from the dataset.
“sns.pairplot(iris)” displays all the variables in the iris dataset against each other
using a scatterplot. This plot can be very helpful for understanding the relationship
between all the variables in our dataset.
These above graphs are used to understand the relationship between the variables
in our dataset. All the above graphs compares the variables of the dataset with
each other.
2. What is Decision Tree? Draw decision tree by taking the example of Play
Tennis.
Ans. Decision Tree : It is a supervised machine learning algorithm that is used to solve
both classification and regression problems. It is a tree structure classifier. It has root
node, internal nodes and leaf nodes and these are connected using branches. Features
of the dataset are represented by the internal nodes and the output of the model is
represented by the leaf nodes. All the internal nodes and root node are also called as
Decision node.

Dataset :

Decision Tree for Play Tennis Dataset :


3. In k-means or KNN, we use Euclidean distance to calculate the distance
between nearest neighbours. Why not Manhattan distance ?
Ans. Euclidean Distance : The Euclidean distance is used mostly when we have two
data points or data sets of integer or float data type. This metric is most commonly
used in the K-NN and K-means clustering algorithms. It is the shortest distance
between the data points. It is, mathematically,
De = ((p1 – q1)2 + (p2 – q2)2)1/2 [ for 2-D Dimension].
and
De = ( Σ (pi – qi)2)1/2 [where i = 1 to n for n – Dimensional space].

Manhattan Distance : It is basically the distance sum of the absolute difference


between points across all dimensions. Also known as “ city block distance ”. It is used
when we have to find the distance between 2 real vectors. Mathematically,
M.D = | p1 – p2 | + | q1 – q2 | [for 2-D Dimension]
and
M.D = Σ | fai – fbi | [for n-dimensional space ]
We use Euclidean Distance over Manhattan distance in K-NN and k-means clustering
algorithms because it provides us the nearest distance and we don’t have real vector
data points in these problems or algorithms. Plus, on using the nearest distance the
accuracy rate of the model also increases which results in more correct output.

4. How to test and know whether or not we have overfitting problem?


Ans. In order to know if we are having the overfitting problem or not in the machine
learning model that we have built, we need to calculate the accuracy and precision
while training the model with the training set and also in the testing stage. If the
model is performing well in the training stage but not in the testing stage that is if the
accuracy rate in high in the training stage but low in testing stage then we have can
conclude that there is overfitting problem.

5. How is KNN different from k-means clustering?


Ans.
KNN k-means clustering
1. It is a supervised machine learning It is an unsupervised machine learning
algorithm algorithm
2. It is used to solve classification and It is used to solve problems based on
regression problems clustering
3. It uses labelled data to make predictions It uses unlabeled data to make predictions
4. It analyses ‘k’ nearest data points and It analyses the mean distance of unlabeled
then classifies the new data based on the data points and then clusters them into
nearest data points. specific groups.
5. Here, ‘k’ refers to the no.of nearest Here, ‘k’ refers to the no.of clusters or
neighbors of new data point. groups.
6. It combines the classification of ‘k’ near It partitions the set of data points into ‘k’
points to predict the classification of new sets or clusters so that each data point in
data point. that cluster are closer to each other.

6. Can you explain the difference between a Test Set and a Validation Set?
Ans. Test Set : This data is used to test the machine learning model after completion of
the training stage. This set contains all kind of data which we face in the real life
scenarios. This set is used on the final model. It is a subset of the original dataset but
doesn’t contain the cases present in the training data set.
Validation Set : This data set is used for validation during the training stage of the
model. Unlike test set, it is considered to be a part of the training stage. This data is
only used to evaluate the model in its training stage but not used to train that is the
model doesn’t learn anything from this data set

7. How can you avoid overfitting in KNN?


Ans. Methods to avoid overfitting in KNN condition are :
More data to train : Increasing the size of the data in the training set can help
the model to more generalize the real case scenarios and increase the accuracy with
the validation and test data too.

Simplify the model : While building the model, we use various no.of features on
which the output of the model depends on. Choose only the important one on which
the output majorly depends on.

Early Stopping : In this technique, we continuously monitor the model in the


training stage and whenever the performance of the model starts decreasing , i.e, its
accuracy decreases we stop the training session.

Cross Validation : In this technique, we divide the training dataset into subsets and
perform validation after training the model with each subset.

8. What is Precision?
Ans. It is the percentage of correctly predicted as positive class in the total number of
positive predictions made by the model.
Precision = TP / (TP + FP) where,
TP (True Positive) : correct prediction as Positive
FP (False Positive) : wrongly predicted as Positive
9. Explain How a ROC Curve works.
Ans. The full-form ROC is Receiver Operating Characteristic Curve. It is a method used to
evaluate the performance of our model. It is used in Binary classification problems.
Let us understand this with the help of an example.
Assume that we have built a machine learning model that not only predicts the output
but also predicts it in the form of probabilistic score. If the probabilistic score of a
class is high then the chances of the new data point belonging to that is also high.
Let the data set be :

xi yi yp
x1 1 1.2
x2 0 0.8
x3 1 0.92
x4 1 1.56

Steps for Obtaining a ROC Curve :


1. Sort the data in the descending order of the probabilistic score of the predicted
class label.

xi yi yp
x4 1 1.56
x3 1 1.2
x1 1 0.92
x2 0 0.8

2. Thresholding : Here we select each and every value of yp as Threshold value (T)
in every case and compare it with the every yp value. If the probability is high then
it belongs to this specific class if not then the other.

3. In this step, we find the TPR and FPR for each and every case.
TPR : True Positive Rate = TP / (TP + FN).
FPR : False Positive Rate = FP / (FP + TN).

4. Plot the graph between TPR and FPR. The curve obtained in this graph is called
the Receiver Operating Characteristic Curve. It has FPR on the x-axis and TPR on
the y-axis.
10. What is Accuracy?
Ans. It is defined as the ratio of no.of correctly classified points to the total no.of points
present in the training data set. It is the ratio of no.of correct predictions made to the
total no.of predictions made by the model.

Accuracy = no.of points correctly classified / total no.of points


Or
Accuracy = (TP + TN) / (TP + TN + FP + FN )

TP ( True Positive ) : Correctly predicted as positive


FP (False Positive) : Wrongly predicted as positive
TN (True Negative) : Correctly predicted as negative
FN (False Negative) : Wrongly predicted as negative.

Its value always lie between 0 and 1. [ 0 <= Accuracy <= 1 ].

11. What is F1 Score ?


Ans. It is difficult to compare 2 models that have low precision but high recall or high
precision and low recall. So, we can use F1 score for this purpose. It helps us to
access both recall and precision simultaneously. It is calculated using,
F1 = (2 * Recall * Precision) / (Recall + Precision).

12. What is Recall ?


Ans. It is defined as “ how many of the points are predicted to be positive of all the points
that really or actually positive.”
Recall = no.of points correctly classified as positive / no.of points actually positive
Or
Recall = TP / ( TP + FN )
TP ( True Positive ) : Correctly predicted as positive.
FN (False Negative) : Wrongly predicted as negative.
13. What is a Confusion Matrix, and Why do we Need it?
Ans. Confusion Matrix : It can be defined as a matrix that summarizes all the
predictions made by the model.
For binary classification it can be drawn as :

Actual Class Label

No Yes
Predicted Class

No TN FP
Label

Yes FN TP

TP ( True Positive ) : Correcly predicted as positive


FP (False Positive) : Wrongly predicted as positive
TN (True Negative) : Correctly predicted as negative
FN (False Negative) : Wrongly predicted as negative.

Need for Confusion Matrix :


1. It also helps us to calculate how much percentage of the predictions were correct
and how much wrong.

2. It compares the predicted class label with actual class label.

3. Evaluate the model using parameters like Accuracy, Precision, Recall and F1
Score that can be calculated based on this matrix.

4. It also tells us the error made by the model. Not only the error but also the type of
the error, that is, type-1 or type-2.

14.What do you mean by AUC curve?


Ans. The full-form of AUC is “Area Under ROC Curve.” As the name depicts, it is the
area under the ROC curve. It ranges between 0 to 1. If the value of AUC is 0 then it
means that the model is not working properly or it is a bad model and if it is 1 then it
is a very good model. It computes the
Properties :
1. For an unbalanced data, the AUC can be high and the such models are dumb /
simple model.
2. It is not dependent on the yp values. It only depends on the order of yp.

15. What is Precision-Recall Trade-Off?


Ans. There are many real life scenarios where we would like to increase or decrease the
precision of the model for our desired output. But due to this the recall value also
changes. Precision and Recall cannot be high at the same time. As we increase the
precision, the recall value also decreases simultaneously. This is called Precision-
Recall Trade-off.

16. What are Decision Trees?


Ans. Decision Trees is a supervised machine learning algorithm that is used to solve both
classification and regression problems. It has tree like structure. Thus it is tree
structured classifier. It has root node, internal nodes and leaf nodes.
There are three types of decision trees :
1. ID3
2. C4.5
3. CART
Entropy, Information Gain, and Gini impurity are the metrics that are used to
construct a decision tree.

17. Explain the structure of the decision tree ?


Ans. Decision tree is a supervised machine learning algorithm that is used to solve both
classification and regression type of problems. It is a tree like structure that start with
a root node and ends with the leaf nodes.

Root Node : Decision tree starts with a root node. This root node generates the child
nodes. It doesn’t have any incoming branches.

Internal Nodes : They come between root node and leaf nodes. These nodes are also
called the “Decision Nodes”. They are called so because they are used to make
decision and have multiple outgoing branches. It represents the features of the dataset.

Branches : The decision taken by the decision nodes are represented by these
branches.
Leaf Node : The final output class label is represented by the leaf nodes. They are the
end nodes of the decision tree. They do not have any outgoing branches. They
represent the final prediction made by the model.
Example :

Root Node

Internal Node Internal Node Internal Node

Leaf Node Leaf Node Leaf Node


Leaf Node

18. What are some advantages of using Decision Trees?


Ans. The advantages of using Decision trees are :
1. It can be used to solve both classification and regression type of problems which
makes it more flexible than many other machine learning algorithms.
2. Its tree or flowchart like structure makes it very easy and simple to understand and
interpret the algorithm, data, and problem solution.
3. It provides us all the possible outputs to a problem.
4. It can handle variables or features of different data types.
5. Many algorithms face problem when there are null values present in the dataset
where as decision tree can also handle such type of problem.

19. How is a Random Forest related to Decision Trees?


Ans. Like Decision tree, Random Forest is a supervised learning algorithm that is useful to
solve classification and regression type of problems. This algorithm uses many
number of Decision trees to predict the output. Thus enhancing the working of the
model by increasing its accuracy rate. This algorithm divides the given data set into
various subsets and construct a decision tree for each subset. Instead of relying on one
decision tree, it uses many decision tree’s output and predict the output based on the
average of the output given by each decision tree. This type of technique is called
“ensemble learning.”
20. How are the different nodes of decision trees represented?
Ans. The Decision tree consists of 3 types of nodes : Root Node, Internal Nodes and Leaf
Nodes.

1. Root Node : It is the parent node of the decision tree. There is always only one root
node in the entire decision tree. It represents the feature or attribute with highest
Entropy value in the data set.
2. Internal Node : They represent the features or attributes that are present in the
dataset. They are also called as the Decision Nodes because they are used to make
decisions to predict the output.
3. Leaf Node : They represent the final output or class label predicted by the model.
These are the last nodes in the decision tree.

21. What type of node is considered Pure?


Ans. A node is said to be pure when it has Entropy is almost equal or equal to zero. It is
also called pure node when it has only one branch or one child node. If the Gini
Impurity of the feature represented by the node is low then the node is said to be more
pure.

22. How would you deal with an Overfitted Decision Tree?


Ans. Decision trees are usually more prone to overfitting. This problem can be avoided or
we can overcome this problem using “Pruning” technique. In this technique we
remove the branches or nodes of the decision tree to avoid overfitting. There are 2
types of pruning techniques :
1. Pre-Pruning : In this method, we stop the tree-building process in the early stage
that is before it produces the leaf nodes. While splitting the at each node during
the tree-building, we perform the cross-validation and try find the error. If this
error is not reducing after every stage then stop the and check the alternate split.

2. Post-Pruning : This pruning creates a decision tree with minimum cross validation
error. In this method, we check for overfitting problem after the decision tree is
completely built. The data set is portioned into many subsets. If there is overfitting
then we start cutting the tree’s leaf nodes from the bottom to up until the cross-
validated error is minimum. The smaller tree is better than a larger tree with more
error possibilities.
23. What are some disadvantages of using Decision Trees and how would
you solve them?
Ans. Disadvantages :
1. Decision trees are more prone to overfitting and do not generalize well to the new
data.
2. Small variation in the data can produce a different decision tree.
3. It shows large variation in the prediction of the class label of the new data point
with small changes to the training set.
4. They are expensive to train compared to other algorithm.
Methods to solve these disadvantages :
1. Pruning of the decision tree can solve the problem of overfitting. There are 2 types
of pruning : Pre-pruning and Post-pruning.
2. Bagging or averaging of the estimates can reduce the variance in the decision tree.

24. What is Gini Index and how is it used in Decision Trees?


Ans. Gini Index : It is similar to the Entropy. If the Gini Index of a feature or attribute in
the dataset is 0 then it is said to be pure attribute. It is calculated using :
G.I = 1 – Σ (Pi)2
where i runs from 1 to n. n = no.of values in the set.
It is used to during the building of decision tree. The selection of features during the
splitting of the nodes is done based on the Gini Index of the feature. The features with
low value of G.I will be preferred first.

25. How would you define the Stopping Criteria for decision trees?
Ans. We stop the building the growing of the decision tree when any one of the following
conditions are met.
1. When all the variables in the corresponding subset of the dataset belong to the
same class label.
2. When the number of variables are less than the specified minimum.
3. In the decision tree, the level of root node is 1, and it’s child node’s level is 2, and
their child node’s level is 3 and so on. But when building the decision tree, if the
level of current node is greater than specifies maximum then we stop the decision
tree.
4. The improvement of the class impurity that is Gain Index is becoming low.
26. What is Entropy?
Ans. It is defined as the measure of randomness in the data or impurity of a variable or
feature of the given dataset. Mathematically, it can be calculated using
H(Y) or E(Y) = - Σ P(Yi) * log2[P(Yi)]
Where,
H(Y) = E(Y) = Entropy
Y = feature in the dataset
‘i’ runs from 1 to ‘k’
k = no.of class labels.

27. How do we measure the Information?


Ans. We measure the Information using the following three parameters :
1. Entropy : It is defined as the measure of randomness in the data or impurity of a
variable or feature of the given dataset. Mathematically, it can be calculated using

H(Y) or E(Y) = - Σ P(Yi) * log2[P(Yi)].

2. Information Gain : It is the difference between entropy of the parent node and
the weighted average multiplied by the entropy of the each feature.

I.G (S,a) = E(S) – [(weighted average)*E(each feature)]


or
I.G(S,a) = E(S) – Σ ( |Sv| / |S| )*E(Sv)

3. Gini Index : It is similar to the Entropy. If the Gini Index of a feature or attribute
in the dataset is 0 then it is said to be pure attribute. It is calculated using
G.I = 1 – Σ (Pi)2
where i runs from 1 to n. n = no.of values in the set.

28. What is the difference between Post-pruning and Pre-pruning?


Ans. Post-pruning method creates a decision tree with minimum cross validation error. In
this method, we check for overfitting problem after the decision tree is completely
built. The data set is portioned into many subsets. If there is overfitting then we start
cutting the tree’s leaf nodes from the bottom to up until the cross-validated error is
minimum. The smaller tree is better than a larger tree with more error possibilities.

Pre-Pruning is a method in which we stop the tree-building process in the early stage
that is before it produces the leaf nodes. While splitting the at each node during the
tree-building, we perform the cross-validation and try find the error. If this error is not
reducing after every stage then stop the and check the alternate split.

29. Compare Linear Regression and Decision Tree.


Ans.

Linear Regression Decision trees


1. It is supervised learning algorithm. It is supervised learning algorithm.
It is used to solve only regression It is used to solve both regression
2.
based problems. and classification based problems.
It is used when there is a linear It is used when there are complex
3. relationship between the features relationship between features and
and output variable in the dataset. the output variable in the dataset.
4. It is more prone to underfitting It is more prone to overfitting

30.What is the relationship between Information Gain and Information Gain


Ratio?
Ans. Information Gain is the difference between entropy of the parent node and the
weighted average multiplied by the entropy of the each feature.
I.G (S,a) = E(S) – [(weighted average)*E(each feature)]
or
I.G(S,a) = E(S) – Σ ( |Sv| / |S| )*E(Sv)
Information Gain ratio is the ration of the Information Gain to the Intrinsic
Information. Intrinsic Information is defined as entropy of sub-datasets
proportionality.
IGR = I.G / I.I
Where,
I.I = - ( Σ (| Sj | / | S | )* log2( | Sj | / | S | ))
From the above we can say that Information Gain and Information Ratio are directly
proportional to each other.
31. Compare Decision Trees and k-Nearest Neighbours.
Ans.
1. Both the Decision trees and k – Nearest Neighbours are supervised machine
learning algorithms.
2. Both Algorithms can be used to solve classification and regression based
problems.
3. Since both belong to the supervised learning algorithms, they both work on
labelled data set.
4. Decision trees are faster compared to the k-NN algorithm in giving the output or
making predictions when the size of the data set is very large.
5. Decision tree is tree structure classifier where as k-NN predicts the class label of
data point based on the class labels of the nearest data point to the new data point.
6. Entropy, Information Gain, Gain Impurity are the metrics that are used in the
Decision tree where as Euclidean Distance, Accuracy, Precision, F1 score, Recall
are the metrics used in k-NN.

32. While building Decision Tree how do you choose which attribute to split
at each node?
Ans. While splitting each node during the building of the decision tree, the attribute is
chosen based on its Information Gain.
I.G (S,a) = E(S) – [(weighted average)*E(each feature)]
or
I.G(S,a) = E(S) – Σ ( |Sv| / |S| )*E(Sv)
The attribute with the highest information gain is chosen at each node.

33. How would you compare different Algorithms to build Decision Trees?
Ans. We always choose the machine learning algorithm that is most appropriate and gives
us the most accurate prediction the output variable. In order to do this, we must
compare the algorithms. To compare the algorithms we calculate many metrics such
as accuracy, precision, F1 score, recall, etc. Not only these it also depends on the type
of the problem statement and the dataset set given to us. If the problem statement
given to us is suitable to the classification or regression then we use the supervised
learning algorithms. We also compare the algorithms based on how quickly it
predicting the accurate or correct output. We choose the decision tree algorithm out of
all the machine learning algorithms when the problem statement is a classification
based or regression based. We use Decision mostly when it is classification based
problem and when there less no.of features on which the output depends on.
34. How do you Gradient Boost decision trees?
Ans. Gradient boosting is method in which we combine various weak learning
algorithms in a sequential manner so that it becomes a fast learning algorithm. In
order to gradient boost decision trees we combine the many trees in series and each
tree on the errors from the previous tree and rectifies them which in return increases
the efficiency and accuracy of the model.

35. What are the differences between Decision Trees and Neural Networks?
Ans. Definition :
Decision Trees is a supervised machine learning algorithm that is used to solve both
classification and regression problems. It has tree like structure. Thus it is tree
structured classifier. It has root node, internal nodes and leaf nodes. Entropy,
Information Gain, and Gini impurity are the metrics that are used to construct a
decision tree.
Neural Network is a type of machine learning process, called deep learning, which
teaches the computer or model to learn the from its mistakes and develop itself
continuously. It is a method that is used to process data and help model to make
predictions with greater accuracy.

Structure :
Decision tree
It has tree like structure. Thus it is tree structured classifier. It has following
parts :
1. Root Node
2. Internal / Decision Nodes
3. Leaf Nodes
4. Branches
Neural Network
The structure of the Neural Network is inspired by the Human brain. It
consists of interconnected artificial neurons in 3 layers :
1. Input Layer
2. Hidden Layer
3. Output Layer

You might also like