You are on page 1of 6

Statistical Data Mining

Weekly Assignment 3
1. Explain what is classification?

 It is a Data analysis task that is the process of finding a model that describes and distinguishes
data classes and concepts.
 Classification is the problem of identifying to which of a set of categories (subpopulations), a
new observation belongs to, on the basis of a training set of data containing observations and
whose categories membership is known.

2. Briefly explain following terms used in classification

A. attribute vector

𝑋 = (𝑥1 , 𝑥2 , … . 𝑥𝑛 ), depicting n measurements made on the tuple from n database attributes,


respectively, 𝐴1 , 𝐴2 , … . 𝐴𝑛

B. class label attribute

Each tuple, X, is assumed to belong to a predefined class as determined by another database


attribute called the class label attribute.

C. training tuples

The single tuple that makes up the training set is Called training tuples, they are selected from
the database being analyzed. Inside In the context of classification, data tuples can be called
samples, examples, instances, Data point or object

D. testing tuples

E. instances/data points / objects

F. supervised learning

Because the class label of each training tuple is provided, this step is also called Supervised
learning (that is, the learning of the classifier is "supervised" because it tells which class each
training tuple belongs to.)

3. Explain following two steps of classification process

a. Learning step: Construction of Classification Model


 Different Algorithms are used to build a classifier by making the model learn using the
training set available. The model has to be trained for the prediction of accurate results.

b. Classification step

 Model used to predict class labels and testing the constructed model on test data and
hence estimate the accuracy of the classification rules.

4. Briefly explain pre-processing steps for classification.

Pre-processing is a data mining technique that transforms raw data into an understandable format. Raw
data (real world data) is always incomplete and that data cannot be sent through a model. That would
cause certain errors. That is why we need to preprocess data before sending through a model.

a. Data Cleaning
Remove noise and correct inconsistencies in data.

b. Data Integration
Data with different representations are put together and conflicts within the data are
resolved.

c. Data Transformation
Data is normalized and generalized. Normalization is a process that ensures that no data
is redundant, it is all stored in a single place, and all the dependencies are logical.

d. Data Reduction
When the volume of data is huge, databases can become slower, costly to access, and
challenging to properly store. Data reduction step aims to present a reduced
representation of the data in a data warehouse.

5. Explain how classification is done by using Decision Tree Induction Method.

 Decision tree is a supervised learning method used in classification and regression


methods in data mining.
 It is a tree that helps us make decisions.
 The decision tree creates a classification or regression model as a tree structure.
 It divides the data set into smaller subsets and, at the same time, steadily develops a
decision tree.
 The final tree is a tree with decision nodes and leaf nodes.
 The decision node has at least two branches. Leaf nodes show classification or decision.
 We cannot do more splits on the leaf nodes-the highest decision node in the tree that is
related to the best predictor variable (called the root node).
 Decision trees can handle categorical data and numeric data.

6. Discuss what are the advantages of using Decision Tree for classification?

 It does not require any domain knowledge.


 It's easy to understand.
 The learning and classification steps of decision trees are simple and fast.
 Decision tree can handle high dimensional data.

7. What is meant by Data Partition in Decision Tree Classification process?

There are two main types of decision trees.

1. Classification trees
2. Regression trees

 In classification trees, the decision variable is Categorical/ discrete.


 Such trees are constructed through a process called binary recursive partitioning. This is an
iterative process of splitting data into multiple partitions, and then further splitting on each
branch.

8. What is attribute selection measure?

Attribute selection measures are also called split rules because they determine how to split the tuple on
a given node.

9. Briefly discuss following three attribute selection measures.

a. Information Gain:

 Information gain refers to the decrease in entropy after the data set is split. It is also
called entropy reduction. Building a decision tree is about discovering the attributes that
return the highest data gain.

b. Gain Ratio:

 The information gain measure is bias towards test with many outcomes.
 Gain ration is defined as,
𝐺𝑎𝑖𝑛(𝐴)
 𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜(𝐴) =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
 The attribute with the maximum gain ratio is selected as the splitting attribute.

c. Gini Index:

 It is calculated by subtracting the sum of the squared probabilities of each category from
a category. It tends to be larger partitions and easy to implement, while information
acquisition tends to have smaller partitions with different values.

10. Clearly explain Decision Tree algorithm.


 The decision tree algorithm may appear long, but it is quite simply the basis algorithm
techniques is as follows:
 The algorithm is based on three parameters: D, attribute_list, and
Attribute _selection_method.
 Generally, we refer to D as a data partition.
 Initially, D is the entire set of training tuples and their related class levels (input training
data).
 The parameter attribute_list is a set of attributes defining the tuples.
 Attribute_selection_method specifies a heuristic process for choosing the attribute
that "best" discriminates the given tuples according to class.
 Attribute_selection_method process applies an attribute selection measure.

 Calculate the Information Gain of each feature.


 Considering that all rows don’t belong to the same class, split the dataset S into subsets
using the feature for which the Information Gain is maximum.
 Make a decision tree node using the feature with the maximum Information gain.
 If all rows belong to the same class, make the current node as a leaf node with the class
as its label.
 Repeat for the remaining features until we run out of all features, or the decision tree
has all leaf nodes.

11. Derive three decision trees for the following data set by using above three attribute selection
measure. (Table 6.1 on page 299)

Note: Clearly show the final tree with all the steps of the answer.

12. Briefly discuss the following two tree pruning techniques.

a. Pre-Pruning: The tree is pruned by halting its construction early.

b. Post-Pruning: This approach removes a sub-tree from a fully grown tree.

13. Clearly explain following two tree pruning algorithms (You may need additional references)
a. Reduced Error

 Start with the complete tree, then run the test data, record the number that appears in
each class on each node.
 For each non-leaf node, count if the sub-tree is retained, the number of errors; if the
sub-tree is passed, the number of errors prune.
 Pruned nodes usually produce fewer errors on test data than sub-trees manufacture.
 The difference between the amounts of error (if positive) is a measure of gain from
pruning sub-trees.
 Choose the one with the largest difference from all nodes as a pruned sub-tree.
 Continue this process, including those where the nodes are reduced zero until further
pruning will increase the misclassification rate.
 This produced the smallest version of the most accurate tree on the test set.
 There may be a number of sub-trees with the same (largest) difference.
 This approach generates a set of trees, ending with the smallest minimum-error tree on
the test data.

b. Pessimistic Error

 The goal is to avoid the need for a separate test data set.
 It can be seen that the misclassification rate of a tree on its training data is too
optimistic.
 If it is used for pruning, it will produce an excessively large tree.
 Therefore recommends continuity correction make the binomial distribution more
accurately estimate the misclassification rate.

14. Apply above two tree pruning algorithms and check whether the above tree can be optimized.

15. Clearly Explain Classification and Regression Tree (CART) Method.


 The decision tree generated by CART is strictly binary and contains each decision node has
exactly two branches.
 CART recursively divide records divide the data in the training dataset into subsets of records
with similar values for the target attributes.
 The CART algorithm executes one for each decision node exhaustively search all available
variables and all possible split values, then select the best split according to the following
conditions.
 Let 𝛷(𝑠|𝑡) be a measure of the “goodness” of a candidate split s at node t, where
𝑁𝑜.𝑜𝑓 𝑐𝑙𝑎𝑠𝑠𝑒𝑠
𝛷(𝑠|𝑡) = 2𝑃𝐿 𝑃𝑅 ∑𝑗=1 |𝑃(𝑗|𝑡𝐿 ) − 𝑃(𝑗|𝑡𝑅 )|
And where,

𝑡𝐿 = left child node of node t


𝑡𝑅 = right child node of node t
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡𝐿
𝑃𝐿 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡

𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡𝑅
𝑃𝑅 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑗 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡𝐿


𝑃(𝑗|𝑡𝐿 ) =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡

𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑙𝑎𝑠𝑠 𝑗 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡𝑅


𝑃(𝑗|𝑡𝑅 ) =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡

 Then the optimal split is whichever split maximizes this measure 𝛷(𝑠|𝑡) over all possible splits at
node t.

16. Derive CART Tree for the following credit risk example.

You might also like