Professional Documents
Culture Documents
Weekly Assignment 3
1. Explain what is classification?
It is a Data analysis task that is the process of finding a model that describes and distinguishes
data classes and concepts.
Classification is the problem of identifying to which of a set of categories (subpopulations), a
new observation belongs to, on the basis of a training set of data containing observations and
whose categories membership is known.
A. attribute vector
C. training tuples
The single tuple that makes up the training set is Called training tuples, they are selected from
the database being analyzed. Inside In the context of classification, data tuples can be called
samples, examples, instances, Data point or object
D. testing tuples
F. supervised learning
Because the class label of each training tuple is provided, this step is also called Supervised
learning (that is, the learning of the classifier is "supervised" because it tells which class each
training tuple belongs to.)
b. Classification step
Model used to predict class labels and testing the constructed model on test data and
hence estimate the accuracy of the classification rules.
Pre-processing is a data mining technique that transforms raw data into an understandable format. Raw
data (real world data) is always incomplete and that data cannot be sent through a model. That would
cause certain errors. That is why we need to preprocess data before sending through a model.
a. Data Cleaning
Remove noise and correct inconsistencies in data.
b. Data Integration
Data with different representations are put together and conflicts within the data are
resolved.
c. Data Transformation
Data is normalized and generalized. Normalization is a process that ensures that no data
is redundant, it is all stored in a single place, and all the dependencies are logical.
d. Data Reduction
When the volume of data is huge, databases can become slower, costly to access, and
challenging to properly store. Data reduction step aims to present a reduced
representation of the data in a data warehouse.
6. Discuss what are the advantages of using Decision Tree for classification?
1. Classification trees
2. Regression trees
Attribute selection measures are also called split rules because they determine how to split the tuple on
a given node.
a. Information Gain:
Information gain refers to the decrease in entropy after the data set is split. It is also
called entropy reduction. Building a decision tree is about discovering the attributes that
return the highest data gain.
b. Gain Ratio:
The information gain measure is bias towards test with many outcomes.
Gain ration is defined as,
𝐺𝑎𝑖𝑛(𝐴)
𝐺𝑎𝑖𝑛𝑅𝑎𝑡𝑖𝑜(𝐴) =
𝑆𝑝𝑙𝑖𝑡𝐼𝑛𝑓𝑜(𝐴)
The attribute with the maximum gain ratio is selected as the splitting attribute.
c. Gini Index:
It is calculated by subtracting the sum of the squared probabilities of each category from
a category. It tends to be larger partitions and easy to implement, while information
acquisition tends to have smaller partitions with different values.
11. Derive three decision trees for the following data set by using above three attribute selection
measure. (Table 6.1 on page 299)
Note: Clearly show the final tree with all the steps of the answer.
13. Clearly explain following two tree pruning algorithms (You may need additional references)
a. Reduced Error
Start with the complete tree, then run the test data, record the number that appears in
each class on each node.
For each non-leaf node, count if the sub-tree is retained, the number of errors; if the
sub-tree is passed, the number of errors prune.
Pruned nodes usually produce fewer errors on test data than sub-trees manufacture.
The difference between the amounts of error (if positive) is a measure of gain from
pruning sub-trees.
Choose the one with the largest difference from all nodes as a pruned sub-tree.
Continue this process, including those where the nodes are reduced zero until further
pruning will increase the misclassification rate.
This produced the smallest version of the most accurate tree on the test set.
There may be a number of sub-trees with the same (largest) difference.
This approach generates a set of trees, ending with the smallest minimum-error tree on
the test data.
b. Pessimistic Error
The goal is to avoid the need for a separate test data set.
It can be seen that the misclassification rate of a tree on its training data is too
optimistic.
If it is used for pruning, it will produce an excessively large tree.
Therefore recommends continuity correction make the binomial distribution more
accurately estimate the misclassification rate.
14. Apply above two tree pruning algorithms and check whether the above tree can be optimized.
𝑁𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑎𝑡 𝑡𝑅
𝑃𝑅 =
𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑟𝑒𝑐𝑜𝑟𝑑𝑠 𝑖𝑛 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑒𝑡
Then the optimal split is whichever split maximizes this measure 𝛷(𝑠|𝑡) over all possible splits at
node t.
16. Derive CART Tree for the following credit risk example.