You are on page 1of 5

Statistical Data Mining

Weekly Assignment 5

1. Clearly explain how the scalability issue in Decision Tree algorithms is managed by using following
methods.

a. SLIQ

 Can handle categorical and continuous valued attributes.


 Its algorithms propose presorting techniques on disk-resident data sets that are too large to fit
in memory.
 Define the use of new data structures to facilitate the tree construction.
 SLIQ employs disk-resident attribute lists and a single memory-resident class list.
 The attribute lists and class list generated by SLIQ for the tuple data.
 Each attribute has an associated attribute list, indexed by RID (a record identifier).
 Each tuple is represented by a linkage of one entry from each attribute list to an entry in the
class list (holding the class label of the given tuple), which in turn is linked to its corresponding
leaf node in the decision tree.
 The class list remains in memory because it is often accessed and modified in the building and
pruning phases.
 The size of the class list grows proportionally with the number of tuples in the training set.
 When a class list cannot fit into memory, the performance of SLIQ decreases.

b. SPRINT

 SPRINT also can handle categorical and continuous valued attributes.


 Its algorithms propose presorting techniques on disk-resident data sets that are too large to fit
in memory.
 Define the use of new data structures to facilitate the tree construction.
 SPRINT uses a different attribute list data structure that holds the class and RID information.
 When a node is split, the attribute lists are partitioned and distributed among the resulting child
nodes accordingly.
 When a list is partitioned, the order of the records in the list is maintained.
 Hence, partitioning lists does not require resorting.
 SPRINT was designed to be easily parallelized, further contributing to its scalability.

c. Rain-Forest

2. Clearly explain Naive Bayesian Algorithm for classification


 This is a classification technique based on Bayes' theorem, assuming independence between
predictors. In short, the naive Bayes classifier assumes that the existence of a particular function
in a class is independent of the existence of any other function.
 Naive Bayes models are easy to build and are especially useful for very large data sets. In
addition to simplicity, Naive Bayes (Naive Bayes) also outperforms very complex classification
methods.
 Bayes' theorem provides a way to calculate the posterior probability𝑃(𝑐|𝑥) from 𝑃(𝑐), 𝑃(𝑥)
and 𝑃(𝑥|𝑐). Look at the following equation:
𝑃(𝑥|𝑐)𝑃(𝑐)
𝑃(𝑐|𝑥) =
𝑃(𝑥)
𝑃(𝑐|𝑋) = 𝑃(𝑥1 |𝑐) × 𝑃(𝑥2 |𝑐) … … 𝑃(𝑥𝑛 |𝑐) × 𝑃(𝑐)
Above,

 𝑃(𝑐|𝑥) is the posterior probability of class (c, target) given predictor (x, attributes).
 𝑃(𝑐) is the prior probability of class.
 𝑃(𝑥|𝑐) is the likelihood which is the probability of predictor given class.
 𝑃(𝑥) is the prior probability of predictor.

3. What is/are the main assumption(s) of the Naive Bayesian Algorithm?

 There are no dependence relationships among the attributes.


 Assume that there is no functional dependency.
 Each feature has the same weight (or importance).
4. Consider the following training data are in Table 6.1.
9
𝑃(𝐵𝑢𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) = = 0.643
14
5
𝑃(𝐵𝑢𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) = = 0.357
14
To compute 𝑃(𝑋|𝐶𝑖 )𝑓𝑜𝑟 𝑖 = 1,2 we compute the following conditional probabilities:
2
𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) = = 0.222
9
3
𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) = = 0.600
5
4
𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑀𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) = = 0.444
9
2
𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑀𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) = = 0.400
5
6
𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑌𝑒𝑠|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) = = 0.667
9
1
𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑌𝑒𝑠|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) = = 0.200
5
6
𝑃(𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝐹𝑎𝑖𝑟|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) = = 0.667
9
2
𝑃(𝑐𝑟𝑒𝑑𝑖𝑡_𝑟𝑎𝑡𝑖𝑛𝑔 = 𝐹𝑎𝑖𝑟|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) = = 0.400
5
𝑃(𝑋|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) =

𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) × 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑀𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) ×


𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑌𝑒𝑠|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) × 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡𝑟𝑎𝑡𝑖𝑛𝑔 = 𝐹𝑎𝑖𝑟|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠)

= 0.222 × 0.444 × 0.667 × 0.667


= 0.044

Similarly,

𝑃(𝑋|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) =

𝑃(𝑎𝑔𝑒 = 𝑦𝑜𝑢𝑡ℎ|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) × 𝑃(𝑖𝑛𝑐𝑜𝑚𝑒 = 𝑀𝑒𝑑𝑖𝑢𝑚|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) ×


𝑃(𝑠𝑡𝑢𝑑𝑒𝑛𝑡 = 𝑌𝑒𝑠|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) × 𝑃(𝑐𝑟𝑒𝑑𝑖𝑡𝑟𝑎𝑡𝑖𝑛𝑔 = 𝐹𝑎𝑖𝑟|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜)

= 0.600 × 0.400 × 0.200 × 0.400

= 0.019.
To find the class𝐶𝑖 , that maximizes 𝑃(𝑋|𝐶𝑖 )𝑃(𝐶𝑖 ), we compute

𝑃(𝑋|𝐵𝑢𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) 𝑃(𝐵𝑢𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑌𝑒𝑠) = 0.044 × 0.643 = 0.028

𝑃(𝑋|𝐵𝑢𝑦𝑠𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) 𝑃(𝐵𝑢𝑦𝐶𝑜𝑚𝑝𝑢𝑡𝑒𝑟 = 𝑁𝑜) = 0.019 × 0.357 = 0.007

5. Discuss advantages and disadvantages of Naive Bayesian Algorithm.

Advantages

 Can easily and quickly predict the category of the test data set. Also performs well in multi-class
prediction.
 If you maintain the independence assumption, the performance of the naive Bayes classifier will
be better than other models (such as logistic regression) and you will need less training data.
 Compared with numeric variables, it performs well in the case of categorical input variables. For
numerical variables, assume a normal distribution (bell-shaped curve, which is a strong
assumption).

Disadvantages

 If the categorical variable has a category (in the test data set) and that category is not observed
in the training data set, the model will be assigned a probability of 0 (zero) and it will not be able
to make predictions. This is often referred to as "zero frequency".
 Another limitation of Naive Bayes is the assumption of independent predictors. In real life, it is
almost impossible for us to obtain a set of completely independent predictors.

6. Clearly explain what is Bayesian Belief Network (BBN).

 Bayesian Belief Networks, or "Bayesian Networks" for short, provide a simple way to apply
Bayes' theorem to complex problems.
 Bayesian networks are a network-based framework for representation and analyzing models
involving uncertainty.
 They are used for intelligent decision aids, intelligent diagnostic aids, data mining etc.
 This was invented as a result of cross-fertilization of ideas between artificial intelligence,
decision analysis and statistics communities.

7. Discuss how the BBN is differ from other knowledge representation and probabilistic analysis tools.

 Different from other knowledge-based systems because uncertainty is handled in statistically


rigorous yet different and simple way.
 Different from other probabilistic analysis tools because of network representation of problems,
use of Bayesian statistics and the synergy between this.

8. Explain how BBN is learned using gradient descendent strategy.


 “Gradient Descent is an optimization algorithm for finding a local minimum of a differentiable
function. Gradient descent is simply used to find the values of a function's parameters
(coefficients) that minimize a cost function as far as possible.”
 Gradient descent is an iterative process to find the minimum value of a function. This is an
optimization algorithm used to find the parameters or coefficients of a function when the
function has a minimum value. Although this function cannot always guarantee to find the
global minimum, it will get stuck on the local minimum.
 Optimization is an important part of machine learning and deep learning.
 Almost every machine learning algorithm has an optimization algorithm at its core that
minimizes its cost function.
 When we fit a straight line with linear regression, we optimize the intercept and slope.
 When we use Logistic regression for classification, we optimize the flower shape, and when we
use the t-SNE algorithm, we optimize the clustering.
 Note that for all the above cases, the "gradient descent" work remains unchanged.

9.

𝑃(𝑊|𝐶) = 𝑃(𝑊|𝑆, 𝑅) × 𝑃(𝑅|𝐶) × 𝑃(𝑆|𝐶) + 𝑃(𝑊|𝑆, 𝑅`) × 𝑃(𝑅`|𝐶) × 𝑃(𝑆|𝐶) + 𝑃(𝑊|𝑆`, 𝑅) × 𝑃(𝑅|𝐶)
× 𝑃(𝑆`|𝐶) + 𝑃(𝑊|𝑆`, 𝑅`) × 𝑃(𝑅`|𝐶) × 𝑃(𝑆`|𝐶)
= 0.99 × 0.8 × 0.1 + 0.9 × 0.2 × 0.1 + 0.9 × 0.8 × 0.9 + 0 × 0.2 × 0.9
= 0.7452

You might also like