Professional Documents
Culture Documents
• Decision tree induction is a common technique in data mining that is used to generate
a predictive model from a dataset. This technique involves constructing a tree-like
structure, where each internal node represents a test on an attribute, each branch
represents the outcome of the test, and each leaf node represents a prediction. The
goal of decision tree induction is to build a model that can accurately predict the
outcome of a given event, based on the values of the attributes in the dataset.
• To build a decision tree, the algorithm first selects the attribute that best splits the data
into distinct classes. This is typically done using a measure of impurity, such as
entropy or the Gini index, which measures the degree of disorder in the data. The
algorithm then repeats this process for each branch of the tree, splitting the data into
smaller and smaller subsets until all of the data is classified.
• Decision tree induction is a popular technique in data mining because it is easy to
understand and interpret, and it can handle both numerical and categorical data.
Additionally, decision trees can handle large amounts of data, and they can be
updated with new data as it becomes available. However, decision trees can be prone
to overfitting, where the model becomes too complex and does not generalize well to
new data. As a result, data scientists often use techniques such as pruning to simplify
the tree and improve its performance.
Where,
P(A|B) is Posterior probability: Probability of hypothesis A on the observed event B.
P(B|A) is Likelihood probability: Probability of the evidence given that the probability of a
hypothesis is true.
PauseNext
Unmute
Current Time 0:06
/
Duration 18:10
Loaded: 4.04%
Â
Fullscreen
P(A) is Prior Probability: Probability of hypothesis before observing the evidence.
P(B) is Marginal Probability: Probability of Evidence.
Working of Naïve Bayes' Classifier:
Working of Naïve Bayes' Classifier can be understood with the help of the below example:
Suppose we have a dataset of weather conditions and corresponding target variable "Play".
So using this dataset we need to decide that whether we should play or not on a particular day
according to the weather conditions. So to solve this problem, we need to follow the below
steps:
1. Convert the given dataset into frequency tables.
2. Generate Likelihood table by finding the probabilities of given features.
3. Now, use Bayes theorem to calculate the posterior probability.
Problem: If the weather is sunny, then the Player should play or not?
Solution: To solve this, first consider the below dataset:
Data Mining can be referred to as knowledge mining from data, knowledge extraction,
data/pattern analysis, data archaeology, and data dredging. In this article, we will see
techniques to evaluate the accuracy of classifiers.
HoldOut
In the holdout method, the largest dataset is randomly divided into three subsets:
• A training set is a subset of the dataset which are been used to build predictive
models.
• The validation set is a subset of the dataset which is been used to assess the
performance of the model built in the training phase. It provides a test platform for
fine-tuning of the model’s parameters and selecting the best-performing model. It is
not necessary for all modeling algorithms to need a validation set.
• Test sets or unseen examples are the subset of the dataset to assess the likely future
performance of the model. If a model is fitting into the training set much better than it
fits into the test set, then overfitting is probably the cause that occurred here.
Basically, two-thirds of the data are been allocated to the training set and the remaining one-
third is been allocated to the test set.
Random Subsampling
• Random subsampling is a variation of the holdout method. The holdout method is
been repeated K times.
• The holdout subsampling involves randomly splitting the data into a training set and a
test set.
• On the training set the data is been trained and the mean square error (MSE) is been
obtained from the predictions on the test set.
• As MSE is dependent on the split, this method is not recommended. So a new split
can give you a new MSE.
• The overall accuracy is been calculated as E = 1/K \sum_{k}^{i=1} E_{i}
Cross-Validation
• K-fold cross-validation is been used when there is only a limited amount of data
available, to achieve an unbiased estimation of the performance of the model.
• Here, we divide the data into K subsets of equal sizes.
• We build models K times, each time leaving out one of the subsets from the training,
and use it as the test set.
• If K equals the sample size, then this is called a “Leave-One-Out”
Bootstrapping
• Bootstrapping is one of the techniques which is used to make the estimations from the
data by taking an average of the estimates from smaller data samples.
• The bootstrapping method involves the iterative resampling of a dataset with
replacement.
• On resampling instead of only estimating the statistics once on complete data, we can
do it many times.
• Repeating this multiple times helps to obtain a vector of estimates.
• Bootstrapping can compute variance, expected value, and other relevant statistics of
these estimates.