You are on page 1of 34

Labeling

Entropy
• Measure of randomness.
• Ideal classification must lead to classes of data set with less entropy.
• Higher value of entropy means higher randomness.
• Lower values mean lesser randomness.
• Which also mean High information gain.
• Under fitting
• Shorter tree
• Over fitting
• Deepest tree
Random forest
• In the random forest approach, a large number of decision trees are
created.
• Every observation is fed into every decision tree.
• The most common outcome for each observation is used as the final
output.
• A new observation is fed into all the trees and taking a majority vote
for each classification model.
• The R package "randomForest" is used to create random forests.
• Syntax
randomForest(formula, data)
Following is the description of the parameters used:
• formula is a formula describing the predictor and response variables.
• data is the name of the data set used.
Random forest
• Ensembling is a type of supervised learning technique where multiple
models are trained on a training dataset and their individual outputs are
combined by some rule to derive the final output.
• Random Forest is one such very powerful ensembling machine learning
algorithm which works by creating multiple decision trees and then
combining the output generated by each of the decision trees.
• Information gain at every node. Classify at the node where information
gain is maximum.
• It combines the output of multiple decision trees and then finally come
up with its own output.
• The dataset is taken from UCI website and can be found on this link. The
data contains 7 variables – six explanatory (Buying Price, Maintenance,
NumDoors, NumPersons, BootSpace, Safety) and one response variable
(Condition).
Random forest
Random forest

You might also like