Professional Documents
Culture Documents
ID3
The ID3 (Iterative Dichotomiser 3) algorithm is a classic decision tree algorithm used in machine learning for classification tasks The
primary goal of recursively partition the data based on the features to create a decision tree that can be used for classification.
Entropy is maximum when all classes are equally likely, indicating maximum disorder, and
it is minimum when the set is pure (contains instances of only one class).
Gini Index
The Gini index, like entropy, is a measure of impurity or disorder in a set of data. It is often used in the context of decision trees to evaluate
the quality of a split.
Gini(S) = 1 - Σ (P_i^2)
Similar to entropy, the Gini index is maximum when the classes are equally likely (maximum impurity) and minimum when the set is pure.
The Gini index is computationally less intensive than entropy, making it a popular choice in decision tree algorithms.
Information Gain
Information gain is a concept used in decision tree algorithms, particularly in the context of feature selection for splitting nodes.
In the context of decision trees, the information gain for a given feature is a measure of how well that feature separates the data into classes. It is
used to decide which feature to split on at each node of the tree.
Information gain is then calculated as the difference between the impurity of the parent dataset and the weighted impurity of the child nodes:
Pruning involves removing parts of the tree that do not provide significant predictive power and simplifying the model.
Tree pruning is a technique used in decision tree algorithms to prevent overfitting and improve the generalization
ability of the model.Pruning involves removing parts of the tree that do not provide significant predictive power and
simplifying the model.
Let's say you are building a decision tree for a classification task. As a pre-pruning measure, you decide to limit the maximum depth of the tree to 5 levels.
This means that the tree will stop growing once it reaches a depth of 5, and no further splits will be performed beyond that depth. This helps prevent the
tree from becoming overly complex and overfitting the training data.
Divisive Dendrogram for hierarchical clustering
Dendrogram:
A dendrogram is a tree-like diagram that visualizes the hierarchy of clusters produced by hierarchical clustering. It is particularly useful in understanding the
relationships between different clusters and deciding the appropriate number of clusters to use for a particular application.
Key Concepts:
Nodes:
Points where branches split are called nodes.
Steps in Divisive Clustering
Start with a Single Cluster:
Initially, all data points are in a single cluster.
Dissimilarity Measure:
Use a dissimilarity measure (e.g., Euclidean distance) to quantify the dissimilarity between data points or clusters.
Difference:
MEDOIDS vs CENTROID
The centroid is the average position of all points in a cluster, while the medoid is the data point
with the minimum dissimilarity to all other points in the cluster.
Centroid calculation involves taking the mean of coordinates, while medoid involves selecting an
actual data point.
Centroid is sensitive to outliers, while medoid is more robust as it is less influenced by extreme
values.
Centroid is used in K-means, while medoid is used in K-medoids and related algorithms.
tanh(z) = (e^{z} - e^{-z}) / (e^{z} + e^{-z})
σ(z) = 1 / (1 + e^{-z})