Professional Documents
Culture Documents
__________________
________________________________
______________________-
WEEK 5
Classification using Decision Trees
Here's a breakdown of the key concepts you've listed:
Classification model: decision tree:
• A decision tree is a supervised learning algorithm that uses a tree-like structure to classify data
points.
• Each internal node represents a test on a feature (attribute).
• Each branch represents the outcome of the test.
• Leaf nodes represent the final prediction (class label).
Goal of decision tree construction:
• The goal is to build a tree that makes accurate predictions on unseen data.
• This is achieved by splitting the data into increasingly pure subsets based on the features.
Impurity:
• Impurity measures how mixed the data is at a given node.
• The higher the impurity, the more mixed the data is, and the less certain we are about the class la-
bel.
• Common impurity measures include Gini impurity and entropy.
Purity measures:
• Purity measures how well a node is classified into a single class.
• The higher the purity, the better the classification.
• Purity is calculated by dividing the number of data points belonging to the majority class by the
total number of data points.
Calculating impurity:
• The specific formula for calculating impurity depends on the chosen measure.
• For Gini impurity, it's calculated as the sum of squared probabilities of each class.
• For entropy, it's calculated based on the logarithms of class probabilities.
Identifying pure sub-groups:
• Identifying pure sub-groups helps us make more accurate predictions.
• This is because pure sub-groups are more likely to belong to a single class.
Decision tree construction:
• The decision tree is built top-down by recursively splitting the data.
• At each split, the feature and its value that best split the data are chosen.
• This process continues until all nodes are pure or some stopping criteria are met.
Tree diagrams: first split, second split:
• Tree diagrams visually represent the decision tree structure.
• They show the feature and value used at each split.
• The first split is the most important and has the largest impact on the tree's performance.
• Subsequent splits refine the classification further.
Final partitioning:
• The final partitioning refers to the state of the tree once all nodes are pure or the stopping criteria
are met.
• Each leaf node represents a distinct class prediction.
Full tree:
• A full tree is a decision tree that has been grown without any stopping criteria.
• This can lead to overfitting, where the tree learns the training data too well and cannot generalize
to unseen data.
Calculating information gain of a split:
• Information gain measures how much a split improves the purity of the data.
• It's calculated by subtracting the impurity after the split from the impurity before the split.
• Higher information gain indicates a better split.
Building a tree - stopping criteria:
• Various stopping criteria can be used to prevent overfitting.
• Common criteria include:
◦ Minimum number of data points in a node
◦ Minimum information gain threshold
◦ Maximum tree depth
Overfitting and underfitting:
• Overfitting occurs when the tree learns the training data too well and cannot generalize to unseen
data.
• This results in poor performance on the test set.
• Underfitting occurs when the tree is not complex enough to capture the relationships in the data.
• This also results in poor performance on the test set.
Possible causes of over-fitting:
• Too much data
• Too few training epochs
• Too complex tree (deep tree, many splits)
How to avoid overfitting:
• Use proper stopping criteria
• Regularization techniques like pre-pruning and post-pruning
• Early stopping during training
Pre-pruning and post-pruning:
• Pre-pruning removes sub-trees before the tree is fully grown.
• This is done by evaluating the information gain or using statistical tests.
• Post-pruning removes sub-trees from a fully grown tree.
• This is done by evaluating the performance on a separate validation set.
_____________________________
WEEK 6
Logistic Regression for Classification
Here's an explanation of the concepts you listed:
Logistic Regression:
• A statistical model used for classification tasks.
• Predicts the probability of an event occurring (e.g., spam email, credit card fraud) based on inde-
pendent variables.
• Uses a logistic function (sigmoid function) to map the linear combination of features to a proba-
bility between 0 and 1.
Linear Probability Model:
• A simpler model that directly predicts the class label based on the linear combination of features.
• Works well for linearly separable data but struggles with non-linear relationships.
Issues with Linear Regression:
• Assumes a linear relationship between features and target variable, which may not hold true for
real-world data.
• Outputs can fall outside the valid range for probabilities (0 to 1).
• Not well-suited for multi-class classification.
The Logistic Regression Model:
• Uses the logistic function to transform the linear combination of features into a probability be-
tween 0 and 1.
• This allows for non-linear relationships between features and the target variable.
• Can be extended to handle multi-class classification.
Non-linear Probability Model:
• Logistic regression can be combined with non-linear transformations of features to capture com-
plex relationships.
• This allows the model to learn non-linear decision boundaries.
Interpreting Coefficients:
• Coefficients in logistic regression represent the change in log-odds for the target variable given a
unit change in the corresponding feature.
• Positive coefficients increase the log-odds, making the class more likely.
• Negative coefficients decrease the log-odds, making the class less likely.
Maximum Likelihood Estimation (MLE):
• A method used to estimate the model parameters that maximize the likelihood of the observed
data.
• Involves iteratively updating the parameters to maximize a specific function (log-likelihood func-
tion).
Multi-Class Classification:
• Logistic regression can be extended to handle multiple classes.
• One-vs-rest or one-vs-one strategies are commonly used for multi-class classification.
Decision Trees vs. Logistic Regression:
• Decision trees are more interpretable and can handle non-linear relationships without feature en-
gineering.
• Logistic regression is more flexible and can handle continuous features and multi-class classifica-
tion.
Learning Curve Comparison:
• Learning curves show the performance of a model as the training data size increases.
• Logistic regression tends to have a smoother learning curve compared to decision trees, which
can be more prone to overfitting.
Overfitting in Linear Regression:
• When the model learns the training data too well and cannot generalize to unseen data.
• Regularization techniques like L1 and L2 can be used to control overfitting.
Removing Variables using p-values:
• Removing variables based solely on p-values can be misleading.
• Variables with high p-values may still be important for the model's overall performance.
Shrinkage (Regularization) Methods:
• Techniques used to reduce the complexity of the model and prevent overfitting.
• L1 and L2 regularization penalize large coefficients, forcing the model to be more conservative.
Lasso (L1 Regularization):
• Shrinks coefficients towards zero, leading to sparse models with some features being completely
eliminated.
• Useful for feature selection and reducing model complexity.
Ridge Regression (L2 Regularization):
• Shrinks coefficients towards zero but does not eliminate them.
• Less prone to overfitting than Lasso but can lead to less sparse models.
L1 vs. L2 Regularization in Logistic Regression:
• L1 regularization can lead to sparse models with better interpretability.
• L2 regularization can be more stable and less prone to overfitting.
• The best choice depends on the specific problem and desired properties of the model.