You are on page 1of 25

1. Describe in brief different types of regression algorithms.

Answer:
1. Linear Regression
Regression is a technique used to model and analyze the relationships between variables and often
times how they contribute and are related to producing a particular outcome together. A linear
regression refers to a regression model that is completely made up of linear variables. Beginning
with the simple case, Single Variable Linear Regression is a technique used to model the
relationship between a single input independent variable (feature variable) and an output
dependent variable using a linear model i.e a line.

The more general case is Multi Variable Linear Regression where a model is created for the
relationship between multiple independent input variables (feature variables) and an output
dependent variable. The model remains linear in that the output is a linear combination of the
input variables. We can model a multi-variable linear regression as the following:

Y = a_1*X_1 + a_2*X_2 + a_3*X_3  ……. a_n*X_n + b

Where a_n are the coefficients, X_n are the variables and b is the bias. As we can see, this
function does not include any non-linearities and so is only suited for modelling linearly separable
data. It is quite easy to understand as we are simply weighting the importance of each feature
variable X_n using the coefficient weights a_n. We determine these weights a_n and the
bias b using a Stochastic Gradient Descent (SGD). Check out the illustration below for a more
visual picture!
Illustration of how Gradient Descent find the optimal parameters for a Linear Regression

A few key points about Linear Regression:

 Fast and easy to model and is particularly useful when the relationship to be modeled is
not extremely complex and if you don’t have a lot of data.

 Very intuitive to understand and interpret.

 Linear Regression is very sensitive to outliers.

2. Polynomial Regression
When we want to create a model that is suitable for handling non-linearly separable data, we will
need to use a polynomial regression. In this regression technique, the best fit line is not a straight
line. It is rather a curve that fits into the data points. For a polynomial regression, the power of
some independent variables is more than 1. For example, we can have something like:

Y = a_1*X_1 + (a_2)²*X_2 + (a_3)⁴*X_3  ……. a_n*X_n + b

We can have some variables have exponents, others without, and also select the exact exponent
we want for each variable. However, selecting the exact exponent of each variable naturally
requires some knowledge of how the data relates to the output. See the illustration below for a
visual comparison of linear vs polynomial regression.
Linear vs Polynomial Regression with data that is non-linearly separable
A few key points about Polynomial Regression:

 Able to model non-linearly separable data; linear regression can’t do this. It is much more
flexible in general and can model some fairly complex relationships.

 Full control over the modelling of feature variables (which exponent to set).

 Requires careful design. Need some knowledge of the data in order to select the best
exponents.

 Prone to over fitting if exponents are poorly selected.

3. Ridge Regression
A standard linear or polynomial regression will fail in the case where there is high collinearity
among the feature variables. Collinearity is the existence of near-linear relationships among the
independent variables. The presence of hight collinearity can be determined in a few different
ways:

 A regression coefficient is not significant even though, theoretically, that variable should
be highly correlated with Y.

 When you add or delete an X feature variable, the regression coefficients change
dramatically.

 Your X feature variables have high pairwise correlations (check the correlation matrix).
We can first look at the optimization function of a standard linear regression to gain some insight
as to how ridge regression can help:

min || Xw - y ||²

Where X represents the feature variables, w represents the weights, and yrepresents the ground


truth. Ridge Regression is a remedial measure taken to alleviate collinearity amongst regression
predictor variables in a model. Collinearity is a phenomenon in which one feature variable in a
multiple regression model can be linearly predicted from the others with a substantial degree of
accuracy. Since the feature variables are so correlated in this way, the final regression model is
quite restricted and rigid in its approximation i.e it has high variance.

To alleviate this issue, Ridge Regression adds a small squared bias factor to the variables:

min || Xw — y ||² + z|| w ||²

Such a squared bias factor pulls the feature variable coefficients away from this rigidness,
introducing a small amount of bias into the model but greatly reducing the variance.

A few key points about Ridge Regression:

 The assumptions of this regression is same as least squared regression except normality is
not to be assumed.

 It shrinks the value of coefficients but doesn’t reaches zero, which suggests no feature
selection feature

4. Lasso Regression
Lasso Regression is quite similar to Ridge Regression in that both techniques have the same
premise. We are again adding a biasing term to the regression optimization function in order to
reduce the effect of collinearity and thus the model variance. However, instead of using a squared
bias like ridge regression, lasso instead using an absolute value bias:

min || Xw — y ||² + z|| w ||

There are a few differences between the Ridge and Lasso regressions that essentially draw back to
the differences in properties of the L2 and L1 regularization:

 Built-in feature selection: is frequently mentioned as a useful property of the L1-norm,


which the L2-norm does not. This is actually a result of the L1-norm, which tends to
produces sparse coefficients. For example, suppose the model have 100 coefficients but only
10 of them have non-zero coefficients, this is effectively saying that “the other 90 predictors
are useless in predicting the target values”. L2-norm produces non-sparse coefficients, so
does not have this property. Thus one can say that Lasso regression does a form of
“parameter selections” since the feature variables that aren’t selected will have a total weight
of 0.

 Sparsity: refers to that only very few entries in a matrix (or vector) is non-zero. L1-norm
has the property of producing many coefficients with zero values or very small values with
few large coefficients. This is connected to the previous point where Lasso performs a type
of feature selection.

 Computational efficiency: L1-norm does not have an analytical solution, but L2-norm
does. This allows the L2-norm solutions to be calculated computationally efficiently.
However, L1-norm solutions does have the sparsity properties which allows it to be used
along with sparse algorithms, which makes the calculation more computationally efficient.

5. ElasticNet Regression
ElasticNet is a hybrid of Lasso and Ridge Regression techniques. It is uses both the L1 and L2
regularization taking on the effects of both techniques:

min || Xw — y ||² + z_1|| w || + z_2|| w ||²

A practical advantage of trading-off between Lasso and Ridge is that, it allows Elastic-Net to
inherit some of Ridge’s stability under rotation.

A few key points about ElasticNet Regression:

 It encourages group effect in the case of highly correlated variables, rather than zeroing
some of them out like Lasso.

 There are no limitations on the number of selected variables

3.What is regularization? What is the advantage of Lasso regression


over ridge regression
Ans:

Regularization
This is a form of regression, that constrains/ regularizes or shrinks the coefficient estimates
towards zero. In other words, this technique discourages learning a more complex or flexible
model, so as to avoid the risk of overfitting.
A simple relation for linear regression looks like this. Here Y represents the learned relation and β
represents the coefficient estimates for different variables or predictors(X).

Y ≈ β0 + β1X1 + β2X2 + …+ βpXp

The fitting procedure involves a loss function, known as residual sum of squares or RSS. The
coefficients are chosen, such that they minimize this loss function.

Now, this will adjust the coefficients based on your training data. If there is noise in the training
data, then the estimated coefficients won’t generalize well to the future data. This is where
regularization comes in and shrinks or regularizes these learned estimates towards zero.

Ridge Regression

Above image shows ridge regression, where the RSS is modified by adding the shrinkage
quantity. Now, the coefficients are estimated by minimizing this function. Here, λ is the tuning
parameter that decides how much we want to penalize the flexibility of our model. The increase
in flexibility of a model is represented by increase in its coefficients, and if we want to minimize
the above function, then these coefficients need to be small. This is how the Ridge regression
technique prevents coefficients from rising too high. Also, notice that we shrink the estimated
association of each variable with the response, except the intercept β0, This intercept is a measure
of the mean value of the response when xi1 = xi2 = …= xip = 0.

When λ = 0, the penalty term has no effect, and the estimates produced by ridge regression will be
equal to least squares. However, as λ→∞, the impact of the shrinkage penalty grows, and the
ridge regression coefficient estimates will approach zero. As can be seen, selecting a good value
of λ is critical. Cross validation comes in handy for this purpose. The coefficient estimates
produced by this method are also known as the L2 norm.

The coefficients that are produced by the standard least squares method are scale equivariant,
i.e. if we multiply each input by c then the corresponding coefficients are scaled by a factor of 1/c.
Therefore, regardless of how the predictor is scaled, the multiplication of predictor and
coefficient(Xjβj) remains the same. However, this is not the case with ridge regression, and
therefore, we need to standardize the predictors or bring the predictors to the same scale before
performing ridge regression. The formula used to do this is given below.

Lasso

Lasso is another variation, in which the above function is minimized. Its clear that this variation
differs from ridge regression only in penalizing the high coefficients. It uses |βj|
(modulus)instead of squares of β, as its penalty. In statistics, this is known as the L1 norm.

Lets take a look at above methods with a different perspective. The ridge regression can be
thought of as solving an equation, where summation of squares of coefficients is less than or
equal to s. And the Lasso can be thought of as an equation where summation of modulus of
coefficients is less than or equal to s. Here, s is a constant that exists for each value of shrinkage
factor λ.  These equations are also referred to as constraint functions.

Consider their are 2 parameters in a given problem. Then according to above formulation,
the ridge regression is expressed by β1² + β2² ≤ s. This implies that ridge regression coefficients
have the smallest RSS(loss function) for all points that lie within the circle given by β1² + β2² ≤ s.

Similarly, for lasso, the equation becomes,|β1|+|β2|≤ s. This implies that lasso coefficients have
the smallest RSS(loss function) for all points that lie within the diamond given by |β1|+|β2|≤ s.

The image below describes these equations.


Credit : An Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie,
Robert Tibshirani

The above image shows the constraint functions(green areas), for lasso(left) and ridge
regression(right), along with contours for RSS(red ellipse). Points on the ellipse share the value
of RSS. For a very large value of s, the green regions will contain the center of the ellipse, making
coefficient estimates of both regression techniques, equal to the least squares estimates. But, this
is not the case in the above image. In this case, the lasso and ridge regression coefficient estimates
are given by the first point at which an ellipse contacts the constraint region. Since ridge
regression has a circular constraint with no sharp points, this intersection will not generally
occur on an axis, and so the ridge regression coefficient estimates will be exclusively non-
zero. However, the lasso constraint has corners at each of the axes, and so the ellipse will often
intersect the constraint region at an axis. When this occurs, one of the coefficients will equal
zero. In higher dimensions(where parameters are much more than 2), many of the coefficient
estimates may equal zero simultaneously.

This sheds light on the obvious disadvantage of ridge regression, which is model
interpretability. It will shrink the coefficients for least important predictors, very close to zero.
But it will never make them exactly zero. In other words, the final model will include all
predictors. However, in the case of the lasso, the L1 penalty has the effect of forcing some of the
coefficient estimates to be exactly equal to zero when the tuning parameter λ is sufficiently
large. Therefore, the lasso method also performs variable selection and is said to yield sparse
models.

What does Regularization achieve?


A standard least squares model tends to have some variance in it, i.e. this model won’t generalize
well for a data set different than its training data. Regularization, significantly reduces the
variance of the model, without substantial increase in its bias. So the tuning parameter λ, used in
the regularization techniques described above, controls the impact on bias and variance. As the
value of λ rises, it reduces the value of coefficients and thus reducing the variance. Till a point,
this increase in λ is beneficial as it is only reducing the variance(hence avoiding overfitting),
without loosing any important properties in the data. But after certain value, the model starts
loosing important properties, giving rise to bias in the model and thus underfitting. Therefore, the
value of λ should be carefully selected.
2. Illustrate with an example the algorithm gini gain with respect
to decision tree.
Ans:
Decision Trees are a type of Supervised Machine Learning (that is you explain what the input is
and what the corresponding output is in the training data) where the data is continuously split
according to a certain parameter. The tree can be explained by two entities, namely decision
nodes and leaves. The leaves are the decisions or the final outcomes. And the decision nodes are
where the data is split.

An example of a decision tree can be explained using above binary tree. Let’s say you want to
predict whether a person is fit given their information like age, eating habit, and physical
activity, etc. The decision nodes here are questions like ‘What’s the age?’, ‘Does he exercise?’,
‘Does he eat a lot of pizzas’? And the leaves, which are outcomes like either ‘fit’, or ‘unfit’. In
this case this was a binary classification problem (a yes no type problem).
There are two main types of Decision Trees:

1. Classification trees (Yes/No types)


What we’ve seen above is an example of classification tree, where the outcome was a variable
like ‘fit’ or ‘unfit’. Here the decision variable is Categorical.

2. Regression trees (Continuous data types)

Here the decision or the outcome variable is Continuous, e.g. a number like 123.
Working
Now that we know what a Decision Tree is, we’ll see how it works internally. There are many
algorithms out there which construct Decision Trees, but one of the best is called as ID3
Algorithm. ID3 Stands for Iterative Dichotomiser 3.
Before discussing the ID3 algorithm, we’ll go through few definitions.
Entropy
Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set S, is the measure of
the amount of uncertainty or randomness in data.

Intuitively, it tells us about the predictability of a certain event. Example, consider a coin toss
whose probability of heads is 0.5 and probability of tails is 0.5. Here the entropy is the highest
possible, since there’s no way of determining what the outcome might be. Alternatively, consider
a coin which has heads on both the sides, the entropy of such an event can be predicted perfectly
since we know beforehand that it’ll always be heads. In other words, this event has no
randomness hence it’s entropy is zero.
In particular, lower values imply less uncertainty while higher values imply high uncertainty.
Information Gain
Information gain is also called as Kullback-Leibler divergence denoted by IG(S,A) for a set S is
the effective change in entropy after deciding on a particular attribute A. It measures the relative
change in entropy with respect to the independent variables.

Alternatively,

where IG(S, A) is the information gain by applying feature A. H(S) is the Entropy of the entire
set, while the second term calculates the Entropy after applying the feature A, where P(x) is the
probability of event x.
Let’s understand this with the help of an example
Consider a piece of data collected over the course of 14 days where the features are Outlook,
Temperature, Humidity, Wind and the outcome variable is whether Golf was played on the day.
Now, our job is to build a predictive model which takes in above 4 parameters and predicts
whether Golf will be played on the day. We’ll build a decision tree to do that using ID3
algorithm.
 
Day Outlook Temperature Humidity Wind Play Golf

D1 Sunny Hot High Weak No

D2 Sunny Hot High Strong No

D3 Overcast Hot High Weak Yes

D4 Rain Mild High Weak Yes

D5 Rain Cool Normal Weak Yes

D6 Rain Cool Normal Strong No

D7 Overcast Cool Normal Strong Yes

D8 Sunny Mild High Weak No

D9 Sunny Cool Normal Weak Yes

D10 Rain Mild Normal Weak Yes

D11 Sunny Mild Normal Strong Yes

D12 Overcast Mild High Strong Yes

D13 Overcast Hot Normal Weak Yes

D14 Rain Mild High Strong No


 
ID3 Algorithm will perform following tasks recursively

1. Create root node for the tree


2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’ denoted by H(S,
x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

Now we’ll go ahead and grow the decision tree. The initial step is to calculate H(S), the Entropy
of the current state. In the above example, we can see in total there are 5 No’s and 9 Yes’s.
Yes No Total

9 5 14

Remember that the Entropy is 0 if all members belong to the same class, and 1 when half of them
belong to one class and other half belong to other class that is perfect randomness. Here it’s 0.94
which means the distribution is fairly random.
Now the next step is to choose the attribute that gives us highest possible Information
Gain which we’ll choose as the root node.
Let’s start with ‘Wind’

where ‘x’ are the possible values for an attribute. Here, attribute ‘Wind’ takes two possible
values in the sample data, hence x = {Weak, Strong}
We’ll have to calculate:

Amongst all the 14 examples we have 8 places where the wind is weak and 6 where the wind
is Strong.
Wind = Weak Wind = Strong Total

8 6 14
Now out of the 8 Weak examples, 6 of them were ‘Yes’ for Play Golf and 2 of them were ‘No’
for ‘Play Golf’. So, we have,

Similarly, out of 6 Strong examples, we have 3 examples where the outcome was ‘Yes’ for
Play Golf and 3 where we had ‘No’ for Play Golf.

Remember, here half items belong to one class while other half belong to other. Hence we have
perfect randomness.
Now we have all the pieces required to calculate the Information Gain,

Which tells us the Information Gain by considering ‘Wind’ as the feature and give us
information gain of 0.048. Now we must similarly calculate the Information Gain for all the
features.
We can clearly see that IG(S, Outlook) has the highest information gain of 0.246, hence we
chose Outlook attribute as the root node. At this point, the decision tree looks like.

Here we observe that whenever the outlook is Overcast, Play Golf is always ‘Yes’, it’s no
coincidence by any chance, the simple tree resulted because of the highest information gain is
given by the attribute Outlook.
Now how do we proceed from this point? We can simply apply recursion, you might want to
look at the algorithm steps described earlier.
Now that we’ve used Outlook, we’ve got three of them remaining Humidity, Temperature, and
Wind. And, we had three possible values of Outlook: Sunny, Overcast, Rain. Where the Overcast
node already ended up having leaf node ‘Yes’, so we’re left with two subtrees to compute:
Sunny and Rain.

Table where the value of Outlook is Sunny looks like:


Temperature Humidity Wind Play Golf

Hot High Weak No

Hot High Strong No

Mild High Weak No

Cool Normal Weak Yes

Mild Normal Strong Yes


 

In the similar fashion, we compute the following values


As we can see the highest Information Gain is given by Humidity. Proceeding in the same
way with   will give us Wind as the one with highest information gain. The final Decision
Tree looks something like this.
The final Decision Tree looks something like this.

Code:
Let’s see an example in Python

import pydotplus
from sklearn.datasets import load_iris
from sklearn import tree
from IPython.display import Image, display
__author__ = "Mayur Kulkarni <mayur.kulkarni@xoriant.com>"

def load_data_set():
"""
Loads the iris data set

:return: data set instance


"""
iris = load_iris()
return iris
def train_model(iris):
"""
Train decision tree classifier

:param iris: iris data set instance


:return: classifier instance
"""
clf = tree.DecisionTreeClassifier()
clf = clf.fit(iris.data, iris.target)
return clf

def display_image(clf, iris):


"""
Displays the decision tree image

:param clf: classifier instance


:param iris: iris data set instance
"""
dot_data = tree.export_graphviz(clf, out_file=None,
feature_names=iris.feature_names,
class_names=iris.target_names,
filled=True, rounded=True)

graph = pydotplus.graph_from_dot_data(dot_data)
display(Image(data=graph.create_png()))

if __name__ == '__main__':
iris_data = load_iris()
decision_tree_classifier = train_model(iris_data)
display_image(clf=decision_tree_classifier, iris=iris_data)
Conclusion:
Below is the summary of what we’ve studied in this blog:

1. Entropy to measure discriminatory power of an attribute for classification task. It defines


the amount of randomness in attribute for classification task. Entropy is minimal means
the attribute appears close to one class and have a good discriminatory power for
classification
2. Information Gain to rank attribute for filtering at given node in the tree. The ranking is
based on high information gain entropy in decreasing order.
3. The recursive ID3 algorithm that creates a decision tree.

4. Write the importance of PCA in machine learning model


improvement.
Introduction to Dimensionality Reduction

 
Machine Learning: As discussed in this article, machine learning is nothing but a field of study
which allows computers to “learn” like humans without any need of explicit programming.
What is Predictive Modeling: Predictive modeling is a probabilistic process that allows us to
forecast outcomes, on the basis of some predictors. These predictors are basically features that
come into play when deciding the final result, i.e. the outcome of the model.

What is Dimensionality Reduction?


In machine learning classification problems, there are often too many factors on the basis of
which the final classification is done. These factors are basically variables called features. The
higher the number of features, the harder it gets to visualize the training set and then work on it.
Sometimes, most of these features are correlated, and hence redundant. This is where
dimensionality reduction algorithms come into play. Dimensionality reduction is the process of
reducing the number of random variables under consideration, by obtaining a set of principal
variables. It can be divided into feature selection and feature extraction.
Why is Dimensionality Reduction important in Machine Learning and Predictive
Modeling?
An intuitive example of dimensionality reduction can be discussed through a simple e-mail
classification problem, where we need to classify whether the e-mail is spam or not. This can
involve a large number of features, such as whether or not the e-mail has a generic title, the
content of the e-mail, whether the e-mail uses a template, etc. However, some of these features
may overlap. In another condition, a classification problem that relies on both humidity and
rainfall can be collapsed into just one underlying feature, since both of the aforementioned are
correlated to a high degree. Hence, we can reduce the number of features in such problems. A 3-
D classification problem can be hard to visualize, whereas a 2-D one can be mapped to a simple
2 dimensional space, and a 1-D problem to a simple line. The below figure illustrates this
concept, where a 3-D feature space is split into two 1-D feature spaces, and later, if found to be
correlated, the number of features can be reduced even further.
Components of Dimensionality Reduction
There are two components of dimensionality reduction:
 Feature selection: In this, we try to find a subset of the original set of variables, or
features, to get a smaller subset which can be used to model the problem. It usually
involves three ways:
1. Filter
2. Wrapper
3. Embedded
 Feature extraction: This reduces the data in a high dimensional space to a lower
dimension space, i.e. a space with lesser no. of dimensions.
Methods of Dimensionality Reduction
The various methods used for dimensionality reduction include:
 Principal Component Analysis (PCA)
 Linear Discriminant Analysis (LDA)
 Generalized Discriminant Analysis (GDA)
Dimensionality reduction may be both linear or non-linear, depending upon the method used.
The prime linear method, called Principal Component Analysis, or PCA, is discussed below.
Principal Component Analysis
This method was introduced by Karl Pearson. It works on a condition that while the data in a
higher dimensional space is mapped to data in a lower dimension space, the variance of the data
in the lower dimensional space should be maximum.

It involves the following steps:


 Construct the covariance matrix of the data.
 Compute the eigenvectors of this matrix.
 Eigenvectors corresponding to the largest eigenvalues are used to reconstruct a large
fraction of variance of the original data.
Hence, we are left with a lesser number of eigenvectors, and there might have been some data
loss in the process. But, the most important variances should be retained by the remaining
eigenvectors.
Advantages of Dimensionality Reduction
 It helps in data compression, and hence reduced storage space.
 It reduces computation time.
 It also helps remove redundant features, if any.
Disadvantages of Dimensionality Reduction
 It may lead to some amount of data loss.
 PCA tends to find linear correlations between variables, which is sometimes undesirable.
 PCA fails in cases where mean and covariance are not enough to define datasets.
 We may not know how many principal components to keep- in practice, some thumb
rules are applied.
This article is contributed by Anannya Uberoi. If you like GeeksforGeeks and would like to
contribute, you can also write an article using contribute.geeksforgeeks.org or mail your article
to contribute@geeksforgeeks.org. See your article appearing on the GeeksforGeeks main page
and help other Geeks.
Please write comments if you find anything incorrect, or you want to share more information
about the topic discussed above.

5. What is cross validation? Discuss briefly the different types of


cross validation methods used.
Ans:

Validation
This process of deciding whether the numerical results quantifying hypothesized relationships
between variables, are acceptable as descriptions of the data, is known as validation. Generally,
an error estimation for the model is made after training, better known as evaluation of residuals.
In this process, a numerical estimate of the difference in predicted and original responses is done,
also called the training error. However, this only gives us an idea about how well our model does
on data used to train it. Now its possible that the model is underfitting or overfitting the data. So,
the problem with this evaluation technique is that it does not give an indication of how well the
learner will generalize to an independent/ unseen data set. Getting this idea about our model
is known as Cross Validation.

Holdout Method
Now a basic remedy for this involves removing a part of the training data and using it to get
predictions from the model trained on rest of the data. The error estimation then tells how our
model is doing on unseen data or the validation set. This is a simple kind of cross validation
technique, also known as the holdout method. Although this method doesn’t take any overhead
to compute and is better than traditional validation, it still suffers from issues of high
variance. This is because it is not certain which data points will end up in the validation set and
the result might be entirely different for different sets.

K-Fold Cross Validation


As there is never enough data to train your model, removing a part of it for validation poses a
problem of underfitting. By reducing the training data, we risk losing important patterns/ trends
in data set, which in turn increases error induced by bias. So, what we require is a method that
provides ample data for training the model and also leaves ample data for validation. K Fold cross
validation does exactly that.

In K Fold cross validation, the data is divided into k subsets. Now the holdout method is
repeated k times, such that each time, one of the k subsets is used as the test set/ validation set
and the other k-1 subsets are put together to form a training set. The error estimation is
averaged over all k trials to get total effectiveness of our model. As can be seen, every data point
gets to be in a validation set exactly once, and gets to be in a training set k-1times. This
significantly reduces bias as we are using most of the data for fitting, and also significantly
reduces variance as most of the data is also being used in validation set. Interchanging the
training and test sets also adds to the effectiveness of this method. As a general rule and
empirical evidence, K = 5 or 10 is generally preferred, but nothing’s fixed and it can take any
value.

Stratified K-Fold Cross Validation


In some cases, there may be a large imbalance in the response variables. For example, in dataset
concerning price of houses, there might be large number of houses having high price. Or in case
of classification, there might be several times more negative samples than positive samples. For
such problems, a slight variation in the K Fold cross validation technique is made, such that
each fold contains approximately the same percentage of samples of each target class as the
complete set, or in case of prediction problems, the mean response value is approximately equal
in all the folds. This variation is also known as Stratified K Fold.
Above explained validation techniques are also referred to as Non-exhaustive cross validation
methods.  These do not compute all ways of splitting the original sample, i.e. you just have to
decide how many subsets need to be made. Also, these are approximations of method explained
below, also called Exhaustive Methods, that computes all possible ways the data can be split
into training and test sets.

Leave-P-Out Cross Validation


This approach leaves p data points out of training data, i.e. if there are n data points in the original
sample then, n-p samples are used to train the model and p points are used as the validation set.
This is repeated for all combinations in which original sample can be separated this way, and then
the error is averaged for all trials, to give overall effectiveness.

This method is exhaustive in the sense that it needs to train and validate the model for all possible
combinations, and for moderately large p, it can become computationally infeasible.

A particular case of this method is when p = 1. This is known as Leave one out cross
validation. This method is generally preferred over the previous one because it does not suffer
from the intensive computation, as number of possible combinations is equal to number of data
points in original sample or n.
Cross Validation is a very useful technique for assessing the effectiveness of your model,
particularly in cases where you need to mitigate overfitting. It is also of use in determining the
hyper parameters of your model, in the sense that which parameters will result in lowest test
error. This is all the basic you need to get started with cross validation. You can get started with
all kinds of validation techniques using Scikit-Learn, that gets you up and running with just a
few lines of code in python.

6. What are ensemble machine learning models? Write short notes


on Bagging, Boosting and Random forest algorithm.

Ans:

You might also like