Professional Documents
Culture Documents
Module 3:
Modeling and Evaluation
Faculty: Santosh Chapaneri
ML Modeling
• The structured representation of raw input data to the meaningful
data set is based on the problem to be solved and the type of data.
• E.g., when the problem is related to prediction and the target field is
santosh.chapaneri@ieee.org
3
Selecting a Model
• For the feature matrix X, the output is y = f(X) + e
santosh.chapaneri@ieee.org
Selecting a Model
• Recall No Free Lunch Theorem
• That’s why, while doing the data exploration (we did before) we need
santosh.chapaneri@ieee.org
5
Predictive Models
• Predictive models predict the value of a category or class to
• 3. Prediction of potential flu patients and demand for flu shots next winter
santosh.chapaneri@ieee.org
Descriptive Models
• There is no target y in the case of unsupervised learning.
• Descriptive models which group together similar data instances, i.e. data
instances having a similar value of the different features are called clustering
models, e.g. k-means
• 1. Customer grouping or segmentation based on social, demographic, etc.
Training a Model
• Holdout method
• Bootstrap sampling
santosh.chapaneri@ieee.org
• Also, the label value of the test data is not known. That is the
reason why a part of the input data is held back (that is how the
name holdout originates) for evaluation of the model.
• This subset of the input data is used as the test data for
santosh.chapaneri@ieee.org
9
santosh.chapaneri@ieee.org
10
santosh.chapaneri@ieee.org
11
of some of the classes proportionally amongst training and test data sets.
• A special variant called repeated holdout, is employed to ensure the
santosh.chapaneri@ieee.org
12
santosh.chapaneri@ieee.org
13
santosh.chapaneri@ieee.org
14
santosh.chapaneri@ieee.org
15
santosh.chapaneri@ieee.org
16
approximately 10% of the data, one of the folds is used as the test data for
validating model performance trained based on the remaining 9 folds (or 90%
of the data). This is repeated 10 times, once for each of the 10 folds being
used as the test data and the remaining folds as the training data. The
average performance across all folds is reported.
santosh.chapaneri@ieee.org
17
santosh.chapaneri@ieee.org
18
training and test data sets from the input data set.
• It uses the technique of Simple Random Sampling with Replacement.
• Bootstrapping randomly picks data instances from the input data set,
bootstrapping can create one or more training data sets having N data
instances, some data instances being repeated multiple times.
• Useful in case of input data sets of small size, i.e. having very less
santosh.chapaneri@ieee.org
20
santosh.chapaneri@ieee.org
21
learner is ready with the model and doesn’t need to refer back to the
training data.
• Eager learners take more time in the learning phase than the lazy
learners.
• E.g. SVM, DT, NB, NN
santosh.chapaneri@ieee.org
22
generalization processes
• Lazy learner doesn’t ‘learn’ anything. It uses the training data in exact, and
learning.
• Due to its dependency on the given training data instance, it is also known
as instance learning.
• Lazy learners take very little time in training because not much of
unknown data in the test data set may be differing from the training data.
• Underfitting: If the target function is kept too simple, it may not be able to
capture the essential nuances and represent the underlying data well.
• Underfitting results in both poor performance with training data as well
santosh.chapaneri@ieee.org
24
santosh.chapaneri@ieee.org
25
santosh.chapaneri@ieee.org
26
santosh.chapaneri@ieee.org
28
santosh.chapaneri@ieee.org
29
santosh.chapaneri@ieee.org
30
• The first case is where the model has correctly classified data instances as
santosh.chapaneri@ieee.org
32
santosh.chapaneri@ieee.org
33
santosh.chapaneri@ieee.org
34
then the this misclassification is fatal. It does not matter if higher number
of tumors which are benign are wrongly classified as malignant.
• In these problems, there are some measures of model performance which
santosh.chapaneri@ieee.org
35
correctly classified.
santosh.chapaneri@ieee.org
36
For the example of win / loss prediction, precision indicates how often the
model predicts the win correctly.
number of positives. For the example of win / loss prediction, recall resembles
what proportion of the total wins were predicted correctly.
santosh.chapaneri@ieee.org
37
classification thresholds.
• The area under curve (AUC) value is the area of the two-dimensional
space under the curve from (0, 0) to (1, 1), where each point on the curve
gives a set of TP and FP values at a specific classification threshold.
• AUC value ranges from 0 to 1, with an AUC of less than 0.5 indicating that
38
santosh.chapaneri@ieee.org
39
santosh.chapaneri@ieee.org
40
santosh.chapaneri@ieee.org
41
fitting options. E.g. in KNN, using different values of K, the model can
be tuned.
• Ensemble: Several models with diverse strengths are combined
together. One model may learn one type data sets well while struggle
with another type of data set. Another model may perform well with the
data set which the first one struggled with.
• Ensemble helps in reducing the bias of different models and also
model.
santosh.chapaneri@ieee.org
42
santosh.chapaneri@ieee.org
43
santosh.chapaneri@ieee.org
44
combined are quite varying, e.g, SVM, neural network, KNN, etc.
• The outputs from the different models are combined using a
santosh.chapaneri@ieee.org
45
to generate multiple training data sets, which are used to train a set of
models using the same learning algorithm. The outcomes are
combined by majority voting (classification) or by average (regression).
santosh.chapaneri@ieee.org
46
santosh.chapaneri@ieee.org
47
• Loop:
– Apply weak learner to weighted training examples (instead of orig. training
set, draw bootstrap samples with weighted probability)
– Increase weight for misclassified examples
• Weighted majority voting on trained classifiers
santosh.chapaneri@ieee.org
48
santosh.chapaneri@ieee.org
49
santosh.chapaneri@ieee.org
50
santosh.chapaneri@ieee.org
51
santosh.chapaneri@ieee.org
52
santosh.chapaneri@ieee.org