CHP 2

Chapter 2
Preparing to Model
Overview
✔ Machine Learning activities
✔ Types of data in Machine Learning
✔ Structures of data
✔ Descriptive statistics
✔ Data visualization
✔ Pre-Processing
Machine Learning Activities/ Detailed Process of ML
• Step-1: Preparing to Model
✔ Understand the type of data in given input dataset
✔ Explore the data to understand the data quality
✔ Explore the relationships amongst the data elements
✔ Find potential issues in data
✔ Remediate data if needed
✔ Data-preprocessing (if needed):

- Dimensionality reduction
- Feature subset selection
• Step-2: Learning
✔ Data partitioning
✔ Model selection
✔ Cross-validation
• Step-3: Performance Evalution
✔ Examine the model performance
✔ Visualize performance trade-offs using ROC curves
• Step-4: Performance Improvement
✔ Tuning the model
✔ Ensembling
✔ Bagging
✔ Boosting
Types of Data
Qualitative data Quantitative data

(Categorical data) (Numeric data)
Nominal Ordinal Interval Ratio

data data data data
Qualitative data (Categorical data) :
✔ It provides the information about the quality of an object or information
which can not be measured.
✔ Examples: - Quality of students (Good or Bad)

- Name and roll number of students
✔ Can be subdivided as,

- Nominal
data
- Ordinal
data
✔ Nominal Data: One which has no numeric value but, have named value.
✔ Examples: - Blood groups (A,AB)
- Nationalities (Indian, British, American)
- Gender: Male, Female, Other
✔ Can not perform mathematical operations addition, subtraction,
multiplication. That’s why also can not perform statistical operations.
✔ Basic count is possible so mode, i.e. Most frequently occurring value can
be identified for nominal data.
✔ Ordinal Data: In addition to possessing the properties of nominal data,
can also be arranged naturally.
✔ Examples: - Customer satisfaction: “Very Happy”, “Happy”, “Unhappy”

- Hardness of metal: “Very hard”, “Hard”, “Soft”
✔ Basic counting is possible. So, mode can be identified.
✔ Ordering is possible so, median and quartiles are possible.

Quantitative data (Numeric Data) :
✔ Relates to the information about the quantity of data.
✔ Examples: Consider the attribute “marks”. It can be measured
on the scale.
✔ Types: - Interval data

- Ratio data
Interval data: Not only value is known but, also the exact
difference b/w value is also known.
✔ Examples: Date, Time
✔ Mathematical operations are possible.
✔ Mean, Median, Mode, Standard deviation are possible.

Ratio data:
✔ Having the same properties as interval data, with an

equal and definitive ratio between each data.
Structure of Data
✔ Data Structure: Basic building block of computer
programming that helps to organize, manage, and store data
for efficient search and retrieval.
✔ Types of Data: Numeric and Categorical data.
✔ Approach of exploring both types of data is different.
✔ For a standard data set, the data dictionary available for

reference.
✔ Data dictionary is a metadata repository.
- Contains all information related to the structure of each data
element in the data set.
- Gives detailed information on each of the attributes.
✔ If the data dictionary is not available, use standard library function of

the machine learning tool.
University of California, Irvine (UCI) Machine Learning

Repository is a collection of 400+ data sets which serve as
benchmarks for researchers and practitioners in the
machine learning community.
Auto MPG Dataset
(Prediction of fuel consumption in miles per gallon)
Exploring Numerical Data
✔ Two most effective mathematical plots to explore numerical
data: Box plot and Histogram.
✔ The measures of central tendency of data is very important.
✔ Central Tendencies are the numerical values that are used to

represent mid-value or central value a large collection of
numerical data.
✔ The measures of central tendency of data: Mean, Median, Mode
Mean: Sum of all data values divided by the count of data

elements.
e.g.: Mean of a set of observations – 21, 89, 34, 67, and 96 61.4
Median: Value of the element appearing in the middle of an

ordered list of data elements.
67
e.g.: Median of 21, 34, 67, 89, and 96
Mode: The value that has higher frequency in a given set of

values.
Why Mean and Median are important to measure central
tendency?
• If the mean=median, the data is symmetrically distributed about the mean.

• The greater the difference among these measures the more asymmetrical the data.
Understanding Data Spread
(Drilling down at granular level)
✔ Granular view of the data spread in the form of,

• Dispersion of data
• Position of the different data values
✔ Consider the data values of two attributes,

- Attribute 1 : 44, 45, 46, 47, and 48 Mean and Median= 46
- Attribute 2 : 34, 39, 46, 52, and 59
Attribute-1 is more concentrated or clustered around the
mean/median value whereas the second set of values of Attribute-2
is quite spread out or dispersed.
✔ Variance is used to measure the extent of dispersion of a
data, or to find out how much the different values of a
data are spread out.
x = variable or attribute whose variance is to be measured

n = number of observations or values of variable x.
Standard Deviation
Larger value of variance or standard deviation indicates more

dispersion in the data and vice versa.
Calculate the Variance for below example,
Attribute 1, values : 44, 45, 46, 47, and 48
Attribute 2, values : 34, 39, 46, 52, and 59
Answers: Variance of Attribute 1 = 2

Variance of Attribute 2 = 79.6
Measure that attribute 1 values are quite concentrated around the mean
while attribute 2 values are extremely spread out.
Measuring data value position
✔ Median: Gives the central data value
(Divides the entire data set into two halves)
✔ First half of the data is divided into two halves so that each
half consists of one-quarter of the data set, then that median of
the first half is known as first quartile or Q1. (Same for 2nd
half)
Any data set has five values -

minimum, first quartile (Q1),
median (Q2), third quartile (Q3),
and maximum.
• Quantiles: Specific points in a data set which divide the data
set into equal parts or equally sized quantities.
• Variants of Quantile:
• Quartile: Divides data set into four parts.
• Percentile: Divides the data set into 100 parts.
We still cannot make sure whether there is any outlier present in the data .
(Solution- Visualization methods: Box Plot, Histogram)
Plotting and exploring numerical data
Box Plots (also called box and whisker plot)
• Box plot is an extremely
effective mechanism to get a
one-shot view and understand whisker
the nature of the data.
IQR
• Gives a standard visualization of
whisker
the five-number summary
statistics of a data: minimum,
first quartile (Q1), median (Q2),
third quartile (Q3), and
maximum.
Detailed Interpretation of Box Plot
• The box indicates the range in which 50% of all data lies.: inter-
quartile range (IQR). (Width of IQR=Q3-Q1)
• Median is represented by the line or band within the box.
• The lower end of the box is the 1 st quartile and the upper end is 3 rd
quartile.
• The upper and lower whiskers represent scores outside the middle
50%.
• The lower whisker extends up to 1.5 times of the inter-quartile range
(or IQR) from the bottom of the box.
• The upper whisker extends up to 1.5 as times of the inter-quartile range

(or IQR) from the top of the box.
Lower Limit = Q1-1.5*IQR

Upper Limit = Q3+1.5*IQR
• The data values coming beyond the lower/upper whiskers are the ones
which are of unusually low or high values respectively. These are the
outliers.
Exercise:
Observations:
100,120,110,150,110,140,130,170,120,220,140,110
Draw the Box Plot for the given data.

Uses of Box Plot:
✔ Box plots provide a visual summary of

the data.
✔ The Median gives you the average value
of the data.
✔ Box Plots shows Skewness of the data.
✔ Dispersion or spread of data can be
visualized by the minimum and
maximum.
✔ Gives us the idea of about the Outliers.
FYI: There are different variants of box plots.
The one we learnt is the Tukey box plot. Famous
mathematician John W. Tukey introduced this
type of box plot (1969).
Histogram
• Helps in understanding the distribution of a numeric data into series
of intervals called ‘bins’.
• Different shapes depending on the nature of the data , e.g. skewness.
• The height of the bar reflects the total count of data elements whose value
falls within the specific bin value, or the frequency.
• A Histogram represents,
o Frequency of different data points in the dataset.
o Location of the center of data.
o The spread of dataset.
o Skewness/variance of dataset.
o Presence of outliers in the dataset.
Difference b/w Box Plot and Histogram
• Box plot is a data display that draws a box over a

number line to show the interquartile range of the data.
The 'whiskers' of a box plot show the least and greatest
values in the data set.
• Histogram is a special kind of bar graph that shows a

bar for a range of data values instead of a single value.
Exploring Categorical Data
✔ How many unique values, proportion (or percentage) of
count of data elements.
1. Chevrolet Chevelle malibu

2. Ford torino Car name Count
3. Pontiac Catalina Chevrolet 3
4. Chevrolet impala
Ford 3
5. Ford galaxie 500
6. Amc rebel sst Amc 2
7. Amc ambassador dpl Pontiac 2
8. Chevrolet impala catalina
9. Ford torino
10. Pontiac catalina
✔ Statistical measure “mode” is applicable on categorical attribute.
✔ Frequency distribution of an attribute having single mode is called

‘unimodal’, two modes are called ‘bimodal’ and multiple modes are
called ‘multimodal’.
Scatter Plot
• Helps in visualizing bivariate relationships.
(i.e. relationship b/w two variables.)
• It is a two-dimensional plot in which points or dots are drawn on
coordinates.
Two-way Cross-tabulations
(also called cross-tab or contingency table)
• Gives the relationship of two categorical attributes in a concise
way.
• Matrix format that presents a summarized view of the bivariate
frequency distribution.
Data Quality
✔ Success of machine learning depends largely on the quality
of data.
✔ Right quality data helps to achieve better prediction
accuracy.
Two common types of problem in Data:

o Some data elements without a value or data with a missing
value.
o Data elements having value surprisingly different from the
other elements (Outliers).
Factors causing the Data Quality Issues
1. Incorrect sample set selection:
• The data may not reflect normal or regular quality
due to incorrect selection of sample set.
• Example: Data of sale during festive season is used

to predict the sale during off season.
2. Errors in Data Collection:
• Resulting in outliers and missing values.
• In process of manually collecting data, possibility of

wrongly recording data either in terms of value is high.
Examples: - 20.67 is wrongly recorded as 206.7 or 2.067
- cm. is wrongly recorded as m. or mm
• It results in data elements which have abnormally high or

low value from other elements (terms as outliers).
• Also, it is possible the data is not recorded at all.
Example: Survey conducted to collect data
Data Remediation
• To achieve the good efficiency, the issues in data quality need
to be remediated.
• Handling the Outliers:

✔ Outliers are data elements with an abnormally high value
which may impact prediction accuracy.
✔ Once they are identified and the decision has been taken
to amend those values.
✔ If the outliers are natural, i.e. the value of the data
element is surprisingly high/low because of a valid reason,
then we should not amend it.
• Remove outliers: If the number of records which are
outliers is not many, simply remove them.
• Imputation: One other way is to impute the value with

mean or median or mode.
- The value of the most similar data element may also be
used for imputation.
• Capping: Capping is setting a limit for the feature and set

the value of all the outliers exceeding the limit to the value
of the limit.
If a dataset consists of significant outliers, they should be
treated separately in the statistical model.
In such case, the groups should be treated as two different

groups, the model should be built for both groups and then the
output can be combined.
Handling missing values
• Eliminate records having a missing value of data elements:
- If missing values are in tolerable limits, remove the
records having such data elements.
- Example: In the case of Auto MPG data set, only in 6 out

of 398 records, the value of attribute ‘horsepower’ is
missing.
- Can not remove if large portion of data has missing value.

• Imputing missing values:
- Imputation is a method to assign a value to the data elements

having missing values.
- Mean/mode/median is most frequently assigned value.
- If data is quantitative, all missing values are imputed with the

mean, median, or mode of the remaining values under the same
attribute.
- If data is qualitative, all missing values are imputed by the

mode of all remaining values of the same attribute.
• Estimate missing values:
- If there are data points similar to the ones with missing attribute
values, then the attribute values from those similar data points
can be planted in place of the missing value.
- For finding similar data points or observations, distance

function can be used.
- Example: Assuming that the weight of a Russian student

having age 12 years and height 5 ft. is missing. Then the weight
of any other Russian student having age close to 12 years and
height close to 5 ft. can be assigned.
Pre-Processing
Dimensionality reduction
✔ High-dimensional data sets need a high amount of computational
space and time.
✔ Features which are not useful may degrade the performance of

machine learning algorithms.
✔ Most of the machine learning algorithms perform better if the

dimensionality of data set.
(i.e. the number of features in the data set, is reduced.)
✔ Dimensionality reduction helps in reducing irrelevance and
redundancy in features.
✔ It is easier to understand a model if the number of features

involved in the learning activity is less.
✔ Dimensionality reduction: The techniques of reducing the

dimensionality of a data set by creating new attributes by
combining the original attributes.
✔ Widely used method: Principal Component Analysis (PCA)
✔ PCA is a statistical technique to convert a set of correlated

variables into a set of transformed, uncorrelated variables
called principal components.
✔ The principal components are a linear combination of the

original variables.
✔ Since principal components are uncorrelated, they capture the

maximum amount of variability in the data.
• Challenge: The original attributes are lost due to the
transformation.
• Another commonly used technique: Singular Value

Decomposition (SVD).
Feature Subset Selection (Or Feature Selection)
✔ Try to find out the optimal subset of the entire feature set which
significantly reduces computational cost without any major
impact on the learning accuracy.
✔ A feature is considered as irrelevant,

o If it plays an insignificant role in classifying or grouping
together a set of data instances.
o Potentially redundant when the information contributed by the
feature is more or less same as other features.

CHP 2

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CHP 2

Uploaded by

Copyright:

Available Formats

Chapter 2

✔ Understand the type of data in given input dataset

✔ Explore the data to understand the data quality

✔ Explore the relationships amongst the data elements

✔ Find potential issues in data

✔ Remediate data if needed

✔ Data-preprocessing (if needed):

Qualitative data Quantitative data

Nominal Ordinal Interval Ratio

✔ Examples: - Quality of students (Good or Bad)

✔ Can be subdivided as,

✔ Examples: - Customer satisfaction: “Very Happy”, “Happy”, “Unhappy”

✔ Ordering is possible so, median and quartiles are possible.

✔ Types: - Interval data

✔ Examples: Date, Time

✔ Mathematical operations are possible.

✔ Mean, Median, Mode, Standard deviation are possible.

✔ Having the same properties as interval data, with an

✔ Types of Data: Numeric and Categorical data.

✔ Approach of exploring both types of data is different.

✔ For a standard data set, the data dictionary available for

✔ If the data dictionary is not available, use standard library function of

University of California, Irvine (UCI) Machine Learning

✔ The measures of central tendency of data is very important.

✔ Central Tendencies are the numerical values that are used to

Mean: Sum of all data values divided by the count of data

Median: Value of the element appearing in the middle of an

Mode: The value that has higher frequency in a given set of

• If the mean=median, the data is symmetrically distributed about the mean.

✔ Granular view of the data spread in the form of,

✔ Consider the data values of two attributes,

x = variable or attribute whose variance is to be measured

Larger value of variance or standard deviation indicates more

Answers: Variance of Attribute 1 = 2

Any data set has five values -

• Median is represented by the line or band within the box.

• The upper whisker extends up to 1.5 as times of the inter-quartile range

Lower Limit = Q1-1.5*IQR

Draw the Box Plot for the given data.

✔ Box plots provide a visual summary of

• Box plot is a data display that draws a box over a

• Histogram is a special kind of bar graph that shows a

1. Chevrolet Chevelle malibu

✔ Frequency distribution of an attribute having single mode is called

Two common types of problem in Data:

• Example: Data of sale during festive season is used

• In process of manually collecting data, possibility of

• It results in data elements which have abnormally high or

• Handling the Outliers:

• Imputation: One other way is to impute the value with

• Capping: Capping is setting a limit for the feature and set

In such case, the groups should be treated as two different

- Example: In the case of Auto MPG data set, only in 6 out

- Can not remove if large portion of data has missing value.

- Imputation is a method to assign a value to the data elements

- Mean/mode/median is most frequently assigned value.

- If data is quantitative, all missing values are imputed with the

- If data is qualitative, all missing values are imputed by the

- For finding similar data points or observations, distance

- Example: Assuming that the weight of a Russian student

✔ Features which are not useful may degrade the performance of

✔ Most of the machine learning algorithms perform better if the

✔ It is easier to understand a model if the number of features