Professional Documents
Culture Documents
Preparing to Model
Overview
✔ Machine Learning activities
✔ Types of data in Machine Learning
✔ Structures of data
✔ Descriptive statistics
✔ Data visualization
✔ Pre-Processing
Machine Learning Activities/ Detailed Process of ML
• Step-1: Preparing to Model
✔ Basic count is possible so mode, i.e. Most frequently occurring value can
be identified for nominal data.
✔ Ordinal Data: In addition to possessing the properties of nominal data,
can also be arranged naturally.
Measure that attribute 1 values are quite concentrated around the mean
while attribute 2 values are extremely spread out.
Measuring data value position
✔ Median: Gives the central data value
(Divides the entire data set into two halves)
✔ First half of the data is divided into two halves so that each
half consists of one-quarter of the data set, then that median of
the first half is known as first quartile or Q1. (Same for 2nd
half)
• Variants of Quantile:
• Quartile: Divides data set into four parts.
• Percentile: Divides the data set into 100 parts.
We still cannot make sure whether there is any outlier present in the data .
(Solution- Visualization methods: Box Plot, Histogram)
Plotting and exploring numerical data
Box Plots (also called box and whisker plot)
• Box plot is an extremely
effective mechanism to get a
one-shot view and understand whisker
the nature of the data.
IQR
• Gives a standard visualization of
whisker
the five-number summary
statistics of a data: minimum,
first quartile (Q1), median (Q2),
third quartile (Q3), and
maximum.
Detailed Interpretation of Box Plot
• The box indicates the range in which 50% of all data lies.: inter-
quartile range (IQR). (Width of IQR=Q3-Q1)
• The lower end of the box is the 1 st quartile and the upper end is 3 rd
quartile.
• The upper and lower whiskers represent scores outside the middle
50%.
• The lower whisker extends up to 1.5 times of the inter-quartile range
(or IQR) from the bottom of the box.
• The data values coming beyond the lower/upper whiskers are the ones
which are of unusually low or high values respectively. These are the
outliers.
Exercise:
Observations:
100,120,110,150,110,140,130,170,120,220,140,110
• A Histogram represents,
o Frequency of different data points in the dataset.
o Location of the center of data.
o The spread of dataset.
o Skewness/variance of dataset.
o Presence of outliers in the dataset.
Difference b/w Box Plot and Histogram
- If there are data points similar to the ones with missing attribute
values, then the attribute values from those similar data points
can be planted in place of the missing value.