STML

Statistics for Machine Learning
1 / 79
Content
Purposes of Statistics
Descriptive Statistics
Shape of data
Numerical Measures of Central Tendency
Numerical Measures of Variability
Notion of Normal Distribution
Covariance and Correlation
2 / 79
Descriptive Inferential Predictive

Statistics Statistics Analysis
▶ Organize ▶ Generalize from ▶ Relationships

▶ Summarize samples to among variables
▶ Simplify populations
▶ Hypothesis
▶ Presentation of
testing
data
3 / 79


▶ Hypothesis
▶ Presentation of
testing
data
▶ Descriptive statistics - describes sample data
3 / 79


▶ Hypothesis
▶ Presentation of
testing
data
▶ Descriptive statistics - describes sample data

▶ Inferential Statistics and Predictive Analysis - Make population
prediction
3 / 79
Data we normally encounter...
4 / 79
Data we normally encounter...
4 / 79
Classification of Data
Data can be broadly classified into two types
5 / 79

i. Categorical
5 / 79

i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
5 / 79

i. Categorical
Each observation can be placed in only one category, and the
categories are mutually exclusive.
5 / 79

i. Categorical
For example, smoking is a categorical variable, with two groups,
categorized as a nonsmoker or a smoker.
5 / 79

i. Categorical
ii. Quantitative
5 / 79

i. Categorical
ii. Quantitative
They take numerical values and represent some kind of
measurement.
5 / 79

i. Categorical
ii. Quantitative
measurement.
Age is an example of a quantitative variable because it can take on
multiple numerical values.
5 / 79

i. Categorical
ii. Quantitative
measurement.
Age is an example of a quantitative variable because it can take on
multiple numerical values.
Weight and height are also examples of quantitative variables.
5 / 79
Types of Data
6 / 79
Types of Data
7 / 79
Types of Data
▶ Nominal Data: Nominal data is a categorical data type, it

describes qualitative characteristics or groups, with no order or rank
between categories.
7 / 79
Types of Data

between categories.
Examples of nominal data include:
7 / 79
Types of Data

between categories.
▶ Gender, ethnicity, eye colour, blood type
7 / 79
Types of Data

between categories.
▶ Brand of refrigerator/motor vehicle/television owned
7 / 79
Types of Data

between categories.
▶ Ordinal Data: It is similar to nominal data in terms of categories,
but it has a meaningful order or rank between the options.
7 / 79
Types of Data

between categories.
Some examples of ordinal data:
7 / 79
Types of Data

between categories.
▶ Income level (e.g. low income, middle income, high income)
7 / 79
Types of Data

between categories.
▶ Income level (e.g. low income, middle income, high income)
▶ Level of agreement (e.g. strongly disagree, disagree, neutral, agree,
strongly agree)
7 / 79
Types of Data
8 / 79
Types of Data
▶ Interval Data: The interval level is a numerical level of

measurement which, like the ordinal scale, places variables in order.
8 / 79
Types of Data

Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
8 / 79
Types of Data

Some examples:
8 / 79
Types of Data

Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
8 / 79
Types of Data

Some examples:
▶ IQ score
8 / 79
Types of Data

Some examples:
▶ IQ score
▶ Income categorized
8 / 79
Types of Data

Some examples:
▶ IQ score
▶ Ratio Data : Like interval data, it is ordered/ranked and the
numerical distance between points is consistent. What makes it
different from interval data is that, measurement of zero means that
there is nothing of that variable.
Some Examples :
8 / 79
Types of Data

Some examples:
▶ IQ score
Some Examples :
▶ Weight in grams (continuous)
8 / 79
Types of Data

Some examples:
▶ IQ score
Some Examples :
▶ Number of employees at a company (discrete)
8 / 79
Types of Data

Some examples:
▶ IQ score
Some Examples :
▶ Number of employees at a company (discrete)
▶ Speed in miles per hour (continuous)
8 / 79
Stevens classification categorizes data according to four
basic properties:
9 / 79
basic properties:
1. Description
In this measurement all we can do is to name or label things. We
cannot perform any arithmetic with nominal level data. All we can
do is count the frequencies with which the things occur.
9 / 79
basic properties:
1. Description
In this measurement all we can do is to name or label things. We
cannot perform any arithmetic with nominal level data. All we can
do is count the frequencies with which the things occur.
2. Order
This scale enables us to order the items of interest using ordinal
numbers. Ordinal numbers denote an item’s position or rank in a
sequence: First, second, third, and so on.
9 / 79
basic properties: (contd.)
10 / 79
3. Distance
The interval level has an inherent order, but here we do have the
distance between intervals on the scale.
10 / 79
3. Distance
The interval level has an inherent order, but here we do have the
distance between intervals on the scale.
4. Origin
The addition of a non-arbitrary zero allows us to calculate the
numerical relationship between values using ratios.
For example: A person who weighs 150 pounds, weighs twice as
much as a person who weighs only 75 pounds and half as much as a
person who weighs 300 pounds. We can calculate ratios like these
because the scale for weight in pounds starts at zero pounds.
These are also referred as primary scales of measurement
10 / 79
Different types of Data
11 / 79
Different types of Data
12 / 79
Primary scales of measurement: Ratio Data
This scale has Description, order, distance, and Origin.
13 / 79
Primary scales of measurement: Interval Data
This scale has Description, order, and distance, but No Origin.
14 / 79
Primary scales of measurement: Ordinal Data
This scale has Description, and order, but No distance, and No Origin.
15 / 79
Primary scales of measurement: Nominal Data
This scale has Description, but No order, No distance, and No Origin.
16 / 79
To begin, not all data are of the same type
17 / 79
17 / 79
18 / 79
Descriptive Statistics
19 / 79
Statistics: A Single variable
Three important defining characteristics of any set of data for a given

variable are:
20 / 79

variable are:
▶ Shape of Data (also called Distribution)
20 / 79

variable are:
▶ Measure of Location (also called Central Tendency)
20 / 79

variable are:
▶ Measure of Location (also called Central Tendency)
▶ Spread of data ( also called Variability)
20 / 79
Example
21 / 79
Example
▶ A mobile phone maker claims that the battery in their brand of

mobile phone will last 48 hours, under normal usage conditions.
21 / 79
Example

▶ The QC department at their manufacturing facility, tests this claim
routinely using randomly selected mobile phones.
21 / 79
Example

▶ The data for the just concluded tests on a certain day, has just
arrived. . .
21 / 79
Example

arrived. . .
It is impossible to expect that each battery would last exactly 48 hours.
21 / 79
Example

arrived. . .
There is bound to be variability in the life of the battery.
21 / 79
Example

arrived. . .
There is bound to be variability in the life of the battery. This variability
can be due to variations in the quality of input components,
manufacturing conditions, workers involved, or just non-explainable
random fluctuations.
21 / 79
So, the claim that we need to verify is:
22 / 79
▶ On an average, a mobile phone battery (made by this company) will

last for 48 hours.
22 / 79

last for 48 hours.
▶ To verify this claim, it will be impossible to test the entire output.
(Population)
22 / 79

last for 48 hours.
(Population)
▶ We will need to devise a method of sampling to obtain a sample of
phones and test them.
22 / 79

last for 48 hours.
(Population)
▶ We need to understand what the sample data indicates. (Descriptive
Statistics)
22 / 79

last for 48 hours.
(Population)
▶ We need to understand what the sample data indicates. (Descriptive
Statistics)
▶ We need to conclude about the population battery life. (Inferential
Statistics)
22 / 79
Data
23 / 79
Data
Thirty batteries were tested, and we have the

number of hours the battery lasted for each of the
battery.
23 / 79
Data

battery.
23 / 79
Data

battery.
The above variables are Quantitative
23 / 79
Data

battery.
Lets say the 30 batteries
were made in three different
plants (A,B,C), and we had
the data related to that..
23 / 79
Data

battery.
23 / 79
Data

battery.
Variables are Qualitative
23 / 79
Summarizing Quantitative Data - Ordering
24 / 79
24 / 79
24 / 79
We can now quickly read, Max = 68, Min = 27, some concentration of
values around 50. . .
24 / 79
Summarizing Quantitative Data: Ungrouped Frequency
Distributions
25 / 79
Distributions
25 / 79
Distributions
25 / 79
Distributions
You can see the concentration of observation values between 48 and 54

more clearly..
25 / 79
Distributions - Graphical
26 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 1
27 / 79
27 / 79
You can see that 25 of 30 observations have a value between 42 to 59

Of Course, we have lost the information on how many observations had a
value of 46, 50 etc. . .
27 / 79
Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.
28 / 79
28 / 79
28 / 79
The right number of groupings and class intervals are subjective and
depends on data.
28 / 79
depends on data.
Anything over 10 groups becomes difficult to read and comprehend.
28 / 79
depends on data.
Anything over 10 groups becomes difficult to read and comprehend.
It is not advisable to vary the class intervals. (e.g. Hrs: 63 -68, 55 – 62,
50 – 54)
28 / 79
Distributions – Open Ended Groups
29 / 79
29 / 79
29 / 79
Open ended Groups bring focus to data points that need greater
attention.
29 / 79
Open ended Groups bring focus to data points that need greater
attention.
But, it is not amenable to certain mathematical computations. (For
example, average)
29 / 79
Histograms
30 / 79
Histograms
▶ Divide the range of values in sample
set into small intervals and count how
many observations fall within each
interval.
30 / 79
Histograms
interval.
▶ For each interval plot a rectangle with
width = interval size and height equal
to number of observations in interval.
30 / 79
Histograms
interval.
▶ For each interval plot a rectangle with
width = interval size and height equal
to number of observations in interval.
30 / 79
31 / 79
Summarizing Quantitative Data: Relative Frequency
Distributions – Proportions/Percentages
32 / 79
32 / 79
32 / 79
Summarizing Quantitative Data: Cumulative Frequency
Distributions
33 / 79
Distributions
33 / 79
Distributions
33 / 79
34 / 79
Summarizing Quantitative Data: Basic Statistical
Measures- Shape of data
35 / 79
Basic Statistical Measures: Shape of data - Examples
36 / 79
36 / 79
36 / 79
36 / 79
36 / 79
36 / 79
Basic Statistical measures: Measures of location
37 / 79
▶ The frequency distribution (tables, and graphs) help us in our

understanding of data by summarizing them.
37 / 79

▶ Measure of location, and dispersion simplifies it further by providing
single number representations of the data.
37 / 79

▶ The central tendency is a score value on which a distribution tends
to center.
37 / 79

to center.
▶ The most common measure is the average, and it signifies what is
typical, usual, representative value of the data.
37 / 79

to center.
▶ The most common measure is the average, and it signifies what is
typical, usual, representative value of the data.
The 3 M’s (Mean, Median, Mode)
37 / 79
Summarizing Quantitative Data: Central Tendency –
Mode (and proportion)
38 / 79
The mode is the data value that occurs most frequently in the data set.
38 / 79
38 / 79
Mode: 50 Hours
3
( 30 x100 = 10%)
38 / 79
Mode
39 / 79
Mode
Advantages
39 / 79
Mode
Advantages
i. Only sensible measure for
qualitative data
39 / 79
Mode
Advantages
qualitative data
ii. More appropriate for quantitative
data which are inherently discrete
39 / 79
Mode
Disadvantages
Advantages
qualitative data
39 / 79
Mode
Disadvantages
i. There may not be a single
Advantages mode (multimodal data)
qualitative data
39 / 79
Mode
Disadvantages
i. Only sensible measure for ii. It does not use all the data
qualitative data available
39 / 79
Mode
Disadvantages
ii. More appropriate for quantitative iii. Poor sampling stability - Large
data which are inherently discrete variations across samples
39 / 79
Mode
Disadvantages
ii. More appropriate for quantitative iii. Poor sampling stability - Large
data which are inherently discrete variations across samples
iv. Not very mathematically
tractable
39 / 79
Summarizing Quantitative Data: Central Tendency – Mean
40 / 79
The mean is the arithmetic average of all the data values in the data set.
40 / 79
40 / 79
Mode: 49.67 Hours
40 / 79
Mode: 49.67 Hours

Formula is given by
x1 + x2 + · · · + xN
x̄ = .
N
40 / 79
Mean
41 / 79
Mean
Advantages
41 / 79
Mean
Advantages
i. Uses all the data values
available
41 / 79
Mean
Advantages
available
ii. Moderate sampling stability
41 / 79
Mean
Advantages
available
iii. Highly mathematically tractable
41 / 79
Mean
Disadvantages
Advantages
available
41 / 79
Mean
Disadvantages
i. More sensitive to extreme values –
Supposing instead of 33, and 27 Hrs,
Advantages we had 24, and 20 Hrs, the mean
would be 49.27 Hrs
available
41 / 79
Mean
Disadvantages
would be 49.27 Hrs
available ii. Not appropriate for qualitative data –
You can’t get an average Gender for
example
41 / 79
Mean
Disadvantages
would be 49.27 Hrs
example
iii. Not appropriate for open ended
distributions
41 / 79
Mean
Disadvantages
would be 49.27 Hrs
example
distributions
iv. Not appropriate for skewed
distributions
41 / 79
Mean
Disadvantages
would be 49.27 Hrs
example
distributions
iv. Not appropriate for skewed
distributions
41 / 79
Median
The median is the data value in the distribution that divides the data into
two groups having equal frequencies – the center point of the data set.
42 / 79
Median
42 / 79
Median
▶ If, n is odd, the Median is the

data value of the ((n + 1)/2)th
observation, when ordered.
42 / 79
Median
▶ If, n is odd, the Median is the

data value of the ((n + 1)/2)th
observation, when ordered.
▶ If n is even, the Median is the
midway point of the (n/2)th
observation, and the
(n/2 + 1)th observation.
Here the median is 50 Hours
42 / 79
Median
43 / 79
Median
Advantages
43 / 79
Median
Advantages
▶ Simple to compute.
43 / 79
Median
Advantages
▶ Very appropriate for skewed
distributions.
43 / 79
Median
Advantages
distributions.
▶ Great sampling stability.
43 / 79
Median
Advantages
distributions.
▶ Most appropriate for open ended
distributions.
43 / 79
Median
Advantages
distributions.
distributions.
▶ Appropriate for ordered qualitative
data.
43 / 79
Median
Advantages
Disadvantages
distributions.
distributions.
data.
43 / 79
Median
Advantages
Disadvantages
▶ Does not use all data
distributions.
values.
distributions.
data.
43 / 79
Median
Advantages
Disadvantages
▶ Does not use all data
distributions.
values.
▶ Less mathematically
tractable compared to
distributions. mean.
data.
43 / 79
Summarizing Qualitative Data: Frequency Table, Central
Tendency, Bar Graph and Pie Chart
44 / 79
44 / 79
44 / 79
44 / 79
The modal value of “Plant A”, is

40%
44 / 79
Summarizing Quantitative Data: Dispersion - Concept
45 / 79
▶ Data Set 1: 2, 6, 10, 10, 14,18

Mean =10, Median =10, Mode =10
45 / 79
▶ Data Set 1: 2, 6, 10, 10, 14,18

▶ Data Set 2: 1, 2, 10, 10, 18,19
45 / 79
▶ Data Set 1: 2, 6, 10, 10, 14,18

▶ Data Set 2: 1, 2, 10, 10, 18,19
▶ Data Set 3: 10,10,10,10,10,10
45 / 79
▶ Data Set 1: 2, 6, 10, 10, 14,18

▶ Data Set 2: 1, 2, 10, 10, 18,19
▶ Data Set 3: 10,10,10,10,10,10
What is the difference between the three data sets?
45 / 79
▶ Data Set 1: 2, 6, 10, 10, 14,18

▶ Data Set 2: 1, 2, 10, 10, 18,19
▶ Data Set 3: 10,10,10,10,10,10
What is the difference between the three data sets?

This difference is characterized as Dispersion (or Variability between the
individual data values) and is captured by measures such as Range, Inter
Quartile Range, Standard Deviation, and Coefficient of Variation
45 / 79
Summarizing Quantitative Data: Dispersion - Range
46 / 79
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)
46 / 79
Data Value)
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
▶ Easy to compute
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
▶ Easy to compute
values
Disadvantages
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
▶ Easy to compute
values
Disadvantages
▶ Cannot be used for qualitative
data
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
▶ Easy to compute
values
Disadvantages
data
▶ Uses only two data values
Range = 68 – 27 =41
46 / 79
Data Value)
Advantages
▶ Easy to compute
values
Disadvantages
data
▶ Highly influenced by extreme
Range = 68 – 27 =41 data values
46 / 79
Data Value)
Advantages
▶ Easy to compute
values
Disadvantages
data
▶ Highly influenced by extreme
Range = 68 – 27 =41 data values
▶ Poor sampling stability
46 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range
47 / 79
47 / 79
* Maximum (100th percentile) = Q4
47 / 79

* 75th percentile = Q3
47 / 79

* Median (50th percentile) = Q2
47 / 79

47 / 79

* Minimum (0th percentile) = Q0
47 / 79

* Range = Q4 –Q0
47 / 79

* Range = Q4 –Q0
* Inter-Quartile Range = Q3 − Q1
47 / 79
48 / 79
Advantages
48 / 79
Advantages
▶ Good sampling stability, Less
influenced by extreme values,
appropriate for skewed
distributions.
48 / 79
Advantages Disadvantages
▶ Good sampling stability, Less
influenced by extreme values,
appropriate for skewed
distributions.
48 / 79
▶ Good sampling stability, Less ▶ Not computable for qualitative
influenced by extreme values, variables, Does not use all the data,
appropriate for skewed Not amenable for further mathematical
distributions. operations.
48 / 79
▶ Good sampling stability, Less ▶ Not computable for qualitative
influenced by extreme values, variables, Does not use all the data,
appropriate for skewed Not amenable for further mathematical
distributions. operations.
48 / 79
Summarizing Quantitative Data: Dispersion – Box
(Whisker) Plots
49 / 79
(Whisker) Plots
49 / 79
(Whisker) Plots
IQR = 8, Any data beyond

the whiskers is to be
inspected – for identifying
outliers, 1.5 is default we can
change it.
49 / 79
(Whisker) Plots
IQR = 8, Any data beyond

the whiskers is to be
inspected – for identifying
outliers, 1.5 is default we can
change it.
49 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)
50 / 79
Deviation (SD)
▶ Standard Deviation s (for sample) is defined as

sP
n 2
i=1 (xi − x̄)
s= .
n−1
50 / 79
Deviation (SD)

sP
n 2
i=1 (xi − x̄)
s= .
n−1
▶ It is an indication of “How much do individual observations vary

from the Central measure?”
50 / 79
Deviation (SD)

sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ For the given data we can compute s as 7.81 hours.
50 / 79
Deviation (SD)

sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ It has the units as the variable (hours in this case).
50 / 79
Deviation (SD)

sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ The Range of 41 Hrs can be read as, all data is covered within 5.25
Standard Deviations. (41/7.81)
50 / 79
Deviation (SD)

sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ The Range of 41 Hrs can be read as, all data is covered within 5.25
Standard Deviations. (41/7.81)
▶ The IQR of 8 hours can be read as, the middle 50% of data is
covered within 1.02 Standard Deviations. (8/7.81)
50 / 79
Comparison with Standard Normal Plot
51 / 79
51 / 79
Advantages
51 / 79
Advantages
▶ Like mean, uses all the data
values.
51 / 79
Advantages
values.
▶ Has good sampling stability.
51 / 79
Advantages
values.
▶ The measure is mathematically
tractable.
51 / 79
Disadvantages
Advantages
values.
tractable.
51 / 79
Disadvantages
Advantages
data.
values.
tractable.
51 / 79
Disadvantages
Advantages
data.
values.
▶ Is influenced by extreme
values/outliers.
tractable.
51 / 79
Disadvantages
Advantages
data.
values.
▶ Is influenced by extreme
values/outliers.
▶ Not very appropriate for skewed
tractable.
data.
51 / 79
Coefficient of Variation (CV)
52 / 79
▶ Data Set 1 : 2, 8, 10,12,18

Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
52 / 79
▶ Data Set 1 : 2, 8, 10,12,18

Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%
52 / 79
▶ Data Set 1 : 2, 8, 10,12,18

Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%
Which data set has more variation?

σ
cv =
µ
where σ is the standard deviation and µ is the mean.
52 / 79
▶ Data Set 1 : 2, 8, 10,12,18

Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%

σ
cv =
µ
▶ Useful for comparing two distributions.
52 / 79
▶ Data Set 1 : 2, 8, 10,12,18

Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%

σ
cv =
µ
▶ Useful for comparing two distributions.
▶ Also, useful for comparing two different variables (since, by
definition, the CV is dimensionless).
52 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)
53 / 79
53 / 79
IndexPof Diversity is given by

k
1 − i=1 pi2 =⇒
1–(0.4)2 –(0.3)2 − (0.3)2 = 0.64
53 / 79
IndexPof Diversity is given by

k
1 − i=1 pi2 =⇒
1–(0.4)2 –(0.3)2 − (0.3)2 = 0.64
53 / 79
IndexPof Diversity is given by If all the elements have come from

k
1 − i=1 pi2 =⇒ just one plant then Index of
1–(0.4)2 –(0.3)2 − (0.3)2 = 0.64 Diversity = 1–(1)2 –(0)2 − (0)2 = 0
53 / 79
Index of Diversity (Gini Impurity)(contd.)
If the elements have come in equally from all plants then

Index of Diversity =1–(0.33)2 –(0.33)2 − (0.33)2 = 0.667.
54 / 79
Index of qualitative variation
55 / 79
Index of qualitative variation =

1− ki=1 pi2
P
(k−1)/k =
= (1–(0.4)2 –(0.3)2 − (0.3)2 )×3/2 =
0.96.
55 / 79
Index of qualitative variation =

1− ki=1 pi2
P Index of qualitative variation will be
(k−1)/k = 0, if all the elements have come
= (1–(0.4)2 –(0.3)2 − (0.3)2 )×3/2 = from just one plant
0.96.
55 / 79
Index of qualitative variation(contd.)
Index of qualitative variation will be 1, if the elements have come in

equally from all plants.
56 / 79
Data: Summary of ...
57 / 79
57 / 79
Mean = 49.67 Hrs

Median =50 Hrs
Mode = 50 Hrs (10%)
SD = 7.81 Hrs
IQR = 8 Hrs
Range = 41
CV = 15.8%
57 / 79
Mean = 49.67 Hrs

Median =50 Hrs
Mode = 50 Hrs (10%)
SD = 7.81 Hrs
IQR = 8 Hrs
Range = 41
CV = 15.8%
57 / 79
Mean = 49.67 Hrs

Median =50 Hrs
Mode = 50 Hrs (10%)
SD = 7.81 Hrs
IQR = 8 Hrs
Range = 41
CV = 15.8%
57 / 79
58 / 79
58 / 79
58 / 79
Mode: Plant A (40

Index of diversity (or) Gini
impurity = 0.64
= 0.96
58 / 79
Mode: Plant A (40

Index of diversity (or) Gini
impurity = 0.64
= 0.96
58 / 79
More than One Variable
59 / 79
Data Tables: Cross Tabulations
60 / 79
60 / 79
Higher the Family Monthly Income, the more likely it is that the HH will
have more than one Car.
60 / 79
Higher the Family Monthly Income, the more likely it is that the HH will
have more than one Car.
Higher the Family Size, the more likely it is that the HH will have more
than one Car.
60 / 79
Data Plots: Three variables at a go
61 / 79
Correlation
62 / 79
Measure of Associations
63 / 79
We get cov = 1019.5
63 / 79
Pn
i=1 (Xi − X̄ )(Yi − Ȳ )
cov (x, y ) =
n−1
Pn
2 − x̄)2
i=1 (xi
SD =
n−1
We get cov = 1019.5
63 / 79
Measure of Associations - Covariance
Difficulties with Covariance – Issues of comparisons
64 / 79
64 / 79
Between Calorie Intake, and Calories

burnt, which is associated more with
Weight?
64 / 79
Difficulties with Covariance – Issues of comparisons - Further complicated

by units
65 / 79

by units
65 / 79

by units
Covariance is unbounded,
and the value also depends
on the units of data
65 / 79
Measure of Associations - Correlations
66 / 79
▶ Correlation between two variables x, y is

n
cov (x, y ) 1 X (xi − x̄) (yi − ȳ )
r = rxy = =
Sx × Sy n−1 Sx Sy
i=1
Is the Pearson’s Correlation coefficient.

Pearson’s correlation coefficient is:
66 / 79

n
cov (x, y ) 1 X (xi − x̄) (yi − ȳ )
r = rxy = =
i=1

▶ Dimensionless, therefore comparable across different variables
66 / 79

n
cov (x, y ) 1 X (xi − x̄) (yi − ȳ )
r = rxy = =
i=1

▶ Dimensionless, therefore comparable across different variables
▶ Lies between -1, and +1
66 / 79
67 / 79
67 / 79
The modulus value of the coefficient

indicates the strength of relationship
between two variables (Closer it is to
1, the greater the strength)
The Sign of the coefficient indicates
the direction of relationship
67 / 79

▶ (+) Variables move in the same
direction
67 / 79

▶ (+) Variables move in the same
direction
▶ (-) Variables move in the
opposite direction
67 / 79
Scatter Plot provides a good hint to the possible correlation value

between two variables.
68 / 79

68 / 79

68 / 79

68 / 79
69 / 79
Pearson’s Correlation
coefficient provides the
direction and strength of a
linear relationship between
two variables.
69 / 79
Pearson’s Correlation
coefficient provides the
direction and strength of a
linear relationship between
two variables.
69 / 79
Spearman Rank Correlation
70 / 79
Spearman’s rank correlation measures the strength and direction of
association between two ranked variables. It basically gives the measure
of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a
monotonic function.
70 / 79
monotonic function.
▶ Degree of association between two variables
70 / 79
monotonic function.
▶ Linear or nonlinear association
70 / 79
monotonic function.
▶ x increases, y increases or decreases monotonically
70 / 79
monotonic function.
▶ x increases, y increases or decreases monotonically
70 / 79
71 / 79
▶ Spearman rank correlation computation for n observations:
6 di2
P
rs = 1 −
n(n2 − 1)
di is the difference in the ranks given to the two variables values for
each item of the data.
71 / 79
6 di2
P
rs = 1 −
n(n2 − 1)
▶ Example:
71 / 79
6 di2
P
rs = 1 −
n(n2 − 1)
▶ Example:
71 / 79
6 di2
P
rs = 1 −
n(n2 − 1)
▶ Example:
▶ rs takes a value between -1 (negative association) and 1 (positive

association) rs = 0 means no association.
71 / 79
6 di2
P
rs = 1 −
n(n2 − 1)
▶ Example:
▶ rs takes a value between -1 (negative association) and 1 (positive

association) rs = 0 means no association.
▶ Can be used if the association is nonlinear and can be applied for
ordinal variables.
71 / 79
Kendall rank correlation coefficient
72 / 79
▶ Kendall rank correlation (non-parametric) the best alternative to

Spearman correlation (non-parametric) if the sample size is small
and has many tied ranks.
72 / 79

▶ Kendall rank correlation is used to test the similarities in the
ordering of data if it is ranked by quantities.
72 / 79

▶ This determines the strength of association based on concordance
and discordance between the pairs.
72 / 79

▶ Concordant: Ordered in the same way (consistency). A pair of
observations is considered concordant if (x2 − x1 ) and (y2 − y1 ) have
the same sign.
72 / 79

▶ Concordant: Ordered in the same way (consistency). A pair of
the same sign.
▶ Discordant: Ordered differently (inconsistency). A pair of
opposite signs.
72 / 79
73 / 79
Kendall’s Correlation coefficient is used to measure association between

two ordinal variables.
73 / 79
Kendall’s Correlation coefficient is used to measure association between

two ordinal variables.
▶ Kendall rank correlation coefficient
Number of concordant pairs - number of discordant pairs
τ= .
n(n − 1)/2
73 / 79
Example: Two experts ranking on food items
74 / 79
74 / 79

▶ Let’s take the items pair 1 and
3.
74 / 79

3.
▶ Both experts have given a lower
rank to item 3 compared to
item 1.
74 / 79

3.
▶ Both experts have given a lower
rank to item 3 compared to
item 1.
▶ Therefore, item pair 1 and 3 is
considered to be a concordant
pair.
74 / 79
75 / 79
75 / 79

4.
75 / 79

4.
▶ Expert 1 has given a lower rank
to item 4 compared to item 2.
75 / 79

4.
▶ But Expert 2 two has given a
higher rank to item 4 compared
to item 2.
75 / 79

4.
▶ But Expert 2 two has given a
higher rank to item 4 compared
to item 2.
▶ Therefore item pair 2 and 4 is
considered to be a Discordant
pair.
75 / 79
76 / 79
76 / 79

▶ You repeat this process for all
the 21 pairs of items as shown
in the grid.
76 / 79

in the grid.
▶ We have 15 concordant pairs
and 6 discordant pairs.
76 / 79

in the grid.
▶ We have 15 concordant pairs
and 6 discordant pairs.
▶ The Kendall’s τ can take a
maximum value of +1 (When
all pairs are concordant) and
minimum value -1 (when all
pairs are discordant).
76 / 79
Summary
▶ Data Types
▶ Classification of Data
▶ Basic Statistical Measures
▶ Shape of the data
▶ Central Tendency (3M’s)
▶ Dispersion
▶ Summarizing Qualitative Data
▶ Frequency Table
▶ Central Tendency
▶ Bar Graph
▶ Pie Chart
▶ Summarizing Quantitative Data: Dispersion
▶ Range
▶ Inter-Quartile Range
▶ Box Plot
▶ Standard Deviation (SD)
▶ Comparison with Normal Distribution
▶ Coefficient of Variation (CV)
77 / 79
Summary (Contd.)
▶ Summarizing Qualitative Data: Dispersion

▶ Index of Diversity (Gini Impurity)
▶ Index of Qualitative Variation
▶ Summary of
▶ Qualitative Data
▶ Quantitative Data
▶ More than one variables: Data tables
▶ Measures of Association
▶ Covariance
▶ Pearson’s Correlation
▶ Spearman Rank Correlation
▶ Kendall Rank Correlation Coefficient
78 / 79
The End
79 / 79

STML

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

STML

Uploaded by

Copyright:

Available Formats

Statistics for Machine Learning

Covariance and Correlation

Descriptive Inferential Predictive

▶ Organize ▶ Generalize from ▶ Relationships

Descriptive Inferential Predictive

▶ Organize ▶ Generalize from ▶ Relationships

▶ Descriptive statistics - describes sample data

Descriptive Inferential Predictive

▶ Organize ▶ Generalize from ▶ Relationships

▶ Descriptive statistics - describes sample data

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

Data can be broadly classified into two types

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Nominal Data: Nominal data is a categorical data type, it

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

▶ Interval Data: The interval level is a numerical level of

This scale has Description, order, distance, and Origin.

This scale has Description, order, and distance, but No Origin.

This scale has Description, but No order, No distance, and No Origin.

Three important defining characteristics of any set of data for a given

Three important defining characteristics of any set of data for a given

Three important defining characteristics of any set of data for a given

Three important defining characteristics of any set of data for a given

▶ A mobile phone maker claims that the battery in their brand of

▶ A mobile phone maker claims that the battery in their brand of

▶ A mobile phone maker claims that the battery in their brand of

▶ A mobile phone maker claims that the battery in their brand of

▶ A mobile phone maker claims that the battery in their brand of

▶ A mobile phone maker claims that the battery in their brand of

▶ On an average, a mobile phone battery (made by this company) will

▶ On an average, a mobile phone battery (made by this company) will

▶ On an average, a mobile phone battery (made by this company) will

▶ On an average, a mobile phone battery (made by this company) will

▶ On an average, a mobile phone battery (made by this company) will

Thirty batteries were tested, and we have the

Thirty batteries were tested, and we have the

Thirty batteries were tested, and we have the

The above variables are Quantitative

Thirty batteries were tested, and we have the

The above variables are Quantitative

Thirty batteries were tested, and we have the

The above variables are Quantitative

Thirty batteries were tested, and we have the