You are on page 1of 324

Statistics for Machine Learning

1 / 79
Content

Purposes of Statistics

Descriptive Statistics
Shape of data
Numerical Measures of Central Tendency
Numerical Measures of Variability
Notion of Normal Distribution

Covariance and Correlation

2 / 79
Purposes of Statistics

Descriptive Inferential Predictive


Statistics Statistics Analysis

▶ Organize ▶ Generalize from ▶ Relationships


▶ Summarize samples to among variables
▶ Simplify populations
▶ Hypothesis
▶ Presentation of
testing
data

3 / 79
Purposes of Statistics

Descriptive Inferential Predictive


Statistics Statistics Analysis

▶ Organize ▶ Generalize from ▶ Relationships


▶ Summarize samples to among variables
▶ Simplify populations
▶ Hypothesis
▶ Presentation of
testing
data

▶ Descriptive statistics - describes sample data

3 / 79
Purposes of Statistics

Descriptive Inferential Predictive


Statistics Statistics Analysis

▶ Organize ▶ Generalize from ▶ Relationships


▶ Summarize samples to among variables
▶ Simplify populations
▶ Hypothesis
▶ Presentation of
testing
data

▶ Descriptive statistics - describes sample data


▶ Inferential Statistics and Predictive Analysis - Make population
prediction

3 / 79
Data we normally encounter...

4 / 79
Data we normally encounter...

4 / 79
Classification of Data

Data can be broadly classified into two types

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
Each observation can be placed in only one category, and the
categories are mutually exclusive.

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
Each observation can be placed in only one category, and the
categories are mutually exclusive.
For example, smoking is a categorical variable, with two groups,
categorized as a nonsmoker or a smoker.

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
Each observation can be placed in only one category, and the
categories are mutually exclusive.
For example, smoking is a categorical variable, with two groups,
categorized as a nonsmoker or a smoker.
ii. Quantitative

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
Each observation can be placed in only one category, and the
categories are mutually exclusive.
For example, smoking is a categorical variable, with two groups,
categorized as a nonsmoker or a smoker.
ii. Quantitative
They take numerical values and represent some kind of
measurement.

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
Each observation can be placed in only one category, and the
categories are mutually exclusive.
For example, smoking is a categorical variable, with two groups,
categorized as a nonsmoker or a smoker.
ii. Quantitative
They take numerical values and represent some kind of
measurement.
Age is an example of a quantitative variable because it can take on
multiple numerical values.

5 / 79
Classification of Data

Data can be broadly classified into two types


i. Categorical
Categorical variables take category or label values and place an
individual into one of several groups.
Each observation can be placed in only one category, and the
categories are mutually exclusive.
For example, smoking is a categorical variable, with two groups,
categorized as a nonsmoker or a smoker.
ii. Quantitative
They take numerical values and represent some kind of
measurement.
Age is an example of a quantitative variable because it can take on
multiple numerical values.
Weight and height are also examples of quantitative variables.

5 / 79
Types of Data

6 / 79
Types of Data

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:
▶ Gender, ethnicity, eye colour, blood type

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:
▶ Gender, ethnicity, eye colour, blood type
▶ Brand of refrigerator/motor vehicle/television owned

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:
▶ Gender, ethnicity, eye colour, blood type
▶ Brand of refrigerator/motor vehicle/television owned
▶ Ordinal Data: It is similar to nominal data in terms of categories,
but it has a meaningful order or rank between the options.

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:
▶ Gender, ethnicity, eye colour, blood type
▶ Brand of refrigerator/motor vehicle/television owned
▶ Ordinal Data: It is similar to nominal data in terms of categories,
but it has a meaningful order or rank between the options.
Some examples of ordinal data:

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:
▶ Gender, ethnicity, eye colour, blood type
▶ Brand of refrigerator/motor vehicle/television owned
▶ Ordinal Data: It is similar to nominal data in terms of categories,
but it has a meaningful order or rank between the options.
Some examples of ordinal data:
▶ Income level (e.g. low income, middle income, high income)

7 / 79
Types of Data

▶ Nominal Data: Nominal data is a categorical data type, it


describes qualitative characteristics or groups, with no order or rank
between categories.
Examples of nominal data include:
▶ Gender, ethnicity, eye colour, blood type
▶ Brand of refrigerator/motor vehicle/television owned
▶ Ordinal Data: It is similar to nominal data in terms of categories,
but it has a meaningful order or rank between the options.
Some examples of ordinal data:
▶ Income level (e.g. low income, middle income, high income)
▶ Level of agreement (e.g. strongly disagree, disagree, neutral, agree,
strongly agree)

7 / 79
Types of Data

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
▶ IQ score

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
▶ IQ score
▶ Income categorized

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
▶ IQ score
▶ Income categorized
▶ Ratio Data : Like interval data, it is ordered/ranked and the
numerical distance between points is consistent. What makes it
different from interval data is that, measurement of zero means that
there is nothing of that variable.
Some Examples :

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
▶ IQ score
▶ Income categorized
▶ Ratio Data : Like interval data, it is ordered/ranked and the
numerical distance between points is consistent. What makes it
different from interval data is that, measurement of zero means that
there is nothing of that variable.
Some Examples :
▶ Weight in grams (continuous)

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
▶ IQ score
▶ Income categorized
▶ Ratio Data : Like interval data, it is ordered/ranked and the
numerical distance between points is consistent. What makes it
different from interval data is that, measurement of zero means that
there is nothing of that variable.
Some Examples :
▶ Weight in grams (continuous)
▶ Number of employees at a company (discrete)

8 / 79
Types of Data

▶ Interval Data: The interval level is a numerical level of


measurement which, like the ordinal scale, places variables in order.
Unlike the ordinal scale, however, the interval scale has a known and
equal distance between each value on the scale.
Some examples:
▶ Temperature in degrees Fahrenheit or Celsius
▶ IQ score
▶ Income categorized
▶ Ratio Data : Like interval data, it is ordered/ranked and the
numerical distance between points is consistent. What makes it
different from interval data is that, measurement of zero means that
there is nothing of that variable.
Some Examples :
▶ Weight in grams (continuous)
▶ Number of employees at a company (discrete)
▶ Speed in miles per hour (continuous)

8 / 79
Stevens classification categorizes data according to four
basic properties:

9 / 79
Stevens classification categorizes data according to four
basic properties:

1. Description
In this measurement all we can do is to name or label things. We
cannot perform any arithmetic with nominal level data. All we can
do is count the frequencies with which the things occur.

9 / 79
Stevens classification categorizes data according to four
basic properties:

1. Description
In this measurement all we can do is to name or label things. We
cannot perform any arithmetic with nominal level data. All we can
do is count the frequencies with which the things occur.
2. Order
This scale enables us to order the items of interest using ordinal
numbers. Ordinal numbers denote an item’s position or rank in a
sequence: First, second, third, and so on.

9 / 79
Stevens classification categorizes data according to four
basic properties: (contd.)

10 / 79
Stevens classification categorizes data according to four
basic properties: (contd.)

3. Distance
The interval level has an inherent order, but here we do have the
distance between intervals on the scale.

10 / 79
Stevens classification categorizes data according to four
basic properties: (contd.)

3. Distance
The interval level has an inherent order, but here we do have the
distance between intervals on the scale.
4. Origin
The addition of a non-arbitrary zero allows us to calculate the
numerical relationship between values using ratios.
For example: A person who weighs 150 pounds, weighs twice as
much as a person who weighs only 75 pounds and half as much as a
person who weighs 300 pounds. We can calculate ratios like these
because the scale for weight in pounds starts at zero pounds.
These are also referred as primary scales of measurement

10 / 79
Different types of Data

11 / 79
Different types of Data

12 / 79
Primary scales of measurement: Ratio Data

This scale has Description, order, distance, and Origin.

13 / 79
Primary scales of measurement: Interval Data

This scale has Description, order, and distance, but No Origin.

14 / 79
Primary scales of measurement: Ordinal Data

This scale has Description, and order, but No distance, and No Origin.

15 / 79
Primary scales of measurement: Nominal Data

This scale has Description, but No order, No distance, and No Origin.

16 / 79
To begin, not all data are of the same type

17 / 79
To begin, not all data are of the same type

17 / 79
To begin, not all data are of the same type

18 / 79
Descriptive Statistics

19 / 79
Statistics: A Single variable

Three important defining characteristics of any set of data for a given


variable are:

20 / 79
Statistics: A Single variable

Three important defining characteristics of any set of data for a given


variable are:
▶ Shape of Data (also called Distribution)

20 / 79
Statistics: A Single variable

Three important defining characteristics of any set of data for a given


variable are:
▶ Shape of Data (also called Distribution)
▶ Measure of Location (also called Central Tendency)

20 / 79
Statistics: A Single variable

Three important defining characteristics of any set of data for a given


variable are:
▶ Shape of Data (also called Distribution)
▶ Measure of Location (also called Central Tendency)
▶ Spread of data ( also called Variability)

20 / 79
Example

21 / 79
Example

▶ A mobile phone maker claims that the battery in their brand of


mobile phone will last 48 hours, under normal usage conditions.

21 / 79
Example

▶ A mobile phone maker claims that the battery in their brand of


mobile phone will last 48 hours, under normal usage conditions.
▶ The QC department at their manufacturing facility, tests this claim
routinely using randomly selected mobile phones.

21 / 79
Example

▶ A mobile phone maker claims that the battery in their brand of


mobile phone will last 48 hours, under normal usage conditions.
▶ The QC department at their manufacturing facility, tests this claim
routinely using randomly selected mobile phones.
▶ The data for the just concluded tests on a certain day, has just
arrived. . .

21 / 79
Example

▶ A mobile phone maker claims that the battery in their brand of


mobile phone will last 48 hours, under normal usage conditions.
▶ The QC department at their manufacturing facility, tests this claim
routinely using randomly selected mobile phones.
▶ The data for the just concluded tests on a certain day, has just
arrived. . .
It is impossible to expect that each battery would last exactly 48 hours.

21 / 79
Example

▶ A mobile phone maker claims that the battery in their brand of


mobile phone will last 48 hours, under normal usage conditions.
▶ The QC department at their manufacturing facility, tests this claim
routinely using randomly selected mobile phones.
▶ The data for the just concluded tests on a certain day, has just
arrived. . .
It is impossible to expect that each battery would last exactly 48 hours.
There is bound to be variability in the life of the battery.

21 / 79
Example

▶ A mobile phone maker claims that the battery in their brand of


mobile phone will last 48 hours, under normal usage conditions.
▶ The QC department at their manufacturing facility, tests this claim
routinely using randomly selected mobile phones.
▶ The data for the just concluded tests on a certain day, has just
arrived. . .
It is impossible to expect that each battery would last exactly 48 hours.
There is bound to be variability in the life of the battery. This variability
can be due to variations in the quality of input components,
manufacturing conditions, workers involved, or just non-explainable
random fluctuations.

21 / 79
So, the claim that we need to verify is:

22 / 79
So, the claim that we need to verify is:

▶ On an average, a mobile phone battery (made by this company) will


last for 48 hours.

22 / 79
So, the claim that we need to verify is:

▶ On an average, a mobile phone battery (made by this company) will


last for 48 hours.
▶ To verify this claim, it will be impossible to test the entire output.
(Population)

22 / 79
So, the claim that we need to verify is:

▶ On an average, a mobile phone battery (made by this company) will


last for 48 hours.
▶ To verify this claim, it will be impossible to test the entire output.
(Population)
▶ We will need to devise a method of sampling to obtain a sample of
phones and test them.

22 / 79
So, the claim that we need to verify is:

▶ On an average, a mobile phone battery (made by this company) will


last for 48 hours.
▶ To verify this claim, it will be impossible to test the entire output.
(Population)
▶ We will need to devise a method of sampling to obtain a sample of
phones and test them.
▶ We need to understand what the sample data indicates. (Descriptive
Statistics)

22 / 79
So, the claim that we need to verify is:

▶ On an average, a mobile phone battery (made by this company) will


last for 48 hours.
▶ To verify this claim, it will be impossible to test the entire output.
(Population)
▶ We will need to devise a method of sampling to obtain a sample of
phones and test them.
▶ We need to understand what the sample data indicates. (Descriptive
Statistics)
▶ We need to conclude about the population battery life. (Inferential
Statistics)

22 / 79
Data

23 / 79
Data

Thirty batteries were tested, and we have the


number of hours the battery lasted for each of the
battery.

23 / 79
Data

Thirty batteries were tested, and we have the


number of hours the battery lasted for each of the
battery.

23 / 79
Data

Thirty batteries were tested, and we have the


number of hours the battery lasted for each of the
battery.

The above variables are Quantitative

23 / 79
Data

Thirty batteries were tested, and we have the


number of hours the battery lasted for each of the
battery.
Lets say the 30 batteries
were made in three different
plants (A,B,C), and we had
the data related to that..

The above variables are Quantitative

23 / 79
Data

Thirty batteries were tested, and we have the


number of hours the battery lasted for each of the
battery.
Lets say the 30 batteries
were made in three different
plants (A,B,C), and we had
the data related to that..

The above variables are Quantitative

23 / 79
Data

Thirty batteries were tested, and we have the


number of hours the battery lasted for each of the
battery.
Lets say the 30 batteries
were made in three different
plants (A,B,C), and we had
the data related to that..

Variables are Qualitative

The above variables are Quantitative

23 / 79
Summarizing Quantitative Data - Ordering

24 / 79
Summarizing Quantitative Data - Ordering

24 / 79
Summarizing Quantitative Data - Ordering

24 / 79
Summarizing Quantitative Data - Ordering

We can now quickly read, Max = 68, Min = 27, some concentration of
values around 50. . .

24 / 79
Summarizing Quantitative Data: Ungrouped Frequency
Distributions

25 / 79
Summarizing Quantitative Data: Ungrouped Frequency
Distributions

25 / 79
Summarizing Quantitative Data: Ungrouped Frequency
Distributions

25 / 79
Summarizing Quantitative Data: Ungrouped Frequency
Distributions

You can see the concentration of observation values between 48 and 54


more clearly..

25 / 79
Summarizing Quantitative Data: Ungrouped Frequency
Distributions - Graphical

26 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 1

27 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 1

27 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 1

You can see that 25 of 30 observations have a value between 42 to 59


Of Course, we have lost the information on how many observations had a
value of 46, 50 etc. . .

27 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 2

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

28 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 2

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

28 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 2

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

28 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 2

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

The right number of groupings and class intervals are subjective and
depends on data.

28 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 2

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

The right number of groupings and class intervals are subjective and
depends on data.
Anything over 10 groups becomes difficult to read and comprehend.

28 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Grouping 2

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

The right number of groupings and class intervals are subjective and
depends on data.
Anything over 10 groups becomes difficult to read and comprehend.
It is not advisable to vary the class intervals. (e.g. Hrs: 63 -68, 55 – 62,
50 – 54)

28 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Open Ended Groups

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

29 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Open Ended Groups

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

29 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Open Ended Groups

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

29 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Open Ended Groups

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

Open ended Groups bring focus to data points that need greater
attention.

29 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions – Open Ended Groups

Any grouped frequency distribution chart is not unique. There are many
ways of groupings that can be made.

Open ended Groups bring focus to data points that need greater
attention.
But, it is not amenable to certain mathematical computations. (For
example, average)

29 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions - Graphical

Histograms

30 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions - Graphical

Histograms
▶ Divide the range of values in sample
set into small intervals and count how
many observations fall within each
interval.

30 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions - Graphical

Histograms
▶ Divide the range of values in sample
set into small intervals and count how
many observations fall within each
interval.
▶ For each interval plot a rectangle with
width = interval size and height equal
to number of observations in interval.

30 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions - Graphical

Histograms
▶ Divide the range of values in sample
set into small intervals and count how
many observations fall within each
interval.
▶ For each interval plot a rectangle with
width = interval size and height equal
to number of observations in interval.

30 / 79
Summarizing Quantitative Data: Grouped Frequency
Distributions - Graphical

31 / 79
Summarizing Quantitative Data: Relative Frequency
Distributions – Proportions/Percentages

32 / 79
Summarizing Quantitative Data: Relative Frequency
Distributions – Proportions/Percentages

32 / 79
Summarizing Quantitative Data: Relative Frequency
Distributions – Proportions/Percentages

32 / 79
Summarizing Quantitative Data: Cumulative Frequency
Distributions

33 / 79
Summarizing Quantitative Data: Cumulative Frequency
Distributions

33 / 79
Summarizing Quantitative Data: Cumulative Frequency
Distributions

33 / 79
Summarizing Quantitative Data: Cumulative Frequency
Distributions - Graphical

34 / 79
Summarizing Quantitative Data: Basic Statistical
Measures- Shape of data

35 / 79
Basic Statistical Measures: Shape of data - Examples

36 / 79
Basic Statistical Measures: Shape of data - Examples

36 / 79
Basic Statistical Measures: Shape of data - Examples

36 / 79
Basic Statistical Measures: Shape of data - Examples

36 / 79
Basic Statistical Measures: Shape of data - Examples

36 / 79
Basic Statistical Measures: Shape of data - Examples

36 / 79
Basic Statistical measures: Measures of location

37 / 79
Basic Statistical measures: Measures of location

▶ The frequency distribution (tables, and graphs) help us in our


understanding of data by summarizing them.

37 / 79
Basic Statistical measures: Measures of location

▶ The frequency distribution (tables, and graphs) help us in our


understanding of data by summarizing them.
▶ Measure of location, and dispersion simplifies it further by providing
single number representations of the data.

37 / 79
Basic Statistical measures: Measures of location

▶ The frequency distribution (tables, and graphs) help us in our


understanding of data by summarizing them.
▶ Measure of location, and dispersion simplifies it further by providing
single number representations of the data.
▶ The central tendency is a score value on which a distribution tends
to center.

37 / 79
Basic Statistical measures: Measures of location

▶ The frequency distribution (tables, and graphs) help us in our


understanding of data by summarizing them.
▶ Measure of location, and dispersion simplifies it further by providing
single number representations of the data.
▶ The central tendency is a score value on which a distribution tends
to center.
▶ The most common measure is the average, and it signifies what is
typical, usual, representative value of the data.

37 / 79
Basic Statistical measures: Measures of location

▶ The frequency distribution (tables, and graphs) help us in our


understanding of data by summarizing them.
▶ Measure of location, and dispersion simplifies it further by providing
single number representations of the data.
▶ The central tendency is a score value on which a distribution tends
to center.
▶ The most common measure is the average, and it signifies what is
typical, usual, representative value of the data.
The 3 M’s (Mean, Median, Mode)

37 / 79
Summarizing Quantitative Data: Central Tendency –
Mode (and proportion)

38 / 79
Summarizing Quantitative Data: Central Tendency –
Mode (and proportion)

The mode is the data value that occurs most frequently in the data set.

38 / 79
Summarizing Quantitative Data: Central Tendency –
Mode (and proportion)

The mode is the data value that occurs most frequently in the data set.

38 / 79
Summarizing Quantitative Data: Central Tendency –
Mode (and proportion)

The mode is the data value that occurs most frequently in the data set.

Mode: 50 Hours
3
( 30 x100 = 10%)

38 / 79
Mode

39 / 79
Mode

Advantages

39 / 79
Mode

Advantages
i. Only sensible measure for
qualitative data

39 / 79
Mode

Advantages
i. Only sensible measure for
qualitative data
ii. More appropriate for quantitative
data which are inherently discrete

39 / 79
Mode

Disadvantages

Advantages
i. Only sensible measure for
qualitative data
ii. More appropriate for quantitative
data which are inherently discrete

39 / 79
Mode

Disadvantages
i. There may not be a single
Advantages mode (multimodal data)
i. Only sensible measure for
qualitative data
ii. More appropriate for quantitative
data which are inherently discrete

39 / 79
Mode

Disadvantages
i. There may not be a single
Advantages mode (multimodal data)
i. Only sensible measure for ii. It does not use all the data
qualitative data available
ii. More appropriate for quantitative
data which are inherently discrete

39 / 79
Mode

Disadvantages
i. There may not be a single
Advantages mode (multimodal data)
i. Only sensible measure for ii. It does not use all the data
qualitative data available
ii. More appropriate for quantitative iii. Poor sampling stability - Large
data which are inherently discrete variations across samples

39 / 79
Mode

Disadvantages
i. There may not be a single
Advantages mode (multimodal data)
i. Only sensible measure for ii. It does not use all the data
qualitative data available
ii. More appropriate for quantitative iii. Poor sampling stability - Large
data which are inherently discrete variations across samples
iv. Not very mathematically
tractable

39 / 79
Summarizing Quantitative Data: Central Tendency – Mean

40 / 79
Summarizing Quantitative Data: Central Tendency – Mean

The mean is the arithmetic average of all the data values in the data set.

40 / 79
Summarizing Quantitative Data: Central Tendency – Mean

The mean is the arithmetic average of all the data values in the data set.

40 / 79
Summarizing Quantitative Data: Central Tendency – Mean

The mean is the arithmetic average of all the data values in the data set.

Mode: 49.67 Hours

40 / 79
Summarizing Quantitative Data: Central Tendency – Mean

The mean is the arithmetic average of all the data values in the data set.

Mode: 49.67 Hours


Formula is given by
x1 + x2 + · · · + xN
x̄ = .
N

40 / 79
Mean

41 / 79
Mean

Advantages

41 / 79
Mean

Advantages
i. Uses all the data values
available

41 / 79
Mean

Advantages
i. Uses all the data values
available
ii. Moderate sampling stability

41 / 79
Mean

Advantages
i. Uses all the data values
available
ii. Moderate sampling stability
iii. Highly mathematically tractable

41 / 79
Mean
Disadvantages

Advantages
i. Uses all the data values
available
ii. Moderate sampling stability
iii. Highly mathematically tractable

41 / 79
Mean
Disadvantages
i. More sensitive to extreme values –
Supposing instead of 33, and 27 Hrs,
Advantages we had 24, and 20 Hrs, the mean
would be 49.27 Hrs
i. Uses all the data values
available
ii. Moderate sampling stability
iii. Highly mathematically tractable

41 / 79
Mean
Disadvantages
i. More sensitive to extreme values –
Supposing instead of 33, and 27 Hrs,
Advantages we had 24, and 20 Hrs, the mean
would be 49.27 Hrs
i. Uses all the data values
available ii. Not appropriate for qualitative data –
You can’t get an average Gender for
ii. Moderate sampling stability
example
iii. Highly mathematically tractable

41 / 79
Mean
Disadvantages
i. More sensitive to extreme values –
Supposing instead of 33, and 27 Hrs,
Advantages we had 24, and 20 Hrs, the mean
would be 49.27 Hrs
i. Uses all the data values
available ii. Not appropriate for qualitative data –
You can’t get an average Gender for
ii. Moderate sampling stability
example
iii. Highly mathematically tractable
iii. Not appropriate for open ended
distributions

41 / 79
Mean
Disadvantages
i. More sensitive to extreme values –
Supposing instead of 33, and 27 Hrs,
Advantages we had 24, and 20 Hrs, the mean
would be 49.27 Hrs
i. Uses all the data values
available ii. Not appropriate for qualitative data –
You can’t get an average Gender for
ii. Moderate sampling stability
example
iii. Highly mathematically tractable
iii. Not appropriate for open ended
distributions
iv. Not appropriate for skewed
distributions

41 / 79
Mean
Disadvantages
i. More sensitive to extreme values –
Supposing instead of 33, and 27 Hrs,
Advantages we had 24, and 20 Hrs, the mean
would be 49.27 Hrs
i. Uses all the data values
available ii. Not appropriate for qualitative data –
You can’t get an average Gender for
ii. Moderate sampling stability
example
iii. Highly mathematically tractable
iii. Not appropriate for open ended
distributions
iv. Not appropriate for skewed
distributions

41 / 79
Summarizing Quantitative Data: Central Tendency –
Median

The median is the data value in the distribution that divides the data into
two groups having equal frequencies – the center point of the data set.

42 / 79
Summarizing Quantitative Data: Central Tendency –
Median

The median is the data value in the distribution that divides the data into
two groups having equal frequencies – the center point of the data set.

42 / 79
Summarizing Quantitative Data: Central Tendency –
Median

The median is the data value in the distribution that divides the data into
two groups having equal frequencies – the center point of the data set.

▶ If, n is odd, the Median is the


data value of the ((n + 1)/2)th
observation, when ordered.

42 / 79
Summarizing Quantitative Data: Central Tendency –
Median

The median is the data value in the distribution that divides the data into
two groups having equal frequencies – the center point of the data set.

▶ If, n is odd, the Median is the


data value of the ((n + 1)/2)th
observation, when ordered.
▶ If n is even, the Median is the
midway point of the (n/2)th
observation, and the
(n/2 + 1)th observation.
Here the median is 50 Hours

42 / 79
Median

43 / 79
Median

Advantages

43 / 79
Median

Advantages
▶ Simple to compute.

43 / 79
Median

Advantages
▶ Simple to compute.
▶ Very appropriate for skewed
distributions.

43 / 79
Median

Advantages
▶ Simple to compute.
▶ Very appropriate for skewed
distributions.
▶ Great sampling stability.

43 / 79
Median

Advantages
▶ Simple to compute.
▶ Very appropriate for skewed
distributions.
▶ Great sampling stability.
▶ Most appropriate for open ended
distributions.

43 / 79
Median

Advantages
▶ Simple to compute.
▶ Very appropriate for skewed
distributions.
▶ Great sampling stability.
▶ Most appropriate for open ended
distributions.
▶ Appropriate for ordered qualitative
data.

43 / 79
Median

Advantages
▶ Simple to compute.
Disadvantages
▶ Very appropriate for skewed
distributions.
▶ Great sampling stability.
▶ Most appropriate for open ended
distributions.
▶ Appropriate for ordered qualitative
data.

43 / 79
Median

Advantages
▶ Simple to compute.
Disadvantages
▶ Very appropriate for skewed
▶ Does not use all data
distributions.
values.
▶ Great sampling stability.
▶ Most appropriate for open ended
distributions.
▶ Appropriate for ordered qualitative
data.

43 / 79
Median

Advantages
▶ Simple to compute.
Disadvantages
▶ Very appropriate for skewed
▶ Does not use all data
distributions.
values.
▶ Great sampling stability.
▶ Less mathematically
▶ Most appropriate for open ended
tractable compared to
distributions. mean.
▶ Appropriate for ordered qualitative
data.

43 / 79
Summarizing Qualitative Data: Frequency Table, Central
Tendency, Bar Graph and Pie Chart

44 / 79
Summarizing Qualitative Data: Frequency Table, Central
Tendency, Bar Graph and Pie Chart

44 / 79
Summarizing Qualitative Data: Frequency Table, Central
Tendency, Bar Graph and Pie Chart

44 / 79
Summarizing Qualitative Data: Frequency Table, Central
Tendency, Bar Graph and Pie Chart

44 / 79
Summarizing Qualitative Data: Frequency Table, Central
Tendency, Bar Graph and Pie Chart

The modal value of “Plant A”, is


40%

44 / 79
Summarizing Quantitative Data: Dispersion - Concept

45 / 79
Summarizing Quantitative Data: Dispersion - Concept

▶ Data Set 1: 2, 6, 10, 10, 14,18


Mean =10, Median =10, Mode =10

45 / 79
Summarizing Quantitative Data: Dispersion - Concept

▶ Data Set 1: 2, 6, 10, 10, 14,18


Mean =10, Median =10, Mode =10
▶ Data Set 2: 1, 2, 10, 10, 18,19
Mean =10, Median =10, Mode =10

45 / 79
Summarizing Quantitative Data: Dispersion - Concept

▶ Data Set 1: 2, 6, 10, 10, 14,18


Mean =10, Median =10, Mode =10
▶ Data Set 2: 1, 2, 10, 10, 18,19
Mean =10, Median =10, Mode =10
▶ Data Set 3: 10,10,10,10,10,10
Mean =10, Median =10, Mode =10

45 / 79
Summarizing Quantitative Data: Dispersion - Concept

▶ Data Set 1: 2, 6, 10, 10, 14,18


Mean =10, Median =10, Mode =10
▶ Data Set 2: 1, 2, 10, 10, 18,19
Mean =10, Median =10, Mode =10
▶ Data Set 3: 10,10,10,10,10,10
Mean =10, Median =10, Mode =10

What is the difference between the three data sets?

45 / 79
Summarizing Quantitative Data: Dispersion - Concept

▶ Data Set 1: 2, 6, 10, 10, 14,18


Mean =10, Median =10, Mode =10
▶ Data Set 2: 1, 2, 10, 10, 18,19
Mean =10, Median =10, Mode =10
▶ Data Set 3: 10,10,10,10,10,10
Mean =10, Median =10, Mode =10

What is the difference between the three data sets?


This difference is characterized as Dispersion (or Variability between the
individual data values) and is captured by measures such as Range, Inter
Quartile Range, Standard Deviation, and Coefficient of Variation

45 / 79
Summarizing Quantitative Data: Dispersion - Range

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values

Disadvantages

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values

Disadvantages
▶ Cannot be used for qualitative
data

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values

Disadvantages
▶ Cannot be used for qualitative
data
▶ Uses only two data values

Range = 68 – 27 =41

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values

Disadvantages
▶ Cannot be used for qualitative
data
▶ Uses only two data values
▶ Highly influenced by extreme
Range = 68 – 27 =41 data values

46 / 79
Summarizing Quantitative Data: Dispersion - Range
The range is defined as the R = (Maximum Data value) – (Minimum
Data Value)

Advantages
▶ Easy to compute
▶ Provides a quick understanding
of the total spread of the data
values

Disadvantages
▶ Cannot be used for qualitative
data
▶ Uses only two data values
▶ Highly influenced by extreme
Range = 68 – 27 =41 data values
▶ Poor sampling stability

46 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4


* 75th percentile = Q3

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4


* 75th percentile = Q3
* Median (50th percentile) = Q2

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4


* 75th percentile = Q3
* Median (50th percentile) = Q2
* 25th percentile = Q1

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4


* 75th percentile = Q3
* Median (50th percentile) = Q2
* 25th percentile = Q1
* Minimum (0th percentile) = Q0

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4


* 75th percentile = Q3
* Median (50th percentile) = Q2
* 25th percentile = Q1
* Minimum (0th percentile) = Q0
* Range = Q4 –Q0

47 / 79
Summarizing Quantitative Data: Dispersion
Inter-Quartile Range

* Maximum (100th percentile) = Q4


* 75th percentile = Q3
* Median (50th percentile) = Q2
* 25th percentile = Q1
* Minimum (0th percentile) = Q0
* Range = Q4 –Q0
* Inter-Quartile Range = Q3 − Q1

47 / 79
Inter-Quartile Range

48 / 79
Inter-Quartile Range

Advantages

48 / 79
Inter-Quartile Range

Advantages
▶ Good sampling stability, Less
influenced by extreme values,
appropriate for skewed
distributions.

48 / 79
Inter-Quartile Range

Advantages Disadvantages
▶ Good sampling stability, Less
influenced by extreme values,
appropriate for skewed
distributions.

48 / 79
Inter-Quartile Range

Advantages Disadvantages
▶ Good sampling stability, Less ▶ Not computable for qualitative
influenced by extreme values, variables, Does not use all the data,
appropriate for skewed Not amenable for further mathematical
distributions. operations.

48 / 79
Inter-Quartile Range

Advantages Disadvantages
▶ Good sampling stability, Less ▶ Not computable for qualitative
influenced by extreme values, variables, Does not use all the data,
appropriate for skewed Not amenable for further mathematical
distributions. operations.

48 / 79
Summarizing Quantitative Data: Dispersion – Box
(Whisker) Plots

49 / 79
Summarizing Quantitative Data: Dispersion – Box
(Whisker) Plots

49 / 79
Summarizing Quantitative Data: Dispersion – Box
(Whisker) Plots

IQR = 8, Any data beyond


the whiskers is to be
inspected – for identifying
outliers, 1.5 is default we can
change it.

49 / 79
Summarizing Quantitative Data: Dispersion – Box
(Whisker) Plots

IQR = 8, Any data beyond


the whiskers is to be
inspected – for identifying
outliers, 1.5 is default we can
change it.

49 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

50 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

▶ Standard Deviation s (for sample) is defined as


sP
n 2
i=1 (xi − x̄)
s= .
n−1

50 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

▶ Standard Deviation s (for sample) is defined as


sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ It is an indication of “How much do individual observations vary


from the Central measure?”

50 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

▶ Standard Deviation s (for sample) is defined as


sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ It is an indication of “How much do individual observations vary


from the Central measure?”
▶ For the given data we can compute s as 7.81 hours.

50 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

▶ Standard Deviation s (for sample) is defined as


sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ It is an indication of “How much do individual observations vary


from the Central measure?”
▶ For the given data we can compute s as 7.81 hours.
▶ It has the units as the variable (hours in this case).

50 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

▶ Standard Deviation s (for sample) is defined as


sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ It is an indication of “How much do individual observations vary


from the Central measure?”
▶ For the given data we can compute s as 7.81 hours.
▶ It has the units as the variable (hours in this case).
▶ The Range of 41 Hrs can be read as, all data is covered within 5.25
Standard Deviations. (41/7.81)

50 / 79
Summarizing Quantitative Data Dispersion: Standard
Deviation (SD)

▶ Standard Deviation s (for sample) is defined as


sP
n 2
i=1 (xi − x̄)
s= .
n−1

▶ It is an indication of “How much do individual observations vary


from the Central measure?”
▶ For the given data we can compute s as 7.81 hours.
▶ It has the units as the variable (hours in this case).
▶ The Range of 41 Hrs can be read as, all data is covered within 5.25
Standard Deviations. (41/7.81)
▶ The IQR of 8 hours can be read as, the middle 50% of data is
covered within 1.02 Standard Deviations. (8/7.81)

50 / 79
Comparison with Standard Normal Plot

51 / 79
Comparison with Standard Normal Plot

51 / 79
Comparison with Standard Normal Plot

Advantages

51 / 79
Comparison with Standard Normal Plot

Advantages
▶ Like mean, uses all the data
values.

51 / 79
Comparison with Standard Normal Plot

Advantages
▶ Like mean, uses all the data
values.
▶ Has good sampling stability.

51 / 79
Comparison with Standard Normal Plot

Advantages
▶ Like mean, uses all the data
values.
▶ Has good sampling stability.
▶ The measure is mathematically
tractable.

51 / 79
Comparison with Standard Normal Plot

Disadvantages
Advantages
▶ Like mean, uses all the data
values.
▶ Has good sampling stability.
▶ The measure is mathematically
tractable.

51 / 79
Comparison with Standard Normal Plot

Disadvantages
Advantages
▶ Cannot be used for qualitative
▶ Like mean, uses all the data
data.
values.
▶ Has good sampling stability.
▶ The measure is mathematically
tractable.

51 / 79
Comparison with Standard Normal Plot

Disadvantages
Advantages
▶ Cannot be used for qualitative
▶ Like mean, uses all the data
data.
values.
▶ Is influenced by extreme
▶ Has good sampling stability.
values/outliers.
▶ The measure is mathematically
tractable.

51 / 79
Comparison with Standard Normal Plot

Disadvantages
Advantages
▶ Cannot be used for qualitative
▶ Like mean, uses all the data
data.
values.
▶ Is influenced by extreme
▶ Has good sampling stability.
values/outliers.
▶ The measure is mathematically
▶ Not very appropriate for skewed
tractable.
data.

51 / 79
Summarizing Quantitative Data: Dispersion
Coefficient of Variation (CV)

52 / 79
Summarizing Quantitative Data: Dispersion
Coefficient of Variation (CV)

▶ Data Set 1 : 2, 8, 10,12,18


Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%

52 / 79
Summarizing Quantitative Data: Dispersion
Coefficient of Variation (CV)

▶ Data Set 1 : 2, 8, 10,12,18


Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%

52 / 79
Summarizing Quantitative Data: Dispersion
Coefficient of Variation (CV)

▶ Data Set 1 : 2, 8, 10,12,18


Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%

Which data set has more variation?


σ
cv =
µ
where σ is the standard deviation and µ is the mean.

52 / 79
Summarizing Quantitative Data: Dispersion
Coefficient of Variation (CV)

▶ Data Set 1 : 2, 8, 10,12,18


Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%

Which data set has more variation?


σ
cv =
µ
where σ is the standard deviation and µ is the mean.
▶ Useful for comparing two distributions.

52 / 79
Summarizing Quantitative Data: Dispersion
Coefficient of Variation (CV)

▶ Data Set 1 : 2, 8, 10,12,18


Mean =10, SD = 5.22, CV = 5.22/10 = 0.522 (or) 52.2%
▶ Data Set 2 : 102, 108, 110,112,118
Mean =110, SD = 5.22, CV = 5.22/110 = 0.047 (or) 4.7%

Which data set has more variation?


σ
cv =
µ
where σ is the standard deviation and µ is the mean.
▶ Useful for comparing two distributions.
▶ Also, useful for comparing two different variables (since, by
definition, the CV is dimensionless).

52 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)

53 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)

53 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)

IndexPof Diversity is given by


k
1 − i=1 pi2 =⇒
1–(0.4)2 –(0.3)2 − (0.3)2 = 0.64

53 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)

IndexPof Diversity is given by


k
1 − i=1 pi2 =⇒
1–(0.4)2 –(0.3)2 − (0.3)2 = 0.64

53 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)

IndexPof Diversity is given by If all the elements have come from


k
1 − i=1 pi2 =⇒ just one plant then Index of
1–(0.4)2 –(0.3)2 − (0.3)2 = 0.64 Diversity = 1–(1)2 –(0)2 − (0)2 = 0

53 / 79
Summarizing Qualitative Data: Dispersion
Index of Diversity (Gini Impurity)(contd.)

If the elements have come in equally from all plants then


Index of Diversity =1–(0.33)2 –(0.33)2 − (0.33)2 = 0.667.

54 / 79
Summarizing Qualitative Data: Dispersion
Index of qualitative variation

55 / 79
Summarizing Qualitative Data: Dispersion
Index of qualitative variation

Index of qualitative variation =


1− ki=1 pi2
P

(k−1)/k =
= (1–(0.4)2 –(0.3)2 − (0.3)2 )×3/2 =
0.96.

55 / 79
Summarizing Qualitative Data: Dispersion
Index of qualitative variation

Index of qualitative variation =


1− ki=1 pi2
P Index of qualitative variation will be
(k−1)/k = 0, if all the elements have come
= (1–(0.4)2 –(0.3)2 − (0.3)2 )×3/2 = from just one plant
0.96.

55 / 79
Summarizing Qualitative Data: Dispersion
Index of qualitative variation(contd.)

Index of qualitative variation will be 1, if the elements have come in


equally from all plants.

56 / 79
Data: Summary of ...

57 / 79
Data: Summary of ...

57 / 79
Data: Summary of ...

Mean = 49.67 Hrs


Median =50 Hrs
Mode = 50 Hrs (10%)
SD = 7.81 Hrs
IQR = 8 Hrs
Range = 41
CV = 15.8%

57 / 79
Data: Summary of ...

Mean = 49.67 Hrs


Median =50 Hrs
Mode = 50 Hrs (10%)
SD = 7.81 Hrs
IQR = 8 Hrs
Range = 41
CV = 15.8%

57 / 79
Data: Summary of ...

Mean = 49.67 Hrs


Median =50 Hrs
Mode = 50 Hrs (10%)
SD = 7.81 Hrs
IQR = 8 Hrs
Range = 41
CV = 15.8%

57 / 79
Data: Summary of ...

58 / 79
Data: Summary of ...

58 / 79
Data: Summary of ...

58 / 79
Data: Summary of ...

Mode: Plant A (40


Index of diversity (or) Gini
impurity = 0.64
Index of qualitative variation
= 0.96

58 / 79
Data: Summary of ...

Mode: Plant A (40


Index of diversity (or) Gini
impurity = 0.64
Index of qualitative variation
= 0.96

58 / 79
More than One Variable

59 / 79
Data Tables: Cross Tabulations

60 / 79
Data Tables: Cross Tabulations

60 / 79
Data Tables: Cross Tabulations

Higher the Family Monthly Income, the more likely it is that the HH will
have more than one Car.

60 / 79
Data Tables: Cross Tabulations

Higher the Family Monthly Income, the more likely it is that the HH will
have more than one Car.
Higher the Family Size, the more likely it is that the HH will have more
than one Car.

60 / 79
Data Plots: Three variables at a go

61 / 79
Correlation

62 / 79
Measure of Associations

63 / 79
Measure of Associations

Covariance and Correlation

We get cov = 1019.5

63 / 79
Measure of Associations

Covariance and Correlation

Pn
i=1 (Xi − X̄ )(Yi − Ȳ )
cov (x, y ) =
n−1
Pn
2 − x̄)2
i=1 (xi
SD =
n−1

We get cov = 1019.5

63 / 79
Measure of Associations - Covariance

Difficulties with Covariance – Issues of comparisons

64 / 79
Measure of Associations - Covariance

Difficulties with Covariance – Issues of comparisons

64 / 79
Measure of Associations - Covariance

Difficulties with Covariance – Issues of comparisons

Between Calorie Intake, and Calories


burnt, which is associated more with
Weight?

64 / 79
Measure of Associations - Covariance

Difficulties with Covariance – Issues of comparisons - Further complicated


by units

65 / 79
Measure of Associations - Covariance

Difficulties with Covariance – Issues of comparisons - Further complicated


by units

65 / 79
Measure of Associations - Covariance

Difficulties with Covariance – Issues of comparisons - Further complicated


by units

Covariance is unbounded,
and the value also depends
on the units of data

65 / 79
Measure of Associations - Correlations

66 / 79
Measure of Associations - Correlations

▶ Correlation between two variables x, y is


n
cov (x, y ) 1 X (xi − x̄) (yi − ȳ )
r = rxy = =
Sx × Sy n−1 Sx Sy
i=1

Is the Pearson’s Correlation coefficient.


Pearson’s correlation coefficient is:

66 / 79
Measure of Associations - Correlations

▶ Correlation between two variables x, y is


n
cov (x, y ) 1 X (xi − x̄) (yi − ȳ )
r = rxy = =
Sx × Sy n−1 Sx Sy
i=1

Is the Pearson’s Correlation coefficient.


Pearson’s correlation coefficient is:
▶ Dimensionless, therefore comparable across different variables

66 / 79
Measure of Associations - Correlations

▶ Correlation between two variables x, y is


n
cov (x, y ) 1 X (xi − x̄) (yi − ȳ )
r = rxy = =
Sx × Sy n−1 Sx Sy
i=1

Is the Pearson’s Correlation coefficient.


Pearson’s correlation coefficient is:
▶ Dimensionless, therefore comparable across different variables
▶ Lies between -1, and +1

66 / 79
Measure of Associations - Correlations

67 / 79
Measure of Associations - Correlations

67 / 79
Measure of Associations - Correlations

The modulus value of the coefficient


indicates the strength of relationship
between two variables (Closer it is to
1, the greater the strength)
The Sign of the coefficient indicates
the direction of relationship

67 / 79
Measure of Associations - Correlations

The modulus value of the coefficient


indicates the strength of relationship
between two variables (Closer it is to
1, the greater the strength)
The Sign of the coefficient indicates
the direction of relationship
▶ (+) Variables move in the same
direction

67 / 79
Measure of Associations - Correlations

The modulus value of the coefficient


indicates the strength of relationship
between two variables (Closer it is to
1, the greater the strength)
The Sign of the coefficient indicates
the direction of relationship
▶ (+) Variables move in the same
direction
▶ (-) Variables move in the
opposite direction

67 / 79
Measure of Associations - Correlations

Scatter Plot provides a good hint to the possible correlation value


between two variables.

68 / 79
Measure of Associations - Correlations

Scatter Plot provides a good hint to the possible correlation value


between two variables.

68 / 79
Measure of Associations - Correlations

Scatter Plot provides a good hint to the possible correlation value


between two variables.

68 / 79
Measure of Associations - Correlations

Scatter Plot provides a good hint to the possible correlation value


between two variables.

68 / 79
Measure of Associations - Correlations

69 / 79
Measure of Associations - Correlations

Pearson’s Correlation
coefficient provides the
direction and strength of a
linear relationship between
two variables.

69 / 79
Measure of Associations - Correlations

Pearson’s Correlation
coefficient provides the
direction and strength of a
linear relationship between
two variables.

69 / 79
Spearman Rank Correlation

70 / 79
Spearman Rank Correlation
Spearman’s rank correlation measures the strength and direction of
association between two ranked variables. It basically gives the measure
of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a
monotonic function.

70 / 79
Spearman Rank Correlation
Spearman’s rank correlation measures the strength and direction of
association between two ranked variables. It basically gives the measure
of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a
monotonic function.
▶ Degree of association between two variables

70 / 79
Spearman Rank Correlation
Spearman’s rank correlation measures the strength and direction of
association between two ranked variables. It basically gives the measure
of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a
monotonic function.
▶ Degree of association between two variables
▶ Linear or nonlinear association

70 / 79
Spearman Rank Correlation
Spearman’s rank correlation measures the strength and direction of
association between two ranked variables. It basically gives the measure
of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a
monotonic function.
▶ Degree of association between two variables
▶ Linear or nonlinear association
▶ x increases, y increases or decreases monotonically

70 / 79
Spearman Rank Correlation
Spearman’s rank correlation measures the strength and direction of
association between two ranked variables. It basically gives the measure
of monotonicity of the relation between two variables i.e. how well the
relationship between two variables could be represented using a
monotonic function.
▶ Degree of association between two variables
▶ Linear or nonlinear association
▶ x increases, y increases or decreases monotonically

70 / 79
Spearman Rank Correlation

71 / 79
Spearman Rank Correlation
▶ Spearman rank correlation computation for n observations:

6 di2
P
rs = 1 −
n(n2 − 1)

di is the difference in the ranks given to the two variables values for
each item of the data.

71 / 79
Spearman Rank Correlation
▶ Spearman rank correlation computation for n observations:

6 di2
P
rs = 1 −
n(n2 − 1)

di is the difference in the ranks given to the two variables values for
each item of the data.
▶ Example:

71 / 79
Spearman Rank Correlation
▶ Spearman rank correlation computation for n observations:

6 di2
P
rs = 1 −
n(n2 − 1)

di is the difference in the ranks given to the two variables values for
each item of the data.
▶ Example:

71 / 79
Spearman Rank Correlation
▶ Spearman rank correlation computation for n observations:

6 di2
P
rs = 1 −
n(n2 − 1)

di is the difference in the ranks given to the two variables values for
each item of the data.
▶ Example:

▶ rs takes a value between -1 (negative association) and 1 (positive


association) rs = 0 means no association.

71 / 79
Spearman Rank Correlation
▶ Spearman rank correlation computation for n observations:

6 di2
P
rs = 1 −
n(n2 − 1)

di is the difference in the ranks given to the two variables values for
each item of the data.
▶ Example:

▶ rs takes a value between -1 (negative association) and 1 (positive


association) rs = 0 means no association.
▶ Can be used if the association is nonlinear and can be applied for
ordinal variables.

71 / 79
Kendall rank correlation coefficient

72 / 79
Kendall rank correlation coefficient

▶ Kendall rank correlation (non-parametric) the best alternative to


Spearman correlation (non-parametric) if the sample size is small
and has many tied ranks.

72 / 79
Kendall rank correlation coefficient

▶ Kendall rank correlation (non-parametric) the best alternative to


Spearman correlation (non-parametric) if the sample size is small
and has many tied ranks.
▶ Kendall rank correlation is used to test the similarities in the
ordering of data if it is ranked by quantities.

72 / 79
Kendall rank correlation coefficient

▶ Kendall rank correlation (non-parametric) the best alternative to


Spearman correlation (non-parametric) if the sample size is small
and has many tied ranks.
▶ Kendall rank correlation is used to test the similarities in the
ordering of data if it is ranked by quantities.
▶ This determines the strength of association based on concordance
and discordance between the pairs.

72 / 79
Kendall rank correlation coefficient

▶ Kendall rank correlation (non-parametric) the best alternative to


Spearman correlation (non-parametric) if the sample size is small
and has many tied ranks.
▶ Kendall rank correlation is used to test the similarities in the
ordering of data if it is ranked by quantities.
▶ This determines the strength of association based on concordance
and discordance between the pairs.
▶ Concordant: Ordered in the same way (consistency). A pair of
observations is considered concordant if (x2 − x1 ) and (y2 − y1 ) have
the same sign.

72 / 79
Kendall rank correlation coefficient

▶ Kendall rank correlation (non-parametric) the best alternative to


Spearman correlation (non-parametric) if the sample size is small
and has many tied ranks.
▶ Kendall rank correlation is used to test the similarities in the
ordering of data if it is ranked by quantities.
▶ This determines the strength of association based on concordance
and discordance between the pairs.
▶ Concordant: Ordered in the same way (consistency). A pair of
observations is considered concordant if (x2 − x1 ) and (y2 − y1 ) have
the same sign.
▶ Discordant: Ordered differently (inconsistency). A pair of
observations is considered concordant if (x2 − x1 ) and (y2 − y1 ) have
opposite signs.

72 / 79
Kendall rank correlation coefficient

73 / 79
Kendall rank correlation coefficient

Kendall’s Correlation coefficient is used to measure association between


two ordinal variables.

73 / 79
Kendall rank correlation coefficient

Kendall’s Correlation coefficient is used to measure association between


two ordinal variables.
▶ Kendall rank correlation coefficient
Number of concordant pairs - number of discordant pairs
τ= .
n(n − 1)/2

73 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items

74 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items

74 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 1 and
3.

74 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 1 and
3.
▶ Both experts have given a lower
rank to item 3 compared to
item 1.

74 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 1 and
3.
▶ Both experts have given a lower
rank to item 3 compared to
item 1.
▶ Therefore, item pair 1 and 3 is
considered to be a concordant
pair.

74 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items

75 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items

75 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 2 and
4.

75 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 2 and
4.
▶ Expert 1 has given a lower rank
to item 4 compared to item 2.

75 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 2 and
4.
▶ Expert 1 has given a lower rank
to item 4 compared to item 2.
▶ But Expert 2 two has given a
higher rank to item 4 compared
to item 2.

75 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ Let’s take the items pair 2 and
4.
▶ Expert 1 has given a lower rank
to item 4 compared to item 2.
▶ But Expert 2 two has given a
higher rank to item 4 compared
to item 2.
▶ Therefore item pair 2 and 4 is
considered to be a Discordant
pair.

75 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items

76 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items

76 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ You repeat this process for all
the 21 pairs of items as shown
in the grid.

76 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ You repeat this process for all
the 21 pairs of items as shown
in the grid.
▶ We have 15 concordant pairs
and 6 discordant pairs.

76 / 79
Kendall rank correlation coefficient

Example: Two experts ranking on food items


▶ You repeat this process for all
the 21 pairs of items as shown
in the grid.
▶ We have 15 concordant pairs
and 6 discordant pairs.
▶ The Kendall’s τ can take a
maximum value of +1 (When
all pairs are concordant) and
minimum value -1 (when all
pairs are discordant).

76 / 79
Summary
▶ Data Types
▶ Classification of Data
▶ Basic Statistical Measures
▶ Shape of the data
▶ Central Tendency (3M’s)
▶ Dispersion
▶ Summarizing Qualitative Data
▶ Frequency Table
▶ Central Tendency
▶ Bar Graph
▶ Pie Chart
▶ Summarizing Quantitative Data: Dispersion
▶ Range
▶ Inter-Quartile Range
▶ Box Plot
▶ Standard Deviation (SD)
▶ Comparison with Normal Distribution
▶ Coefficient of Variation (CV)
77 / 79
Summary (Contd.)

▶ Summarizing Qualitative Data: Dispersion


▶ Index of Diversity (Gini Impurity)
▶ Index of Qualitative Variation
▶ Summary of
▶ Qualitative Data
▶ Quantitative Data
▶ More than one variables: Data tables
▶ Measures of Association
▶ Covariance
▶ Pearson’s Correlation
▶ Spearman Rank Correlation
▶ Kendall Rank Correlation Coefficient

78 / 79
The End

79 / 79

You might also like