You are on page 1of 91

Qualifier

WEEK 1-4 NOTES


STATISTICS

INDEX

Week -1
Week - 2
Week - 3
Week - 4

By :- Rushikesh Chavan
WEEK-1

WEEK 1 Page 2 By :- Rushikesh Chavan


By :- Rushikesh Chavan
INTRO & TYPE OF DATA - BASIC DEFINITION

• What is Statistics ?
---> Statistics is the art of learning from data . It is concerned with the
collection of data , there subsequent description and their analysis
which will lead to the Drawing of Conclusion .

- The part of statistics concerned with - The part of statistics concerned with
the description and summarization of the drawing of conclusion from data
data is called Descriptive Statistics is called Inferential Statistics .

Descriptive Statistics doesn't necessarily seek to explain why something is


happening or make predictions about the future. Instead, it focuses on
presenting facts and characteristics as they are.

- To be able to draw a conclusion from the data , we must take into account
The possibility of chance - i.e. Probability .

WEEK 1 Page 3 By :- Rushikesh Chavan


POPULATION AND SAMPLE

Population :- The total collection of all the elements that we are interested in is
called as population .
Sample :- A subgroup of the population that will be studied in detailed is
called as sample .

Example :- Bag full of Colour Full balls and we as an observer want to see just
red balls . So -->

Bag of Colour full balls


Is Population

And Red Balls are Sample

WEEK 1 Page 4 By :- Rushikesh Chavan


PURPOSE OF STATISTICAL ANALYSIS

- Note that if the purpose of the analysis is to examine and explore Information
for its own intrinsic interest only then the study is Descriptive .

- If the information is obtained from a sample of a population and The purpose


of the study is to use that information to draw conclusion / predictions about
the population , that study is Inferential .

- A descriptive study may be performed either on a sample or a population .

- When an inference or predictions is made about the population based on


information obtained from the sample , then the study become inferential .

WEEK 1 Page 5 By :- Rushikesh Chavan


WHAT IS DATA ?

In order to learn something we need to collect data


So Data is the facts and figures which are collected , analysed and summarized for
presentation and interpretation (understanding)

- Statistics relies on data , Information that is around us .

- Why do we collect Data ?


The primary reason for collecting data is Interested in knowing about the
Characteristics of groups ; it could be group of people
It could be places , things , or events .

For Ex : - You-tube
they want to know which video people loved the most
That's why they added like - dislike button in there app
so simply they are collecting the data

- Data Collection :-

- Data Available :- Published Data ( https://data.gov.in/ )


- Data not Available :- need to generate or collect it

WEEK 1 Page 6 By :- Rushikesh Chavan


.

- Unstructured and Structured data :-

- Unstructured Data :-

Example : - Social Media Posts


Description : - Unstructured data refers to information that does not have a
Specified , predefined format . Social media posts - such as
tweets or posts on FB & Instagram are prime example of the
unstructured data . These posts can contain mix of text , image
videos , hashtags , mentions & emojis .

- Structured Data :-

Example : - Excel Spreadsheet


Description : - Structured data on other hand , has well - defined and
Organized format . An Excel Spreadsheet is an excellent ex.
of structured data . Each row and column in spreadsheet has
a specific purpose and follows a consistent data model . This
makes it easy to perform operation like sorting , filtering
& aggregating data , as well as conducting structured analysis .

WEEK 1 Page 7 By :- Rushikesh Chavan


VARIABLES & CASES

Case :- A unit (ex:- Student) from which the data is collected .


Variables :- A character or attribute that varies across all units ( ex : student's hight )

Note :- The student data is represented in tabular form .


for organizing data in tabular form these two points should be considered

- Rows represents the cases :- For each case same attribute is recorded .
- Columns represent variables : - For each variable , same type of value for each
case is recorded
(Rows -> Cases)
(Date of birth is variable)
(Anjali is a case )
(These Columns represents variables)

WEEK 1 Page 8 By :- Rushikesh Chavan


CLASSIFICATION OF DATA

Categorical Data :- Called as Qualitative variable .


Identifies the group membership
we cannot perform any meaning-full mathematical operation on it .
Ex : - In Students data set Gender is Categorical Variable Bcoz it has two categories M & F
And we can classify every student in these two categories i.e. (M) Male , ( F ) Female

Numerical Data :- Called as Quantitative Variables .


Describes the numerical properties of the cases ( ex : - Students hight )
Ex : - In Students data set Marks is Numerical data

Measurement Unit :- Scale defines the meaning of numerical data, such as weights measured
in kilograms, prices in rupees , hight in CM etc .
Note :- if data is being entered in Numerical data type ( ex : - hight ) then unit must be
common.

WEEK 1 Page 9 By :- Rushikesh Chavan


CROSS - SECTIONAL & TIME SERIES DATA

• Time Series :- If the data is recorded over a period of time, then it is called time-
series data.
• Time-Plot :- graph of a time series showing values in chronological order is
known as Time-plot.

Ex:- The data collected to observe the temperature in Delhi for seven different
days is a time-series data. Because, data is recorded only for one place (i.e. Delhi)
and it is recorded over a period of time (i.e. seven different days).

• Cross-Sectional Data :- If the data is observed at the same time, then it is


called cross-sectional data.

Ex :- The data collected to observe the temperature of Delhi, Chennai, Jaipur and
Bhopal on a particular day is a cross-sectional data. Because, data is recorded at
the same time and it is observed for several places.

WEEK 1 Page 10 By :- Rushikesh Chavan


SCALES OF MEASUREMENT

We have four scales of measurement called nominal, ordinal, interval and


ratio scale. Data collection requires any one of the scales of measurement.

• Nominal Scale of Measurement :-

When the data for a variable consist of labels or names used to identify the
characteristic of an observation, that scale of measurement is considered a
nominal scale.

Example: Name , Board , Gender, Blood group etc.

- These nominal variable can be numerically coded


Like (M) - male as & (F) - female as

- Nominal scale is just categories or labels which does not contain any order .

WEEK 1 Page 11 By :- Rushikesh Chavan


.
Ordinal Scale of Measurement :- When data exhibits properties of nominal data and the
order or rank of data is meaningful , then scale of
measurement is considered an ordinal scale .

Ex:- When customer who visits on Amazon and buys something then customer gives the
rating for product quality i.e one star - five star / poor , good , excellent
i.e. data have the properties of nominal data also this data can be ranked
w.r.t product quality .

Note :-
- We can code an ordinal scale of measurement, as
- bad can be coded as ,
- good can be coded as
- excellent can be coded as .
- There is an order in , , .

Interval Scale of measurement :- If the data have all the properties of ordinal data i.e.
(having order in data ) also interval between values
expressed with fixed unit of measurement
then it is Interval Scale of measurement .

Note :- Data with interval scale of measurement are always numeric and we can find out
the difference between any two values.
Ex :- If a room has AC & that room has temperature °C,
temperature outside the room is °C.
It is correct to say that the difference in temperature is °C,
but it is incorrect to say that the outdoor is twice as hot as indoor.

WEEK 1 Page 12 By :- Rushikesh Chavan


.

Ratio scale of measurement :- If the data have all the properties of interval data and
the ratio of two values is meaningful , then the scale of
measurement is ratio scale .
Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like
height, weight and marks can be added, subtracted and multiplied or divided
as it all have absolute zero property .

WEEK 1 Page 13 By :- Rushikesh Chavan


WEEK-2

WEEK 2 Page 14 By :- Rushikesh Chavan


FREQUENCY DISTRIBUTION

Definition :-
- A frequency distribution of qualitative data is a way to organize and
display data by listing the distinct values or categories and how many times each
value or category appears in a dataset.
- Each row of a frequency table lists a category along with the Number of cases
in the category .

Let's use a simple example to illustrate this concept :-


Imagine you have a survey where you asked people about their favorite colors, and
you recorded their responses. Here are the responses you received from people:

Category - Blue
1. Red
2. Blue
COLOURS FREQUENCY
3. Green
Blue 2 Blue Category
4. Red
Green 2 cases : 2
5. Red
6. Blue Purple 1
7. Yellow Red 4
8. Green Yellow 1
9. Red
Grand Total 10
10. Purple

WEEK 2 Page 15 By :- Rushikesh Chavan


• STEPS TO DISPLAY THE FREQUENCY TABLE IN GOOGLESHEETS

• Select / Highlight the cells having


data you want to visualize .

• In formatting bar click on Insert


Option & Go to Pivot Table option

WEEK 2 Page 16 By :- Rushikesh Chavan


.

• You can create Pivot Table


in both new and existing sheet

• Pivot Table in new sheet

WEEK 2 Page 17 By :- Rushikesh Chavan


.

• After creating Pivot Table go to editor


• 1st --> Add the rows
• 2nd --> Add values

• FINAL RESULT OF FREQUENCY TABLE :-

WEEK 2 Page 18 By :- Rushikesh Chavan


RELATIVE FREQUENCY

⚫ Definition :-
The ratio of Frequency to the total number of observation
is called Relative Frequency .

Steps to construct a relative frequency distribution :-

Step :- Obtain a frequency distribution of the data


Step :- Divide each frequency by total number of observations

COLOURS FREQUENCY Relative Frequency


Blue 2 0.2 ( Freq/Grad.T ) = / = .
Green 2 0.2
Purple 1 0.1
Red 4 0.4
Yellow 1 0.1
Grand Total 10 1 The sum of Relative Frequency's is
always 1

WEEK 2 Page 19 By :- Rushikesh Chavan


STEPS TO DISPLAY THE FREQUENCY TABLE IN GOOGLESHEETS

• Select / Highlight the cells having data


you want to visualize in Charts format

• In formatting bar click on Insert


Option & Go to Chart option .

WEEK 2 Page 20 By :- Rushikesh Chavan


.
• In this dropdown menu you can select
type of charts for ex:- Pie Charts

Some Examples of Charts :-

WEEK 2 Page 21 By :- Rushikesh Chavan


DESCRIBING OF CATEGORICAL DATA

Charts of categorical data :-

- The two most common displays of a categorical variable are a bar chart & pie chart
- Both discribe a categorical variable by Displaying its frequency table .

PIE CHARTS :-

A Pie Chart is a circle divided into pieces or Wedges proportional to


relative frequencies of the qualitative data .

This is Piece / Wedge


COLOURS FREQUENCY Relative Frequency
Blue 2 0.2
Green 2 0.2
Purple 1 0.1
The above data is based on this frequency
Red 4 0.4
Yellow 1 0.1 distribution table .
Grand Total 10 1

Note : for finding how much degree is required for per wedge use Degree = relative frequency 360

WEEK 2 Page 22 By :- Rushikesh Chavan


.

Bar Chart :-

A bar chart displays the

• Distinct values of qualitative data :- Horizontal Axis


• Relative Frequency ( frequency / % ) :- Vertical Axis

This Frequencies or relative frequency of each distinct value is represented by a


Vertical Bar which has hight equal to the frequency / R.F of that distinct value .

- This bars should be positioned in such a way that they should not touch
each other .
Vertical Axis

Horizontal Axis

This is Vertical Bar


COLOURS FREQUENCY Relative Frequency
Blue 2 0.2
Green 2 0.2
Purple 1 0.1 The above Bar Chart is based on this
Red 4 0.4
Yellow 1 0.1 frequency distribution table .
Grand Total 10 1

WEEK 2 Page 23 By :- Rushikesh Chavan


.

Pareto Charts :-

When the categories in a bar chart are sorted by frequency


( i.e. Greater frequency to lower frequency / Lower to greater )
These types of bar chart are known as Pareto Charts

• If the categorical variable is ordinal , then the bar chart must


preserve the ordering .

Basic Ex:- Greater frequency

Lower Frequency

WEEK 2 Page 24 By :- Rushikesh Chavan


BEST PRACTICES WHILE GRAPHING

First - Know your Purpose for every table you create

- Choose the table / graph to serve the purpose .

- So if the purpose is just to count the data and represent it


as a table then we must go for the Frequency Table .
- But for example you want to compare the data of two or
more different values then we must go for a Bar Chart .
- But if my Purpose is to know Share of each state out of
100 % then u must go for Pie Chart .

Label's in Chart :-

First thing is to label or annotate your data because only when we


label or annotate the data there is a better visualization or it
communicates the idea better.

This is how perfectly annotated graph looks like

WEEK 2 Page 25 By :- Rushikesh Chavan


.

Suppose a Bar Chart consists of too many categories & from those categories large
number of categories have less count then we should not exclude them becoz it will
cause the loss of data ; except of that we should make the another category name
other then there will be no loss of data .

Ex :-

WEEK 2 Page 26 By :- Rushikesh Chavan


.

Note : -
- Pie chart is best suited when we want to emphasize the proportion
rather than the frequency of the categories. Each slice in a pie
chart must have a distinct colour .
- Some examples where Bar Charts can be used :-
- An e-commerce company wants to know how many sources generated
income above a certain value .
- An e-commerce company is interested in knowing amount of transaction
by different payment modes.
- Also all the bars in a bar chart can have the same colour

WEEK 2 Page 27 By :- Rushikesh Chavan


MISLEADING GRAPHS
The Area Principle :-

( 1 ) Improper Scaling ( 2 ) Regular Graph ( Proper Scaling )

The area principle says that the area occupied by a part of the graph should be
similar to the amount of data it represents

• In the improperly scaled pictogram bar graph ( 1 ) , the image for B is actually 9 times
as large as compared A. so this graph doesn't obey the Area principle .

• We can say the ( 2 ) Regular graph obey the area principle .

• If any graph doesn't obey the Area principle then that graph is Misleading Graph

WEEK 2 Page 28 By :- Rushikesh Chavan


Truncated Graphs :- Another common violation is when the baseline of
a bar chart is not at zero .

Truncated Bar Graph

Regular Bar Graph

WEEK 2 Page 29 By :- Rushikesh Chavan


.
Indicating A Y-axis Break :-

WEEK 2 Page 30 By :- Rushikesh Chavan


ROUND-OFF ERROR

Category Percentage Round-Off


A 22.3 22
B 35.6 36
C 12.6 13
D 11 11
E 18.5 18
Total 100 % 102

It is important to check for round-off errors. Round-off errors occur when table
entries are percentages or proportions, the value of total sum may slightly differ
from 100% or 1. This might result in a pie chart.

In the table, the value of total sum is 100%. Suppose, we round off the values and
draw a pie chart as follows:

In this pie chart has round-off errors because total sum of all entries is 100.5%
which is different from 100%.

WEEK 2 Page 31 By :- Rushikesh Chavan


MODE & MEDIAN

MODE :- The Mode of a categorical variable is the most common category ,


The category with the highest frequency .

Mode labels :- (a) Longest Bar in a Bar Chart .


(b) Widest slice in a pie chart .
(c) First Category shown in Pareto Chart . ( Depends on order )

Basic Ex:-
COLOURS FREQUENCY Relative Frequency
Blue 2 0.2
Green 2 0.2
Purple 1 0.1
Red 4 0.4
Yellow 1 0.1
Grand Total 10 1

Greater frequency

Lower Frequency

In this figure Red has the longest Bar ,


Thus , mode of the dataset is Category " Red "

WEEK 2 Page 32 By :- Rushikesh Chavan


For Pie-chart Representation of same dataset : -

Category A has the widest slice . Thus Mode of the Dataset is Category ( " Red " )

WEEK 2 Page 33 By :- Rushikesh Chavan


BIMODAL & MULTIMODAL Data : -

If two or more categories have similar & highest frequency, then data
is called Bimodal (in the case of two) or Multimodal (more than two).

COLOURS FREQUENCY Relative Frequency


Blue 4 0.3333333333
Green 2 0.1666666667
Purple 1 0.08333333333
Red 4 0.3333333333
Yellow 1 0.08333333333
Grand Total 12 1

In the above bar chart , both categories " Red " & " Blue " have highest Frequency .

WEEK 2 Page 34 By :- Rushikesh Chavan


MEDIAN :-
• The median of an ordinal variable is the middle value when all the values
are arranged in order.

• If there are an even number of observations, then we can choose the


category on either side of the middle of the sorted list as the median.

• Imagine you have a list of ordinal data representing the education levels of
a group of people: "High School," "Associate's Degree," "Bachelor's Degree,"
"Master's Degree," and "Ph.D."

• If you arrange these education levels in order from the least to the most
advanced, it would look like this:

1. High School
2. Associate's Degree
3. Bachelor's Degree
4. Master's Degree
5. Ph.D.

The median, in this case, would be the education level that falls right in the
middle when you have an odd number of observations. i.e. " Bachelor's Degree ".

Note: Median can be defined only for ordinal data whereas mode can be defined
for both nominal as well as ordinal data.

WEEK 2 Page 35 By :- Rushikesh Chavan


WEEK-3

week 3 Page 36 By :- Rushikesh Chavan


DESCRIBING THE NUMERICAL DATA

• Organizing Numerical Data :-

( VARIABLE TYPE )

DESCRETE VARIABLE :- A Discrete variable usually involves a count of something .

Ex:- Number of spelling mistakes .


Number of accidents in a month & in a particular city . etc

CONTINUOUS VARIABLE :- A Continuous variable involves a measurement of


something .
Ex :- Weight of person .
Hight of a person . etc

First group the observation into classes ( categories ) and then treat the classes as
the distinct values of quantitative data .
Now we can construct frequency and R.F Relative -Frequency distributions of the data .

week 3 Page 37 By :- Rushikesh Chavan


ORGANIZING DISCRETE DATA ( Single value )

If the data set contains only a relatively small number of different values then it is
convenient to represent it in a frequency table .

Each class represents a distinct value along with its frequency of occurrence .

For Example :- The dataset reports the which colour do the people like in the
group of 12 people

COLOURS FREQUENCY OF OCCURRENCE RELATIVE FREQUENCY


Blue 4 0.3333333333
Green 2 0.1666666667
Purple 1 0.08333333333
Red 4 0.3333333333
Yellow 1 0.08333333333
Grand Total 12 1

week 3 Page 38 By :- Rushikesh Chavan


ORGANIZING CONTINUOUS DATA

- Organize the data into a number of classes to make the data understandable .
- But there are few guidelines that need to be followed :-

▪ Number of classes :- The appropriate number is a subjective choice ,


The rule of thumb is to have 5-20 classes .
▪ Each observation should belong to some class & no observation should belong
to more than one class .
▪ It is common but not essential to choose class intervals of equal length .

week 3 Page 39 By :- Rushikesh Chavan


Some New Terms :-
• Lower Class Limit : The smallest value that could go in a class.
• Upper Class Limit : The Largest value that could go in the class.
1-10 ---> so 1 is lower class limit & 10 is upper class limit

• Class Width : The Difference between the lower limit of a class


& the lower limit of the next-higher class.
• Class Mark : The Average of the Two class limits of a class.

• A class interval contains its left end but not its right end boundary point.
suppose there is a interval 30 - 40
[ 30 - 40 ) i.e. 30 is included in the interval but 40 is not included
As 40 is lower limit of the next higher class .

For example :- Organizing Continuous Data related to weight if Bags in class .

Class Interval Frequency


1 - 10 7
10 - 20 10
20 - 30 9
30 - 40 14
40 - 50 14
50 - 60 11
60 - 70 6
70 - 80 13
80 - 90 7
90 - 100 9

Note :- These types of graphs are known as Histogram

week 3 Page 40 By :- Rushikesh Chavan


STEAM - AND - LEAF DIAGRAM

The Stem-and-Leaf Diagram which is also known as stem-plot


In this graph each observation is separated into two parts :-
1st part :- Stem - it's consists of all but the most rightmost digit .
2nd part :- Leaf - The rightmost digit .

Example :-

Note : - If the data are all two-digit numbers, then we could let the stem
of a data value be the tens digit and the leaf be the ones digit.

So , 75 is represented as follows :-

Stem Leaf
7 5

If there are two values 75 , 78 :-

Stem Leaf
7 58

week 3 Page 41 By :- Rushikesh Chavan


Steps to construct a Stem-plot :-

- In each observation
Stem - it's consists of all but the most rightmost digit .
Leaf - The rightmost digit .

- Write the Stems from smallest to largest in a vertical column


in left of vertical rule

Stem Leaf

Left of Vertical Rule Right of Vertical rule

- Write each leaf to the right of the vertical rule


In that row which contains appropriate stem .

- Arrange the leaves in each row in ascending order .

week 3 Page 42 By :- Rushikesh Chavan


- Example :-
Draw a stem-and-leaf plot for the dataset 15, 22, 29, 36, 31, 23, 45,
10, 25, 28, 48 which are the ages of 11 patients admitted in a certain
hospital. Stem-and-leaf plot for the above dataset is :

Ages of Patients
10
15 Stem Leaf
22
23 1 0 5 ----> 10 15
25
28
2 2 3 5 8 9 ----> 22 23 25 28 29
29 3 1 6 ----> 31 36
31 4 5 8 ----> 45 48
36
45

You can observe how leaves are arranged in Ascending order

week 3 Page 43 By :- Rushikesh Chavan


DESCRIPTIVE MEASURES

Descriptive measures are quantities whose values are determined


by the data and be used to summarize a data set .

• Types of Descriptive Measures :-

1. Measures of Central Tendency :-

These are the measures that indicate the most typical


value or centre of a data set .

2. Measures of Dispersion :-

These measures indicate the variability or spread of a


dataset .

• What are Outliers : -


As an example, consider a dataset of the ages of a group of people:
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 120]. But if we look closely,
most of the ages are between 21 and 30 years. The age ‘120’ is an
outlier .

week 3 Page 44 By :- Rushikesh Chavan


1. MEASURES OF CENTRAL TENDENCY

A. MEAN :-
The most commonly used measures of central tendency is the mean .
The mean of a data set is the sum of the observation divided by the
Number of observations .

- The mean is usually referred as Arithmetic Average .


- i.e. divide the sum of the values by the number of values.

- Formulas for Different Types of Observations : -

1) Sample Mean :

2) Population Mean :

3) Mean for Grouped Data :

--- for Discrete single value data

4) Mean for Grouped Data :

--- for Continuous data

Note :- The Mean is sensitive to outliers

week 3 Page 45 By :- Rushikesh Chavan


Understanding Formulas With Example

1) Sample Mean :

Example :- 2 , 12 , 5 , 7 , 6 , 7 , 3 ---> find the sample mean of this data

n = 7 ---> as there are 7 observations

2) Mean for Grouped Data :


f ---> is frequency
Ex :- If we ask a group of people that which number do they like from
0 -to- 9

Result -> 2 , 1 , 3 , 4 , 5 , 2 , 3 , 3 , 3 , 4 , 4 , 1 , 2 , 3 , 4 ---> n = 15

2 , 6 , 15 , 16 , 5

Value ( Xi ) FREQUENCY ( Fi ) Fi Xi
1 2 2
2 3 6
3 5 15
4 4 16
5 1 5
Grand Total n = 15 Sum = 44

week 3 Page 46 By :- Rushikesh Chavan


1) Mean for Grouped Data :

f ---> is Frequency .
m --> is Mid-point of the class interval .

Class Interval Frequency(fi) Mid-Point ( mi ) f i mi


30 - 40 14 35 490
40 - 50 14 45 630
50 - 60 11 55 605
60 - 70 6 65 390
70 - 80 13 75 975
80 - 90 7 85 595
90 - 100 9 95 855
TOTAL 74 4540

week 3 Page 47 By :- Rushikesh Chavan


ADDING & MULTIPLYING A CONSTANT

Let where c is a constant then


---> Is the Old mean without adding the constants .
---> Is the new mean after adding the constant ( c )
Example :Taking the dataset of marks of students

- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> =
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
73 , 84 , 43 , 73 , 40 , 75 , 66 , 52 , 63 , 71 ----> =

So we can observe that


= 59 + 5

Let where c is a constant then


- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> =
- Suppose the teacher has decided to scale done each mark by 40 %
So we have to multiply each mark in data with 0.4 .
- Then the final data is
27.2 , 31.6 , 15.2 , 27.2 , 14 , 28 , 24.4 , 18.8 , 23.2 , 26.4 ----> =

So we can observe that


= 59 0.4

week 3 Page 48 By :- Rushikesh Chavan


B. MEDIAN :-
Median is also used to measure the Central Tendency Coz the median
of a data set is the middle value in the ordered list.
( ordered list ex- increasing order ).
• 4 , 12 , 3 , 10 ,9 , 16 , 15 ---> Find the median

Arrange the data in increasing order


3 , 4 , 9 , 10 , 12 , 15 , 16 ----> No. of observations ( n ) = 7

If the number of observations are odd then

Median Observation =

Median Observation = = 4th observation

Median Observation = 10

• 4 , 12 , 3 , 10 ,9 , 16 ---> Find the median

Arrange the data in increasing order


3 , 4 , 9 , 10 , 12 , 16 ----> No. of observations ( n ) = 6

If the number of observations are even then

Median Observation =

Median Observation = = 9.5

Median Observation = 9.5

Note :- Median is not sensitive to outliers

week 3 Page 49 By :- Rushikesh Chavan


ADDING & MULTIPLYING A CONSTANT

Let where c is a constant then


---> Is the Old median without adding the constants .
---> Is the new median after adding the constant ( c )
Example :Taking the dataset of marks of students

- 68 , 79 , 38 , 70 , 61 , 47 , 58 , 66
- 38 , 47 , 58 , 61 , 66 , 68 , 70 , 79 ---> median = = 63.5
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
- 43 , 52 , 63 , 66 , 71 , 73 , 75 , 84 ----> =

So we can observe that


= 63.5 + 5

Let where c is a constant then


- 68 , 79 , 38 , 70 , 61 , 47 , 58 , 66
- 38 , 47 , 58 , 61 , 66 , 68 , 70 , 79 ---> median = = 63.5
- Suppose the teacher has decided to scale done each mark by 40 %
So we have to multiply each mark in data with 0.4 .
- Then the final data is
----> =

So we can observe that


= 63.5 0.4

week 3 Page 50 By :- Rushikesh Chavan


C. MODE :-
It’s a measure of the central tendency ,
Mode of data is its most frequently occurring value .
- If there is no value more than once then the data has no mode
- Example : - 2 , 12 , 5 , 7 , 6 , 7 , 3
----> As 7 occurs twice so , 7 is Mode

- Example :- 2 , 105 , 5 , 7 , 6 , 3
----> no Mode

• Adding a Constant

Let where c is a constant then


---> Is the Old Mode without adding the constants .
---> Is the new Mode after adding the constant ( c )
Example :Taking the dataset of marks of students

- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> Mode is 68
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
73 , 84 , 43 , 73 , 40 , 75 , 66 , 52 , 63 , 71 ----> Mode is 73
So we can observe that
= 68 + 5 = 73

• Multiplying a Constant

Let where c is a constant then


- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> Mode is 68
multiplying complete data by 0.4

- 27.2 , 31.6 , 15.2 , 27.2 , 14 , 28 , 24.4 , 18.8 , 23.2 , 26.4 --> mode is 27.2
So we can observe that
= 63 0.4 = 27.2

week 3 Page 51 By :- Rushikesh Chavan


2. MEASURES OF DISPERSION

Measure of dispersion indicates the amount of variation , or


spread in dataset .

Some measures of dispersion are :

1. Range

2. Variance

3. Standard Deviation

4. Interquartile Range

week 3 Page 52 By :- Rushikesh Chavan


1) RANGE :- The range of a dataset is the difference between its
largest & smallest values

Range = Max - Min

- Examples : -
1 ) Find the range of the dataset 1 , 2 , 3 , 4 , 5 .
----> Here ,
max value :- 3
min value :- 3

Range = 5 - 1 = 4

2 ) Find the range of the dataset 1 , 2 , 3 , 4 , 15 .


----> Here ,
max value :- 1
min value :- 15

Range = 15 - 1 = 14

- From above examples we can observe that range is sensitive to outliers


Coz there is just one difference in these two datasets i.e. 5 & 15 .
Which makes the large difference in the range

week 3 Page 53 By :- Rushikesh Chavan


2) VARIANCE :-

Imagine you and your friends have just played several games of basketball and each
scored a different number of points. Now, you want to understand how much each person’s
score differs from the average score. This is where variance comes in!

Variance is a statistical measurement that describes how much individual numbers in a data
set differ from the average value
Here’s how it works:

1. Calculate Mean of data


2. Calculate Deviation = difference between the Xi and the (Mean) --> (Xi - )
3. Square each deviation (Xi - )2.
4. Finally, use these formulas : -

Population Variance :

Sample Variance :

The result is the variance!


- A high variance means that the scores are spread out and people scored very
differently from each other.
- A low variance means that everyone’s scores were close to the average ( Mean ) .

week 3 Page 54 By :- Rushikesh Chavan


Example: Consider the dataset 68, 79, 38, 68, 35, 70, 61, 47, 58, 66.
(1) Compute Population & Sample variance of the dataset. Solution:

Data xi − (xi − x¯)2


1 68 9 81
2 79 20 400
3 38 -21 441
4 68 9 81
5 35 -24 576
6 70 11 121
7 61 2 4
8 47 -12 144
9 58 -1 1
10 66 7 49
Total Xi = 590 = (Xi − )2 = 0 (xi − )2 = 1898

Mean = = 59

a) Population Variance = = 189.8

b) Sample Variance = = 210.88

week 3 Page 55 By :- Rushikesh Chavan


ADDING & MULTIPLYING THE CONSTANT

If we add the Constant ( C ) in the each Xi of data set then also

New Variance = Old Variance

Example: Consider the dataset 68, 79, 38, 68, 35, 70, 61, 47, 58, 66

Xi Xi+5 Xi − ( Xi + 5 ) − (( Xi + 5 ) − )2 = ( Xi − )2
68 73 9 9 81
79 84 20 20 400
38 43 -21 -21 441
68 73 9 9 81
35 40 -24 -24 576
70 75 11 11 121
61 66 2 2 4
47 52 -12 -12 144
58 63 -1 -1 1
66 71 7 7 49
590 640
(( Xi + 5 ) − )2 =1898

Mean ( Xi ) = 59
Mean ( Xi +5 ) = 59 + 5 = 64

a) Population Variance = = 189.8

b) Sample Variance = = 210.88

New Variance = Old Variance

week 3 Page 56 By :- Rushikesh Chavan


If we multiply the Constant ( C ) in the each Xi of data set then

New Variance = Old Variance C2

Xi Xi 5 Xi - 59 ( Xi 5 ) - 295 (( Xi 5 ) - 295)2
68 340 9 45 2025
79 395 20 100 10000
38 190 -21 -105 11025
68 340 9 45 2025
35 175 -24 -120 14400
70 350 11 55 3025
61 305 2 10 100
47 235 -12 -60 3600
58 290 -1 -5 25
66 330 7 35 1225
590 2950
(( Xi 5 ) − )2 = 47450

Mean ( Xi ) = 59
Mean ( Xi +5 ) = 59 5 = 295

a) Old Population Variance = 189.8

b) New Population Variance = = 189.8 52 = 4745

c) Old Sample Variance = = 210.89

d) New sample Variance = 210.89 52 = 5272.2

week 3 Page 57 By :- Rushikesh Chavan


STANDARD DEVIATION :-

The quantity which is the square root of the Sample Variance is the
Simple Standard Deviation

- Formula : -

Units of Standard Deviation :-

- If we have a dataset of weights of 10 students which is measured in kg, then the


unit of variance will be (kg) 2 and units of standard deviation will be kg.

ADDING & MULTIPLYING CONTANT :

- If we add the Constant ( C ) in the each Xi of data set then also

New Variance = Old Variance

- If we multiply the Constant ( C ) in the each Xi of data set then

New Variance = Old Variance C

week 3 Page 58 By :- Rushikesh Chavan


PERCENTILE

The " 100p percentile " ---> way of ranking data .


If the any Data value is at the " 100p percentile "
---> it means that 100p percent of data is equal or less than that 100p percentile value .

• For example : If a student scores in 90th percentile on the test


It means they scored higher than or equal to 90% of the students .

At least 100( 1 - p ) percent of the data values are greater than or equal to it
---> This means if we subtract the percentile rank ( In decimal form ) from 1 and
multiply it with 100 , we will get the percentage of data that is equal to or more
than our data values .

• For example : If student is in the 90th then 100( 1 - 0.90 ) = 10 %


So 10 % of students scored the same or higher than this student .

In simple terms 100p percentile tells us that where a particular value stands in w.r.t
Other data values .
It is just like saying " This value is higher than X % of all values & lower than Y % .

1% 1% 1% 1% 1% 1%
P1 P2 P3 P97 P98 P99

We can see P99 is 99th percentile


So 99 % of data is less than it and
1 % is greater than it

week 3 Page 59 By :- Rushikesh Chavan


Computing Percentile

1) Arrange the data in increasing order .


[ 68, 38, 66, 79, 61, 47, 68, 35, 70, 58 ]
Increasing order ---> [35, 38, 47, 58, 61, 66, 68, 68, 70, 79.]
2) n ---> number of observations = 10
P ---> percentile in decimal form ex:- 25 ---> 0.25
So if ( n p ) is not an integer then determine smallest integer greater than np
( n p ) = ( 10 0.25 ) = 2.5 --> not an integer
Smallest larger integer greater than ( n p ) is 3
Therefore 3rd observation of the data set is 25th percentile --> Which is 47

3) If np is an integer then find the average of the values in position np and np+1
Example : - [35, 38, 47, 58, 61, 66, 68, 68, 70, 79.]
n = 10 & p = 0.50 ---> coz we have to find 50th percentile
( n p ) = 10 0.50 = 5
Since np is in integer,
so we need to take the average of 5th = 61 observation and 6th = 66, observation
50th percentile = = 63.5

week 3 Page 60 By :- Rushikesh Chavan


OUARTILES

• The Sample 25th percentile is called first quartile


• The sample 50th percentile is called second quartile & median .
• The sample 60th percentile is called third quartile .

Entire Data

Min Max

1st Quartile 3rd Quartile

25 % of Data
2nd Quartile
Median

50 % of Data

75 % of Data

These Quartiles breaks up data set into four parts

week 3 Page 61 By :- Rushikesh Chavan


The Five Number Summary
- Minimum
- Q1 : First Quartile or Lower Quartile
- Q2 : Second Quartile of Median
- Q3 : Third Quartile or Upper Quartile
- Maximum

The Interquartile Range ( IQR )

- The Interquartile range , IQR is the difference between the 1st & 3rd quartile.

IQR = Q3 - Q1

week 3 Page 62 By :- Rushikesh Chavan


WEEK-4

week 4 Page 63 By :- Rushikesh Chavan


CONTINGENCY TABLE

• It shows the distribution of one variable in rows and another in columns, used to study
the correlation between the two variables .

• For example :- if you have two categorical variables like gender (male/female) and
you have to find out whether ownership of a smartphone is associated with gender of
a 100 student by ( YES / NO ) . A contingency table would classify outcomes for
some variable in rows and the other in columns .

Own a Smart Phone


GENDER NO YES Row Total

}
Female 10 34 44
Male 14 42 56 No. of male and females in 100
Column Total 24 76 100
Student is represented in rows

No. of YES / NO outcomes is represented in columns

- If all the variables are nominal variables , then in the represent there order doesn't
matters
Ex : - YES & No / Gender ( Male / Female ) .
- But if the variables are ordinal then order for representing them really matters .
Ex : - High , Medium , Low

week 4 Page 64 By :- Rushikesh Chavan


• Steps to Construct Contingency Table

Step 1 :- Select all data by using


( Ctrl + A )

Step 2 :- Click on Insert & select Pivot Table

Step 3 :- Create Pivot Table

week 4 Page 65 By :- Rushikesh Chavan


Step 3 :- Add Following Aspects in the
Pivot Table using Pivot Table Editor .

Result : -

week 4 Page 66 By :- Rushikesh Chavan


• What to do if there are Ordinal Variable's in Data

For example :- if you have two categorical variables like


( HIGH , MEDIUM , LOW ) these are Ordinal Variable and you have to find out
whether ownership of a smartphone is associated with Income of 100 people by
( YES / NO ) . A contingency table would classify outcomes for some variable in
rows and the other in columns .
Own a Smartphone

{
GENDER NO YES Row Total
HIGH 2 18 20
MEDIUM 27 39 66
LOW 9 5 14
Column Total 38 62 100

If the Variables are Ordinal then we should


maintain the correct increasing order in the data .

week 4 Page 67 By :- Rushikesh Chavan


Row Relative Frequencies

As we know Relative Frequency is Dividing Frequency by the total number of observation .

Similarly the ; The Row Relative Frequency is


Divide each cell frequency in a row by its row total .

Own a Smart Phone


GENDER NO YES Row Total
Female 10 34 44
Male 14 42 56
Column Total 24 76 100

Q. Find the Relative Frequency of YES & No ( Male + Female )

Own a Smart Phone


GENDER NO YES Row Total
Female 10 34 44
Male 14 42 56
Column Total 24 76 100

Total no. of observation = 100

A. Find the Relative Frequency of Yes (Female & male ) NO ( Female & male )
w.r.t there total population

Yes ( Female ) = = ; Yes ( Female ) = 77.27 %

No ( Female ) = = ; No ( Female ) = 22.73 %

Yes ( Male ) = = ; Yes ( Male ) = 75 %

No ( Male ) = = ; No ( Male ) = 25 %

week 4 Page 68 By :- Rushikesh Chavan


Column Relative Frequencies

As we know Relative Frequency is Dividing Frequency by the total number of observation .

Similarly the ; The Row Relative Frequency is


Divide each cell frequency in a column by its column total .

Own a Smart Phone


GENDER NO YES Row Total
Female 10 34 44
Male 14 42 56
Column Total 24 76 100

A. Find the proportion of female participants ?

Own a Smart Phone


GENDER NO YES Row Total
Female 10 34 44
Male 14 42 56
Column Total 24 76 100
Female % = 44 %

Total no. of observation = 100


Female % = 56 %

Q. Find the proportion of female ( No & Yes ) w.r.t total population said No & Yes
proportion of male ( No & Yes ) w.r.t total population said No & Yes

No ( Male ) = = ; No ( Male ) = 58.33 %

No ( Female ) = = ; No ( Female ) = 41.67 %

Yes ( Female ) = = ; Yes ( Female ) = 44.74 %

Yes ( Male ) = = ; Yes ( Male ) = 55.26 %

week 4 Page 69 By :- Rushikesh Chavan


Association Between Two Variable

If two categories variable are associated


( i.e. Knowing info about one variable gives info about another variable to )
For this we have to use Relative Row/Column Frequency .

• If Relative Row / Column frequencies are same for all Rows/Columns then the
Two variables are not associated to each other

• If Relative Row / Column frequencies are Different for all Rows/Columns then
the two variables are associated to each other
• Row Relative Freq • Column Relative Freq

Own a Smart Phone Own a Smart Phone


GENDER NO YES Row Total GENDER NO YES Row Total
Female 22.73% 77.27% 44 Female 41.67% 44.74% 44%
Male 25.00% 75.00% 56 Male 58.33% 55.26% 56%
Column Total 24% 76% 100 Column Total 24 76 100

- You can observe that there is no Any major - You can observe that there is no Any major
difference between each Row As RF in each difference between each Column As RF in
Row lying between 22-28% & 77 - 76% so each Column lying between 41-59% so
variable aren't associated with each other. variable aren't associated with each other.

Own a Smartphone Own a Smartphone


GENDER NO YES Row Total GENDER NO YES Row Total
HIGH 10.00% 90.00% 20 HIGH 5.26% 29.03% 20.00%
MEDIUM 40.91% 59.09% 66 MEDIUM 71.05% 62.90% 66.00%
LOW 64% 35.71% 14 LOW 23.68 8.06% 14.00%
Column Total 38.00% 62.00% 100 Column Total 38 62 100

- You can observe that there is major difference - You can observe that there is major difference
between each Row they do don't lay in same between each Row they do don't lay in same
range . Variables are associated . range . Variables are associated

week 4 Page 70 By :- Rushikesh Chavan


STACKED BAR CHART

As we know bar chart summarized the data for categorical variable


Under consideration with the length of the bars which represents the
Frequency of occurrence of a particular category .

A Stacked Bar Chart also represents the counts for a category .In
addition each bar is further broken down into smaller segments ,
Each segment representing the frequency of that particular category .
It is also referred as Segmented Bar Chart .

- Standard Stacked Bar Chart


Own a Smart Phone
GENDER NO YES Row Total
Female 10 34 44
Male 14 42 56
Column Total 24 76 100

Segments

- 100% Stacked Bar Chart

Own a Smart Phone


GENDER NO YES Row Total
Female 22.73% 77.27% 44
Male 25.00% 75.00% 56
Column Total 24% 76% 100

week 4 Page 71 By :- Rushikesh Chavan


.

- Standard Stacked Bar Chart

Own a Smartphone
GENDER NO YES Row Total
HIGH 2 18 20
MEDIUM 27 39 66
LOW 9 5 14
Column Total 38 62 100

- 100% Stacked Bar Chart

Own a Smartphone
GENDER NO YES Row Total
HIGH 10.00% 90.00% 20
MEDIUM 40.91% 59.09% 66
LOW 64% 35.71% 14
Column Total 38.00% 62.00% 100

week 4 Page 72 By :- Rushikesh Chavan


We can observe from above Stacked
Bar Chart that there is Association
Between the variables as compared to
below chart becoz in below chart
We can observe that data represented
by two bars is almost same .

week 4 Page 73 By :- Rushikesh Chavan


SCATTER PLOT

• Used for Looking association between numerical variables . A Scatter Plot is a


graph that displays pairs of values as points on two-dimensional plane .

- How to Decide the Variable's Axis ( X or Y axis ) ?

Y ---> Response Variable ( Dependent on Independent Variable )


X ---> Explanatory Variable ( Independent Variable )

AGE HIGHT Plots ( X , Y )


1 75 ( 1 , 75 )
2 85 ( 2 , 85 )
3 94 ( 3 , 94 )
4 101 ( 4 , 101 )
5 108 ( 5 , 108)

- Age is on X - Axis Because it is Independent Variable .


- Hight is on Y - Axis Because it is Dependent on Age .

week 4 Page 74 By :- Rushikesh Chavan


Example 2 : Prices Of Homes
SIZE ( 1000 sq feet ) PRICE ( INR price ) SIZE ( 1000 sq feet ) PRICE ( INR price )
0.8 68 0.5 201
1 81 0.6 69
1.1 72 0.9 122
1.3 91 1.1 133
1.6 87 1.3 207
1.8 56 1.4 71
2.3 83 1.5 149
2.3 112 2 122
2.5 93 2.2 188
2.5 100 2.6 198
2.7 136 2.7 88
3.1 109 2 207
3.1 122 3.1 133
3.2 159 2.3 206
3.4 170 3.4 90

I - Graph II - Graph

• As in graph ( I ) You can observe the highlighted points We can say that price of houses
increases with its size . i.e. Price ( Variable ) is associated with Size ( Variables )

• As in graph ( II ) You can observe the highlighted points We can say that Variable of
price of houses is not associated with Size of house variable because houses with less size
have more price & Houses with large Size have less price .

week 4 Page 75 By :- Rushikesh Chavan


DESCRIBING ASSOCIATION

When describing association between variables in a scatter plot ,


There are four key Questions that need to be answered

• Direction : Does the pattern of the graph trend up/down/ Both ?

• Curvature : Does the pattern appear to linear or does it curve ?

• Variation : Are the points tightly clustered along the pattern?

• Outliers : Did u find something out of the range ?

week 4 Page 76 By :- Rushikesh Chavan


DISCRIBING ASSOCIATION : DIRECTION

UP-Trend
& Linear

Down-Trend
& Linear

week 4 Page 77 By :- Rushikesh Chavan


Curved but Down-trend

Curved but Up-trend

week 4 Page 78 By :- Rushikesh Chavan


• Tightly Clustered

• Variable

week 4 Page 79 By :- Rushikesh Chavan


• Outliers :- Red circled are outliers because they don't follow the pattern

week 4 Page 80 By :- Rushikesh Chavan


MEASURE OF ASSOCIATION

week 4 Page 81 By :- Rushikesh Chavan


COVARIANCE : - It quantifies the strength of the linear association between two
numerical variables .

In simple words If you have two variables let's call them X & Y
So , Covariance will tell you how changes in X are associated with changes in y .

Covariance ( +ve ) ---> X is Increasing .


Y will also tend to be Increasing .
Covariance ( -ve ) ---> X is Increasing .
Y will tend to Decreasing .

X value nature Y value Nature Signs ( Similar / Different ) of Deviation


Large Large Similar Signs
Small Small Similar Signs
Large Small Different Signs
Small Large Different Signs

• Formulas :-
Population Covariance =

Sample Covariance =

• Unit Of Covariance :- If X is Kg and Y is in meter ---> Unit = Kg meter

• Formula for Google Sheet :- COVARIANCE.P(data_y, data_x)


COVARIANCE.S(data_y, data_x)

week 4 Page 82 By :- Rushikesh Chavan


Example 1

Same nature of value

DEVIATION of X DEVIATION OF Y
AGE HIGHT
1 75 -2 -17.6
Small value Small value ->Same Sign
2 85 -1 -7.6
3 94
0 1.4
Large Value 4 101 Large value ->Same sign 1 8.4
5 108
2 15.4
Mean ( )& = 3 92.6

Population Covariance = = = 16.4 ---> +ve covariance

Sample Covariance = = = 20.5 ---> +ve covariance

As we know
Covariance ( +ve ) ---> X is Increasing .
Y will also tend to be Increasing .

AGE HIGHT DEVIATION of X DEVIATION OF Y


1 75 -2 -17.6
2 85 -1 -7.6
3 94 0 1.4
4 101 1 8.4
5 108 2 15.4

X is increasing Y is increasing

week 4 Page 83 By :- Rushikesh Chavan


Example 2

Different nature of value

DEVIATION of X DEVIATION OF Y
AGE HIGHT
1 6 -2 2
Small value large value ->Differ Sign
2 5 -1 1
3 4 0 0
Large Value 4 3 1 -1
Small value ->Differ sign
5 2 2 -2
Mean ( )& = 3 4

Population Covariance = = = -2 ---> -ve covariance

Sample Covariance = = = -2.5 ---> -ve covariance

As we know
Covariance ( -ve ) ---> X is Increasing .
Y will also tend to be Decreasing .

AGE HIGHT DEVIATION of X DEVIATION OF Y


1 6 -2 2
2 5 -1 1
3 4 0 0
4 3 1 -1
5 2 2 -2

X is increasing Y is Decreasing

week 4 Page 84 By :- Rushikesh Chavan


CORRELATION :- It is represented by ' r ' . It is most easy to understand measure of
linear association between two numerical variable is correlation .

- The value of correlation ' r ' ranges from - 1 to +1 .


- Correlation ---> +ve & close to +1 ---> strong association . X & Y both increase
- Correlation ---> -ve & close to -1 ---> strong association . X increase Y decreases
- Correlation ---> 0 --> week association --> no linear association .

Formula :- r = =

Cov( x , y ) ---> Covariance of x & y

Where & = Standard Daviation of x & y

y
y

r -1
r +1

x x

week 4 Page 85 By :- Rushikesh Chavan


Example : 1

AGE HIGHT DEVIATION of X DEVIATION OF Y

1 75 -2 4 -17.6 309.76 35.2


2 85 -1 1 -7.6 57.76 7.6
3 94 0 0 1.4 1.96 0
4 101 1 1 8.4 70.56 8.4
5 108 2 4 15.4 237.16 30.8
= 3 = 92.6 Sum = 82

= 1.58

13.01

r= = = = 0.9964

--> Correlation ( r ) ---> +ve & close to +1 ---> strong association .

week 4 Page 86 By :- Rushikesh Chavan


Example : 2

X Y DEVIATION of X DEVIATION OF Y

1 6 -2 4 2 4 -4
2 5 -1 1 1 1 -1
3 4 0 0 0 0 0
4 3 1 1 -1 1 -1
5 2 2 4 -2 4 -4
= 3 = 4 Sum = 82

= 1.58

1.58

r= = = = -1

--> Correlation ( r ) ---> -ve & equal to -1 ---> strong association .

week 4 Page 87 By :- Rushikesh Chavan


FITTING OF LINE

R2 ---> Goodness of Fit Measure


Where ,
0 <= R2 <= 1

In above graph ;
Equation of the line : Hight = 8.2 ( Age ) + 10

Y = 8.2X + 68 ---> Slope = 8.2 ( so slope is +ve )

Goodness of Fit Measure = R2 = 0.993

Also , R2 = = r ( i.e. Correlation )

So , r = = = 0.993 --> check example 1 of correlation .

As the slope is positive then the value r will also be +ve

week 4 Page 88 By :- Rushikesh Chavan


R2 ---> Goodness of Fit Measure
Where ,
0 <= R2 <= 1

In above graph ;
Equation of the line : Hight = -1 ( Age ) + 7

Y = -1X + 7 ---> Slope = -1 ( so slope is -ve )

Goodness of Fit Measure = R2 = 1

Also , R2 = = r ( i.e. Correlation )

So , r = = = 1

As the slope of graph is -ve so r will also be -ve

r = -1 --> check example 2 of correlation .

week 4 Page 89 By :- Rushikesh Chavan


Association Between a Categorical Variable & Numerical Variable

• Categorical variable has two different categories ( Dichotomous )

Point Bi-Series Correlation Coefficient

X ---> Numerical Variable


Y ---> Categorical Variable ( Dichotomous variable )

• As Y is Dichotomous Variable
Then code one categorical variable as Y = 0 & another Y = 1
Just for grouping them
• Calculate the mean values of these two categories separately

If category is coded as Y = 0 ---> mean = Y0


If category is coded as Y = 1 ---> mean = Y1

• Find P0 & P1

P0 =

P1 =

• Find Sx ( Standard Deviation ) of X variable

• Formula : -
rpb =

week 4 Page 90 By :- Rushikesh Chavan

You might also like