Week 1-4 Statistics Notes

Qualifier
WEEK 1-4 NOTES

STATISTICS
INDEX
Week -1
Week - 2
Week - 3
Week - 4
By :- Rushikesh Chavan
WEEK-1
WEEK 1 Page 2 By :- Rushikesh Chavan

By :- Rushikesh Chavan
INTRO & TYPE OF DATA - BASIC DEFINITION
• What is Statistics ?
---> Statistics is the art of learning from data . It is concerned with the
collection of data , there subsequent description and their analysis
which will lead to the Drawing of Conclusion .
- The part of statistics concerned with - The part of statistics concerned with
the description and summarization of the drawing of conclusion from data
data is called Descriptive Statistics is called Inferential Statistics .
Descriptive Statistics doesn't necessarily seek to explain why something is

happening or make predictions about the future. Instead, it focuses on
presenting facts and characteristics as they are.
- To be able to draw a conclusion from the data , we must take into account
The possibility of chance - i.e. Probability .

POPULATION AND SAMPLE
Population :- The total collection of all the elements that we are interested in is
called as population .
Sample :- A subgroup of the population that will be studied in detailed is
called as sample .
Example :- Bag full of Colour Full balls and we as an observer want to see just
red balls . So -->
Bag of Colour full balls

Is Population
And Red Balls are Sample

PURPOSE OF STATISTICAL ANALYSIS
- Note that if the purpose of the analysis is to examine and explore Information
for its own intrinsic interest only then the study is Descriptive .
- If the information is obtained from a sample of a population and The purpose

of the study is to use that information to draw conclusion / predictions about
the population , that study is Inferential .
- A descriptive study may be performed either on a sample or a population .
- When an inference or predictions is made about the population based on

information obtained from the sample , then the study become inferential .

WHAT IS DATA ?
In order to learn something we need to collect data

So Data is the facts and figures which are collected , analysed and summarized for
presentation and interpretation (understanding)
- Statistics relies on data , Information that is around us .
- Why do we collect Data ?

The primary reason for collecting data is Interested in knowing about the
Characteristics of groups ; it could be group of people
It could be places , things , or events .
For Ex : - You-tube
they want to know which video people loved the most
That's why they added like - dislike button in there app
so simply they are collecting the data
- Data Collection :-
- Data Available :- Published Data ( https://data.gov.in/ )

- Data not Available :- need to generate or collect it

.
- Unstructured and Structured data :-
- Unstructured Data :-
Example : - Social Media Posts

Description : - Unstructured data refers to information that does not have a
Specified , predefined format . Social media posts - such as
tweets or posts on FB & Instagram are prime example of the
unstructured data . These posts can contain mix of text , image
videos , hashtags , mentions & emojis .
- Structured Data :-
Example : - Excel Spreadsheet

Description : - Structured data on other hand , has well - defined and
Organized format . An Excel Spreadsheet is an excellent ex.
of structured data . Each row and column in spreadsheet has
a specific purpose and follows a consistent data model . This
makes it easy to perform operation like sorting , filtering
& aggregating data , as well as conducting structured analysis .

VARIABLES & CASES
Case :- A unit (ex:- Student) from which the data is collected .

Variables :- A character or attribute that varies across all units ( ex : student's hight )
Note :- The student data is represented in tabular form .

for organizing data in tabular form these two points should be considered
- Rows represents the cases :- For each case same attribute is recorded .
- Columns represent variables : - For each variable , same type of value for each
case is recorded
(Rows -> Cases)
(Date of birth is variable)
(Anjali is a case )
(These Columns represents variables)

CLASSIFICATION OF DATA
Categorical Data :- Called as Qualitative variable .

Identifies the group membership
we cannot perform any meaning-full mathematical operation on it .
Ex : - In Students data set Gender is Categorical Variable Bcoz it has two categories M & F
And we can classify every student in these two categories i.e. (M) Male , ( F ) Female
Numerical Data :- Called as Quantitative Variables .

Describes the numerical properties of the cases ( ex : - Students hight )
Ex : - In Students data set Marks is Numerical data
Measurement Unit :- Scale defines the meaning of numerical data, such as weights measured
in kilograms, prices in rupees , hight in CM etc .
Note :- if data is being entered in Numerical data type ( ex : - hight ) then unit must be
common.

CROSS - SECTIONAL & TIME SERIES DATA
• Time Series :- If the data is recorded over a period of time, then it is called time-
series data.
• Time-Plot :- graph of a time series showing values in chronological order is
known as Time-plot.
Ex:- The data collected to observe the temperature in Delhi for seven different
days is a time-series data. Because, data is recorded only for one place (i.e. Delhi)
and it is recorded over a period of time (i.e. seven different days).
• Cross-Sectional Data :- If the data is observed at the same time, then it is

called cross-sectional data.
Ex :- The data collected to observe the temperature of Delhi, Chennai, Jaipur and
Bhopal on a particular day is a cross-sectional data. Because, data is recorded at
the same time and it is observed for several places.

SCALES OF MEASUREMENT
We have four scales of measurement called nominal, ordinal, interval and

ratio scale. Data collection requires any one of the scales of measurement.
• Nominal Scale of Measurement :-
When the data for a variable consist of labels or names used to identify the
characteristic of an observation, that scale of measurement is considered a
nominal scale.
Example: Name , Board , Gender, Blood group etc.
- These nominal variable can be numerically coded

Like (M) - male as & (F) - female as
- Nominal scale is just categories or labels which does not contain any order .

.
Ordinal Scale of Measurement :- When data exhibits properties of nominal data and the
order or rank of data is meaningful , then scale of
measurement is considered an ordinal scale .
Ex:- When customer who visits on Amazon and buys something then customer gives the
rating for product quality i.e one star - five star / poor , good , excellent
i.e. data have the properties of nominal data also this data can be ranked
w.r.t product quality .
Note :-
- We can code an ordinal scale of measurement, as
- bad can be coded as ,
- good can be coded as
- excellent can be coded as .
- There is an order in , , .
Interval Scale of measurement :- If the data have all the properties of ordinal data i.e.
(having order in data ) also interval between values
expressed with fixed unit of measurement
then it is Interval Scale of measurement .
Note :- Data with interval scale of measurement are always numeric and we can find out
the difference between any two values.
Ex :- If a room has AC & that room has temperature °C,
temperature outside the room is °C.
It is correct to say that the difference in temperature is °C,
but it is incorrect to say that the outdoor is twice as hot as indoor.

.
Ratio scale of measurement :- If the data have all the properties of interval data and
the ratio of two values is meaningful , then the scale of
measurement is ratio scale .
Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like
height, weight and marks can be added, subtracted and multiplied or divided
as it all have absolute zero property .

WEEK-2

FREQUENCY DISTRIBUTION
Definition :-
- A frequency distribution of qualitative data is a way to organize and
display data by listing the distinct values or categories and how many times each
value or category appears in a dataset.
- Each row of a frequency table lists a category along with the Number of cases
in the category .
Let's use a simple example to illustrate this concept :-

Imagine you have a survey where you asked people about their favorite colors, and
you recorded their responses. Here are the responses you received from people:
Category - Blue
1. Red
2. Blue
COLOURS FREQUENCY
3. Green
Blue 2 Blue Category
4. Red
Green 2 cases : 2
5. Red
6. Blue Purple 1
7. Yellow Red 4
8. Green Yellow 1
9. Red
Grand Total 10
10. Purple

• STEPS TO DISPLAY THE FREQUENCY TABLE IN GOOGLESHEETS
• Select / Highlight the cells having

data you want to visualize .
• In formatting bar click on Insert

Option & Go to Pivot Table option

.
• You can create Pivot Table

in both new and existing sheet
• Pivot Table in new sheet

.
• After creating Pivot Table go to editor

• 1st --> Add the rows
• 2nd --> Add values
• FINAL RESULT OF FREQUENCY TABLE :-

RELATIVE FREQUENCY
⚫ Definition :-
The ratio of Frequency to the total number of observation
is called Relative Frequency .
Steps to construct a relative frequency distribution :-
Step :- Obtain a frequency distribution of the data

Step :- Divide each frequency by total number of observations
COLOURS FREQUENCY Relative Frequency

Blue 2 0.2 ( Freq/Grad.T ) = / = .
Green 2 0.2
Purple 1 0.1
Red 4 0.4
Yellow 1 0.1
Grand Total 10 1 The sum of Relative Frequency's is
always 1

STEPS TO DISPLAY THE FREQUENCY TABLE IN GOOGLESHEETS
• Select / Highlight the cells having data

you want to visualize in Charts format
• In formatting bar click on Insert

Option & Go to Chart option .

.
• In this dropdown menu you can select
type of charts for ex:- Pie Charts
Some Examples of Charts :-

DESCRIBING OF CATEGORICAL DATA
Charts of categorical data :-
- The two most common displays of a categorical variable are a bar chart & pie chart
- Both discribe a categorical variable by Displaying its frequency table .
PIE CHARTS :-
A Pie Chart is a circle divided into pieces or Wedges proportional to

relative frequencies of the qualitative data .
This is Piece / Wedge

Blue 2 0.2
Green 2 0.2
Purple 1 0.1
The above data is based on this frequency
Red 4 0.4
Yellow 1 0.1 distribution table .
Grand Total 10 1
Note : for finding how much degree is required for per wedge use Degree = relative frequency 360

.
Bar Chart :-
A bar chart displays the
• Distinct values of qualitative data :- Horizontal Axis

• Relative Frequency ( frequency / % ) :- Vertical Axis
This Frequencies or relative frequency of each distinct value is represented by a

Vertical Bar which has hight equal to the frequency / R.F of that distinct value .
- This bars should be positioned in such a way that they should not touch
each other .
Vertical Axis
Horizontal Axis
This is Vertical Bar

Blue 2 0.2
Green 2 0.2
Purple 1 0.1 The above Bar Chart is based on this
Red 4 0.4
Yellow 1 0.1 frequency distribution table .
Grand Total 10 1

.
Pareto Charts :-
When the categories in a bar chart are sorted by frequency

( i.e. Greater frequency to lower frequency / Lower to greater )
These types of bar chart are known as Pareto Charts
• If the categorical variable is ordinal , then the bar chart must

preserve the ordering .
Basic Ex:- Greater frequency
Lower Frequency

BEST PRACTICES WHILE GRAPHING
First - Know your Purpose for every table you create
- Choose the table / graph to serve the purpose .
- So if the purpose is just to count the data and represent it

as a table then we must go for the Frequency Table .
- But for example you want to compare the data of two or
more different values then we must go for a Bar Chart .
- But if my Purpose is to know Share of each state out of
100 % then u must go for Pie Chart .
Label's in Chart :-
First thing is to label or annotate your data because only when we

label or annotate the data there is a better visualization or it
communicates the idea better.
This is how perfectly annotated graph looks like

.
Suppose a Bar Chart consists of too many categories & from those categories large
number of categories have less count then we should not exclude them becoz it will
cause the loss of data ; except of that we should make the another category name
other then there will be no loss of data .
Ex :-

.
Note : -
- Pie chart is best suited when we want to emphasize the proportion
rather than the frequency of the categories. Each slice in a pie
chart must have a distinct colour .
- Some examples where Bar Charts can be used :-
- An e-commerce company wants to know how many sources generated
income above a certain value .
- An e-commerce company is interested in knowing amount of transaction
by different payment modes.
- Also all the bars in a bar chart can have the same colour

MISLEADING GRAPHS
The Area Principle :-
( 1 ) Improper Scaling ( 2 ) Regular Graph ( Proper Scaling )
The area principle says that the area occupied by a part of the graph should be
similar to the amount of data it represents
• In the improperly scaled pictogram bar graph ( 1 ) , the image for B is actually 9 times
as large as compared A. so this graph doesn't obey the Area principle .
• We can say the ( 2 ) Regular graph obey the area principle .
• If any graph doesn't obey the Area principle then that graph is Misleading Graph

Truncated Graphs :- Another common violation is when the baseline of
a bar chart is not at zero .
Truncated Bar Graph
Regular Bar Graph

.
Indicating A Y-axis Break :-

ROUND-OFF ERROR
Category Percentage Round-Off

A 22.3 22
B 35.6 36
C 12.6 13
D 11 11
E 18.5 18
Total 100 % 102
It is important to check for round-off errors. Round-off errors occur when table
entries are percentages or proportions, the value of total sum may slightly differ
from 100% or 1. This might result in a pie chart.
In the table, the value of total sum is 100%. Suppose, we round off the values and
draw a pie chart as follows:
In this pie chart has round-off errors because total sum of all entries is 100.5%
which is different from 100%.

MODE & MEDIAN
MODE :- The Mode of a categorical variable is the most common category ,

The category with the highest frequency .
Mode labels :- (a) Longest Bar in a Bar Chart .

(b) Widest slice in a pie chart .
(c) First Category shown in Pareto Chart . ( Depends on order )
Basic Ex:-
Blue 2 0.2
Green 2 0.2
Purple 1 0.1
Red 4 0.4
Yellow 1 0.1
Grand Total 10 1
Greater frequency
Lower Frequency
In this figure Red has the longest Bar ,

Thus , mode of the dataset is Category " Red "

For Pie-chart Representation of same dataset : -
Category A has the widest slice . Thus Mode of the Dataset is Category ( " Red " )

BIMODAL & MULTIMODAL Data : -
If two or more categories have similar & highest frequency, then data
is called Bimodal (in the case of two) or Multimodal (more than two).

Blue 4 0.3333333333
Green 2 0.1666666667
Purple 1 0.08333333333
Red 4 0.3333333333
Yellow 1 0.08333333333
Grand Total 12 1
In the above bar chart , both categories " Red " & " Blue " have highest Frequency .

MEDIAN :-
• The median of an ordinal variable is the middle value when all the values
are arranged in order.
• If there are an even number of observations, then we can choose the

category on either side of the middle of the sorted list as the median.
• Imagine you have a list of ordinal data representing the education levels of
a group of people: "High School," "Associate's Degree," "Bachelor's Degree,"
"Master's Degree," and "Ph.D."
• If you arrange these education levels in order from the least to the most
advanced, it would look like this:
1. High School
2. Associate's Degree
3. Bachelor's Degree
4. Master's Degree
5. Ph.D.
The median, in this case, would be the education level that falls right in the
middle when you have an odd number of observations. i.e. " Bachelor's Degree ".
Note: Median can be defined only for ordinal data whereas mode can be defined
for both nominal as well as ordinal data.

WEEK-3
week 3 Page 36 By :- Rushikesh Chavan

DESCRIBING THE NUMERICAL DATA
• Organizing Numerical Data :-
( VARIABLE TYPE )
DESCRETE VARIABLE :- A Discrete variable usually involves a count of something .
Ex:- Number of spelling mistakes .

Number of accidents in a month & in a particular city . etc
CONTINUOUS VARIABLE :- A Continuous variable involves a measurement of

something .
Ex :- Weight of person .
Hight of a person . etc
First group the observation into classes ( categories ) and then treat the classes as
the distinct values of quantitative data .
Now we can construct frequency and R.F Relative -Frequency distributions of the data .

ORGANIZING DISCRETE DATA ( Single value )
If the data set contains only a relatively small number of different values then it is
convenient to represent it in a frequency table .
Each class represents a distinct value along with its frequency of occurrence .
For Example :- The dataset reports the which colour do the people like in the
group of 12 people
COLOURS FREQUENCY OF OCCURRENCE RELATIVE FREQUENCY

Blue 4 0.3333333333
Green 2 0.1666666667
Purple 1 0.08333333333
Red 4 0.3333333333
Yellow 1 0.08333333333
Grand Total 12 1

ORGANIZING CONTINUOUS DATA
- Organize the data into a number of classes to make the data understandable .
- But there are few guidelines that need to be followed :-
▪ Number of classes :- The appropriate number is a subjective choice ,

The rule of thumb is to have 5-20 classes .
▪ Each observation should belong to some class & no observation should belong
to more than one class .
▪ It is common but not essential to choose class intervals of equal length .

Some New Terms :-
• Lower Class Limit : The smallest value that could go in a class.
• Upper Class Limit : The Largest value that could go in the class.
1-10 ---> so 1 is lower class limit & 10 is upper class limit
• Class Width : The Difference between the lower limit of a class

& the lower limit of the next-higher class.
• Class Mark : The Average of the Two class limits of a class.
• A class interval contains its left end but not its right end boundary point.
suppose there is a interval 30 - 40
[ 30 - 40 ) i.e. 30 is included in the interval but 40 is not included
As 40 is lower limit of the next higher class .
For example :- Organizing Continuous Data related to weight if Bags in class .
Class Interval Frequency

1 - 10 7
10 - 20 10
20 - 30 9
30 - 40 14
40 - 50 14
50 - 60 11
60 - 70 6
70 - 80 13
80 - 90 7
90 - 100 9
Note :- These types of graphs are known as Histogram

STEAM - AND - LEAF DIAGRAM
The Stem-and-Leaf Diagram which is also known as stem-plot

In this graph each observation is separated into two parts :-
1st part :- Stem - it's consists of all but the most rightmost digit .
2nd part :- Leaf - The rightmost digit .
Example :-
Note : - If the data are all two-digit numbers, then we could let the stem
of a data value be the tens digit and the leaf be the ones digit.
So , 75 is represented as follows :-
Stem Leaf
7 5
If there are two values 75 , 78 :-
Stem Leaf
7 58

Steps to construct a Stem-plot :-
- In each observation
Stem - it's consists of all but the most rightmost digit .
Leaf - The rightmost digit .
- Write the Stems from smallest to largest in a vertical column

in left of vertical rule
Stem Leaf
Left of Vertical Rule Right of Vertical rule
- Write each leaf to the right of the vertical rule

In that row which contains appropriate stem .
- Arrange the leaves in each row in ascending order .

- Example :-
Draw a stem-and-leaf plot for the dataset 15, 22, 29, 36, 31, 23, 45,
10, 25, 28, 48 which are the ages of 11 patients admitted in a certain
hospital. Stem-and-leaf plot for the above dataset is :
Ages of Patients
10
15 Stem Leaf
22
23 1 0 5 ----> 10 15
25
28
2 2 3 5 8 9 ----> 22 23 25 28 29
29 3 1 6 ----> 31 36
31 4 5 8 ----> 45 48
36
45
You can observe how leaves are arranged in Ascending order

DESCRIPTIVE MEASURES
Descriptive measures are quantities whose values are determined

by the data and be used to summarize a data set .
• Types of Descriptive Measures :-
1. Measures of Central Tendency :-
These are the measures that indicate the most typical

value or centre of a data set .
2. Measures of Dispersion :-
These measures indicate the variability or spread of a

dataset .
• What are Outliers : -

As an example, consider a dataset of the ages of a group of people:
[21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 120]. But if we look closely,
most of the ages are between 21 and 30 years. The age ‘120’ is an
outlier .

1. MEASURES OF CENTRAL TENDENCY
A. MEAN :-
The most commonly used measures of central tendency is the mean .
The mean of a data set is the sum of the observation divided by the
Number of observations .
- The mean is usually referred as Arithmetic Average .

- i.e. divide the sum of the values by the number of values.
- Formulas for Different Types of Observations : -
1) Sample Mean :
2) Population Mean :
3) Mean for Grouped Data :
--- for Discrete single value data
--- for Continuous data
Note :- The Mean is sensitive to outliers

Understanding Formulas With Example
1) Sample Mean :
Example :- 2 , 12 , 5 , 7 , 6 , 7 , 3 ---> find the sample mean of this data
n = 7 ---> as there are 7 observations

f ---> is frequency
Ex :- If we ask a group of people that which number do they like from
0 -to- 9
Result -> 2 , 1 , 3 , 4 , 5 , 2 , 3 , 3 , 3 , 4 , 4 , 1 , 2 , 3 , 4 ---> n = 15
2 , 6 , 15 , 16 , 5
Value ( Xi ) FREQUENCY ( Fi ) Fi Xi
1 2 2
2 3 6
3 5 15
4 4 16
5 1 5
Grand Total n = 15 Sum = 44

f ---> is Frequency .
m --> is Mid-point of the class interval .
Class Interval Frequency(fi) Mid-Point ( mi ) f i mi

30 - 40 14 35 490
40 - 50 14 45 630
50 - 60 11 55 605
60 - 70 6 65 390
70 - 80 13 75 975
80 - 90 7 85 595
90 - 100 9 95 855
TOTAL 74 4540

ADDING & MULTIPLYING A CONSTANT
Let where c is a constant then

---> Is the Old mean without adding the constants .
---> Is the new mean after adding the constant ( c )
Example :Taking the dataset of marks of students
- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> =
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
73 , 84 , 43 , 73 , 40 , 75 , 66 , 52 , 63 , 71 ----> =
So we can observe that

= 59 + 5

- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> =
- Suppose the teacher has decided to scale done each mark by 40 %
So we have to multiply each mark in data with 0.4 .
27.2 , 31.6 , 15.2 , 27.2 , 14 , 28 , 24.4 , 18.8 , 23.2 , 26.4 ----> =

= 59 0.4

B. MEDIAN :-
Median is also used to measure the Central Tendency Coz the median
of a data set is the middle value in the ordered list.
( ordered list ex- increasing order ).
• 4 , 12 , 3 , 10 ,9 , 16 , 15 ---> Find the median
Arrange the data in increasing order

3 , 4 , 9 , 10 , 12 , 15 , 16 ----> No. of observations ( n ) = 7
If the number of observations are odd then
Median Observation =
Median Observation = = 4th observation
Median Observation = 10
• 4 , 12 , 3 , 10 ,9 , 16 ---> Find the median
Arrange the data in increasing order

3 , 4 , 9 , 10 , 12 , 16 ----> No. of observations ( n ) = 6
If the number of observations are even then
Median Observation =
Median Observation = = 9.5
Median Observation = 9.5
Note :- Median is not sensitive to outliers

ADDING & MULTIPLYING A CONSTANT

---> Is the Old median without adding the constants .
---> Is the new median after adding the constant ( c )
- 68 , 79 , 38 , 70 , 61 , 47 , 58 , 66
- 38 , 47 , 58 , 61 , 66 , 68 , 70 , 79 ---> median = = 63.5
- 43 , 52 , 63 , 66 , 71 , 73 , 75 , 84 ----> =

= 63.5 + 5

- 68 , 79 , 38 , 70 , 61 , 47 , 58 , 66
- 38 , 47 , 58 , 61 , 66 , 68 , 70 , 79 ---> median = = 63.5
- Suppose the teacher has decided to scale done each mark by 40 %
So we have to multiply each mark in data with 0.4 .
----> =

= 63.5 0.4

C. MODE :-
It’s a measure of the central tendency ,
Mode of data is its most frequently occurring value .
- If there is no value more than once then the data has no mode
- Example : - 2 , 12 , 5 , 7 , 6 , 7 , 3
----> As 7 occurs twice so , 7 is Mode
- Example :- 2 , 105 , 5 , 7 , 6 , 3
----> no Mode
• Adding a Constant

---> Is the Old Mode without adding the constants .
---> Is the new Mode after adding the constant ( c )
- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> Mode is 68
73 , 84 , 43 , 73 , 40 , 75 , 66 , 52 , 63 , 71 ----> Mode is 73
= 68 + 5 = 73
• Multiplying a Constant

- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> Mode is 68
multiplying complete data by 0.4
- 27.2 , 31.6 , 15.2 , 27.2 , 14 , 28 , 24.4 , 18.8 , 23.2 , 26.4 --> mode is 27.2
= 63 0.4 = 27.2

2. MEASURES OF DISPERSION
Measure of dispersion indicates the amount of variation , or

spread in dataset .
Some measures of dispersion are :
1. Range
2. Variance
3. Standard Deviation
4. Interquartile Range

1) RANGE :- The range of a dataset is the difference between its
largest & smallest values
Range = Max - Min
- Examples : -
1 ) Find the range of the dataset 1 , 2 , 3 , 4 , 5 .
----> Here ,
max value :- 3
min value :- 3
Range = 5 - 1 = 4
2 ) Find the range of the dataset 1 , 2 , 3 , 4 , 15 .

----> Here ,
max value :- 1
min value :- 15
Range = 15 - 1 = 14
- From above examples we can observe that range is sensitive to outliers

Coz there is just one difference in these two datasets i.e. 5 & 15 .
Which makes the large difference in the range

2) VARIANCE :-
Imagine you and your friends have just played several games of basketball and each
scored a different number of points. Now, you want to understand how much each person’s
score differs from the average score. This is where variance comes in!
Variance is a statistical measurement that describes how much individual numbers in a data
set differ from the average value
Here’s how it works:
1. Calculate Mean of data

2. Calculate Deviation = difference between the Xi and the (Mean) --> (Xi - )
3. Square each deviation (Xi - )2.
4. Finally, use these formulas : -
Population Variance :
Sample Variance :
The result is the variance!

- A high variance means that the scores are spread out and people scored very
differently from each other.
- A low variance means that everyone’s scores were close to the average ( Mean ) .

Example: Consider the dataset 68, 79, 38, 68, 35, 70, 61, 47, 58, 66.
(1) Compute Population & Sample variance of the dataset. Solution:
Data xi − (xi − x¯)2

1 68 9 81
2 79 20 400
3 38 -21 441
4 68 9 81
5 35 -24 576
6 70 11 121
7 61 2 4
8 47 -12 144
9 58 -1 1
10 66 7 49
Total Xi = 590 = (Xi − )2 = 0 (xi − )2 = 1898
Mean = = 59
a) Population Variance = = 189.8
b) Sample Variance = = 210.88

ADDING & MULTIPLYING THE CONSTANT
If we add the Constant ( C ) in the each Xi of data set then also
New Variance = Old Variance
Example: Consider the dataset 68, 79, 38, 68, 35, 70, 61, 47, 58, 66
Xi Xi+5 Xi − ( Xi + 5 ) − (( Xi + 5 ) − )2 = ( Xi − )2
68 73 9 9 81
79 84 20 20 400
38 43 -21 -21 441
68 73 9 9 81
35 40 -24 -24 576
70 75 11 11 121
61 66 2 2 4
47 52 -12 -12 144
58 63 -1 -1 1
66 71 7 7 49
590 640
(( Xi + 5 ) − )2 =1898
Mean ( Xi ) = 59
Mean ( Xi +5 ) = 59 + 5 = 64
a) Population Variance = = 189.8
b) Sample Variance = = 210.88

If we multiply the Constant ( C ) in the each Xi of data set then
New Variance = Old Variance C2
Xi Xi 5 Xi - 59 ( Xi 5 ) - 295 (( Xi 5 ) - 295)2
68 340 9 45 2025
79 395 20 100 10000
38 190 -21 -105 11025
68 340 9 45 2025
35 175 -24 -120 14400
70 350 11 55 3025
61 305 2 10 100
47 235 -12 -60 3600
58 290 -1 -5 25
66 330 7 35 1225
590 2950
(( Xi 5 ) − )2 = 47450
Mean ( Xi ) = 59
Mean ( Xi +5 ) = 59 5 = 295
a) Old Population Variance = 189.8
b) New Population Variance = = 189.8 52 = 4745
c) Old Sample Variance = = 210.89
d) New sample Variance = 210.89 52 = 5272.2

STANDARD DEVIATION :-
The quantity which is the square root of the Sample Variance is the
Simple Standard Deviation
- Formula : -
Units of Standard Deviation :-
- If we have a dataset of weights of 10 students which is measured in kg, then the

unit of variance will be (kg) 2 and units of standard deviation will be kg.
ADDING & MULTIPLYING CONTANT :
- If we add the Constant ( C ) in the each Xi of data set then also
- If we multiply the Constant ( C ) in the each Xi of data set then
New Variance = Old Variance C

PERCENTILE
The " 100p percentile " ---> way of ranking data .

If the any Data value is at the " 100p percentile "
---> it means that 100p percent of data is equal or less than that 100p percentile value .
• For example : If a student scores in 90th percentile on the test

It means they scored higher than or equal to 90% of the students .
At least 100( 1 - p ) percent of the data values are greater than or equal to it
---> This means if we subtract the percentile rank ( In decimal form ) from 1 and
multiply it with 100 , we will get the percentage of data that is equal to or more
than our data values .
• For example : If student is in the 90th then 100( 1 - 0.90 ) = 10 %

So 10 % of students scored the same or higher than this student .
In simple terms 100p percentile tells us that where a particular value stands in w.r.t
Other data values .
It is just like saying " This value is higher than X % of all values & lower than Y % .
1% 1% 1% 1% 1% 1%
P1 P2 P3 P97 P98 P99
We can see P99 is 99th percentile

So 99 % of data is less than it and
1 % is greater than it

Computing Percentile
1) Arrange the data in increasing order .

[ 68, 38, 66, 79, 61, 47, 68, 35, 70, 58 ]
Increasing order ---> [35, 38, 47, 58, 61, 66, 68, 68, 70, 79.]
2) n ---> number of observations = 10
P ---> percentile in decimal form ex:- 25 ---> 0.25
So if ( n p ) is not an integer then determine smallest integer greater than np
( n p ) = ( 10 0.25 ) = 2.5 --> not an integer
Smallest larger integer greater than ( n p ) is 3
Therefore 3rd observation of the data set is 25th percentile --> Which is 47
3) If np is an integer then find the average of the values in position np and np+1
Example : - [35, 38, 47, 58, 61, 66, 68, 68, 70, 79.]
n = 10 & p = 0.50 ---> coz we have to find 50th percentile
( n p ) = 10 0.50 = 5
Since np is in integer,
so we need to take the average of 5th = 61 observation and 6th = 66, observation
50th percentile = = 63.5

OUARTILES
• The Sample 25th percentile is called first quartile

• The sample 50th percentile is called second quartile & median .
• The sample 60th percentile is called third quartile .
Entire Data
Min Max
1st Quartile 3rd Quartile
25 % of Data
2nd Quartile
Median
50 % of Data
75 % of Data
These Quartiles breaks up data set into four parts

The Five Number Summary
- Minimum
- Q1 : First Quartile or Lower Quartile
- Q2 : Second Quartile of Median
- Q3 : Third Quartile or Upper Quartile
- Maximum
The Interquartile Range ( IQR )
- The Interquartile range , IQR is the difference between the 1st & 3rd quartile.
IQR = Q3 - Q1

WEEK-4

CONTINGENCY TABLE
• It shows the distribution of one variable in rows and another in columns, used to study
the correlation between the two variables .
• For example :- if you have two categorical variables like gender (male/female) and
you have to find out whether ownership of a smartphone is associated with gender of
a 100 student by ( YES / NO ) . A contingency table would classify outcomes for
some variable in rows and the other in columns .
Own a Smart Phone

GENDER NO YES Row Total
}
Female 10 34 44
Male 14 42 56 No. of male and females in 100
Column Total 24 76 100
Student is represented in rows
No. of YES / NO outcomes is represented in columns
- If all the variables are nominal variables , then in the represent there order doesn't
matters
Ex : - YES & No / Gender ( Male / Female ) .
- But if the variables are ordinal then order for representing them really matters .
Ex : - High , Medium , Low

• Steps to Construct Contingency Table
Step 1 :- Select all data by using

( Ctrl + A )
Step 2 :- Click on Insert & select Pivot Table
Step 3 :- Create Pivot Table

Step 3 :- Add Following Aspects in the
Pivot Table using Pivot Table Editor .
Result : -

• What to do if there are Ordinal Variable's in Data
For example :- if you have two categorical variables like

( HIGH , MEDIUM , LOW ) these are Ordinal Variable and you have to find out
whether ownership of a smartphone is associated with Income of 100 people by
( YES / NO ) . A contingency table would classify outcomes for some variable in
rows and the other in columns .
Own a Smartphone
{
HIGH 2 18 20
MEDIUM 27 39 66
LOW 9 5 14
If the Variables are Ordinal then we should

maintain the correct increasing order in the data .

Row Relative Frequencies
As we know Relative Frequency is Dividing Frequency by the total number of observation .
Similarly the ; The Row Relative Frequency is

Divide each cell frequency in a row by its row total .
Own a Smart Phone

Female 10 34 44
Male 14 42 56
Q. Find the Relative Frequency of YES & No ( Male + Female )
Own a Smart Phone

Female 10 34 44
Male 14 42 56
Total no. of observation = 100
A. Find the Relative Frequency of Yes (Female & male ) NO ( Female & male )
w.r.t there total population
Yes ( Female ) = = ; Yes ( Female ) = 77.27 %
No ( Female ) = = ; No ( Female ) = 22.73 %
Yes ( Male ) = = ; Yes ( Male ) = 75 %
No ( Male ) = = ; No ( Male ) = 25 %

Column Relative Frequencies
As we know Relative Frequency is Dividing Frequency by the total number of observation .
Similarly the ; The Row Relative Frequency is

Divide each cell frequency in a column by its column total .
Own a Smart Phone

Female 10 34 44
Male 14 42 56
A. Find the proportion of female participants ?
Own a Smart Phone

Female 10 34 44
Male 14 42 56
Female % = 44 %
Total no. of observation = 100

Female % = 56 %
Q. Find the proportion of female ( No & Yes ) w.r.t total population said No & Yes
proportion of male ( No & Yes ) w.r.t total population said No & Yes
No ( Male ) = = ; No ( Male ) = 58.33 %
No ( Female ) = = ; No ( Female ) = 41.67 %
Yes ( Female ) = = ; Yes ( Female ) = 44.74 %
Yes ( Male ) = = ; Yes ( Male ) = 55.26 %

Association Between Two Variable
If two categories variable are associated

( i.e. Knowing info about one variable gives info about another variable to )
For this we have to use Relative Row/Column Frequency .
• If Relative Row / Column frequencies are same for all Rows/Columns then the
Two variables are not associated to each other
• If Relative Row / Column frequencies are Different for all Rows/Columns then
the two variables are associated to each other
• Row Relative Freq • Column Relative Freq
Own a Smart Phone Own a Smart Phone

GENDER NO YES Row Total GENDER NO YES Row Total
Female 22.73% 77.27% 44 Female 41.67% 44.74% 44%
Male 25.00% 75.00% 56 Male 58.33% 55.26% 56%
Column Total 24% 76% 100 Column Total 24 76 100
- You can observe that there is no Any major - You can observe that there is no Any major
difference between each Row As RF in each difference between each Column As RF in
Row lying between 22-28% & 77 - 76% so each Column lying between 41-59% so
variable aren't associated with each other. variable aren't associated with each other.
Own a Smartphone Own a Smartphone

GENDER NO YES Row Total GENDER NO YES Row Total
HIGH 10.00% 90.00% 20 HIGH 5.26% 29.03% 20.00%
MEDIUM 40.91% 59.09% 66 MEDIUM 71.05% 62.90% 66.00%
LOW 64% 35.71% 14 LOW 23.68 8.06% 14.00%
Column Total 38.00% 62.00% 100 Column Total 38 62 100
- You can observe that there is major difference - You can observe that there is major difference
between each Row they do don't lay in same between each Row they do don't lay in same
range . Variables are associated . range . Variables are associated

STACKED BAR CHART
As we know bar chart summarized the data for categorical variable

Under consideration with the length of the bars which represents the
Frequency of occurrence of a particular category .
A Stacked Bar Chart also represents the counts for a category .In
addition each bar is further broken down into smaller segments ,
Each segment representing the frequency of that particular category .
It is also referred as Segmented Bar Chart .
- Standard Stacked Bar Chart

Own a Smart Phone
Female 10 34 44
Male 14 42 56
Segments
- 100% Stacked Bar Chart
Own a Smart Phone

Female 22.73% 77.27% 44
Male 25.00% 75.00% 56
Column Total 24% 76% 100

.
- Standard Stacked Bar Chart
Own a Smartphone
HIGH 2 18 20
MEDIUM 27 39 66
LOW 9 5 14
- 100% Stacked Bar Chart
Own a Smartphone
HIGH 10.00% 90.00% 20
MEDIUM 40.91% 59.09% 66
LOW 64% 35.71% 14
Column Total 38.00% 62.00% 100

We can observe from above Stacked
Bar Chart that there is Association
Between the variables as compared to
below chart becoz in below chart
We can observe that data represented
by two bars is almost same .

SCATTER PLOT
• Used for Looking association between numerical variables . A Scatter Plot is a

graph that displays pairs of values as points on two-dimensional plane .
- How to Decide the Variable's Axis ( X or Y axis ) ?
Y ---> Response Variable ( Dependent on Independent Variable )

X ---> Explanatory Variable ( Independent Variable )
AGE HIGHT Plots ( X , Y )

1 75 ( 1 , 75 )
2 85 ( 2 , 85 )
3 94 ( 3 , 94 )
4 101 ( 4 , 101 )
5 108 ( 5 , 108)
- Age is on X - Axis Because it is Independent Variable .

- Hight is on Y - Axis Because it is Dependent on Age .

Example 2 : Prices Of Homes
SIZE ( 1000 sq feet ) PRICE ( INR price ) SIZE ( 1000 sq feet ) PRICE ( INR price )
0.8 68 0.5 201
1 81 0.6 69
1.1 72 0.9 122
1.3 91 1.1 133
1.6 87 1.3 207
1.8 56 1.4 71
2.3 83 1.5 149
2.3 112 2 122
2.5 93 2.2 188
2.5 100 2.6 198
2.7 136 2.7 88
3.1 109 2 207
3.1 122 3.1 133
3.2 159 2.3 206
3.4 170 3.4 90
I - Graph II - Graph
• As in graph ( I ) You can observe the highlighted points We can say that price of houses
increases with its size . i.e. Price ( Variable ) is associated with Size ( Variables )
• As in graph ( II ) You can observe the highlighted points We can say that Variable of
price of houses is not associated with Size of house variable because houses with less size
have more price & Houses with large Size have less price .

DESCRIBING ASSOCIATION
When describing association between variables in a scatter plot ,

There are four key Questions that need to be answered
• Direction : Does the pattern of the graph trend up/down/ Both ?
• Curvature : Does the pattern appear to linear or does it curve ?
• Variation : Are the points tightly clustered along the pattern?
• Outliers : Did u find something out of the range ?

DISCRIBING ASSOCIATION : DIRECTION
UP-Trend
& Linear
Down-Trend
& Linear

Curved but Down-trend
Curved but Up-trend

• Tightly Clustered
• Variable

• Outliers :- Red circled are outliers because they don't follow the pattern

MEASURE OF ASSOCIATION

COVARIANCE : - It quantifies the strength of the linear association between two
numerical variables .
In simple words If you have two variables let's call them X & Y
So , Covariance will tell you how changes in X are associated with changes in y .
Covariance ( +ve ) ---> X is Increasing .

Y will also tend to be Increasing .
Covariance ( -ve ) ---> X is Increasing .
Y will tend to Decreasing .
X value nature Y value Nature Signs ( Similar / Different ) of Deviation

Large Large Similar Signs
Small Small Similar Signs
Large Small Different Signs
Small Large Different Signs
• Formulas :-
Population Covariance =
Sample Covariance =
• Unit Of Covariance :- If X is Kg and Y is in meter ---> Unit = Kg meter
• Formula for Google Sheet :- COVARIANCE.P(data_y, data_x)

COVARIANCE.S(data_y, data_x)

Example 1
Same nature of value
DEVIATION of X DEVIATION OF Y
AGE HIGHT
1 75 -2 -17.6
Small value Small value ->Same Sign
2 85 -1 -7.6
3 94
0 1.4
Large Value 4 101 Large value ->Same sign 1 8.4
5 108
2 15.4
Mean ( )& = 3 92.6
Population Covariance = = = 16.4 ---> +ve covariance
Sample Covariance = = = 20.5 ---> +ve covariance
As we know
Covariance ( +ve ) ---> X is Increasing .
Y will also tend to be Increasing .
AGE HIGHT DEVIATION of X DEVIATION OF Y

1 75 -2 -17.6
2 85 -1 -7.6
3 94 0 1.4
4 101 1 8.4
5 108 2 15.4
X is increasing Y is increasing

Example 2
Different nature of value
DEVIATION of X DEVIATION OF Y
AGE HIGHT
1 6 -2 2
Small value large value ->Differ Sign
2 5 -1 1
3 4 0 0
Large Value 4 3 1 -1
Small value ->Differ sign
5 2 2 -2
Mean ( )& = 3 4
Population Covariance = = = -2 ---> -ve covariance
Sample Covariance = = = -2.5 ---> -ve covariance
As we know
Covariance ( -ve ) ---> X is Increasing .
Y will also tend to be Decreasing .

1 6 -2 2
2 5 -1 1
3 4 0 0
4 3 1 -1
5 2 2 -2
X is increasing Y is Decreasing

CORRELATION :- It is represented by ' r ' . It is most easy to understand measure of
linear association between two numerical variable is correlation .
- The value of correlation ' r ' ranges from - 1 to +1 .

- Correlation ---> +ve & close to +1 ---> strong association . X & Y both increase
- Correlation ---> -ve & close to -1 ---> strong association . X increase Y decreases
- Correlation ---> 0 --> week association --> no linear association .
Formula :- r = =
Cov( x , y ) ---> Covariance of x & y
Where & = Standard Daviation of x & y
y
y
r -1
r +1
x x

Example : 1
1 75 -2 4 -17.6 309.76 35.2

2 85 -1 1 -7.6 57.76 7.6
3 94 0 0 1.4 1.96 0
4 101 1 1 8.4 70.56 8.4
5 108 2 4 15.4 237.16 30.8
= 3 = 92.6 Sum = 82
= 1.58
13.01
r= = = = 0.9964
--> Correlation ( r ) ---> +ve & close to +1 ---> strong association .

Example : 2
X Y DEVIATION of X DEVIATION OF Y
1 6 -2 4 2 4 -4
2 5 -1 1 1 1 -1
3 4 0 0 0 0 0
4 3 1 1 -1 1 -1
5 2 2 4 -2 4 -4
= 3 = 4 Sum = 82
= 1.58
1.58
r= = = = -1
--> Correlation ( r ) ---> -ve & equal to -1 ---> strong association .

FITTING OF LINE
R2 ---> Goodness of Fit Measure

Where ,
0 <= R2 <= 1
In above graph ;
Equation of the line : Hight = 8.2 ( Age ) + 10
Y = 8.2X + 68 ---> Slope = 8.2 ( so slope is +ve )
Goodness of Fit Measure = R2 = 0.993
Also , R2 = = r ( i.e. Correlation )
So , r = = = 0.993 --> check example 1 of correlation .
As the slope is positive then the value r will also be +ve

R2 ---> Goodness of Fit Measure
Where ,
0 <= R2 <= 1
In above graph ;
Equation of the line : Hight = -1 ( Age ) + 7
Y = -1X + 7 ---> Slope = -1 ( so slope is -ve )
Goodness of Fit Measure = R2 = 1
Also , R2 = = r ( i.e. Correlation )
So , r = = = 1
As the slope of graph is -ve so r will also be -ve
r = -1 --> check example 2 of correlation .

Association Between a Categorical Variable & Numerical Variable
• Categorical variable has two different categories ( Dichotomous )
Point Bi-Series Correlation Coefficient
X ---> Numerical Variable

Y ---> Categorical Variable ( Dichotomous variable )
• As Y is Dichotomous Variable
Then code one categorical variable as Y = 0 & another Y = 1
Just for grouping them
• Calculate the mean values of these two categories separately
If category is coded as Y = 0 ---> mean = Y0

If category is coded as Y = 1 ---> mean = Y1
• Find P0 & P1
P0 =
P1 =
• Find Sx ( Standard Deviation ) of X variable
• Formula : -
rpb =

Week 1-4 Statistics Notes

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 1-4 Statistics Notes

Uploaded by

Copyright:

Available Formats

Qualifier

WEEK 1-4 NOTES

WEEK 1 Page 2 By :- Rushikesh Chavan

Descriptive Statistics doesn't necessarily seek to explain why something is

WEEK 1 Page 3 By :- Rushikesh Chavan

Bag of Colour full balls

And Red Balls are Sample

WEEK 1 Page 4 By :- Rushikesh Chavan

- If the information is obtained from a sample of a population and The purpose

- A descriptive study may be performed either on a sample or a population .

- When an inference or predictions is made about the population based on

WEEK 1 Page 5 By :- Rushikesh Chavan

In order to learn something we need to collect data

- Statistics relies on data , Information that is around us .

- Why do we collect Data ?

- Data Available :- Published Data ( https://data.gov.in/ )

WEEK 1 Page 6 By :- Rushikesh Chavan

- Unstructured and Structured data :-

Example : - Social Media Posts

Example : - Excel Spreadsheet

WEEK 1 Page 7 By :- Rushikesh Chavan

Case :- A unit (ex:- Student) from which the data is collected .

Note :- The student data is represented in tabular form .

WEEK 1 Page 8 By :- Rushikesh Chavan

Categorical Data :- Called as Qualitative variable .

Numerical Data :- Called as Quantitative Variables .

WEEK 1 Page 9 By :- Rushikesh Chavan

• Cross-Sectional Data :- If the data is observed at the same time, then it is

WEEK 1 Page 10 By :- Rushikesh Chavan

We have four scales of measurement called nominal, ordinal, interval and

• Nominal Scale of Measurement :-

Example: Name , Board , Gender, Blood group etc.

- These nominal variable can be numerically coded

WEEK 1 Page 11 By :- Rushikesh Chavan

WEEK 1 Page 12 By :- Rushikesh Chavan

WEEK 1 Page 13 By :- Rushikesh Chavan

WEEK 2 Page 14 By :- Rushikesh Chavan

Let's use a simple example to illustrate this concept :-

WEEK 2 Page 15 By :- Rushikesh Chavan

• Select / Highlight the cells having

• In formatting bar click on Insert

WEEK 2 Page 16 By :- Rushikesh Chavan

• You can create Pivot Table

• Pivot Table in new sheet

WEEK 2 Page 17 By :- Rushikesh Chavan

• After creating Pivot Table go to editor

• FINAL RESULT OF FREQUENCY TABLE :-

WEEK 2 Page 18 By :- Rushikesh Chavan

Steps to construct a relative frequency distribution :-

Step :- Obtain a frequency distribution of the data

COLOURS FREQUENCY Relative Frequency

WEEK 2 Page 19 By :- Rushikesh Chavan

• Select / Highlight the cells having data

• In formatting bar click on Insert

WEEK 2 Page 20 By :- Rushikesh Chavan

Some Examples of Charts :-

WEEK 2 Page 21 By :- Rushikesh Chavan

Charts of categorical data :-

A Pie Chart is a circle divided into pieces or Wedges proportional to

This is Piece / Wedge

WEEK 2 Page 22 By :- Rushikesh Chavan

A bar chart displays the

• Distinct values of qualitative data :- Horizontal Axis