Professional Documents
Culture Documents
Week 1-4 Statistics Notes
Week 1-4 Statistics Notes
INDEX
Week -1
Week - 2
Week - 3
Week - 4
By :- Rushikesh Chavan
WEEK-1
• What is Statistics ?
---> Statistics is the art of learning from data . It is concerned with the
collection of data , there subsequent description and their analysis
which will lead to the Drawing of Conclusion .
- The part of statistics concerned with - The part of statistics concerned with
the description and summarization of the drawing of conclusion from data
data is called Descriptive Statistics is called Inferential Statistics .
- To be able to draw a conclusion from the data , we must take into account
The possibility of chance - i.e. Probability .
Population :- The total collection of all the elements that we are interested in is
called as population .
Sample :- A subgroup of the population that will be studied in detailed is
called as sample .
Example :- Bag full of Colour Full balls and we as an observer want to see just
red balls . So -->
- Note that if the purpose of the analysis is to examine and explore Information
for its own intrinsic interest only then the study is Descriptive .
For Ex : - You-tube
they want to know which video people loved the most
That's why they added like - dislike button in there app
so simply they are collecting the data
- Data Collection :-
- Unstructured Data :-
- Structured Data :-
- Rows represents the cases :- For each case same attribute is recorded .
- Columns represent variables : - For each variable , same type of value for each
case is recorded
(Rows -> Cases)
(Date of birth is variable)
(Anjali is a case )
(These Columns represents variables)
Measurement Unit :- Scale defines the meaning of numerical data, such as weights measured
in kilograms, prices in rupees , hight in CM etc .
Note :- if data is being entered in Numerical data type ( ex : - hight ) then unit must be
common.
• Time Series :- If the data is recorded over a period of time, then it is called time-
series data.
• Time-Plot :- graph of a time series showing values in chronological order is
known as Time-plot.
Ex:- The data collected to observe the temperature in Delhi for seven different
days is a time-series data. Because, data is recorded only for one place (i.e. Delhi)
and it is recorded over a period of time (i.e. seven different days).
Ex :- The data collected to observe the temperature of Delhi, Chennai, Jaipur and
Bhopal on a particular day is a cross-sectional data. Because, data is recorded at
the same time and it is observed for several places.
When the data for a variable consist of labels or names used to identify the
characteristic of an observation, that scale of measurement is considered a
nominal scale.
- Nominal scale is just categories or labels which does not contain any order .
Ex:- When customer who visits on Amazon and buys something then customer gives the
rating for product quality i.e one star - five star / poor , good , excellent
i.e. data have the properties of nominal data also this data can be ranked
w.r.t product quality .
Note :-
- We can code an ordinal scale of measurement, as
- bad can be coded as ,
- good can be coded as
- excellent can be coded as .
- There is an order in , , .
Interval Scale of measurement :- If the data have all the properties of ordinal data i.e.
(having order in data ) also interval between values
expressed with fixed unit of measurement
then it is Interval Scale of measurement .
Note :- Data with interval scale of measurement are always numeric and we can find out
the difference between any two values.
Ex :- If a room has AC & that room has temperature °C,
temperature outside the room is °C.
It is correct to say that the difference in temperature is °C,
but it is incorrect to say that the outdoor is twice as hot as indoor.
Ratio scale of measurement :- If the data have all the properties of interval data and
the ratio of two values is meaningful , then the scale of
measurement is ratio scale .
Example: Height (in cm), Weight (in kg) and Marks, etc. All such types of data like
height, weight and marks can be added, subtracted and multiplied or divided
as it all have absolute zero property .
Definition :-
- A frequency distribution of qualitative data is a way to organize and
display data by listing the distinct values or categories and how many times each
value or category appears in a dataset.
- Each row of a frequency table lists a category along with the Number of cases
in the category .
Category - Blue
1. Red
2. Blue
COLOURS FREQUENCY
3. Green
Blue 2 Blue Category
4. Red
Green 2 cases : 2
5. Red
6. Blue Purple 1
7. Yellow Red 4
8. Green Yellow 1
9. Red
Grand Total 10
10. Purple
⚫ Definition :-
The ratio of Frequency to the total number of observation
is called Relative Frequency .
- The two most common displays of a categorical variable are a bar chart & pie chart
- Both discribe a categorical variable by Displaying its frequency table .
PIE CHARTS :-
Note : for finding how much degree is required for per wedge use Degree = relative frequency 360
Bar Chart :-
- This bars should be positioned in such a way that they should not touch
each other .
Vertical Axis
Horizontal Axis
Pareto Charts :-
Lower Frequency
Label's in Chart :-
Suppose a Bar Chart consists of too many categories & from those categories large
number of categories have less count then we should not exclude them becoz it will
cause the loss of data ; except of that we should make the another category name
other then there will be no loss of data .
Ex :-
Note : -
- Pie chart is best suited when we want to emphasize the proportion
rather than the frequency of the categories. Each slice in a pie
chart must have a distinct colour .
- Some examples where Bar Charts can be used :-
- An e-commerce company wants to know how many sources generated
income above a certain value .
- An e-commerce company is interested in knowing amount of transaction
by different payment modes.
- Also all the bars in a bar chart can have the same colour
The area principle says that the area occupied by a part of the graph should be
similar to the amount of data it represents
• In the improperly scaled pictogram bar graph ( 1 ) , the image for B is actually 9 times
as large as compared A. so this graph doesn't obey the Area principle .
• If any graph doesn't obey the Area principle then that graph is Misleading Graph
It is important to check for round-off errors. Round-off errors occur when table
entries are percentages or proportions, the value of total sum may slightly differ
from 100% or 1. This might result in a pie chart.
In the table, the value of total sum is 100%. Suppose, we round off the values and
draw a pie chart as follows:
In this pie chart has round-off errors because total sum of all entries is 100.5%
which is different from 100%.
Basic Ex:-
COLOURS FREQUENCY Relative Frequency
Blue 2 0.2
Green 2 0.2
Purple 1 0.1
Red 4 0.4
Yellow 1 0.1
Grand Total 10 1
Greater frequency
Lower Frequency
Category A has the widest slice . Thus Mode of the Dataset is Category ( " Red " )
If two or more categories have similar & highest frequency, then data
is called Bimodal (in the case of two) or Multimodal (more than two).
In the above bar chart , both categories " Red " & " Blue " have highest Frequency .
• Imagine you have a list of ordinal data representing the education levels of
a group of people: "High School," "Associate's Degree," "Bachelor's Degree,"
"Master's Degree," and "Ph.D."
• If you arrange these education levels in order from the least to the most
advanced, it would look like this:
1. High School
2. Associate's Degree
3. Bachelor's Degree
4. Master's Degree
5. Ph.D.
The median, in this case, would be the education level that falls right in the
middle when you have an odd number of observations. i.e. " Bachelor's Degree ".
Note: Median can be defined only for ordinal data whereas mode can be defined
for both nominal as well as ordinal data.
( VARIABLE TYPE )
First group the observation into classes ( categories ) and then treat the classes as
the distinct values of quantitative data .
Now we can construct frequency and R.F Relative -Frequency distributions of the data .
If the data set contains only a relatively small number of different values then it is
convenient to represent it in a frequency table .
Each class represents a distinct value along with its frequency of occurrence .
For Example :- The dataset reports the which colour do the people like in the
group of 12 people
- Organize the data into a number of classes to make the data understandable .
- But there are few guidelines that need to be followed :-
• A class interval contains its left end but not its right end boundary point.
suppose there is a interval 30 - 40
[ 30 - 40 ) i.e. 30 is included in the interval but 40 is not included
As 40 is lower limit of the next higher class .
Example :-
Note : - If the data are all two-digit numbers, then we could let the stem
of a data value be the tens digit and the leaf be the ones digit.
So , 75 is represented as follows :-
Stem Leaf
7 5
Stem Leaf
7 58
- In each observation
Stem - it's consists of all but the most rightmost digit .
Leaf - The rightmost digit .
Stem Leaf
Ages of Patients
10
15 Stem Leaf
22
23 1 0 5 ----> 10 15
25
28
2 2 3 5 8 9 ----> 22 23 25 28 29
29 3 1 6 ----> 31 36
31 4 5 8 ----> 45 48
36
45
2. Measures of Dispersion :-
A. MEAN :-
The most commonly used measures of central tendency is the mean .
The mean of a data set is the sum of the observation divided by the
Number of observations .
1) Sample Mean :
2) Population Mean :
1) Sample Mean :
2 , 6 , 15 , 16 , 5
Value ( Xi ) FREQUENCY ( Fi ) Fi Xi
1 2 2
2 3 6
3 5 15
4 4 16
5 1 5
Grand Total n = 15 Sum = 44
f ---> is Frequency .
m --> is Mid-point of the class interval .
- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> =
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
73 , 84 , 43 , 73 , 40 , 75 , 66 , 52 , 63 , 71 ----> =
Median Observation =
Median Observation = 10
Median Observation =
- 68 , 79 , 38 , 70 , 61 , 47 , 58 , 66
- 38 , 47 , 58 , 61 , 66 , 68 , 70 , 79 ---> median = = 63.5
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
- 43 , 52 , 63 , 66 , 71 , 73 , 75 , 84 ----> =
- Example :- 2 , 105 , 5 , 7 , 6 , 3
----> no Mode
• Adding a Constant
- 68 , 79 , 38 , 68 , 70 , 61 , 47 , 58 , 66 ----> Mode is 68
- Suppose the teacher has decided to add 5 marks to student's marks .
- Then the final data is
73 , 84 , 43 , 73 , 40 , 75 , 66 , 52 , 63 , 71 ----> Mode is 73
So we can observe that
= 68 + 5 = 73
• Multiplying a Constant
- 27.2 , 31.6 , 15.2 , 27.2 , 14 , 28 , 24.4 , 18.8 , 23.2 , 26.4 --> mode is 27.2
So we can observe that
= 63 0.4 = 27.2
1. Range
2. Variance
3. Standard Deviation
4. Interquartile Range
- Examples : -
1 ) Find the range of the dataset 1 , 2 , 3 , 4 , 5 .
----> Here ,
max value :- 3
min value :- 3
Range = 5 - 1 = 4
Range = 15 - 1 = 14
Imagine you and your friends have just played several games of basketball and each
scored a different number of points. Now, you want to understand how much each person’s
score differs from the average score. This is where variance comes in!
Variance is a statistical measurement that describes how much individual numbers in a data
set differ from the average value
Here’s how it works:
Population Variance :
Sample Variance :
Mean = = 59
Example: Consider the dataset 68, 79, 38, 68, 35, 70, 61, 47, 58, 66
Xi Xi+5 Xi − ( Xi + 5 ) − (( Xi + 5 ) − )2 = ( Xi − )2
68 73 9 9 81
79 84 20 20 400
38 43 -21 -21 441
68 73 9 9 81
35 40 -24 -24 576
70 75 11 11 121
61 66 2 2 4
47 52 -12 -12 144
58 63 -1 -1 1
66 71 7 7 49
590 640
(( Xi + 5 ) − )2 =1898
Mean ( Xi ) = 59
Mean ( Xi +5 ) = 59 + 5 = 64
Xi Xi 5 Xi - 59 ( Xi 5 ) - 295 (( Xi 5 ) - 295)2
68 340 9 45 2025
79 395 20 100 10000
38 190 -21 -105 11025
68 340 9 45 2025
35 175 -24 -120 14400
70 350 11 55 3025
61 305 2 10 100
47 235 -12 -60 3600
58 290 -1 -5 25
66 330 7 35 1225
590 2950
(( Xi 5 ) − )2 = 47450
Mean ( Xi ) = 59
Mean ( Xi +5 ) = 59 5 = 295
The quantity which is the square root of the Sample Variance is the
Simple Standard Deviation
- Formula : -
At least 100( 1 - p ) percent of the data values are greater than or equal to it
---> This means if we subtract the percentile rank ( In decimal form ) from 1 and
multiply it with 100 , we will get the percentage of data that is equal to or more
than our data values .
In simple terms 100p percentile tells us that where a particular value stands in w.r.t
Other data values .
It is just like saying " This value is higher than X % of all values & lower than Y % .
1% 1% 1% 1% 1% 1%
P1 P2 P3 P97 P98 P99
3) If np is an integer then find the average of the values in position np and np+1
Example : - [35, 38, 47, 58, 61, 66, 68, 68, 70, 79.]
n = 10 & p = 0.50 ---> coz we have to find 50th percentile
( n p ) = 10 0.50 = 5
Since np is in integer,
so we need to take the average of 5th = 61 observation and 6th = 66, observation
50th percentile = = 63.5
Entire Data
Min Max
25 % of Data
2nd Quartile
Median
50 % of Data
75 % of Data
- The Interquartile range , IQR is the difference between the 1st & 3rd quartile.
IQR = Q3 - Q1
• It shows the distribution of one variable in rows and another in columns, used to study
the correlation between the two variables .
• For example :- if you have two categorical variables like gender (male/female) and
you have to find out whether ownership of a smartphone is associated with gender of
a 100 student by ( YES / NO ) . A contingency table would classify outcomes for
some variable in rows and the other in columns .
}
Female 10 34 44
Male 14 42 56 No. of male and females in 100
Column Total 24 76 100
Student is represented in rows
- If all the variables are nominal variables , then in the represent there order doesn't
matters
Ex : - YES & No / Gender ( Male / Female ) .
- But if the variables are ordinal then order for representing them really matters .
Ex : - High , Medium , Low
Result : -
{
GENDER NO YES Row Total
HIGH 2 18 20
MEDIUM 27 39 66
LOW 9 5 14
Column Total 38 62 100
A. Find the Relative Frequency of Yes (Female & male ) NO ( Female & male )
w.r.t there total population
No ( Male ) = = ; No ( Male ) = 25 %
Q. Find the proportion of female ( No & Yes ) w.r.t total population said No & Yes
proportion of male ( No & Yes ) w.r.t total population said No & Yes
• If Relative Row / Column frequencies are same for all Rows/Columns then the
Two variables are not associated to each other
• If Relative Row / Column frequencies are Different for all Rows/Columns then
the two variables are associated to each other
• Row Relative Freq • Column Relative Freq
- You can observe that there is no Any major - You can observe that there is no Any major
difference between each Row As RF in each difference between each Column As RF in
Row lying between 22-28% & 77 - 76% so each Column lying between 41-59% so
variable aren't associated with each other. variable aren't associated with each other.
- You can observe that there is major difference - You can observe that there is major difference
between each Row they do don't lay in same between each Row they do don't lay in same
range . Variables are associated . range . Variables are associated
A Stacked Bar Chart also represents the counts for a category .In
addition each bar is further broken down into smaller segments ,
Each segment representing the frequency of that particular category .
It is also referred as Segmented Bar Chart .
Segments
Own a Smartphone
GENDER NO YES Row Total
HIGH 2 18 20
MEDIUM 27 39 66
LOW 9 5 14
Column Total 38 62 100
Own a Smartphone
GENDER NO YES Row Total
HIGH 10.00% 90.00% 20
MEDIUM 40.91% 59.09% 66
LOW 64% 35.71% 14
Column Total 38.00% 62.00% 100
I - Graph II - Graph
• As in graph ( I ) You can observe the highlighted points We can say that price of houses
increases with its size . i.e. Price ( Variable ) is associated with Size ( Variables )
• As in graph ( II ) You can observe the highlighted points We can say that Variable of
price of houses is not associated with Size of house variable because houses with less size
have more price & Houses with large Size have less price .
UP-Trend
& Linear
Down-Trend
& Linear
• Variable
In simple words If you have two variables let's call them X & Y
So , Covariance will tell you how changes in X are associated with changes in y .
• Formulas :-
Population Covariance =
Sample Covariance =
DEVIATION of X DEVIATION OF Y
AGE HIGHT
1 75 -2 -17.6
Small value Small value ->Same Sign
2 85 -1 -7.6
3 94
0 1.4
Large Value 4 101 Large value ->Same sign 1 8.4
5 108
2 15.4
Mean ( )& = 3 92.6
As we know
Covariance ( +ve ) ---> X is Increasing .
Y will also tend to be Increasing .
X is increasing Y is increasing
DEVIATION of X DEVIATION OF Y
AGE HIGHT
1 6 -2 2
Small value large value ->Differ Sign
2 5 -1 1
3 4 0 0
Large Value 4 3 1 -1
Small value ->Differ sign
5 2 2 -2
Mean ( )& = 3 4
As we know
Covariance ( -ve ) ---> X is Increasing .
Y will also tend to be Decreasing .
X is increasing Y is Decreasing
Formula :- r = =
y
y
r -1
r +1
x x
= 1.58
13.01
r= = = = 0.9964
X Y DEVIATION of X DEVIATION OF Y
1 6 -2 4 2 4 -4
2 5 -1 1 1 1 -1
3 4 0 0 0 0 0
4 3 1 1 -1 1 -1
5 2 2 4 -2 4 -4
= 3 = 4 Sum = 82
= 1.58
1.58
r= = = = -1
In above graph ;
Equation of the line : Hight = 8.2 ( Age ) + 10
In above graph ;
Equation of the line : Hight = -1 ( Age ) + 7
So , r = = = 1
• As Y is Dichotomous Variable
Then code one categorical variable as Y = 0 & another Y = 1
Just for grouping them
• Calculate the mean values of these two categories separately
• Find P0 & P1
P0 =
P1 =
• Formula : -
rpb =