Chapter-3: Statistical Analysis

CHAPTER-3
STATISTICAL ANALYSIS
TRP: An example
Television rating point (TRP) is a tool provided to judge which programs are viewed the most. This
gives us an index of the choice of the people and also the popularity of a particular channel.
For calculation purpose, a device is attached to the TV sets in few thousand viewers’ houses in
different geographic and demographic sectors.
◦ The device is called as People's Meter. It reads the time and the programme that a viewer
watches on a particular day for a certain period.
An average is taken, for example, for a 30-days period.
The above further can be augmented with a Personal Interview Survey (PIS), which becomes the
basis for many studies/decision making.
Essentially, we are to analyze data for TRP estimation.
STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 2

Defining Data
A set of data is a collection of observed values representing one or more
characteristics of some objects or units.
Example: For TRP, data collection consist of the following attributes.
◦ Age: A viewer’s age in years
◦ Sex: A viewer’s gender coded 1 for male and 0 for female
◦ Happy: A viewer’s general happiness
◦ NH for not too happy
◦ PH for pretty happy
◦ VH for very happy
◦ TVHours: The average number of hours a respondent watched TV during a day

Defining Data
Viewer# Age Sex Happy TVHours
… … … … …
… … … … …
55 34 F VH 5
… … … … …
Note:
 A data set is composed of information from a set of units.
 Information from a unit is known as an observation.
 An observation consists of one or more pieces of information about a unit; these are called
variables.
Defining Population
 A population is a data set representing the entire entities of interest.

 A parameter is a numerical characteristic of the population.
Example: All TV Viewers in the country/world.
Note:
1. All people in the country/world is not a population.
2. For different survey, the population set may be completely different.
3. For statistical learning, it is important to define the population that we

intend to study very carefully.
Defining Sample
A sample is a data set consisting of a population.
OR
A sample is a portion of the population
A sample statistic is a numerical characteristic of the sample.
Example: All students studying in Class XII is a population, whereas those students belong to a
given school is sample.
Note:
◦ Normally In inferential statistics, a sample is obtained in such a way as to be representative
of the population.
An unbiased sampling procedure is one in which each observation in the population has an
equal chance of being chosen for the sample.

Introduction to Statistics
Statistics is a discipline defining a set of procedures used to collect and
interpret numerical data.
The discipline of statistics serves two purposes:
1. Statistical procedures can be used to describe the relevant characteristics
(dispersion, central tendency) of a body of data. This is called Descriptive
Statistics .
2. Statistical procedures can be utilized to help us make inferences or predictions
about a whole population based on information from a sample of the population.
This is called Inferential Statistics.

Defining Statistics
A statistics is a quantity calculated from data that describes a particular
characteristics of a sample.
Example: Inference Statistics:

In the context of TRP
◦ Overall frequency of the various levels of happiness.
◦ Is there a relationship between the age of a viewers and his/her general
happiness?
◦ Is there a relationship between the age of the viewer and the number of TV
hours watched?

Data Summarization
 To identify the typical characteristics of data (i.e., to have an overall picture).
 To identify which data should be treated as noise or outliers.
 The data summarization techniques can be classified into two broad
categories:
◦ Measures of location /centrality
◦ Measures of dispersion

Measurement of location/Centrality
◦ It is also alternatively called as measuring the central tendency.
◦ A function of the sample values that summarizes the location information
into a single number is known as a measure of location.
◦ The most popular measures of location are

◦ Mean
◦ Median
◦ Mode
◦ Midrange

Mean of a sample

The mean (arithmetic average) is the most common measure of centrality.
The mean of a sample data is denoted a. The mean of the population is

denoted μ
Different mean measurements known are:
Simple mean
Weighted mean
Trimmed mean

Simple mean of a sample

It is also called simply arithmetic mean or average and is abbreviated as (AM).
 If , , ,….., are the sample values, the simple mean is defined as

Weighted mean of a sample

It is also called weighted arithmetic mean or weighted average.
 When each sample value is associated with a weight , for i = 1,2,…,n, then it
is defined as
Note
When all weights are equal, the weighted mean reduces to simple mean.

Trimmed mean of a sample
If there are extreme values (also called outlier) in a sample, then the mean is
influenced greatly by those values.
To offset the effect caused by those extreme values, we can use the concept of
trimmed mean.
OR Trimmed mean is defined as the mean obtained after chopping off values at
the high and low extremes.

Properties of mean
 If , i = 1,2,…,m are the means of m samples of sizes , ,….., respectively, then the mean of
the combined sample is given by:-
(Distributive Measure)
 If a new observation is added to a sample of size n with mean , the new mean is given by

Properties of mean
 If an existing observation is removed from a sample of size n with mean , the new mean is
given by
 If m observations with mean , are added (removed) from a sample of size n with mean ,
then the new mean is given by

Properties of mean
 If a constant c is subtracted (or added) from each sample value, then the mean of the
transformed variable is linearly displaced by c. That is,
 If each observation is called by multiplying (dividing) by a non-zero constant, then the

altered mean is given by
 Where, * is x (multiplication) or ÷ (division) operator.

Relation between mean, median and
mode
 A given set of data can be categorized into three categories:-
 Symmetric data
 Positively skewed data
 Negatively skewed data

Symmetric data
For symmetric data, all mean, median and mode lie at the same point
A symmetric distribution is one in which the side of the
distribution to the right of the mean is a mirror image of the left portion.

Positively skewed/Skewed Right data
 Here, mode occurs at a value smaller than the median.

Negatively skewed /Skewed Left data
 Here, mode occurs at a value greater than the median.
median
Mean<median
mean
Long left tail Short right tail

NOTES:
The value of the mean is sensitive to the outliers in a data set whereas the
median is not.
The median will equal the mean if the distribution of the data is symmetric.
If a distribution is skewed to the right, outlier values in the data much larger
than the mean are pulling the value of the mean above the median.
If a distribution is skewed to the left, outlier values in the data much smaller
than the mean are pulling the value of the mean below the median.

Measures of dispersion/Variablility
 Location measure are far too insufficient to understand data.
 Another set of commonly used summary statistics for continuous data are those
that measure the dispersion.
 A dispersion measures the extent of spread of observations in a sample
 Some important measure of dispersion are:
◦ Range
◦ Variance and Standard Deviation
◦ Mean Absolute Deviation (MAD)
◦ Absolute Average Deviation (AAD)
◦ Interquartile Range (IQR)

Range of a sample

Let X = , , ,….., be n sample values that are arranged in increasing order.
It is the simplest measure of variability.
 The range R of these samples are then defined as:
 R = max(X) – min(X) = -
 EXAMPLE: NOTE:
It can be misleading if most of the values are
 In {4, 6, 9, 3, 7} concentrated in a narrow band of values, but
Increasing Order: X= {3,4,6,7,9} there are also a relatively small number of
Min(X)=3 more extreme values
Example: In {8, 11, 5, 9, 7, 6, 3616}:
Max(X)=9
So the range is 3616 − 5 = 3611.
Range=(9-3)=6 The single value of 3616 makes the range
large, but most values are around 10.

Variance and Standard Deviation
 In statistics, variance refers to the spread of a data set.
 It’s a measurement used to identify how far each number in the
data set is from the mean.
 A variance value of zero represents that all of the values within a
data set are identical, while all variances that are not equal to zero
will come in the form of positive numbers.
 A large variance means that the numbers in a set are far from the
mean and each other. A small variance means that the numbers
are closer together in value.

Variance and Standard Deviation

Let X = { , , ,….., } are sample values of n samples.
 Then, variance denoted as σ² is defined as :-
 where, denotes the mean of the sample

 The average of the squared differences from the Mean.
 The standard deviation, σ, of the samples is the square root of the variance
 The coefficient of variation expresses standard deviation as a percentage of
the mean:
 The coefficient of variation is useful in comparing the variation of data sets
that have different means

Example
You and your friends have just measured the heights of your dogs (in millimeters):
The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.

Example Solution:
1. Find the mean first:

Mean = (600 + 470 + 170 + 430 + 300)/5
= (1970)/5
= 394
2. To calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2 = (2062 + 762 + (−224)2 + 362 + (−94)2 ) /5
= (42436 + 5776 + 50176 + 1296 + 8836)/5
= (108520)/5
= 21704

Example Solution:
And the Standard Deviation is just the square root of Variance, so:
Standard Deviation
σ = √21704
= 147.32...
= 147 (to the nearest mm)
So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is
extra large or extra small.

NOTE:
if the data is a Sample (a selection taken from a bigger Population), then the
calculation changes!
When you have "N" data values that are:
◦ The Population: divide by N when calculating Variance
◦ A Sample: divide by N-1 when calculating Variance
Example: if our 5 dogs are just a sample of a bigger population of dogs, we divide by 4 instead
of 5 like this:
Sample Variance = 108,520 / 4 = 27,130
Sample Standard Deviation = √27,130 = 164 (to the nearest mm)

Formulas : Discrete variables
The "Population Standard Deviation":
The "Sample Standard Deviation":

Formulas : Standard deviation
calculated using a frequency table
The "Population Standard Deviation":
* f
The "Sample Standard Deviation":

* f
The larger the variance or standard deviation, the greater the variation in the data
around its mean.

EXAMPLE:
Find an estimate of the variance and standard deviation of the following data for
the marks obtained in a test by 88 students:
Marks (x) 0 ≤ x < 10 10 ≤ x < 20 20 ≤ x < 30 30 ≤ x < 40 40 ≤ x < 50
Frequency 6 16 24 25 17
(f)

Solution:
Marks Frequency( Mid- mf m-mean (m- (m-
f) value(m) mean)^2 mean)^2 𝟐 12208
*f 𝝈= =𝟏𝟑𝟖.𝟕𝟐𝟕
88
0 ≤ x < 10 6 5 30 -23.5 552.25 3313.5
10 ≤ x < 20 16 15 240 -13.5 182.25 2916
20 ≤ x < 30 24 25 600 -3.5 12.25 294

=11.778
30 ≤ x < 40 25 35 875 6.5 42.25 1056.25
40 ≤ x < 50 17 45 765 16.5 272.25 4628.25

88 Mean=
28.52 12208

Standard deviation using grouped variables
(continuous or discrete)
220 students were asked the number of hours per week they spent watching television. With
this information, calculate the mean and standard deviation of hours spent watching television
by the 220 students.
Number of hours per week spent watching television
Hours Number of students
10 to 14 2
15 to 19 12
20 to 24 23
25 to 29 60
30 to 34 77
35 to 39 38
40 to 44 8

Standard deviation using grouped variables
(continuous or discrete)
Hours Number Mid xf (x -mean) (x -mean)^2 (x -mean)^2
of points(x) *f
students
𝟐 8002.73
10 to 14 2 12 24 -17.8 317.55 635.10 𝝈= =𝟑𝟔 .𝟑𝟖
220
15 to 19 12 17 204 -12.8 164.35 1972.23
20 to 24 23 22 506 -7.82 61.15 1406.51

=6.031
25 to 29 60 27 1620 -2.82 7.95 477.14
30 to 34 77 32 2464 2.18 4.75 365.93
35 to 39 38 37 1406 7.18 51.55 1958.99
40 to 44 8 42 336 12.18 148.35 1186.82
220 Mean= 8002.73

29.82

EXAMPLE
Assuming the frequency distribution is approximately normal, calculate the
interval within which 95% of the previous example's observations would be
expected to occur.
= 29.82, = 6.03
Calculate the interval using the following formula: - 2s < x < + 2s
◦ 29.82 - (2 X 6.03) < x < 29.82 + (2 X 6.03)
◦ 29.82 - 12.06 < x < 29.82 + 12.06
◦ 17.76 < x < 41.88
◦ This means that there is about a 95% certainty that a student will spend
between 18 hours and 42 hours per week watching television.

Mean Absolute Deviation (MAD)
◦ Since, the mean can be distorted by outlier, and as the variance is computed using the mean, it is thus
sensitive to outlier. To avoid the effect of outlier, there are two more robust measures of dispersion
known. These are:
◦ Mean Absolute Deviation (MAD)

MAD (X) = median
◦ Absolute Average Deviation (AAD)

AAD(X) =
where, X = {, ,…..,}is the sample values of n observations

Interquartile Range
◦ Like MAD and AAD, there is another robust measure of dispersion known, called as Interquartile
range, denoted as IQR
◦ It overcomes the sensitivity to extreme data values.
◦ It is the range for the middle 50%of the data.
◦ To understand IQR, let us first define percentile and quartile
◦ Percentile
◦ The percentile of a set of ordered data can be defined as follows:
o Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is
a value of x such that p% of the observed values of x are less than
o Example: The 50th percentile is that value such that 50% of all values of x are less than .
◦ Note: The median is the 50th percentile.

Percentile
Use the following set of stock prices (in dollars): 10, 7, 20, 12, 5, 15,
9, 18, 4, 12, 8, 14 Find the 10th percentile and the 50th percentile
Solutions:
1. First sort the data in ascending order:
4, 5, 7, 8, 9, 10, 12, 12, 14, 15, 18, 20
2. There are 12 scores so, n = 12
3. To find the 10th percentile, we use the formula:

Percentile
Position 1 2 3 4 5 6 7 8 9 10 11 12
Data 4 5 7 8 9 10 12 12 14 15 18 20
The 10th percentile is the number in the 2nd position.

10th Percentile
TO find the 50th percentile

𝒕𝒉 𝒑 𝟓𝟎
𝒊 = ( ) ( )
𝟏𝟎𝟎
𝒏=
𝟏𝟎𝟎
𝟏𝟐=𝟔
If i is an integer, the pth percentile is the average of the values in positions i and i+1

Percentile
Position 1 2 3 4 5 6 7 8 9 10 11 12
Data 4 5 7 8 9 10 12 12 14 15 18 20
We need to find the 6th and 7th numbers in the sorted data set.

11

Interquartile Range

Quartile
◦ The most commonly used percentiles are quartiles.
◦ The first quartile, denoted by is the 25th percentile.
◦ The third quartile, denoted by is the 75th percentile
◦ The median, is the 50th percentile.
The quartiles including median, give some indication of the center, spread and shape of a
distribution
The distance between and is a simple measure of spread that gives the range covered by the
middle half of the data. This distance is called the interquartile range (IQR) and is defined as
IQR = -

Quartile and Interquartile
Example: 5, 7, 4, 4, 6, 2, 8
1. Put them in order: 2, 4, 4, 5, 6, 7, 8
2. Cut the list into quarters:
2 4 4 5 6 7 8
Quartile(Q1)= Quartile(Q2)= Quartile(Q3)=

Lower quartile Median upper quartile
Interquartile Range = Q3-Q1= 7-4=3

EXAMPLE
1, 3, 3, 4, 5, 6, 6, 7, 8, 8
Interquartile Range = Q3-Q1= 7-3=4

Covariance
Covariance provides insight into how two variables are
related to one another.
More precisely, covariance refers to the measure of how two
random variables in a data set will change together.
A positive covariance means that the two variables at hand are
positively related, and they move in the same direction.
A negative covariance means that the variables are inversely
related, or that they move in opposite directions.

Covariance
The formula for covariance is as follows:

In this formula, X represents the independent variable, Y represents the dependent variable, N
represents the number of data points in the sample, x-bar represents the mean of the X, and y-
bar represents the mean of the dependent variable Y.

Are Covariance and Correlation The
Same Thing?
Simply put, no
While both covariance and correlation indicate whether variables are positively
or inversely related to each other, they are not considered to be the same.
This is because correlation also informs about the degree to which the
variables tend to move together.
covariance does not use one standardized unit of measurement.
Correlation, on the other hand, standardizes the measure of interdependence
between two variables and informs researchers as to how closely the two
variables move together.

EXAMPLE
Calculate covariance for the following data set:
x: 2.1, 2.5, 3.6, 4.0
y: 8, 10, 12, 14 =6.8/ 3=2.267
x-mean(x)*y-
x y x-mean(x) y-mean(y) mean(y)
2.1 8 -0.95 -3 2.85
2.5 10 -0.55 -1 0.55
3.6 12 0.55 1 0.55
4 14 0.95 3 2.85
Sum=6.8
mean(x)= mean(y)=
3.05 11
The result is positive, meaning that the variables are positively related.
EXAMPLE
To calculate the correlation
x: 2.1, 2.5, 3.6, 4.0 =2.267/1.55*4.47=0.3264

y: 8, 10, 12, 14
x- x- y-
mean(x)*y mean(x) mean(y)
x y x-mean(x) y-mean(y) -mean(y) ^2 ^2
2.1 8 -0.95 -3 2.85 0.9025 9
2.5 10 -0.55 -1 0.55 0.3025 1
3.6 12 0.55 1 0.55 0.3025 1
4 14 0.95 3 2.85 0.9025 9
6.8 2.41 20
A positive correlation coefficient less than one indicates a less than perfect positive
correlation.
EXAMPLE:
For example, suppose you take a sample of stock returns from the Excelsior Corporation and the
Adirondack Corporation from the years 2008 to 2012, as shown here:
What are the covariance and correlation between the stock returns?
Year Excelsior Corp. Annual Adirondack Corp.
Return (percent) (X) Annual Return
(percent) (Y)
2008 1 3
2009 –2 2
2010 3 4
2011 0 6
2012 3 0

EXAMPLE:

EXAMPLE:

EXAMPLE:
The sample standard deviation of X is the square root of 4.5, or
The sample standard deviation of Y equals the square root of 5, or

EXAMPLE:
sample correlation formula gives you
The negative result shows that there’s a weak negative correlation between the stock
returns of Excelsior and Adirondack. If two variables are perfectly negatively correlated (they
always move in opposite directions), their correlation will be –1. If two variables are
independent (unrelated to each other), their correlation will be 0. The correlation between
the returns to Excelsior and Adirondack stock is a –0.2108, which indicates that the two
variables show a slight tendency to move in opposite directions.

Reference:
1. https://www.statcan.gc.ca/edu/power-pouvoir/ch12/5214891-eng.htm
2. http://www.lboro.ac.uk/media/wwwlboroacuk/content/mlsc/downloads/var_stand_deviat_
group.pdf
3. https://www.mathsisfun.com/data/standard-deviation.html
4. http://
www.compton.edu/facultystaff/jmmartinez/docs/Math-150-Spring-2015/Stat-Ch3-Formulas.
pdf
5. http://www2.gsu.edu/~dscsss/teaching/mgs9920/slides/ch03%20ver3.pdf
6. https://www.surveygizmo.com/resources/blog/variance-covariance-correlation/

Chapter-3: Statistical Analysis

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter-3: Statistical Analysis

Uploaded by

Copyright:

Available Formats

CHAPTER-3

Essentially, we are to analyze data for TRP estimation.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 2

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 3

 Information from a unit is known as an observation.

 A population is a data set representing the entire entities of interest.

Example: All TV Viewers in the country/world.

2. For different survey, the population set may be completely different.

3. For statistical learning, it is important to define the population that we

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 6

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 7

Example: Inference Statistics:

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 8

◦ Measures of location /centrality

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 9

◦ The most popular measures of location are

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 10

The mean of a sample data is denoted a. The mean of the population is

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 11

 If , , ,….., are the sample values, the simple mean is defined as

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 12

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 13

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 14

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 15

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 16

 If each observation is called by multiplying (dividing) by a non-zero constant, then the

 Where, * is x (multiplication) or ÷ (division) operator.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 17

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 18

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 19

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 20

Long left tail Short right tail

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 21

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 22

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 23

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 24

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 25

 where, denotes the mean of the sample

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 26

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 27

1. Find the mean first:

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 28

= 147 (to the nearest mm)

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 29

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 30

The "Population Standard Deviation":

The "Sample Standard Deviation":

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 31

The "Sample Standard Deviation":

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 32

Marks (x) 0 ≤ x < 10 10 ≤ x < 20 20 ≤ x < 30 30 ≤ x < 40 40 ≤ x < 50

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 33

10 ≤ x < 20 16 15 240 -13.5 182.25 2916

20 ≤ x < 30 24 25 600 -3.5 12.25 294

40 ≤ x < 50 17 45 765 16.5 272.25 4628.25

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 34

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 35

20 to 24 23 22 506 -7.82 61.15 1406.51

30 to 34 77 32 2464 2.18 4.75 365.93

35 to 39 38 37 1406 7.18 51.55 1958.99

40 to 44 8 42 336 12.18 148.35 1186.82

220 Mean= 8002.73

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 36