You are on page 1of 56

CHAPTER-3

STATISTICAL ANALYSIS
TRP: An example
Television rating point (TRP) is a tool provided to judge which programs are viewed the most. This
gives us an index of the choice of the people and also the popularity of a particular channel.
For calculation purpose, a device is attached to the TV sets in few thousand viewers’ houses in
different geographic and demographic sectors.
◦ The device is called as People's Meter. It reads the time and the programme that a viewer
watches on a particular day for a certain period.
An average is taken, for example, for a 30-days period.

The above further can be augmented with a Personal Interview Survey (PIS), which becomes the
basis for many studies/decision making.

Essentially, we are to analyze data for TRP estimation.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 2


Defining Data
A set of data is a collection of observed values representing one or more
characteristics of some objects or units.
Example: For TRP, data collection consist of the following attributes.
◦ Age: A viewer’s age in years
◦ Sex: A viewer’s gender coded 1 for male and 0 for female
◦ Happy: A viewer’s general happiness
◦ NH for not too happy
◦ PH for pretty happy
◦ VH for very happy
◦ TVHours: The average number of hours a respondent watched TV during a day

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 3


Defining Data
Viewer# Age Sex Happy TVHours
… … … … …
… … … … …
55 34 F VH 5
… … … … …

Note:
 A data set is composed of information from a set of units.

 Information from a unit is known as an observation.

 An observation consists of one or more pieces of information about a unit; these are called
variables.
STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 4
Defining Population

 A population is a data set representing the entire entities of interest.


 A parameter is a numerical characteristic of the population.

Example: All TV Viewers in the country/world.

Note:
1. All people in the country/world is not a population.

2. For different survey, the population set may be completely different.

3. For statistical learning, it is important to define the population that we


intend to study very carefully.
STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 5
Defining Sample
A sample is a data set consisting of a population.
OR
A sample is a portion of the population
A sample statistic is a numerical characteristic of the sample.

Example: All students studying in Class XII is a population, whereas those students belong to a
given school is sample.
Note:
◦ Normally In inferential statistics, a sample is obtained in such a way as to be representative
of the population.
An unbiased sampling procedure is one in which each observation in the population has an
equal chance of being chosen for the sample.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 6


Introduction to Statistics
Statistics is a discipline defining a set of procedures used to collect and
interpret numerical data.
The discipline of statistics serves two purposes:
1. Statistical procedures can be used to describe the relevant characteristics
(dispersion, central tendency) of a body of data. This is called Descriptive
Statistics .
2. Statistical procedures can be utilized to help us make inferences or predictions
about a whole population based on information from a sample of the population.
This is called Inferential Statistics.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 7


Defining Statistics
A statistics is a quantity calculated from data that describes a particular
characteristics of a sample.

Example: Inference Statistics:


In the context of TRP
◦ Overall frequency of the various levels of happiness.
◦ Is there a relationship between the age of a viewers and his/her general
happiness?
◦ Is there a relationship between the age of the viewer and the number of TV
hours watched?

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 8


Data Summarization
 To identify the typical characteristics of data (i.e., to have an overall picture).
 To identify which data should be treated as noise or outliers.
 The data summarization techniques can be classified into two broad
categories:

◦ Measures of location /centrality

◦ Measures of dispersion

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 9


Measurement of location/Centrality
◦ It is also alternatively called as measuring the central tendency.
◦ A function of the sample values that summarizes the location information
into a single number is known as a measure of location.

◦ The most popular measures of location are


◦ Mean
◦ Median
◦ Mode
◦ Midrange

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 10


Mean of a sample

  The mean (arithmetic average) is the most common measure of centrality.

The mean of a sample data is denoted a. The mean of the population is


denoted μ
Different mean measurements known are:
Simple mean
Weighted mean
Trimmed mean

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 11


Simple mean of a sample

  It is also called simply arithmetic mean or average and is abbreviated as (AM).

 If , , ,….., are the sample values, the simple mean is defined as

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 12


Weighted mean of a sample

  It is also called weighted arithmetic mean or weighted average.

 When each sample value is associated with a weight , for i = 1,2,…,n, then it
is defined as

Note
When all weights are equal, the weighted mean reduces to simple mean.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 13


Trimmed mean of a sample
If there are extreme values (also called outlier) in a sample, then the mean is
influenced greatly by those values.
To offset the effect caused by those extreme values, we can use the concept of
trimmed mean.
OR Trimmed mean is defined as the mean obtained after chopping off values at
the high and low extremes.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 14


Properties of mean
   If , i = 1,2,…,m are the means of m samples of sizes , ,….., respectively, then the mean of
the combined sample is given by:-

(Distributive Measure)

 If a new observation is added to a sample of size n with mean , the new mean is given by

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 15


Properties of mean
   If an existing observation is removed from a sample of size n with mean , the new mean is
given by

 If m observations with mean , are added (removed) from a sample of size n with mean ,
then the new mean is given by

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 16


Properties of mean
   If a constant c is subtracted (or added) from each sample value, then the mean of the
transformed variable is linearly displaced by c. That is,

 If each observation is called by multiplying (dividing) by a non-zero constant, then the


altered mean is given by

 Where, * is x (multiplication) or ÷ (division) operator.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 17


Relation between mean, median and
mode
 A given set of data can be categorized into three categories:-
 Symmetric data
 Positively skewed data
 Negatively skewed data

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 18


Symmetric data
For symmetric data, all mean, median and mode lie at the same point
A symmetric distribution is one in which the side of the
distribution to the right of the mean is a mirror image of the left portion.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 19


Positively skewed/Skewed Right data
 Here, mode occurs at a value smaller than the median.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 20


Negatively skewed /Skewed Left data
 Here, mode occurs at a value greater than the median.

median
Mean<median
mean

Long left tail Short right tail

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 21


NOTES:
The value of the mean is sensitive to the outliers in a data set whereas the
median is not.
The median will equal the mean if the distribution of the data is symmetric.
If a distribution is skewed to the right, outlier values in the data much larger
than the mean are pulling the value of the mean above the median.
If a distribution is skewed to the left, outlier values in the data much smaller
than the mean are pulling the value of the mean below the median.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 22


Measures of dispersion/Variablility
 Location measure are far too insufficient to understand data.
 Another set of commonly used summary statistics for continuous data are those
that measure the dispersion.
 A dispersion measures the extent of spread of observations in a sample
 Some important measure of dispersion are:
◦ Range
◦ Variance and Standard Deviation
◦ Mean Absolute Deviation (MAD)
◦ Absolute Average Deviation (AAD)
◦ Interquartile Range (IQR)

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 23


Range of a sample

  Let X = , , ,….., be n sample values that are arranged in increasing order.
It is the simplest measure of variability.
 The range R of these samples are then defined as:
 R = max(X) – min(X) = -
 EXAMPLE: NOTE:
It can be misleading if most of the values are
 In {4, 6, 9, 3, 7}  concentrated in a narrow band of values, but
Increasing Order: X= {3,4,6,7,9} there are also a relatively small number of
Min(X)=3 more extreme values
Example: In {8, 11, 5, 9, 7, 6, 3616}:
Max(X)=9
So the range is 3616 − 5 = 3611.
Range=(9-3)=6 The single value of 3616 makes the range
large, but most values are around 10.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 24


Variance and Standard Deviation
 In statistics, variance refers to the spread of a data set.
 It’s a measurement used to identify how far each number in the
data set is from the mean.
 A variance value of zero represents that all of the values within a
data set are identical, while all variances that are not equal to zero
will come in the form of positive numbers.
 A large variance means that the numbers in a set are far from the
mean and each other. A small variance means that the numbers
are closer together in value.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 25


Variance and Standard Deviation

  Let X = { , , ,….., } are sample values of n samples.
 Then, variance denoted as  σ² is defined as :-

 where, denotes the mean of the sample


 The average of the squared differences from the Mean.
 The standard deviation, σ, of the samples is the square root of the variance
 The coefficient of variation expresses standard deviation as a percentage of
the mean:
 The coefficient of variation is useful in comparing the variation of data sets
that have different means

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 26


Example

You and your friends have just measured the heights of your dogs (in millimeters):

The heights (at the shoulders) are: 600mm, 470mm, 170mm, 430mm and 300mm.
Find out the Mean, the Variance, and the Standard Deviation.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 27


Example Solution:

1. Find the mean first:


Mean = (600 + 470 + 170 + 430 + 300)/5
  = (1970)/5
  = 394

2. To calculate the Variance, take each difference, square it, and then average the result:
Variance
σ2 = (2062 + 762 + (−224)2 + 362 + (−94)2 ) /5
  = (42436 + 5776 + 50176 + 1296 + 8836)/5
  = (108520)/5
  = 21704

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 28


Example Solution:

And the Standard Deviation is just the square root of Variance, so:

Standard Deviation

σ = √21704

  = 147.32...

  = 147 (to the nearest mm)

So, using the Standard Deviation we have a "standard" way of knowing what is normal, and what is
extra large or extra small.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 29


NOTE:
if the data is a Sample (a selection taken from a bigger Population), then the
calculation changes!
When you have "N" data values that are:
◦ The Population: divide by N when calculating Variance
◦ A Sample: divide by N-1 when calculating Variance
Example: if our 5 dogs are just a sample of a bigger population of dogs, we divide by 4 instead
of 5 like this:
Sample Variance = 108,520 / 4 = 27,130
Sample Standard Deviation = √27,130 = 164 (to the nearest mm)

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 30


Formulas : Discrete variables

The "Population Standard Deviation":

The "Sample Standard Deviation":  

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 31


Formulas : Standard deviation
calculated using a frequency table
The "Population Standard Deviation":
* f

The "Sample Standard Deviation":  


* f

The larger the variance or standard deviation, the greater the variation in the data
around its mean.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 32


EXAMPLE:
Find an estimate of the variance and standard deviation of the following data for
the marks obtained in a test by 88 students:

Marks (x) 0 ≤ x < 10 10 ≤ x < 20 20 ≤ x < 30 30 ≤ x < 40 40 ≤ x < 50

Frequency 6 16 24 25 17
(f)

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 33


Solution:
Marks Frequency( Mid- mf m-mean (m- (m-
f) value(m) mean)^2 mean)^2  𝟐 12208
*f 𝝈= =𝟏𝟑𝟖.𝟕𝟐𝟕
88
0 ≤ x < 10 6 5 30 -23.5 552.25 3313.5

10 ≤ x < 20 16 15 240 -13.5 182.25 2916

20 ≤ x < 30 24 25 600 -3.5 12.25 294  


=11.778
30 ≤ x < 40 25 35 875 6.5 42.25 1056.25

40 ≤ x < 50 17 45 765 16.5 272.25 4628.25


88 Mean=
28.52 12208

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 34


Standard deviation using grouped variables
(continuous or discrete)
220 students were asked the number of hours per week they spent watching television. With
this information, calculate the mean and standard deviation of hours spent watching television
by the 220 students.
Number of hours per week spent watching television
Hours Number of students
10 to 14 2
15 to 19 12
20 to 24 23
25 to 29 60
30 to 34 77
35 to 39 38
40 to 44 8

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 35


Standard deviation using grouped variables
(continuous or discrete)
Hours Number Mid xf (x -mean) (x -mean)^2 (x -mean)^2
of points(x) *f
students
 𝟐 8002.73
10 to 14 2 12 24 -17.8 317.55 635.10 𝝈= =𝟑𝟔 .𝟑𝟖
220
15 to 19 12 17 204 -12.8 164.35 1972.23

20 to 24 23 22 506 -7.82 61.15 1406.51  


=6.031
25 to 29 60 27 1620 -2.82 7.95 477.14

30 to 34 77 32 2464 2.18 4.75 365.93

35 to 39 38 37 1406 7.18 51.55 1958.99

40 to 44 8 42 336 12.18 148.35 1186.82

220 Mean= 8002.73


29.82

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 36


EXAMPLE
 Assuming the frequency distribution is approximately normal, calculate the
interval within which 95% of the previous example's observations would be
expected to occur.
 = 29.82, = 6.03
Calculate the interval using the following formula:  - 2s < x <  + 2s
◦ 29.82 - (2 X 6.03) < x < 29.82 + (2 X 6.03)
◦ 29.82 - 12.06 < x < 29.82 + 12.06
◦ 17.76 < x < 41.88
◦ This means that there is about a 95% certainty that a student will spend
between 18 hours and 42 hours per week watching television.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 37


Mean Absolute Deviation (MAD)
◦  Since, the mean can be distorted by outlier, and as the variance is computed using the mean, it is thus
sensitive to outlier. To avoid the effect of outlier, there are two more robust measures of dispersion
known. These are:

◦ Mean Absolute Deviation (MAD)


MAD (X) = median

◦ Absolute Average Deviation (AAD)


AAD(X) =

where, X = {, ,…..,}is the sample values of n observations

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 38


Interquartile Range
◦  Like MAD and AAD, there is another robust measure of dispersion known, called as Interquartile
range, denoted as IQR
◦ It overcomes the sensitivity to extreme data values.
◦ It is the range for the middle 50%of the data.
◦ To understand IQR, let us first define percentile and quartile

◦ Percentile
◦ The percentile of a set of ordered data can be defined as follows:

o Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percentile is
a value of x such that p% of the observed values of x are less than
o Example: The 50th percentile is that value such that 50% of all values of x are less than .
◦ Note: The median is the 50th percentile.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 39


Percentile
 Use the following set of stock prices (in dollars): 10, 7, 20, 12, 5, 15,
9, 18, 4, 12, 8, 14 Find the 10th percentile and the 50th percentile
Solutions:
1. First sort the data in ascending order:
4, 5, 7, 8, 9, 10, 12, 12, 14, 15, 18, 20
2. There are 12 scores so, n = 12
3. To find the 10th percentile, we use the formula:

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 40


Percentile
Position 1 2 3 4 5 6 7 8 9 10 11 12
Data 4 5 7 8 9 10 12 12 14 15 18 20

The 10th percentile is the number in the 2nd position.


 10th Percentile

TO find the 50th percentile


  𝒕𝒉 𝒑 𝟓𝟎
𝒊 = ( ) ( )
𝟏𝟎𝟎
𝒏=
𝟏𝟎𝟎
𝟏𝟐=𝟔

If i is an integer, the pth percentile is the average of the values in positions i and i+1

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 41


Percentile
Position 1 2 3 4 5 6 7 8 9 10 11 12
Data 4 5 7 8 9 10 12 12 14 15 18 20

We need to find the 6th and 7th numbers in the sorted data set.
 
11

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 42


Interquartile Range

  Quartile
◦ The most commonly used percentiles are quartiles.
◦ The first quartile, denoted by is the 25th percentile.
◦ The third quartile, denoted by is the 75th percentile
◦ The median, is the 50th percentile.
The quartiles including median, give some indication of the center, spread and shape of a
distribution
The distance between and is a simple measure of spread that gives the range covered by the
middle half of the data. This distance is called the interquartile range (IQR) and is defined as
IQR = -

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 43


Quartile and Interquartile
Example: 5, 7, 4, 4, 6, 2, 8
1. Put them in order: 2, 4, 4, 5, 6, 7, 8
2. Cut the list into quarters:
2 4 4 5 6 7 8

Quartile(Q1)= Quartile(Q2)= Quartile(Q3)=


Lower quartile Median upper quartile

Interquartile Range = Q3-Q1= 7-4=3


STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 44
EXAMPLE
 1, 3, 3, 4, 5, 6, 6, 7, 8, 8

Interquartile Range = Q3-Q1= 7-3=4

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 45


Covariance
Covariance provides insight into how two variables are
related to one another.
More precisely, covariance refers to the measure of how two
random variables in a data set will change together.
A positive covariance means that the two variables at hand are
positively related, and they move in the same direction. 
A negative covariance means that the variables are inversely
related, or that they move in opposite directions. 

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 46


Covariance
 The formula for covariance is as follows:
 

In this formula, X represents the independent variable, Y represents the dependent variable, N
represents the number of data points in the sample, x-bar represents the mean of the X, and y-
bar represents the mean of the dependent variable Y.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 47


Are Covariance and Correlation The
Same Thing?
Simply put, no
While both covariance and correlation indicate whether variables are positively
or inversely related to each other, they are not considered to be the same. 
This is because correlation also informs about the degree to which the
variables tend to move together.
covariance does not use one standardized unit of measurement.
Correlation, on the other hand, standardizes the measure of interdependence
between two variables and informs researchers as to how closely the two
variables move together.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 48


EXAMPLE
Calculate covariance for the following data set:
x: 2.1, 2.5, 3.6, 4.0  
y: 8, 10, 12, 14 =6.8/ 3=2.267
x-mean(x)*y-
x y x-mean(x) y-mean(y) mean(y)
2.1 8 -0.95 -3 2.85
2.5 10 -0.55 -1 0.55
3.6 12 0.55 1 0.55
4 14 0.95 3 2.85
Sum=6.8
mean(x)= mean(y)=
3.05 11

The result is positive, meaning that the variables are positively related.
STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 49
EXAMPLE
To calculate the correlation  

x: 2.1, 2.5, 3.6, 4.0 =2.267/1.55*4.47=0.3264


y: 8, 10, 12, 14
x- x- y-
mean(x)*y mean(x) mean(y)
x y x-mean(x) y-mean(y) -mean(y) ^2 ^2
2.1 8 -0.95 -3 2.85 0.9025 9
2.5 10 -0.55 -1 0.55 0.3025 1
3.6 12 0.55 1 0.55 0.3025 1
4 14 0.95 3 2.85 0.9025 9
6.8 2.41 20

A positive correlation coefficient less than one indicates a less than perfect positive
correlation.
STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 50
EXAMPLE:
For example, suppose you take a sample of stock returns from the Excelsior Corporation and the
Adirondack Corporation from the years 2008 to 2012, as shown here:
What are the covariance and correlation between the stock returns?
Year Excelsior Corp. Annual Adirondack Corp.
Return (percent) (X) Annual Return
(percent) (Y)
2008 1 3
2009 –2 2
2010 3 4
2011 0 6
2012 3 0

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 51


EXAMPLE:

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 52


EXAMPLE:

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 53


EXAMPLE:

The sample standard deviation of X is the square root of 4.5, or

The sample standard deviation of Y equals the square root of 5, or

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 54


EXAMPLE:
sample correlation formula gives you

The negative result shows that there’s a weak negative correlation between the stock
returns of Excelsior and Adirondack. If two variables are perfectly negatively correlated (they
always move in opposite directions), their correlation will be –1. If two variables are
independent (unrelated to each other), their correlation will be 0. The correlation between
the returns to Excelsior and Adirondack stock is a –0.2108, which indicates that the two
variables show a slight tendency to move in opposite directions.

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 55


Reference:
1. https://www.statcan.gc.ca/edu/power-pouvoir/ch12/5214891-eng.htm
2. http://www.lboro.ac.uk/media/wwwlboroacuk/content/mlsc/downloads/var_stand_deviat_
group.pdf
3. https://www.mathsisfun.com/data/standard-deviation.html
4. http://
www.compton.edu/facultystaff/jmmartinez/docs/Math-150-Spring-2015/Stat-Ch3-Formulas.
pdf
5. http://www2.gsu.edu/~dscsss/teaching/mgs9920/slides/ch03%20ver3.pdf
6. https://www.surveygizmo.com/resources/blog/variance-covariance-correlation/

STATISTICS PREPARED BY CHITRAPRIYA N., CSE DEPT 56

You might also like