You are on page 1of 44

Quantitative Methods

Why statistics?
Decision making is often based on
analysis of data.
Statistics helps you to make sense of the
data by using tools that summarize,
present and analyze the data.
Decision maker can also ascertain the
confidence in the decisions.
Examples
How many newspapers should the vendor stock
to maximize revenue?
Depends on the probability distribution of demand
and expected profit
Are two or more market segments significantly
different?
Hypothesis testing
What proportion of people are happy with the
Sixth-pay commission report?
Parameter estimation
Sample vs. Population
Population is the entire group/collection of
individuals/objects/things that we want
information about.
Sample is part of the population that we actually
examine to gather information.
Example
We wish to find the average dividend percentage of
all companies traded at NSE.
All stocks traded at NSE comprises population
10% of the stocks selected for gathering information is the
sample

Inferential Statistics
Predict and forecast
values of population
parameters
Test hypotheses about
values of population
parameters
Make decisions
Descriptive Statistics
Collect
Organize
Summarize
Display
Analyze
Subdivision within Statistics
Descriptive statistics
- data and frequency distribution
The following are the departure delay in minutes of 42 flights
selected at random from a particular airport.

10 12 45
13 8 40
13 0 0
20 45 0
95 38 67
4 47 55
0 56 5
45 50 27
50 15 26
34 12 25
48 40 25
50 42 48
53 44 23
56 46 22
Frequency Distribution
Table with two columns listing:
Each and every group or class or interval of values
Associated frequency of each group
Number of observations assigned to each group
Sum of frequencies is number of observations
Class midpoint is the middle value of a group or class or
interval
Relative frequency is the percentage/proportion of total
observations in each class
Sum of relative frequencies = 1
Frequency distribution
Delay in
minutes
Frequency Relative
frequency
0 1 5 12 0.286
15 - 30 8 0.190
30 45 6 0.143
45 60 14 0.333
60 or more 2 0.048
Total 42 1
Frequency distribution- histogram
0
2
4
6
8
10
12
14
16
0 1 5 15 - 30 30 45 45 60 60 or more
Delay in Minutes
F
r
e
q
u
e
n
c
y
Two variable frequency distribution
-cross tabulation
delay in minutes 0-15 15-30 30-45 45-60 60 or more Total
Govt. 5 2 5 9 0 21
Private 7 6 1 5 2 21
Total 12 8 6 14 2 42
A joint frequency distribution of two variables (e.g. ownership of airline, delay
in minutes)
Descriptive statistics - measures
Measures of Location
Measures of Variability
Skewness and Kurtosis
Association between two variables

Measures of Location
Arithmetic Mean
Median
Mode
Percentiles
Quartiles

Arithmetic mean
The mean of a data set is the average
of all the data values.



x
x
n
i
=

=

x
N
i
Sample mean
Population mean
Mean example
Average delay in flight departure





x
= 1354/42 = 32.2381 minutes
Median
It is the middle item in a data set that is
arranged in ascending/descending order
If there are n observations then the
Median = (n+1)/2 th observation.
computation rule
if n is odd then (n+1)/2 is an integer
if n is even then use average of n/2 and n/2 +1 th
observation
Example
Sorted 42
observations
median is average of
21
st
and 22
nd

observation
= (34+38)/2
= 36

0 22 45
0 23 46
0 25 47
0 25 48
4 26 48
5 27 50
8 34 50
10 38 50
12 40 53
12 40 55
13 42 56
13 44 56
15 45 67
20 45 95
Mode
Mode is the highest occurring observation
mode in the example is 0
The greatest frequency can occur at two
or more different values.
If the data have exactly two modes, the
data are bimodal.
If the data have more than two modes, the
data are multimodal.
Given any set of ordered numerical
observations
The P
th
percentile in the ordered set is that
value below which lie P% (P percent) of the
observations in the set.
The position of the P
th
percentile is given by (n +
1)P/100, where n is the number of observations in
the set.


Percentiles and Quartiles
Example
Calculate 45
th
percentile of the airline
delay data
the position of 45
th
percentile is
45*(42+1)/100 = 19.35
th

value of 45
th
percentile
= 19
th
observation + 0.35 of (20 19)th
observation
= 26.35 (26 + 0.35(27-26))
Quartiles
Quartiles are special names to percentiles
Q1 = 25
th
percentile
Q2 = 50
th
percentile = median
Q3 = 75
th
percentile
Measures of Variability
Range
Interquartile Range
Variance
Standard Deviation
Coefficient of Variation
Range
The range of a data set is the difference
between the largest and smallest data values.
It is the simplest measure of variability.
It is very sensitive to the smallest and largest
data values.
Example from airline delay data
Range = 95 0 = 95 minutes
Interquartile range
The interquartile range of a data set is the
difference between the third quartile and the first
quartile.
It is the range for the middle 50% of the data.
It overcomes the sensitivity to extreme data
values.
Variance
The variance is a measure of variability
that utilizes all the data.
It is based on the difference between the
value of each observation (x
i
) and the
mean (x for a sample, for a population).

o

2
2
=

( ) x
N
i
s
x
i
x
n
2
2
1
=

( ) < - Population variance


Sample variance - >
Standard deviation
The standard deviation of a data set is the
positive square root of the variance.
It is measured in the same units as the
data, making it more easily comparable,
than the variance, to the mean.
If the data set is a sample, the standard
deviation is denoted s.
If the data set is a population, the standard
deviation is denoted o (sigma).

Coefficient of Variation
The coefficient of variation indicates how large the
standard deviation is in relation to the mean.
If the data set is a sample, the coefficient of variation
is computed as follows:


If the data set is a population, the coefficient of
variation is computed as follows:
s
x
( ) 100
o

( ) 100
s
x
( ) 100
Example
Variance
= 465.89 minutes square

Standard Deviation
= 21.585 minutes

Coefficient of Variation =
= 21.584/32.2381 (100) = 66.95%

Skewness
Skewness characterizes the degree of
asymmetry of a distribution around its
mean
Positively skewed
Symmetric or unskewed
Negatively skewed

Skewness
Skewed to left
Skewness
Skewness
Symmetric
Skewness
Skewed to right
Skewness - measure
3
3
1
) (
o

n
X
=
Skewness of a distribution is measured by
For a given data set you may use
Kurtosis
Kurtosis characterizes the relative
peakedness or flatness of a symmetric
distribution compared to the normal
distribution
Platykurtic (relatively flat)
Mesokurtic (normal)
Leptokurtic (relatively peaked)
Kurtosis
Platykurtic - flat distribution
Kurtosis
Mesokurtic - not too flat and not too peaked
Kurtosis
Leptokurtic - peaked distribution
Kurtosis - measure
Kurtosis for a distribution is measured by

4
4
2
) (
o

|
n
X
=
3
2
= | q
where
For a given data set you may use
Association between two variables
Delay Passengers Delay Passengers Delay Passengers
53 65 56 51 50 68
40 61 42 50 0 72
46 53 25 57 38 74
0 65 13 57 55 68
22 45 40 54 45 73
5 58 8 54 15 63
44 68 27 65 48 68
12 65 67 57 0 55
12 56 48 62 10 45
25 50 4 50 50 71
13 70 45 61 56 64
50 73 0 59 26 60
45 63 34 63 47 61
23 56 95 49 20 48
Association between two variables
Scatter plot
Covariance
Correlation Coefficient

Scatter Plot
Scatter Plots are used to identify any
underlying relationships among pairs of
data sets.
The plot consists of a scatter of points,
each point representing an observation.

Scatter Plot
Delay vs Passengers
0
10
20
30
40
50
60
70
80
90
100
0 10 20 30 40 50 60 70 80
Passengers
D
e
l
a
y
Covariance
The covariance is a measure of the linear
association between two variables.
Positive values indicate a positive
relationship.
Negative values indicate a negative
relationship
If the data sets are samples, the covariance is
denoted by


If the data sets are populations, the
covariance is denoted by
Covariance
s
x x y y
n
xy
i i
=

( )( )
1
o

xy
i x i y
x y
N
=

( )( )
= 20.42 in the
Airline
example
Correlation Coefficient
The coefficient can take on values between -1 and +1.
Values near -1 indicate a strong negative linear relationship.
Values near +1 indicate a strong positive linear relationship.
If the data sets are samples, the coefficient is


If the data sets are populations, the coefficient is

o
o o
xy
xy
x y
=
r
s
s s
xy
xy
x y
=
= 0.121 in Airline
example

You might also like