Professional Documents
Culture Documents
Statistics
Data Information
Population
— a population is the group of all items of
interest to a statistics practitioner.
Sample
— A sample is a set of data drawn from the
population. [It is a part of a population]
Vocabulary of Statistics…(cont)
Parameter
— A descriptive measure of a population.
Statistic
— A descriptive measure of a sample.
Population Sample
Subset
Statistic
Parameter
Variables may be …
— Discrete and Continuous.
Vocabulary of Statistics…(cont)
Data
Data are the different numerical values associated
with a variable.
Numerical Techniques.
There are two ways to
DESCRIBE the data...
One, Visualization
Understanding data by
summarizing it through
various statistical
measures.
And, we can describe
our data in terms of
various characteristics
of the same!!!!
DATA CAN BE CHARACTERIZED BY
A NUMBER OF PARAMETERS…!!!
● CENTRAL TENDENCY
● DISPERSION
● SYMMETRY
● PEAKEDNESS
Describing Data Numerically
Describing Data Numerically
Mode Variance
Standard Deviation
Coefficient of Variation
Types of series -----
Individual series or series without frequencies
Frequency Series
Discrete series (or Frequency Array)
Continuous Series (Frequency Distribution or Series with
Class intervals)
Grouped Data
Ungrouped Data
FIRST…
MEASURES OF
CENTRAL
TENDENCY
Which measure should be
used?
Purpose
Common value
Mid value
Average
Algebraic treatment
Nature of Data
Open end C.I.
Varying C.I./unequal C.I.
Extreme Values
Size of the data
Measures of Central Tendency
Objective:
To describe data
One value that describes the entire data
bird‘s eye view
To facilitate comparison
At on point of time, or
Over a period of time
Also called central value/expected
value/average
Requisites of a Good Average
Easy to understand
Easy to compute
Based on all items in data
Should not be unduly affected by extreme
values
Subject to further algebraic treatment
Rigidly defined
Sampling stability
Calculate Mean and Median
x i
x i1
n
Arithmetic Midpoint of Most frequently
average ranked values observed value
(if one exists)
Arithmetic Mean
The arithmetic mean (mean) is the most
common measure of central tendency
For a population of N values:
N
xx1 x 2 x N
i Population
μ
i1
values
N N
Population size
x i
x1 x 2 x n Observed
x i1
values
n n
Sample size
Arithmetic Mean
(continued)
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Mean = 3 Mean = 4
1 2 3 4 5 15 1 2 3 4 10 20
3 4
5 5 5 5
Based on all items
Subject to further algebraic treatment
Rigidly defined
It is a calculated value rather than positional
value … Like ……….Median
0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10
Median = 3 Median = 3
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6
No Mode
Mode = 9
Mode
In symmetrical distribution,
Mean=Median=Mode
Which measure of location
is the ―best‖?
Q1 = 35 would mean ?
Q3 = 44 would mean?
Ranked 29 31 35 39 39 40 43 44 44 52
Values
Days 1 2 3 4 5 6 7 8 9 10
Quartiles
Quartiles split the ranked data into 4 segments with
an equal number of values per segment (note that
the widths of the segments may be different)
Q1 Q2 Q3
The first quartile, Q1, is the value for which 25% of the
observations are smaller or equal to value of Q1 and 75%
of the observations are larger or equal to the value of Q1
Q2 is the same as the median (50% are smaller, 50% are
larger)
Q3 Only 25% of the observations are greater than the third
quartile, and 75% of the observations are less than value
of Q
Quartile Formulas
(n = 9)
Q1 = is in the 0.25(9+1) = 2.5 position of the ranked data
so use the value half way between the 2nd and 3rd values,
so Q1 = 12.5
Percentiles and Quartiles
Skewness
SKEWNESS……
Assymetrical Distribution
When frequencies are not equally distributed on both
sides of central value
The curve gets inclined to one side
Mean ≠ Median ≠ Mode
SKEWNESS……
A distribution can have
Positively skewed curve/Positive skewness:
The curve is inclined to right
Mean > Median > Mode
(Q3 – Median) > (Median –Q1)
In this case, majority of the observations are having values less
than mean value and it has a long tail on the right hand side i.e.
very few frequencies are spread on the high-value end of the
curve. (Long tail is caused by some extremely large values)
Negatively skewed curve/Negative skewness:
The curve is inclined to left
Mean < Median < Mode
(Q3 – Median) < (Median –Q1)
In this case, majority of the observations are having values greater
than mean value and it has a long tail on the left hand side i.e.
frequencies are spread on the low-value end of the curve. (Long tail
is caused by extremely low values)
SHARE A SHARE B
0.30
0.25
PROBABILITY
0.20
0.15
0.10
What do you
think in which
0.05 shares should
you invest?
0.00
0 5 10 15 20 25 30 35
RETURN
Now, look at again the following two
shares…
SHARE - A SHARE - B
Return (%) Probability Return (%) Probability
2 0.05 1 0.02
9 0.29 4 0.08
12 0.24 7 0.10
16 0.17 9 0.13
19 0.12 12 0.16
23 0.07 16 0.18
28 0.03 21 0.31
30 0.04 30 0.02
1.00 1.00
Expected Expected
14.10% 14.10%
Return Return
Standard Standard
6.40 6.40
Deviation Deviation
Skewness 0.80 Skewness -1.07
THIRD……
MEASURE OF
PEAKEDNESS
KURTOSIS……
Leptokurtic curve/distribution
Distribution that has relatively high peak than normal
distribution
They have more concentration of frequencies at the
centre than normal distribution
Platykurtic curve/distribution:
Distribution that is relatively flat than normal
distribution
They have lesser concentration of frequencies at the
centre than normal distribution
14
12
10
0 N = 20.00
1.0 2.0 3.0 4.0 5.0
Platykurtic Example
1
Std. Dev = 1.
Mean = 3.0
0 N = 18.00
1.0 2.0 3.0 4.0 5.0
All mesokurtic curves are Not normal curves
But all Normal curves are mesokurtic curves
Age of respondent
Upper
Quartile
Lower Median
Quartile
34 36 38 40 42 44 46
Box-and-Whisker Plot
The plot can be oriented horizontally or vertically
Example:
Median X
X Q1 Q3 maximum
minimum (Q2)
25% 25% 25% 25%
12 30 45 57 70
Question!
Geometric mean
Used to measure the rate of change of a variable
over time
x g (x 1 x 2 x n ) (x 1 x 2 x n )
n 1/n
1. Rates
2. Following to be considered:
If rates are to be averaged over Numerator units,
then HM should be used
If rates are to be averaged over Denominator units,
then AM should be used
Fourth Measure,
Measures of Dispersion
MEASURES OF VARIATION &
SPREAD
The degree to which numerical data tend to
spread around an average value is called the
measure of variation or dispersion.
Dispersion is the degree of spread in a variable
- it describes the distance from the average.
Mean alone does not provide a complete or
sufficient description of data
Variation
Same center,
different variation
Range
Example:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
Disadvantages of the Range
Ignores the way in which data are distributed
i.e. it ignores all the values of distribution
7 8 9 10 11 12 7 8 9 10 11 12
Range = 12 - 7 = 5 Range = 12 - 7 = 5
Sensitive to outliers
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5
Range = 5 - 1 = 4
1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120
Range = 120 - 1 = 119
Disadvantages of the Range
No sampling stability
Then why to use Range?
Applications?
Weather forecast
Quality control
Fluctuations in share prices
Range
IQR = Q3 - Q1
σ 2 i1
N
Where μ = population mean
N = population size
xi = ith value of the variable x
Sample Variance
s
2 i1
n -1
Where X = arithmetic mean
n = sample size
Xi = ith value of the variable X
Population Standard Deviation
i
(x μ) 2
σ i1
N
Sample Standard Deviation
i
Sample standard deviation:
(x x) 2
S i1
n -1
Standard Deviation
11 12 13 14 15 16 17 18 19 20 21
s = 3.338
(compare to the two
Data A cases below)
11 12 13 14 15 16 17 18 19 20 21
s = 0.926
(values are concentrated
Data B near the mean)
s = 4.570
11 12 13 14 15 16 17 18 19 20 21 (values are dispersed far
Data C from the mean)
Advantages of Variance and
Standard Deviation
Commonly used as a measure of dispersion because of its
mathematical properties
Based on every item
Subject to further Algebraic treatment
Correlation, skewness, sampling etc.
Standard deviation = 5
s 5
CVA 100% 100% 10%
x 50 Both stocks
Stock B: have the same
standard
Average price last year = $100 deviation, but
stock B is less
Standard deviation = $5 variable relative
s
to its price
5
CVB 100% 100% 5%
x 100
Which measure of dispersion to
use
Depends on
Type of data available
If extreme values & few items - > avoid S.D.
If skewed data - > avoid M.D (when from mean)
If there are gaps around Quartiles -> Avoid Q.D
If there are open-end classes -> Q.D.
Purpose of investigation
If elementary treatment of statistical measure ->
Range, Q.D., M.D
But for further statistical analysis -> S.D.
Question!
w x i i
w 1x1 w 2 x 2 w n x n
x i1
n n
Where wi is the weight of the ith observation
and n w i
fm i i
K
where n fi
x i1
i1
n
Approximations for Grouped Data
Suppose data are grouped into K classes, with
frequencies f1, f2, . . ., fK, and the midpoints of the
classes are m1, m2, . . ., mK
i i
f (m x) 2
s2 i1
n 1
The Empirical Rule
68%
μ
μ 1σ
The Empirical Rule
(continued)
μ 2σ contains about 95% of the values in
the population or the sample
μ 3σ contains almost all (about 99.7%) of
the values in the population or the sample
95% 99.7%
μ 2σ μ 3σ
A more general interpretation of the standard
deviation is derived from Chebychev‘s Theorem
which applies to all shapes of distributions
---------- ------------ --------------------
What did we Learn?
Described measures of central tendency
Mean, median, mode
Illustrated the shape of the distribution
Symmetric, skewed
Described measures of variation
Range, interquartile range, variance and standard deviation,
coefficient of variation
References