Professional Documents
Culture Documents
SB 2024 Lecture3
SB 2024 Lecture3
4
Numerical Description
Three key characteristics of numerical data:
► Center: the extent to which the values of a numerical variable
group around a typical, or central, value.
► Where are the data values concentrated?
► What seem to be typical or middle data values? Is there
central tendency?
► Variability: the amount of dispersion, or scattering, away from a
central value that the values of a numerical variable show
► How much dispersion is there in the data?
► How spread out are the data values?
► Are there unusual values?
► Shape: the pattern of the distribution of values from the lowest
value to the highest value.
► Are the data values distributed symmetrically? Skewed?
Sharply peaked? Flat? Bimodal?
5
Measures of Central Tendency
6
Measures of Central Tendency
► Central tendency is a single value used to describe the center
point of a data set.
Central Tendency
Weighted
Mean
7
Mean
► The mean, or average, is the most common measure of
central tendency.
► Calculate the mean by adding all the values in a data set and
the diving the results by the number of observations.
8
Mean
► For the population:
N
åx i
x1 + x 2 + ! + x N Population values
μ= i=1
=
N N
Population size
► For the sample:
n
Observed values
åx i
x1 + x 2 + ! + x n
x= i=1
=
n n
Sample size
9
Mean
► Suppose a sample size n=10 gives the following values:
4 3 8 10 0 2 3 7 5 2
4 + 3 + 8 + 10 + 0 + 2 + 3 + 7 + 5 + 2
"̅ =
10
44
= = 4.4
10
10
Mean
► Advantages:
► Easy to calculate
► Summarizes the data with a single value
► Disadvantages:
► Affected by outliers (values that are much higher or
lower than most of the data)
11
Mean
► Affected by outliers (values that are much higher or lower than
most of the data)
0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1 + 2 + 3 + 4 + 10 20
= =4
5 5
1 + 2 + 3 + 4 + 5 15
= =3
5 5
Mean = 3
0 1 2 3 4 5 6 7 8 9 10
12
Weighted Mean
► A weighted mean allows you to assign more weight to
certain values and less weight to others.
∑(%&'(*% "% )
"̅ =
∑(%&' *%
= 3.56
► A’s grade point average is approximately 3.56.
14
Questions
► You invest in the stock market, the returns are significantly
influenced by market conditions. The details are as follows:
Stock returns
Market condition Likelihood Returns
Up 45% 20%
Neutral 20% 14%
Down 35% -5%
15
The Median
16
Median
► The median is the value in the data set for which half
observations are higher and half the observations are
lower.
17
Median
► Not Affected by outliers
0 1 2 3 4 5 6 7 8 9 10
Median = 3
Median = 3
0 1 2 3 4 5 6 7 8 9 10
18
Median
► The location of the median:
n +1
Median position = position in the ordered data
2
Where n is the number of observations.
► Note that (n+1)/2 is not the value of the median, only the
position of the median in the ranked data
19
Median
► Suppose a sample size n=9 with order gives the following
values:
0 3 4 7 9 12 13 17 25
9+1
=5
2
20
Median
► Suppose a sample size n=10 with order gives the following
values:
0 3 4 7 9 12 13 17 25 4000
10 + 1 Not affected
= 5.5 by outliers
2
21
The Mode
23
Mode
24
Mode
0 1 2 3 4 5 6
No Mode
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Mode = 9
25
Shape of a Distribution
►Describes how data are distributed
►Measures of shape
►Symmetric
►Skewed (Left or Right)
Mean < Median Mean = Median Median < Mean
27
Skewness
1 4
% 2. − 2̅
!"#$%#&& = -
(% − 1)(% − 2) &
./0
28
Questions
►The number of students in 10 classes was recorded. The
stem and leaf diagram with 10’s digit for the stem shows
this information:
30
Measures of Variability
31
Measures of Variability
Variation
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14
Range = 14 - 1 = 13
33
Range
►Advantages
►Easy to calculate and understand
►Disadvantages
►Only based on two numbers on the dataset. Ignores
the way in which data are distributed
►Sensitive to outliers
34
Variance
►The variance is the average of the squared differences
between each data value and the mean.
►Sample variance is denoted by s2
35
Variance
►N = population size
36
Variance
►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
567
=
2
= 18.57
37
Standard Deviations
►Standard deviation is the square root of variance
38
Standard deviation
►Sample Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
567
=
2
= 18.57
Standard deviation: & = 58. :; = <. 65
A measure of how far on average each data value is
from the mean of the sample.
39
Comparing SD
40
Questions
► Scores for the TOEIC tests are presented below for a group of n =
12 students. Male 500
Male 450
Male 600
Male 520
Male 540
Male 390
Female 600
Female 700
Female 720
Female 300
Female 410
Female 270
41
Excel and STATA
► Excel Formulas
► Mean: =Average(Data Values)
► Weighted Mean: =Sumproduct(X1, X2)
► Median: =Median(Data Values)
► Mode: =Mode(Data Values)
► Variance: = Var.S(Data Values)
► SD: =STDEV.S(Data Values)
42
Excel
► Excel Tools: Data/Data Analysis/Descriptive Statistics
43
Excel
44
Megstat
45
Grouped Data
46
Grouped Data
► Suppose data are grouped into K classes, with frequencies f1, f2,
. . . fK, and the midpoints of the classes are m1, m2, . . ., mK.
► For a sample of n observations, the mean is
47
Grouped Data
48
Grouped Data
50
Using the Mean and Standard
Deviation Together
52
Coefficient of Variation
53
Coefficient of Variation
$
!" = (100)
&̅
54
Coefficient of Variation example
►Stock market
Mean 100 60
SD 20 15
CV =20/100*100 =15/60*100
=20% =25%
55
Z-score
►Zscore is
►Zero for values equal to the mean
►Positive for values above the mean
►Negative for values below the mean
56
Z-score formula
►Sample Zscore
# − #̅
!=
&
x=the data value of interest
s=the sample standard deviation
#̅ = the sample mean
►Population Zscore
#−'
!=
(
( =the population standard deviation
'= the population mean
57
Unusual observations
58
Z-score example
►Price for a glass of milk tea (size L) in Vietnam
59
Z-score example
►A price for a glass of milk tea (size L) in Vietnam
60
Empirical rule
μ
μ ± 2σμ ± 2σ μ ± 3σμ ± 3σ
μ ± 1σ
About 95% of the values in
the population to fall within ±
2 standard deviation from the
mean
61
Chebychev’s Theorem
► For any population with mean μ and standard deviation σ , and k >
1 , the percentage of observations that fall within the interval
[" + $%]
► is at least
)
1− ×100
*+
62
Relative Position
Percentiles, Quartiles, and Box Plots
63
Measures of relative position
► Measures:
► Percentiles
► Quartiles
► Interquartile
64
Percentiles
65
Percentiles
66
Percentiles
p
LP = (n + 1)
100
67
Percentiles
= 6082
68
Quartiles
► Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.
► The first quartile, Q1, is the value for which 25% of the observations are
smaller and 75% are larger.
► Q2 is the same as the median (50% are smaller, 50% are larger)
► Only 25% of the observations are greater than the third quartile.
Questions?
► What is the median of the data values below Q2?
► What is the median of the data values above Q3?
Interquartile
**
Outliers
Outliers
►Formulas for the upper and lower limits of outliers
73
Examples
74
Association Between Two Variables
(Covariance , Correlation)
75
Covariance
► The covariance measures the direction of the linear relationship between two
variables.
► The population covariance:
å (x - x)(y - y)
i i
Cov (x , y) = s xy = i=1
n -1
► Only concerned with the direction of the relationship (positive, negative, no
relationship)
► No causal effect is implied
76
Covariance
77
Correlation Coefficient
► The sample correlation coefficient, rxy measures both the strength and
direction of the linear relationship between two variables.
Standard
deviations for the
► Formula for sample correlation coefficient: x or y variable
Cov (x , y)
r=
sX sY
78
Correlation Coefficients
► Unit free
79
Correlation Coefficients
80
Excel
► Covariance for the sample:
81
Exercise
82