You are on page 1of 75

Session 3: Descriptive Statistics

Statistics for Business


Dr. Le Anh Tuan

4
Numerical Description
Three key characteristics of numerical data:
► Center: the extent to which the values of a numerical variable
group around a typical, or central, value.
► Where are the data values concentrated?
► What seem to be typical or middle data values? Is there
central tendency?
► Variability: the amount of dispersion, or scattering, away from a
central value that the values of a numerical variable show
► How much dispersion is there in the data?
► How spread out are the data values?
► Are there unusual values?
► Shape: the pattern of the distribution of values from the lowest
value to the highest value.
► Are the data values distributed symmetrically? Skewed?
Sharply peaked? Flat? Bimodal?
5
Measures of Central Tendency

6
Measures of Central Tendency
► Central tendency is a single value used to describe the center
point of a data set.
Central Tendency

Mean Median Mode

Weighted
Mean

7
Mean
► The mean, or average, is the most common measure of
central tendency.

► In statistics, we generally use the term mean instead of


average, and the mean has a specific formula:

► Calculate the mean by adding all the values in a data set and
the diving the results by the number of observations.

8
Mean
► For the population:
N

åx i
x1 + x 2 + ! + x N Population values
μ= i=1
=
N N
Population size
► For the sample:

n
Observed values
åx i
x1 + x 2 + ! + x n
x= i=1
=
n n
Sample size

9
Mean
► Suppose a sample size n=10 gives the following values:
4 3 8 10 0 2 3 7 5 2

► The sample mean is

4 + 3 + 8 + 10 + 0 + 2 + 3 + 7 + 5 + 2
"̅ =
10
44
= = 4.4
10

10
Mean
► Advantages:
► Easy to calculate
► Summarizes the data with a single value

► Disadvantages:
► Affected by outliers (values that are much higher or
lower than most of the data)

11
Mean
► Affected by outliers (values that are much higher or lower than
most of the data)

0 1 2 3 4 5 6 7 8 9 10
Mean = 4
1 + 2 + 3 + 4 + 10 20
= =4
5 5
1 + 2 + 3 + 4 + 5 15
= =3
5 5
Mean = 3

0 1 2 3 4 5 6 7 8 9 10
12
Weighted Mean
► A weighted mean allows you to assign more weight to
certain values and less weight to others.

► Formula for the weighted mean:

∑(%&'(*% "% )
"̅ =
∑(%&' *%

► "% = ith data value


► *% = the weight for each data value x-
► ∑(%&' *% = the sum of all the weights.
Weighted Mean
► A GPA is a weighted average that assigns greater weight to
courses with more credits. Grades typically range from 0 to 4.
For example, if A earned a grade of 4.0 in Science (worth 4
credits), 3 in English (worth 3 credits), and 3.5 in Physics
(worth 2 credits), what is A's GPA?
► Solutions:
► Multiply each grade by its weight, add the products,
and then divide by the sum of the weights.

4×4 + 3×3 + 3.5×2 32


"̅ = =
4+3+2 9

= 3.56
► A’s grade point average is approximately 3.56.

14
Questions
► You invest in the stock market, the returns are significantly
influenced by market conditions. The details are as follows:
Stock returns
Market condition Likelihood Returns
Up 45% 20%
Neutral 20% 14%
Down 35% -5%

1. Calculate expected return for this investment?


2. If you invest $US 1000, calculate the expected monetary
value.

15
The Median

16
Median
► The median is the value in the data set for which half
observations are higher and half the observations are
lower.

► In other words, the median (M) is the 50th percentile or


midpoint of the ordered sample data.

17
Median
► Not Affected by outliers

0 1 2 3 4 5 6 7 8 9 10

Median = 3
Median = 3

0 1 2 3 4 5 6 7 8 9 10
18
Median
► The location of the median:

n +1
Median position = position in the ordered data
2
Where n is the number of observations.

► If the number of observations is odd, the median


is the middle number.
► If the number of observations is even, the median
is the average of the two middle numbers.

► Note that (n+1)/2 is not the value of the median, only the
position of the median in the ranked data

19
Median
► Suppose a sample size n=9 with order gives the following
values:
0 3 4 7 9 12 13 17 25

► The median position is

9+1
=5
2

► The median value is 9

20
Median
► Suppose a sample size n=10 with order gives the following
values:
0 3 4 7 9 12 13 17 25 4000

► The median position is

10 + 1 Not affected
= 5.5 by outliers
2

► The median value is


9 + 12
= 10.5
2

21
The Mode

23
Mode

►The mode is the value that appears most often in a data


set.
►If no data value or category repeats more than once,
we say that the mode does not exist.

►More than one mode can exist if two or more values


tie for the most frequent.

►The mode is most useful for discrete or categorical data


with only a few distinct data values. For continuous data or
data with a wide range, the mode is rarely useful.

24
Mode

0 1 2 3 4 5 6

No Mode

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Mode = 9

25
Shape of a Distribution
►Describes how data are distributed
►Measures of shape
►Symmetric
►Skewed (Left or Right)
Mean < Median Mean = Median Median < Mean

Left-Skewed Symmetric Right-Skewed

27
Skewness

1 4
% 2. − 2̅
!"#$%#&& = -
(% − 1)(% − 2) &
./0

!"#$%#&& < 0 !"#$%#&& = 0 !"#$%#&& > 0

28
Questions
►The number of students in 10 classes was recorded. The
stem and leaf diagram with 10’s digit for the stem shows
this information:

1. What percentage of classes have more than 20 students?


2. How many classes are there that have fewer than 5 students?
3. Find the mean, median, and mode
4. Determine the shape of distribution
5. A class of 40 students is added to the sample. The teacher
says the median will remain the same. Is the teacher correct?
Please explain your answer.
29
Questions
►The Dot Plot provides information on the number of cities
each student in a class has visited.

1. How many observations are there in the sample?


2. How many students visit more than 10 cities?
3. Find the mean, median, and mode
4. Determine the shape of distribution

30
Measures of Variability

31
Measures of Variability
Variation

Range Variance Standard deviation

►Measures of variation give


information on the spread or
variability of the data values.
►Variability provides a quantitative
measure of the differences
between scores in a distribution
Same center,
and describes the degree to which
different variation
the scores are spread out or
clustered together.
32
Range

►Simplest measure of variation

►Difference between the largest and the smallest


observations.
Range = Highest Value – Lowest Value

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Range = 14 - 1 = 13

33
Range
►Advantages
►Easy to calculate and understand
►Disadvantages
►Only based on two numbers on the dataset. Ignores
the way in which data are distributed

►Sensitive to outliers

34
Variance
►The variance is the average of the squared differences
between each data value and the mean.
►Sample variance is denoted by s2

►Where "̅ = sample mean


►n = sample size
►Xi = ith value of the variable X

35
Variance

►Population variance is denoted by

►Where ! = population mean

►N = population size

►Xi = ith value of the variable X

36
Variance

►A sample includes 8 observations:


10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16

►Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
567
=
2
= 18.57

37
Standard Deviations
►Standard deviation is the square root of variance

►Has the same units as original data, making it more easily


interpreted than the variance.

►Sample standard deviation:

38
Standard deviation

►A sample includes 6 observations:


10 12 14 15 17 18 18 24
n=8, Mean "̅ = 16

►Sample Variance:
► &' =
()*+),). /()'+),). /()0+),). /()1+),). /()2+),). /'∗()4+),). /('0+),).
4+)
567
=
2
= 18.57
Standard deviation: & = 58. :; = <. 65
A measure of how far on average each data value is
from the mean of the sample.
39
Comparing SD

40
Questions
► Scores for the TOEIC tests are presented below for a group of n =
12 students. Male 500
Male 450
Male 600
Male 520
Male 540
Male 390
Female 600
Female 700
Female 720
Female 300
Female 410
Female 270

1. Compute the variance for the entire group.


2. Which group has the highest dispersion in scores?

41
Excel and STATA
► Excel Formulas
► Mean: =Average(Data Values)
► Weighted Mean: =Sumproduct(X1, X2)
► Median: =Median(Data Values)
► Mode: =Mode(Data Values)
► Variance: = Var.S(Data Values)
► SD: =STDEV.S(Data Values)

42
Excel
► Excel Tools: Data/Data Analysis/Descriptive Statistics

43
Excel

44
Megstat

45
Grouped Data

46
Grouped Data

► Suppose data are grouped into K classes, with frequencies f1, f2,
. . . fK, and the midpoints of the classes are m1, m2, . . ., mK.
► For a sample of n observations, the mean is

n: the total number of observations


k: the number of classes

► This mean is only an approximate value since the midpoint is just


an estimate of the value in each class.

47
Grouped Data

► The following table gives the frequency distribution of the


number of orders received each morning during the past 30
mornings at a coffee store. Calculate the mean.
Number of orders Frequency (f)
0-4 4
5-9 5
10 - 14 10
15 - 19 7
20 - 24 4
n=30

48
Grouped Data

► For a sample of n observations, the variance is

50
Using the Mean and Standard
Deviation Together

52
Coefficient of Variation

►The standard deviation is affected by the scale of the data


►When sample means are very different, comparing
SD can be misleading.

►The coefficient of variance, CV, measures SD in terms of its


percentage of the mean.
►A high CV indicates high variability relative to the
size of the mean.
►A low CV indicates low variability relative to the size
of the mean.

►A smaller coefficient of variation indicates more


consistency within a set of data values.

53
Coefficient of Variation

►Sample coefficient of variation

$
!" = (100)

s=the sample standard deviation


&̅ = the sample mean

►Population coefficient of variation


+
!" = (100)
,
+ =the population standard deviation
,= the population mean

54
Coefficient of Variation example
►Stock market

Price for Stock A Price for Stock B

Mean 100 60

SD 20 15

CV =20/100*100 =15/60*100
=20% =25%

►Although stock A has a larger deviation, the price is more


consistent.

55
Z-score

►Zscore identifies the number of standard deviations a


particular value is from the mean of its distributions.

►A Zscore has no unit.

►Zscore is
►Zero for values equal to the mean
►Positive for values above the mean
►Negative for values below the mean

56
Z-score formula
►Sample Zscore

# − #̅
!=
&
x=the data value of interest
s=the sample standard deviation
#̅ = the sample mean
►Population Zscore

#−'
!=
(
( =the population standard deviation
'= the population mean

57
Unusual observations

►Based on its standardized Zscore, a data value is classified


as:
►Unusual if |Z| > 2 (beyond μ ± 2σ)
►Outlier if |Z| > 3 (beyond μ ± 3σ)

58
Z-score example
►Price for a glass of milk tea (size L) in Vietnam

15K 28K 43K 50K 70K 90K


►Average price: Mean=49.33
►SD = 27.40
►How far is the price of KOI THÉ (90K) from the sample
mean of 49.33 (in SD increments)

59
Z-score example
►A price for a glass of milk tea (size L) in Vietnam

15K 28K 43K 50K 70K 90K


►!"#$%& = ()* − ,). ..)/12. ,* = 3. ,4
►The price of KOI is more than one standard deviation (1.48)
above the sample mean.

60
Empirical rule

►If the data distribution is bell-shaped, symmetrical curve


centered around the mean, we would expect:
About 68% of the values in About 99.7% of the values in
the population to fall within ± the population to fall within ±
1 standard deviation from the 3 standard deviation from the
mean mean

68% 95% 95% 99.7%99.7%

μ
μ ± 2σμ ± 2σ μ ± 3σμ ± 3σ
μ ± 1σ
About 95% of the values in
the population to fall within ±
2 standard deviation from the
mean
61
Chebychev’s Theorem
► For any population with mean μ and standard deviation σ , and k >
1 , the percentage of observations that fall within the interval
[" + $%]

► is at least
)
1− ×100
*+

► Regardless of how the data are distributed, at least (1 - 1/k2) of the


values will fall within k standard deviations of the mean (for k > 1)

62
Relative Position
Percentiles, Quartiles, and Box Plots

63
Measures of relative position

► Measures of relation position compare the position of one


value in relation to other values in the data set.

► Measures:
► Percentiles
► Quartiles
► Interquartile

64
Percentiles

► Percentiles are data that have been divided into 100


groups.

► For example, you EQ score in the 80rd percentile on a


standardized test. That means that 80% of the test-takers
scored below you.

► Generally, the pth percentile of a data set (where p is any


number between 1 and 100) is the value that at least p
percent of the observations will fall below.

65
Percentiles

66
Percentiles

► Arrange the data in ascending order.

► Compute Lp, the location of the pth percentile.

p
LP = (n + 1)
100

67
Percentiles

► Example: The 80th percentile for the starting salary data

► Step 1. Arrange the data in ascending order.


5710, 5755, 5850, 5880, 5880, 5890, 5920, 5940, 5950, 6050,
6130, 6325.
p
( n + 1) = æç ö÷ ( 12 + 1) = 10.4
80
► Step 2. L80 =
100 è 100 ø
► Step 3. 80th percentile = 6050 + 0.4×(6130–6050)

= 6082

68
Quartiles
► Quartiles are scale points that divide the sorted data into four
groups of approximately equal size.

► The first quartile, Q1, is the value for which 25% of the observations are
smaller and 75% are larger.
► Q2 is the same as the median (50% are smaller, 50% are larger)
► Only 25% of the observations are greater than the third quartile.
Questions?
► What is the median of the data values below Q2?
► What is the median of the data values above Q3?
Interquartile

► Interquartile range describes the middle 50% of the data.

► Can eliminate high- and low-valued observations (outliers)

► Interquartile range = 3rd quartile – 1st quartile


IQR = Q3 – Q1
Box-and-Whisker plot
► A box plot (also called a box-and-whisker plot) is a graphical
display showing the relation position of the five-number
summary:
Min , Q1 , Q2 , Q3 , Max
► It also provides outlies if any

**

Outliers
Outliers
►Formulas for the upper and lower limits of outliers

►Upper Limit = Q3 + 1.5 IQR


►Lower Limit = Q1 - 1.5 IQR

►Values beyond these limits are considered outliers

73
Examples

74
Association Between Two Variables
(Covariance , Correlation)

75
Covariance
► The covariance measures the direction of the linear relationship between two
variables.
► The population covariance:

► The sample covariance:


n

å (x - x)(y - y)
i i
Cov (x , y) = s xy = i=1
n -1
► Only concerned with the direction of the relationship (positive, negative, no
relationship)
► No causal effect is implied

76
Covariance

► Covariance between two variables:

► Cov(x,y) > 0 x and y tend to move in the same direction

► Cov(x,y) < 0 x and y tend to move in opposite directions

► Cov(x,y) = 0 x and y are independent

77
Correlation Coefficient
► The sample correlation coefficient, rxy measures both the strength and
direction of the linear relationship between two variables.

► Formula for population correlation coefficient:

Standard
deviations for the
► Formula for sample correlation coefficient: x or y variable

Cov (x , y)
r=
sX sY

78
Correlation Coefficients
► Unit free

► Ranges between –1 and 1

► The closer to –1, the stronger the negative linear relationship

► The closer to 1, the stronger the positive linear relationship

► The closer to 0, the weaker any positive linear relationship

79
Correlation Coefficients

80
Excel
► Covariance for the sample:

=COVARIANCE.S(X DATA VALUES, Y DATA VALUES)

► Correlation for the sample:


=CORREL (X DATA VALUES, Y DATA VALUES)

Excel Tools: Data/Data Analysis/Covariance


Data/Data Analysis/Correlation

81
Exercise

► Review Session 3, Online Quiz 3.

► Reading Chapter 4. Probability

82

You might also like