Professional Documents
Culture Documents
Chapter 1
AMA1501/1602
Introduction to Statistics for Business
Contents
1 1.1 Introduction
Statistics very often involves the collection of data. There are many ways to obtain
data, and the World Wide Web is one of them. The advantages and disadvantages of
common data collecting method are discussed below.
1.3.4 Observation
Primary data refers to data that is collected directly from the source for the first
time by the researcher. This data is original and has not been previously collected,
published, or analysed by anyone else. Examples of primary data collection
methods include surveys, interviews, observations, and experiments.
Secondary data, on the other hand, refers to data that has been collected and
published by someone else or for another purpose. This data has already been
analysed and interpreted by others, and the researcher uses it for their own
analysis. Examples of secondary data sources include government statistics,
academic journals, books, and reports.
Both primary and secondary data have their own advantages and disadvantages.
Primary data is more specific to the research question and can be tailored to the
researcher’s needs, but it can be time-consuming and expensive to collect.
Secondary data can be easily accessible and less expensive, but it may not be as
accurate or relevant to the research question as primary data. Therefore,
researchers often use a combination of both types of data to obtain a
comprehensive understanding of the phenomenon under investigation.
A graph is a method of presenting statistical data in visual form. The main purpose of
any chart is to give a quick, easy-to-read-and-interpret pictorial representation of data
which is more difficult to obtain from a table or a complete listing of the data.
Some basic rules for the construction of a statistical chart are listed below:
1 A clear and concise title that provides enough identification of the graph must be
included.
2 Each scale should have a scale caption that indicates the units used.
3 The zero point on the co-ordinate scale should be indicated. In situations where
space is limited, a scale break may be inserted to indicate its omission.
4 All items presented in the graph must be clearly labeled and legible, even when
viewed in black and white reprint.
There are many varieties of graphs. The most commonly used graphs are described as
below.
Pie chart
Simple bar chart
Bi-directional bar chart
Multiple bar chart
Component bar chart
Other types of graphs
https://ec.europa.eu/eurostat/web/gisco/gisco-activities/statistical-atlas
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 26 / 82
1.6 Frequency Distribution
A useful method for summarizing a set of data is the construction of a frequency table,
or a frequency distribution. That is, we divide the overall range of values into a
number of “classes” (or “bins”, “intervals”) and count the number of observations that
fall into each of these classes.
Example 1.1 A traffic inspector has counted the number of auto-mobiles passing a
certain point in 100 successive 20-minute time periods. The observations are listed
below.
23 20 16 18 30 22 26 15 5 18
14 17 11 37 21 6 10 20 22 25
19 19 19 20 12 23 24 17 18 16
27 16 28 26 15 29 19 35 20 17
12 30 21 22 20 15 18 16 23 24
15 24 28 19 24 22 17 19 8 18
17 18 23 21 25 19 20 22 21 21
16 20 19 11 23 17 23 13 17 26
26 14 15 16 27 18 21 24 33 20
21 27 18 22 17 20 14 21 22 19
1.6.1 Histogram
Notes:
The vertical lines of a histogram should be the class boundaries.
The range of the variable should constitute the major portion of the graphs of
frequency distributions. If the smallest observation is far away from zero, then a
“axis break” sign ( ) or ( ) or ( ) should be introduced in the horizontal
axis.
Example 1.2 (a) Construct a histogram for the traffic data in Example 1.1.
The frequency polygon is particularly useful when two or more distributions are to be
presented for comparison on the same graph.
Note:
A frequency curve can be obtained by smoothing the frequency polygon.
Example 1.2 (b) Construct a frequency polygon for the traffic data in Example 1.1.
Cumulative frequency distribution shows the total number of observations that fall
below a given value or range of values in a dataset.
Example 1.3 Construct a cumulative frequency polygon (also called an ogive) of the
distribution of traffic data in Example 1.1.
When we work with numerical data, it seems apparent that in most set of data there is
a tendency for the observed values to group themselves about some interior values.
Some central values seem to be the characteristics of the data. This phenomenon is
referred to as central tendency.
For a given set of data, the measure of location we use depends on what we mean by
middle; different definitions give rise to different measures. We shall consider some
more commonly used measures, namely arithmetic mean, median and mode.
Arithmetic mean can be used to calculate any numerical data and it is always unique.
It is obvious that extreme values affect the mean. Also, arithmetic mean ignores the
degree of importance in different categories of data.
Solution.
20 + 18 + 15 + 15 + 14 + 12 + 11 + 9 + 7 + 6 + 4 + 1
mean =
12
132
=
12
= 11
1.7.2 Median
Median is defined as the middle item of all given observations arranged in order. It is
the value that separates the lower 50% of the dataset from the upper 50%.
If the dataset has an odd number of values, the median is the middle value.
If the dataset has an even number of values, the median is the average of the two
middle values.
The median is a useful measure of central tendency for datasets that have outliers or
are skewed, as it is less sensitive to extreme values than the mean. It is also useful for
ordinal data, where the values have a natural order but the differences between them
may not be equally meaningful.
Solution. The observations are arranged in order. There are 12 observations. The
median is the average of the middles: 12 and 11.
12 + 11
median = = 11.5
2
1.7.3 Mode
Mode is the value which occurs most frequently. The mode may not exist, and even if
it does, it may not be unique.
For ungrouped data, we simply count the largest frequency of the given value. If all are
of the same frequency, no mode exits. If more than one values have the same largest
frequency, then the mode is not unique.
Note that the mode is independent of extreme values and it may be applied in
qualitative data.
mode = 15
2, 2, 2, 4, 5, 6, 7, 7, 7
Solution. The values which appear most frequently are 2 and 7. Therefore,
mode = 2 and 7
Note:
A dataset with two modes is said to be bimodal. While a set with more than two
modes may be described as multimodal. Having multiple modes can indicate that the
data is complex or has multiple underlying patterns or processes. It is important to
identify and interpret all the modes in the dataset to gain a clear understanding of the
distribution of the data.
1.7.4 Conclusion
For symmetrically distributed and unimodal data (i.e. only one mode), the mean,
median and mode can be used almost interchangeably.
Median divides the area of the distribution into two equal parts.
Example 1.8 There were two companies, Company A and Company B. Their salaries
profiles given in mean, median and mode were as follow:
Company A Company B
Mean $30,000 $30,000
Median $30,000 $30,000
Mode Nil Nil
However, their detail salary ($) structures could be completely different as that:
Hence it is necessary to have some measures on how data are scattered. That is, we
want to know what is the dispersion, or variability in a set of data.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 52 / 82
1.8 Dispersion and Skewness
1.8.1 Range
Range is the difference between the largest and smallest values in the dataset. It gives
an idea of how spread out the data is, and how much variability there is in the dataset.
The range is easy to calculate but cannot be obtained if open ended grouped data are
given.
However, it is sensitive to outliers and extreme values, and it does not take into
account the distribution of the data.
Decile divides the distribution into ten equal parts. There are nine deciles such that
10% of the data are ≤ D1 ;
20% of the data are ≤ D2 ; and so on.
Percentile divides the distribution into one hundred equal parts. There are 99
percentiles such that
1% of the data are ≤ P1 ;
2% of the data are ≤ P2 ; and so on.
Fractile, even more flexible, divides the distribution into a convenient number of parts.
Quartiles are the most commonly used values of position which divides distribution into
four equal parts such that
25% of the data are ≤ Q1 = first quartile = lower quartile;
50% of the data are ≤ Q2 = second quartile = median;
75% of the data are ≤ Q3 = third quartile = upper quartile.
Steps:
1 Use the median to divide the ordered data set into two-halves.
If there is an odd number of data points in the original ordered data set, do not
include the median (the central value in the ordered list) in either half.
If there is an even number of data points in the original ordered data set, split this
data set exactly in half.
2 The lower quartile value is the median of the lower half of the data. The upper
quartile value is the median of the upper half of the data.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 55 / 82
1.8 Dispersion and Skewness
The interquartile range (IQR) is a measure of variability that represents the spread of
the middle 50% of the data in a dataset. It is calculated as the difference between the
third quartile (Q3 ) and the first quartile (Q1 ), and provides an estimate of the spread
or dispersion of the data that is less sensitive to outliers than the range.
The semi-interquartile range (SIQR), also known as the quartile deviation, is a measure
of variability that is half the size of the interquartile range (IQR).
Q3 − Q1
IQR = Q3 − Q1 , SIQR =
2
Mean absolute deviation (MAD) is the mean of the absolute values of all deviations
from the mean. Therefore, it takes every item into account:
Note:
MAD is more robust to outliers than the other measurements because it is based on
the absolute differences between each data point and the mean.
Variance and standard deviation are both measures of how spread out a set of data is
from its mean value.
Variance is calculated by taking the average of the squared differences between
each data point and the mean.
Standard deviation (s.d.) is the square root of variance. The advantage of using
standard deviation over variance is that standard deviation is expressed in the
same units as the original data, while variance is expressed in squared units, which
can be harder to interpret.
A high variance/s.d. indicates that the data points are widely spread out from the
mean, while a low variance/s.d. indicates that the data points are clustered closely
around the mean.
The population variance, σ 2 , is the mean of the square of all deviations from the mean:
Sample Variance
The sample variance, s2 , is the mean of the square of all deviations from the sample
mean:
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2 (xi − x̄)2
P
2
s = =
n−1 n−1
where
xi is the value of the i-th item;
x1 + x2 + · · · + xn
x̄ = is the sample mean;
n
n is the sample size.
Example 1.9 Find the coefficients of variation of the salesman salary and the clerical
salary.
x̄ s
Salesman salary $916.76/month $286.70/month
Clerical salary $98.50/week $20.55/week
Solution.
286.70
CVs = × 100% = 31%
916.76
20.55
CVc = × 100% = 21%
98.50
Example 1.10 Evaluate the different measurements of variation based on the salary
($) scales of the two companies in Example 1.8:
(a) Range
Company A:
$55,000 − $5,000 = $50,000
Company B:
$55,000 − $5,000 = $50,000
Company A:
(c) Variance
Company A:
(5,000 − 30,000)2 + (15,000 − 30,000)2 + (25,000 − 30,000)2
+ (35,000 − 30,000)2 + (45,000 − 30,000)2 + (5,000 − 30,000)2
6
= 291,666,667 (dollar square)
Company B:
(5,000 − 30,000)2 + (5,000 − 30,000)2 + (5,000 − 30,000)2
+ (55,000 − 30,000)2 + (55,000 − 30,000)2 + (55,000 − 30,000)2
6
= 625,000,000 (dollar square)
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 66 / 82
Example 1.10
Company A:
p
$ 291,666,667 = $17,078
Company B:
p
$ 625,000,000 = $25,000
Company A:
$17,078
× 100% = 56.93%
$30,000
Company B:
$25,000
× 100% = 83.33%
$30,000
For grouped data, the median can be found by first identify the median class, which is
the class containing the median. Then apply the following formula to the median class:
n
2 −C
median = L1 + × (L2 − L1 )
fm
where
L1 is the lower class boundary of the median class;
n is the total frequency (i.e. the sample size);
C = f1 + · · · + fm−1 is the cumulative frequency just before the median class;
fm is the frequency of the median class;
L2 is the upper class boundary containing the median.
fm
frequency = C n
−C
2
L1 L2
median
For grouped data, the mode can be found by first identify the modal class, which is the
class with the highest frequency. Then apply the following formula to the modal class:
d1
mode = L1 + × (L2 − L1 )
d1 + d2
where
L1 is the lower class boundary of the modal class;
d1 is the difference of the frequencies of the modal class with the previous class
and is always positive;
d2 is the difference of the frequencies of the modal class with the following class
and is always positive;
L2 is the upper class boundary of the modal class.
d1
d2
d1 : d2
L1 L2
mode
where
xi is the class mark of the i-th class;
fi is the frequency of the i-th class;
P
fi is the total frequency (i.e. the sample size);
fi x2i = f1 x21 + f2 x22 + · · · + fk x2k ;
P
fi xi )2 = (f1 x1 + f2 x2 + · · · + fk xk )2 .
P
(
Example 1.11 The following table shows gas consumption data for 100 cars during a
specific time period:
Gas consumption Frequency
10 − 19 1
20 − 29 0
30 − 39 1
40 − 49 4
50 − 59 7
60 − 69 16
70 − 79 19
80 − 89 20
90 − 99 17
100 − 109 11
110 − 119 3
120 − 129 1
(b) Median, Q1 , Q3
n
−C
median = L1 + 2
× (L2 − L1 ) fm = 20
fm
C = 48
50 − 48
= 79.5 + × (89.5 − 79.5)
20
= 80.5 2
Similarly,
25 − 13
Q1 = 59.5 + × (69.5 − 59.5) = 67 79.5 89.5
16
75 − 68 median
Q3 = 89.5 + × (89.5 − 79.5) = 93.6176
17
(c) Mode
20
19 1
3 17
1 : 3
79.5 89.5
mode
d1 20 − 19
mode = L1 + ×(L2 −L1 ) = 79.5+ ×(89.5−79.5) = 82
d1 + d2 (20 − 19) + (20 − 17)
(d) s.d.
v
fi xi )2 s
u P
u f x2 − ( P
uP
i i
( fi )( fi x2i ) − ( fi xi )2
P P P
fi
u
s= =
t
fi − 1 fi − 1)
P P P
( fi )(
s s
fi x2i ) − ( fi xi )2 100(671705) − (7970)2
P P
n(
= =
n(n − 1) 100(100 − 1)
= 19.2
1.8.8 Skewness
The skewness is an abstract quantity which shows how data piled-up. A number of
measures have been suggested to determine the skewness of a given distribution.
If the longer tail is on the right, we say that it is skewed to the right, and the
coefficient of skewness is positive.
If the longer tail is on the left, we say that is skewed to the left and the coefficient
of skewness is negative.
Example 1.12
Note:
For moderately skewed distribution data, their relationship can be given by