Chapter 1

Descriptive Statistics
Chapter 1
AMA1501/1602
Introduction to Statistics for Business
Contents
1 1.1 Introduction
2 1.2 Some Basic Definitions
3 1.3 Method of Data Collection
4 1.4 Primary and Secondary Data
5 1.5 Graphical Descriptions of Data
6 1.6 Frequency Distribution
7 1.7 Central Tendency
8 1.8 Dispersion and Skewness
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 2 / 82

1.1 Introduction
Statistics is concerned with the scientific method by which information is
collected, organized, analysed and interpreted for the purpose of description and
decision making.
Examples using statistics are: Hang Seng Index, Life or car insurance rate,
Unemployment rate, Consumer Price Index, etc.

1.1 Introduction
There are two subdivisions of statistical method:

Descriptive Statistics – It deals with the presentation of numerical facts, or data,
in either tables or graphs form, and with the methodology of analysing the data.
Inferential Statistics – It involves techniques for making inferences about the
whole population on the basis of observations obtained from samples.

1.2 Some Basic Definitions
Population – A population is the group from which data are to be collected.

Sample – A sample is a subset of a population.
Variable – A variable is a feature characteristic of any member of a population
differing in quality or quantity from one member to another.
Quantitative variable – A variable differing in quantity is called quantitative
variable, for example, the weight of a person, number of people in a car.
Qualitative variable – A variable differing in quality is called a qualitative variable
or attribute, for example, color, the degree of damage of a car in an accident.
Discrete variable – A discrete variable is one which no value may be assumed
between two given values, for example, number of children in a family.
Continuous variable – A continuous variable is one which any value may be
assumed between two given values, for example, the time for 100-meter run.

1.3 Method of Data Collection
Statistics very often involves the collection of data. There are many ways to obtain
data, and the World Wide Web is one of them. The advantages and disadvantages of
common data collecting method are discussed below.

1.3.1 Postal/Online Questionnaire
The main advantages are:

The apparent low cost compared with other methods
although the cost per useful answer may well be high.
No need for a closely grouped sample as in personal
interviews, since the Post Office is acting as a field force.
There is no interviewer bias.
A considered reply can be given – the respondent has time to consult any
necessary documents.

1.3.1 Postal/Online Questionnaire
The main disadvantages are:

The whole questionnaire can be read before answering
(which in some circumstances it is undesirable).
Spontaneous answers cannot be collected. Only simple
questions and instructions can be given.
The wrong person may complete the form.
Other persons’ opinions may be given e.g. by a wife consulting per husband.
No control is possible over the speed of the reply.
A poor “response rate” (a low percentage of replies) will be obtained.

1.3.2 Telephone Interview

It is cheaper than personal interviews but tends to be
dearer on average than postal questionnaires.
It can be carried out relatively quick.
Help can be given if the person does not understand the question as worded.
The telephone can be used in conjunction with other survey methods, e.g. for
encouraging replies to postal surveys or making appointments for personal
interviews.
Spontaneous answers can be obtained.

1.3.2 Telephone Interview

In some countries not everybody owns a telephone,
therefore, a survey carried out among telephone owners
would be biased towards the upper social classes of the
community. But the telephone can be used in industrial
market research anywhere since businesses are invariably on the telephone.
It is easy to refuse to be interviewed on the telephone simply by replacing the
receiver. The response rate tends to be higher than postal surveys but not as high
as when personal interviews are used.
As in the postal questionnaire, it is not possible to check the characteristics of the
person who is replying, particularly with regard to age and social class.
The questionnaire cannot be too long or too involved.

1.3.3 Personal Interview
In market research this is by far the most commonly used way of collecting
information from the general public.
A trained person may assess the interviewees in terms of age and social
class and area of residence, and even sometimes assess the accuracy of
the information given (e.g. by checking the pantry to see if certain goods are really there).
Help can be given to those respondents who are unable to understand the questions,
although great care has to be taken that the interview’s own feelings do not enter into the
wording of the question and so influence the answers of the respondents.
A well-trained interviewer can persuade a person to give an interview who might otherwise
have refused on a postal or telephone enquiry, so that a higher response rate, giving a
more representative cross-section of views, is obtained.
A great deal more information can be collected than is possible by the previous methods.
Interviews of three quarters of an hour are commonplace, and a great deal of information
can be gathered in this time.
1.3.3 Personal Interview
It is far more expensive than other methods because interviewers have
to be recruited, trained and paid a suitable salary and expenses.
The interviewer may consciously or unconsciously bias the answers to
the question, in spite of being trained not to do so.
Persons may not like to give confidential or embarrassing information at a face-to-face
interview.
In general, people may tend to give information that they feel will impress the interviewer,
and show themselves in a better light, e.g. by claiming to read “quality” newspapers.
There is a possibility that the interviewer will cheat by not carrying out the interview or
carrying out only parts of it. All reputable organizations carry out quality control checks
to lessen the chances of this happening.
Some types of people are more difficult to locate and interview than others, e.g. travellers.
While this may not be important in some surveys, it will be on others, such as car surveys.
1.3.4 Observation
Observation can be carried out by trained observers, cameras, or closed-circuit
television.
Observation is used in a wide range of fields. For instance, anthropologists may live in
a primitive society, and social workers may become factory workers to learn about the
habits and customs of the community they are studying.
Observation can also be used in “before and after” studies, for example, observing the
flow of traffic in a supermarket before and after changes to the store layout. In
industry, many Work Study techniques rely on observing individuals or groups of
workers to establish the system of movements they employ, with the aim of eliminating
wasteful effort.
If there are not enough trained observers available or the movements are complicated,
cameras may be used, allowing for detailed analysis by repeatedly running the film.
Quality control checks and retail audits, a branch of market research, may also be
considered observation techniques.
1.3.4 Observation

The actual actions or habits of persons are observed, not
what the persons say they would do when questioned. It is
interesting to note that in one study only 40% of families
who stated they were going to buy a new car had actually
bought one when called upon a year later.
Observation may keep the system undisturbed. In some cases it is undesirable for
people to know an experiment or change is to be made, or is taking place to
maintain high accuracy.

1.3.4 Observation

The results of the observations depend on the skill and
impartiality of the observer.
It is often difficult in practice to obtain a truly random
sample of persons or events.
It is difficult to predict future behaviour on pure observation.
It is not possible to observe actions which took place before the study was
contemplated.
Opinions and attitudes cannot usually be obtained by observation.
In marketing, the frequency of a person’s purchase cannot be obtained by pure
observation. Nor can such forms of behaviour as church-going, smoking and
crossing roads, except by employing a continuous and lengthy (and hence
detectable) period of observation.
1.3.5 Reports and Published Statistics
International organizations such as the United Na-

tions provide useful data. Governments also pub-
lish statistics on population, trade, production, and
other topics. Reports on specialized topics, including
scientific research, are available from governments,
trade organizations, trade unions, universities, pro-
fessional and scientific organizations, and local au-
thorities. The World Wide Web is also an efficient
source of data.

1.4 Primary and Secondary Data
Primary data refers to data that is collected directly from the source for the first
time by the researcher. This data is original and has not been previously collected,
published, or analysed by anyone else. Examples of primary data collection
methods include surveys, interviews, observations, and experiments.
Secondary data, on the other hand, refers to data that has been collected and
published by someone else or for another purpose. This data has already been
analysed and interpreted by others, and the researcher uses it for their own
analysis. Examples of secondary data sources include government statistics,
academic journals, books, and reports.

1.4 Primary and Secondary Data
Both primary and secondary data have their own advantages and disadvantages.
Primary data is more specific to the research question and can be tailored to the
researcher’s needs, but it can be time-consuming and expensive to collect.
Secondary data can be easily accessible and less expensive, but it may not be as
accurate or relevant to the research question as primary data. Therefore,
researchers often use a combination of both types of data to obtain a
comprehensive understanding of the phenomenon under investigation.

1.5 Graphical Descriptions of Data
A graph is a method of presenting statistical data in visual form. The main purpose of
any chart is to give a quick, easy-to-read-and-interpret pictorial representation of data
which is more difficult to obtain from a table or a complete listing of the data.
Some basic rules for the construction of a statistical chart are listed below:
1 A clear and concise title that provides enough identification of the graph must be
included.
2 Each scale should have a scale caption that indicates the units used.
3 The zero point on the co-ordinate scale should be indicated. In situations where
space is limited, a scale break may be inserted to indicate its omission.
4 All items presented in the graph must be clearly labeled and legible, even when
viewed in black and white reprint.

There are many varieties of graphs. The most commonly used graphs are described as
below.
Pie chart
Simple bar chart
Bi-directional bar chart
Multiple bar chart
Component bar chart
Other types of graphs

Pie chart – Pie charts are widely used to show the component parts of a total. They
are popular because of their simplicity. In constructing a pie chart, the angles of a slice
from the center must be in proportion with the percentage of the total. The following
example of pie charts gives the percentage of education attainment in Hong Kong for
the population aged 15 and over who are reside on land and non-institutionalized in
the year 2022.
Source: Census and Statistics Department, HKSAR

Simple bar chart – The horizontal bar chart is also a simple and popular chart. Like
the pie chart, the simple horizontal bar chart is a one-scale chart. In constructing a bar
chart, it is noted that the width of the bar is not important, but the height of the bar
must be in proportion with the data. The following bar chart gives the monthly
household income of Hong Kong in the year 2022.

Bi-directional bar chart – A bar chart can use either horizontal or vertical bars. A
bi-directional bar chart indicates both the positive and negative values. The following
example gives the highest and lowest recorded temperatures in 5 states across the
United States.
Source: National Centers for Environmental Information

Multiple bar chart – A multiple bar chart is a useful tool for quickly comparing
different sets of data. In the following example, the marital status of males and
females in Hong Kong in the year 2022 is compared using a multiple bar chart.

Component bar chart – A component bar chart subdivides the bars into different
sections. It is useful when the total of the components is of interest. The following
example gives the nutrient values of food.
Source: U.S. Department of Agriculture

Other types of graphs – Graphic presentations can be made more attractive through
the use of careful layout and appropriate symbols. Sometimes information pertaining
to different geographical area can even be presented through the use of so-called
statistical map.
https://ec.europa.eu/eurostat/web/gisco/gisco-activities/statistical-atlas
1.6 Frequency Distribution
Statistical data obtained by means of census, sample surveys or experiments usually

consist of raw, unorganized sets of numerical values. Before these data can be used as
a basis for inferences about the phenomenon under investigation or as a basis for
decision, they must be summarized and the pertinent information must be extracted.
A useful method for summarizing a set of data is the construction of a frequency table,
or a frequency distribution. That is, we divide the overall range of values into a
number of “classes” (or “bins”, “intervals”) and count the number of observations that
fall into each of these classes.

The general rules for constructing a frequency distribution are:

1 There should be an appropriate number of classes, with neither too few nor too
many.
2 The classes should have equal widths wherever possible, but the first and last
classes may be open-ended to cater for extreme values.
3 Class limits represent the largest and smallest data values that an be included in
the class. Class limits are actual data values.
4 Class boundaries provide values that eliminate gaps between the classes in the
frequency distribution. To find a class boundary, average the upper class limit of
one class and the lower class limit of the next class.
5 Each class should be represented by a class mark, also known as the class
midpoint of the i-th class. This can be found by calculating the simple average of
the class boundaries or the class limits of the same class.

Example 1.1
Example 1.1 A traffic inspector has counted the number of auto-mobiles passing a
certain point in 100 successive 20-minute time periods. The observations are listed
below.
23 20 16 18 30 22 26 15 5 18
14 17 11 37 21 6 10 20 22 25
19 19 19 20 12 23 24 17 18 16
27 16 28 26 15 29 19 35 20 17
12 30 21 22 20 15 18 16 23 24
15 24 28 19 24 22 17 19 8 18
17 18 23 21 25 19 20 22 21 21
16 20 19 11 23 17 23 13 17 26
26 14 15 16 27 18 21 24 33 20
21 27 18 22 17 20 14 21 22 19

Example 1.1
1 Setting up the classes
Choose a class width of 5 for each class, then we have seven classes going from 5
to 9, from 10 to 14, . . ., and from 35 to 39.
2 Tallying and counting
Class Tally Marks Count

5−9 3
10 − 14 9
15 − 19 36
20 − 24 35
25 − 29 12
30 − 34 3
35 − 39 2

Example 1.1
3 Illustrating the data in tabular form
Frequency distribution for the traffic data

Number of autos per period Number of periods
5−9 3
10 − 14 9
15 − 19 36
20 − 24 35
25 − 29 12
30 − 34 3
35 − 39 2
Total 100

1.6.1 Histogram
A histogram is usually used to present frequency distributions graphically. This is

constructed by drawing rectangles over each class. The area of each rectangle should
be proportional to its frequency.
Notes:
The vertical lines of a histogram should be the class boundaries.
The range of the variable should constitute the major portion of the graphs of
frequency distributions. If the smallest observation is far away from zero, then a
“axis break” sign ( ) or ( ) or ( ) should be introduced in the horizontal
axis.

Example 1.2
Example 1.2 (a) Construct a histogram for the traffic data in Example 1.1.
Class limit Class boundary Frequency

5−9 4.5 − 9.5 3
10 − 14 9.5 − 14.5 9
15 − 19 14.5 − 19.5 36
20 − 24 19.5 − 24.5 35
25 − 29 24.5 − 29.5 12
30 − 34 29.5 − 34.5 3
35 − 39 34.5 − 39.5 2

Example 1.2

1.6.2 Frequency Polygon
Another method to represent frequency distribution graphically is by a frequency

polygon. As in the histogram, the base line is divided into sections corresponding to
the class-interval, but instead of the rectangles, the points of successive class marks
are being connected.
The frequency polygon is particularly useful when two or more distributions are to be
presented for comparison on the same graph.
Note:
A frequency curve can be obtained by smoothing the frequency polygon.

Example 1.2
Example 1.2 (b) Construct a frequency polygon for the traffic data in Example 1.1.
Class limit Class boundary Class mark Frequency

5−9 4.5 − 9.5 7 3
10 − 14 9.5 − 14.5 12 9
15 − 19 14.5 − 19.5 17 36
20 − 24 19.5 − 24.5 22 35
25 − 29 24.5 − 29.5 27 12
30 − 34 29.5 − 34.5 32 3
35 − 39 34.5 − 39.5 37 2

Example 1.2

1.6.3 Cumulative Frequency Distribution and Cumulative Polygon
Cumulative frequency distribution shows the total number of observations that fall
below a given value or range of values in a dataset.
There are several reasons why cumulative frequency distribution is useful:

1 It provides a quick and easy way to visualize the distribution of a dataset.
2 It helps to identify outliers and extreme values in the dataset.
3 It allows us to calculate percentiles and quartiles easily.
4 It can be used to compare two or more datasets.
Overall, cumulative frequency distribution is a useful tool for anyone working with
data, whether in the fields of statistics, economics, psychology, or any other discipline
that deals with quantitative data.

Example 1.3
Example 1.3 Construct a cumulative frequency polygon (also called an ogive) of the
distribution of traffic data in Example 1.1.
Cumulative frequency distribution for the traffic data

Number of autos per period Less than Cumulative frequency
5−9 9.5 3
10 − 14 14.5 12
15 − 19 19.5 48
20 − 24 24.5 83
25 − 29 29.5 95
30 − 34 34.5 98
35 − 39 39.5 100

Example 1.3

Example 1.3
A cumulative frequency curve can similarly be drawn by smoothing the cumulative
frequency polygon.

1.6.4 Relative Frequency
Relative frequency distribution shows the proportion or percentage of observations that

fall within a given class interval in a dataset.
Relative frequency of a class is defined as:

Frequency of the class
Relative frequency =
Total frequency
If the frequencies are changed to relative frequencies, then a relative frequency
histogram, a relative frequency polygon and a relative frequency curve can similarly be
constructed.

1.7 Central Tendency
When we work with numerical data, it seems apparent that in most set of data there is
a tendency for the observed values to group themselves about some interior values.
Some central values seem to be the characteristics of the data. This phenomenon is
referred to as central tendency.
For a given set of data, the measure of location we use depends on what we mean by
middle; different definitions give rise to different measures. We shall consider some
more commonly used measures, namely arithmetic mean, median and mode.

1.7.1 Arithmetic Mean
The arithmetic population mean, µ, or simply called mean, is obtained by adding

together all of the measurements and dividing by the total number of measurements
taken:
N
P
xi
x1 + x2 + · · · + xN
P
i=1 xi
µ= = =
N N N
where
xi is the value of the i-th item;
N is the population size.
Arithmetic mean can be used to calculate any numerical data and it is always unique.
It is obvious that extreme values affect the mean. Also, arithmetic mean ignores the
degree of importance in different categories of data.

Example 1.4
Example 1.4 Given the following set of ungrouped data:
20, 18, 15, 15, 14, 12, 11, 9, 7, 6, 4, 1
Find the mean of the ungrouped data.
Solution.
20 + 18 + 15 + 15 + 14 + 12 + 11 + 9 + 7 + 6 + 4 + 1
mean =
12
132
=
12
= 11

1.7.2 Median
Median is defined as the middle item of all given observations arranged in order. It is
the value that separates the lower 50% of the dataset from the upper 50%.
If the dataset has an odd number of values, the median is the middle value.
If the dataset has an even number of values, the median is the average of the two
middle values.
The median is a useful measure of central tendency for datasets that have outliers or
are skewed, as it is less sensitive to extreme values than the mean. It is also useful for
ordinal data, where the values have a natural order but the differences between them
may not be equally meaningful.

Example 1.5
Example 1.5 Find the median of the ungrouped data:
20, 18, 15, 15, 14, 12, 11, 9, 7, 6, 4, 1
Solution. The observations are arranged in order. There are 12 observations. The
median is the average of the middles: 12 and 11.
12 + 11
median = = 11.5
2

1.7.3 Mode
Mode is the value which occurs most frequently. The mode may not exist, and even if
it does, it may not be unique.
For ungrouped data, we simply count the largest frequency of the given value. If all are
of the same frequency, no mode exits. If more than one values have the same largest
frequency, then the mode is not unique.
Note that the mode is independent of extreme values and it may be applied in
qualitative data.

Example 1.6
Example 1.6 Find the mode of the ungrouped data:
20, 18, 15, 15, 14, 12, 11, 9, 7, 6, 4, 1
Solution. The value which appears most frequently is 15. Therefore,
mode = 15

Example 1.7
Example 1.7 Find the mode of the ungrouped data:
2, 2, 2, 4, 5, 6, 7, 7, 7
Solution. The values which appear most frequently are 2 and 7. Therefore,
mode = 2 and 7
Note:
A dataset with two modes is said to be bimodal. While a set with more than two
modes may be described as multimodal. Having multiple modes can indicate that the
data is complex or has multiple underlying patterns or processes. It is important to
identify and interpret all the modes in the dataset to gain a clear understanding of the
distribution of the data.

1.7.4 Conclusion
For symmetrically distributed and unimodal data (i.e. only one mode), the mean,
median and mode can be used almost interchangeably.
Mean can be interpreted as the center of gravity of the distribution.
Median divides the area of the distribution into two equal parts.
Mode is the highest point of the distribution.

1.8 Dispersion and Skewness
Sometimes mean, median and mode may not be able to reflect the true picture of
some data. The following example explains the reason.
Example 1.8 There were two companies, Company A and Company B. Their salaries
profiles given in mean, median and mode were as follow:
Company A Company B
Mean $30,000 $30,000
Median $30,000 $30,000
Mode Nil Nil
However, their detail salary ($) structures could be completely different as that:
Company A 5,000 15,000 25,000 35,000 45,000 55,000

Company B 5,000 5,000 5,000 55,000 55,000 55,000
Hence it is necessary to have some measures on how data are scattered. That is, we
want to know what is the dispersion, or variability in a set of data.
1.8.1 Range
Range is the difference between the largest and smallest values in the dataset. It gives
an idea of how spread out the data is, and how much variability there is in the dataset.
The range is easy to calculate but cannot be obtained if open ended grouped data are
given.
However, it is sensitive to outliers and extreme values, and it does not take into
account the distribution of the data.

1.8.2 Deciles, Percentile, and Fractile
Decile divides the distribution into ten equal parts. There are nine deciles such that
10% of the data are ≤ D1 ;
20% of the data are ≤ D2 ; and so on.
Percentile divides the distribution into one hundred equal parts. There are 99
percentiles such that
1% of the data are ≤ P1 ;
2% of the data are ≤ P2 ; and so on.
Fractile, even more flexible, divides the distribution into a convenient number of parts.

1.8.3 Quartiles
Quartiles are the most commonly used values of position which divides distribution into
four equal parts such that
25% of the data are ≤ Q1 = first quartile = lower quartile;
50% of the data are ≤ Q2 = second quartile = median;
75% of the data are ≤ Q3 = third quartile = upper quartile.
Steps:
1 Use the median to divide the ordered data set into two-halves.
If there is an odd number of data points in the original ordered data set, do not
include the median (the central value in the ordered list) in either half.
If there is an even number of data points in the original ordered data set, split this
data set exactly in half.
2 The lower quartile value is the median of the lower half of the data. The upper
quartile value is the median of the upper half of the data.
The interquartile range (IQR) is a measure of variability that represents the spread of
the middle 50% of the data in a dataset. It is calculated as the difference between the
third quartile (Q3 ) and the first quartile (Q1 ), and provides an estimate of the spread
or dispersion of the data that is less sensitive to outliers than the range.
The semi-interquartile range (SIQR), also known as the quartile deviation, is a measure
of variability that is half the size of the interquartile range (IQR).
Q3 − Q1
IQR = Q3 − Q1 , SIQR =
2

1.8.4 Mean Absolute Deviation
Mean absolute deviation (MAD) is the mean of the absolute values of all deviations
from the mean. Therefore, it takes every item into account:
|x1 − µ| + |x2 − µ| + · · · + |xN − µ| |xi − µ|

P
=
N N
where
µ is the population mean;
Note:
MAD is more robust to outliers than the other measurements because it is based on
the absolute differences between each data point and the mean.

1.8.5 Variance and Standard Deviation
Variance and standard deviation are both measures of how spread out a set of data is
from its mean value.
Variance is calculated by taking the average of the squared differences between
each data point and the mean.
Standard deviation (s.d.) is the square root of variance. The advantage of using
standard deviation over variance is that standard deviation is expressed in the
same units as the original data, while variance is expressed in squared units, which
can be harder to interpret.
A high variance/s.d. indicates that the data points are widely spread out from the
mean, while a low variance/s.d. indicates that the data points are clustered closely
around the mean.

Population Variance and Standard Deviation
The population variance, σ 2 , is the mean of the square of all deviations from the mean:
(x1 − µ)2 + (x2 − µ)2 + · · · + (xN − µ)2 (xi − µ)2

P
2
σ = =
N N
where
µ is the population mean;
sP
√ (xi − µ)2
The population standard deviation σ is defined as σ = σ2 = .
N

Sample Variance
The sample variance, s2 , is the mean of the square of all deviations from the sample
mean:
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2 (xi − x̄)2
P
2
s = =
n−1 n−1
where
x1 + x2 + · · · + xn
x̄ = is the sample mean;
n
n is the sample size.

Sample Standard Deviation

√
The sample standard deviation s is defined as s = s2 :
v
u P 2
u x2 − ( xi )
uP
sP
i
(xi − x̄)2 n
u
t
s= =
n−1 n−1
where
P 2
xi = x21 + x22 + · · · + x2n is the sum of squares;
( xi )2 = (x1 + x2 + · · · + xn )2 is the square of sum;
P
n is the sample size.

1.8.6 Coefficient of Variation
The coefficient of variation (CV) is expressed as a percentage, and it is a way of

measuring the relative variability of a data set. It does not depend on unit and can be
used to make comparison even two samples differ in means or relate to different types
of measurements.
The coefficient of variation gives:

standard deviation
CV = × 100%
mean

Example 1.9
Example 1.9 Find the coefficients of variation of the salesman salary and the clerical
salary.
x̄ s
Salesman salary $916.76/month $286.70/month
Clerical salary $98.50/week $20.55/week
Solution.
286.70
CVs = × 100% = 31%
916.76
20.55
CVc = × 100% = 21%
98.50

Example 1.10
Example 1.10 Evaluate the different measurements of variation based on the salary
($) scales of the two companies in Example 1.8:
Company A 5,000 15,000 25,000 35,000 45,000 55,000

Company B 5,000 5,000 5,000 55,000 55,000 55,000
(a) Range
Company A:
$55,000 − $5,000 = $50,000
Company B:
$55,000 − $5,000 = $50,000

Example 1.10
Company A 5,000 15,000 25,000 35,000 45,000 55,000

Company B 5,000 5,000 5,000 55,000 55,000 55,000
(b) Mean absolute deviation
Company A:
|5,000 − 30,000| + |15,000 − 30,000| + |25,000 − 30,000|

+ |35,000 − 30,000| + |45,000 − 30,000| + |55,000 − 30,000|
= $15,000
6
Company B:
|5,000 − 30,000| + |5,000 − 30,000| + |5,000 − 30,000|

+ |55,000 − 30,000| + |55,000 − 30,000| + |55,000 − 30,000|
= $25,000
6

Example 1.10
Company A 5,000 15,000 25,000 35,000 45,000 55,000

Company B 5,000 5,000 5,000 55,000 55,000 55,000
(c) Variance
Company A:
(5,000 − 30,000)2 + (15,000 − 30,000)2 + (25,000 − 30,000)2
+ (35,000 − 30,000)2 + (45,000 − 30,000)2 + (5,000 − 30,000)2
6
= 291,666,667 (dollar square)
Company B:
(5,000 − 30,000)2 + (5,000 − 30,000)2 + (5,000 − 30,000)2
+ (55,000 − 30,000)2 + (55,000 − 30,000)2 + (55,000 − 30,000)2
6
= 625,000,000 (dollar square)
Example 1.10
(d) Standard deviation
Company A:
p
$ 291,666,667 = $17,078
Company B:
p
$ 625,000,000 = $25,000
(e) Coefficient of variation
Company A:
$17,078
× 100% = 56.93%
$30,000
Company B:
$25,000
× 100% = 83.33%
$30,000

1.8.7 Measures of Grouped Data
For grouped data, the mean can be found by

k
P
fi xi
i=1
x̄ =
n
where
xi is the class mark of the i-th class;
fi is the frequency of the i-th class;
k is the number of classes;
k
P
n= fi is the total frequency (i.e. the sample size).
i=1

For grouped data, the median can be found by first identify the median class, which is
the class containing the median. Then apply the following formula to the median class:
n
2 −C
median = L1 + × (L2 − L1 )
fm
where
L1 is the lower class boundary of the median class;
n is the total frequency (i.e. the sample size);
C = f1 + · · · + fm−1 is the cumulative frequency just before the median class;
fm is the frequency of the median class;
L2 is the upper class boundary containing the median.

fm
frequency = C n
−C
2
L1 L2
median

For grouped data, the mode can be found by first identify the modal class, which is the
class with the highest frequency. Then apply the following formula to the modal class:
d1
mode = L1 + × (L2 − L1 )
d1 + d2
where
L1 is the lower class boundary of the modal class;
d1 is the difference of the frequencies of the modal class with the previous class
and is always positive;
d2 is the difference of the frequencies of the modal class with the following class
and is always positive;
L2 is the upper class boundary of the modal class.

Geometrically the mode can be represented by the following graph and can be obtained
by using similar triangle properties.
d1
d2
d1 : d2
L1 L2
mode

The sample standard deviation (s.d.) is

v
fi xi )2
u P
u f x2 − ( P
uP
sP
2 i i
fi (xi − x̄) u
fi
s= =
t
fi − 1 fi − 1
P P
where
xi is the class mark of the i-th class;
fi is the frequency of the i-th class;
P
fi is the total frequency (i.e. the sample size);
fi x2i = f1 x21 + f2 x22 + · · · + fk x2k ;
P
fi xi )2 = (f1 x1 + f2 x2 + · · · + fk xk )2 .
P
(

Example 1.11
Example 1.11 The following table shows gas consumption data for 100 cars during a
specific time period:
Gas consumption Frequency
10 − 19 1
20 − 29 0
30 − 39 1
40 − 49 4
50 − 59 7
60 − 69 16
70 − 79 19
80 − 89 20
90 − 99 17
100 − 109 11
110 − 119 3
120 − 129 1

Example 1.11
Find the sample mean, median, Q1 , Q3 , mode, and s.d..
Solution.
Gas consumption Frequency (fi ) Class boundary Class mark (xi ) fi xi fi x2i
10 − 19 1 9.5 − 19.5 14.5 14.5 210.25
20 − 29 0 19.5 − 29.5 24.5 0 0
30 − 39 1 29.5 − 39.5 34.5 34.5 1190.25
40 − 49 4 39.5 − 49.5 44.5 178 7921
50 − 59 7 49.5 − 59.5 54.5 381.5 20791.75
60 − 69 16 59.5 − 69.5 64.5 1032 66564
70 − 79 19 69.5 − 79.5 74.5 1415.5 105454.75
80 − 89 20 79.5 − 89.5 84.5 1690 142805
90 − 99 17 89.5 − 99.5 94.5 1606.5 151814.25
100 − 109 11 99.5 − 109.5 104.5 1149.5 120122.75
110 − 119 3 109.5 − 119.5 114.5 343.5 39330.75
120 − 129 1 119.5 − 129.5 124.5 124.5 15500.25
Sum n = 100 7970 671705
Example 1.11
(a) Sample mean

P P
fi xi fi xi
x̄ = = P
n fi
1 × 14.5 + 0 × 24.5 + 1 × 34.5 + · · · + 1 × 124.5
=
1 + 0 + 1 + ··· + 1
7970
=
100
= 79.7

Example 1.11
(b) Median, Q1 , Q3
n
−C
median = L1 + 2
× (L2 − L1 ) fm = 20
fm
C = 48
50 − 48
= 79.5 + × (89.5 − 79.5)
20
= 80.5 2
Similarly,
25 − 13
Q1 = 59.5 + × (69.5 − 59.5) = 67 79.5 89.5
16
75 − 68 median
Q3 = 89.5 + × (89.5 − 79.5) = 93.6176
17

Example 1.11
(c) Mode
20
19 1
3 17
1 : 3
79.5 89.5
mode
d1 20 − 19
mode = L1 + ×(L2 −L1 ) = 79.5+ ×(89.5−79.5) = 82
d1 + d2 (20 − 19) + (20 − 17)

Example 1.11
(d) s.d.
v
fi xi )2 s
u P
u f x2 − ( P
uP
i i
( fi )( fi x2i ) − ( fi xi )2
P P P
fi
u
s= =
t
fi − 1 fi − 1)
P P P
( fi )(
s s
fi x2i ) − ( fi xi )2 100(671705) − (7970)2
P P
n(
= =
n(n − 1) 100(100 − 1)
= 19.2

1.8.8 Skewness
The skewness is an abstract quantity which shows how data piled-up. A number of
measures have been suggested to determine the skewness of a given distribution.
If the longer tail is on the right, we say that it is skewed to the right, and the
coefficient of skewness is positive.
If the longer tail is on the left, we say that is skewed to the left and the coefficient
of skewness is negative.

Example 1.12
Example 1.12
Skewed to the left Skewed to the right

(negatively skewed) (positively skewed)

Pearson’s 1st coefficient of skewness

mean − mode
SK1 =
standard deviation
Pearson’s 2nd coefficient of skewness

3 (mean − median)
SK2 =
standard deviation
Note:
For moderately skewed distribution data, their relationship can be given by
mean − mode 3 (mean − median)

skewness = ≈
standard deviation standard deviation
Then
mean − mode ≈ 3 (mean − median)

Chapter 1

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Chapter 1

Uploaded by

Copyright:

Available Formats

Descriptive Statistics

2 1.2 Some Basic Definitions

3 1.3 Method of Data Collection

4 1.4 Primary and Secondary Data

5 1.5 Graphical Descriptions of Data

6 1.6 Frequency Distribution

7 1.7 Central Tendency

8 1.8 Dispersion and Skewness

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 2 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 3 / 82

There are two subdivisions of statistical method:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 4 / 82

Population – A population is the group from which data are to be collected.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 5 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 6 / 82

1.3.1 Postal/Online Questionnaire

The main advantages are:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 7 / 82

1.3.1 Postal/Online Questionnaire

The main disadvantages are:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 8 / 82

1.3.2 Telephone Interview

The main advantages are:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 9 / 82

1.3.2 Telephone Interview

The main disadvantages are:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 10 / 82

The main advantages are:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 14 / 82

The main disadvantages are:

1.3.5 Reports and Published Statistics

International organizations such as the United Na-

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 16 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 17 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 18 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 19 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 20 / 82

Source: Census and Statistics Department, HKSAR

Source: Census and Statistics Department, HKSAR

Source: National Centers for Environmental Information

Source: Census and Statistics Department, HKSAR

Source: U.S. Department of Agriculture

Statistical data obtained by means of census, sample surveys or experiments usually

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 27 / 82

The general rules for constructing a frequency distribution are:

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 28 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 29 / 82

2 Tallying and counting

Class Tally Marks Count

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 30 / 82

3 Illustrating the data in tabular form

Frequency distribution for the traffic data

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 31 / 82

A histogram is usually used to present frequency distributions graphically. This is

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 32 / 82

Class limit Class boundary Frequency

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 33 / 82

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 34 / 82

1.6.2 Frequency Polygon

Another method to represent frequency distribution graphically is by a frequency

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 35 / 82

Class limit Class boundary Class mark Frequency

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 36 / 82