You are on page 1of 82

Descriptive Statistics

Chapter 1
AMA1501/1602
Introduction to Statistics for Business
Contents

1 1.1 Introduction

2 1.2 Some Basic Definitions

3 1.3 Method of Data Collection

4 1.4 Primary and Secondary Data

5 1.5 Graphical Descriptions of Data

6 1.6 Frequency Distribution

7 1.7 Central Tendency

8 1.8 Dispersion and Skewness

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 2 / 82


1.1 Introduction
Statistics is concerned with the scientific method by which information is
collected, organized, analysed and interpreted for the purpose of description and
decision making.
Examples using statistics are: Hang Seng Index, Life or car insurance rate,
Unemployment rate, Consumer Price Index, etc.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 3 / 82


1.1 Introduction

There are two subdivisions of statistical method:


Descriptive Statistics – It deals with the presentation of numerical facts, or data,
in either tables or graphs form, and with the methodology of analysing the data.
Inferential Statistics – It involves techniques for making inferences about the
whole population on the basis of observations obtained from samples.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 4 / 82


1.2 Some Basic Definitions

Population – A population is the group from which data are to be collected.


Sample – A sample is a subset of a population.
Variable – A variable is a feature characteristic of any member of a population
differing in quality or quantity from one member to another.
Quantitative variable – A variable differing in quantity is called quantitative
variable, for example, the weight of a person, number of people in a car.
Qualitative variable – A variable differing in quality is called a qualitative variable
or attribute, for example, color, the degree of damage of a car in an accident.
Discrete variable – A discrete variable is one which no value may be assumed
between two given values, for example, number of children in a family.
Continuous variable – A continuous variable is one which any value may be
assumed between two given values, for example, the time for 100-meter run.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 5 / 82


1.3 Method of Data Collection

Statistics very often involves the collection of data. There are many ways to obtain
data, and the World Wide Web is one of them. The advantages and disadvantages of
common data collecting method are discussed below.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 6 / 82


1.3 Method of Data Collection

1.3.1 Postal/Online Questionnaire

The main advantages are:


The apparent low cost compared with other methods
although the cost per useful answer may well be high.
No need for a closely grouped sample as in personal
interviews, since the Post Office is acting as a field force.
There is no interviewer bias.
A considered reply can be given – the respondent has time to consult any
necessary documents.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 7 / 82


1.3 Method of Data Collection

1.3.1 Postal/Online Questionnaire

The main disadvantages are:


The whole questionnaire can be read before answering
(which in some circumstances it is undesirable).
Spontaneous answers cannot be collected. Only simple
questions and instructions can be given.
The wrong person may complete the form.
Other persons’ opinions may be given e.g. by a wife consulting per husband.
No control is possible over the speed of the reply.
A poor “response rate” (a low percentage of replies) will be obtained.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 8 / 82


1.3 Method of Data Collection

1.3.2 Telephone Interview

The main advantages are:


It is cheaper than personal interviews but tends to be
dearer on average than postal questionnaires.
It can be carried out relatively quick.
Help can be given if the person does not understand the question as worded.
The telephone can be used in conjunction with other survey methods, e.g. for
encouraging replies to postal surveys or making appointments for personal
interviews.
Spontaneous answers can be obtained.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 9 / 82


1.3 Method of Data Collection

1.3.2 Telephone Interview

The main disadvantages are:


In some countries not everybody owns a telephone,
therefore, a survey carried out among telephone owners
would be biased towards the upper social classes of the
community. But the telephone can be used in industrial
market research anywhere since businesses are invariably on the telephone.
It is easy to refuse to be interviewed on the telephone simply by replacing the
receiver. The response rate tends to be higher than postal surveys but not as high
as when personal interviews are used.
As in the postal questionnaire, it is not possible to check the characteristics of the
person who is replying, particularly with regard to age and social class.
The questionnaire cannot be too long or too involved.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 10 / 82


1.3 Method of Data Collection
1.3.3 Personal Interview
In market research this is by far the most commonly used way of collecting
information from the general public.
The main advantages are:
A trained person may assess the interviewees in terms of age and social
class and area of residence, and even sometimes assess the accuracy of
the information given (e.g. by checking the pantry to see if certain goods are really there).
Help can be given to those respondents who are unable to understand the questions,
although great care has to be taken that the interview’s own feelings do not enter into the
wording of the question and so influence the answers of the respondents.
A well-trained interviewer can persuade a person to give an interview who might otherwise
have refused on a postal or telephone enquiry, so that a higher response rate, giving a
more representative cross-section of views, is obtained.
A great deal more information can be collected than is possible by the previous methods.
Interviews of three quarters of an hour are commonplace, and a great deal of information
can be gathered in this time.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 11 / 82
1.3 Method of Data Collection
1.3.3 Personal Interview
The main disadvantages are:
It is far more expensive than other methods because interviewers have
to be recruited, trained and paid a suitable salary and expenses.
The interviewer may consciously or unconsciously bias the answers to
the question, in spite of being trained not to do so.
Persons may not like to give confidential or embarrassing information at a face-to-face
interview.
In general, people may tend to give information that they feel will impress the interviewer,
and show themselves in a better light, e.g. by claiming to read “quality” newspapers.
There is a possibility that the interviewer will cheat by not carrying out the interview or
carrying out only parts of it. All reputable organizations carry out quality control checks
to lessen the chances of this happening.
Some types of people are more difficult to locate and interview than others, e.g. travellers.
While this may not be important in some surveys, it will be on others, such as car surveys.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 12 / 82
1.3 Method of Data Collection
1.3.4 Observation
Observation can be carried out by trained observers, cameras, or closed-circuit
television.
Observation is used in a wide range of fields. For instance, anthropologists may live in
a primitive society, and social workers may become factory workers to learn about the
habits and customs of the community they are studying.
Observation can also be used in “before and after” studies, for example, observing the
flow of traffic in a supermarket before and after changes to the store layout. In
industry, many Work Study techniques rely on observing individuals or groups of
workers to establish the system of movements they employ, with the aim of eliminating
wasteful effort.
If there are not enough trained observers available or the movements are complicated,
cameras may be used, allowing for detailed analysis by repeatedly running the film.
Quality control checks and retail audits, a branch of market research, may also be
considered observation techniques.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 13 / 82
1.3 Method of Data Collection

1.3.4 Observation

The main advantages are:


The actual actions or habits of persons are observed, not
what the persons say they would do when questioned. It is
interesting to note that in one study only 40% of families
who stated they were going to buy a new car had actually
bought one when called upon a year later.
Observation may keep the system undisturbed. In some cases it is undesirable for
people to know an experiment or change is to be made, or is taking place to
maintain high accuracy.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 14 / 82


1.3 Method of Data Collection
1.3.4 Observation

The main disadvantages are:


The results of the observations depend on the skill and
impartiality of the observer.
It is often difficult in practice to obtain a truly random
sample of persons or events.
It is difficult to predict future behaviour on pure observation.
It is not possible to observe actions which took place before the study was
contemplated.
Opinions and attitudes cannot usually be obtained by observation.
In marketing, the frequency of a person’s purchase cannot be obtained by pure
observation. Nor can such forms of behaviour as church-going, smoking and
crossing roads, except by employing a continuous and lengthy (and hence
detectable) period of observation.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 15 / 82
1.3 Method of Data Collection

1.3.5 Reports and Published Statistics

International organizations such as the United Na-


tions provide useful data. Governments also pub-
lish statistics on population, trade, production, and
other topics. Reports on specialized topics, including
scientific research, are available from governments,
trade organizations, trade unions, universities, pro-
fessional and scientific organizations, and local au-
thorities. The World Wide Web is also an efficient
source of data.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 16 / 82


1.4 Primary and Secondary Data

Primary data refers to data that is collected directly from the source for the first
time by the researcher. This data is original and has not been previously collected,
published, or analysed by anyone else. Examples of primary data collection
methods include surveys, interviews, observations, and experiments.
Secondary data, on the other hand, refers to data that has been collected and
published by someone else or for another purpose. This data has already been
analysed and interpreted by others, and the researcher uses it for their own
analysis. Examples of secondary data sources include government statistics,
academic journals, books, and reports.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 17 / 82


1.4 Primary and Secondary Data

Both primary and secondary data have their own advantages and disadvantages.

Primary data is more specific to the research question and can be tailored to the
researcher’s needs, but it can be time-consuming and expensive to collect.
Secondary data can be easily accessible and less expensive, but it may not be as
accurate or relevant to the research question as primary data. Therefore,
researchers often use a combination of both types of data to obtain a
comprehensive understanding of the phenomenon under investigation.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 18 / 82


1.5 Graphical Descriptions of Data

A graph is a method of presenting statistical data in visual form. The main purpose of
any chart is to give a quick, easy-to-read-and-interpret pictorial representation of data
which is more difficult to obtain from a table or a complete listing of the data.

Some basic rules for the construction of a statistical chart are listed below:
1 A clear and concise title that provides enough identification of the graph must be
included.
2 Each scale should have a scale caption that indicates the units used.
3 The zero point on the co-ordinate scale should be indicated. In situations where
space is limited, a scale break may be inserted to indicate its omission.
4 All items presented in the graph must be clearly labeled and legible, even when
viewed in black and white reprint.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 19 / 82


1.5 Graphical Descriptions of Data

There are many varieties of graphs. The most commonly used graphs are described as
below.

Pie chart
Simple bar chart
Bi-directional bar chart
Multiple bar chart
Component bar chart
Other types of graphs

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 20 / 82


1.5 Graphical Descriptions of Data
Pie chart – Pie charts are widely used to show the component parts of a total. They
are popular because of their simplicity. In constructing a pie chart, the angles of a slice
from the center must be in proportion with the percentage of the total. The following
example of pie charts gives the percentage of education attainment in Hong Kong for
the population aged 15 and over who are reside on land and non-institutionalized in
the year 2022.

Source: Census and Statistics Department, HKSAR


AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 21 / 82
1.5 Graphical Descriptions of Data
Simple bar chart – The horizontal bar chart is also a simple and popular chart. Like
the pie chart, the simple horizontal bar chart is a one-scale chart. In constructing a bar
chart, it is noted that the width of the bar is not important, but the height of the bar
must be in proportion with the data. The following bar chart gives the monthly
household income of Hong Kong in the year 2022.

Source: Census and Statistics Department, HKSAR


AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 22 / 82
1.5 Graphical Descriptions of Data
Bi-directional bar chart – A bar chart can use either horizontal or vertical bars. A
bi-directional bar chart indicates both the positive and negative values. The following
example gives the highest and lowest recorded temperatures in 5 states across the
United States.

Source: National Centers for Environmental Information


AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 23 / 82
1.5 Graphical Descriptions of Data
Multiple bar chart – A multiple bar chart is a useful tool for quickly comparing
different sets of data. In the following example, the marital status of males and
females in Hong Kong in the year 2022 is compared using a multiple bar chart.

Source: Census and Statistics Department, HKSAR


AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 24 / 82
1.5 Graphical Descriptions of Data
Component bar chart – A component bar chart subdivides the bars into different
sections. It is useful when the total of the components is of interest. The following
example gives the nutrient values of food.

Source: U.S. Department of Agriculture


AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 25 / 82
1.5 Graphical Descriptions of Data
Other types of graphs – Graphic presentations can be made more attractive through
the use of careful layout and appropriate symbols. Sometimes information pertaining
to different geographical area can even be presented through the use of so-called
statistical map.

https://ec.europa.eu/eurostat/web/gisco/gisco-activities/statistical-atlas
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 26 / 82
1.6 Frequency Distribution

Statistical data obtained by means of census, sample surveys or experiments usually


consist of raw, unorganized sets of numerical values. Before these data can be used as
a basis for inferences about the phenomenon under investigation or as a basis for
decision, they must be summarized and the pertinent information must be extracted.

A useful method for summarizing a set of data is the construction of a frequency table,
or a frequency distribution. That is, we divide the overall range of values into a
number of “classes” (or “bins”, “intervals”) and count the number of observations that
fall into each of these classes.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 27 / 82


1.6 Frequency Distribution

The general rules for constructing a frequency distribution are:


1 There should be an appropriate number of classes, with neither too few nor too
many.
2 The classes should have equal widths wherever possible, but the first and last
classes may be open-ended to cater for extreme values.
3 Class limits represent the largest and smallest data values that an be included in
the class. Class limits are actual data values.
4 Class boundaries provide values that eliminate gaps between the classes in the
frequency distribution. To find a class boundary, average the upper class limit of
one class and the lower class limit of the next class.
5 Each class should be represented by a class mark, also known as the class
midpoint of the i-th class. This can be found by calculating the simple average of
the class boundaries or the class limits of the same class.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 28 / 82


Example 1.1

Example 1.1 A traffic inspector has counted the number of auto-mobiles passing a
certain point in 100 successive 20-minute time periods. The observations are listed
below.

23 20 16 18 30 22 26 15 5 18
14 17 11 37 21 6 10 20 22 25
19 19 19 20 12 23 24 17 18 16
27 16 28 26 15 29 19 35 20 17
12 30 21 22 20 15 18 16 23 24
15 24 28 19 24 22 17 19 8 18
17 18 23 21 25 19 20 22 21 21
16 20 19 11 23 17 23 13 17 26
26 14 15 16 27 18 21 24 33 20
21 27 18 22 17 20 14 21 22 19

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 29 / 82


Example 1.1
1 Setting up the classes
Choose a class width of 5 for each class, then we have seven classes going from 5
to 9, from 10 to 14, . . ., and from 35 to 39.

2 Tallying and counting

Class Tally Marks Count


5−9 3
10 − 14 9
15 − 19 36
20 − 24 35
25 − 29 12
30 − 34 3
35 − 39 2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 30 / 82


Example 1.1

3 Illustrating the data in tabular form

Frequency distribution for the traffic data


Number of autos per period Number of periods
5−9 3
10 − 14 9
15 − 19 36
20 − 24 35
25 − 29 12
30 − 34 3
35 − 39 2
Total 100

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 31 / 82


1.6 Frequency Distribution

1.6.1 Histogram

A histogram is usually used to present frequency distributions graphically. This is


constructed by drawing rectangles over each class. The area of each rectangle should
be proportional to its frequency.

Notes:
The vertical lines of a histogram should be the class boundaries.
The range of the variable should constitute the major portion of the graphs of
frequency distributions. If the smallest observation is far away from zero, then a
“axis break” sign ( ) or ( ) or ( ) should be introduced in the horizontal
axis.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 32 / 82


Example 1.2

Example 1.2 (a) Construct a histogram for the traffic data in Example 1.1.

Class limit Class boundary Frequency


5−9 4.5 − 9.5 3
10 − 14 9.5 − 14.5 9
15 − 19 14.5 − 19.5 36
20 − 24 19.5 − 24.5 35
25 − 29 24.5 − 29.5 12
30 − 34 29.5 − 34.5 3
35 − 39 34.5 − 39.5 2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 33 / 82


Example 1.2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 34 / 82


1.6 Frequency Distribution

1.6.2 Frequency Polygon

Another method to represent frequency distribution graphically is by a frequency


polygon. As in the histogram, the base line is divided into sections corresponding to
the class-interval, but instead of the rectangles, the points of successive class marks
are being connected.

The frequency polygon is particularly useful when two or more distributions are to be
presented for comparison on the same graph.

Note:
A frequency curve can be obtained by smoothing the frequency polygon.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 35 / 82


Example 1.2

Example 1.2 (b) Construct a frequency polygon for the traffic data in Example 1.1.

Class limit Class boundary Class mark Frequency


5−9 4.5 − 9.5 7 3
10 − 14 9.5 − 14.5 12 9
15 − 19 14.5 − 19.5 17 36
20 − 24 19.5 − 24.5 22 35
25 − 29 24.5 − 29.5 27 12
30 − 34 29.5 − 34.5 32 3
35 − 39 34.5 − 39.5 37 2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 36 / 82


Example 1.2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 37 / 82


1.6 Frequency Distribution

1.6.3 Cumulative Frequency Distribution and Cumulative Polygon

Cumulative frequency distribution shows the total number of observations that fall
below a given value or range of values in a dataset.

There are several reasons why cumulative frequency distribution is useful:


1 It provides a quick and easy way to visualize the distribution of a dataset.
2 It helps to identify outliers and extreme values in the dataset.
3 It allows us to calculate percentiles and quartiles easily.
4 It can be used to compare two or more datasets.
Overall, cumulative frequency distribution is a useful tool for anyone working with
data, whether in the fields of statistics, economics, psychology, or any other discipline
that deals with quantitative data.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 38 / 82


Example 1.3

Example 1.3 Construct a cumulative frequency polygon (also called an ogive) of the
distribution of traffic data in Example 1.1.

Cumulative frequency distribution for the traffic data


Number of autos per period Less than Cumulative frequency
5−9 9.5 3
10 − 14 14.5 12
15 − 19 19.5 48
20 − 24 24.5 83
25 − 29 29.5 95
30 − 34 34.5 98
35 − 39 39.5 100

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 39 / 82


Example 1.3

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 40 / 82


Example 1.3
A cumulative frequency curve can similarly be drawn by smoothing the cumulative
frequency polygon.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 41 / 82


1.6 Frequency Distribution

1.6.4 Relative Frequency

Relative frequency distribution shows the proportion or percentage of observations that


fall within a given class interval in a dataset.

Relative frequency of a class is defined as:


Frequency of the class
Relative frequency =
Total frequency
If the frequencies are changed to relative frequencies, then a relative frequency
histogram, a relative frequency polygon and a relative frequency curve can similarly be
constructed.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 42 / 82


1.7 Central Tendency

When we work with numerical data, it seems apparent that in most set of data there is
a tendency for the observed values to group themselves about some interior values.
Some central values seem to be the characteristics of the data. This phenomenon is
referred to as central tendency.

For a given set of data, the measure of location we use depends on what we mean by
middle; different definitions give rise to different measures. We shall consider some
more commonly used measures, namely arithmetic mean, median and mode.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 43 / 82


1.7 Central Tendency

1.7.1 Arithmetic Mean

The arithmetic population mean, µ, or simply called mean, is obtained by adding


together all of the measurements and dividing by the total number of measurements
taken:
N
P
xi
x1 + x2 + · · · + xN
P
i=1 xi
µ= = =
N N N
where
xi is the value of the i-th item;
N is the population size.

Arithmetic mean can be used to calculate any numerical data and it is always unique.
It is obvious that extreme values affect the mean. Also, arithmetic mean ignores the
degree of importance in different categories of data.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 44 / 82


Example 1.4

Example 1.4 Given the following set of ungrouped data:

20, 18, 15, 15, 14, 12, 11, 9, 7, 6, 4, 1

Find the mean of the ungrouped data.

Solution.
20 + 18 + 15 + 15 + 14 + 12 + 11 + 9 + 7 + 6 + 4 + 1
mean =
12
132
=
12
= 11

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 45 / 82


1.7 Central Tendency

1.7.2 Median

Median is defined as the middle item of all given observations arranged in order. It is
the value that separates the lower 50% of the dataset from the upper 50%.

If the dataset has an odd number of values, the median is the middle value.
If the dataset has an even number of values, the median is the average of the two
middle values.

The median is a useful measure of central tendency for datasets that have outliers or
are skewed, as it is less sensitive to extreme values than the mean. It is also useful for
ordinal data, where the values have a natural order but the differences between them
may not be equally meaningful.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 46 / 82


Example 1.5

Example 1.5 Find the median of the ungrouped data:

20, 18, 15, 15, 14, 12, 11, 9, 7, 6, 4, 1

Solution. The observations are arranged in order. There are 12 observations. The
median is the average of the middles: 12 and 11.

12 + 11
median = = 11.5
2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 47 / 82


1.7 Central Tendency

1.7.3 Mode

Mode is the value which occurs most frequently. The mode may not exist, and even if
it does, it may not be unique.

For ungrouped data, we simply count the largest frequency of the given value. If all are
of the same frequency, no mode exits. If more than one values have the same largest
frequency, then the mode is not unique.

Note that the mode is independent of extreme values and it may be applied in
qualitative data.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 48 / 82


Example 1.6

Example 1.6 Find the mode of the ungrouped data:

20, 18, 15, 15, 14, 12, 11, 9, 7, 6, 4, 1

Solution. The value which appears most frequently is 15. Therefore,

mode = 15

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 49 / 82


Example 1.7

Example 1.7 Find the mode of the ungrouped data:

2, 2, 2, 4, 5, 6, 7, 7, 7

Solution. The values which appear most frequently are 2 and 7. Therefore,

mode = 2 and 7

Note:
A dataset with two modes is said to be bimodal. While a set with more than two
modes may be described as multimodal. Having multiple modes can indicate that the
data is complex or has multiple underlying patterns or processes. It is important to
identify and interpret all the modes in the dataset to gain a clear understanding of the
distribution of the data.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 50 / 82


1.7 Central Tendency

1.7.4 Conclusion

For symmetrically distributed and unimodal data (i.e. only one mode), the mean,
median and mode can be used almost interchangeably.

Mean can be interpreted as the center of gravity of the distribution.

Median divides the area of the distribution into two equal parts.

Mode is the highest point of the distribution.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 51 / 82


1.8 Dispersion and Skewness
Sometimes mean, median and mode may not be able to reflect the true picture of
some data. The following example explains the reason.

Example 1.8 There were two companies, Company A and Company B. Their salaries
profiles given in mean, median and mode were as follow:
Company A Company B
Mean $30,000 $30,000
Median $30,000 $30,000
Mode Nil Nil
However, their detail salary ($) structures could be completely different as that:

Company A 5,000 15,000 25,000 35,000 45,000 55,000


Company B 5,000 5,000 5,000 55,000 55,000 55,000

Hence it is necessary to have some measures on how data are scattered. That is, we
want to know what is the dispersion, or variability in a set of data.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 52 / 82
1.8 Dispersion and Skewness

1.8.1 Range

Range is the difference between the largest and smallest values in the dataset. It gives
an idea of how spread out the data is, and how much variability there is in the dataset.

The range is easy to calculate but cannot be obtained if open ended grouped data are
given.

However, it is sensitive to outliers and extreme values, and it does not take into
account the distribution of the data.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 53 / 82


1.8 Dispersion and Skewness

1.8.2 Deciles, Percentile, and Fractile

Decile divides the distribution into ten equal parts. There are nine deciles such that
10% of the data are ≤ D1 ;
20% of the data are ≤ D2 ; and so on.

Percentile divides the distribution into one hundred equal parts. There are 99
percentiles such that
1% of the data are ≤ P1 ;
2% of the data are ≤ P2 ; and so on.

Fractile, even more flexible, divides the distribution into a convenient number of parts.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 54 / 82


1.8 Dispersion and Skewness
1.8.3 Quartiles

Quartiles are the most commonly used values of position which divides distribution into
four equal parts such that
25% of the data are ≤ Q1 = first quartile = lower quartile;
50% of the data are ≤ Q2 = second quartile = median;
75% of the data are ≤ Q3 = third quartile = upper quartile.
Steps:
1 Use the median to divide the ordered data set into two-halves.
If there is an odd number of data points in the original ordered data set, do not
include the median (the central value in the ordered list) in either half.
If there is an even number of data points in the original ordered data set, split this
data set exactly in half.
2 The lower quartile value is the median of the lower half of the data. The upper
quartile value is the median of the upper half of the data.
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 55 / 82
1.8 Dispersion and Skewness

The interquartile range (IQR) is a measure of variability that represents the spread of
the middle 50% of the data in a dataset. It is calculated as the difference between the
third quartile (Q3 ) and the first quartile (Q1 ), and provides an estimate of the spread
or dispersion of the data that is less sensitive to outliers than the range.

The semi-interquartile range (SIQR), also known as the quartile deviation, is a measure
of variability that is half the size of the interquartile range (IQR).

Q3 − Q1
IQR = Q3 − Q1 , SIQR =
2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 56 / 82


1.8 Dispersion and Skewness

1.8.4 Mean Absolute Deviation

Mean absolute deviation (MAD) is the mean of the absolute values of all deviations
from the mean. Therefore, it takes every item into account:

|x1 − µ| + |x2 − µ| + · · · + |xN − µ| |xi − µ|


P
=
N N
where
xi is the value of the i-th item;
µ is the population mean;
N is the population size.

Note:
MAD is more robust to outliers than the other measurements because it is based on
the absolute differences between each data point and the mean.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 57 / 82


1.8 Dispersion and Skewness

1.8.5 Variance and Standard Deviation

Variance and standard deviation are both measures of how spread out a set of data is
from its mean value.
Variance is calculated by taking the average of the squared differences between
each data point and the mean.
Standard deviation (s.d.) is the square root of variance. The advantage of using
standard deviation over variance is that standard deviation is expressed in the
same units as the original data, while variance is expressed in squared units, which
can be harder to interpret.
A high variance/s.d. indicates that the data points are widely spread out from the
mean, while a low variance/s.d. indicates that the data points are clustered closely
around the mean.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 58 / 82


1.8 Dispersion and Skewness

Population Variance and Standard Deviation

The population variance, σ 2 , is the mean of the square of all deviations from the mean:

(x1 − µ)2 + (x2 − µ)2 + · · · + (xN − µ)2 (xi − µ)2


P
2
σ = =
N N
where
xi is the value of the i-th item;
µ is the population mean;
N is the population size.
sP
√ (xi − µ)2
The population standard deviation σ is defined as σ = σ2 = .
N

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 59 / 82


1.8 Dispersion and Skewness

Sample Variance

The sample variance, s2 , is the mean of the square of all deviations from the sample
mean:
(x1 − x̄)2 + (x2 − x̄)2 + · · · + (xn − x̄)2 (xi − x̄)2
P
2
s = =
n−1 n−1
where
xi is the value of the i-th item;
x1 + x2 + · · · + xn
x̄ = is the sample mean;
n
n is the sample size.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 60 / 82


1.8 Dispersion and Skewness

Sample Standard Deviation



The sample standard deviation s is defined as s = s2 :
v
u P 2
u x2 − ( xi )
uP
sP
i
(xi − x̄)2 n
u
t
s= =
n−1 n−1
where
xi is the value of the i-th item;
P 2
xi = x21 + x22 + · · · + x2n is the sum of squares;
( xi )2 = (x1 + x2 + · · · + xn )2 is the square of sum;
P

n is the sample size.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 61 / 82


1.8 Dispersion and Skewness

1.8.6 Coefficient of Variation

The coefficient of variation (CV) is expressed as a percentage, and it is a way of


measuring the relative variability of a data set. It does not depend on unit and can be
used to make comparison even two samples differ in means or relate to different types
of measurements.

The coefficient of variation gives:


standard deviation
CV = × 100%
mean

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 62 / 82


Example 1.9

Example 1.9 Find the coefficients of variation of the salesman salary and the clerical
salary.
x̄ s
Salesman salary $916.76/month $286.70/month
Clerical salary $98.50/week $20.55/week

Solution.
286.70
CVs = × 100% = 31%
916.76
20.55
CVc = × 100% = 21%
98.50

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 63 / 82


Example 1.10

Example 1.10 Evaluate the different measurements of variation based on the salary
($) scales of the two companies in Example 1.8:

Company A 5,000 15,000 25,000 35,000 45,000 55,000


Company B 5,000 5,000 5,000 55,000 55,000 55,000

(a) Range

Company A:
$55,000 − $5,000 = $50,000
Company B:
$55,000 − $5,000 = $50,000

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 64 / 82


Example 1.10

Company A 5,000 15,000 25,000 35,000 45,000 55,000


Company B 5,000 5,000 5,000 55,000 55,000 55,000

(b) Mean absolute deviation

Company A:

|5,000 − 30,000| + |15,000 − 30,000| + |25,000 − 30,000|


+ |35,000 − 30,000| + |45,000 − 30,000| + |55,000 − 30,000|
= $15,000
6
Company B:

|5,000 − 30,000| + |5,000 − 30,000| + |5,000 − 30,000|


+ |55,000 − 30,000| + |55,000 − 30,000| + |55,000 − 30,000|
= $25,000
6

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 65 / 82


Example 1.10

Company A 5,000 15,000 25,000 35,000 45,000 55,000


Company B 5,000 5,000 5,000 55,000 55,000 55,000

(c) Variance

Company A:
(5,000 − 30,000)2 + (15,000 − 30,000)2 + (25,000 − 30,000)2
+ (35,000 − 30,000)2 + (45,000 − 30,000)2 + (5,000 − 30,000)2
6
= 291,666,667 (dollar square)
Company B:
(5,000 − 30,000)2 + (5,000 − 30,000)2 + (5,000 − 30,000)2
+ (55,000 − 30,000)2 + (55,000 − 30,000)2 + (55,000 − 30,000)2
6
= 625,000,000 (dollar square)
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 66 / 82
Example 1.10

(d) Standard deviation

Company A:
p
$ 291,666,667 = $17,078
Company B:
p
$ 625,000,000 = $25,000

(e) Coefficient of variation

Company A:
$17,078
× 100% = 56.93%
$30,000
Company B:
$25,000
× 100% = 83.33%
$30,000

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 67 / 82


1.8 Dispersion and Skewness

1.8.7 Measures of Grouped Data

For grouped data, the mean can be found by


k
P
fi xi
i=1
x̄ =
n
where
xi is the class mark of the i-th class;
fi is the frequency of the i-th class;
k is the number of classes;
k
P
n= fi is the total frequency (i.e. the sample size).
i=1

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 68 / 82


1.8 Dispersion and Skewness

For grouped data, the median can be found by first identify the median class, which is
the class containing the median. Then apply the following formula to the median class:
n
2 −C
median = L1 + × (L2 − L1 )
fm
where
L1 is the lower class boundary of the median class;
n is the total frequency (i.e. the sample size);
C = f1 + · · · + fm−1 is the cumulative frequency just before the median class;
fm is the frequency of the median class;
L2 is the upper class boundary containing the median.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 69 / 82


1.8 Dispersion and Skewness

fm

frequency = C n
−C
2

L1 L2

median

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 70 / 82


1.8 Dispersion and Skewness

For grouped data, the mode can be found by first identify the modal class, which is the
class with the highest frequency. Then apply the following formula to the modal class:
d1
mode = L1 + × (L2 − L1 )
d1 + d2
where
L1 is the lower class boundary of the modal class;
d1 is the difference of the frequencies of the modal class with the previous class
and is always positive;
d2 is the difference of the frequencies of the modal class with the following class
and is always positive;
L2 is the upper class boundary of the modal class.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 71 / 82


1.8 Dispersion and Skewness
Geometrically the mode can be represented by the following graph and can be obtained
by using similar triangle properties.

d1
d2

d1 : d2

L1 L2

mode

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 72 / 82


1.8 Dispersion and Skewness

The sample standard deviation (s.d.) is


v
fi xi )2
u P
u f x2 − ( P
uP
sP
2 i i
fi (xi − x̄) u
fi
s= =
t
fi − 1 fi − 1
P P

where
xi is the class mark of the i-th class;
fi is the frequency of the i-th class;
P
fi is the total frequency (i.e. the sample size);
fi x2i = f1 x21 + f2 x22 + · · · + fk x2k ;
P

fi xi )2 = (f1 x1 + f2 x2 + · · · + fk xk )2 .
P
(

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 73 / 82


Example 1.11

Example 1.11 The following table shows gas consumption data for 100 cars during a
specific time period:
Gas consumption Frequency
10 − 19 1
20 − 29 0
30 − 39 1
40 − 49 4
50 − 59 7
60 − 69 16
70 − 79 19
80 − 89 20
90 − 99 17
100 − 109 11
110 − 119 3
120 − 129 1

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 74 / 82


Example 1.11
Find the sample mean, median, Q1 , Q3 , mode, and s.d..
Solution.
Gas consumption Frequency (fi ) Class boundary Class mark (xi ) fi xi fi x2i
10 − 19 1 9.5 − 19.5 14.5 14.5 210.25
20 − 29 0 19.5 − 29.5 24.5 0 0
30 − 39 1 29.5 − 39.5 34.5 34.5 1190.25
40 − 49 4 39.5 − 49.5 44.5 178 7921
50 − 59 7 49.5 − 59.5 54.5 381.5 20791.75
60 − 69 16 59.5 − 69.5 64.5 1032 66564
70 − 79 19 69.5 − 79.5 74.5 1415.5 105454.75
80 − 89 20 79.5 − 89.5 84.5 1690 142805
90 − 99 17 89.5 − 99.5 94.5 1606.5 151814.25
100 − 109 11 99.5 − 109.5 104.5 1149.5 120122.75
110 − 119 3 109.5 − 119.5 114.5 343.5 39330.75
120 − 129 1 119.5 − 129.5 124.5 124.5 15500.25
Sum n = 100 7970 671705
AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 75 / 82
Example 1.11

(a) Sample mean


P P
fi xi fi xi
x̄ = = P
n fi
1 × 14.5 + 0 × 24.5 + 1 × 34.5 + · · · + 1 × 124.5
=
1 + 0 + 1 + ··· + 1
7970
=
100
= 79.7

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 76 / 82


Example 1.11

(b) Median, Q1 , Q3
n
−C
median = L1 + 2
× (L2 − L1 ) fm = 20
fm
C = 48
50 − 48
= 79.5 + × (89.5 − 79.5)
20
= 80.5 2

Similarly,
25 − 13
Q1 = 59.5 + × (69.5 − 59.5) = 67 79.5 89.5
16
75 − 68 median
Q3 = 89.5 + × (89.5 − 79.5) = 93.6176
17

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 77 / 82


Example 1.11

(c) Mode

20
19 1
3 17

1 : 3

79.5 89.5
mode

d1 20 − 19
mode = L1 + ×(L2 −L1 ) = 79.5+ ×(89.5−79.5) = 82
d1 + d2 (20 − 19) + (20 − 17)

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 78 / 82


Example 1.11

(d) s.d.
v
fi xi )2 s
u P
u f x2 − ( P
uP
i i
( fi )( fi x2i ) − ( fi xi )2
P P P
fi
u
s= =
t
fi − 1 fi − 1)
P P P
( fi )(
s s
fi x2i ) − ( fi xi )2 100(671705) − (7970)2
P P
n(
= =
n(n − 1) 100(100 − 1)

= 19.2

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 79 / 82


1.8 Dispersion and Skewness

1.8.8 Skewness

The skewness is an abstract quantity which shows how data piled-up. A number of
measures have been suggested to determine the skewness of a given distribution.

If the longer tail is on the right, we say that it is skewed to the right, and the
coefficient of skewness is positive.
If the longer tail is on the left, we say that is skewed to the left and the coefficient
of skewness is negative.

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 80 / 82


Example 1.12

Example 1.12

Skewed to the left Skewed to the right


(negatively skewed) (positively skewed)

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 81 / 82


1.8 Dispersion and Skewness

Pearson’s 1st coefficient of skewness


mean − mode
SK1 =
standard deviation

Pearson’s 2nd coefficient of skewness


3 (mean − median)
SK2 =
standard deviation

Note:
For moderately skewed distribution data, their relationship can be given by

mean − mode 3 (mean − median)


skewness = ≈
standard deviation standard deviation
Then
mean − mode ≈ 3 (mean − median)

AMA1501/1602 Introduction to Statistics for Business Chapter 1 - Descriptive Statistics 82 / 82

You might also like