You are on page 1of 26

Descriptive Statistics

CHAPTER 2: Descriptive STATISTICS

2.1 Introduction to data presentation

Data presentation is an essential step before further statistical analysis is to be carried out.
Data are summarized and displayed visually so that researchers, managers and decision
makers are able to observe important features of the data and provide insights into the type
to model and analysis that should be used. Some common data presentations include
frequency table, bar chart, pie chart, histogram, frequency curve, line graph, pictograph, stem-
and-leaf plot, box plot and ogive.

2.2 Organizing and Graphing Qualitive Data

Quantitative data can be classified into categories or classes. They can best be presented in
the form of frequency distribution, bar chart, pie chart and contingency table.

Frequency Distribution

One simple way of presenting qualitative data is by frequency distribution. A frequency


distribution is a table consisting of columns and rows. A column may consist of categories of
data. A class is a category into which qualitative data can be classified. Class frequency is the
number of observations that fall in a particular class.

For example, a car dealer in Kuala Lumpur makes the sales for the following types of cars in
the month of January 2005 as shown in Table 2.1.

The first column of Table 2.1 consists of the qualitative variable that is the car models, while
the second column is the frequency of the types of cars sold.

Table 2.1 The frequency distribution showing the stock of cars at one car dealer

Car model Number of cars


Proton Waja 66
Proton Wira 50
Proton Saga 39
Proton Gen-2 25
Total 180

The variable consists of four models of cars, namely, Proton Waja, Proton Wira, Proton Saga
and Proton Gen2. This type of table is usually called frequency table or frequency distribution
for qualitative data. A frequency distribution summarises data into different classes and their
frequencies.
Descriptive Statistics

2.3 Pie Chart

A pie chart can be used to represent categorical data. It consists of one or more circles that
are divided into sectors. The sectors show the number of objects or percentage of each group
or category. The angle in the sector is proportional to the number or percentage of elements
in that category.

Guidelines for constructing a pie chart as follows.

a) Choose a small number of categories (say 3 -10) to represent the data being summarised.
Too many categories make the pie chart difficult to interpret.
b) Partition the circle to match the percentages for each of the categories.
c) Whenever possible, construct the pie chart so that percentages are either in ascending or
descending order. This method helps in the interpretation of the data.

2.4 Bar Chart

A bar chart is another graphical method for describing data that can be divided into categories.
Bar charts are frequently used in newspapers, magazines, companies’ annual reports and
other presentations to convey and highlight information. A bar chart uses the lengths of
horizontal bars (in horizontal bar chart) or vertical columns (in vertical bar chart) to represent
quantities or percentages.

General guidelines for constructing a bar chart are as follows.

a. Label the vertical axis with the number of objects that fall into each category; and label the
categories along the horizontal axis.
b. Construct a rectangle over each category with the height of the rectangle equal to the
number of objects in that category. The base of each rectangle should be of the same
width.
c. Leave space between each category on the horizontal axis to distinguish between the
categories and to clarify the presentation.

There are three common types of bar charts. They are vertical (or horizontal) bar chart, cluster
bar chart and stacked bar chart.
Descriptive Statistics

Vertical Bar Chart

Figure 2.1 is an example of a vertical bar chart. The chart shows the quarterly profits of the
company ZA in the year 2014. The profits are higher in the second and third quarters of 2014.

40
RM (million) 38

30 30

1st 2nd 3rd 4th


Quarter

Figure 2.1 Company ZA’s quarterly profit for the year 2014 in RM (million)

Cluster Bar Chart

A cluster bar chart uses several bars for each item. Figure 2.2 displays the quarterly profits of
three companies, AZ, BW and CT. The clustered chart enables us to see immediately that
company CT has higher quarterly profits than companies AZ and BW. Company AZ has the
lowest quarterly profits of the three companies.

60

50

40
RM (million)

30

20

10

0
1st 2nd 3rd 4th
Quarter

Company CT Company BW Company AZ

Figure 2.2 Quarterly profits of company AZ, BW and CT


Descriptive Statistics

Stacked Bar Chart

A stacked bar chart (also called component bar chart) shows the total breakdown in its
component bars. If the components are converted into percentage, then the percentage
stacked bar chart is produced where all the bars are of equal height.

Contingency Table

A contingency table is also known as a cross-classification table. It is often desirable to


examine the categorical responses in term of two qualitative variable simultaneously. Some
data can be grouped according to two criteria of classification. For example, a car
manufacturer might be interested to know whether colour preference for a car is independent
of gender. In this case, we could take a sample of car buyers or potential car buyers, record
their gender and their colour preference, and classify their responses by gender and their
preferred colour. Consider Table 2.2.

Table 2.2 Cross-classification table for colour preference

Red Green White Black Blue Total


Men 30 10 26 33 1 100
Women 45 8 12 10 25 100
Total 75 18 38 43 26 200

In the Table 2.2, the two variables are gender and color. The table shows that men preferred
black or red cars while women prefer red cars. Men dislike blue cars while women dislike
green cars.

2.5 Organizing and Graphing Quantitative Data

Quantitative data are normally summarized in tabular forms. A frequency table is used to
simplify the data. Quantitative data can be divided into ungrouped and grouped data. Stem
and leaf plot, histogram, cumulative frequency distribution and ogive are frequently used to
display the data and to highlight the information.

Stem and leaf plot

One of the graphical methods for describing quantitative data is the stem and leaf plot, which
is widely used in exploratory data analysis when the data set is small. This plot separates data
entries into leading digits and trailing digits. It allows us to use the information contained in a
frequency distribution to show the range of data values (for examples, scores), their
concentration, the shape of the distribution, whether there are specific values or scores not
represented and whether there are stray or extreme values or scores.
Descriptive Statistics

The guideline for constructing stem and lead plot are as follows.

a) Split each score or value into two sets of digits. The first (or leading) set of digits is the
stem and the second (trailing) set to digits is the leaf.
b) List all the possible stem digits from the lowest to the highest.
c) For each score in the mass of data, write down the lead numbers on the line labelled by
the appropriate stem number.

Frequency distribution for ungrouped data

A frequency distribution is a summary table where the data are grouped into a number of
classes. The objective is to obtain the number of responses associated with the different
values of the variables. The frequency of an observation is the number of times the
observations, each with the corresponding frequency. Thus, for ungrouped data, the
frequency distribution is as table consisting observed values with their corresponding
frequencies.

Frequency distribution for grouped data

A frequency table summarize the data collected by forming intervals of values and indicating
the number of data that falls into each interval. This frequency table with class intervals is
known as the frequency distribution of grouped data. The grouping data is often desirable
because it reduces the complexity of the data and helps to smoothen out irregularities in the
distribution. However, there are some disadvantages to the grouping if the data. Some
information is lost when data are grouped into several class intervals. For example, if it is
known that there are six observation in an interval labelled 15 -20, one cannot say that whether
they are all at one end of the interval or are spread through it.

There are several guidelines that can be followed in constructing a grouped frequency
distribution. Firstly, the class interval should be mutually exclusive. This means that the class
intervals should not overlap and must be clearly defined. Secondly, it is a goof practice to
ensure that class intervals are of equal width except for open-ended class. If there are no
observation in a particular interval, it should neither be too few classes nor to many classes.
The rule of thumb is, the number of classes should not be less than 5 and should not be more
than 15. To determine the number of classes, k for a set of data constructing of n observation,
the formula below can be used.

log 𝑛
𝑘=
log 2

log 60
For example, if the number of data is 60, then the number of classes, k is log 2
= 5.9 = 6 class.
Finally, the number of the data (frequency) falling into each class is indicated in the frequency
table. It should be noted that the number of observations in the data is equal to the sum of the
frequencies.

∑𝑓 = 𝑛
Descriptive Statistics

Histogram

A histogram is a graphical representation of the frequency distribution in which bars represent


frequencies. This histogram is constructed by using class boundaries and frequencies of the
classes. The frequency is represented by the area of the bar. The area is equivalent to the
height of the bar for equal class intervals.

When plotting histograms, the random variable or phenomenon of interest is plotted along the
horizontal axis; the vertical axis represents the number, proportion or percentage of
observation per class interval, depending on whether or not the particular histogram, a relative
frequency histogram or a percentage histogram respectively.

Vertical axis label Types of chart


Number of observations Frequency histogram
Proportions of observations Relative frequency histogram
Percentages of observations Percentage histogram

Frequency Polygon.

When a line graph of a class frequency is plotted against class mark, a frequency polygon is
obtained. If a histogram is available, the frequency polygon is obtained by connecting the mid-
points of the tops of the rectangles in the histogram. Two additional classes with zero
frequencies are added to the two ends of the histogram so that the two ends of the frequency
polygon are connected to the horizontal axis. Figure 2.3 illustrates the frequency polygon.

40

35

30

25

20 Frequency
polygon
15

10

0
0 2 4 6 8 10 12 14 16
Class Boundary

Figure 2.3 Frequency polygon.


Descriptive Statistics

2.6 Cumulative Frequency Distribution and Ogives

Cumulative Frequency Distribution

There are two types of cumulative frequency distributions: ‘less than’ or ‘more than’. The ‘less
than’ cumulative frequency is more frequently used. Cumulative frequency is determined by
adding frequencies. The frequencies up to the upper boundary of each class interval are
progressively added and the cumulative total are placed in a new column. Table 2.3 indicates
how the cumulative frequencies are calculated. In the same way, the cumulative relative
frequency is calculated by adding the relative frequency progressively.

Table 2.3 Cumulative frequency distribution

Class Class frequency Cumulative Relative Cumulative


number frequency frequency relative
frequency
1 27 – 29 2 2 0.04 0.04
2 29 – 31 9 11 0.18 0.22
3 31 – 33 13 24 0.26 0.48
4 33 – 35 14 38 0.28 0.76
5 35 – 37 8 46 0.16 0.92
6 37 – 39 4 50 0.08 1.00
Total 50

Ogive (Cumulative Frequency Curve)

An ogive (or commonly known as the cumulative frequency curve) is a graph or a line chart
of a cumulative frequency distribution. There are two types of ogive: ‘less than’ and ‘more
than’. ‘Less than’ ogive is an increasing function. It rises to the right. The ‘more than’ ogive
falls to the right. An ogive is drawn based on the data from a cumulative frequency table or
the data from the cumulative frequencies. Ogives for relative frequencies are used when two
cumulative distributions with different total frequencies are to be compared.

Example 2.1

Table 2.4 shows the number of services years of 120 employees at a firm called IZZY.

Table 2.4 Number of service years of employees of IZZY

Service years Number of employees


1–4 16
5–8 20
9 – 12 28
13 – 16 24
17 – 20 16
21 – 24 11
25 – 28 5

From the data above, draw a ‘less than’ ogive


Descriptive Statistics

Solution

The lower limit is calculated for every class.

Table 2.5 Cumulative frequency of number of service years at IZZY

Service years Cumulative frequency


Less than 0.5 0
Less than 4.5 16
Less than 8.5 36
Less than 12.5 64
Less than 16.5 88
Less than 20.5 104
Less than 24.5 115
Less than 28.5 120

140

120

100

80

60

40

20

0
0 5 10 15 20 25 28 30

Figure 2.4 The service years of 120 employees as the firm IZZY
Descriptive Statistics

2.7 Numerical Descriptive Measures of Data: Measures of Central


Tendency

Measures of central tendency refer to few figures that reflect centric data distribution. It is a
single value located at the centre of a data and can be taken as a summary value, which
explains the behaviour of a particular data set. Three value that are often used as measures
of central tendency of a data set. That are mean, median and mode.

Mean

Mean is a measure of central tendency that is computed by taking the sum of all data values,
divided by the number of data. For examples, if the test scores of five students are 70, 80, 60,
70+80+60+90+50
90, 50 then the mean score of these students is = 70. Mean is the most
5
commonly used measure of central tendency since it reflects the average score representing
the whole data set. For example, the average score for students in class A is 60 and the
average score for students in class B is 70. This clearly shows that, on average the students
in group B perform better than students in group A.

Median

Median is a value that lies in the centre of the data where half (50%) of the data are greater
or equal to this value, and another half (50%)of the data are smaller or equal to this value .To
determine the median ,one has to arrange the data in ascending or descending order and then
select the central value of the data as the median .

Using the data supplied earlier the arranging the data in ascending order we obtain 50
,60,70,80.90. The centre value or median score is 70, and we can conclude that half of the
students in class score below 70 and another half score above 70.

Mode

Mode is the value that occurs most frequently in a data set. For example, if the data set is 50,
60, 70, 70, 70, 80, 90, 90 then the mode is 70 because it occurs most frequently (three times)
in the data set. Thus, with the mode 70, we can conclude that most of students score 70 in
that particular test.
Descriptive Statistics

2.8 Measures of Central Tendency for Ungrouped Data

Mean (Arithmetic Mean)

Mean is the most commonly used measure of central tendency. It is the average of a group of
data. Mean is calculated by summing up all the observations in the data set then dividing it by
the number of data. The population mean is normally represented by the Greek letter μ,
pronounced as miu while the sample mean is represented by X.

Computation of the mean can be represented either as the population mean, or sample mean,
X. The following formula is used to compute mean.

Σ𝑥 𝑥1 + 𝑥2 + 𝑥3 + . …. + 𝑥ₙ
Population mean, μ = =
𝑁 𝑁

Σ𝑥 𝑥1 + 𝑥2 + 𝑥3 + . …. + 𝑥ₙ
Sample mean, Σ = =
𝑛 𝑛

Where N = number of elements in the population, and


n = number of elements in the sample.

Median

Median is the middle value of an ordered array of data. If there is an odd number of
observations in the data that is arranged in ascending or descending order, the median is the
middle value of the data. However, if there is an even number of observations in the data, the
median is the average of the two middle numbers.

Finding median in a large number of observations

If there are large number of observations, the median is determined by computing the term in
an ordered array. For example, if we have 77 data, the median is the 39 th term as shown as
below.

𝑛+1 77+1 77+1


Location of median = = = 39 = = 39
2 2 2

The median is the 39th observation in the data set that is arranged in ascending order or
descending order. An advantage of using the median as a measure of central tendency is, the
median is unaffected by the magnitude of extreme values. The median is often the best
measure of location especially in the analysis of variables such as cost of houses, income and
age.Irrespective of how large the data or whether the data even or odd numbers, as long as
𝑛+1
they are arranged order, median can be obtained by finding its location using the ( 2
)th
formula.
Descriptive Statistics

Mode

Mode is the value that occurs most frequently in a set of data. Mode is located by arranging
the data in ascending or descending order. For categorical data, mode is the category that
has the highest frequency. For example, see the Figure 2.5 the mode is bicycle.

140

120

100

80

60

40

20

0
Bus Bicycle Walking Train

Figure 2.5 Data presenting transport usage in a particular school.

For quantitative data that is represented by a histogram, mode is the class interval with the
highest frequency.
Frequency

60

40

20

Price of book
0 20 30 40 50 60 70 80
Figure 2.6 Data presenting frequency of the prices of books

Disadvantage of mode

A disadvantage of mode as a measure of central tendency is that it is not unique. A set of data
may have one, two or many modes or no mode at all.
Descriptive Statistics

2.9 Measures of Position for Ungrouped Data

Box-and-whisker plot

A box-and-whisker plot provides a useful graphical representation of data using minimum


(Xmin), maximum (Xmax), first quartile (Q₁), third quartile (Q₃) and median, as presented in
Figure 2.7.

Minimum Q₁ Median Q₃ Maximum


Figure 2.7 box-and-whisker plot

The vertical line inside the box represent the location of the median value of the data set. The
vertical line on the left-hand side of the box represents the location of the first quartile (Q₁)
and the vertical line on the right-hand side of the box represents the location of the third quartile
(Q₃). The extreme end of the line (a whisker) connecting to the left-hand side of the box is the
location of the smallest value, Xmin and the extreme end of the line connecting to the right-
hand side of the box is the largest value, Xmax.

Box-and-whisker plots for various types of distribution are shown in Figures 2.8 (a)-(e).

For normally distributed data, median is in the middle of the box and the whiskers are of equal
length. For negatively-skewed data, the whisker and rectangular box is longer on the left-hand
side. For positively-skewed data, the whisker and rectangular box is longer on the right-hand
side.

Quartiles

Quartiles are the most widely used measures of non-central location and they are used to
describe positional values of large sets of numerical data. While median is the middle value
that splits the ordered number into half (50% of the observations are smaller and 50% are
larger than the median), quartiles are descriptive measures that split the ordered data into four
quarters.

First Quartiles (Q₁)

The first quartile is a positional value where 25% of the observations are smaller and 75% are
larger than the value given by the following formula:

𝑛+1
𝑄1 = th value of ordered observations
4
Descriptive Statistics

Panel A Panel B
(a) Bell-shaped distribution (normally distributed data) (b) Distribution skewed to the left

Panel C Panel D
(c) Distribution skewed to the right (d) Rectangular distribution

Panel E
(e) U-shaped distribution

Figure 2.8 Types of box-and-whisker plots


Descriptive Statistics

Third quartile (Q₃)

The third quartile is a positional value where 75% of the observations are smaller and 25%
are larger than the value given by the following formula:

3(𝑛 + 1)
𝑄3 = 𝑡ℎ value of ordered observations
4

The following lists the rules for obtaining quartile values.

(a) If the positioning point is an integer, the numerical observation on the positioning point
is chosen for the quartiles.
(b) If the positioning point is halfway between two integers, the average of their
corresponding observations is selected.
(c) If the positioning point is neither an integer nor a value halfway between two integers,
a simple rule to follow is, round off to the nearest integer and select the numerical value
of corresponding observation.

2.10 Measures of Central Tendency for Grouped Data

When the amount of raw data is large, it is better presented in a table of frequency distribution.
Computing measures of central tendency for grouped data is different from computing raw
data.

Mean

For grouped data, each class interval is represented by the mid-point of the interval, 𝑋𝑖 . The
mean is calculated as

∑ 𝑓𝑥𝑖
𝑀𝑒𝑎𝑛, 𝑋̅ =
∑𝑓

Median

For grouped data with frequency distribution, the following method is used to find the median.
Create a column for the cumulative frequency and determine the position of median in the
distribution. Once the median interval is determined, median is calculated as

𝑛
− ∑ 𝑓𝑚−1
𝑀𝑒𝑑𝑖𝑎𝑛 = 𝐿𝑚 + [2 ] ×𝑐
𝑓𝑚

Where n = Sample size,


Lₘ = Lower limit of the median class,
𝛴𝑓𝑚−1 = Cumulative frequency before the median class.
fₘ = Frequency of the median class, and
c = Median class size.
Descriptive Statistics

Mode

For grouped frequency distribution with continuous variables, mode can be estimated using a
histogram. First, a histogram is drawn for the data and class with the highest frequency
(commonly called modal class). Next, two lines are drawn at the top of the column, one from
the top right-hand corner another from the top left-handed corner of the modal class to the top
left-handed corner of the column after the modal class, At the point of intersection between
the two lines, a vertical line is drawn towards the horizontal axis of the histogram. The variable
value is the mode of distribution.

Mode can also be calculated as

𝑓0 − 𝑓1
𝑀𝑜𝑑𝑒 = 𝐿 + [ ]×𝑐
(𝑓0 − 𝑓1 ) + (𝑓0 − 𝑓2 )

Where L = Lower boundary of the class containing mode,


c = Size of the class containing mode,
f₀ = Frequency of the class containing mode,
f₁ = Frequency of the class before the class containing mode, and
f₂ = Frequency of the class after the class containing mode.

An estimating mode from a histogram

Histogram can be used to estimate the mode of the distribution. The following steps can be
taken.

1. Identify the mode class. (This is the class interval with the highest frequency.)
2. Draw two lines, AC and BD, as shown in Figure 2.9.
3. The mode is the value on the horizontal axis where AC and BD intersect.

f
A B

D
C

Mode
Figure 2.9 Estimating from a histogram
Descriptive Statistics

30 A B

20 D

10

0
0.5 4.5 8.5 12.5 16.5 20.5 24.5 28.5
0 Mode

Figure 2.10 Estimating mode of employees working experience

2.11 Skewness

Skewness measures the lack of symmetry in a data distribution. The skewed portion is the
long and thin part of the curve. A skewed distribution means data are spares at one end of the
distribution but dense at the other end. The value of skewness falls between -3.0 and 3.0. the
skewness value -3.0 indicates the distribution is extremely skewed to the left, while the
skewness value 3.0 indicates the data is extremely skewed to the right. If data are perfectly
symmetry, the skewness value is 0.0. however, in research, data are considered to be
normally distributed if the skewness value falls between -1.0 to 1.0.

Skewness in Relation to Mean, Median and Mode.

The concept of skewness helps us understand the relation between three measures: mean,
median and mode; in a uni-mode distribution (distribution with a single peak or mode). Mode
is the highest point of the curve, or the apex and median is the middle value. Mean is usually
located somewhere towards the tail of the distribution curve because mean is affected by all
values, including extreme ones. A bell-shaped or normal distribution curve has no skewness.
The mean, median and mode are all located at the centre of the distribution. The intensity of
the skew can simply be measured by subtracting the mode from the mean, if the mean is
greatly pulled to one side, then the difference between the mean and mode will be large. The
larger the magnitude, the more skewed the distribution. In a positively-skewed distribution, the
mean is found to the right of the mode. When the distribution is skewed to the left, the mean
is pulled to the left and is smaller than the mode. This distribution is called negatively-skewed.If
the difference between mode and mean is exactly zero, then the distribution is symmetrical.
Descriptive Statistics

The Relationship Between Mean, Median and Mode.

(a) Mean > median > mode: The distribution is positively-skewed or is skewed to the right
(Figure 2.11).

Mode Median Mean

Figure 2.11 Distribution is positively-skewed

(b) Mean = median = mode: The distribution is symmetrical or has zero skewness meaning
the data are evenly distributed (Figure 2.12). A symmetric distribution has no skewness
and the data are evenly or normally distributed. As such, the mean, median and mode
are equal in value.

Mean
Median
Mode

Figure 2.12 Distribution is symmetrical and has skewness = 0.0

(c) Mean < median < mode: The distribution is negatively-skewed or is skewed to the left
(Figure 2.13)

Mean Median Mode


Figure 2.13 Distribution is negatively-skewed
Descriptive Statistics

2.12 Measures of Position for Grouped Data

The position of grouped data can be measured by the first and third quartile denoted as Q₁
and Q₃ respectively. The first and third quartiles can be calculated based on the distribution
table. They can also be obtained by using ogive (cumulative frequency curve).

First (Q₁) and Third (Q₃) Quartiles

Steps to obtain Q₁ and Q₃ values using formulae are as follows.

Step 1 Obtain the cumulative frequencies.

Step 2 Identify the first and third quartile classes. To do this, first obtain the location of the first
𝑛 𝑛
and third quartile respectively, using the formulae and 3 . Then refer to the
4 4
cumulative frequency column and determine the locations and the classes in which Q₁
and Q₃ lie. Within these classes, the values of Q₁ and Q₃ can be determined.

Step 3 Find the first and third quartile.

𝑛
− 𝐹1
𝑄1 = 𝐿1 + [4 ] 𝑐1
𝑓1

Where, L₁ = lower limit of the first quartile class,


n = number of observations,
F₁ = cumulative frequency before the first quartile class,
f₁ = frequency of the first quartile class, and
c₁ = first quartile class size.

3(𝑛)
− 𝐹₃
𝑄3 = 𝐿3 + [ 4 ] 𝑐₃
𝑓₃

Where, L₃ = lower limit of the third quartile class,


n = number of observations,
F₃ = cumulative frequency before the third quartile class,
f₃ = frequency of the third quartile class, and
c₃ = third quartile class size.

Steps to obtains Q₁ and Q₃ values using ogive are as follows.


𝑛
Step 1 Mark the first and third quartile locations on the y-axis. The first quartile location is 4
𝑛
and the third quartile location is 3 .
4

Step 2 From each of the quartile locations marked on the y-axis, draw a horizontal line
corresponding to the location point and vertical line that crosses the x-axis
Descriptive Statistics

2.13 Introduction Measures of Dispersion

In last subtopic, we learnt that mean, median and mode measures of central location of
distribution. Two distributions may have the same central locations but different dispersions
on spreads. This is shown by distribution A and B in Figure 2.14.

A distribution can be effectively described by central location (using mean, median or mode)
and the dispersion. Dispersion of a distribution provides us additional information on the
reliability of the measure of central. If the data are widely dispersed, the central location is said
to be less representative of the data as a whole. The central location for data with little
dispersion is considered more reliable.

The measures of dispersion can be used to compare dispersion of various samples. A widely
spread distribution should not be used for decision making. For example, a financial analyst
knows that widely spread dispersed earnings indicate a high risk to stockholders and creditors
whereas small dispersion of earnings indicates stable earnings and therefore lower risk level.

Figure 2.14 Mean A and B

2.14 Understanding Dispersion in Data

Measures of dispersion help us to understand the spread or variability of a set of data. It gives
additional information to judge the reliability of the measure of central tendency and helps in
comparing dispersion that is present in various samples.

There are several common measures of dispersion. They are range, interquartile range, semi-
interquartile range (also called quartile deviation), average absolute deviation, variance,
standard deviation and coefficient of variation. The more the spread out or dispersed the data,
the larger is the range, the interquartile range, the variance and the standard deviation. On
the other hand, the more concentrated or homogeneous the data, the smaller are the various
measures of dispersion stated above. If the observations are the same (no variations in the
data), the measures of dispersion will be zero. These measures of dispersion are non-negative
in values.
Descriptive Statistics

2.15 Distance Measures of Dispersion

Range for Ungrouped data

Range is the difference between the largest and the smallest observations in a set of data.

Range = Largest value – Smallest value

The range measures the total spread of the data. The range is a simple measure of total
variation or dispersion in the data. However, the weakness is, it does not take into account
how the data are distributed between the smallest and the largest values.

Two different sets of data may give the same range. However, the values between two
extremes are different. Thus, the range is heavily influenced by the extreme values in the data.
Range is often used in quality control for production processes due to simplicity.

Range for Grouped data

For grouped data in frequency distribution, the range is the difference between the upper
boundary of the highest class (UB) and the lower boundary of the lowest class (LB).

Range = Upper boundary of highest class – Lowest boundary of lowest class

Interquartile Range

Interquartile range is the difference between the third and the first quartiles in a set of data.
This measure considers the spread in the middle fifty percent of the data; therefore it is not
influenced by the extreme values of the data set. Interquartile range is obtained by subtracting
the first quartile from the third quartile.

Interquartile range = 𝑄3 - 𝑄1

Quartile Deviation (Semi-interquartile range)

Another measure of distance dispersion is the quartile deviation. Quartile deviation is defined
as follows.

1
Quartile deviation = (𝑄3 - 𝑄1 )
2

It is also frequently called semi-interquartile range.


Descriptive Statistics

2.16 Mean deviation (Average Absolute Deviation)

Mean deviation (or average absolute deviation) is calculated by summing up the difference
between each observation and the mean. This value is then divided by the number of
observations.

∑ | 𝑥−𝑚𝑒𝑎𝑛 |
Mean deviation = = 𝑛

Where x is the observation and n is the number of observations

Mean deviation does not have the mathematical properties required. It is not useful in other
statistical work such as hypothesis testing, forecasting and quality control. The useful
measures of variation are variance and standard deviation.

2.17 Population variance and standard deviation

Population variance for ungrouped data

Variance determines how the values fluctuate around the mean. The population variance is
the mean value of the squares of the deviations from the mean. This is usually written as s²
and read as sigma squared.

1
𝜎2 = ∑( 𝑋 − 𝜇 )2
𝑁

Population variance can be written in different forms.

1
𝜎2 = ∑ 𝑋 2 − 𝜇2
𝑁

Where 𝜎 2 = population variance

𝑥 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑟 𝑣𝑎𝑙𝑢𝑒

𝑁 = 𝑡𝑜𝑡𝑎𝑙 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛

∑ = 𝑠𝑢𝑚 𝑜𝑓 𝑣𝑎𝑙𝑢𝑒𝑠, 𝑎𝑛𝑑

𝜇 = 𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑚𝑒𝑎𝑛
Descriptive Statistics

Population standard deviation for ungrouped data

The most practical and commonly used measures of variation is the population standard
deviation which is represented by the symbol 𝜎. This measure is the square root of the
population variance. Population standard deviation is defined as

𝑝𝑜𝑝𝑢𝑙𝑎𝑡𝑖𝑜𝑛 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝜎 = √𝜎 2

Thus, population standard deviation can be calculated using the following formulae.

1 1
𝜎 = √ ∑(𝑋 − 𝜇)2 , 𝜎 = √ ∑𝑋 2 − 𝜇2
2 𝑁

or

1 ∑𝑋
𝜎 = √ ∑𝑋 2 − ( )2
𝑁 𝑁

Population variance and standard deviation of grouped data

For grouped data that is in the form of distribution of class intervals and each class interval is
represented by mid-point, the following formulae for population variance is used.

1
𝜎2 = ∑𝑓(𝑋 − 𝜇)2
𝑁
1 ∑𝑓𝑥 2
𝜎 2 = 𝑁 ∑𝑓𝑥 2 − ( 𝑁
)

1
or 𝜎 2 = 𝑁 ∑𝑓𝑥 2 − 𝜇2
Descriptive Statistics

2.18 Sample Variance and Standard Deviation

Sample Variance for Ungrouped Data

Variance for a sample of n measurements is the sum of the squared distances of the
measurements from the mean divided by (n-1). This is normally denoted by the symbol s².

The sample variance is written in different forms.

1
𝑠2 = ∑(𝑥 − 𝑥̅ 2 )
𝑛−1

1 1 (∑𝑥) 2
𝑠 2 = 𝑛−1 (∑𝑥 2 − 𝑛𝑥̅ 2 ) or 𝑠 2 = 𝑛−1 (∑𝑥 2 − 𝑛
)

Where 𝑛 = 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠 𝑜𝑟 𝑣𝑎𝑙𝑢𝑒𝑠 𝑖𝑛 𝑠𝑎𝑚𝑝𝑙𝑒

𝑥 = 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛 𝑜𝑟 𝑣𝑎𝑙𝑢𝑒

𝑥̅ = 𝑠𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛

∑𝑥 2 = 𝑠𝑢𝑚 𝑜𝑓 𝑎𝑙𝑙 𝑡ℎ𝑒 𝑠𝑞𝑢𝑎𝑟𝑒𝑠 𝑜𝑓 𝑜𝑏𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛𝑠

Sample Standard Deviation for Ungrouped Data

Sample standard deviation is the square root of the sample variance.

𝑠𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛, 𝑠 = √𝑠 2

The sample standard deviation is calculated using the following formulae.

1
𝑠2 = √ 𝛴(𝜘 − 𝑥̅ )2
𝑛−1

1 (∑𝑥)2
𝑠 = √𝑛−1 (∑𝑋 2 − 𝑛

1
𝑠 = √𝑛−1 (∑𝑋 2 − 𝑛𝑥̅ 2 )
Descriptive Statistics

Interpreting variance and standard deviation

The variance and standard deviation (square root of variance) are the measures of the
average scatter around the mean. In other words, they measure the fluctuations of data values
above and below its mean. The variance possesses certain useful mathematical properties.
However, it's computation results in squared units such as squared percentages, squared
ringgit and squared centimetres. Thus, the primary measure of variation is the standard
deviation, as the value of this measurement is in the original unit of the data.

Sample variance and standard deviation of grouped data

For grouped data that is in the form of distribution of class intervals, the following formulae are
used.

1
𝑆𝑎𝑚𝑝𝑙𝑒 𝑣𝑎𝑟𝑖𝑎𝑛𝑐𝑒, 𝑠 2 = ∑𝑓(𝑥 − 𝑥̅ )2
𝑛−1

1
𝑠2 = (∑𝑓𝑥 2 − 𝑛𝑥̅ )2
𝑛−1

1 (∑𝑓𝑥)2
𝑠2 = [∑𝑓𝑥 2 − ]
𝑛−1 𝑛

The standard deviation is the square root of the sample variance

𝑠 = √𝑠 2

1
𝑠=√ ∑𝑓(𝑥 2 − 𝑛𝑥̅ )2
𝑛−1

1 (∑𝑥)2
𝑠 = √𝑛−1 (∑𝑓𝑋 2 − 𝑛
)

1
𝑠=√ (∑𝑓𝑥 2 − 𝑛𝑥̅ 2 )
𝑛−1
Descriptive Statistics

2.19 Coefficient of Variation (Relative Dispersion)

Large standard deviations mean large variability within the data set. In some cases large
variability is desired, but in other cases small variability is desired. When comparing
distributions of different means and variances, a useful measure is the coefficient of variation
(written as CV). The coefficient of variation gives us the ratio of the standard deviation to the
arithmetic mean expressed as a per cent.

𝑆𝑎𝑚𝑝𝑙𝑒 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛


𝐶𝑉 = × 100
𝑆𝑎𝑚𝑝𝑙𝑒 𝑚𝑒𝑎𝑛

2.20 Skewness

Other than measures of central location and dispersion, another important measure of
distribution is the skewness of the distribution. A distribution can be symmetric, skewed to
the right or skewed to the left.

In a set of data, if the mode is greater than the mean, this means that there are more values
bigger than the mean, where the smaller values are more widely spread. The distribution is
negatively-skewed or skewed to the left (Figure 2.15)

Figure 2.15 Skewed to the left

However, if the mean is greater than the mode, the distribution is skewed to the right or
positively-skewed.

Figure 2.16 Skewed to the right

If the mean, median and mode are equal and the distribution is equally spread on both sides
of the mean, then it is said to be symmetrical. The distribution is usually bell-shaped.
Descriptive Statistics

Measure of Skewness

Measure of skewness is used to determine the difference between the mean and mode of
the distribution.

1. If (𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒) = +𝑣𝑒(𝑝𝑜𝑠𝑖𝑡𝑖𝑣𝑒), the distribution is skewed to the right or


positively-skewed.
2. If (𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒) = −𝑣𝑒(𝑛𝑒𝑔𝑎𝑡𝑖𝑣𝑒 ), the distribution is skewed to the left or
negatively-skewed.
3. If (𝑚𝑒𝑎𝑛 − 𝑚𝑜𝑑𝑒) = 0, the distribution is symmetrical.

Pearson coefficient of skewness

Pearson coefficient of skewness is usually used to measure the skewness of the distribution.
It is calculated as follows.

𝑀𝑒𝑎𝑛 − 𝑀𝑜𝑑𝑒 3(𝑀𝑒𝑎𝑛 − 𝑀𝑒𝑑𝑖𝑎𝑛)


𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 = 𝑜𝑟 𝑆𝑘𝑒𝑤𝑛𝑒𝑠𝑠 =
𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛 𝑆𝑡𝑎𝑛𝑑𝑎𝑟𝑑 𝑑𝑒𝑣𝑖𝑎𝑡𝑖𝑜𝑛

If skewness= 0, the distribution is symmetrical.

If it is positive, the distribution is skewed to the right or positively-skewed.

If the skewness is negative, the distribution is skewed to the left or negatively-skewed.

You might also like