You are on page 1of 31

DESCRIPTIVE STATISTICS

Introduction

One important use of descriptive statistics is to summarize a collection of


data in a clear and understable way. Data need to be ordered, reduced and
presented in some attractive way so that real information may be extracted.

Organising Data

Ordering and reduction of data can be done/achieved by the use of tables,


graphs and descriptive measures (summary statistics), and these methods
compliment each other.

TABULATIONS

The most common way of presenting information is in the form of a ta-


ble. Tables can make information easier to assimilate, showing at a glance
patterns or grouping’s.

Frequency Distributions

A frequency distribution shows us a summarised grouping of data divided


into mutually exclusive classes and the number of occurrences in a class.
Frequency distributions are used for both qualitative and quantitative data.
A frequency distribution for qualitative data lists all categories and the num-
ber of elements that belong to ech category.

Example 2.1 The following were responses on degree preference of 25 stu-


dents who have just completed their A’ Level studies.
Business Accounting Tourism BSc Medicine
Engineering Business Economics Accounting Pharmacy
BA Accounting Economics Business Accounting
Accounting Tourism BSc Pharmacy Economics
Accounting Economics Medicine BA Business

Construct a frequency distribution table for the above data set.

1
Solution

Degree preference Frequency


Accounting 6
Business 4
Tourism 2
Economics 4
BSc 2
Engineering 1
BA 2
Pharmacy 2
Medicine 2

Grouped frequency distribution

Useful for condensing data.


For quantitative data we tend to group the data into non-overlapping classes
for us to have reasonable classes. As classes increase, the table become less
meaningful and more difficult to interpret.

Example 2.2 The following are ages of 50 students in a class.

19 20 19 27 37
30 23 20 20 33
22 19 23 21 22
21 19 42 32 20
18 24 30 28 20
22 27 37 20 24
27 21 22 20 25
41 20 24 27 19
27 22 18 19 30
28 19 19 21 20

Using classes, 15-19, 20-24, 25-29-29, e.t.c, construct a grouped frequency


distribution table.

2
Solution

Age Frequency
15-19 10
20-24 23
25-29 8
30-34 5
35-39 2
40-44 2

Relative frequency and Percentage distribution


f requency
RelativeF requency =
totalf requency
It shows what proportion of the total frequency belonging to the correspond-
ing category.

Percent for a category is obtained by multiplying the relative frequency of


the category by 100.

P ercent = relative f requency × 100

Cumulative Frequencies/Percent

Another useful statistic, which can be derived from a frequency distribution,


is the “cumulative frequency/percent”. Accumulates as one moves down the
table.

Example 2.3 Referring to the students degree preference, we have the follow-
ing table incorporating relative, percent and cumulative percent frequencies.

3
Degree Preference Relative Freq Percent Cumulative Percent
Accounting 0.24 24 24
Business 0.16 16 40
Tourism 0.08 8 48
Economics 0.16 16 64
BSc 0.08 8 72
Engineering 0.04 4 76
BA 0.08 8 84
Pharmacy 0.08 8 84
Medicine 0.08 8 100

GRAPHICAL PRESENTATION OF DATA

Some graphs are ideal for qualitative data whilst some are ideal for quanti-
tative data.

Types of Graphs

Pie chart, bar graph, box and whisker diagram, histogram, stem and leaf,
scatter plot, population pyramid, e.t.c.

Pie Chart

A pie chart is used to split a particular quantity into its component of pieces.
It is a convenient way of representing percentages or relative frequencies
(rather than frequencies). To construct a pie chart, draw a line from the
center of the circle to the outer edge and then construct the various pieces
of the pie-chart by drawing the corresponding angles.

Example 2.4 Referring to the student’s degree preference data (Example


2.3)

Angle f or Accounting = 24% of 3600


The following is the pie chart for the data.

4
Bar Graph

A bar graph is a graphical way of presenting a frequency distribution.Bar


charts are often used for qualitative or categorical data, although they can
be used quite effectively with quantitative data if the number of unique scores
in the data set is not large. A bar chart plots the number of times a par-
ticular value or category occurs in a data set, with the height of the bar
representing the number of observations with that score or in that category.
The Y-axis could represent any measurement unit: relative frequency, raw
count, percent, or whatever else is appropriate for the situation.

Example 2.5 A hardware store’s quarterly sales are a follows, in thousand


dollars.

5
SEASON QUARTERLY SALES
1st quarter 290
2rd quarter 180
rd
3 quarter 260
4th quarter 120

Summarize the data in form of a bar chart.

Histogram

A graph very similar to a bar graph is the histogram. It is ideal for con-
tinuous data and there are no gaps between the bars as in bar graph. It is
constructed by first selecting a number of ”intervals” to be used. The choice
is between reducing the information sufficiently while still providing enough
variability to picture the shape of the distribution.

6
Construction of Histogram of Equal Class Width

Area of rectangle is very important when constructing a histogram. If the


class width are equal, we construct a rectangle with each class interval as a
base and the height of the rectangle equal to the number of observations in
the class.

Class width = U pper Class Boundary (U CB)−Lower Class Boundary (LCB)


Example 2.6 The following data is amount of money spent by households
of a certain location in a week on basic commodities.

Money Spent (Dollars) Frequency


0-25 36
25-50 24
50-75 12
75-100 9
100-125 9
125-150 5
150-175 3
175-200 2

Construct a histogram for the data.

Solution The classes have equal width that is 25. So the height of the
bars/rectangles are the corresponding frequencies for each class.

Construction of Histogram of Unequal Class Width

Sometimes data is not of equal width. Instead of using frequency as the


height of a bar/rectangle, for unequal class width we make use of frequency
density.
F requency
F requency density =
Class width

7
Stem and Leaf Diagram

They are useful in summarising reasonably sized data sets and unlike his-
tograms, result in no loss of information. By this we mean that it is possible
to retrieve the original data set from the stem and leaf diagram, which is not
the case when using a histogram. A stem and leaf diagram is constructed in
such a manner that each number is divided into two parts, a stem consisting
of one or more of the leading digits and a leaf, consisting of the remaining
digits.

Construction of a Stem and Leaf diagram

• Select the stems and list them in order in a vertical column.

• Draw a line to the right of the stems.

• For each observation, record the remaining digits, that is, the leaf to
the right in ech of the rows corresponding to the appropriate stem.

• Reorder the leafs in ascending order.

• If there are too many leafs corresponding to one stem, we split the
stems possibly into two.

• Have a key so that anyone can derive the original observations.

Example 2.7 The following data gives time taken in minutes to interview
participants in a certain survey.

18 42 20 58 56 11 34 42 54 23
66 41 52 23 16 42 37 50 19 36
32 42 53 30 14 29 25 47 21 12

8
Solution

Stem Leaf
1 1 2 4 6 8 9
2 0 1 3 3 5 9
3 0 2 4 6 7
4 1 2 2 2 2 7
5 0 2 3 4 6 8
6 6

Key: 1|1 means 11

Frequency Polygon

It is a graph formed by joining the midpoints of the tops of successive bars


in a histogram by straight lines.

Cumulative frequency curve

It is constructed by putting boundaries on the horizontal axis and percentage


cumulative or absolute cumulative frequencies or relative cumulative frequen-
cies on the vertical axis. When plotting the upper limits of each class are
used. It is different from a frequency polygon which uses midpoints.

9
Activity 2.1

1. The total number of children per 1000 married women are as follows
for women between the ages 15 and 45.

Age (Years) Number of children


15-20 718
20-25 993
25-30 1329
30-35 1788
35-40 2048
40-45 2167

(a) What do you achieve by tabulating the data in the form of a


frequency distribution?
(b) Form a relative frequency distribution
(c) What proportion of the children were born from married women
between the ages of 20 and 30?

2. A random sample of two-bedroom apartments in Avenues area, Harare,


revealed the following monthly rentals. Forty units were sampled

249 180 190 230 349 299 300 175 305 205
225 160 195 395 245 275 155 225 360 180
205 239 305 260 155 255 310 230 305 210
320 168 309 380 190 225 365 302 285 395

(a) Determine the class intervals if 5 class intervals are desired for a
frequency distribution.
(b) Construct a cumulative frequency distribution.
(c) Construct a histogram for the data.
(d) What proportion of the apartments priced over $299 per month.

10
3. The following data are examinations scores obtained by 30 students at
a certain college.

45 88 65 58 45 58 56 68 71 66
87 54 51 74 52 66 64 59 63 54
51 72 53 62 42 32 68 47 91 52

Construct a stem and leaf diagram for the data.

4. The daily number of photocopies made in an office are grouped into


a table having the classes 0-499, 500-999, 1000-1499, and 1500-1999
photocopies. Find

(a) the class boundaries


(b) the class midpoint
(c) the class width

11
DESCRIPTIVE MEASURES

Measures of Central Tendency

The purpose of a measure of central tendency is to determine the “centre”


of your values or possibly the “most typical” data value. The main measures
of central tendency are the mean, median and the mode.

Mean for Ungrouped Data

The mean is the most commonly used measure of central tendency. In most
cases, we deal with data from a sample and we refer to the arithmetic mean
as simply the sample mean. If the observations in a sample of size n are
x1 , x2 , ..., xn , then Pn
xi
x̄ = i=1
n
Example 2.8 The data, 3652, 4125, 9526, 2546 and 2328 are salaries of
Company Executives. Find the sample sample mean.

Solution
Pn
i=1 xi 3652 + 4125 + ... + 2328 22177
x̄ = = = = 4435.4
n 5 5

Mean for Grouped Data

Sometimes we may have to work with data in the form of a frequency distri-
bution, called grouped data, when the raw data are not available. We do not
have the data values used to make this frequency distribution and so we are
forced to approximate the sample statistics. Suppose data is grouped into
k classes with frequencies f1 , f2 , ..., fk and midpoints x1 , x2 , ..., xk , then the
arithmetic mean for grouped data is defined as
Pk
i=1 fi xi
x̄ =
n
Pk
where n = i=1 fi .

12
Example 2.9 Making use of data in Example 2.6, calculate the sample
mean.

Money Spent (Dollars) LCB UCB Mid Point (xi ) Frequency (fi ) fi xi
0-25 0 25 12.5 36 450
25-50 25 50 37.5 24 900
50-75 50 75 62.5 12 750
75-100 75 100 87.5 9 787.5
100-125 100 125 112.5 9 1012.5
125-150 125 150 137.5 5 687.5
150-175 150 175 162.5 3 487.5
175-200 175 200 187.5 2 375
Sum 100 5450
Pk
i=1 fi xi
x̄ =
n
= 12.5 × 36 + 37.5 × 24 + ... + 187.5 × 2
5450
=
100
= 54.5

Median for Ungrouped Data

If a set of n observations is arranged in order of size then, if n is odd,


the median is the value of the middle observation, if n is even, the median
is the value of the arithmetic mean of the two middle observations, that

(i) If n is odd and M is the value of the median then:


n+1
M = the value of the th observation.
2
 
(ii) If n is even, the middle observations are n2 th and the n
2
+ 1 th obser-
vations and then

M = the mean value of these two observations

13
Example 2.10 Find the median of the numbers 356, 147, 216, 215, 191, 209,
187, 153, 278 and 133.

Solution We first rank our observations either in ascending order or de-


scending order that is, 133, 147, 153, 187, 191, 209, 215, 216, 278 and 356.
n = 10 that is n is even, hence
n n
 
median = mean of th and + 1 th observations
2 2

= mean of 5th and 6th observations


191 + 209 400
= = = 200
2 2
Median for Grouped Data

It is not possible to find the exact value for the median for grouped data.
However, the median can be obtained using two approaches, that is, the
graphical method and the arithmetic approach.

Graphical Method

Here we make use of the cumulative frequency curve to find the median. To
find the median you locate the 50th percentile or ( n2 )th position of absolute
cumulative frequencies on the vertical axis. Having located this point, move
horizontally until you reach the curve and then move vertically downward to
the horizontal axis. That position at the horizontal axis is the median.

Arithmetic method

We can also make use of the cumulative frequency distribution to calculate


the median. We determine the location of the median class by calculating
n
2
. Then determine the frequency of the median class, the lower limit of the
median class, the width of the class and the cumulative frequency just before
the median class and the median is given by:

cm ( n2 − Fm−1 )
median = Lm +
fm

14
where Lm is the lower limit of the median class,
cm is the class width of the median class,
fm is frequency of the median class,
Fm−1 is the cumulative frequency of class just before the median class and
n = ki=1 fi is the total number of observations.
P

Example 2.11 Making use of data in Example 2.6, calculate the sample me-
dian.

Here the data is grouped and incorporating the cumulative frequency we


have

Money Spent (Dollars) Frequency Cumulative Frequency


0-25 36 36
25-50 24 60
50-75 12 72
75-100 9 81
100-125 9 90
125-150 5 95
150-175 3 98
175-200 2 2

n = 100 and 100


2
= 50, hence our median class is 25-50. Also Lm = 25,
cm = 25, fm = 24, Fm−1 = 36. Thus:

cm ( n2 − Fm−1 )
median = Lm +
fm

25( 100
2
− 36
= 25 +
24
25(50 − 36)
= 25 +
24
350
= 25 +
24
= 39.5833 (4 d.p.)

15
Mode for Ungrouped Data

Mode is simply the observation that occurs most frequently.This value is


found simply by inspection.

Mode for Grouped Data

The mode can be determined either graphically or by calculation.

Graphical Method

To identify the mode, the tallest bar in the histogram is identified. Join
the corner points of the tallest bar diagonally to the start of the next bars
respectively. The diagonals will intersect at some point. Then draw a verti-
cal line passing through the point of intersection and the mode is the point
where this vertical line intersect with the horizontal axis.

Arithmetic Method

We determine the modal class, that is, the class with the highest frequency.
Then the mode is given by:

cm (fm − fm−11
M ode = Lm +
2fm − (fm−1 + fm+1 )

where Lm is the lower limit of the modal class,


cm is the class width of the modal class,
fm is frequency of the modal class,
fm−1 is the frequency of class just before the modal class and
fm+1 is the frequency of class just after the modal class.

Example 2.12 Making use of data used in Example 2.11, calculate the mode.

Solution The modal class is 0-25, that is the class with the highest fre-
quency. We can also note that Lm = 0, cm = 25, fm = 36, fm−1 = 0,

16
fm+1 = 24 and hence

cm (fm − fm−1
M ode = Lm +
2fm − (fm−1 + fm+1 )

25(36 − 0)
=0+
2 ∗ 36 − (0 + 24)
900
=
48
= 18.75
Activity 2.2

(1) What are the advantages and disadvantages of the

(a) mean
(b) median
(c) mode

(2) The following data were obtained from a survey requesting 30 different
families to list their weekly expenditure on food.

99 85 72 59 119 120 95 83 78 91
64 106 86 87 78 108 136 102 86 74
72 103 94 63 73 89 75 88 107 101

Calculate the mean, median and mode.

17
(3) The data below gives marks obtained by Applied Statistics students

Appied Statistics Mark Number of Students


11-20 4
21-30 7
31-40 12
41-50 18
51-60 29
61-70 13
71-80 8
81-90 5
91-100 4

Calculate,

(i) the mean,


(ii) the median,
(iii) the mode.

18
Measures of Position

They cannot really be called measures of central tendency, but they are
measures of location in that they give position of specified observations and
the most commonly used are quartiles, deciles and percentiles.

Quartiles

The quartiles divide the set of measurements into four equal parts. Twenty-
five per cent of the measurements are less than the lower quartile, fifty per
cent of the measurements are less than the median and seventy-five per cent
of the measurements are less than the upper quartile. So, fifty per cent of
the measurements are between the lower quartile and the upper quartile.
These are denoted by the sympols Q1 (lower or first quartile), Q2 (second
quartiles) and Q3 (upper or third quartile).
Note: median is equivalent to the second quartile.

Quartiles for Ungrouped Data

There are different approaches to the calculation of the first and third quar-
tiles. The approach we are going adopt in this course is the interpolation
method because most statistical softwares make use of this method. First we
order data in ascending or descending order. Irrespective of n, the sample
size, we have:
1
Q1 = (n + 1)th observation.
4
3
Q3 = (n + 1)th observation.
4
1 3
Usually 4 (n + 1)th and 4 (n + 1)th are fractions and thus where this method
of interpolation arise.
For example, suppose 14 (n + 1) = a.bth observation, where a is integer part
and b is the decimal part. In this case

Q1 = ath observation + 0.b[(a + 1)th observation − ath observation]

Example 2.13 Find the median, lower quartile, upper quartile and in-
terquartile range of the following data set of scores: 18 20 23 20 23 27 24 23 29

19
Solution
Arrange the values in ascending order of magnitude:
18 20 20 23 23 23 24 27 29

n+1
Since n is odd, median = the value of the 2
th observation, that is

n+1
median = th observation
2
9+1
= th observation
2
= 5th observation = 23
For the lower quartile we have,
1
Q1 = (n + 1)th observation
4
1
= (9 + 1)th observation = 2.5
4
= 2nd observation + 0.5[3rd − 2nd observation]
= 20 + 0.5(20 − 20) = 20
For the upper or third quartile we have,
3
Q3 = (n + 1)th observation
4
3
= (9 + 1)th observation = 7.5
4
= 7th observation + 0.5[8th − 7th observation]
= 24 + 0.527 − 24 = 25.5

20
Quartiles for Grouped Data

The procedure is the same with the one for the median. The difference
lies in the identification of the quartile and the quartile position. The posi-
tions are identified by calculating n4 and 3n4
for Q1 and Q3 respectively.
For the lower quartile we have,

cq ( n4 − Fq−1 )
Q1 = Lq +
fq

For the upper quartile we have,

cq ( 3n
4
− Fq−1 )
Q3 = Lq +
fq

where Lm is the lower limit of the quartile class,


cm is the class width of the quartile class,
fm is frequency of the quartile class,
Fm−1 is the cumulative frequency of class just before the quartile class and
n = ki=1 fi is the total number of observations.
P

Example 2.14 Advertising expenditures constitute one of the important


components of the costs of goods sold. From the following data giving the
advertising expenditures (in thousand of dollars) of 50 companies, find lower
and upper quartiles.

Advertising Expenditure Number of Companies


25-35 5
35-45 11
45-55 18
55-65 6
65-75 10

21
Solution

Advertising Expenditure Number of Companies (fi ) Fi


25-35 5 5
35-45 11 16
45-55 18 34
55-65 6 40
65-75 10 50

50
4
= 12.5, thus class containing the lower quartile is 35-45. Then Lq = 35,
cq = 45 − 35 = 10, fq = 11, Fq−1 = 5 and n = 50. Hence,

cq ( n4 − Fq−1 )
Q1 = Lq +
fq

10( 50
4
− 5)
= 35 +
11
= 41.8182 (4 d.p)
3∗50
4
= 37.5, thus class containing the lower quartile is 55-65. Then Lq = 55,
cq = 65 − 55 = 10, fq = 6, Fq−1 = 34 and n = 50. Hence,

cq ( 3n
4
− Fq−1 )
Q3 = Lq +
fq

10( 3∗50
4
− 34)
= 55 +
6
= 60.8333 (4 d.p)
Deciles

These values divide the observations into 10 equal parts and are denoted
by D1 , D2 , ..., D9 , for example D3 has 30% of values below it and 70% above
it.

22
Percentiles

These values divide the observations into 100 equal parts and are denoted by
P1 , P2 , P3 , ..., P99 , for example P30 has 30% of values below it and 70% above
it.

Note

• Q1 = P25

• Q2 = D5 = P50

• Q3 = P75

Measures of Dispersion

Dispersion is the statistical term for the spread or variability of data. Mea-
sures of dispersion reflect the amount of spread or variability in a collection
of data. Consider the following data of height (cm) in two different teams.

Harare: 160, 161, 162, 162, 163, 164, 165, 167, 171, 175
Chinhoyi: 154, 156, 158, 159, 163, 164, 166, 172, 172, 186

Both teams have the same mean height of 165cm, but the distribution of the
height of the team members in Chinhoyi is more spread out.

Range
Is the simplest measure of dispersion.
For ungrouped data

Range = maximum value − M inimum value

23
For grouped data

Range = Highest Class Boundary − Lowest Class Boundary

Disadvantage: It is affected/influenced by outliers i.e. unusually high or


low data values.

Inter-Quartile Range, IQR


When the data is arranged in ascending order of size, the quartiles divide
the data into four parts, (refer to notes on Measures of Position).

IQR = Q3 − Q1

Advantage: It is not affected/influence by outliers.


Disadvantage: Does not show the spread of the whole group of data.

A measurement of spread can be derived from it, called the Semi Inter-
quartile Range
1
Semi Inter − quartile Range = (Q3 − Q1 )
2
Variance and Standard Deviation
This is the most commonly used measure of variability in statistical analysis.
Unlike the range and IQR, it takes into account all the observations in the
data set. The greater the variability in the data, the higher the value of the
statistic.
Disadvantage It is difficult or time-consuming to compute manually.

Steps in calculating sample variance

(i) Calculate the mean, x̄

(ii) Find the difference between each observation and the mean, xi − x̄.

24
(iii) Square the differences, (xi − x̄)2
Pn
(iv) Sum the squared differences, i=1 (xi − x̄)2

(v) Compute s2 by dividing the sum in (iv) by n − 1


1 Pn
i.e. s2 = n−1 i=1 (xi − x̄)
2

For q
sample standard deviation, take the square root of the result in (v) i.e.
1 Pn 2
s = n−1 i=1 (xi − x̄) .

NB: Approximately 68% of the data will fall within one stadard deviation of
the mean, 95% will fall within 2 standard deviations and 7.7% (almost 00%)
will fall within 3 standard deviations of the mean in a normal distribution
curve. This is useful for outlier detection.
Computational Formulae
For ungrouped data

N
1 X
P opulation variance, σ 2 = (xi − µ)2
N i=1
v
u N
u1 X
P opulation standard deviation, σ = t (xi − µ)2
N i=1

n
1 X
Sample variance, s2 = (xi − x̄)2
n − 1 i=1
Pn
i=1 x2i − nx̄2
=
n−1

v
n
1 X
u
u
Sample standard deviation, s = t (xi − x̄)2
n − 1 i=1
sP
n
i=1 x2i − nx̄2
=
n−1

25
For grouped data
If the data were grouped into k classed with class intervals whose midpoints
are x1 , x2 , ..., xk with frequency of occurrence f1 , f2 , ..., fk , respectively then

k
1
P opulation variance, σ 2 = Pk fi (xi − µ)2 )
X

i=1 fi i=1
k Pk
fi xi 2
fi xi − ( Pi=1
2
X
= k )
i=1 i=1 fi

v
u k
u 1 X
P opulation standard deviation, σ = t Pk fi (xi − µ)2 )
i=1 fi i=1
v
u k Pk
uX fi xi 2
= t fi xi − ( Pi=1
2
k )
i=1 i=1 fi

k
1 X
Sample variance, s2 = fi (xi − x̄)2
n − 1 i=1
k
( ki=1 fi xi )2
P
1 X 2
= ( fi xi − Pk )
n − 1 i=1 i=1 fi

v
u k
u 1 X
Sample standard deviation, s = t fi (xi − x̄)2
n − 1 i=1
v
k
( ki=1 fi xi )2
u P
u 1 X
=t ( fi xi 2 − P k )
n − 1 i=1 i=1 fi

Effects on the dispersion measure with change in data

Removal of a certain value from the data If the maximum or mini-


mum value (assuming both are unique) in a data set is removed then the

26
(1) range will decrease

(2) standard deviation will decrease

(3) IQR may increase or decrease

Adding a common constant to the whole data set

If a constant k is added to every data point in a set of data, then the range,
IQR and standard deviation will not change.

Multiplying the whole data set by a constant

The range, IQR and the standard deviation will be k times the original
values.
NB: The last 3 notions are essential in ZIMSTAT work when the concept
of weighting is considered, say on the bread-basket composition and their
contribution in the computation of the Consumer Price Index, CPI.
Coefficient of Variation
For distributions having the same mean, the distributions with the largest
standard deviation has the greatest variation. But when considering distribu-
tions with different means, decision-makers cannot compare the uncertainty
in distribution only by comparing standard deviations. The coefficient of
variation is a measure used to compare the variability in one data set with
that in another in situations in which a direct comparison of standard devia-
tions is not convenient or realistic. For example, in a study of milk consump-
tion in USA, it is reported that the mean number of gallons of milk consumed
per family unit per week is 8 with a standard deviation of 3 gallons. A similar
study in Canada reports the mean consumption to be 12 litres with a sample
standard deviation of 4 litres. It makes no sense to compare these standard
deviations directly because they are reported in different. Coefficient of
variation in a data set is simply the ratio of the standard deviation to the
mean expressed as a percentage.

s
Coef f icient of variation = × 100%

27
NB: Interpretation of coefficient of variation, we say that data exhibits y%
of relative variation from the mean, where y is the value obtained in .

Measures of Shape
An important aspect of the description of a data/variable is its shape, which
indicates the frequency of values from different ranges of the variable. on e is
typically interested in knowing how well the distribution of the variable/data
can be approximated by the normal distribution.
Skewness
Skewness is a measure of the asymmetry of the distribution relative to the
normal distribution. The normal distribution is symmetrical about its meant,
its skewness is equal to zero. A distribution with a significant positive skew-
ness has a long right tail, whilst a distribution with a significant negative
skewness has a long left tail. For a skewed distribution, the mean tends to lie
on the same side of the mode as the longer tail. Skewness is a dimensionless
quantity.

1 Pn 3
n−1 i=1 (xi − x̄)
skewness, m3 = 1 n 3
2 2
i=1 (xi − x̄) )
P
( n−1
mean − mode
=
standard deviation
3(mean − median)
=
standard deviation
if m3 < 0, then the data is negatively skewed.
if m3 = 0, then the data is symmetrical
if m3 > 0, then the data is positively skewed.

A skewness coefficient is considered significant if | s.dmm


3
3
|> 2
where s.d m3 is the standard deviation of skewness given by ( n6 )2 , where n is
the sample size.
Kurtosis

28
Kurtosis is the degree of peakedness in a distribution, usually taken relative
to a normal distribution. The peakedness property means that there is an
excess frequency at the center of the distribution.
Pn
− x̄)4
i=1 (xi
m4 = −3
(n − 1)s4

Positive values of m4 indicate longer and thicker tails than a normal distri-
bution, whereas negative values of m4 indicate shorter and thinner tails. A
distribution with positive kurtosis is called Leptokurtic, and a distribution
with negative kurtosis is called Platykurtic, when m4 = 0, the distribution
is called Mesokurtic.
Box and Whisker Diagram
This diagram was not mentioned when we discussed other graphical tech-
niques because it is constructed using some of the descriptive measures we
have just discussed as opposed to the former which used the raw data. A Box
and Whisker diagram or plot illustrates the spread and skewness of a data
set. It provides a graphical five-point summary of the set of data by showing
the quartiles and the extreme values of the data. It is useful in identifying
outliers i.e. unusually high or low values in a data set which can be due to
typographical errors.
The whiskers extend from the box to the maximum value or Q3 + 23 (IQR) and
minimum value or Q3 − 23 (IQR). If maximum value < Q3 + 32 (IQR), then
whisker extends to maximum value otherwise it will extends to Q3 + 23 (IQR).
Values greater than Q3 + 23 (IQR) are indicated by * and this is an indication
of outliers. If minimum value > Q1 − 32 (IQR), then the whisker extends
to minimum value otherwise it extends to Q1 − 23 (IQR). Values less than
Q1 − 23 (IQR) are indicated by * and this is an indication of outliers.
Below is the box and whisker plot of Harare players height.

29
30
Applied Statistics I, STAT102 Assignment 2
A1 Given data on heights of 100 randomly selected Certificate of Applied
Statistics students Compute the following statistic;

Height (inches) Midpoint, xi Frequency (fi )


59.5 - 62.5 61 5
62.5 - 65.5 64 18
65.5 - 68.5 67 42
68.5 - 71.5 70 27
71.5 - 74.5 73 8

(i) mean [3]


(ii) median [4]
(iii) mode [4]
(iv) range [2]
(v) sample standard deviation [5]

A2 Using the data on the two teams Harare and Chinhoyi in the notes on
measures of dispersion, calculate for each team the;

(a) median [2]


(b) mode [2]
(c) Q1 and Q3 and the IQR [6]
(d) skewness and comment [3,3]
(e) kurtosis and comment [3,3]
(f) Suppose the heights in Harare were measured in cm and those
in Chinhoyi were measured in inches, compute the coefficients of
variation for the two teams and comment. [5,5]

Due Date 11 October 2010

31

You might also like