0% found this document useful (0 votes)
17 views61 pages

Descriptive Statistics Module Overview

This document is a module on descriptive statistics that is intended as an introductory course. It contains 5 units that cover basic statistical concepts, methods of organizing and summarizing data graphically and numerically, and an introduction to analyzing data in SPSS. Common issues like handling missing data are also addressed.

Uploaded by

lucykaweruza46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views61 pages

Descriptive Statistics Module Overview

This document is a module on descriptive statistics that is intended as an introductory course. It contains 5 units that cover basic statistical concepts, methods of organizing and summarizing data graphically and numerically, and an introduction to analyzing data in SPSS. Common issues like handling missing data are also addressed.

Uploaded by

lucykaweruza46
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Chancellor College, Mathematical Science

Department.

DESCRIPTIVE
STATISTICS
Module for Diploma in Statistics

Fiskani J.M Kondowe


Contents
UNIT 1: BASIC STATISTICAL CONCEPTS ................................................................................................. 5

1.1 STATISTICS.................................................................................................................................... 6

1.1.1 What is Statistics


Statistics?
istics ................................................................................................................. 6

1.1.2 Types of statistics .................................................................................................................. 6

1.2 DATA............................................................................................................................................. 7

1.2.1 Background ........................................................................................................................... 7

1.2.2 Types of data ......................................................................................................................... 7

1.3 SCALES OF MEASUREMENT ......................................................................................................... 9

UNIT 2: GRAPHICAL SUMMARIES OF DATA ........................................................................................11

2.1 ORGANIZING AND GRAPHING QUALITATIVE DATA ..................................................................11

2.1.1 Frequency distributions ......................................................................................................11

2.1.2 Relative frequency and percentage distributions ..............................................................13

2.1.3 Graphical presentation of qualitative data ........................................................................13

2.2 ORGANIZING AND GRAPHING QUANTITATIVE DATA ...............................................................16

2.2.1 Frequency distributions ......................................................................................................16

2.2.2 Constructing frequency distribution tables........................................................................


tables 18

2.2.3 Relative frequency and percentage distributions ..............................................................19

2.2.4 Graphing grouped data.


data ......................................................................................................20

2.2.5 Cumulative frequency distributions ...................................................................................21

2.2.6Stem and leaf displays .........................................................................................................23

UNIT 3: NUMERICAL SUMMARIES OF DATA .......................................................................................25

3.1 MEASURES OF CENTRAL TENDENCY .........................................................................................25

3.1.1 Mean ...................................................................................................................................25

1
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
3.1.2 Median ................................................................................................................................29

3.1.3 Mode ...................................................................................................................................30

3.2 MEASURES OF DISPERSION


DISPERSION .......................................................................................................31

3.2.1 Range...................................................................................................................................31

3.2.2 Variance and standard deviation........................................................................................32

3.2.3 Coefficient of variation .......................................................................................................34

3.3 MEASURES OF POSITION ...........................................................................................................35

3.3.1 Quartiles, Interquartile range .............................................................................................35

3.3.2 Deciles, percentiles and percentile rank ............................................................................36

3.4 OTHER NUMERICAL ATTRIBUTES OF DATA ...............................................................................39

3.4.1 Skewness and kurtosis ........................................................................................................39

UNIT 4: INTODUCTION TO DESCRIPTIVE ANALYSIS USING SPSS ........................................................42

4.1 TABULAR PRESENTATION OF DATA...........................................................................................42

4.1.1 One variable tabulation ......................................................................................................42

4.1.2:
4.1.2: Cross-
Cross-tabulations ...............................................................................................................44

4.1.3 Formatting SPSS cross-


cross-tabulations .....................................................................................46

4.2 GRAPHIC PRESENTATION OF DATA ...........................................................................................47

4.2.1 Computing simple summary statistics ...................................................................................48

4.2.2 Drawing a pie chart .............................................................................................................48

4.2.3 Drawing Box Plots ...............................................................................................................49

4.2.4 Drawing
Drawing Histograms............................................................................................................50

UNIT 5: COMMON COMPLICATIONS WHEN ANALYSING SURVEY DATA ...........................................51

5.1 ANALYSIS OF MULTIPLE RESPONSE QUESTIONS .......................................................................51

5.1.1 Entering multiple response questions data in SPSS ...........................................................51

2
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
5.1.2 Analyzing multiple response questions in SPSS .................................................................52

5.2 PRESENCE OF MISSING VALUES IN DATA ..................................................................................54

5.2.1Missing data mechanisms....................................................................................................54

5.2.2 Handling missing data .........................................................................................................55

5.3 PRESENCE OF ZERO VALUES ......................................................................................................56

5.3.1 Analysis with Zero values present ......................................................................................56

5.4 WEIGHTED VALUES ....................................................................................................................58

5.4.1 When to Use a Weighted Average .....................................................................................58

REFFERENCES ......................................................................................................................................60

3
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
PREFACE
Statistics is an important field of mathematics that is used to analyze, interpret, and predict
outcomes from data. Descriptive statistics will teach you the basic concepts used to describe
data. This is a great beginner course for those interested in Data Science, Economics, Psychology,
Machine Learning, Sports analytics and just about any other field.

This branch of statistics lays the foundation for all statistical knowledge, but it is not something
that you should learn simply so you can use it in the distant future. Descriptive statistics can be
used NOW, in English class, in physics class, in history, at the football stadium, in the grocery
store and in everything we do. Without statistics we couldn't plan our budgets, pay our taxes,
and evaluate classroom/office performance. Are you beginning to get the picture? We need
statistics!

This handbook is therefore designed for an introductory course in descriptive statistics and can
be used entirely and/or as a supplement to all comparable texts in statistics. This can also be
regarded as a self-study for those students wishing to pursue their careers in Statistics.

4
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
ACKNOWLEDGEMENTS
I wish to thank the staff of Mathematical Sciences Department, National Statistical Office
through National Statistical Systems (NSS) and UNDP for constructive criticisms and provision of
donor funding to facilitate development of this module respectively. Most importantly, I thank
the two institutions for entrusting me with the responsibility to carry out this rigorous exercise.

I also wish to extend my gratitude to the module reviewers, friends and colleagues in the
department for constructive critics and inputs towards development of this module. Your inputs
have brought significance to the successful completion of this students’ handbook.

5
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 1: BASIC STATISTICAL CONCEPTS
Objectives
By the end of this topic, students have to be able to:
• Define statistics
• Recognize the type of data in a dataset
• Summarize different types of data
• Differentiate between descriptive and inferential statistics
• Recognize the proper type of scale to use for data

1.1 STATISTICS
1.1.1 What is Statistics?
Statistics
Statistics is a group of methods used to collect, analyze, present and interpret data to make
decisions.

Statistical analysis is a process that involves identifying the questions of interest, data collection
and analysis and producing a report. In real-life problems, the data collection and analysis steps
may be repeated more than once. Data is collected in order to shed light on some question of
importance to the engineer, biologist, climatologist, administrator, or other professional

In many applications, the cycle of data collection and analysis is a central part in the quest for
improvement to systems and processes. The aim of statistics is to supply useful information to
people whose main area of expertise is not statistics. These people are not interested directly in
either data or statistical methods.

1.1.2 Types of statistics


The field of statistics basically has two aspects: theoretical and applied. Theoretical or
mathematical statistics deals with the development, derivation, and proof of statistical
theorems, formulas, rules, and laws. Applied statistics involves the applications of those
theorems, formulas, rules, and laws to solve real-world problems. Applied statistics can be
divided into two areas: descriptive statistics and inferential statistics. This course will focus on
the descriptive part of applied statistics

6
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
statistics: consists of methods that use sample results to help make decision or
Inferential statistics:
predictions about a population. More on inferential statistics will be tackled in other courses.

statistics: Consists of methods for organizing, displaying and describing data by using
Descriptive statistics:
tables, graphs and summary measures. This type of analysis is useful in drawing conclusions on
large data sets as they are presented in summary form. For instance, suppose we have
information on the test scores of students enrolled in a statistics class. In statistical terminology,
the whole set of numbers that represents the scores of students is called a data set, the name of
each student is called an element, and the score of each student is called an observation. A data
set in its original form is usually very large. Consequently, such a data set is not very helpful in
drawing conclusions or making decisions. It is easier to draw conclusions from summary tables
and diagrams than from the original version of a data set. In descriptive statistics, we reduce
data to a manageable size by constructing tables, drawing graphs, or calculating summary
measures such as averages.

1.2 DATA
1.2.1 Background
Data contains information and statistics serves to extract this information. Data is considered as
the basic commodity of the statistics without which, there is no information on which to reach
conclusions or base decisions. However the information in data is often not immediately
obvious, especially in large data sets.

Large data sets must be summarized before patterns and relationships can be seen, there is
usually too much noise in the raw data to see the information that they contain. As such,
statistical methods which use graphical and numerical methods to highlight important features
of the data are needed to ensure that the highest precision is [Link] smaller data sets, it is
less important to summarize the data; the problem is usually that there is not enough
information to get a clear answer to questions of importance.

1.2.2
1.2.2 Types of data
Primary and secondary data

7
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
 Primary data means original data has been collected specially for the purpose in mind.
For example, when the National Statistics Office (NSO) collects data for use in the
Demographic Health Survey (DHS) report, it is considered to be primary data for the
DHS.
 Secondary data is data that has been collected for another purpose, yet we use it for
analysis to come up with conclusions on different reasons other than the one the data
was collected for. For example, when a researcher uses data from the DHS report to
answer a certain research question of his/her interest, the MDHS data is now considered
secondary data.

Quantitative and qualitative data

 Quantitative data are observations measured on a numerical scale and consists of


numbers representing counts or measurements.
 Qualitative (attribute) data is data that cannot assume a numerical value but can be
classified into two or more nonnumeric categories based on some nonnumeric
characteristic.

Discrete and continuous data

 Discrete data results when the number of possible values is either a finite or countable
number (that is, the number of possible values is 0 or 1 or 2 and so on). For example, the
number of eggs that hens lay is discrete data because they represent counts.
 Continuous data results from infinitely many possible values that correspond to some
continuous scale that covers a range of values without gaps, interruptions or [Link]
example, the ages of students in a class can assume any value over a continuous span.

Cross-
Cross-sectional and time series data

 Cross-sectional data is data collected on different elements or variables at the same point
in time or for the same period of time. For example, the types of TV’s owned by Lilongwe
residents in 1 particular year (e.g. 2007) can be considered as cross-sectional data
collected at the same time (2007)

8
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
 Time series data is data collected on the same element or the same variable at different
points in time or for different periods of time. For instance, data on the total number of
TV’s owned by Lilongwe residents each year, collected yearly over a period of five year is
an example of time series data.

1.3 SCALES OF MEASUREMENT


These are ways in which data (variables/numbers) is defined and categorized. Each scale of
measurement has certain properties which in turn determine the appropriateness for use of
certain statistical analyses on the data. There are four generally used scales of measurement,
listed from weakest to strongest:

Nominal scale is where categorical values and/or numbers are used as labels for groups or
classes. For example, if our data set comprises the countries Malawi, Zambia and Angola, we
may designate Malawi as 1, Zambia as 2 and Angola as 3. In this case, the numbers 1, 2 and 3
stand only for the category to which the data points belong and do not represent any order.

Ordinal scale represents an ordered series of relationships or rank order. For example, individuals
competing in a contest may be fortunate to achieve first, second, or third place, these positions
represent ordinal data. Likert-type scales (such as “On a scale of 1 to 10 with 1 being no pain and
10 being high pain, how much pain are you in today?”) also represent ordinal data.

Interval scale represents quantity and has equal units but for which zero represents simply an
additional point of measurement is an interval scale. The Fahrenheit is a clear example of the
interval scale of measurement, thus 60 degree Fahrenheit or -10 degrees Fahreinheit are interval
data.

Ratio Scale is similar to the interval scale in such a way that it also represents quantity and has
equality of units. However, this scale also has an absolute zero (no numbers exist below zero).

Exercise 1.1

1. Briefly describe the meaning of the word statistics.


2. Briefly explain the types of statistics.

9
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
3. Explain the difference between cross-section and time-series data. Give an example of
each of these two types of data.
4. Classify the following as cross-section or time-series data.
i) Food bill of a family for each month of 2009.
ii) Number of armed robberies each year in Lilongwe from 1998 to 2009.
iii) Number of supermarkets in Malawian cities on December 31, 2009.
iv) Gross sales of 200 ice cream parlors in July 2009.
v) Average prices of houses in 100 cities.
vi) Salaries of 50 Airtel employees.
vii) Number of cars sold each year by Toyota Malawi from 1980 to 2009.
viii) Number of employees employed by Malawi Government each year from 1985 to
2009.
5. Briefly distinguish the following terms.
i) Quantitative and Qualitative data.
ii) Discrete and Continuous data.
6. Indicate which of the following variables are Discrete or continuous.
i) Number of persons in a family
ii) Colors of cars
iii) Marital status of people
iv) Time to commute from home to work
v) Number of errors in a person’s bank statement

10
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 2: GRAPHICAL SUMMARIES OF DATA
Objectives

By the end of this topic, students should be able to:

• Identify the appropriate graphical summary and presentation for particular questions.
• Present data in a histogram and be able to interpret data when presented with a
histogram.
• Recognize the advantages and limitations of each method of presentation.
• Explain what can be gained and lost from data summary.
• Interpret graphical summaries to answer questions concerning proportions, extremes,
medians and quartiles for quantitative variables.

INTRODUCTION
INTRODUCTION

The source of our statistical knowledge lies in the data. Once we obtain the sample data values,
one way to become acquainted with them is to display them in tables or graphically. Tables,
charts and graphs are very important tools in statistics because they communicate information
visually. These visual displays may reveal the patterns of behavior of the variables being studied.
Qualitative and quantitative data can be summarized differently in graphical form. This unit
focuses on graphical summaries of qualitative and quantitative data.

2.1 ORGANIZING AND GRAPHING QUALITATIVE DATA


2.1.1 Frequen
Frequency distributions
A frequency distribution for qualitative data lists all categories and the number of elements that
belong to each of the categories. Table 2.1 is a frequency distribution table(or simply a frequency
table) that depicts the types of employment 100 students at a particular college intend to
engage in.

11
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Table 2.1: Type of employment students intend to engage in

TYPE OF EMPLOYMENT (category) NUMBER OF


STUDENTS(frequency)
Private companies/ businesses 44
Federal government 16
State/ Local government 23
Own business 17
Total 100
Example2-1: A sample of 30 employees from large companies was selected and asked how
Example2-
stressful their jobs were. The responses were recorded as below (“very” is “very stressful”,
“somewhat” is “somewhat stressful”, “none” is “no stress at all”)

Table 2.2: Stress levels for employees

Somewhat None Somewhat Very Very None


Very Somewhat Somewhat Very Somewhat Somewhat
Very Somewhat None Very None Somewhat
Somewhat Very Somewhat Somewhat Very None
Somewhat Very Very somewhat None Somewhat
Solution: Note that the variable is classified into three categories (very stressful, somewhat
stressful and not stressful at all). The numbers or counts in each category are what we call
frequencies. The frequencies are illustrated in the table below:

Table 2.3: Stress levels frequency table

Stress on job Frequency (f)


(f)
Very stressful 10
Somewhat stressful 14
Not stressful 6
Sum 30

12
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
2.1.2
2.1.2 Relative frequency and percentage distributions
Relative frequency of a category is obtained by dividing the frequency of that category by the
sum of all frequencies. It shows what fractional part or proportion of the total frequency belongs
to the corresponding category.

Calculating relative frequency of a category

    ℎ  
    =
      

Percentage for a category is obtained by multiplying the relative frequency of that category by
100. It shows what percentage of the total frequency belongs to a corresponding category.

   =     ∗ 100

Example 2-2: Determine the relative frequency and percentage distributions on stress levels for
the data from table 2.3.

Solution

Table 2.4: Relative frequencies and percentage distributions of stress on job

Stress on job Relative frequency Percentage (%)


Very stressful 10/30 = 0.333 0.333(100) = 33.3
Somewhat stressful 14/30 = 0.467 0.467(100) = 46.7
Not stressful 6/30 = 0.200 0.200(100) = 20
Sum = 1.000 Sum = 100

2.1.3
2.1.3 Graphical presentation of qualitative data
It is said that “a picture is worth a thousand words”, the same can be said regarding statistics as
a graphical display can reveal at a glance the main characteristics of a dataset. Bar graphs and pie
charts are two types of graphs that are commonly used to display qualitative data.

Bar graph (bar chart)is a graph of bars whose heights represent the frequencies (or relative
frequencies/percentage frequencies) of respective categories.

13
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Example 2-3: The data in Table 2.5 represent the percentages of price increases of some
consumer goods and services for the period December 1990 to December 2000 in Blantyre city.
Construct a bar chart for this data.

Table 2.5: Percentages of Price Increases of Some Consumer Goods and Services in Blantyre

CONSUMER SERVICE PERCENTAGE PRICE INCREASE


Medical care 83.3%
Electricity 22.1%
Residential rent 43.5%
Food 41.1%
Consumer price index 35.8%
Apparel and upkeep 21.2%

Solution: Looking at Figure 2.1 we can identify where the maximum and minimum responses are
located, so that we can descriptively discuss the phenomenon whose behavior we want to
understand.

Percentage price increase in consumer goods


90

80

70

60
Percentage

50

40

30

20

10

0
MC EI RR FD CPI A&U

Figure 2.1: Bar graph on percentage price increase of consumer goods.


14
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Pie Chart is a circle divided into sectors that represent the percentages (relative frequencies) of a
population or a sample that belongs to different categories. A pie chart is more commonly used
to display percentages although it can be used to display frequencies or relative frequencies. The
whole pie chart represents the total sample or population.

Example 2-4: Construct a pie chart for the combined percentages of carbon monoxide (CO) and
ozone (O3) emissions from differe
different sources as listed in Table 2.6 below

Table 2.6: Combined Percentages of CO and  Emissions

Transportation Industrial Fuel Combustion Solid waste (S) Miscellaneous


(T) process (I) (F) (M)
63% 10% 14% 5% 8%

Solution:: The pie chart is given in figure 2.2

ombined Percentages of CO and  Emissions


Figure 2.2: Pie chart for combined

Exercise 2.1

1) Why do we need to group data in the form of a frequency table? Explain briefly.

15
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
2) How are the relative frequencies and percentages of categories obtained from the
frequencies of categories? Illustrate with the help of an example.
3) The following data give the results of a sample survey. The letters A, B, and C represent the
three categories.
A B B A C B C C C A C B C A C
C B C C A A B C C B C B A C A

i) Prepare a frequency distribution table.


ii) Calculate the relative frequencies and percentages for all categories.
iii) What percentage of the elements in this sample belong to category A or B or C?
iv) Draw a bar graph for the frequency distribution.
v) Draw a pie chart for the percentage distribution.

2.2 ORGANIZING AND GRAPHING QUANTITATIVE DATA


This section explains how one can group and graphically display quantitative data. Take note that
all discussions made in this section are referring to quantitative data

2.2.1
2.2.1 Frequency distributions
data lists all classes and the number of values that belong
A frequency distribution of quantitative data
to each class. The data presented in the form of a frequency distribution is called grouped data.

A class is an interval that includes all the values that fall within two numbers called the upper and
lower limits. The classes represent a variable and are non-overlapping (each value belongs to one
and only one class).

Class boundary (real class limit) is given by the midpoint of the upper limit of one class and the
lower limit of the next class. In table 2.7 below, to find the mid-point of the upper limit of the
first class and the lower limit of the second class, we divide the sum of these two limits by 2.
Thus the class midpoint is

1000 + 1001
= 1000.5
2

16
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
The value 1000.5 is called the upper boundary of the first class and the lower boundary of the
second class.

Class width is the difference between two boundaries of a class and Class mid-
mid-point/mark is
obtained by dividing the sum of the two limits (boundaries/lower and upper limits) of a class by
two.

Example 2-5: Construct a table showing the class boundaries, class midpoints, and class width for
the data in Table 2.7on daily earnings of 100 employees at a large company. Note that the data
has already been grouped into classes with corresponding frequencies of employees falling into
each class given.

Table 2.7: Weekly earnings for 100 employees

Weekly earnings (kwacha) Number of employees


801 - 1000 9
1001 – 1200 22
1201 – 1400 39
1401 – 1600 15
1601 – 1800 9
1801 – 2000 6

Solution: From table 2.7,the values 801,1001, 1201, 1401, 1601 and 1801 are lower limits, and
the values 1000, 1200, 1400, 1600, 1800 and 2000 are the upper limits of the six classes.

17
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Table 2.8 Class boundaries, class widths and class mid-points for table 2.7 data

Class Limits Class Boundaries Class Width Class Midpoint


801 - 1000 800.5 to less than 1000.5 200 900.5
1001 - 1200 1000.5 to less than 1200.5 200 1100.5
1201 - 1400 1200.5 to less than 1400.5 200 1300.5
1401 - 1600 1400.5 to less than 1600.5 200 1500.5
1601 - 1800 1600.5 to less than 1800.5 200 1700.5
1801 - 2000 1800.5 to less than 2000.5 200 1900.5

2.2.2
2.2.2 Constructing frequency distribution tables
When constructing a frequency distribution table for quantitative data, the following needs to be
considered

Number of classes: Usually the number of classes varies from 5 to 20 depending on the number
of observations in the data set. It is recommended to have more classes with a large data set.

width: It is possible to have classes of different sizes. It is preferable however, to have the
Class width:
same width for all classes. As such, an approximate width can be calculated depending on the
number of classes one intends to have.

'     & − )    &


"## $  %&ℎ =
* +  & & 

point: Any convenient number that is less than or equal


Lower limit of the first class or the starting point:
to the smallest value in the dataset can be used as the lower limit of the first class.

Example 2-6: The following data gives the total number of cell phones sold by Airtel on each of
the 30 days in a month. Construct a frequency distribution table.

Table 2.9: Airtel cellphone sells

8 25 11 15 29 22 10 5 17 21 23 14 19 27 14
22 13 26 16 18 12 9 26 20 16 23 20 16 16 21

18
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Solution:: In this dataset, the minimum value is 5, and the maximum value is 29. Suppose we
decide to group this data using five classes of equal width, then

29 − 5
"## $  %&ℎ  ℎ  = = 4.8
5

We can further approximate this to a convenient number, say five. Therefore, for the first class,
we have a lower limit of five and our classes will have a class width of five and be as in the below:

Table 2.10: Frequency distribution for the data on cellphone sales

Cell phones sold Frequency (f) Class mid-


mid-point
5-9 3 7
10-14 6 12
15-19 8 17
20-24 8 22
25-29 5 27

2.2.3
2.2.3 Relative frequency and percentage distributions
Relative frequency and percentage distribution is calculated in the same way as we did for
qualitative data in section 2.2.2

Example 2-7: Calculate the relative frequencies and percentages for data in Table 2.9

Solution: The relative frequencies and percentages for cell phone sales are given below:

Table 2.11: Relative frequency and percentage distributions for cell phone sales

Cell phones sold Class boundaries Relative frequency Percentage (%)


5-9 4.5 to less than 9.5 3/30= 0.100 10.0
10-14 9.5 to less than 14.5 6/30= 0.200 20.0
15-19 14.5 to less than 19.5 8/30= 0.267 26.7
20-24 19.5 to less than 24.5 8/30= 0.267 26.7
25-29 24.5 to less than 29.9 5/30= 0.167 16.7

19
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
2.2.4
2.2.4 Graphing grouped data.
data
Grouped (quantitative) data can be displayed using a histogram or polygon.

Histogram is a graph in which classes are marked on the horizontal axis and frequencies, relative
frequencies, or percentages are marked on the vertical axis. The frequencies, relative
frequencies or percentages are represented by heights of bars which are drawn adjacent to each
other without any gaps. Figure 2.3 shows the frequency histogram for data on cell phone sales.
Class limits have been used to mark classes on the horizontal axis. However, class boundaries can
also be used.

Frequencies of cellphone sales


10
9
8
7
frequency

6
5
4
3
2
1
0
5 -9 10-14 15-19 20-24 25-29
Cellpones sold

Figure 2.3: Frequency histogram for Airtel cell phone sales

Note that similar graphs can be constructed for relative frequencies, percentage frequencies,
etc.

Polygon is a graph formed by joining the midpoints of the tops of successive bars in a histogram
with straight lines. In other words, it is a line graph constructed from the class mid points and
frequencies for each class

20
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
For a large dataset, as the number of classes is increased (and the width of classes decreased),
the frequency polygon eventually
ventually becomes a smooth curve and is called a frequency distribution
curve/frequency curve.

2.8:: Using the data in table 2.9, con


Example 2.8 construct
struct a frequency polygon for Airtel cell phone
sales.

Solution: A polygon for cell phone is constructed using class midpoints is given in figure 2.4

Figure 2.4: Cell phone sales polygon

2.2.5 Cumulative frequency distributions


A cumulative frequency distribution gives the total number of values that fall below the upper
boundary of a particular class.

Cumulative relative frequencies are obtained by dividing the cumulative frequencies by the total
number of observations in the data set. Cumulative percentages are obtained by multiplying the
cumulative relative frequencies by 100

Cumulative frequency distribution


ribution tables can be constructed for a given dataset. Note that each
class has the same lower limit but different upper limit.

21
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Example 2-9: Using data on Cell phone sales from Example 2-6 prepare a cumulative frequency,
cumulative relative frequency and cumulative percentage distribution table for the number cell
phones sold by the company.

Solution:: With reference to the frequency table constructed in Example 2-6, the cumulative
frequency distribution table can be constructed as below:

Table 2.12: Cumulative frequency distribution table for Airtel cell phone sales

Class limits Class Boundaries Cumulative Cumulative Cumulative


Frequency relative frequency percentage
5-9 4.5 to less than 9.5 3 3/30 = 0.100 10.0
5-14 9.5 to less than 3 + 6= 9 9/30 = 0.300 30.0
14.5
5-20 14.5 to less than 3 + 6 + 8= 17 17/30 = 0.567 56.7
19.5
5-24 19.5 to less than 3 + 6 + 8 + 8= 25 25/30 = 0.833 83.3
24.5
5-29 24.5 to less than 3 + 6 + 8 + 8 + 5= 30 30/30 = 1.000 100.0
29.9

An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines
the dots marked above the upper boundaries of classes at heights equal to the cumulative
frequencies of respective classes.

To draw an ogive, the variable is marked on the horizontal axis and the cumulative frequencies
on the vertical axis. Then the dots are marked above the upper boundaries of various classes at
the heights equal to the corresponding cumulative frequencies. The ogive is obtained by joining
consecutive points with straight lines. Note that the ogive starts at the lower boundary of the
first class and ends at the upper boundary of the last class, and connecting with lower
boundaries for the rest of the classes

22
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Example: Construct an Ogive for the cell phone sales data in table 2.11
Example:

Solution:: An ogive for data in table 2.11 is constructed below. It is constructed using the lower
class boundaries and their respective cumulative frequencies.

Figure 2.5: Ogive for airtel cell phone sales

2.2.6Stem
2.2.6Stem and leaf displays
A stem-and-leaf plot is a simple way of ssummarizing quantitative data that is well suited to
stem-and-
computer applications. When data sets are relatively small, stem
stem-and-leaf
leaf plots are particularly
useful. In a stem-and-leaf
leaf plot, each data value is split into a “stem” and a “leaf.” The “leaf” is
usually the last digit of the number and the other digits to the left of the “leaf” form the “stem.”

10: Construct a stem-and


Example 2-10: and-leaf
leaf display for the following scores of 30 college
colle student.

75 52 80 96 65 79 71 87 93 95 69 72 81 61 76
86 79 68 50 92 83 84 77 64 71 87 72 92 57 98
Table 2-13: Scores for college students

23
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Solution: A stem and leaf display for college scores is given below

9 2 3 5 6 8 2
8 0 3 4 6 7 7 1
7 2 1 5 2 7 9 6 9 1
6 8 5 4 9 1
5 0 7 2
Figure 2.6: A stem and leaf display

A return period,
period also known as a recurrence interval (sometimes repeat interval)
interval is an estimate of
the likelihood of an event, such as an earthquake, flood or a river discharge flow to occur. It is a
statistical measurement typically based on historic data denoting the average recurrence interval
over an extended period of time, and is usually used for risk analysis (e.g. to decide whether a
project should be allowed to go forward in a zone of a certain risk, or to design structures to
withstand an event with a certain return period).

Exercise 2.2

1) Briefly explain the concept of cumulative frequency distribution. How are the cumulative
relative frequencies and cumulative percentages calculated?
2) Explain for what kind of frequency distribution an ogive is drawn. Can you think of any use
for an ogive? Explain.
3) The following table gives the frequency distribution of the number of ATM cards possessed
by 80 adults.
Number of Credit cards 0 to 3 4 to 7 8 to 11 12 to 15 16 to 19
Number of Adults 18 26 22 11 3
a) Prepare a cumulative frequency distribution.
b) Calculate the cumulative relative frequencies and cumulative percentages for all classes.
c) Find the percentage of these adults who possess 7 or fewer ATM cards.
d) Draw an ogive for the cumulative percentage distribution.
e) Using the ogive, find the percentage of adults who possess 10 or fewer ATM cards.

24
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 3: NUMERICAL SUMMARIES OF DATA

Objectives

By the end of this topic, students should be able to:

• Explain why it is important to summarize the variability of a dataset


• Be able to provide and explain the role of the common summary statistics (mean and
median, range, maximum, minimum, quartiles, inter-quartile range, quartile deviation,
mean deviation, standard deviation, variance, coefficient of variation, degrees of
freedom.) for a simple dataset from first principles.
• Explain the formulae for the variance, standard deviation and the mean deviation

Background

In the previous section we looked at some graphical and tabular techniques for describing a data
set. We shall now consider some numerical characteristics of a dataset which can provide more
detailed information about important features of a distribution. These summaries are in
different groups like Measures of central tendency, measures of variability/dispersion and
measures of position/relative position, etc.

3.1 MEASURES OF CENTRAL TENDENCY


Measures of central tendency give the center of a histogram or a frequency distribution curve.
These measures give a picture of the distribution of a dataset. This section discusses three
different measures of central tendency namely mean, mode and median.

3.1.1 Mean
The mean, also known as arithmetic mean is the most frequently used measure of central
tendency. The mean is basically the average of a set of observations and is calculated differently
for grouped and ungrouped data. The mean calculated from the whole population data is
denoted / and the mean calculated from sample data is denoted$̅ .

ungrouped data is obtained by dividing the sum of all values by the number of values in
Mean for ungrouped
the dataset.

25
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
∑$
1  ##  &: / =
*

∑$
1  # &: $̅ =

Where ∑ $ is the sum of all sample values, * is the population size, is the sample size, / is the
population mean and $̅ is the sample mean.

3-1: The following data give the time in months from hire to promotion for a random
Example 3-
sample of 25 software engineers from all software engineers employed by a large
telecommunications firm: 5, 7, 229, 453, 12, 14, 18, 14, 14, 483, 22, 21, 25, 23, 24, 34, 37, 34,
49, 64, 47, 67, 69, 192 and 125. Calculate the mean.

Solution: Since it’s a random sample, we use the sample mean formula to calculate the mean
where n=25.

∑5467 $4
$̅ = = 83.28  ℎ

We can conclude that at this company, on average, it takes a software engineer 83.28 months to
get a promotion.

3-2: The following are the ages (in years) of all eight employees of a small company: 53,
Example 3-
32, 61, 27, 39, 44, 49 and 57. Find the mean age of these employees.

Solution: Since the given data set includes all eight employees of the company, it represents the
population hence N=8.

∑5467 ∑ $4
/̅ = = 45.25  
*

data: When we encounter situations where the data is grouped, we no longer


Mean for grouped data:
have individual data values, hence different formulae are used to calculate mean for such data.

∑
1  ##  &: / =
*

26
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
∑
1  # &: $̅ =

Where *  ℎ ##  9, is the sample size,  is the class midpoint and is the
frequency of a class.

Example 3-3: The grouped data in Table 3.1 represent the number of children from birth through
the end of the teenage years in a large apartment complex. Find the mean for the data:

Table 3.1 Number of Children and Their Age Group

Class 0-3 4-7 8-11 12-15 16-19


Frequency
Frequency 7 4 19 12 8

Solution: For simplicity of calculation we create a frequency table 3.2

Table 3.2: Frequency table

Class : ; ;:
0-3 7 1.5 10.5
4-7 4 5.5 22
8-11 19 9.5 180.5
12-15 12 13.5 162
16-19 8 17.5 140
n=50 ∑  = 515

∑5467 4 4 515
)#  : $̅ = = = 10.30
50

We can conclude that the average age of children in the apartment complex is 10 years old.

Effect of Outliers on mean


We can state that the value of the population mean /is constant. However, the value of the
sample mean varies from sample to sample: it depends on what values of the population are

27
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
included in that sample. Sometimes a data set may contain a few very small or a few very large
values (outliers/extreme values). A major shortcoming of the mean as a measure of central
tendency is that it is very sensitive to outliers. Example 3–4 illustrates this point.

3-4: Table 3.3 lists the money (in million Kwachas) spent by six Malawian television
Example 3-
stations in 2014.

Table 3.3: Money spent by Malawian TV stations in 2015

TV station MBC Luso Timveni Zodiak Times Nyasa


Money spent 337.9 22.4 31.8 19.8 9.0 27.5

Notice that MBC spent much more money compared to the other stations hence it is an outlier.
As such, we will show how the inclusion of this outlier affects the value of the mean.

Solution: If we do not include the expenditure for MBC (the outlier), the mean expenditure of the
five TV stations is:

22.4 + 31.8 + 19.8 + 90 + 27.5


= 21.1  =%ℎ
5

Now, to see the impact of the outlier on the value of the mean, we include the expenditure of
MBC and find the mean contributions of the six companies. This mean is:

22.4 + 31.8 + 19.8 + 337.9 + 90 + 27.5


= 74.73  =%ℎ
5

Thus, including the MBC expenditure causes more than a threefold increase in the value of the
mean, which changes from $22.1 million to $74.73 million.

The preceding example should encourage us to be cautious. We should remember that the
mean is not always the best measure of central tendency because it is heavily influenced by
outliers. Sometimes other measures of central tendency give a more accurate impression of a
data set. For example, when a data set has outliers, instead of using the mean, we can use either
the trimmed mean or the median as a measure of central tendency.

28
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Trimmed mean is calculated by dropping a certain percentage of values from each end of a
ranked data set. The trimmed mean is especially useful as a measure of central tendency when a
data set contains a few outliers at each end. Suppose the following data give the ages (in years)
of 10 employees of a company: 47, 53, 38, 26, 39, 49, 19, 67, 31 and 23. To calculate the 10%
trimmed mean, first rank these data values in increasing order; then drop 10% of the smallest
values and 10% of the largest values. The mean of the remaining 80% of the values will give
the10% trimmed mean. Note that this data set contains 10 values, and 10% of 10 is 1. Thus, if we
drop the smallest value and the largest value from this data set, the mean of the remaining 8
values will be called the 10% trimmed mean.

47 + 53 + 38 + 26 + 39 + 49 + 19 + 67 + 31 + 23
>    ℎ &  = = 39.2
10

The ranked data set is: 19, 23, 26, 31, 38, 39, 47, 49, 53 and 67. Hence the 10% trimmed mean
will be given as:

23 + 26 + 31 + 38 + 39 + 47 + 49 + 53
10%  &  = = 38.25
8

3.1.2 Median
The median is the value of the middle term in a data set that has been ranked in increasing
order. If a data set is odd, the median is given by the middle term in the ranked data. If the
number of observations is even, then the median is given by the average of the values of the two
middle terms.

3-5: The following data gives the prices (in millions of Kwachas) of 5 houses sold by an
Example 3-
estate agent: 34, 50,12,8 and 15. Find the median.

Solution: First rank the values in increasing order as follows

8 12 15 34 50

The median is 15.

3-6: Table 3.4 below gives 2014 profits (rounded to billions of Kwachas) for banks in
Example 3-
Malawi. Find the median

29
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Table 3.4: 2014 profits for selected Malawian banks

Bank NBS FDH MSB FMB


Profit 52 24 36 57

Solution: Ranking the bank profits we get

24 36 52 57

The median is the average of the two middle values: (36+52)/2 = 44

The median gives the center of a histogram, with half of the data values to the left of the median
and half to the right of the median. The advantage of using the median as a measure of central
tendency is that it is not influenced by outliers. Consequently, the median is preferred over the
mean as a measure of central tendency for data sets that contain outliers.

3.1.3
3.1.3 Mode
The mode is the value that occurs with the highest frequency in a dataset. A dataset can have no
mode, one mode (unimodal dataset), two modes (bimodal dataset) and more than two modes
(multimodal dataset). Bimodal and multimodal datasets have two values and more than two
values occurring with the same highest frequency respectively.

3-7: In 2015 maize yields for five selected families were 200kg, 315kg, 400kg, 200kg and
Example 3-
178kg. Find the mode

Solution: Mode is 200kg as it is the only value occurring with the highest frequency in the
dataset.

3-8: The ages of 10 randomly selected students are 21, 15, 16,19,21,19,14,25,26 and 25
Example 3-
years. Find the mode.

Solution: The data set has three modes; 21, 19 and 25, each with a (highest)frequency of 2.

Note that a major shortcoming of the mode is that a data set may have none or may have more
than one mode, whereas it will have only one mean and only one median. For instance, a data

30
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
set with each value occurring only once has no mode hence not easy to make any numerical
conclusions.

Exercise 3.1

1) Explain how the value of the median is determined for a data set that contains an odd
number of observations and for a data set that contains an even number of observations.
2) Briefly explain the meaning of an outlier. Is the mean or the median a better measure of
central tendency for a data set that contains outliers? Illustrate with the help of an
example.
3) Using an example, show how outliers can affect the value of the mean.
4) The following data give the numbers of car thefts that occurred in a city during the past
12 days: 6, 3, 7, 1, 1, 4, 3, 8, 7, 2, 6, 9, 1 and 5.
a) Find the mean, median, and mode car thefts during the 12 days
b) Find the mean car thefts per day, during the 12 days.

3.2 MEASURES OF DISPERSION


Measures of central tendency do not reveal the whole picture of the distribution of a data set.
Two data sets may have the same measures of central tendency but different
spread/variation/dispersion. The measures that help us to learn more about the spread of data
are called the measures of dispersion.
dispersion This section focuses on some measures of dispersion.

3.2.1 Range
The range of a set of observations is the difference between the largest and smallest
observation/value of a data set. It is the simplest measure of dispersion to calculate.

  = '    − )  

3-9: National Bank of Malawi registered profits (in Billions of Kwachas) of 60, 34, 50, 46
Example 3-
and 37 for the years 2001, 2002, 2003, 2004 and 2005 respectively. Find the range of profits for
the given years.

31
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Solution: The maximum profit is 60 billion kwacha and minimum profit is 34 billion kwacha.
Therefore

  = 60 − 34+ =%ℎ = 26 + =%ℎ

Thus the profits of the five years are in the range of 26 Billion Kwacha

Note that the range, like the mean, has the disadvantage of being influenced by outliers hence
the range is not a good measure of dispersion to use for a data set that contains outliers.
Another disadvantage of using the range as a measure of dispersion is that its calculation is
based on two values only (the largest and the smallest), all other values in a data set are ignored
when calculating the range.

3.2.2 Variance and standard deviation


The variance and standard deviation are more useful measures of variation because they use the
information contained in all the observations in the data set or population. The standard
deviation is found by taking the positive square root of the variance

The value of the standard deviation tells us how closely the values of a data set are clustered
around the mean. In general, a lower value of the standard deviation for a data set indicates that
the values of a data set are spread over a relatively smaller range around the mean while a larger
value indicates that the values are spread over a relatively larger range around the mean.

The variance calculated from population data is denoted by A B , and variance from sample data is
denoted as  B . Similarly, the standard deviation calculated from population data is denoted δ,
and the standard deviation calculated from sample data is denoted 

Variance and standard deviation


deviation for ungrouped data
The standard deviation is basically the positive square root of the variance. The following are the
basic formulae used to calculate variance for grouped data.

∑ DE
∑5467$4 − /B ∑ $ −
B
F
#    : B
A = =
* *

32
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
∑ DE
∑5467$4 − $̅ 
B ∑ $B −
5
)#    ∶ B
 = =
−1 −1

Where / is the population mean and $̅ is the sample [Link] quantity$ − /) or ($ − $̅  in the
above formulae are called the deviation of the $ value from the mean. Note that the sum of the
deviations of the $ values from the mean is always zero; that is, ∑F 5
467$4 − /=0 and∑467$4 −

$̅  = 0.

Example 3-10: The following data give the time in months from hire to promotion for a random
3-10:
sample of 25 software engineers from all software engineers employed by a large
telecommunications firm: 5, 7, 229, 453, 12, 14, 18, 14, 14, 483, 22, 21, 25, 23, 24, 34, 37, 34,
49, 64, 47, 67, 69, 192 and 125. Calculate the variance and standard deviation.

Solution: This is sample data so we will calculate Sample variance. N=24, $̅ = 83.28

∑5467$4 − $̅ B 1
 = B
= H5 − 83.28B + 7 − 83.28B + ⋯ + 125 − 83.28B J]
−1 24

=16478 months

And the sample standard deviation is

 = L B = √16.478 = 128.36  ℎ

Variance
Variance and standard deviation for grouped
grouped data
The following are the basic formulae used to calculate variance for grouped data.

∑ NOE
∑F
467 4 4 − /
B ∑ B
− F
#    : A B = =
* *
∑ NOE
∑5467 4 4 − $̅ B ∑ 
B
− 5
)#   : B
 = =
−1 −1

Where  is the midpoint of a class and is the frequency of a class.

Example 3-11:The grouped data below represents the number of children from birth through the
3-11:
end of the teenage years in a large apartment complex. Find the mean for the data:

33
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Class 0-3 4-7 8-11 12-15 16-19
Frequency 7 4 19 12 8

Solution: We construct a frequency table as below:

Class : ; ;: ;P :
0-3 7 1.5 10.5 15.75
4-7 4 5.5 22 121
8-11 19 9.5 180.5 1714.75
12-15 12 13.5 162 2187
16-19 8 17.5 140 2450
n=50 ∑  = 515 ∑ B = 6488.5
Table 3.5: Frequency table

∑ NQ OQ E R7R
∑ 4B 4 − 6488.5 −
5 RS
)#   :  B = = = 24.16
−1 49

)#  & & & :  = L B = √24.16 = 4.92

Effect of Outliers on standard deviation


Outliers have the same effect on standard deviation/variance like they do on mean hence the
need to take steps to remove them from a data set just like we did with mean in Section 3.1.1.

3.2.3 Coefficient of variation


Whenever two samples have the same units of measure, the variance and standard deviation for
each can be compared directly. For example, in comparing mileage of two different car brands;
Brand A and Brand B, it was found that for a specific year, the standard deviation for brand A was
422 miles and the standard deviation for brand B was 350 miles. Here we could say that the
variation in mileage was greater in brand B. But what if we wanted to compare the standard
deviations of two different variables, such as the number of sales per salesperson over a 3-
month period and the commissions made by these salespeople? We use the coefficient of
variation.

34
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
The coefficient of variation(denoted
variation CVar) is a statistic that allows you to compare standard
deviations when the units are different. It is given as a percentage using the formula below:

 X
T #: UV = × 100% T ##  : UV = × 100%
$̅ /

Where s and σ are sample and population variance respectively.

Example 3-12: The mean of the number of sales of cars over a 3-month period is 87, and the
3-12:
standard deviation is 5. The mean of the commissions is MK5225000, and the standard deviation
is MK773000. Compare the variations of the two.

Solution: The coefficients of variation are

R
CVar (Sales) = YZ[\ × 100 = 5.7%

[[SSS
CVar (Commissions) =YRBBRSSS\ × 100 = 14.8%%

The coefficient of variation is larger for commissions, hence the commissions are more variable
than the sales.

3.3 MEA
MEASURES OF POSITION
POSITION
A measure of position determines the position of a single value in relation to other values in a
sample or a population data set. In this section we will discuss some of the measures of position.

3.3.1 Quartiles,
Quartiles, Inter quartile range
Quartiles are three summary measures that divide a ranked data set into four equal
parts/quarters. These three summary measures are the first quartile (denoted by ]7 , the
second quartile (denoted by ]B  and third quartile (denoted ] .Note that the data should be
ranked in increasing order before the quartiles are determined.

The first quartile is basically the value of the middle term among the observations that are less
than the median, and the third quartile is the value of the middle term among the observations
that are greater than the median.

35
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Each of the portions in the figure below contains 25% of the observations of the data set
arranged in increasing order. The portions are separated by the three quartiles (]7 , ]B  & ] .

25% 25% 25% 25%


]7 ]B ]

Figure 3.1 Quartiles

From figure 3.1, we can see that approximately 25% and 75% of the values are less than ]7 and
greater than ]7respectively, and approximately 75% and 25%of the data values are less than
] and greater than ] respectively.

Inter quartile range (IQR) is the difference between the third quartile and the first quartile of a
data set.

^] = ] − ]7

3.3.2 Deciles, percentiles and percentile rank

Percentiles are the summary measures that divide a ranked data set(in increasing order) into 100
equal parts. Each dataset has 99 percentiles which divide it into 100 equal parts

The first quartile is the 25th percentile and often called the lower quartile, the second quartile is
the 50th percentile and often called the middle quartile and the third quartile is the 75th
percentile often called the upper quartile. The kth percentile is denoted #_ , where k is an integer
in the range 1-99. For instance, the 25th percentile is denoted by BR . Figure 3.2 shows the
positions of the 99 percentiles.

1% 1% 1% … 1% 1% 1%
#7 #B # #`[ #`Z #``

Figure 3.2 Percentiles

The approximate value of the kth percentile, denoted #_ is given as follows

36
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
=
#_ = V   ℎ a b ℎ    ℎ  =& & 
100

Where k denotes the number of the percentile and n represents the sample size.

Percentile rank for a particular value $4 gives the percentage of values in the data set that are less
than $4 and is given as:

* +  V   ℎ $4


    =  $4 = × 100
> +     ℎ &

Deciles are summary measures that divide ranked data set (in increasing order) into 10 equal
groups. Note that the first decile (&7  corresponds to7S ; second decile (&B  corresponds toBS ;
etc. Deciles can be found by using the formulas given for percentiles.

The relationships among percentiles, deciles, and quartiles are summarized as: Deciles are
denoted by&7 , &B , & , . . . , &` , corresponding to 7S , BS , S , . . . , `S respectively. Quartiles are
denoted by]7,]B ,] corresponding to BR , RS , [R [Link] median is the same as
RS or ]B or&R .

3-13: Table 3.6 below gives 2014 expenditures (rounded to Millions of Kwacha’s) for
Example 3-
12Private Secondary Schools in Malawi.

Table 3.6: 2014 expenditures for some Private Schools in Malawi

Private School/ College 2014 Profits


Bishop Mackenzie 8
KamuzuAcademy 12
Mary Mount Girls 7
Phwezi 17
New Era 14
Marist Boys 45
KalibuAcadamey 10
Matindi girls Academy 13
Maranatha Girls 17

37
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Peter Pan 13
Perke Boys 9
Michiru Boys 11
a) Find the values of the three quartiles.
b) Where does the 2014 profits for Bishop Mackenzie fall in relation to these quartiles?
c) Find the inter quartile range.
d) Find the value of the 42nd percentile and give a brief interpretation of the 42nd percentile.
e) Find the Percentile rank for 14 million kwacha profit for New Era Private School.

Solution: First we rank the data in increasing order

Position of value 1 2 3 4 5 6 7 8 9 10 11 12
Value 7 8 9 10 11 12 13 13 14 17 17 45
]7 ]B ]
a) From the ranking, we note that ]7 falls between the values 9 and 10, ]B falls between
the values 12 and 13, and ] falls between the values 14 and 17. Therefore the Quartiles
will be as follows:

9 + 10 12 + 13 14 + 17
]7 = = 9.5, ]B = = 12.5, ] = = 15.5
2 2 2

 ]7 = 9.5 1 Kwacha indicates that 25% of the schools in this sample spent less
than 9.5 Million Kwacha and 75% of the companies spent more than 9.5 Million
Kwacha
 ]B =12.5 Million Kwacha indicates that half of the schools spent less than 12.5 Million
Kwacha while the other half spent more than 12.5 Million Kwacha.
 ] = 15.5 Million Kwacha indicates that 75% of the schools spent less than 15.5
Million Kwacha while 255 of the schools spent more than 15.5 Million Kwacha
b) The expenditure for Bishop Mackenzie falls below the lower quartile.
c) ^] = ] − ]7 = 15.5 − 9.5 = 6 1 c%ℎ.
d) Using the data arranged in increasing order, the position of the 42ndpercentile is:
dB7B
#dB = 7SS
= 5.04ℎ  .

38
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
The value of the 5.04th term can be approximated by the value of the 5th term in the
ranked data. Therefore #dB =42nd percentile=11 million Kwacha.
Z
e) Percentile rank of 14 = × 100 = 66.67%
7B

3.4 OTHER NUMERICAL ATTRIBUTES OF DATA


3.4.1 Skewness and kurtosis
Skewness f is a measure of the degree of symmetry of a frequency distribution about the
Skewnesse
mean. When a distribution stretches to the right more than it does to the left it is said to be right
skewed/positively skewed. Similarly, a left skewed/negative skewed distribution is the one that
stretches asymmetrically to the left. Zero skewness implies a symmetric distribution about the
mean. Skewness of a population/distribution, denoted g can be calculated using the formula:

F
$4 − / 
g = h i j /*
A
467

g = 0 implies zero skewness, g < 0 implies negative skweness and g > 0 implies positive
[Link] that two distributions can have the same mean, variance and skeweness but
could still be significantly different in shape hence the need to look at kurtosis

tosis o is a measure of whether a distribution is peaked or flat relative to a normal


Kurtosis
Kurtosisn
distribution. In general, the larger the kurtosis, the more peaked will be the distribution, and vice
versa. Kurtosis is calculated and reported either as an absolute or a relative value. Absolute
kurtosis is always a positive number while negative kurtosis can be a negative number. The two
are related as follows:

 =  = "+  =  − 3 = gd − 3

Absolute kurtosis of a population, denoted gd is calculated using the following formula:

F
$4 − / d
gd = h i j /*
A
467

A negative relative kurtosis gd < 3 (implies a flatter distribution than the normal distribution
and is called platykurtic. A positive relative kurtosisgd > 3implies a more peaked distribution

39
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
than the normal distribution and is called leptokurtic. Zero kurtosis (gd = 3implies a distribution
has the same kurtosis as a normal distribution and is called mesokurtic.

When analysis is done in a statistical package (e.g. SPSS), the value of Kurtosis is given as an
absolute Kurtosis and interpreted as follows:

• Kurtosis > 3: Distribution is sharper than a normal distribution with thicker walls and most
values concentrated around the mean, which implies high probability for extreme values.
• Kurtosis < 3: Distribution is flatter than a normal distribution with a wider peak. The
probability for extreme values is less than for a normal distribution, and the values are
wider spread around the mean.
• Kurtosis = 3: Distribution is the same as a normal distribution.

3.4.2 Box-
Box-and whisker plots
These give a graphic presentation of data using five measures: the median, the first quartile, the
third quartile and the smallest and largest values in the dataset between the lower and upper
inner fences. It can help us visualize the center, spread and the skewness of a data set. It also
helps in detecting outliers.

Example: The following data are weights (in Kg’s) of diabetic patients at a Central hospital.
Construct a box-whisker plot for these data:

75 69 84 112 74 104 81 90 94 144 79 98

Solution

1 First, rank the data in increasing order and calculate the value of the median, first
Step 1:
quartile, third quartile and the interquartile range. After ranking the data, these values are given
as: Median= 87, ]7=77, ] =101, ^] = 24

2 Find the points that are 1.5 × ^] below ]7 and 1.5 × ^] above ] . These two are
Step 2:
called the lower and upper inner fences respectively.

Lower inner fence =]7 − 1.5 × 24 =77-36=41,

Upper inner fence= ] + 1.5 × 24 = 101+36=137

40
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Step 3:
3 Determine the smallest and largest values in the given dataset within the two inner
fences. Smallest value= 69, Largest value= 112

Step 4:
4 Draw a horizontal line and mark the income levels on it such that all the values in the
given data set are covered. Above the horizontal line, draw a box with its left side at the position
of the first quartile and the right side at the position of the third quartile. Inside the box, draw a
vertical line at the position of the median. The by drawing two lines, join the points of the
smallest and largest values within the two inner fences to the box. These two lines are called
whiskers. Note that a value that falls outside the two inner fences is called an outlier. Below is a
box-and-whisker plot for the data in this example.

Activity: Construct a box plot for the example above and show which values are outliers

Exercise 3.2

1) Briefly describe how the three quartiles are calculated for a data set. Illustrate by
calculating the three quartiles for two examples, the first with an odd number of
observations and the second with an even number of observations.
2) Explain how the inter quartile range is calculated. Give one example.
3) Briefly describe how the percentiles are calculated for a data set.
4) Explain the concept of the percentile rank for an observation of a data set.
5) The following data give the weights (in Kg’s) lost by 15 members of a health club at the
end of 2 months after joining the club.
5, 10, 8, 7, 25, 12, 5, 14, 11, 10, 21, 9, 8, 11, 18
i) Compute the values of the three quartiles and the inter quartile range.
ii) Calculate the (approximate) value of the 82nd percentile.
iii) Find the percentile rank of 10.

41
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 4: INTR
INTRODUCTION TO DESCRIPTIVE ANALYSIS USING SPSS
Objectives

By the end of this topic, students must be able to:

• Do cross tabulations on data in SPSS.


• Construct basic graphical summaries of data.

Introduction

This topic will introduce students to working with SPSS to perform simple descriptive analyses
like: Formation of tables, bar charts, histograms, etc. This section assumes that students have
entered/already have data in SPSS. As such we will use some pre-loaded datasets in SPSS to go
through the practical. Throughout the chapter, the upward pointing arrow (↑) will be an
instruction for the learner to click on/open the proceeding icon/command.

4.1 TABULAR PRESENTATION OF DATA.


By the end of this section, a student should be able to create simple and complex tables to
summarize data sets. These will range from single variable frequency tabulation to hierarchical
cross-tabulation and multiple responses.

We will also learn how to summarize categorical data in tabular form as well as how to format
the tabulations to one’s preference. Summarizing the data this way offers a nice visual aid for
seeing the picture better. In the next sections, we shall continue with this philosophy and see
how we can visually present the data further using graphical techniques.

4.1.1 One
One variable tabulation
In this section, we shall use the file CSS telco Data to perform tabulations for internet services
with respect to region zones in the data set. We shall learn this through the next practice
exercise.

Practice 4-1: Forming frequency tables

1) ↑Analyze; (b) ↑DescripWve StaWsWcs; (c) ↑Frequencies;


2) ↑Geographic indicator (region) (i.e. choosing the variable that has zones of households);

42
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
3) Paste the variable into the ‘variable(s)’dialog box by left clicking on the triangular arrow;
4) √ [Wck] ‘Display frequency tables’
5) ↑OK (This acWon leads to a display of the output as shown in the figure below).

Figure 4.1: SPSS Output for a one-way frequency table

The output suggests that most (344) of the individuals who participated in this survey came from
households of zone 3. In the other columns, SPSS gives the percentages of responses that are
within each household size with the last column giving the cumulative percentages. The last
column particularly showed that in the places where the survey was carried out, members of
‘zone 3’ fully participated with 100% cumulative percentage.

The window in which this output lies is the second main window of SPSS called Output
View. Note that at the bottom task bar of your screen, there is a new icon labeled as
‘Output1 – IBM SPSS Statistics Viewer’. To the left of that icon is the one for the CSS telco Data.
You can switch between the output and the data view mode by left clicking the window we want
to be our current SPSS window.

43
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.1.2: Cross-
Cross-tabulations
tabulations
In order to compare age categories in across all the Zones, one may decide to have the zones as
rows and Age cat as the columns. Make sure that the variables you are performing cross-
tabulations on are categorized/already put in desired categories. To perform the cross
tabulations, follow the steps in the next practice exercise.

Practice
Practice 4-2: Creating a 2-
2-way Table

1) Choose ‘select cases’ option. Then go and clear the ‘Zone 2’ selection by ‘reset’;
2) ↑Analyze; (c) ↑DescripWve StaWsWcs; (d) ↑Crosstabs;
3) ↑Geographic region (zones) (i.e. choose the zones variable);
4) Paste this variable into the ‘row(s)’dialog box via the triangular arrow next to it;
5) ↑Marital status (i.e. choosing the variable that has age in categories);
6) Paste this variable into the ‘columns(s)’dialog box via the triangular arrow next to it;
7) ↑OK (This acWon leads to a display of the output as shown in the figure below).

Figure 4.2: SPSS output for a cross-tabulation for geographic indicator against marital status

44
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
This gives the required output that shows that there were more unmarried people that
participated in the survey across all zones..

Performing the same procedure on Geographic location and level of education variables, we
would get a table as below

Table 4.3: SPSS output for cross tabulation of geographic indicator against level of education

This gives an output that shows that there were more people with a high school degree that
participated in the study across all zones.

If one had a multiple response question then one would define the multiple response groups as
described in section 5.1.1, then the frequency tables or cross-tabulations would be performed as
outlined in section 5.1.2.

4.1: In the previous practice exercise, we tried to create a cross-tabulation of ‘zones’


Exercise 4.
versus ‘age in category’. Now, using the same data set, try to create a cross-tabulation of ‘retire’
versus ‘age in category’.

45
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2 Suppose that the variable ‘custcat’ in the CSS telcodata was capturing information
Exercise 4.2:
on whether a respondent was placed in the following customer service category; Basic service,
E-service, Plus service, and Total service. Use slightly advanced cross tabulation of ‘custcat’ by
‘zones’ as follows. Try to place the variable ‘custcat’ in the row(s) dialog box, ‘internet service
access(internet)’ variable in the column(s)dialog box and the ‘zones’ variable in the Layer 1 of 1
dialog box to find out whether there were some respondents who could not have any internet
service received.

4.1.3 Formatting SPSS cross-


cross-tabulations
Perhaps one of the most versatile feature SPSS 20.0 for Windows output on cross-tabulations is
the ability to professionally format the tables as well as giving the possibility to reverse the rows
and columns. For example, if one’s cross-tabulation cannot be fully seen because the output
table happens to be longer in the horizontal direction than the vertical direction. One can then
swap the rows and columns by taking the following actions:

Practice 4-
4-3: Swapping rows and columns

To swap rows and columns, perform the following actions:

1) R↑ (i.e. right click on the cross-tabulation output one wants to swap rows for columns);
2) ↑Edit Content
3) ↑in separate window (you may just close a window that pops up)
4) ↑Pivot
5) (a)↑Transpose rows and columns (b) ↑file
6) Close the (most) current window to see that the rows and columns have changed roles.

Practice 4-
4-4: Editing SPSS cross-
cross-tabulations

To edit the contents of the table, one follows the following actions

1) R↑ on the appropriate table (i.e. right click a cross-tabulation one wants to swap rows
for columns);
2) ↑Edit Content
3) ↑in separate window (you may just close a window that pops up)

46
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4) ↑Edit or go further with another choice of representing your data by choosing ‘graph’; or
5) ↑format, then get choice on ‘table looks’
6) ↑↑On the item or part of the item you want to edit (You may need one further click to
edit part of it); and when
7) One is through with the editing, one can click anywhere outside the table;

Practice 4-
4-5: Choosing a professional format

To format the table in a professional way, one may use various inbuilt formats as follows.

1) ↑↑on the appropriate table (i.e. double click a tale one wants to format for
presentation);
2) R↑ on the table you just double clicked;
3) ↑Edit content
4) ↑in separate window
5) ↑format, ↑Tablelooks…
6) Choose any format of tables among the list of formats appearing on this window by left
clicking on it e.g. Academic;
7) ↑Save Look , ↑OK
8) ↑file, ↑close

Recall that the SPSS output keeps appending new outputs at the bottom of the previous outputs.
If the output becomes very long with unnecessary outputs one can discard an output or some
part of it by left clicking on that output once and pressing or striking [del] or [delete] key on the
keyboard.

4.2 GRAPHIC PRESENTATION OF DATA


For this section, we will be using Employee data from the SPSS datasets. To open your SPSS
Employee data [↑File, ↑Open, ↑↑SPSS Employee data (or in the ‘Filename’ dialog box, type:
SPSS Employee data, ↑Open)].

47
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2.1 Computing simple summary statistics
We shall now compute the following statistical values: the total (sum) number of employees, the
mean, the minimum and maximum data values, the range, the standard deviation and the
standard error. Note that you can perform these commands on any other set of continuous data.
Let us command SPSS to calculate these values on the ‘actual’ number of males selected. Follow
the steps:

1) ↑Analyze [Some versions of SPSS have ‘Statistics’ instead of ‘Analyze’ on the menu!]
2) ↑DescripWve StaWsWcs
3) ↑Frequencies
4) Choose (by clicking) on the list of variables; the one that has correct male employee’s
number variable.
5) Click the triangular arrow to complete (paste) correct male employees number on the
window on the right.
NB: Any wrongly selected variable that is pasted can be removed by clicking it where it is
pasted, followed by clicking the [reversed!] triangular arrow tab.
6) ↑StaWsWcs
7) Check (i.e. put a tick [√] by clicking on the object) the ‘Sum’, ‘Mean’, ‘Maximum’,
‘Minimum’, Range’, ‘Std. Deviation’, ‘Variance’ and the ‘S.E. mean’ items.
8) ↑ConWnue , ↑OK

In the output, the first part of the table has the values N (valid and missing values). This is the
sample size that was used in the computation.

4.2.2 Drawing a pie chart


Using the data in section 4.2.1 above, draw a pie chart to show the proportion selected by
Minority classification. [Use: ↑Graphs, ↑Legacy Dialogs, ↑Pie, ↑Summaries for group of cases
↑Define. ↑move the variable ‘minority’ into ‘define slices by’ dialog box ↑Ok].

This analysis will produce a pie chart for the variable of interest.

48
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2.3 Drawing Box Plots
Let us suppose that we are not quite sure of the distribution of income among different age
categories from SPSS telcodata. Let us compare the income with respect to the age groups
selected. The steps are:

1) ↑Graphs. ↑Legacy dialogs. ↑Boxplot. ↑Define.


2) Select the variable (e.g. “income”) and paste in the ‘Variable’ dialog box
3) Select the variable (e.g. ‘Custcatg’) into ‘Category Axis’ dialog box.
4) ↑Ok

Figure 4.4: SPSS output for box-and-whisker plots

Note that boxplots help us identify outliers in the data set. For instance, in the output above we
can clearly see an outlier in the E-service customer category.

49
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2.4 Drawing Histograms
Using any SPSS data set (e.g. telco data), one can still plot the distribution of the income
data/any variable using similar steps.

1) ↑Graphs. ↑Legacy dialogs. ↑Histogram. ↑Define


2) Select the variable (‘income’) and paste in the ‘Variable’ dialog box.
3) Select the variable (‘Educatg’) into ‘Column Axis’ dialog box
4) ↑Titles [We want to put a Wtle to the graph]
5) In the Title ‘Line 1’ dialog box, type the title of your graph (INCOME HISTOGRAM FOR
EDUCATIONAL LEVELS)
6) ↑ConWnue. ↑OK.

50
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 5: COMMON COMPLICATIONS WHEN ANALYSING SURVEY DATA
Objectives

By the end if this topic, students should be able to:

• Properly handle Zero values in data sets.


• Handle and analyze multiple response data.
• Know when to use weighted values
• Handle missing values in a dataset.

5.1 ANALYSIS OF MULTIPLE RESPONSE QUE


QUESTIONS
Multiple responses refer to the situation when people are allowed to tick more than one answer
option for a single question. An example of a multiple response question could be:

Example 5.1:
5.1 Question 1: Which of the following electrical appliances items do you own? (Tick
which applies)

a) TV
b) VCR
c) Stereo/CD player
d) PDA
e) Computer
f) Fax Machine

When presented with a question like this, one could tick on more than one answer hence the
need to tackle analysis for such questions. As such this section will focus on analysis of such
questions in SPSS. Note however, that the analysis can be done in other packages as well.

5.1.1 Entering multiple response questions data in SPSS


When entering multiple response questions, each response/category is entered as a separate
variable. For instance, the question on electrical appliances could be entered as five separate
Variables (Coded differently) in SPSS variable view as follows:

1. Do you have a TV? 4. Do you have a Computer?

51
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
i) Yes i) Yes
ii) No ii) No
2. Do you have a VCR? 5. Do you have a fax machine?
i) Yes i) Yes
ii) No ii) No
3. Do you have a PDA?
i) Yes
ii) No

5.1.2 Analyzing
Analyzing multiple response questions in SPSS
A simple approach is to use the Multiple Response option. This procedure creates a single
summary table of counts and percent based on several variables that contain responses to one
question. This would create one table that combines all five variables, rather than five separate
tables.

1. First, make note of how the variables of interest are coded. For this example there are six
categories (a-f)
2. Next, instruct SPSS that the set of variables represents responses to a single question. In
the menu bar, go to Analyze>Multiple Response>Define Variable Sets. To define a multiple
response set in SPSS we must specify the list of variables that make up the set, the type
of coding used, and a name.
3. Using the arrow button, place variables Q1_a (Owns a TV) through Q1_f(Owns a fax
machine) in the “Variables in Set” box.
4. Depending on how you entered the data Click:
• “Categories” and add “1-6” for the range; if the data was entered as a repeated
question with all the 6 responses included in the values section.
• “Dichotomies” and add “1” for the counted value; if each response was
incorporated in a single question and a respondent had to answer “yes” or “no”
where applicable.

52
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
5. Give the new collapsed variable a name (e.g. item). Next, give the variable a label and
click “Add”. Notice that the set name now appears in the Multiple Response Sets list box.
The $ prefix distinguishes the set name from an ordinary SPSS variable name
6. Click “close”.
7. Return to Analyze > Multiple response. You will now see that two options have been
activated: Frequencies and Crosstabs. Below is an example of a frequency output for the
item variable. The table was created based on responses to the six variables (Q1_a to
Q1_f). The N column indicates how many respondents mentioned each item. The Percent
of Responses column indicates what percentage of the total number of items mentioned
is contained in each category. The Percent of Cases indicates what percentage of
respondents own items of each given type.

$item Frequencies
Table 5.1: SPSS output for multiple response variables
Percent
Responses of Cases
N Percent N
$item(a) Owns TV 6337 26.4% 99.3%
Owns VCR 6145 25.6% 96.3%
Owns stereo/CD
6206 25.8% 97.3%
player
Owns PDA 1307 5.4% 20.5%
Owns computer 2811 11.7% 44.1%
Owns fax machine 1202 5.0% 18.8%
Total 24008 100.0% 376.4%
a Dichotomy group tabulated at value 1.
Note that the column for total Percent of Cases has 376.4%. The reason that it is possible to have
over 100% is because each respondent can select more than one category. Theoretically, if
everyone selected all categories this percentage would be equal to 600%. Note that the multiple

53
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
response set that was created will remain active until a different data file is opened or you exit
SPSS.

5.2 PRESENCE OF MISSING VALUES IN DATA

The concept of missing values is important to understand in order to successfully manage data.
If the missing values are not handled properly by the researcher, then he/she may end up
drawing an inaccurate inference about the data .As such there is need to know how to deal with
missing values in a dataset. This section will address some easy methods of dealing with missing
values.

5.2.1Missing data mechanisms


There are different assumptions about missing data mechanisms:

(MCAR) Suppose variable Y has some missing values. We will say


Missing completely at random (MCAR):
that these values are MCAR if the probability of missing data on Y is unrelated to the value of Y
itself or to the values of any other variable in the data set. However, it does allow for the
possibility that “missingness” on Y is related to the “missingness” on some other variable X.

Example: We want to assess which are the main determinants of income (such as age). The
MCAR assumption would be violated if people who did not report their income were, on
average, younger than people who reported it. This can be tested by dividing the sample into
those who did and did not report their income, and then testing a difference in mean age. If we
fail to reject the null hypothesis, then we can conclude that the MCAR is mostly fulfilled (there
could still be some relationship between missingness of Y and the values of Y).

(MAR) This is a weaker assumption than MCAR which states that the
Missing at random (MAR):
probability of missing data on Y is unrelated to the value of Y after controlling for other variables
in the analysis (say X).

Example: The MAR assumption would be satisfied if the probability of missing data on
income depended on a person’s age, but within age group the probability of missing income was
unrelated to income. However, this cannot be tested because we do not know the values of the
54
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
missing data, thus, we cannot compare the values of those with and without missing data to see
if they systematically differ on that variable.

(NMAR) Missing values do depend on unobserved values.


Not missing at random (NMAR):

Example: The NMAR assumption would be fulfilled if people with high income are less likely to
report their income.

5.2.2 Handling missing data


In this section, we will look at the simple methods of dealing with missing values. When faced
with missing values, the researcher may leave the data or do data imputation to replace them.
Suppose the number of cases of missing values is extremely small; then, an expert researcher
may drop or omit those values from the analysis. In statistical language, if the number of the
cases is less than 5% of the sample, then the researcher can drop them. Below are some the
methods used in dealing with missing data.

List-wise deletion (or complete case analysis): If a case has missing data for any of the variables,
List-
then simply exclude that case from the analysis. It is usually the default in most statistical
packages.

Advantages: It can be used with any kind of statistical analysis and no special computational
methods are required.

Limitations: It can exclude a large fraction of the original sample. For example, suppose a data
set with 1,000 people and 20 variables. Each of the variables has missing data on 5% of the
cases, then, you could expect to have complete data for only about 360 individuals, discarding
the other 640. It works well when the data are missing completely at random (MCAR), which
rarely happens in reality

Imputation methods: Here, you substitute each missing value for a reasonable guess, and then
Imputation
carry out the analysis as if there were not missing values. There are two main imputation
techniques:

55
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
a) Marginal mean imputation: Compute the mean of X using the non-missing values and use
it to impute missing values of X.
Limitations: It leads to biased estimates of variances and covariance’s and, generally, it
should be avoided.

b) Conditional mean imputation: Suppose we are estimating a regression model with


multiple independent variables and one of them, say X, has missing values. We select
those cases with complete information and regress X on all the other independent
variables. Then, we use the estimated equation to predict X for those cases it is missing.

Limitations of imputation techniques in general: They lead to an underestimation of standard


errors and, thus, overestimation of test statistics. The main reason is that the imputed values are
completely determined by a model applied to the observed data, in other words, they contain no
error.

5.3 PRESENCE OF ZERO VALUES


Zero values may be a simple and common part of most data sets. For instance, in a household
survey one may be asked to list how many the assets they have in different categories e.g.
radios, bicycles, etc. Obviously some may not have these assets hence bringing in Zero values to
our data set.

5.3.1 Analysis with Zero values present


For instance, questions like: How many livestock do you have? What was your yield of maize?,
How much rain fell yesterday?, etc. may be asked in a survey. Questions like these can be
regarded as observations and yield data as follows:

Table 5.2: Number of assets for particular questions in a survey

Question/observation 1 2 3 4 5 6 7 8 9 10
Response/value 3 8 0 0 5 6 0 7 0 1

56
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Possible analysis of the dataset in table 5.2 (including the Zero values) would yield the following
summary statistics:

 > +   = 30, >#    = 10,


  = 3, & = 2, . . 

An alternative analysis would be to separate the data set (i.e. not including the zero values
ctly). This analysis would yield two sets of summary statistics as follows:
directly)
directly).

 > = 30, >#    = 10,


 # #   9  = 0.4 40%, +  9  = 4,
 >#    = 6    − 9 ,
 1 = 5  ℎ  − 9   , & = 5.5 , 

It is interesting to note that both analyses are valid depending on the precise objective and on
the type of data. However, the 2-step analysis is often appropriate where the data is split into
two (including/not including the zero values) and analyzed as such.

For example, when a question like “How many cattle do you have?” is asked, some would answer
“yes”: giving the number of cattle owned, while others will answer “no”: producing Zero values.
In analyzing and interpreting data from this question one could have summary statistics as
follows

Table 5.3: number of cattle in households

Farmer A B C D E
Number of cattle 3 4 6 0 3

In analyzing the data from table 5.3, one would have two analyses as follows

a) Including the Zero value: The mean number of cattle per household can be calculated
as:3 + 4 + 6 + 0 + 3/5 = 3.2
Alternatively one would say

57
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
b) Excluding the Zero value: 80%(4/5) of the farmers owned cattle and among the cattle
owners, the mean was 4 cattle per farmer

We can see that analysis “b” is giving the true mean number of cattle in households that actually
have cattle. Hence for such a dataset, it is advisable to use analysis be.

5.4 WEIGHTED VALUES


In some situations, we are presented with data values that vary in their degree of importance. As
such, we may want to compute weighted values depending on degree of importance. In this
section we will focus on weighted means as they are the ones commonly used.

5.4.1 When to Use a Weighted Average


There are two main cases where you will generally use a weighted average instead of a
traditional average. The first is when you want to calculate an average that is based on different
percentage values for several categories. One example might be the calculation of a course
grades.

The second case is when you have a group of items that each has a frequency associated with it.
In these types of situations, using a weighted average can be much quicker and easier than the
traditional method of adding up each individual value and dividing by the total. This is especially
useful when you are dealing with large data sets that may contain hundreds or even thousands
of items but only a finite number of choices.

Example 2: Suppose we are given data on farm yields for two farmers (In tons/hectare) as
follows:

Table 5.4: Farm yields for two farmers

Farmer Area ( ha) Yield (t/ha) Production (tons)


A 5 1 5
B 0.5 2 1

58
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
From the data given in table 5.4, one can compute different mean yields depending on what you
are trying to answer:

• If you are interested in the farmer, then there are 2 farmers and the mean is
(1+2)/2= 1.5 t/ha
• If the area is the unit of interest then there are 5.5 ha and note that Farmer A is 10 times
as important as farmer B, so a weighted mean is produced.
produced Here we need to weight each
yield by the area it represents hence the mean will be
Mean = (1*5 + 2*0.5)/5.5 = 1.1
Here the areas are the “weights”, they are used when different observations represent
different proportions of the “population

3 A student is enrolled in a biology course where the final grade is determined based
Example 3:
on the following categories: tests 40%, final exam 25%, quizzes 25%, and homework 10%. The
student has earned the following scores for each category: tests-83%, final exam-75%, quizzes-
90%, and homework-100%. We need to calculate the student's overall grade.

To calculate a weighted average with percentages, each category value must first be multiplied
by its percentage. Then all of these new values must be added together. In this example, we
must multiply the student's average on all tests (83) by the percentage that the tests are worth
toward the final grade (40%). Please note that all percentages must be converted to decimals
before you multiply. Similarly, the final exam score (75) will be multiplied by 0.25 (25%). The
same will be true for both the quizzes (90 * 0.25%) and homework (100 * 0.10%). Thus, the
overall calculation would be (83 * .40) + (75 * .25) + (90 * .25) + (100 * .10) = 33.2 + 18.75 + 22.5
+ 10 = 84.45 or 84% if rounded down.

5-1: A student has earned the following averages in his history course: tests-90%,
Exercise 5-
quizzes-88%, papers-85%, and homework-95%. The overall course grade is comprised of tests
(30%), quizzes (20%), final exam (20%), papers (20%) and homework (10%). What score must he
earn on the final exam in order to earn a final grade of at least 90%.

59
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
REFFERENCES

Aczel, A. D. & Sounderpandian, J. (2002). Complete Business Statistics. NY: McGraw-Hill.

Donnelly, R. (2007). The Complete Idiot's Guide To Statistics, 2nd Edition. NY: Kindle Edition.

H. Kara (2013). Analyzing data : A time saving guide. NY: Kindle edition.

Mann, P. S. (2011). Introductory Statistics. NJ: John Wiley and Sons.

Namangale, J. J. & Gondwe, C. (2014). SPSS Training Manual. In Mwakilama, E., Twabi, H. &
Sawerengela, P. (Eds)

Panik, M. J. (2005). Advanced statistics from an elementary point of view. Burlington: Elsevier.

Sharma, A. (2014). Statistics: A walk-through to Descriptive Statistics. NY: Kindle edition.

Statistics Training Pack for SADC. Statistical Services Centre of the University of Reading
[Link]

Trosset, M. W. (2001). An Introduction to Statistical Inference and Data Analysis.


Williamsburg: College of William and Mary.

Wheelan, C. (2013). Naked Statistics: Stripping the Dread from the Data. NY: Kindle edition

60
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe

You might also like