Descriptive Statistics Module Overview
Descriptive Statistics Module Overview
Department.
DESCRIPTIVE
STATISTICS
Module for Diploma in Statistics
1.1 STATISTICS.................................................................................................................................... 6
1.2 DATA............................................................................................................................................. 7
1
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
3.1.2 Median ................................................................................................................................29
3.2.1 Range...................................................................................................................................31
4.1.2:
4.1.2: Cross-
Cross-tabulations ...............................................................................................................44
4.2.4 Drawing
Drawing Histograms............................................................................................................50
2
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
5.1.2 Analyzing multiple response questions in SPSS .................................................................52
REFFERENCES ......................................................................................................................................60
3
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
PREFACE
Statistics is an important field of mathematics that is used to analyze, interpret, and predict
outcomes from data. Descriptive statistics will teach you the basic concepts used to describe
data. This is a great beginner course for those interested in Data Science, Economics, Psychology,
Machine Learning, Sports analytics and just about any other field.
This branch of statistics lays the foundation for all statistical knowledge, but it is not something
that you should learn simply so you can use it in the distant future. Descriptive statistics can be
used NOW, in English class, in physics class, in history, at the football stadium, in the grocery
store and in everything we do. Without statistics we couldn't plan our budgets, pay our taxes,
and evaluate classroom/office performance. Are you beginning to get the picture? We need
statistics!
This handbook is therefore designed for an introductory course in descriptive statistics and can
be used entirely and/or as a supplement to all comparable texts in statistics. This can also be
regarded as a self-study for those students wishing to pursue their careers in Statistics.
4
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
ACKNOWLEDGEMENTS
I wish to thank the staff of Mathematical Sciences Department, National Statistical Office
through National Statistical Systems (NSS) and UNDP for constructive criticisms and provision of
donor funding to facilitate development of this module respectively. Most importantly, I thank
the two institutions for entrusting me with the responsibility to carry out this rigorous exercise.
I also wish to extend my gratitude to the module reviewers, friends and colleagues in the
department for constructive critics and inputs towards development of this module. Your inputs
have brought significance to the successful completion of this students’ handbook.
5
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 1: BASIC STATISTICAL CONCEPTS
Objectives
By the end of this topic, students have to be able to:
• Define statistics
• Recognize the type of data in a dataset
• Summarize different types of data
• Differentiate between descriptive and inferential statistics
• Recognize the proper type of scale to use for data
1.1 STATISTICS
1.1.1 What is Statistics?
Statistics
Statistics is a group of methods used to collect, analyze, present and interpret data to make
decisions.
Statistical analysis is a process that involves identifying the questions of interest, data collection
and analysis and producing a report. In real-life problems, the data collection and analysis steps
may be repeated more than once. Data is collected in order to shed light on some question of
importance to the engineer, biologist, climatologist, administrator, or other professional
In many applications, the cycle of data collection and analysis is a central part in the quest for
improvement to systems and processes. The aim of statistics is to supply useful information to
people whose main area of expertise is not statistics. These people are not interested directly in
either data or statistical methods.
6
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
statistics: consists of methods that use sample results to help make decision or
Inferential statistics:
predictions about a population. More on inferential statistics will be tackled in other courses.
statistics: Consists of methods for organizing, displaying and describing data by using
Descriptive statistics:
tables, graphs and summary measures. This type of analysis is useful in drawing conclusions on
large data sets as they are presented in summary form. For instance, suppose we have
information on the test scores of students enrolled in a statistics class. In statistical terminology,
the whole set of numbers that represents the scores of students is called a data set, the name of
each student is called an element, and the score of each student is called an observation. A data
set in its original form is usually very large. Consequently, such a data set is not very helpful in
drawing conclusions or making decisions. It is easier to draw conclusions from summary tables
and diagrams than from the original version of a data set. In descriptive statistics, we reduce
data to a manageable size by constructing tables, drawing graphs, or calculating summary
measures such as averages.
1.2 DATA
1.2.1 Background
Data contains information and statistics serves to extract this information. Data is considered as
the basic commodity of the statistics without which, there is no information on which to reach
conclusions or base decisions. However the information in data is often not immediately
obvious, especially in large data sets.
Large data sets must be summarized before patterns and relationships can be seen, there is
usually too much noise in the raw data to see the information that they contain. As such,
statistical methods which use graphical and numerical methods to highlight important features
of the data are needed to ensure that the highest precision is [Link] smaller data sets, it is
less important to summarize the data; the problem is usually that there is not enough
information to get a clear answer to questions of importance.
1.2.2
1.2.2 Types of data
Primary and secondary data
7
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Primary data means original data has been collected specially for the purpose in mind.
For example, when the National Statistics Office (NSO) collects data for use in the
Demographic Health Survey (DHS) report, it is considered to be primary data for the
DHS.
Secondary data is data that has been collected for another purpose, yet we use it for
analysis to come up with conclusions on different reasons other than the one the data
was collected for. For example, when a researcher uses data from the DHS report to
answer a certain research question of his/her interest, the MDHS data is now considered
secondary data.
Discrete data results when the number of possible values is either a finite or countable
number (that is, the number of possible values is 0 or 1 or 2 and so on). For example, the
number of eggs that hens lay is discrete data because they represent counts.
Continuous data results from infinitely many possible values that correspond to some
continuous scale that covers a range of values without gaps, interruptions or [Link]
example, the ages of students in a class can assume any value over a continuous span.
Cross-
Cross-sectional and time series data
Cross-sectional data is data collected on different elements or variables at the same point
in time or for the same period of time. For example, the types of TV’s owned by Lilongwe
residents in 1 particular year (e.g. 2007) can be considered as cross-sectional data
collected at the same time (2007)
8
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Time series data is data collected on the same element or the same variable at different
points in time or for different periods of time. For instance, data on the total number of
TV’s owned by Lilongwe residents each year, collected yearly over a period of five year is
an example of time series data.
Nominal scale is where categorical values and/or numbers are used as labels for groups or
classes. For example, if our data set comprises the countries Malawi, Zambia and Angola, we
may designate Malawi as 1, Zambia as 2 and Angola as 3. In this case, the numbers 1, 2 and 3
stand only for the category to which the data points belong and do not represent any order.
Ordinal scale represents an ordered series of relationships or rank order. For example, individuals
competing in a contest may be fortunate to achieve first, second, or third place, these positions
represent ordinal data. Likert-type scales (such as “On a scale of 1 to 10 with 1 being no pain and
10 being high pain, how much pain are you in today?”) also represent ordinal data.
Interval scale represents quantity and has equal units but for which zero represents simply an
additional point of measurement is an interval scale. The Fahrenheit is a clear example of the
interval scale of measurement, thus 60 degree Fahrenheit or -10 degrees Fahreinheit are interval
data.
Ratio Scale is similar to the interval scale in such a way that it also represents quantity and has
equality of units. However, this scale also has an absolute zero (no numbers exist below zero).
Exercise 1.1
9
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
3. Explain the difference between cross-section and time-series data. Give an example of
each of these two types of data.
4. Classify the following as cross-section or time-series data.
i) Food bill of a family for each month of 2009.
ii) Number of armed robberies each year in Lilongwe from 1998 to 2009.
iii) Number of supermarkets in Malawian cities on December 31, 2009.
iv) Gross sales of 200 ice cream parlors in July 2009.
v) Average prices of houses in 100 cities.
vi) Salaries of 50 Airtel employees.
vii) Number of cars sold each year by Toyota Malawi from 1980 to 2009.
viii) Number of employees employed by Malawi Government each year from 1985 to
2009.
5. Briefly distinguish the following terms.
i) Quantitative and Qualitative data.
ii) Discrete and Continuous data.
6. Indicate which of the following variables are Discrete or continuous.
i) Number of persons in a family
ii) Colors of cars
iii) Marital status of people
iv) Time to commute from home to work
v) Number of errors in a person’s bank statement
10
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 2: GRAPHICAL SUMMARIES OF DATA
Objectives
• Identify the appropriate graphical summary and presentation for particular questions.
• Present data in a histogram and be able to interpret data when presented with a
histogram.
• Recognize the advantages and limitations of each method of presentation.
• Explain what can be gained and lost from data summary.
• Interpret graphical summaries to answer questions concerning proportions, extremes,
medians and quartiles for quantitative variables.
INTRODUCTION
INTRODUCTION
The source of our statistical knowledge lies in the data. Once we obtain the sample data values,
one way to become acquainted with them is to display them in tables or graphically. Tables,
charts and graphs are very important tools in statistics because they communicate information
visually. These visual displays may reveal the patterns of behavior of the variables being studied.
Qualitative and quantitative data can be summarized differently in graphical form. This unit
focuses on graphical summaries of qualitative and quantitative data.
11
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Table 2.1: Type of employment students intend to engage in
12
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
2.1.2
2.1.2 Relative frequency and percentage distributions
Relative frequency of a category is obtained by dividing the frequency of that category by the
sum of all frequencies. It shows what fractional part or proportion of the total frequency belongs
to the corresponding category.
ℎ
=
Percentage for a category is obtained by multiplying the relative frequency of that category by
100. It shows what percentage of the total frequency belongs to a corresponding category.
Example 2-2: Determine the relative frequency and percentage distributions on stress levels for
the data from table 2.3.
Solution
2.1.3
2.1.3 Graphical presentation of qualitative data
It is said that “a picture is worth a thousand words”, the same can be said regarding statistics as
a graphical display can reveal at a glance the main characteristics of a dataset. Bar graphs and pie
charts are two types of graphs that are commonly used to display qualitative data.
Bar graph (bar chart)is a graph of bars whose heights represent the frequencies (or relative
frequencies/percentage frequencies) of respective categories.
13
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Example 2-3: The data in Table 2.5 represent the percentages of price increases of some
consumer goods and services for the period December 1990 to December 2000 in Blantyre city.
Construct a bar chart for this data.
Table 2.5: Percentages of Price Increases of Some Consumer Goods and Services in Blantyre
Solution: Looking at Figure 2.1 we can identify where the maximum and minimum responses are
located, so that we can descriptively discuss the phenomenon whose behavior we want to
understand.
80
70
60
Percentage
50
40
30
20
10
0
MC EI RR FD CPI A&U
Example 2-4: Construct a pie chart for the combined percentages of carbon monoxide (CO) and
ozone (O3) emissions from differe
different sources as listed in Table 2.6 below
Exercise 2.1
1) Why do we need to group data in the form of a frequency table? Explain briefly.
15
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
2) How are the relative frequencies and percentages of categories obtained from the
frequencies of categories? Illustrate with the help of an example.
3) The following data give the results of a sample survey. The letters A, B, and C represent the
three categories.
A B B A C B C C C A C B C A C
C B C C A A B C C B C B A C A
2.2.1
2.2.1 Frequency distributions
data lists all classes and the number of values that belong
A frequency distribution of quantitative data
to each class. The data presented in the form of a frequency distribution is called grouped data.
A class is an interval that includes all the values that fall within two numbers called the upper and
lower limits. The classes represent a variable and are non-overlapping (each value belongs to one
and only one class).
Class boundary (real class limit) is given by the midpoint of the upper limit of one class and the
lower limit of the next class. In table 2.7 below, to find the mid-point of the upper limit of the
first class and the lower limit of the second class, we divide the sum of these two limits by 2.
Thus the class midpoint is
1000 + 1001
= 1000.5
2
16
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
The value 1000.5 is called the upper boundary of the first class and the lower boundary of the
second class.
Class width is the difference between two boundaries of a class and Class mid-
mid-point/mark is
obtained by dividing the sum of the two limits (boundaries/lower and upper limits) of a class by
two.
Example 2-5: Construct a table showing the class boundaries, class midpoints, and class width for
the data in Table 2.7on daily earnings of 100 employees at a large company. Note that the data
has already been grouped into classes with corresponding frequencies of employees falling into
each class given.
Solution: From table 2.7,the values 801,1001, 1201, 1401, 1601 and 1801 are lower limits, and
the values 1000, 1200, 1400, 1600, 1800 and 2000 are the upper limits of the six classes.
17
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Table 2.8 Class boundaries, class widths and class mid-points for table 2.7 data
2.2.2
2.2.2 Constructing frequency distribution tables
When constructing a frequency distribution table for quantitative data, the following needs to be
considered
Number of classes: Usually the number of classes varies from 5 to 20 depending on the number
of observations in the data set. It is recommended to have more classes with a large data set.
width: It is possible to have classes of different sizes. It is preferable however, to have the
Class width:
same width for all classes. As such, an approximate width can be calculated depending on the
number of classes one intends to have.
Example 2-6: The following data gives the total number of cell phones sold by Airtel on each of
the 30 days in a month. Construct a frequency distribution table.
8 25 11 15 29 22 10 5 17 21 23 14 19 27 14
22 13 26 16 18 12 9 26 20 16 23 20 16 16 21
18
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Solution:: In this dataset, the minimum value is 5, and the maximum value is 29. Suppose we
decide to group this data using five classes of equal width, then
29 − 5
"## $ %&ℎ ℎ = = 4.8
5
We can further approximate this to a convenient number, say five. Therefore, for the first class,
we have a lower limit of five and our classes will have a class width of five and be as in the below:
2.2.3
2.2.3 Relative frequency and percentage distributions
Relative frequency and percentage distribution is calculated in the same way as we did for
qualitative data in section 2.2.2
Example 2-7: Calculate the relative frequencies and percentages for data in Table 2.9
Solution: The relative frequencies and percentages for cell phone sales are given below:
Table 2.11: Relative frequency and percentage distributions for cell phone sales
19
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
2.2.4
2.2.4 Graphing grouped data.
data
Grouped (quantitative) data can be displayed using a histogram or polygon.
Histogram is a graph in which classes are marked on the horizontal axis and frequencies, relative
frequencies, or percentages are marked on the vertical axis. The frequencies, relative
frequencies or percentages are represented by heights of bars which are drawn adjacent to each
other without any gaps. Figure 2.3 shows the frequency histogram for data on cell phone sales.
Class limits have been used to mark classes on the horizontal axis. However, class boundaries can
also be used.
6
5
4
3
2
1
0
5 -9 10-14 15-19 20-24 25-29
Cellpones sold
Note that similar graphs can be constructed for relative frequencies, percentage frequencies,
etc.
Polygon is a graph formed by joining the midpoints of the tops of successive bars in a histogram
with straight lines. In other words, it is a line graph constructed from the class mid points and
frequencies for each class
20
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
For a large dataset, as the number of classes is increased (and the width of classes decreased),
the frequency polygon eventually
ventually becomes a smooth curve and is called a frequency distribution
curve/frequency curve.
Solution: A polygon for cell phone is constructed using class midpoints is given in figure 2.4
Cumulative relative frequencies are obtained by dividing the cumulative frequencies by the total
number of observations in the data set. Cumulative percentages are obtained by multiplying the
cumulative relative frequencies by 100
21
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Example 2-9: Using data on Cell phone sales from Example 2-6 prepare a cumulative frequency,
cumulative relative frequency and cumulative percentage distribution table for the number cell
phones sold by the company.
Solution:: With reference to the frequency table constructed in Example 2-6, the cumulative
frequency distribution table can be constructed as below:
Table 2.12: Cumulative frequency distribution table for Airtel cell phone sales
An ogive is a curve drawn for the cumulative frequency distribution by joining with straight lines
the dots marked above the upper boundaries of classes at heights equal to the cumulative
frequencies of respective classes.
To draw an ogive, the variable is marked on the horizontal axis and the cumulative frequencies
on the vertical axis. Then the dots are marked above the upper boundaries of various classes at
the heights equal to the corresponding cumulative frequencies. The ogive is obtained by joining
consecutive points with straight lines. Note that the ogive starts at the lower boundary of the
first class and ends at the upper boundary of the last class, and connecting with lower
boundaries for the rest of the classes
22
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Example: Construct an Ogive for the cell phone sales data in table 2.11
Example:
Solution:: An ogive for data in table 2.11 is constructed below. It is constructed using the lower
class boundaries and their respective cumulative frequencies.
2.2.6Stem
2.2.6Stem and leaf displays
A stem-and-leaf plot is a simple way of ssummarizing quantitative data that is well suited to
stem-and-
computer applications. When data sets are relatively small, stem
stem-and-leaf
leaf plots are particularly
useful. In a stem-and-leaf
leaf plot, each data value is split into a “stem” and a “leaf.” The “leaf” is
usually the last digit of the number and the other digits to the left of the “leaf” form the “stem.”
75 52 80 96 65 79 71 87 93 95 69 72 81 61 76
86 79 68 50 92 83 84 77 64 71 87 72 92 57 98
Table 2-13: Scores for college students
23
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Solution: A stem and leaf display for college scores is given below
9 2 3 5 6 8 2
8 0 3 4 6 7 7 1
7 2 1 5 2 7 9 6 9 1
6 8 5 4 9 1
5 0 7 2
Figure 2.6: A stem and leaf display
A return period,
period also known as a recurrence interval (sometimes repeat interval)
interval is an estimate of
the likelihood of an event, such as an earthquake, flood or a river discharge flow to occur. It is a
statistical measurement typically based on historic data denoting the average recurrence interval
over an extended period of time, and is usually used for risk analysis (e.g. to decide whether a
project should be allowed to go forward in a zone of a certain risk, or to design structures to
withstand an event with a certain return period).
Exercise 2.2
1) Briefly explain the concept of cumulative frequency distribution. How are the cumulative
relative frequencies and cumulative percentages calculated?
2) Explain for what kind of frequency distribution an ogive is drawn. Can you think of any use
for an ogive? Explain.
3) The following table gives the frequency distribution of the number of ATM cards possessed
by 80 adults.
Number of Credit cards 0 to 3 4 to 7 8 to 11 12 to 15 16 to 19
Number of Adults 18 26 22 11 3
a) Prepare a cumulative frequency distribution.
b) Calculate the cumulative relative frequencies and cumulative percentages for all classes.
c) Find the percentage of these adults who possess 7 or fewer ATM cards.
d) Draw an ogive for the cumulative percentage distribution.
e) Using the ogive, find the percentage of adults who possess 10 or fewer ATM cards.
24
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 3: NUMERICAL SUMMARIES OF DATA
Objectives
Background
In the previous section we looked at some graphical and tabular techniques for describing a data
set. We shall now consider some numerical characteristics of a dataset which can provide more
detailed information about important features of a distribution. These summaries are in
different groups like Measures of central tendency, measures of variability/dispersion and
measures of position/relative position, etc.
3.1.1 Mean
The mean, also known as arithmetic mean is the most frequently used measure of central
tendency. The mean is basically the average of a set of observations and is calculated differently
for grouped and ungrouped data. The mean calculated from the whole population data is
denoted / and the mean calculated from sample data is denoted$̅ .
ungrouped data is obtained by dividing the sum of all values by the number of values in
Mean for ungrouped
the dataset.
25
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
∑$
1 ## &: / =
*
∑$
1 # &: $̅ =
Where ∑ $ is the sum of all sample values, * is the population size, is the sample size, / is the
population mean and $̅ is the sample mean.
3-1: The following data give the time in months from hire to promotion for a random
Example 3-
sample of 25 software engineers from all software engineers employed by a large
telecommunications firm: 5, 7, 229, 453, 12, 14, 18, 14, 14, 483, 22, 21, 25, 23, 24, 34, 37, 34,
49, 64, 47, 67, 69, 192 and 125. Calculate the mean.
Solution: Since it’s a random sample, we use the sample mean formula to calculate the mean
where n=25.
∑5467 $4
$̅ = = 83.28 ℎ
We can conclude that at this company, on average, it takes a software engineer 83.28 months to
get a promotion.
3-2: The following are the ages (in years) of all eight employees of a small company: 53,
Example 3-
32, 61, 27, 39, 44, 49 and 57. Find the mean age of these employees.
Solution: Since the given data set includes all eight employees of the company, it represents the
population hence N=8.
∑5467 ∑ $4
/̅ = = 45.25
*
∑
1 ## &: / =
*
26
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
∑
1 # &: $̅ =
Where * ℎ ## 9, is the sample size, is the class midpoint and is the
frequency of a class.
Example 3-3: The grouped data in Table 3.1 represent the number of children from birth through
the end of the teenage years in a large apartment complex. Find the mean for the data:
Class : ; ;:
0-3 7 1.5 10.5
4-7 4 5.5 22
8-11 19 9.5 180.5
12-15 12 13.5 162
16-19 8 17.5 140
n=50 ∑ = 515
∑5467 4 4 515
)# : $̅ = = = 10.30
50
We can conclude that the average age of children in the apartment complex is 10 years old.
27
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
included in that sample. Sometimes a data set may contain a few very small or a few very large
values (outliers/extreme values). A major shortcoming of the mean as a measure of central
tendency is that it is very sensitive to outliers. Example 3–4 illustrates this point.
3-4: Table 3.3 lists the money (in million Kwachas) spent by six Malawian television
Example 3-
stations in 2014.
Notice that MBC spent much more money compared to the other stations hence it is an outlier.
As such, we will show how the inclusion of this outlier affects the value of the mean.
Solution: If we do not include the expenditure for MBC (the outlier), the mean expenditure of the
five TV stations is:
Now, to see the impact of the outlier on the value of the mean, we include the expenditure of
MBC and find the mean contributions of the six companies. This mean is:
Thus, including the MBC expenditure causes more than a threefold increase in the value of the
mean, which changes from $22.1 million to $74.73 million.
The preceding example should encourage us to be cautious. We should remember that the
mean is not always the best measure of central tendency because it is heavily influenced by
outliers. Sometimes other measures of central tendency give a more accurate impression of a
data set. For example, when a data set has outliers, instead of using the mean, we can use either
the trimmed mean or the median as a measure of central tendency.
28
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Trimmed mean is calculated by dropping a certain percentage of values from each end of a
ranked data set. The trimmed mean is especially useful as a measure of central tendency when a
data set contains a few outliers at each end. Suppose the following data give the ages (in years)
of 10 employees of a company: 47, 53, 38, 26, 39, 49, 19, 67, 31 and 23. To calculate the 10%
trimmed mean, first rank these data values in increasing order; then drop 10% of the smallest
values and 10% of the largest values. The mean of the remaining 80% of the values will give
the10% trimmed mean. Note that this data set contains 10 values, and 10% of 10 is 1. Thus, if we
drop the smallest value and the largest value from this data set, the mean of the remaining 8
values will be called the 10% trimmed mean.
47 + 53 + 38 + 26 + 39 + 49 + 19 + 67 + 31 + 23
> ℎ & = = 39.2
10
The ranked data set is: 19, 23, 26, 31, 38, 39, 47, 49, 53 and 67. Hence the 10% trimmed mean
will be given as:
23 + 26 + 31 + 38 + 39 + 47 + 49 + 53
10% & = = 38.25
8
3.1.2 Median
The median is the value of the middle term in a data set that has been ranked in increasing
order. If a data set is odd, the median is given by the middle term in the ranked data. If the
number of observations is even, then the median is given by the average of the values of the two
middle terms.
3-5: The following data gives the prices (in millions of Kwachas) of 5 houses sold by an
Example 3-
estate agent: 34, 50,12,8 and 15. Find the median.
8 12 15 34 50
3-6: Table 3.4 below gives 2014 profits (rounded to billions of Kwachas) for banks in
Example 3-
Malawi. Find the median
29
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Table 3.4: 2014 profits for selected Malawian banks
24 36 52 57
The median gives the center of a histogram, with half of the data values to the left of the median
and half to the right of the median. The advantage of using the median as a measure of central
tendency is that it is not influenced by outliers. Consequently, the median is preferred over the
mean as a measure of central tendency for data sets that contain outliers.
3.1.3
3.1.3 Mode
The mode is the value that occurs with the highest frequency in a dataset. A dataset can have no
mode, one mode (unimodal dataset), two modes (bimodal dataset) and more than two modes
(multimodal dataset). Bimodal and multimodal datasets have two values and more than two
values occurring with the same highest frequency respectively.
3-7: In 2015 maize yields for five selected families were 200kg, 315kg, 400kg, 200kg and
Example 3-
178kg. Find the mode
Solution: Mode is 200kg as it is the only value occurring with the highest frequency in the
dataset.
3-8: The ages of 10 randomly selected students are 21, 15, 16,19,21,19,14,25,26 and 25
Example 3-
years. Find the mode.
Solution: The data set has three modes; 21, 19 and 25, each with a (highest)frequency of 2.
Note that a major shortcoming of the mode is that a data set may have none or may have more
than one mode, whereas it will have only one mean and only one median. For instance, a data
30
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
set with each value occurring only once has no mode hence not easy to make any numerical
conclusions.
Exercise 3.1
1) Explain how the value of the median is determined for a data set that contains an odd
number of observations and for a data set that contains an even number of observations.
2) Briefly explain the meaning of an outlier. Is the mean or the median a better measure of
central tendency for a data set that contains outliers? Illustrate with the help of an
example.
3) Using an example, show how outliers can affect the value of the mean.
4) The following data give the numbers of car thefts that occurred in a city during the past
12 days: 6, 3, 7, 1, 1, 4, 3, 8, 7, 2, 6, 9, 1 and 5.
a) Find the mean, median, and mode car thefts during the 12 days
b) Find the mean car thefts per day, during the 12 days.
3.2.1 Range
The range of a set of observations is the difference between the largest and smallest
observation/value of a data set. It is the simplest measure of dispersion to calculate.
3-9: National Bank of Malawi registered profits (in Billions of Kwachas) of 60, 34, 50, 46
Example 3-
and 37 for the years 2001, 2002, 2003, 2004 and 2005 respectively. Find the range of profits for
the given years.
31
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Solution: The maximum profit is 60 billion kwacha and minimum profit is 34 billion kwacha.
Therefore
Thus the profits of the five years are in the range of 26 Billion Kwacha
Note that the range, like the mean, has the disadvantage of being influenced by outliers hence
the range is not a good measure of dispersion to use for a data set that contains outliers.
Another disadvantage of using the range as a measure of dispersion is that its calculation is
based on two values only (the largest and the smallest), all other values in a data set are ignored
when calculating the range.
The value of the standard deviation tells us how closely the values of a data set are clustered
around the mean. In general, a lower value of the standard deviation for a data set indicates that
the values of a data set are spread over a relatively smaller range around the mean while a larger
value indicates that the values are spread over a relatively larger range around the mean.
The variance calculated from population data is denoted by A B , and variance from sample data is
denoted as B . Similarly, the standard deviation calculated from population data is denoted δ,
and the standard deviation calculated from sample data is denoted
∑ DE
∑5467$4 − /B ∑ $ −
B
F
# : B
A = =
* *
32
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
∑ DE
∑5467$4 − $̅
B ∑ $B −
5
)# ∶ B
= =
−1 −1
Where / is the population mean and $̅ is the sample [Link] quantity$ − /) or ($ − $̅ in the
above formulae are called the deviation of the $ value from the mean. Note that the sum of the
deviations of the $ values from the mean is always zero; that is, ∑F 5
467$4 − /=0 and∑467$4 −
$̅ = 0.
Example 3-10: The following data give the time in months from hire to promotion for a random
3-10:
sample of 25 software engineers from all software engineers employed by a large
telecommunications firm: 5, 7, 229, 453, 12, 14, 18, 14, 14, 483, 22, 21, 25, 23, 24, 34, 37, 34,
49, 64, 47, 67, 69, 192 and 125. Calculate the variance and standard deviation.
Solution: This is sample data so we will calculate Sample variance. N=24, $̅ = 83.28
∑5467$4 − $̅ B 1
= B
= H5 − 83.28B + 7 − 83.28B + ⋯ + 125 − 83.28B J]
−1 24
=16478 months
Variance
Variance and standard deviation for grouped
grouped data
The following are the basic formulae used to calculate variance for grouped data.
∑ NOE
∑F
467 4 4 − /
B ∑ B
− F
# : A B = =
* *
∑ NOE
∑5467 4 4 − $̅ B ∑
B
− 5
)# : B
= =
−1 −1
Example 3-11:The grouped data below represents the number of children from birth through the
3-11:
end of the teenage years in a large apartment complex. Find the mean for the data:
33
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Class 0-3 4-7 8-11 12-15 16-19
Frequency 7 4 19 12 8
Class : ; ;: ;P :
0-3 7 1.5 10.5 15.75
4-7 4 5.5 22 121
8-11 19 9.5 180.5 1714.75
12-15 12 13.5 162 2187
16-19 8 17.5 140 2450
n=50 ∑ = 515 ∑ B = 6488.5
Table 3.5: Frequency table
∑ NQ OQ E R7R
∑ 4B 4 − 6488.5 −
5 RS
)# : B = = = 24.16
−1 49
34
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
The coefficient of variation(denoted
variation CVar) is a statistic that allows you to compare standard
deviations when the units are different. It is given as a percentage using the formula below:
X
T #: UV = × 100% T ## : UV = × 100%
$̅ /
Example 3-12: The mean of the number of sales of cars over a 3-month period is 87, and the
3-12:
standard deviation is 5. The mean of the commissions is MK5225000, and the standard deviation
is MK773000. Compare the variations of the two.
R
CVar (Sales) = YZ[\ × 100 = 5.7%
[[SSS
CVar (Commissions) =YRBBRSSS\ × 100 = 14.8%%
The coefficient of variation is larger for commissions, hence the commissions are more variable
than the sales.
3.3 MEA
MEASURES OF POSITION
POSITION
A measure of position determines the position of a single value in relation to other values in a
sample or a population data set. In this section we will discuss some of the measures of position.
3.3.1 Quartiles,
Quartiles, Inter quartile range
Quartiles are three summary measures that divide a ranked data set into four equal
parts/quarters. These three summary measures are the first quartile (denoted by ]7 , the
second quartile (denoted by ]B and third quartile (denoted ] .Note that the data should be
ranked in increasing order before the quartiles are determined.
The first quartile is basically the value of the middle term among the observations that are less
than the median, and the third quartile is the value of the middle term among the observations
that are greater than the median.
35
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Each of the portions in the figure below contains 25% of the observations of the data set
arranged in increasing order. The portions are separated by the three quartiles (]7 , ]B & ] .
From figure 3.1, we can see that approximately 25% and 75% of the values are less than ]7 and
greater than ]7respectively, and approximately 75% and 25%of the data values are less than
] and greater than ] respectively.
Inter quartile range (IQR) is the difference between the third quartile and the first quartile of a
data set.
^] = ] − ]7
Percentiles are the summary measures that divide a ranked data set(in increasing order) into 100
equal parts. Each dataset has 99 percentiles which divide it into 100 equal parts
The first quartile is the 25th percentile and often called the lower quartile, the second quartile is
the 50th percentile and often called the middle quartile and the third quartile is the 75th
percentile often called the upper quartile. The kth percentile is denoted #_ , where k is an integer
in the range 1-99. For instance, the 25th percentile is denoted by BR . Figure 3.2 shows the
positions of the 99 percentiles.
1% 1% 1% … 1% 1% 1%
#7 #B # #`[ #`Z #``
36
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
=
#_ = V ℎ a b ℎ ℎ =& &
100
Where k denotes the number of the percentile and n represents the sample size.
Percentile rank for a particular value $4 gives the percentage of values in the data set that are less
than $4 and is given as:
Deciles are summary measures that divide ranked data set (in increasing order) into 10 equal
groups. Note that the first decile (&7 corresponds to7S ; second decile (&B corresponds toBS ;
etc. Deciles can be found by using the formulas given for percentiles.
The relationships among percentiles, deciles, and quartiles are summarized as: Deciles are
denoted by&7 , &B , & , . . . , &` , corresponding to 7S , BS , S , . . . , `S respectively. Quartiles are
denoted by]7,]B ,] corresponding to BR , RS , [R [Link] median is the same as
RS or ]B or&R .
3-13: Table 3.6 below gives 2014 expenditures (rounded to Millions of Kwacha’s) for
Example 3-
12Private Secondary Schools in Malawi.
37
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Peter Pan 13
Perke Boys 9
Michiru Boys 11
a) Find the values of the three quartiles.
b) Where does the 2014 profits for Bishop Mackenzie fall in relation to these quartiles?
c) Find the inter quartile range.
d) Find the value of the 42nd percentile and give a brief interpretation of the 42nd percentile.
e) Find the Percentile rank for 14 million kwacha profit for New Era Private School.
Position of value 1 2 3 4 5 6 7 8 9 10 11 12
Value 7 8 9 10 11 12 13 13 14 17 17 45
]7 ]B ]
a) From the ranking, we note that ]7 falls between the values 9 and 10, ]B falls between
the values 12 and 13, and ] falls between the values 14 and 17. Therefore the Quartiles
will be as follows:
9 + 10 12 + 13 14 + 17
]7 = = 9.5, ]B = = 12.5, ] = = 15.5
2 2 2
]7 = 9.5 1 Kwacha indicates that 25% of the schools in this sample spent less
than 9.5 Million Kwacha and 75% of the companies spent more than 9.5 Million
Kwacha
]B =12.5 Million Kwacha indicates that half of the schools spent less than 12.5 Million
Kwacha while the other half spent more than 12.5 Million Kwacha.
] = 15.5 Million Kwacha indicates that 75% of the schools spent less than 15.5
Million Kwacha while 255 of the schools spent more than 15.5 Million Kwacha
b) The expenditure for Bishop Mackenzie falls below the lower quartile.
c) ^] = ] − ]7 = 15.5 − 9.5 = 6 1 c%ℎ.
d) Using the data arranged in increasing order, the position of the 42ndpercentile is:
dB7B
#dB = 7SS
= 5.04ℎ .
38
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
The value of the 5.04th term can be approximated by the value of the 5th term in the
ranked data. Therefore #dB =42nd percentile=11 million Kwacha.
Z
e) Percentile rank of 14 = × 100 = 66.67%
7B
F
$4 − /
g = h i j /*
A
467
g = 0 implies zero skewness, g < 0 implies negative skweness and g > 0 implies positive
[Link] that two distributions can have the same mean, variance and skeweness but
could still be significantly different in shape hence the need to look at kurtosis
F
$4 − / d
gd = h i j /*
A
467
A negative relative kurtosis gd < 3 (implies a flatter distribution than the normal distribution
and is called platykurtic. A positive relative kurtosisgd > 3implies a more peaked distribution
39
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
than the normal distribution and is called leptokurtic. Zero kurtosis (gd = 3implies a distribution
has the same kurtosis as a normal distribution and is called mesokurtic.
When analysis is done in a statistical package (e.g. SPSS), the value of Kurtosis is given as an
absolute Kurtosis and interpreted as follows:
• Kurtosis > 3: Distribution is sharper than a normal distribution with thicker walls and most
values concentrated around the mean, which implies high probability for extreme values.
• Kurtosis < 3: Distribution is flatter than a normal distribution with a wider peak. The
probability for extreme values is less than for a normal distribution, and the values are
wider spread around the mean.
• Kurtosis = 3: Distribution is the same as a normal distribution.
3.4.2 Box-
Box-and whisker plots
These give a graphic presentation of data using five measures: the median, the first quartile, the
third quartile and the smallest and largest values in the dataset between the lower and upper
inner fences. It can help us visualize the center, spread and the skewness of a data set. It also
helps in detecting outliers.
Example: The following data are weights (in Kg’s) of diabetic patients at a Central hospital.
Construct a box-whisker plot for these data:
Solution
1 First, rank the data in increasing order and calculate the value of the median, first
Step 1:
quartile, third quartile and the interquartile range. After ranking the data, these values are given
as: Median= 87, ]7=77, ] =101, ^] = 24
2 Find the points that are 1.5 × ^] below ]7 and 1.5 × ^] above ] . These two are
Step 2:
called the lower and upper inner fences respectively.
40
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Step 3:
3 Determine the smallest and largest values in the given dataset within the two inner
fences. Smallest value= 69, Largest value= 112
Step 4:
4 Draw a horizontal line and mark the income levels on it such that all the values in the
given data set are covered. Above the horizontal line, draw a box with its left side at the position
of the first quartile and the right side at the position of the third quartile. Inside the box, draw a
vertical line at the position of the median. The by drawing two lines, join the points of the
smallest and largest values within the two inner fences to the box. These two lines are called
whiskers. Note that a value that falls outside the two inner fences is called an outlier. Below is a
box-and-whisker plot for the data in this example.
Activity: Construct a box plot for the example above and show which values are outliers
Exercise 3.2
1) Briefly describe how the three quartiles are calculated for a data set. Illustrate by
calculating the three quartiles for two examples, the first with an odd number of
observations and the second with an even number of observations.
2) Explain how the inter quartile range is calculated. Give one example.
3) Briefly describe how the percentiles are calculated for a data set.
4) Explain the concept of the percentile rank for an observation of a data set.
5) The following data give the weights (in Kg’s) lost by 15 members of a health club at the
end of 2 months after joining the club.
5, 10, 8, 7, 25, 12, 5, 14, 11, 10, 21, 9, 8, 11, 18
i) Compute the values of the three quartiles and the inter quartile range.
ii) Calculate the (approximate) value of the 82nd percentile.
iii) Find the percentile rank of 10.
41
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 4: INTR
INTRODUCTION TO DESCRIPTIVE ANALYSIS USING SPSS
Objectives
Introduction
This topic will introduce students to working with SPSS to perform simple descriptive analyses
like: Formation of tables, bar charts, histograms, etc. This section assumes that students have
entered/already have data in SPSS. As such we will use some pre-loaded datasets in SPSS to go
through the practical. Throughout the chapter, the upward pointing arrow (↑) will be an
instruction for the learner to click on/open the proceeding icon/command.
We will also learn how to summarize categorical data in tabular form as well as how to format
the tabulations to one’s preference. Summarizing the data this way offers a nice visual aid for
seeing the picture better. In the next sections, we shall continue with this philosophy and see
how we can visually present the data further using graphical techniques.
4.1.1 One
One variable tabulation
In this section, we shall use the file CSS telco Data to perform tabulations for internet services
with respect to region zones in the data set. We shall learn this through the next practice
exercise.
42
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
3) Paste the variable into the ‘variable(s)’dialog box by left clicking on the triangular arrow;
4) √ [Wck] ‘Display frequency tables’
5) ↑OK (This acWon leads to a display of the output as shown in the figure below).
The output suggests that most (344) of the individuals who participated in this survey came from
households of zone 3. In the other columns, SPSS gives the percentages of responses that are
within each household size with the last column giving the cumulative percentages. The last
column particularly showed that in the places where the survey was carried out, members of
‘zone 3’ fully participated with 100% cumulative percentage.
The window in which this output lies is the second main window of SPSS called Output
View. Note that at the bottom task bar of your screen, there is a new icon labeled as
‘Output1 – IBM SPSS Statistics Viewer’. To the left of that icon is the one for the CSS telco Data.
You can switch between the output and the data view mode by left clicking the window we want
to be our current SPSS window.
43
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.1.2: Cross-
Cross-tabulations
tabulations
In order to compare age categories in across all the Zones, one may decide to have the zones as
rows and Age cat as the columns. Make sure that the variables you are performing cross-
tabulations on are categorized/already put in desired categories. To perform the cross
tabulations, follow the steps in the next practice exercise.
Practice
Practice 4-2: Creating a 2-
2-way Table
1) Choose ‘select cases’ option. Then go and clear the ‘Zone 2’ selection by ‘reset’;
2) ↑Analyze; (c) ↑DescripWve StaWsWcs; (d) ↑Crosstabs;
3) ↑Geographic region (zones) (i.e. choose the zones variable);
4) Paste this variable into the ‘row(s)’dialog box via the triangular arrow next to it;
5) ↑Marital status (i.e. choosing the variable that has age in categories);
6) Paste this variable into the ‘columns(s)’dialog box via the triangular arrow next to it;
7) ↑OK (This acWon leads to a display of the output as shown in the figure below).
Figure 4.2: SPSS output for a cross-tabulation for geographic indicator against marital status
44
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
This gives the required output that shows that there were more unmarried people that
participated in the survey across all zones..
Performing the same procedure on Geographic location and level of education variables, we
would get a table as below
Table 4.3: SPSS output for cross tabulation of geographic indicator against level of education
This gives an output that shows that there were more people with a high school degree that
participated in the study across all zones.
If one had a multiple response question then one would define the multiple response groups as
described in section 5.1.1, then the frequency tables or cross-tabulations would be performed as
outlined in section 5.1.2.
45
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2 Suppose that the variable ‘custcat’ in the CSS telcodata was capturing information
Exercise 4.2:
on whether a respondent was placed in the following customer service category; Basic service,
E-service, Plus service, and Total service. Use slightly advanced cross tabulation of ‘custcat’ by
‘zones’ as follows. Try to place the variable ‘custcat’ in the row(s) dialog box, ‘internet service
access(internet)’ variable in the column(s)dialog box and the ‘zones’ variable in the Layer 1 of 1
dialog box to find out whether there were some respondents who could not have any internet
service received.
Practice 4-
4-3: Swapping rows and columns
1) R↑ (i.e. right click on the cross-tabulation output one wants to swap rows for columns);
2) ↑Edit Content
3) ↑in separate window (you may just close a window that pops up)
4) ↑Pivot
5) (a)↑Transpose rows and columns (b) ↑file
6) Close the (most) current window to see that the rows and columns have changed roles.
Practice 4-
4-4: Editing SPSS cross-
cross-tabulations
To edit the contents of the table, one follows the following actions
1) R↑ on the appropriate table (i.e. right click a cross-tabulation one wants to swap rows
for columns);
2) ↑Edit Content
3) ↑in separate window (you may just close a window that pops up)
46
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4) ↑Edit or go further with another choice of representing your data by choosing ‘graph’; or
5) ↑format, then get choice on ‘table looks’
6) ↑↑On the item or part of the item you want to edit (You may need one further click to
edit part of it); and when
7) One is through with the editing, one can click anywhere outside the table;
Practice 4-
4-5: Choosing a professional format
To format the table in a professional way, one may use various inbuilt formats as follows.
1) ↑↑on the appropriate table (i.e. double click a tale one wants to format for
presentation);
2) R↑ on the table you just double clicked;
3) ↑Edit content
4) ↑in separate window
5) ↑format, ↑Tablelooks…
6) Choose any format of tables among the list of formats appearing on this window by left
clicking on it e.g. Academic;
7) ↑Save Look , ↑OK
8) ↑file, ↑close
Recall that the SPSS output keeps appending new outputs at the bottom of the previous outputs.
If the output becomes very long with unnecessary outputs one can discard an output or some
part of it by left clicking on that output once and pressing or striking [del] or [delete] key on the
keyboard.
47
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2.1 Computing simple summary statistics
We shall now compute the following statistical values: the total (sum) number of employees, the
mean, the minimum and maximum data values, the range, the standard deviation and the
standard error. Note that you can perform these commands on any other set of continuous data.
Let us command SPSS to calculate these values on the ‘actual’ number of males selected. Follow
the steps:
1) ↑Analyze [Some versions of SPSS have ‘Statistics’ instead of ‘Analyze’ on the menu!]
2) ↑DescripWve StaWsWcs
3) ↑Frequencies
4) Choose (by clicking) on the list of variables; the one that has correct male employee’s
number variable.
5) Click the triangular arrow to complete (paste) correct male employees number on the
window on the right.
NB: Any wrongly selected variable that is pasted can be removed by clicking it where it is
pasted, followed by clicking the [reversed!] triangular arrow tab.
6) ↑StaWsWcs
7) Check (i.e. put a tick [√] by clicking on the object) the ‘Sum’, ‘Mean’, ‘Maximum’,
‘Minimum’, Range’, ‘Std. Deviation’, ‘Variance’ and the ‘S.E. mean’ items.
8) ↑ConWnue , ↑OK
In the output, the first part of the table has the values N (valid and missing values). This is the
sample size that was used in the computation.
This analysis will produce a pie chart for the variable of interest.
48
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2.3 Drawing Box Plots
Let us suppose that we are not quite sure of the distribution of income among different age
categories from SPSS telcodata. Let us compare the income with respect to the age groups
selected. The steps are:
Note that boxplots help us identify outliers in the data set. For instance, in the output above we
can clearly see an outlier in the E-service customer category.
49
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
4.2.4 Drawing Histograms
Using any SPSS data set (e.g. telco data), one can still plot the distribution of the income
data/any variable using similar steps.
50
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
UNIT 5: COMMON COMPLICATIONS WHEN ANALYSING SURVEY DATA
Objectives
Example 5.1:
5.1 Question 1: Which of the following electrical appliances items do you own? (Tick
which applies)
a) TV
b) VCR
c) Stereo/CD player
d) PDA
e) Computer
f) Fax Machine
When presented with a question like this, one could tick on more than one answer hence the
need to tackle analysis for such questions. As such this section will focus on analysis of such
questions in SPSS. Note however, that the analysis can be done in other packages as well.
51
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
i) Yes i) Yes
ii) No ii) No
2. Do you have a VCR? 5. Do you have a fax machine?
i) Yes i) Yes
ii) No ii) No
3. Do you have a PDA?
i) Yes
ii) No
5.1.2 Analyzing
Analyzing multiple response questions in SPSS
A simple approach is to use the Multiple Response option. This procedure creates a single
summary table of counts and percent based on several variables that contain responses to one
question. This would create one table that combines all five variables, rather than five separate
tables.
1. First, make note of how the variables of interest are coded. For this example there are six
categories (a-f)
2. Next, instruct SPSS that the set of variables represents responses to a single question. In
the menu bar, go to Analyze>Multiple Response>Define Variable Sets. To define a multiple
response set in SPSS we must specify the list of variables that make up the set, the type
of coding used, and a name.
3. Using the arrow button, place variables Q1_a (Owns a TV) through Q1_f(Owns a fax
machine) in the “Variables in Set” box.
4. Depending on how you entered the data Click:
• “Categories” and add “1-6” for the range; if the data was entered as a repeated
question with all the 6 responses included in the values section.
• “Dichotomies” and add “1” for the counted value; if each response was
incorporated in a single question and a respondent had to answer “yes” or “no”
where applicable.
52
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
5. Give the new collapsed variable a name (e.g. item). Next, give the variable a label and
click “Add”. Notice that the set name now appears in the Multiple Response Sets list box.
The $ prefix distinguishes the set name from an ordinary SPSS variable name
6. Click “close”.
7. Return to Analyze > Multiple response. You will now see that two options have been
activated: Frequencies and Crosstabs. Below is an example of a frequency output for the
item variable. The table was created based on responses to the six variables (Q1_a to
Q1_f). The N column indicates how many respondents mentioned each item. The Percent
of Responses column indicates what percentage of the total number of items mentioned
is contained in each category. The Percent of Cases indicates what percentage of
respondents own items of each given type.
$item Frequencies
Table 5.1: SPSS output for multiple response variables
Percent
Responses of Cases
N Percent N
$item(a) Owns TV 6337 26.4% 99.3%
Owns VCR 6145 25.6% 96.3%
Owns stereo/CD
6206 25.8% 97.3%
player
Owns PDA 1307 5.4% 20.5%
Owns computer 2811 11.7% 44.1%
Owns fax machine 1202 5.0% 18.8%
Total 24008 100.0% 376.4%
a Dichotomy group tabulated at value 1.
Note that the column for total Percent of Cases has 376.4%. The reason that it is possible to have
over 100% is because each respondent can select more than one category. Theoretically, if
everyone selected all categories this percentage would be equal to 600%. Note that the multiple
53
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
response set that was created will remain active until a different data file is opened or you exit
SPSS.
The concept of missing values is important to understand in order to successfully manage data.
If the missing values are not handled properly by the researcher, then he/she may end up
drawing an inaccurate inference about the data .As such there is need to know how to deal with
missing values in a dataset. This section will address some easy methods of dealing with missing
values.
Example: We want to assess which are the main determinants of income (such as age). The
MCAR assumption would be violated if people who did not report their income were, on
average, younger than people who reported it. This can be tested by dividing the sample into
those who did and did not report their income, and then testing a difference in mean age. If we
fail to reject the null hypothesis, then we can conclude that the MCAR is mostly fulfilled (there
could still be some relationship between missingness of Y and the values of Y).
(MAR) This is a weaker assumption than MCAR which states that the
Missing at random (MAR):
probability of missing data on Y is unrelated to the value of Y after controlling for other variables
in the analysis (say X).
Example: The MAR assumption would be satisfied if the probability of missing data on
income depended on a person’s age, but within age group the probability of missing income was
unrelated to income. However, this cannot be tested because we do not know the values of the
54
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
missing data, thus, we cannot compare the values of those with and without missing data to see
if they systematically differ on that variable.
Example: The NMAR assumption would be fulfilled if people with high income are less likely to
report their income.
List-wise deletion (or complete case analysis): If a case has missing data for any of the variables,
List-
then simply exclude that case from the analysis. It is usually the default in most statistical
packages.
Advantages: It can be used with any kind of statistical analysis and no special computational
methods are required.
Limitations: It can exclude a large fraction of the original sample. For example, suppose a data
set with 1,000 people and 20 variables. Each of the variables has missing data on 5% of the
cases, then, you could expect to have complete data for only about 360 individuals, discarding
the other 640. It works well when the data are missing completely at random (MCAR), which
rarely happens in reality
Imputation methods: Here, you substitute each missing value for a reasonable guess, and then
Imputation
carry out the analysis as if there were not missing values. There are two main imputation
techniques:
55
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
a) Marginal mean imputation: Compute the mean of X using the non-missing values and use
it to impute missing values of X.
Limitations: It leads to biased estimates of variances and covariance’s and, generally, it
should be avoided.
Question/observation 1 2 3 4 5 6 7 8 9 10
Response/value 3 8 0 0 5 6 0 7 0 1
56
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
Possible analysis of the dataset in table 5.2 (including the Zero values) would yield the following
summary statistics:
An alternative analysis would be to separate the data set (i.e. not including the zero values
ctly). This analysis would yield two sets of summary statistics as follows:
directly)
directly).
It is interesting to note that both analyses are valid depending on the precise objective and on
the type of data. However, the 2-step analysis is often appropriate where the data is split into
two (including/not including the zero values) and analyzed as such.
For example, when a question like “How many cattle do you have?” is asked, some would answer
“yes”: giving the number of cattle owned, while others will answer “no”: producing Zero values.
In analyzing and interpreting data from this question one could have summary statistics as
follows
Farmer A B C D E
Number of cattle 3 4 6 0 3
In analyzing the data from table 5.3, one would have two analyses as follows
a) Including the Zero value: The mean number of cattle per household can be calculated
as:3 + 4 + 6 + 0 + 3/5 = 3.2
Alternatively one would say
57
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
b) Excluding the Zero value: 80%(4/5) of the farmers owned cattle and among the cattle
owners, the mean was 4 cattle per farmer
We can see that analysis “b” is giving the true mean number of cattle in households that actually
have cattle. Hence for such a dataset, it is advisable to use analysis be.
The second case is when you have a group of items that each has a frequency associated with it.
In these types of situations, using a weighted average can be much quicker and easier than the
traditional method of adding up each individual value and dividing by the total. This is especially
useful when you are dealing with large data sets that may contain hundreds or even thousands
of items but only a finite number of choices.
Example 2: Suppose we are given data on farm yields for two farmers (In tons/hectare) as
follows:
58
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
From the data given in table 5.4, one can compute different mean yields depending on what you
are trying to answer:
• If you are interested in the farmer, then there are 2 farmers and the mean is
(1+2)/2= 1.5 t/ha
• If the area is the unit of interest then there are 5.5 ha and note that Farmer A is 10 times
as important as farmer B, so a weighted mean is produced.
produced Here we need to weight each
yield by the area it represents hence the mean will be
Mean = (1*5 + 2*0.5)/5.5 = 1.1
Here the areas are the “weights”, they are used when different observations represent
different proportions of the “population
3 A student is enrolled in a biology course where the final grade is determined based
Example 3:
on the following categories: tests 40%, final exam 25%, quizzes 25%, and homework 10%. The
student has earned the following scores for each category: tests-83%, final exam-75%, quizzes-
90%, and homework-100%. We need to calculate the student's overall grade.
To calculate a weighted average with percentages, each category value must first be multiplied
by its percentage. Then all of these new values must be added together. In this example, we
must multiply the student's average on all tests (83) by the percentage that the tests are worth
toward the final grade (40%). Please note that all percentages must be converted to decimals
before you multiply. Similarly, the final exam score (75) will be multiplied by 0.25 (25%). The
same will be true for both the quizzes (90 * 0.25%) and homework (100 * 0.10%). Thus, the
overall calculation would be (83 * .40) + (75 * .25) + (90 * .25) + (100 * .10) = 33.2 + 18.75 + 22.5
+ 10 = 84.45 or 84% if rounded down.
5-1: A student has earned the following averages in his history course: tests-90%,
Exercise 5-
quizzes-88%, papers-85%, and homework-95%. The overall course grade is comprised of tests
(30%), quizzes (20%), final exam (20%), papers (20%) and homework (10%). What score must he
earn on the final exam in order to earn a final grade of at least 90%.
59
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe
REFFERENCES
Donnelly, R. (2007). The Complete Idiot's Guide To Statistics, 2nd Edition. NY: Kindle Edition.
H. Kara (2013). Analyzing data : A time saving guide. NY: Kindle edition.
Namangale, J. J. & Gondwe, C. (2014). SPSS Training Manual. In Mwakilama, E., Twabi, H. &
Sawerengela, P. (Eds)
Panik, M. J. (2005). Advanced statistics from an elementary point of view. Burlington: Elsevier.
Statistics Training Pack for SADC. Statistical Services Centre of the University of Reading
[Link]
Wheelan, C. (2013). Naked Statistics: Stripping the Dread from the Data. NY: Kindle edition
60
Descriptive Statistics Module Developed by Fiskani J.M. Kondowe