You are on page 1of 164

Polling Public Opinion

Version 2, 2015

Jelke Bethlehem
2
Contents
1. Public opinion ……………………………………………………………………………………………………………………. 5
2. Some history ……………………………………………………………………………………………………………………. 11
3. Setting up a poll ……………………………………………………………………………………………………………….. 17
4. Asking questions ……………………………………………………………………………………………………………… 23
5. Selecting a sample ……………………………………………………………………………………………………………. 41
6. Collecting data …………………………………………………………………………………………………………………. 57
7. Checking and correcting data ……………………………………………………………………………………………. 63
8. Computing estimates ………………………………………………………………………………………………………... 71
9. The non-response problem ………………………………………………………………………………………………. 93
10. Online polls ……………………………………………………………………………………………………………………. 105
11. Analysing the data ………………………………………………………………………………………………………….. 121
12. Publishing the results …………………………………………………………………………………………………….. 139
13. A checklist for polls ………………………………………………………………………………………………………… 143

References …………………………………………………………………………………………………………………………… 155


Index …………………………………………………………………………………………………………………………………… 159

3
4
1 Public opinion

1.1 What is public opinion?


What do you think is the most important problem facing this country today? Which party do you think
you are most likely to vote for? Do you approve or disapprove of the membership of the European
Union? To what extent do you favour or oppose the death penalty? Do you think the use of marijuana
should be made legal, or not? These are examples of questions you may have to answer if you
participate in an opinion poll.
An opinion poll attempts to measure public opinion. But what is public opinion? There is no consensus
about its definition. Public opinion is defined here as the aggregate of individual opinions of the
persons in a group. There is a clear public opinion if the majority of the people in the group have the
same opinion. If there are many different, opposing views, there is no clear public opinion. And if there
are many people without an opinion, there is also no clear public opinion.
People can only have an opinion if there is an issue about witch they can express an opinion. Once an
issue is generally recognised, some people will begin to form an opinion about it. If a sufficient number
of people express their opinion, a public opinion on the issue begins to emerge. Not all people will
develop an opinion about a public issue. Some may not be interested, and others simply may not have
heard about it.
Bradburn, Sudman, Wansink (2004) distinguish opinions and attitudes. In their view, opinion is most
often used to refer to views about a particular object (person, policy), whereas attitude is more often
seen as a bundle of opinions that are more or less coherent and are about a complex object. One might
have an opinion about making specific changes in a law, and an attitude about the general ideas behind
this law. An opinion can often be measured by a single question, while measuring an attitude may
require a set of questions, the answers to which are combined in some way. Kalton and Schuman
(1982) have a similar view: an opinion usually reflects a view on a specific topic, such a voting
behaviour in the next elections. An attitude is a more general concept, reflecting views about a wider,
often much more complex issue.
There is also a distinction between facts on the one hand and opinions and attitudes on the other.
Facts are verifiable. In other words: facts can be objectively proven to have occurred or are actually
the case. There is a true value. Opinions and attitudes are a judgment, a viewpoint, or a statement
about matters commonly considered to be subjective. They are the result of emotion or interpretation
of facts. For attitudes and opinions there is no such thing as a true value. People may not have an
opinion or attitude at all. The only way to measure it, is to ask for it. There is no other way to measure
it.

1.2 Measuring public opinion


Public opinion can be measured with an opinion poll. An opinion poll is a special case of a survey. A
survey is an instrument that collects information about a well-defined population. Often such a
population consists of persons, for example all people in the country of age 18 years or older. A
population can also consist of other objects, like households, companies, farms, schools, etc. Typically,
a survey collects information by asking questions. To do this in a uniform and consistent way, a
questionnaire is used. If the objects in the population are not persons (e.g. companies), representatives
of these objects have to be found to answer the questions.
There are no fundamental differences between surveys and opinion polls, but there are some practical
discrepancies. A poll is often small and quick. There are only a few questions that have to be answered
by small group of people. Analysis of the results will not require much time. Focus is on quickly
obtaining an indication of the public opinion about a current issue. Surveys are often large and

5
complex. More people will have to complete the questionnaire. The questionnaire may contain many
questions and can have a complex structure. Analysis of the results may be time-consuming. Focus is
more on precise and valid conclusions than on global indications.

Figure 1.1. An opinion poll

One way to measure public opinion is to ask each object in the population to fill in the questionnaire.
In case of a survey, this would be called a census or a complete enumeration. Particularly for large
populations, this is time-consuming and expensive. And as more and more people are asked to
participate, the response burden of people is increased. Therefore, people will be less and less inclined
to cooperate. This is called non-response.
A survey or opinion poll collects information on only a small part of the population. This small part is
called the sample. In principle, the sample provides information on only the sampled objects in the
population. No information will become available on the non-sampled objects. Fortunately, if the
sample is selected in a ‘clever’ way, it is possible to draw conclusions about the population as a whole.
In this context, ‘clever’ means that the sample is selected using probability sampling. This means a
random selection procedure uses an element of chance to determine which objects are selected, and
which are not. If it is clear how the selection mechanism works, and if it is possible to compute the
probabilities of being selected in the sample, valid and precise estimates can be made of the
characteristics of the whole population. So, the public opinion in the sample can be used to accurately
estimate the public opinion in the whole population.
There are other ways than probability sampling to select samples. Some of them, like quota sampling
and self-selection, will be discussed. These sampling techniques do not follow the fundamental
principles of probability sampling. Therefore, there is no guarantee at all that these approaches will
result in valid and precise conclusions about the population.

1.3 Conducting an opinion poll


Setting up and carrying out an opinion poll is a process that involves a number of steps. It is important
that you take proper decisions in each step. Failing to do so may result in meaningless results. This
section gives an overview of these steps. Subsequent chapters will treat many aspects in more detail.

6
1.3.1 Target population
The first phase of conducting a poll is the design phase. You have to take a number of decisions in this
phase. One of these decisions is to define the target population. This is the group of people about
which you want to know more. It is the group that you want to investigate. You select the sample from
this group. Consequently, the conclusions of your analysis of the collected data also apply to this group.
1.3.2 Reference date
A poll is often supposed to measure the opinion or status of a target population at a specific moment in
time. This is called the reference date. The questionnaires should be filled in by the sample persons
around this date, and questions should refer to the situation of the respondent at this date. The
conclusions drawn from the poll, also apply to the situation in the target population at this date.
1.3.3 Variables
What are you going to measure in the target population? You have to select a set of variables for this.
Variables are quantities that can assume any of a set of given values. The age of a person is a variable.
It can have any value between 0 and, say, 120. Access to internet is also a variable (with two possible
values yes and no). Likewise, opinion about an issue is a variable. The possible values of an opinion
depend on the way it is measured. If you ask if someone is in favour or opposed, there are only two
possible values. If you ask someone’s opinion by means of a five-point Likert scale, there are five
possible values.
1.3.4 Population characteristics
You will use the collected data to draw conclusions about the status (opinion and behaviour) of the
target population as a whole. This usually comes down to computing a number of indicators. These
indicators are called population characteristics. Population characteristics can be computed exactly if
the individual values of the relevant variables are known for every object in the population. For
example, if the population characteristic is the percentage of people having access to internet, it can be
computed if it is known for every person in the population if he or she has access to internet. Since the
values of variables usually only become available for objects in the sample, population characteristics
have to be estimated.
1.3.5 The questionnaire
Whatever the objects in the target population are (persons, households, companies, etc), you collect
data by means of asking questions. The questions are answered by individuals representing the
objects. We call these people respondents. You have to collect the data in a consistent and objective
way, so that you get comparable answers. Therefore, you must ask the questions in exactly the same
way of all respondents. A questionnaire is the way to do this. Designing a questionnaire is a crucial step
in the process. Errors in the questionnaire will lead to wrong answers to the questions, and therefore
wrong conclusions will be drawn from the poll. Bishop (2005) states that variations in the form of the
questions, the wording of question texts and the context of the questions may have a substantial effect
on the answers.
1.3.6 The mode of data collection
How to get the answers to the questions? How do you collect the answers? There are several ways to
do this. We call these ways modes of data collection. Each mode has advantages and disadvantages.
You can visit the respondents at home and ask the questions face-to-face. Or you can call the
respondents and ask the questions by telephone. These are called interviewer-assisted modes, because
interviewers ask the questions. You also have self-administered modes. The respondents are on their
own. There are no interviewers to help them. An example of a self-administered mode is sending the
questionnaires by mail to respondents.

7
Nowadays you can use the computer to ask questions. The paper questionnaire is replaced by a digital
one in the computer. The quality of the collected data is usually better, and it takes less time to do a
poll. This approach is called computer-assisted interviewing (CAI). For face-to-face interviewing,
interviewers bring a laptop or tablet with them to the homes of the interviewers. This is called
computer-assisted personal interviewing (CAPI). For telephone interviewing, respondents are called
from the call centre of the poll organisation. This is called computer-assisted telephone interviewing
(CATI).
With the emergence of the internet and the World Wide Web, a new mode of data collection became
quickly very popular: the online poll. At first sight, online data collection seems attractive. It makes it
possible to collect a lot of data in a cheap and fast way. Unfortunately, this mode of data collection also
has its problems.
1.3.7 The sampling frame
Once you have decided how to collect your data, you must find a way to select a sample. You need a
sampling frame for this. A sampling frame is a list of all people in the target population. It must be clear
for all persons in the list how you can contact time. The choice of the sampling frame depends on the
mode of data collection. For a face-to-face poll or a mail poll you need addresses, for telephone polls
you need telephone numbers, and for an online poll you preferable must have e-mail addresses.
It is important that the sampling frame exactly covers the target population. If this is not the case,
specific groups may not be selected in the poll, leading to a lack of representativity. This may affect the
validity of the outcomes of the poll.
1.3.8 The sampling design
It is a fundamental principle of survey methodology that you must apply probability sampling to select
a sample from the target population. Probability sampling makes it possible to draw valid conclusions
about the target population as a whole. And only then you can quantify the precision of your estimates
of population characteristics.
There are various ways to select a probability sample. The most straightforward one is to select a
simple random sample (with equal probabilities). Other sampling designs are to select a systematic
sample or a two-stage sample. On the one hand the choice will depend on practical aspects like the
availability of sampling frames and the costs of data collection, and on the other hand it will depend on
the required precision of your estimates.
1.3.9 The sample size
How large must my sample be? There is no simple answer to this question. The precision of the
estimates of population characteristics depend on the sample size. So, if you want very precise
estimates, you have to draw a large sample. Are you, however, satisfied with less precision, a smaller
sample may suffice. You can compute the sample size once you know how precise your estimates have
to be.
1.3.10 Data collection
When you have chosen a mode of data collection, and you have drawn a sample from the population,
the fieldwork starts. You must try to get the answers of all selected persons to all questions. If you do
your data collection face-to-face, interviewers must visit the sample persons. If you do your poll by
telephone, interviewers must call the sample persons. For a mail poll, you must send the
questionnaires to the sample persons by mail. In case of an online poll, you must send the link to the
questionnaire by e-mail or some other means.
All persons in the sample must complete the questionnaire. Unfortunately, often some questionnaires
remain empty because of non-response. For various reasons people do not answer the questions. This

8
is called non-response. There can be various causes: people simply refuse to participate, they cannot be
contacted, or they not able to answer the questions (for example due to language problems). Non-
response affects the validity of the poll results if it is selective. Non-response is selective if, due to non-
response, specific groups are under- or over-represented in the poll.
1.3.11 Data editing
The data collection phase will result in many completed questionnaires. For further analysis of the
data, you have to bring them into the computer. If you have used some form of computer-assisted
interviewing, the data are already in the computer. This is not the case if paper questionnaire forms
were used. Then you need a special data entry phase.
A lot can go wrong during a poll. This may lead to errors in the collected data. Therefore, the data must
be checked. This can be a separate activity, but it can also be combined with data entry. Data editing
means you check the data for errors. Detected errors must be corrected. This is not always simple,
particularly if you have to do it afterwards and without the possibility to confront the respondent with
the problem.
1.3.12 Computing estimates
After data editing, the data are ready for analysis. The first step in the analysis will be estimation of a
number of population characteristics, like totals, means and percentages. It may be informative to
compute these estimates for several subpopulations into which the population can be divided. For
example, you can compute estimates separately for males and females, for various age groups, or for
various regions in the country.
You always have to realise that you are only computing estimates of population characteristics and not
their exact values. This is because you only have sample data. The estimates have a margin of error.
You have to publish these margins of error, if only to avoid the impression that your estimates
represent the true values.
1.3.13 Non-response correction
Almost every poll is affected by non-response. As was mentioned earlier, the main causes for non-
response are refusal, no-contact and not-able. Non-response is often selective, because specific groups
are under-represented in the poll. Selective non-response leads to wrong conclusions. To avoid this, a
correction must be carried out. This correction is called adjustment weighting. This comes down to
assigning weights to respondents. Respondents in over-represented group get smaller weights than
respondents in under-represented groups. This should reduce the lack of representativity.
1.3.14 Publication
The final phase of the poll process is publication of the results. Usually, you will write a research
report. This report should not only contain results and conclusions. It should also describe how the
poll was designed and carried out. Everybody must be able to determine if the principles for sound
scientific survey research were followed.
Among the things to be reported are the target population, the sampling frame and sampling design,
sample size, response rates, correction for non-response, and the margins of error.

9
10
2 Some history
The idea of collecting data about people and turning these data into useful statistical information, is
already very old. Factual data were collected in Babylonian times in agricultural censuses. This
occurred shortly after the art of writing was invented. Ancient China counted its people to determine
the revenues and the military strength of its provinces. There are also accounts of statistical overviews
compiled by Egyptian rulers long before Christ. Rome conducted censuses of people and of property.
The data were used to establish the political status of citizens and to assess the military strength and
tax obligations to the state. And of course, there was numbering of the people of Israel, leading to the
birth of Jesus in the small town of Bethlehem.

Figure 2.1. Census in Bethlehem (Pieter Brueghel, 1605-1610)

Opinion polls can be traced back to ancient Greece. Around 500 BC many city-states had a form of
government based on democracy. All free (non-slave), native (non-foreign) adult males could express
their opinion on things like declaring war, dispatching diplomatic missions and ratifying treaties.
Opinions were turned into decisions by people attending popular assemblies and by voting on these
issues.
For a long period in history, data collection was based on complete enumeration of the target
population. Every object in the target population had to provide information. The important idea of
sampling emerged only at the end of the 19th century. It took many years before this idea was
accepted.
This chapter gives a global overview of historical developments. Section 2.1 is about data collection by
means of a census, section 2.2 describes the rise of survey sampling, and section 2.3 focuses on the
history of opinion polls.

2.1 Collecting statistical data


Ancient empires (China, Egypt, Rome) collected statistical information mainly to find out how much
tax could be collected and how large the army could be. Data collection was rare in the Middle Ages. A
good example was the census of England taken by the order of William the Conqueror, King of
England. The compilation of this Domesday Book started in the year 1086 AD. The book records a
wealth of information about each manor and each village in the country. There is information about
more than 13,000 places, and on each county there are more than 10,000 facts. To collect all this data,
the country was divided into a number of regions, and in each region a group of commissioners was
appointed from among the greater lords. Each county within a region was dealt with separately.
Sessions were held in each county town. The commissioners summoned all those required to appear

11
before them. They had prepared a standard list of questions. For example, there were questions about
the owner of the manor, the number of free man and slaves, the area of woodland, pasture and
meadow, the number of mills and fishponds, to the total value, and the prospects of getting more
profit. The Domesday Book still exists and county data files are available on CD-ROM and the internet.

Figure 2.2. The Domesday Book

You can find another interesting example in the Inca Empire which existed between 1000 and 1500
AD in South America. Each Inca tribe had its own statistician, called the Quipucamayoc. This man kept
records of, for example, the number of people, the number of houses, the number of llamas, the
number of marriages and the number of young men that could be recruited for the army. All these
facts were recorded on a quipu, a system of knots in coloured ropes. A decimal system was used for
this.

Figure 2.3. The Quipucamayoc, the Inca-statistician

At regular intervals, couriers brought the quipus to Cusco, the capital of the kingdom, where all
regional statistics were compiled into national statistics. The system of Quipucamayocs and quipus
worked remarkably well. Unfortunately, the system vanished with the fall of the empire.
An early census took place in Canada in 1666. Jean Talon, the intendant of New France, ordered an
official census of the colony to measure the increase in population since the founding of Quebec in
1608. The enumeration, which recorded a total of 3,215 persons, included the name, age, sex, marital
status and occupation of every person. The first censuses in Europe were undertaken by the Nordic
countries: The first census in Sweden-Finland took place in 1746. It had been already suggested

12
earlier, but the initiative was rejected because “it corresponded to the attempt of King David who
wanted to count his people”.

Figure 2.4. A census-taker interviews a family (1870)

The industrial revolution in the 19th century was an important era in the history of statistics. It
brought about drastic and extensive changes in society, as well as in science and technology. Among
many other things, urbanisation started from industrialisation, and also democratisation and the
emerging social movements at the end of the industrial revolution created new statistical demands.
The rise of statistical thinking originated partly from the demands of society and partly from emerging
new ideas. In this period, the foundations for many principles of modern social statistics were laid.
Several central statistical bureaus, statistical societies, conferences, and journals, were established
soon after this period.

2.2 The rise of survey sampling


The development of modern sampling theory started around the year 1895. In that year, Anders Kiaer
(1895, 1997), the founder and first director of Statistics Norway, published his Representative Method.
It was an inquiry in which a large selection of persons was questioned. This selection should be a
‘miniature’ of the population. Persons were selected in such a way that specific groups were present in
the right proportions. This would now be called quota sampling. The proportions were obtained from
previous investigations. Anders Kiaer stressed the importance of representativity. His argument was
that, if a sample was representative with respect to some variables with a known population
distribution, it would also be representative with respect to the other variables in the survey.
A basic problem of the Representative Method was that there was no way of establishing the precision
of estimates. The method lacked a formal theory of inference. It was Bowley (1906, 1926), who made
the first steps in this direction. He showed that for large samples, selected at random from the
population, the estimate had an approximately normal distribution.
From this moment on, there were two methods of sample selection. The first one was Kiaer’s
Representative Method, based on purposive selection (quota sampling), in which representativity
played a crucial role, and for which no measure of the precision of the estimates could be obtained.
The second one was Bowley’s approach, based on random sampling, and for which an indication of the
precision of estimates could be computed. Both methods existed side by side for a number of years.
This situation lasted until 1934, in which year the Polish scientist Jerzy Neyman published his now
famous paper, see Neyman (1934). Neyman developed a new theory based on the concept of the
confidence interval. The confidence interval is still used nowadays to describe the margin of error of an
estimate.

13
The contribution of Neyman was not only that he invented the confidence interval. By making an
empirical evaluation of Italian census data, he could also prove that the Representative Method based
on purposive sampling failed to provide satisfactory estimates of population characteristics. The result
of Neyman’s evaluation of purposive sampling was that the method fell into disrepute.
The classical theory of survey sampling was more or less completed when Horvitz and Thompson
(1952) published their general theory for constructing unbiased (valid) estimates. They showed that,
whatever the selection probabilities are, as long as they are known and positive, it is always possible
to construct valid estimates. Since then, probability sampling has become the preferred sample
selection method. It is the only sampling approach allowing for valid inference about the population
the sample came from.

2.3 Opinion polls


Opinion polls can be seen as a special type of sample surveys, in which attitudes or opinions of a group
of people are measured on political, economic or social topics. The history of opinion polls in the
United States goes back to 1824. In that year, two newspapers, the Harrisburg Pennsylvanian and the
Raleigh Star, attempted to determine political preferences of voters prior to the presidential election
of that year. The early polls did not pay much attention to sampling aspects. Therefore, it was difficult
to establish the accuracy of results. Such opinion polls were often called straw polls. This expression
goes back to rural America. Farmers would throw a handful of straws into the air to see which way the
wind was blowing. In the 1820’s, newspapers began doing straw polls in the streets to see how
political winds blew.
It took until the 1920’s before more attention was paid to sampling aspects. At that time, Archibald
Crossley developed new techniques for measuring American public's radio listening habits. And
George Gallup worked out new ways to assess reader interest in newspaper articles, see e.g. Lienhard
(2003). The sampling technique used by Gallup, was quota sampling. The idea was to investigate
groups of people who were representative for the population. Gallup sent out hundreds of
interviewers across the country. Each interviewer was given quota for different types of respondents:
so many middle-class urban women, so many lower-class rural men, etc. In total, approximately
50,000 interviews were carried out for his poll.
Gallup’s approach was in great contrast with that of the Literary Digest Magazine, which was at that
time the leading polling organisation. This magazine conducted regular “America Speaks” polls. It
based its predictions on returned questionnaire forms that were sent to addresses obtained from
telephone directories books and automobile registration lists. The sample size for these polls was very
large, something like two million people.

Table 2.1. The presidential election in the U.S. in 1936


Candidate Prediction by Prediction by Election
Literary Digest Gallup result
Roosevelt (D) 43% 56% 61%
Landon (R) 57% 44% 37%

The presidential election of 1936 turned out to be decisive for both approaches, see Utts (1999).
Gallup correctly predicted Franklin Roosevelt to be the new president, whereas Literary Digest
predicted that Alf Landon would beat Franklin Roosevelt. The results are summarised in table 2.1.
How could a prediction based on such a large sample be so wrong? The explanation was a fatal flaw in
the sampling procedure of Literary Digest’s. The automobile registration lists and telephone
directories where not representative samples. In the 1930’s cars and telephones were typically owned
by the middle and upper classes. More well-to-do Americans tended to vote Republican, and the less

14
well-to-do were inclined to vote Democrat. Therefore, Republicans were over-represented in the
Literary Digest sample.
As a result of this historic mistake, the Literary Digest magazine ceased publication in 1937. And
opinion researchers learned that they should rely on more scientific ways of sample selection. They
also learned that the way a sample is selected, is more important than the size of the sample.
Gallup’s quota sampling approach turned out to work better than Literary Digest’s haphazard
selection approach. However, Jerzy Neyman had shown already in 1934 that quota sampling can lead
to invalid estimates. Gallup was confronted with the problems of quota sampling in the campaign for
the presidential election of 1948. Harry Truman was the Democratic candidate and Thomas Dewey
was the Republican candidate. Table 2.2 summarised Gallup’s prediction and the real election result.

Table 2.2. The presidential election in the U.S. in 1948


Candidate Prediction by Election
Gallup result
Truman (D) 44% 50%
Dewey (R) 50% 45%

The sample size of Gallup’s poll was 3,250 persons. Gallup concluded from the poll that Thomas Dewey
would win the election. Some newspaper were so convinced of Dewey’s victory that they declare him
the winner of the election their early editions. This prediction turned out to be completely wrong.

Figure 2.5. Truman showing a newspaper with a wrong prediction

Gallup predicted that Dewey would get 50% of the votes. That was 5% more than the election result
(45%). An analysis of the polls of 1936 and 1948 showed that the quota samples contained too many
Republicans and too few Democrats. This did not lead to a wrong prediction in 1936 because the
difference between the two candidates was much more than 5%. The difference between the two
candidates was much smaller in 1948. Therefore, the lack of representativity in Gallup’s sample led to
a wrong prediction.
These examples show the dangers of using quota sampling. Quota samples are not based on random
selection. Instead, interviewers were instructed to select groups of people in the right proportions. But
you can only do that for a limited number of variables, such as gender, age, level of education and race.
But making a sample representative with respect to these variables, does not automatically guarantee
representativity with respect to other variables, like voting behaviour. The best guarantee for
representativity is to apply random sampling and giving each object in the population the same
selection probability.

15
16
3 Setting up a poll
The design of a poll starts by specifying its objectives. These objectives may initially be vague and
formulated in terms of abstract concepts. They often take the form of obtaining the answer to a general
research question. Examples are:
 Feel people save on the streets?
 For which political party will people vote at the next general elections?
 Are people in favour of or opposed to having windmills for producing electricity close to their
homes?
Such general questions have to be translated into a more concrete poll instrument. Several aspects
have to be addressed for this. A number of basic aspects will be discussed in this chapter:
 The exact definition of the population to be investigated (target population).
 The point in time to which the poll refers (reference date).
 The specification of what has to be measured (variables).
 The specification of what has to be estimated (population characteristics).
Other aspects, like the construction of the questionnaire, the choice of the sampling design and the
sample size, are discussed in subsequent chapters.
It is important to pay careful attention to these initial steps. Wrong decisions have their impact in all
subsequent phases of the poll. In the end it may turn out that the general questions of the poll have not
been answered.

3.1 The target population


The target population is the population you want to investigate. It is the population to which the
outcomes of the poll refer. The objects in target population are often people, households or companies.
So, the target population does not necessarily have to consist of persons. The target population is also
the set of objects from which you select your sample.
It is important to define the target population clearly. Mistakes made during this phase will affect the
outcomes of the poll. Therefore the definition of the target population requires careful consideration.
It must be possible to determine without error whether an object encountered ‘in the field’ does or
does not belong to the target population.

Example 3.1. A radio listening poll

There are many public local radio stations in The Netherlands. Now and then these radio stations
carry out radio listening polls. Objective is to find out whether the people in the region like the
radio station, if they listen to it, and which programmes they like most.
What could be the target population of such a poll? Most local radio stations broadcast their
programmes in just one municipality. So the target population consists in principle of all inhabitants
of the municipality. Should also foreigners be included who temporarily live in the municipality?
And what about natives who are temporarily of town?
And what about children? Should they be included in the target population? They may also listen to
the radio station and could have an opinion. Should there be an age limit for children? In The
Netherlands, children from the age of 13 years and over are often part of the target population.

17
Target population

The target population is a finite set of objects U consisting of N objects: U = {1, 2, …, N}. The quantity
N is the population size. The numbers 1, 2, …, N denote the sequence numbers of the objects. If we
talk about object 3, we refer to the object with sequence number 3.

3.2 The reference date


A poll is often supposed to measure the state of a target population at a specific moment in time. This
is the so-called reference date. The sampling frame should reflect the status at this reference date.
Since the sample will be selected from the sampling frame before the reference date, this might not be
the case. The sampling frame may contain objects that do not exist anymore at the reference date.
People may have died or companies may have ceased to exist. Since such objects do not belong to the
target population at the reference date, they should be removed from the sample. This is called over-
coverage.
It may also happen that new objects have come into existence after the time of sample selection and
before the reference date. For example, a person moves into town or a new company is created. You
will never be able to include these new objects in the poll. This phenomenon is called under-coverage.

Example 3.2. Problems with the reference date

Suppose you conduct a poll in a town among people of age 18 and older. Objective is to describe the
situation at the reference date of May 1. The sample is selected in the design phase of the poll, say at
April 1. It is a large poll, so data collection cannot be completed in one day. Therefore, interviews are
conducted in a period of two weeks, starting one week before the reference date and ending one
week after the reference date.
Now suppose an interviewer attempts to contact a selected person at April 29. It turns out the
person has moved to another town. This is a case of over-coverage. What counts is the situation at
May 1, and the person did not belong anymore to the target population at the reference date. So,
there is no problem. Since this is a case of over-coverage, it can be ignored.
The situation is different if an interviewer attempts to contact a person at May 5, and this person
turns out to have moved out of town at May 2. This person belonged to the target population at the
reference date, and therefore should have been interviewed. This is no coverage problem, but a case
of non-response. The person should be tracked down and interviewed.

3.3 The variables


You use the poll to measure various facts and opinions of the people selected in the sample. Measuring
comes down to asking questions and recording the answers. This will provide you with a set of data
you can use for your analysis and for drawing conclusions about the target population.
You can measure all kinds of properties of the sample persons. We call such a property a variable
(because it varies from person to person). Examples of such properties are the income of a household,
the opinion of a person about a political issue, or the profit of the company in the last year. We
distinguish three types of variables: qualitative variables, quantitative variables and indicator
variables.
A qualitative variable is sometimes also called a categorical variable. It divides the target population
into a number of groups (categories). The value of such a variable for a specific object is the label of the
group to which the object belongs. For example, if the qualitative variable is the gender of a person, the

18
values could be 1 (for males) and 2 (for females). Computations with these values are not meaningful.
We can only determine whether two objects belong to the same group (because they have the same
value/label) or to different groups (because they have different values/labels). Examples of qualitative
variables are the political party for which one voted (Christian-Democrat Party, Liberal Party, Social-
Democrat Party, Green Party, etc.) and marital status (never married, married, separated, divorced,
widowed).
Sometimes, qualitative variables are divided in to nominal variables and ordinal variables. The groups
of a nominal variable have no natural ordering. You can label the groups in any order. The groups of an
ordinal variable have a natural order. For example, level of education is an ordinal variable. There is an
ordering from low to high.

Table 3.1. Variables


Type of variable Description
Qualitative variables Division in groups.
- Nominal variables No ordering of groups.
- Ordinal variables Natural ordering of groups.
Quantitative variables Measure size, value, duration, etc.
- Continuous variables Can assume any value.
- Discrete variables Only discrete values, e.g. counts.
Indicator variables Measures presence (1) or absence (0)

A quantitative variable measures the size, weight, value or duration of something. The values of such a
variable have a meaning, because there is always a unit of measurement. Examples of quantitative
variables are weight of a person (in kilograms), the amount of time spent on the internet (in hours) or
rating politicians on a scale from 1 to 10. Computations with the values of a quantitative variable are
meaningful. For example, it is possible to compute the mean rating of a politician in the sample.
Sometimes, quantitative variables are divided into continuous variables and discrete variables. A
continuous variable is a variable that can assume any value in a given interval. An example is the net
monthly income of a person. A discrete variable assumes only discrete values. A variable that measures
counts is a discrete variable. An example is the number of cars in an household. By adding the values of
this variable, we obtain the total number of cars in the target population.
An indicator variable measures the presence or absence of a specific property. The presence of the
property is indicated by the value 1, and the absence by the value 0. Examples of indicator variables
are voting for a specific party, having internet access, and having a paid job.
One the one hand, an indicator variable is a qualitative variable, as it divides the population into two
groups: a group of objects with the property and a group of objects without the property. On the other
hand, an indicator variable is also a quantitative variable. Computations are meaningful. If you add up
the values of this variable in the target population, you obtain the number objects in the population
with the property. If you divide this number by the size of the population, you get the fraction of
objects with the property. And if you multiply this by 100, you get the percentage of objects with the
property.
If you set up a poll, you have to decide which variables you want to measure. Often there are two sets
of variables: target variables and auxiliary variables. Target variables measure the properties of
objects that contribute to answering the general research question of the poll. Together, these
variables provide the information needed to get insight in the behaviour or opinions of the objects in
the target population. For example, in a political poll, the target variables could be whether someone
intends to vote, party preference in the coming elections, and party preference in the previous
elections.

19
Target variable

An arbitrary target variable is denoted by the letter Y. The values of this variables for the N objects in
the target variable are denoted by Y1, Y2, …, YN. If, for example, Y is the party preference at the last
election, then Y1 is the party preference of person 1, Y2 the party preference of person2, etc.

Example 3.2. Variables in a radio listening poll

There are many public local radio stations in The Netherlands. Now and then these radio stations
carry out radio listening polls. Objective is to find out whether the people in the region like the radio
station, whether they listen to it, and which programmes they like most. In an attempt to standardise
the questionnaire, the following list of question was proposed:

Variable Type of variable


Does one know the radio station exists? Indicator
Did one ever listen to the radio station? Indicator
Why does one not listen? Qualitative
Did one listen in a particular week Indicator
Did one listen on a particular day Indicator
How long did one listen on a particular day (hours) Quantitative
Type of programme one prefers Qualitative
General assessment of the radio station Quantitative

The target variables measure all kinds of aspects of the phenomenon you want to investigate. Often
you also measure other variables in your poll that at first sight seem unrelated to the target variables.
We call such variables auxiliary variables. These variables measure background characteristics of the
objects. For persons, the auxiliary variables are often demographic variables like gender, age, marital
status and level of education.

Auxiliary variable

An arbitrary auxiliary variable is denoted by the letter X. The values of this variables for the N objects
in the auxiliary variable are denoted by X1, X2, …, XN. If, for example, X is the person’s age, then X1 is
the age of person 1, X2 the age of person2, etc.

Auxiliary variables give you the possibility to compare different groups. Does the voting behaviour of
young people differ from that of old people? Do females listen to different radio programs than males?
Auxiliary variables are also important for a different reason. Such variables are badly needed to do
something about the negative effects of non-response. You can read more about this in chapter 9.

3.4 Population characteristics


You will use the collected data to draw conclusions about the status (opinion and behaviour) of the
target population as a whole. This usually comes down to computing a number of quantities. These
quantities are called population characteristics. Population characteristics can be computed exactly if
the individual values of the relevant variables are known for every object in the population. Since the

20
values of target variables only become available for objects in the sample, population characteristics
cannot be computed exactly, but have to be estimated.
For a qualitative variable, the two most obvious population characteristics are the population total and
the population mean of its values over all objects. Suppose the target population consists of persons. If
the target variable is the number of internet devices (desktop pc’s, laptops, tablets, smartphone’s, etc.)
the person has, then the population total is equal to the total number of internet devices in the
population. If the target variable is the net monthly income of a person, then the population mean is
the mean income in the population.
If you also measure auxiliary variables like gender and age, you can estimate population
characteristics in specific groups in the population, like the mean income for males and females, and
the total number of internet devices the elderly have.
There is another population characteristic for a quantitative variable that should be mentioned, and
that is the population variance. This characteristic measures the amount of variation of the values of a
variables. The population variance is 0 if all objects have the same value. The more the values differ
from each other, the larger the population variance will be. This population characteristic is important,
because you need it to compute the margin of error of your estimates.

Population characteristics for a quantitative variable

The population total of a quantitative target variable Y is equal to


N
YTOT  Y1  Y2  ...  YN  Yk .
k 1

The population mean of a quantitative target variable Y is equal to


Y1  Y2  ...  YN YTOT 1 N
Y   Yk
N N N k 1

The population variance of a quantitative target variable Y is equal to

(Y1  Y )2  (Y2  Y )2  ...  (YN  Y )2 1 N


S2 
N 1
 
N  1 k 1
(Yk  Y )2 .

The population variance can be seen as a kind of average of the squares of the distances of the
values to the mean.

For an indicator variable, three population characteristics are often computed: the population total,
the population fraction and the population percentage. You obtain the population total by adding up all
values of the indicator variable. Since these values can only be 0 or 1, the population total is here equal
to the number of objects with a particular property. If you compute the population mean of an
indicator variable, you get the fraction of objects with the specific property. And if you multiply this
result by 100, you get percentage of people in the population with the property.

21
Example 3.3. Population characteristics of a radio listening poll

Suppose, you carry out a radio listening poll in a town, and you ask whether one listed to the radio
station yesterday. This means you measure an indicator variable with values 1 (one did listen) and
0 (one did not listen).
The population total of this variable will be equal to the total number of people that listened
yesterday to the radio station. The population mean is the fraction of people that did listen. If you
multiply this mean by 100, you get the percentage of people that did listen.

Population characteristics for an indicator variable

If Y is an indicator variable (with values 0 and 1), the population total


N
YTOT  Y1  Y2  ...  YN  Yk
k 1

is equal to the number of objects with the particular property.


The population mean
Y1  Y2  ...  YN YTOT 1 N
Y   Yk
N N N k 1

of indicator variable Y is equal to the fraction of objects with the specific property.
If we denote the population percentage of objects with a particular property by P, then this
population characteristic is equal to
Y1  Y2  ...  YN Y
P  100  Y  100   100  TOT .
N N
The population variance of an indicator variable Y can be written as
N P 100  P
S2    .
N  1 100 100

22
4 Asking questions
A research project often starts by formulating a general research question. Sometimes, the research
question can be answered by retrieving information from existing sources. If this is not the case,
information must be collected. A poll is one of the possibilities to do this. To conduct a poll, the
research question must be translated into a set of variables (both target variables and auxiliary
variables) to be measured. The values of these variables are obtained by asking questions. It is
important that these questions measure what they intend to measure. Together all the questions make
up the questionnaire.
The questionnaire is a measuring instrument. It is, however, not a perfect measuring instrument. You
can measure someone’s length with a measuring-staff, and you can measure someone’s weight with a
weighing scale. These physical measuring devices are generally very accurate. The situation is
different for a questionnaire. It only indirectly measures someone‘s behaviour or opinion. Schwarz et
al. (2008) describe the tasks involved in answering a question in a poll as follows:
1) Respondents need to understand the question to determine which information they are asked to
provide.
2) Respondents need to recall relevant information from memory. In case of an opinion question, a
ready-for-use answer may not be stored in memory. Instead, respondents need to form a judgment
on the spot, based on whatever relevant information comes to mind at the time.
3) The respondents need to report their answer. They can rarely do it in their own words. They
usually have to reformat their answer in order to fit one of the response alternatives provided by
the researcher.
4) The respondents need to decide whether to report the true answer or to report a different answer.
If the question is about a sensitive topic, they may decide to refuse to give an answer. And if an
answer is socially undesirable, they may change their answer.
From a cognitive point of view this process complicates the use of a questionnaire as a measuring
instrument. A lot can go wrong in the process of asking and answering questions. Problems in the
questionnaire will affect the collected data, and consequently, also the poll results. Therefore, it is of
the utmost importance to carefully design and test the questionnaire. It is sometimes said that
questionnaire design is an art and not a skill. Nevertheless, long years of experience have led to a
number of useful rules. Some of these rules are described in this section. They deal with question texts,
question types, and the structure of the questionnaire. Also, some attention is paid to testing a
questionnaire.

4.1 Factual and non-factual questions


Like Kalton & Schuman (1982), we distinguish factual and non-factual questions. Factual questions are
asked to obtain information about facts and behaviour. There is always an individual true value. This
true value can also be determined, at least in theory, by some other means than asking a question of
the respondent. Examples of factual questions are ‘What is your regular hourly rate of pay on this job’,
‘Do you own or rent your place of residence’ and ‘Do you have an internet connection in your home?’.
The fact to be measured by a factual question must be precisely defined. Experience shows that even a
small change in the question text may lead to a substantially different answer. As an example, a
question about the number of rooms in the household can cause problems if it is not clear what
constitutes a room and what not. Should a kitchen, a bathroom, a hall and a landing be included?
Non-factual questions ask about attitudes and opinions. An opinion usually reflects views on a specific
topic, like voting behaviour in the next election. An attitude is a more general concept, reflecting views
about a wider, often more complex issue. With opinions and attitudes, there is no such thing as a true

23
value. They measure a subjective state of the respondent that cannot be observed by another means.
The opinion or attitude only exists in the mind of the respondent.
There are various theories explaining how respondents determine their answer to an opinion
question. One such theory is the online processing model described by Lodge et al. (1995). According to
this theory, people maintain an overall impression of ideas, events and persons. Every time they are
confronted with new information, this summary view is updated spontaneously. When they have to
answer an opinion question, their response is determined by this overall impression. The online
processing model should typically be applicable to opinions about politicians and political parties.
There are situations in which people do not have formed an opinion about a specific issue. They only
start to think about it when confronted with the question. According to the memory based model of
Zaller (1992), people collect all kinds of information from the media and in contacts with other people.
Much of this information is stored in memory without paying attention to it. When respondents have
to answer an opinion question, they may recall some of the relevant information stored in memory.
Due to the limitations of the human memory, only part of the information is used. This is the
information that immediately comes to mind when the question is asked. This is often information that
only recently has been stored in memory. Therefore, the memory based model is able to explain why
people seem to be unstable in their opinions. There answer may easily be determined by the way the
issue was recently covered in the media.

4.2 The question text


The question text is probably the most important part of the question. This is what the respondents
respond to. If they do not understand the question, they will not give the correct answer, or they will
give no answer at all. Some rules of thumb are presented here that may help to avoid the most obvious
mistakes. Examples are given of question texts not following these rules.
4.2.1 Use familiar wording
The question text must use words that are familiar to those who have to answer the question.
Particularly, questionnaire designers must be careful not to use jargon that is familiar to themselves,
but not to the respondents. Economists may understand a question like

Do you think that food prices are increasing at the same rate as a year ago, at a faster rate, or at a slower rate?

This question asks about the rate at which prices rise, but a less knowledgeable person may easily
interpret the question as asking whether prices decreased, have stayed the same, or increased.
Unnecessary and possibly unfamiliar abbreviations must be avoided. Do not expect respondents to be
able to answer questions about, for example, caloric content of food, disk capacity (in gigabytes) of
their computer, or the bandwidth (in Mbps) of their internet connection.
Indefinite words like ‘usually’, ‘regularly’, ‘frequently’, ‘often’ , ‘recently’ and ‘rarely’ must be avoided if
there is no additional text explaining what they mean. How regular is regularly? How frequent is
frequently? These words do not have the same meaning for every respondent. One respondent may
interpret ‘regularly’ as every day, while it could mean once a month to another respondent. Here is an
example of such a question:

Have you been to the cinema recently?

24
What does ‘recently’ mean? It could mean the last week or the last month. The question can be
improved by specifying the time period:

In the last seven day, have you been to the cinema last week?

Even this question text could cause some confusion. Does ‘last week’ mean the last seven days or
maybe the period since last Sunday?
4.2.2 Avoid ambiguous questions
If the question text is interpreted differently by different respondents, their answers will not be
comparable. For example, if a question asks about income, it must be clear whether it is about weekly,
monthly or annual income. Moreover, it must be clear whether the respondents must specify their
income before or after tax has been deducted.
Vague wording may also lead to interpretation problems. A respondent confronted with the question

Are you satisfied with the recreational facilities in your neighbourhood?

may wonder about what recreational facilities exactly are. Is this a question about parks and
swimming pools? Do recreational facilities also include libraries, theatres, cinemas, playgrounds,
dance studios, and community centres? What will respondents have in their mind, when they answer
this question? It is better to describe in the question text what is meant by recreational facilities.
4.2.3 Avoid long question texts
You should keep the question text as short as possible. A respondent attempting to comprehend a long
question, may leave out part of the text, and thus change the meaning of the question. Long texts may
also cause respondent fatigue, resulting in a decreased motivation to continue. Of course, the question
text should not be so short that it becomes ambiguous. Here is an example of a question that may be
too long:

During the past seven days, were you employed for wages or other remuneration, or were you self-employed
in a household enterprise, were you engaged in both types of activities simultaneously, or were you engaged in
neither activity?

You can obtain some indication of the length and difficulty of a question text by counting the total
number of syllables, and the average number of syllables per word. Table 4.1 gives examples of these
indicators for three questions. The first question is simple and short. The second one is also short but
it is much more difficult. The third question is very long and has an intermediate complexity.

Table 4.1. Indicators for the length and complexity of a question


Questions Words Syllables Syllables
per word
Have you been to the cinema in the last week? 9 12 1.3
Are you satisfied with the recreational facilities in your neighbourhood? 10 21 2.1
During the past seven days, were you employed for wages or other 38 66 1.7
remuneration, or were you self-employed in a household enterprise, were
you engaged in both types of activities simultaneously, or were you
engaged in neither activity?

25
If a question text appears to be too long, you might consider splitting in two or more shorter questions.
It should be noted that some research shows that longer question text sometimes lead to better
answers. According to Kalton & Schuman (1982) longer text may work better for open questions about
threatening topics.
4.2.4 Avoid recall questions as much as possible
Questions requiring recall of events that have happened in the past (recall questions) are a source of
errors. The reason is that people make memory errors. They tend to forget events, particularly when
they happened a long time ago. Recall errors are more severe as the length of the reference period is
longer. Important events, more interesting events and more frequently happening events will be
remembered better than other events. For example, the question

In the last two years, how many times did you contact your family doctor.

is a simple question to ask, but for many people difficult to answer. Recall errors may even occur for
shorter time periods. In the 1981 Health Survey of Statistics Netherlands, respondents had to report
contacts with their family doctor over the last three months. Memory effects were investigated by
Sikkel (1983). It turned out that the percentage of unreported contacts increased with time. The longer
ago an event took place, the more likely it is that it will be forgotten. The percentage of unreported
events for this question increased on average with almost 4% per week. Over the total period of three
months, about one quarter of the contacts with the family doctor were not reported.
Recall questions may also suffer from telescoping. This occurs if respondents report events as having
occurred either earlier or later than they actually did. As a result an event is incorrectly reported
within the reference period, or incorrectly excluded from the reference period. Bradburn et al. (2004)
note that that telescoping leads more often to overstating than to understating a number of events.
Particularly, for short reference periods, telescoping may lead to substantial errors in estimates.
4.2.5 Avoid leading questions
A leading question is a question that is not asked in a neutral way, but that leads the respondents in the
direction of a specific answer. For example, the question

Do you agree with the majority of people that the quality of the health care in the country is falling?

contains a reference to the ‘majority of people’. It suggests it is socially undesirable to not agree. A
question can also become leading by including the opinion of experts in questions text, like in

Most doctors say that cigarette smoking cause lung cancer. Do you agree?

Questionnaire designers should watch out for loaded words that have a tendency of being attached to
extreme situations:

What should be done about murderous terrorists who threaten the freedom of good citizens and the safety of
our children?

Particularly, adjectives like ‘murderous’ and ‘good’ increase a specific loading of the question.
Opinion questions may address topics about which non-respondents not yet have made up their mind.
They may even lack sufficient information for a balanced judgment. Questionnaire designers may
sometime provide additional information in the question text. Such information should be objective

26
and neutral, and should not influence respondents in a specific direction. Saris (1997) performed an
experiment showing the dangers of changes in the question text. He measured the opinion of the Dutch
about increasing the power of the European Parliament. Respondents were randomly assigned one of
these two questions:

Question 1 Question 2
An increase of the powers of the European Many problems cross national borders.
Parliament will be at the expense of the For example, 50% of the acid rain in The
national parliament. Netherlands comes from other countries.
Do you think the powers of the European Do you think the powers of the European
Parliament should be increased? Parliament should be increased?

Of the respondents offered the question on the left, 33% answered ‘yes’ and 42% answered ‘no’. And
of the respondents offered the question in the right, 53% answered ‘yes’ and only 23% answered ‘no’.
These substantial differences are not surprising, as the explanatory text on the left stresses a negative
aspect and the text on the right stresses a positive aspect.
4.2.6 Avoid asking things people don’t know
A question text can be very simple, and completely unambiguous, but still it can be impossible to
answer it. This may happen if you ask the respondents for facts they do not know. Here is an example

How many hours did you listen to your local radio station in the last six months?

Respondents do not keep record of all kinds of simple things happening in their lives. So, they can only
make a guess. This guess need not necessarily be an accurate one. Answering this question is even
more complicated by using a relatively long reference period.
4.2.7 Avoid sensitive questions
Sensitive questions should be avoided as much as possible. Sensitive questions address topics which
respondents may see as embarrassing. Such questions may result in inaccurate answers. Respondents
may refuse to provide information on topics like income, health, criminal behaviour, or sexual
behaviour. Respondents may also avoid giving an answer that is socially undesirable. Instead, they
may provide a response that is socially more acceptable.
Sensitive questions can be asked in such a ways that the likelihood of response is increased and a more
honest response is facilitated. A first option is including the question in a series of less sensitive
questions about the same topic. Another option is making it clear in the question text that the
behaviour or attitude is not so unusual. Bradburn et al. (2004) give the following example:

Even the calmest parents get sometimes angry at their children. Did your children do anything in the past seven
days to make you angry?

A similar effect can be obtained by referring in question text to experts that may find the behaviour not
so unusual:

Many doctors now believe that moderate drinking of liquor helps to reduce the likelihood of heart attacks and
strokes. Have you drank any liquor in the past month?

27
A question asking about numerical quantities (like income) can be experienced as threatening if an
exact value must be supplied. This can be avoided by letting the respondent select a range of values.
For example, instead of asking for the exact income, you ask people to choose an income category.
4.2.8 Avoid double questions (or double-barrelled questions)
A question must ask one thing at the time. If more than one thing is asked in a question, it is unclear
what the answer means. For example, the question

Do you think that people should eat less and exercise more?

actually consists of two questions: ‘Do you think that people should eat less?’ and ‘Do you think that
people should exercise more?’. Suppose, someone thinks that people should not eat less, but should
exercise more, what answer must be given: yes or no? The solution to this problem is simple: the
question must be split two questions each asking one thing at the time.

4.2.9 Avoid negative questions


You must not ask questions in the negative as this is more difficult to understand for respondents.
Respondents may be confused by a question like

Are you against a ban on smoking?

Even more problematic are double-negative questions. They are a source of serious problems. Here is
an example:

Would you rather not use a non-medicated shampoo?

Negative questions can usually be rephrased such that the negative effect is removed. For example,
‘are you against …’ can be replaced by ‘are you in favour …’.
4.2.10 Avoid hypothetical questions
It is difficult for people to answer questions about imaginary situations, as they relate to circumstances
they have never experienced. At best the answer is guesswork and a total lie at worst. Here is an
example of a hypothetical question:

If you were the president of the country, how would you stop crime?

Hypothetical questions are often asked to get more insight in attitudes and opinions about certain
issues. However, little is known about processes in the respondent’s mind that lead to an answer to
such questions. So, one may wonder whether hypothetical questions really measure what a researcher
wants to measure.

4.3 Types of questions


Only the text of the question has been discussed until now. Another import aspect of a poll question is
the way in which the question must be answered. Several types of answers are possible. We discuss
the advantages and disadvantages of a number of question types: the open question, the closed
question (with one or more answers), and the numerical question. We also pay attention to combining
questions into a grid question.

28
4.3.1 Open question
An open question is a simple question to ask. It allows respondents to answer the question completely
in their own words. An open question is typically used in situations where respondents should be able
to express themselves freely. Open questions often invoke spontaneous answers. Open questions also
have disadvantages. The possibility always exists that a respondent overlooks a certain answer
possibility. Consider the following question from a readership poll:

In the last two weeks, which weekly magazines have your read.
………………………………………………………………………………………………………………………………………………………………….

Research in The Netherlands has shown that if this question is offered to respondents as an open
question, typically television guides are overlooked. If a list is presented containing all weekly’s,
including television guides, much more people report having read TV guides.
Asking an open question may also lead to vague answers. Consider the following question:

What do you consider the most important aspect of your job?


………………………………………………………………………………………………………………………………………………………………….

To many respondents it will be unclear what kind of answer is expected. They will probably answer
something like ‘salary’. And what do they mean when they say this? It is important to get a high salary,
or a regular salary, or maybe both?
Processing the answers to open questions is cumbersome, particularly if the answer is written down
on a paper form. Entering such answers in the computer takes effort, and even more if the written text
is not very well readable. Answers to open questions also take more disk space than answers to other
types of questions. Furthermore, analysing answers to open questions is not very straightforward. It is
often done manually because there is no intelligent software that can do this automatically.
Considering the potential problems mentioned above, open questions should be avoided wherever
possible. However, there are situations where there is no alternative. An example is a question asking
for the occupation of the respondent. A list containing all possible occupations would be very long. It
could easily have thousands of entries. All this makes it impossible to let respondents locate their
occupation in the list. The only way out is to ask for occupation by means of an open question.
Extensive, time-consuming automatic and/or manual coding procedures must be implemented to find
the proper classification code matching the description.
4.3.2 Closed question, one answer
A closed question is used to measure a qualitative variable. There is a list of possible answers
corresponding to the groups/categories of the variable. Respondents have to pick one possibility from
the list. Of course, this requires the list to contain all possible answers:

What is your present marital status?


 Never married
 Married
 Separated
 Divorced
 Widowed

29
Online questionnaires use radio buttons for closed questions with one possible answer. Only one radio
button can be selected to choose an option. Chosing another option automatically de-selects the
current option.
There will be problem if respondents cannot find their answer. One way to avoid such a problem is to
add a category ‘Other’, with possibly also the option to enter the answer. An example is the question
below for listeners to a local radio station:

Which type of programs do you listen to most on your local radio station?
 Music
 News and current affairs
 Sports
 Culture
 Other
If other, please specify: ……………………………………………………………………………………………………………….

If the list with answer options is long, it will be difficult for the respondent the find the proper answer.
The closed question below is an example of a question with many options:

Which means of transport do you use most for travelling within your town?
 Walking
 Bicycle
 Electric bicycle
 Moped
 Scooter
 Motorcycle
 Car
 Bus
 Tram
 Metro / light-rail
 Other means of transport

Particularly in face-to-face and telephone polls, where the interviewer reads out all options, this may
cause problems. By the time the interviewer reaches the end of the list, the respondent has already
forgotten the first options in the list. This causes a preference for an answer near the end of the list.
This is called the recency effect.
Use of show cards may help respondents in face-to-face interviews. A show card contains the complete
list of possible answers to a question. Such a card can be handed over to the respondents who then can
pick their answer from the list.
In case of an online poll or a mail poll, the respondents have to read the list themselves. They do not do
this always very careful, and may quickly loose interest. This leads to preference for answers early in
the list. This is called the primacy effect.
A special kind of closed question is the rating scale question. You use such a question to measure the
opinion of attitude of person with respect to a certain issue. Instead of giving a choice between two
categories (for example Agree and Disagree), a so-called Likert scale can be used. This is a set of
options giving respondents the possibility to express the strength of their opinion or attitude. The
Likert scale was invented by Likert Rensis in 1932. A Likert scale question often has five categories.
Here is an example:

30
Taking all things together, how satisfied or dissatisfied are you with life in general?
 Very dissatisfied
 Dissatisfied
 Neither dissatisfied, nor satisfied
 Satisfied
 Very satisfied

Rating scale questions can also have three options or seven options. Three options may restrict
respondents too much in expressing their feelings. Seven or more options may be too much. It will be
hard for respondents to find the option that is closest to their feelings.
Rating scale questions should have an odd number of options. There should be one neutral option in
the middle. This option is for those without an opinion about the particular issue. A drawback of this
middle option is that more respondents tend to choose this option, so that they can avoid having to
express an opinion. It can be seen as a form of satisficing. This term was introduced by Krosnick
(1991). It is the phenomenon that respondents do not do all they can to provide the correct answer.
Instead they attempt to give a more or less acceptable answer with minimal effort.
Sometimes a question cannot be answered because respondents simply do not know the answer. Such
respondents should have to possibility to indicate this on the questionnaire form. Forcing them to
make up an answer will reduce the validity of the data. It has always been a matter of debate how to
deal with the don’t know option. One way to deal with don’t know is to offer it as one of the options in a
closed question:

Do you remember for sure whether or not you voted in the last elections for the European Parliament of June 4,
2009?
 Yes, I voted
 No, I did not vote
 Don’t know

Particularly for a self-administered questionnaire such a question may suffer from satisficing.
Respondents seek the easiest way to answer the question by simply selecting the don’t know option.
Such behaviour can be avoided in face-to-face and telephone polls. Interviewers are trained in
assisting respondents to give a ‘real’ answer, and to avoid don’t know as much as possible. The option
is not explicitly offered, but is implicitly available. Only if respondents indicate they really do not know
the answer, the interviewer records this response as don’t know.
Another way to avoid satisficing is by introducing a filter question. This question asks whether
respondents have an opinion on a certain issue. And only if the say they have an opinion, they are
asked to specify their opinion in the subsequent question.
4.3.3 Closed question, more than one answer
The closed questions discussed up until now allow for exactly only one answer to be given. All answer
options have to be mutually exclusive and exhaustive. So respondents can always find one and only
one option describing their situation. Sometimes, however, there are closed questions in which
respondents must have the possibility to select more than one option. Here is an example:

31
What are your normal modes of transport to work (check all that apply)?
 Walking
 Bicycle
 Motorcycle
 Car
 Bus, tram
 Other mode of transport

The question asks for modes of transport to work. Respondents may use more than one mode of
transport for their journey to work: First they go to the railway station by bicycle, then they take the
train into town, and finally they walk from the railway station to the office. It is clear that more
answers must be possible. Therefore, respondents can check every option applying to them.

A closed question with more than one answer is sometimes called a check-all-that-apply question.
Often square check boxes are used to indicate that more than one answer can be given, Dillman et al.
(1998) have shown that such a question may lead to problems if the list of options is very long.
Responds tend to stop after they have checked a few answers, and do not look at the rest of the list any
more. Therefore, too few options are checked. This can be avoided by using a different format for the
question:

What are your normal modes of transport to work?


Yes No
  Walking
  Bicycle
  Motorcycle
  Car
  Bus, tram
  Other mode of transport

Each check box has been replaced by two radio buttons, one for Yes and one for No. This approach
forces respondents to do something for each option. They either have to check Yes or No. So, they have
to go down the list option by option, and give an explicit answer for each option. This approach leads
to more options that apply, but it has the disadvantage that it takes more time to answer the question.
4.3.4 Numerical question
Another frequency occurring type of question is a numerical question. The answer to such a question is
simply a number. Examples are question about age, income or prices. In most household survey
questionnaires there is a question about the number of members in the household:

How many people are there in your household?


__ __ people

The two separate dashes give a visual clue to the respondent as to how many digits are (at most)
expected. Numerical questions in electronic questionnaires may have a lower and upper bound built
in for the answer. This ensures that entered numbers are always within the valid range.
It should be noted that respondents are in many situations not able to give exact answers to numerical
questions, because they simply do not know the answer. An example is the following question:

32
In the last seven days, how many hours did you listen to your local radio station?
__ __ __ hours

An alternative may be to ask a closed question with as options a number of intervals:

In the last seven days, how many hours did you listen to your local radio station?
 0 – 1 hours
 1 – 2 hours
 2 – 5 hours
 5 – 10 hours
 10 hours or more

A special type of numerical question is a date question. Many polls ask respondents to specify dates, for
example date of birth, date of purchase of a product, or date of retirement. Here is an example:

What is your date of birth?


__ __ __ __ __ __
day month year

Of course, a date can be asked by means of an open question, but if used in interviewing software,
dedicated date questions offer much more control, and thus less errors will be made in entering a date.
4.3.5 Grid question
If you have a series of questions with the same set of possible answers, you can combined these
questions into a grid question. Such a question is sometimes also called a matrix question. A grid
question is a kind of table in which each row represents a single question and each column
corresponds to one possible answer option. Here is an example:

Excellent Very Good Fair Poor


good
How would you rate the overall quality of the radio     
station?
How would you rate the quality of the news     
programmes?
How would you rate the quality of the sport     
programmes?
How would you rate the quality of the music     
programmes?

At first sight, grid questions seem to have some advantages. A grid question takes less space on the
questionnaire form than a set of single questions. Moreover, it provides respondents with more
oversight. Therefore it can reduce the time it takes to answer the questions. Couper et al. (2001)
indeed found that a matrix question takes less time to answer than a set of single questions.
According to Dillman et al. (2009) answering a grid question is a complex cognitive task. It is not
always easy for respondents to link a single question in a row to the proper answer in a column.
Moreover, respondents can navigate through the grid in several ways, row-wise, column-wise, or a
mix. This increases the risk of missing answers to questions, resulting in a higher item non-response.

33
Dillman et al. (2009) advises to limit to use of grid questions in online polls as much as possible. And if
they are used, the grid should not be too wide or to too long. Preferably, the whole grid should fit on a
single screen. This is not so easy to realise as different respondents may have set different screen
resolutions on their computer screens. If respondents have to scroll, either horizontally other
vertically, they may easily get confused, leading to wrong or missed answers.
Several authors, see for example Krosnick (1991) and Touraneau et al. (2004), express concern about
a phenomenon that is sometimes called straight-lining. This particularly occurs for grid questions in
online polls. Respondents take the easy way out by giving the same answer to all questions in the grid.
They simply check all radio buttons in the same column. Often this is the column corresponding to the
middle (neutral) response option. Here is an example of straight-lining:

Excellent Very Good Fair Poor


good
How would you rate the overall quality of the radio     
station?
How would you rate the quality of the news     
programmes?
How would you rate the quality of the sport     
programmes?
How would you rate the quality of the music     
programmes?

Straight-lining the middle response option can be seen as a form of satisficing. It is a means of quickly
answering a series of questions without thinking. It manifests itself in very short response times. So,
short response times for grid questions (when compared to a series of single questions) are not always
a positive message. It can mean that there are measurement errors caused by satisficing.
If a matrix question is used, much attention should paid to its visual layout. For example, a type of
shading as in the examples above reduces confusion, and therefore also reduces item non-response.
Dynamic shading may help if a grid question is used in an online poll. Kaczmirek (2010) distinguishes
pre-selection shading and post-selection shading. Pre-selection shading comes down to changing the
background color of a cell or row of the matrix question if the cursor is moved over it by the
respondent. Pre-selection shading helps the respondent to locate the proper answer to the proper
question. It is active before the answer is clicked. Post-selection shading means shading of a cell or row
in the grid after the answer has been selected. This feedback informs the respondent which answer to
which question was selected. Kaczmirek (2010) concluded that pre-selection and post-selection
shading of complete rows improves the quality of the answers. However, pre-selection shading of just
the cell reduced the quality of the answers.
Galesic et al. (2007) also experimented with post-selection shading. The font colour or the background
colour was changed immediately after respondents answered a question in the matrix. This helped
respondents to navigate, and therefore improved the quality of the data.

4.4 Question order


Once you have specified all questions, you have to include them in the questionnaire in the proper
order. A first aspect is grouping of questions. It is advised to keep questions about the same topic close
together. This will make answering question easier for respondents, and therefore will improve the
quality of the collected data.
Another aspect is the potential learning effect. An issue addressed early in the questionnaire may make
respondents think about it. This may affect answers to later questions. This phenomenon played a role

34
in a Dutch Housing Demand Survey. People turned out to be much more satisfied with their housing
conditions if this question was asked early in the questionnaire. The questionnaire contained a
number of questions about the presence of all kind of facilities in and around the house (Do you have a
bath? Do you have a garden? Do you have a central heating system?). As a consequence, several people
realised that they lacked these facilities, and therefore became less and less satisfied with their
housing conditions.
Question order can affect the results in two ways. One is that mentioning something (an idea, an issue,
a brand) in one question can make people think of it while they answer a later question, when they
might not have thought of it if it had not been previously mentioned. In some cases this problem may
be reduced by randomising the order of related questions. Separating related questions by unrelated
ones might also reduce this problem, though neither technique will completely eliminate it.
Tiemeijer (2008) describes an example were the answers to a specific question were affected by a
previous question. The Eurobarometer (www.ec.europa.eu/public_opinion) is an opinion poll in all
member states of the European Union held since 1973. The European Commission uses this poll to
monitor the evolution of public opinion in the Member States. This may help in making policy decision.
The following question was asked in 2007:

Taking everything into consideration, would you say that the country has on balance benefited or not from
being a member of the European Union?

It turned out that 69% of the respondents were of the opinion that the country had benefited from the
EU. A similar question was included at the same time in a Dutch opinion poll (Peil.nl). However, the
question was preceded by another question that asked respondents to select the most important
disadvantages of being a member of the EU. Among the items in the list were the fast extension EU, the
possibility of Turkey becoming a member state, the, introduction of the Euro, the waste of money by
the European Commission, the loss of identity of the member states, the lack of democratic rights of
citizens, veto rights of member states, and possible interference of the European Commission with
national issues. As a result, only 43% of the respondents considered membership of the EU beneficial.
A third aspect of the order of the questions is that a specific question order can encourage people to
complete the questionnaire. Ideally, the early questions in a poll should be easy and pleasant to
answer. Such questions encourage respondents to continue the poll. Whenever possible, difficult or
sensitive questions should be asked near the end of the questionnaire. If these questions cause
respondents to quit, at least many other questions have been answered.
Another aspect of the order of questions is routing. Usually, not every question is relevant for every
respondent. For example, an election poll may contain questions for both voters and non-voters. For
the voters, there may be questions about party preference, and for non-voters there may be questions
about the reasons for not voting. Irrelevant questions may irritate people, possibly resulting in refusal
to continue. Moreover they may not be able to answer questions not relating to their situation. Finally,
it takes more time to complete a questionnaire if also irrelevant questions must be answered. To avoid
all these problems, you should include routing instructions in your questionnaire. Figure 4.1 contains
an example of a simple questionnaire with routing instructions.
There are two types of routing instructions. The first type is that of a jump instruction attached to an
answer option of a closed question. Question 4 has such instructions. If respondents answer Yes, they
are instructed to jump to question 6 and continue from there. If the answer to question 3 is No, they
are instructed to go to question 5.
Sometimes a routing instruction does not depend on just an answer to a closed question. It may
happen that the decision to jump to another question depends on the answer to several other

35
questions, or on the answer to another type of question. In this case, a route instruction may take the
form of an instruction between the questions. This is a text placed between questions. Figure 4.1
contains such an instruction between questions 2 and 3. If respondents are younger than 18, the rest
of the questionnaire is skipped.

Figure 4.1. A simple election poll with routing instructions


1. Generally speaking, how much interest would you say you have in politics?
 A great deal
 A fair amount
 Only a little
 No interest at all

2. What is your age (in years)? __ __ __

Answer questions below only if you are 18 years or older.

3. Of the issues that were discussed during the election campaign, which one was most important for you?
......................................................................................................................................................................................................................

4. Did you vote in the last parliamentary elections?


 Yes Go to question 6
 No Go to question 5

5. Why did you not vote?


 Not interested
 Had no time
 Forgot about it
 Too young
 Other reason
Go to question 9

6. Which party did you vote for?


 Christian-Democrat Party
 Social-Democrat Party
 Liberal Party
 Green Party
 Other party

7. Which other party did you consider (check all that apply)?
 Christian-Democrat Party
 Social-Democrat Party
 Liberal Party
 Green Party
 Other party

8. Do you think that voting for parliamentary elections should be compulsory, or do you think that people
should only vote if they want to?
 Strongly favour compulsory voting
 Favour compulsory voting
 Favour people voting only if they want to
 Strongly favour people voting only if they want to

9. Did you follow the election campaign in the media (check all that apply)?
 Yes, on radio
 Yes, on TV
 Yes, on internet

End of questionnaire

36
We already mentioned that routing instructions not only see to it that only relevant questions are
asked, but they also reduce the number of questions asked, so that the interview takes less time.
However, it should be remarked that many and complex route instructions increase the burden for the
interviewer. This complexity may be an extra source of possible errors.

4.5 Questionnaire testing


Before you can use a questionnaire for collecting data, you must test it. Errors in the questionnaire
may cause wrong questions to be asked and right questions to be skipped. Also errors in the questions
themselves may lead to errors in answers. Every researcher will agree that testing is important, but
this does not always happen in practice. Often there is no time to carry out a proper testing procedure.
An overview of some aspects of questionnaire testing is given here. More information can be found, for
example, in Converse and Presser (1986).
Questionnaire testing usually comes down to trying it out in practice. There are two approaches to do
this. One is to imitate a normal interview situation. Interviewers make contact with respondents and
interview them, like in a real interview situation for a poll. The respondents do not know it is just a
test, and therefore they behave like they would do during a normal interview. If they know it was just a
test, they could very well behave differently.
Another way to test a questionnaire is to inform respondents that they are part of a test. This has the
advantage that the interviewers can ask the respondents whether they have understood the questions,
whether things were unclear to them, and why they gave specific answers.
A number of aspects of a questionnaire should be tested. Maybe the most important aspect is the
validity of the question. Does the question really measure what the researcher wants it to measure? It
is not simple to establish question validity in practice. A first step may be to determine the meaning of
the question. It is important that the researcher and the respondent interpret the question in exactly in
the same way. There are ample examples in the questionnaire design literature about small and large
misunderstandings. Converse and Presser (1986) mention a question about ‘heavy traffic in the
neighbourhood’, where the researcher meant trucks and respondents thought the question was about
drugs. Another question asked about ‘family planning’, where the researcher meant birth control and
respondents did interpret this as saving money for vacations.
The examples above make clear how important validity testing is. Research has shown that often
respondents interpret questions differently than the researcher intended. Also, if respondents do not
understand the question, they change the meaning of the question in such a way that it can be
answered by them.
Another aspect of questionnaire testing is to check whether questions offer sufficient variation in
answer possibilities. A poll question is not very interesting for analysis purposes if all respondents
give the same answer. It must be possible to explore and compare the distribution of the answers to a
question for several sub-groups of the population.
It should be noted that there are situations where a very skew answer distribution may be interesting.
For example, De Fuentes-Merillas et al. (1998) investigate addiction to scratch cards in The
Netherlands. It turned out that only 0.24% of the adult population was addicted. Although this was a
very small percentage, it was important to have more information about the size of the group.
The meaning of a question may be clear, and it also may allow sufficient variation in answers, but this
still not means it can always be answered. Some questions are easy to ask, but difficult to answer. A
question like

In your household, how many kilograms of coffee did you consume in the last year?

37
is clear, but very hard to answer, because respondents simply do not know the answer, or can
determine the answer only with great effort. Likewise, asking for the net yearly income is not as simple
as it looks. Researchers should realise they may get only an approximate answer.
Many people are reluctant to participate in polls. And even if they corporate, they may not be very
enthusiastic or motivated to answer the questions. Researchers should realise this may have an effect
on the quality of the answers given. The more interested respondents are, the better their answers will
be. One aspect of questionnaire testing is to determine how interesting questions are for respondents.
The number of uninteresting questions should be as small as possible.
Another important aspect is the length of the questionnaire. The longer the questionnaire, the larger
the risk of problems. Questionnaire fatigue may cause respondents to stop answering questions before
the end of the questionnaire is reached. A rule sometimes suggested in the Netherlands is that an
interview should not last longer than a class in school (50 minutes). However, it should be noted that
this also partly depends on the mode of interviewing. For example telephone interviews should take
less time than face-to-face interviews. And completing a questionnaire online should take not more
than 15 minutes.
Up until now testing was aimed at individual questions. However, also the structure of the
questionnaire as a whole has to be tested. Each respondent follows a specific route through the
questionnaire. The topics encountered en route must have a meaningful order for all respondents. One
way the researcher can check this is to read out loud the questions (instead of reading them). While
listening to this story, unnatural turns will become apparent.
To keep the respondent interested, and to avoid questionnaire fatigue, it is recommended to start the
questionnaire with interesting questions. Uninteresting and sensitive questions (gender, age, income)
should come at the end of the questionnaire. These way potential problems can be postponed until the
end.
It should be noted that sometimes the structure of the questionnaire requires uninteresting questions
like gender to be asked early in the questionnaire. This may happen if they are used as filter questions.
The answer to such a question determines the route through the rest of the questionnaire. For
example, if a questionnaire contains separate parts for male and female respondents, first gender of
the respondent must be determined.
The growing potential of computer hardware and software has made it possible to develop very large
and complex CAPI, CATI or online questionnaires. To protect respondents from having to answer all
these questions, routing structures and filter questions see to it that only relevant questions are asked,
and irrelevant questions are skipped. It is not always easy to test the routing structure of these large
and complex questionnaires. Sometimes it helps to make a flowchart of the questionnaire. Figure 4.2
contains a small example. It is the same questionnaire as in figure 4.1.
An aspect that you can take into account when developing a questionnaire is the general well-being of
the respondents. Nowadays polls are conducted about a wide a range of topics, including sensitive
topics like use of alcohol and drugs, homosexual relationships, marriage and divorce, maltreatment of
children, mental problems, depression, suicide, physical and mental handicaps and religious
experiences. Although you should apply the principle of informed consent for respondents, you may
wonder whether respondents feel as happy after the poll as before the poll, if sensitive topics like the
ones mentioned are addressed.

38
Figure 4.2. Flowchart of the routing structure of a questionnaire

Generally speaking, how much


interest would you say you have in
politics?

What is your age?

18 years or older? No

Yes
Of the issues that were discussed
during the election campaign, which
one was the most important for you?
important for you?
Did you vote in the last parliamentary
elections?

Did vote? No

Yes
Which party did you vote for? Why did you not vote?

Which other party did you consider?

Do you think that voting for


parliamentary elections should be
compulsory?

Did you follow the election campaign


in the media?

39
Testing a questionnaire may proceed in two phases. Converse and Presser (1986) suggest the first
phase to consist of 25 to 75 interviews. Focus is on testing closed questions. The answer options must
be clear and meaningful. All respondents must be able to find the proper answer. If the answer options
do not cover all possibilities, there must be a way out by having the special option ‘Other, please
specify …’.
To collect the experiences of the interviewers in the first phase, Converse and Presser (1986) suggest
letting them complete a small questionnaire with the following questions:
 Did any of the questions seem to make the respondents uncomfortable?
 Did you have to repeat any questions?
 Did they respondents misinterpret any questions?
 Which questions were the most difficult or awkward for you to ask? Have you come to dislike any
questions? Why?
 Did any of the sections in the questionnaire seem to drag?
 Were there any sections in the questionnaire in which you felt that the respondents would have
liked the opportunity to say more?
The first phase of questionnaire testing is a thourough search for essential errors. The second phase
should be seen as a final rehearsal. The focus is not any more on repairing substantial errors or on
trying out a completely different approach. This is just the finishing touch. The questionnaire is tested
in a real interview situation with real respondents. They respondents do not know they participate in
a test. The number of respondents in the second phase is also 25 to 75. This is also the phase in which
external experts about the questionnaires could be consulted.

40
5 Selecting a sample
The theory of sampling for surveys and polls has been developed over a period of more than 100
years. The paradigm of probability sampling has shown to work well in social research, official
statistics and market research. It allows you to draw well-founded, reliable and valid conclusions
about the state of the population that is investigated.
The basics of probability sampling were laid down by Horvitz and Thompson (1952) in their seminal
paper. They state that valid estimators of population characteristics can always be constructed,
provided samples are selected by means of probability sampling, and every element in the population
has a known and strictly positive probability of being selected. Moreover, under these conditions,
standard errors of estimates, and thus margins of errors, can be computed. Therefore it is possible to
establish the precision of estimates.
It is often said that a sample is a good sample because it is representative. Unfortunately it is usually
not clear what ‘representative’ means. Kruskal & Mosteller (1979a, 1979b and 1979c) explored the
literature and found many different meanings. Therefore, they advise not use the term unless it is
explained what is meant by it. We define a representative sample as a sample in which the distribution
of each variable is (approximately) equal to its distribution in the population. So, as an example, the
fractions of males and females in the sample should be more or less equal to the fractions of males and
females in the population. A good way of obtaining a representative sample is to draw a probability
sample, where each object in the target population has the same probability of selection. Such a
sample is called a simple random sample.
A simple random sample is by far the most frequently used type of sampling design. It is a simple and
straightforward way of sampling. There are, however, many more sampling designs that all apply
probability sampling. Examples are sampling with unequal probabilities, stratified sampling, cluster
sampling and two-stage sampling. We will restrict ourselves to simple random sampling, but in some
situations we need to use two-stage sampling.

5.1 The sampling frame


How to draw a sample from a target population? How to select a number of people that can be
considered representative? You need a sampling frame for this. A sampling frame is a list of all people
in the target population. It must be clear for all persons in this list how you can contact them.
The choice of the sampling frame depends on the mode of data collection. For a face-to-face poll or a
mail poll you need addresses, for a telephone poll you need telephone numbers, and for an online poll
you must have a list of e-mail addresses.
A sampling frame can exist on paper (a card-index box for the members of a club, or a telephone
directory), or in a computer (a database containing names and addresses, such as a population
register). If such lists are not available, detailed geographical maps are sometimes used.
The sampling frame should be an accurate representation of the population. You run the risk of
drawing wrong conclusions from a poll if you select the sample from a sampling frame that differs
from the population. Figure 5.1 shows what can go wrong.
The first problem is under-coverage. This occurs if the target population contains objects that do not
have a counterpart in the sampling frame. Such objects can never be selected in the sample. An
example of under-coverage is poll where the sample is selected from a population register. Illegal
immigrants are part of the population, but will never be encountered in the sampling frame. Another
example is an online poll, where respondents are selected via the internet. There will be under-
coverage due to people without internet access. Under-coverage can have serious consequences. If
objects outside the sampling frame systematically differ from the objects in the sampling frame,

41
estimates of population parameters may be seriously biased. A complicating factor is that it is often
not very easy to detect under-coverage.

Figure 5.1. Coverage problems in a sampling frame

The second sampling frame problem is over-coverage. This refers to the situation where the sampling
frame contains objects that do not belong to the target population. If such objects end up in the sample
and their data are used in the analysis, estimates of population characteristics may be affected. It
should be rather simple to detect over-coverage as these objects should experience difficulties to
answer the questions in the questionnaire.

Figure 5.2. Under-coverage and over-coverage

Some countries, like The Netherlands and the Scandinavian countries, have a population register. Such
a registers contains all permanent residents in the country. It contains for each person name and
address and also the values of other variables, like date of birth, gender, marital status, and country of
birth. Such population registers are an ideal sampling frame, because there are usually little coverage
problems.
Another frequently used sampling frame is a Postal Address File (PAF). Postal service agencies in
several countries maintain databases of all delivery points in the country. Examples are the
Netherlands, the United Kingdom, Australia and New Zealand. Such databases contains postal
addresses of both private houses and companies. Typically, a Postal Address File can be used to draw a
sample of addresses, and therefore also of households.
It is sometimes not clear whether addresses in a Postal Address File belong to a private houses or to
companies. If the aim is to select a sample of households, there may be over-coverage caused by
companies in the file.
You can use a telephone directory to select a sample of telephone numbers. However, telephone
directories may suffer from serious coverage problems. Under-coverage occurs because many people
have unlisted numbers, and some will have no telephone at all. Moreover, there is a rapid increase of

42
the use of mobile phones. In many countries, mobile phone numbers are not listed in directories. This
means that young people with only a mobile phone are missing in the sampling frame, and therefore
may be seriously under-represented in a poll. A telephone directory also suffers from over-coverage,
because it contains telephone numbers of shops, companies, etc. Hence, it may happen that people are
contacted that do not belong to the target population. Moreover, some people may have a higher than
assumed contact probability, because they can be contacted both at home and in the office.
You can avoid under-coverage problems of telephone directories by applying random digital dialling
(RDD). It means you use a computer algorithm to generate valid random telephone numbers. One way
to do this is to take an existing number from the telephone directory and to replace its final digit by a
random other digit. An RDD algorithm can produce both listed and unlisted numbers, including
numbers of mobile phones. So you have complete coverage.
Random digital dialing also has drawbacks. In some countries it is not clear what an unanswered
number means. It can mean that the number is not in use. This is a case of over-coverage. No follow-up
is needed. It can also mean that someone simply does not answer the phone. This is a case of non-
response, which has to be followed up. Another drawback of RDD is that there is no information at all
about non-respondents. This makes correction for non-response very difficult (see also chapter 9
about non-response).
If you want to conduct an online poll, you need a sampling frame with e-mail addresses. Unfortunately,
for many populations such sampling frames are not available. Exceptions are a poll among students of
a university (where each student has a university supplied e-mail address) or a poll among employees
of a large company.

Example 5.1. How not to select your sample

It is not always easy to find a proper sampling frame. Therefore there are ample examples of polls
with bad sampling procedures. Such polls often lead to wrong conclusion. Here we give two
examples.
A poll in a shopping mall
A local radio station conducted a radio listening poll. To quickly collect a lot of data, it was decided
to send interviewers to the local shopping mall on Saturday afternoon. There were a lot of people
there. So it was not too difficult to get a large number of completed forms. Analysis of the collected
data led to a surprising conclusion: almost no one listened to the sports programme that was
broadcasted every Saturday afternoon. Of course, this conclusion was not so surprising. The only
people interviewed were those in the shopping centre at Saturday afternoon. And they were not
listening to the sports programme. In fact, this reduced the target population of the poll from all
inhabitants to only those shopping on Saturday. So, the sample was not representative.
A poll in a magazine
A publisher distributes a free door-to-door magazine each week. The editors of the magazine
wanted to know how many inhabitants read the magazine. So they decided the carry out a poll. In a
specific week they included a questionnaire form in the magazine. People were asked to complete
the questionnaire and to return it to the publisher. Of course, one of the most important questions
was whether the magazine is read. The returned forms showed that everyone read the magazine.
The publisher was happy. However, he did not realise there was something wrong. There were a lot
of people who did not read the magazine and threw it away immediately. They did not encounter
the questionnaire form, and therefore they could not answer the questions. In fact, this sampling
approach restricted the target population to only those reading the magazine.

43
Problems can occur if the units in the sampling frame are different from those in the target population.
Typical is the case where the target population consists of persons and the sampling frame of
addresses. This may, for example, happen if a postal address file is used as a sampling frame. Suppose
persons have to be selected with equal probabilities. A naïve way to do this would be to randomly
select a sample of addresses, and to draw one person at random at each selected address. At first sight,
this seems reasonable, but it ignores the fact that now not every person has the same selection
probability: members in large families have a smaller probability of being selected than members of
small families. So members of small families will be over-represented.
A second case is a poll in which households are to be selected with equal probabilities, and the
sampling frame consists of persons. This can happen if the sample is selected from a population
register. Now large households have a larger selection probability than smaller households, because
larger households have more people in the sampling frame. In fact, the selection probability of a
households is proportional to the size of the family.

5.2 Sample selection


There are many ways in which you can select a sample from a population, but there is only one way to
do it in a meaningful way, and that is by means of probability sampling. Every object in the population
must have a positive (non-zero) probability of selection, and you must know the values of all these
probabilities. If these conditions are satisfied, you can construct valid and consistent estimators of
population characteristics. Moreover, you can compute the margins of error of your estimates, so that
you can indicate how good or bad your estimates are. In summary: to do a poll in a scientifically
meaningful way, you have to apply probability sampling.
The obvious way to select a probability sample is to assign the same probability of selection to each
object in the target population. This is called a simple random sample. This the most frequently used
sampling design. It is, however, possible to use other sampling designs, in which the selection
probabilities are not the same. We will come back to this.
Drawing a simple random sample requires a selection procedure that indeed gives each object in the
population the same probability of selection. Objects must be selected without prejudice. Human
beings are not able to select such a sample. They just cannot pick a number of objects giving each
element the same probability of selection. Conscious or unconscious preferences always seem to play a
role. An illustration of this phenomenon is an experiment in which a sample of 413 persons were
asked to pick an arbitrary number in the range from 1 up to and including 9. The results are
summarised in figure 5.2.

Figure 5.2. Picking a random number

44
If people would behave as a random number generator, each number should have been mentioned
with approximately the same frequency of 11%. This is definitely not the case. People seem to have a
high preference for the number ‘seven’. More than 40% mentioned it. Apparently ‘seven’ is more
random than any other number. The numbers ‘one’ and ‘two’ are almost never mentioned. The
conclusion is that people cannot select a random sample.
Samples have to be drawn by means of an objective probability mechanism that guarantees that every
element in the population has exactly the same probability of being selected. Such a mechanism will be
called a randomiser. A randomiser is a device (electronic or mechanical) with the following properties:
 It can be used repeatedly.
 It has N possible outcomes that are numbered 1 ,2, …, N, where N is known.
 It produces one of the N possible outcomes every time it is activated.
 Each time it is activated, all possible outcome are equally probable.
The main property of a randomiser is that its outcome is unpredictable in the highest possible degree.
All methods of prediction, with or without knowledge or use of past results, are equivalent.
The perfect randomiser does not exist in practice. There are, however, devices that come close to a
randomiser. They serve their purpose as a randomiser. The proof of the pudding is in the eating: the
people living in the princedom of Monaco do not pay taxes as the randomisers in the casino of Monaco
provide sufficient income for the princedom.
A coin is a simple example of a randomiser. The two outcomes ‘heads’ and ‘tails’ are equally probable.
Another example of a randomiser is a dice, see figure 5.3. Each of the numbers 1 to 6 has the same
probability, provided the dice is ‘fair’.

Figure 5.3. Example of a randomiser: dice

A coin can only be used to draw a sample if the population consists of two elements. A dice can be used
only for a population of six elements. This is not very realistic. Target populations are usually much
larger than that. Suppose, you musty select a sample if size 50 from a target population of size 750. You
cannot use a coin or a dice. What you can use is a 20-sided dice. See figure 5.4.

Figure 5.4. A 20-sided dice

Such a dice contains the numbers from 1 to 20. If you subtract 10 from the outcomes 10 and higher,
and you interpret 10 as 0, the dice contains twice the numbers from 0 to 9. Three throws of such a dice
produce a three-digit number in the range from 0 to 999. If you ignore the outcome 0 and all outcomes

45
over 750, you obtain a sequence number of an object. This object is selected in the sample. By
repeating this procedure 50 times, you obtain a sample of size 50.
In practice, you will draw larger samples from much larger population, like a sample of 1,000 people
from the target population of potential voters in a country. It is not realistic to use dice for this. A more
practical approach is to use a calculator, a spreadsheet program, or some other computer software.
Some calculators have a function to generate random values from the interval [0, 1). Every value
between 0 and 1 is possible. The value 0 may occur, but not the value 1.

Example 5.2. Generating random numbers with a calculator

The CASIO FX-82 calculator has a button RAN#. Each time you press this button,
a new random value from the interval [0, 1) appears on the display.
A session in which RAN# button was pressed 20 times, produced the following
values:

0.360 0.319 0.778 0.753 0.521 0.652 0.609 0.812 0.057 0.756
0.205 0.465 0.023 0.128 0.394 0.381 0.802 0.031 0.415 0.065

To select a random sample, you need random numbers in the range from 1 to N, where N denotes the
size of the population. At first sight, random values from the interval [0, 1) seem not useful for this.
However, there is a simple procedure to transform random values from [0, 1) into random numbers
from 1 to N:
(1) Multiply the random value from [0, 1) by the population size N. This produces a value in the
interval [0, N). The value 0 may occur, but the value 1 not.
(2) Round this value down to the nearest integer value by removing the decimal part. This produces
an integer number in the range from 0 to N – 1.
(3) Add 1 to this number. This produces a number in the range from 1 to N.
You can also use a spreadsheet program to select a random sample. We will show how to do this with
MS Excel. First, fill the column A with random values from [0, 1). The number of values should be
equal to the sample size. You do this with the function RAND(). Next, you create random number in
column B by applying the procedure in steps (1) to (3).

Figure 5.5. Generating random numbers in a spreadsheet

46
Suppose, the population size is 18,000 and you want to select a sample of size 10. First, you fill the cells
A1, A2, …, A10 with random values using RAND(). Next, you compute random numbers in cells B1,
B2, …, B10 with the computations =1+INT(A1*18000), =1+INT(A2*18000), etc. Figure 5.5 shows
the result of this process.

5.3 Sampling with and without replacement


If you throw a dice a number of times, it is possible that a certain number appears more than once. The
same applies to any other randomiser: if you produce a sequence of random numbers, some numbers
may occur more than once. The consequence would be that an object is selected more than once in the
sample. This is not very meaningful. It would mean repeating the measurements for this object. If the
questions are answered a second time, the answers will not be different. Therefore, sampling without
replacement is preferred. This is a way of sampling in which each object can appear at most only once
in a sample.
A lotto machine is a good example of sampling without replacement. A selected ball is not replaced in
the population. Therefore, it cannot be selected a second time.

Figure 5.4. A lotto machine: selecting a sample without replacement

The procedure for selecting a sample without replacement is straightforward. A sequence of random
numbers in the range from 1 to N is produced using some randomiser. If a number is produced that
was already generated previously, it is simply ignored. The process is continued until the sample size
has been reached.
A roulette wheel is a good example of sampling with replacement. At every turn, each of the possible
numbers can again be produced with the same probability. For example, it is possible (but not very
likely) that in a sequence of 10 turns the number 7 is produced 10 times.

Figure 5.5. A roulette: selecting a sample with replacement

In the subsequent sections, we will look more closely at three different sampling designs:
 A simple random sample. You select a sample with equal probabilities and without replacement.
 A systematic sample. This is a quick (and sometimes dirty) way to select a simple random sample.
 A two-stage sample. This is a two-step sampling approach in which you first select a sample of
addresses, and then a random person at each selected address.

47
5.4 A simple random sample
A simple random sample is closest to what comes into the mind of many people when they think about
random sampling. It is similar to a lottery. It is also one of the simplest ways to select a random
sample. The basic property is that each object in the target population has the same probability of
being selected in the sample.
A simple random sample can be selected with and without replacement. We consider only sampling
without replacement in this section. This is more efficient than sampling with replacement. Sampling
without replacement guarantees that objects cannot be selected more than once in the same sample.
All sample objects will be different.

Figure 5.6. Simple random sampling

We denote the size of the target population by N. To select a simple random sample from this
population, you need a set of random numbers in the range from 1 to N. All numbers must have the
same probability of selection. And all numbers must be different.
If the sample is small, you can use a calculator. Such a calculator must have a facility to generate
random values. Usually such calculators produce random values between 0 and 1 (where 0 can occur
and 1 cannot occur). You can use recipe 5.1 to select a sample with such a calculator.

Recipe 5.1. Drawing a simple random sample with a calculator

Step 1: Draw a random value RAN from the interval [0, 1).
Step 2: Multiply this value by the size of the target population N. This produces a value in the
interval [0, N).
Step 3: Remove the decimal part of the value. This produces an integer number in the range
from 0 to N – 1.
Step 4: Add 1 to this integer number. This produces an integer number in the range from 1 to N.
Step 5: If you have this number already in the sample, ignore it, go back step 1, and make a new
attempt.
Step 6: If you did not yet reach the sample size, go back to step 1, and select another number.

If you want to draw a bigger sample, the manual procedure in recipe 5.1 may be too cumbersome. An
alternative is to draw the sample with a spreadsheet program. Recipe 5.2. describes how to do this
with Excel.

48
Recipe 5.2. Drawing a simple random sample with a spreadsheet

Step 1: Fill column A with the sequence numbers of the objects in the target populations. These
are the number from 1 to N, where N is the size of the population. You can use the
function ROW() for this.
Step 2: Fill column B with random values from the interval [0, 1). You can use the function
RAND() for this. The spreadsheet fragment on the left below shows an example.

Step 3: Select Options in the Tools menu and open the tab Calculation. Set Calculation to Manual.
Step 4: Select columns A and B, order this block on column B. The result is something like in the
spreadsheet fragment below on the right.
Step 5: The sample consists of the set of numbers in the first part of column A. If you want a
sample of, for example, 10 objects, take the first 10 numbers.
After step 2: After step 4:

For much large samples of, say, of few thousand objects, from very large target populations, the
spreadsheet approach is still cumbersome. It may be better to develop special software for this
situation. There are also websites that can select samples for you. An example of such a website is
www.aselector.nl. It can generate samples of up to 1,000 objects.

5.5 A systematic sample


Manually selecting a simple random sample from a long list can be a lot of work. Particularly if the
objects in the sampling frame are not numbered, it is not so easy to find, say, object 1341. For such
situations, you can consider using systematic sampling as an alternative for simple random sampling.

Figure 5.7. A systematic sample

Systematic sampling is convenient if you have to select a sample by hand. You could think of drawing a
sample from a file of cards (with addresses or telephone numbers). Another example is drawing
sample from a paper version of a telephone directory.

49
The first step in selecting a systematic sample is to determine the number of objects in the target
population (the population size N) and the number of objects in the sample (the sample size n). If you
know these numbers, you can compute the step length. The step length is the length of the jump you
use to jump through the sampling frame.

Example 5.3. Computing the step length for a systematic sample

You want to conduct a radio listening poll in the town of Harewood. There is a list with all 9590
addresses of households in the town. You want to select a systematic sample of 500 addresses. Then
the step length is equal to 9590 / 500 = 19.18.

The next step in selecting a systematic sample is to compute the starting point. The starting point is
used to determine the sequence number of the first object in your sample. To obtain the starting point,
you first draw a random value from the interval that runs from 0 to the step length. You can easily do
this with a calculator by multiplying a random value from [0, 1) by the step length. If you now round
up the starting value to the nearest integer number, you get the sequence number of the first element
in your sample.

Example 5.4. Computing the first object in a systematic sample

You want to conduct a radio listening poll in the town of Harewood. There is a list with all 9590
addresses of households in the town. You want to select a systematic sample of 500 addresses. Then
the step length is equal to 9590 / 500 = 19.18.
The starting value is obtained by taking a random value from the interval [0, 19.18). First, you use
your calculator to obtain a random value between 0 and 1. Suppose, the result is 0.261. Next, you
multiply this value by the step length 19.18. The result is 0.261  19.18 = 5.00598. So, the starting
value is 5.00598.
To obtain the first element in the sample, you round up the starting value to the nearest integer, and
this is 6. So the first object in your sample is element 6.

Together, the starting value and the step length fix the sample. You obtain the first object in the sample
by rounding up the starting value. You obtain the next sample object by adding the step length to the
starting value and rounding up the result to the nearest integer. You continue doing this until you
reach the end of the sampling frame.

Example 5.4. Selecting a systematic sample

You want to conduct a radio listening poll in the town of Harewood. There is a list with all 9590
addresses of households in the town. You want to select a systematic sample of 500 addresses. Then
the step length is equal to 9590 / 500 = 19.18. Suppose, the starting value is equal to 5.00598. By
continuously adding the step length, you obtain the following series of values:
5.00598, 24.18598, 43.36598, … , 9537.46598, 9556.64598, 9575.82598.
Rounding up these values to the nearest integer produces the sequence numbers of the sample
objects:
6, 25, 44, … , 9538, 9557, 9576.

50
It may happen that dividing the population size by the sample size already results in an integer
number. Then also the step length is an integer number. This simplifies all computations, since all the
rounding is not necessary anymore. For example, if you want to draw a systematic sample of size 500
from a list of 9,500 objects, the step length is equal to 9500 / 500 = 19. You can draw the starting value
from the integer numbers in the range from 1 to 19. Suppose, you draw 5. Then this is the sequence
number of the first element in the sample. All other sequence numbers in the sample are obtained by
adding the step length 19 all the time: 5, 24, 43, …, 9448, 9567, 9486.
A warning is in place if you decide to draw a systematic sample from a sampling frame. Systematic
sampling assumes that the order of the objects in the sampling frame is completely arbitrary. There
should be no relation between the order of the objects and the variables you are measuring in your
poll.
A simple example illustrates the danger of selecting a systematic sample. Suppose, you want to
conduct a poll about living conditions in a newly built neighbourhood. All streets have 20 houses,
numbered from 1 to 20. The sampling frame is ordered by street and house number. Suppose, you
draw a systematic sample with a step length of 20. This means you select one house in each street.
Then there are two possibilities: (1) if the starting value is 1, you only have corner houses in your
sample, and (2) if the starting value is unequal to 1, you have no corner houses at all in your sample. So
each possible sample is far from representative (with respect to housing conditions and related
variables). Either you have too many or too few corner houses in your sample. This may affect the
validity of the conclusions drawn from your poll.
If you select a systematic sample from a list sorted by street name or postal code, or from a telephone
directory, it is often not unreasonable to assume there is no relationship between the order of objects
in the list and the target variables of the survey. Then a systematic sample is more or less
representative and you consider it to be similar to a simple random sample.

5.6 A two-stage sample


If you have to draw a sample of persons, the ideal situation would be to have a sampling frame
consisting of persons, such as a population register. Often such frames are not available or accessible.
An alternative can be to use an address list. To select a sample of persons from a list of addresses, you
have to use a two-step approach: first you draw addresses, and then you draw one or more persons at
each selected address. This called a two-stage sample.
A question that comes up is how many persons to draw at each selected address. You can take all
persons at the address that belong to the target population, but you can also limit it to one person per
address. In many situations it is not very meaningful to interview several persons at one address.
Often behaviour, attitudes and opinions of persons at the same address are more or less similar. If this
is the case, interviewing more persons at an address will not give you more information. It is more
effective to spread your sample over more addresses and interview one person per address.
So, the advice is to draw one person at each selected address. How do you do this? The first step would
be to make a list of all persons at the selected address that belong to the target population. For an
election poll, for example, this could be all persons living at the address and having an age of 18 years
or older. From this list you have to select one person randomly. If the poll is conducted face-to-face or
by telephone, the interviewers have to do it. And if the poll is a mail poll or online poll, one of the
members of the household has to do it. To keep things simple, the selection procedure must be easy.
Often, the first birthday procedure is applied: select the person which the next birthday. This assumes
that there is no relationship between the date of someone’s birthday and the topic of the poll.
You have to realise that by drawing one person at each selected address, the selection probabilities of
persons are not equal anymore. If we assume that addresses correspond to households (disregarding

51
that it now and then can happen that there are more households living at the same address), each
household has the same the same probability of being selected in the sample. But persons in a large
household have a smaller probability than persons in small households. If you compute estimates of
population characteristics, you have to take this into account by correcting for the unequal selection
probabilities. If you omit this, you run the risk of drawing wrong conclusions from your poll, because
persons from large households are under-represented. Therefore, your sample is not representative.
Note that to be able to correct for the unequal selection probabilities, you have to record the number
of household members (as far as they belong to the target population) of each selected household.

Example 5.5. Selection probabilities in a radio listening poll

You want to conduct a radio listening poll in the town of Harewood. You have a list consisting of the
addresses of all 9,590 households in Harewood. You want to select 209 addresses. At each selected
address you want to interview one person. The selection probability of each household is
209
 0.022 .
9590
The selection probability of a person at an address is determined by the number of people living
there (from the age of 12 years). If we denote the number of persons by A, the selection probability
is equal to
1
.
A
The total probability of a person to be selected in the sample, is obtained by multiplying the two
probabilities above. This results in the selection probability
209
.
9590  A
This expression indeed shows that not every person has the same selection probability. The
probability for someone in a single-person household is 209 / 9590 = 0.022. For someone in a 2-
person household, the probability is 209 / (2  9590) = 209 / 19180 = 0.011, which is two times as
small.

Using the sample data, you have to compute estimates of population characteristics. How you do this,
depends on the sampling design. This is explained in chapter 8. We also show in that chapter how to
compute the corresponding margins of error. Last but not least, we will describe in that chapter how
you can determine the sample size of your poll.

5.7 Sampling in practice


You have usually several possibilities of drawing a sample from your target population. We will give a
number of examples in this section. These examples describe the situation in The Netherlands. Starting
point is the fictitious town of Rhinewood. This town has 18,500 inhabitants. You want to conduct an
election poll. Therefore you define the target population to consist of all inhabitants of age 18 year and
older. You want to select a simple random sample of 1,000 persons.
5.7.1 Sampling from a population register
The Netherlands has a population register. It is called the Gemeentelijke Basisadministratie
Persoonsgegevens (GBA). This register is maintained by the municipalities. It contains for each
inhabitant name, address, date of birth, and some other variables like gender, marital status, place of

52
birth and country of birth. It is technically easy to select a sample from the population register. There
are, however, practical complications.
In the first place, there are restrictions. There is only one organisation in The Netherlands that has
access to the population register by law, and that is Statistics Netherlands, the national statistical
institute of The Netherlands. For other surveys and polls, the municipality has to give permission for
selecting a sample. This permission is only given for scientific research projects. And even in case of
scientific research, municipalities are sometimes reluctant to cooperate. Moreover, municipalities may
ask a fee for the work they have to do.
A second complication may be that municipal employees lack knowledge of and experience with
sample selection. So, it should be explained to them in simple terms how to select a random sample.
One way to do this, is to use the so-called A-number. This is a unique identification number that is
assigned to each individual in the population register. The A-number is only used by municipalities for
internal purposes. It is a 10-digit random number. You can use this number to select a sample. For
example, you can ask the municipality to extract only records for which the second digit is equal to ‘4’.
This gives you a 10% sample of the population. A next step could be to filter out only those records
that meet certain criteria. For example, for your election survey you use only records of persons with
an age of at least 18 years.
If the town is very large, and you do not want a very large sample, you could decide to extract only
records for which, for example, the 2nd digit of the A-number is equal to ‘4’ and the 9th digit is equal to
‘8’. This gives you a 1% sample from the population.

Example 5.6. Selection of a sample from the population register

Suppose, you have to select a sample of 1,000 persons from the population of the town of
Rhinewood. The town has a population of 18,500 people.
You obtain permission to use the population register. You instruct the employees in the town hall to
extract only those records for which the final digit of the A-number is equal to ‘7’. This gives you a
10% sample. The number of records in the sample will approximately be equal 1,850.
The next step is to remove the records of persons that are too young (less than 18 years). From the
remaining set of records, you only use the first 1,000 records.

By using the population register as a sampling frame, your target population will consist of people that
are registered in the town. This includes people that are temporarily out of town. This excludes people
from somewhere else that temporarily live in the town. And this also excludes illegal immigrants that
may be living in town.
5.7.2 Sampling from an address file
There are companies in The Netherlands that maintain address files. An example is the Postal Addres
File (PAF) offered by de postal service agency PostNL. Such databases contains postal addresses of
both private houses and companies where post can be delivered. Typically, the file can be used to draw
a sample of addresses of households. It is possible to specify additional selection criteria, for example
with respect to the type of address (private house or company).
It is important that such an address file is up-to-date. If this not so, there may discrepancies between
the file and reality. There may be houses in the file that do not exist anymore, because they were
demolished. This is a case of over-coverage. And there may be newly built houses that have not yet
been included in the file. This is a case of under-coverage.

53
Example 5.7. Fragment of an address file

Here is an example of a short fragment of an address file. It contains some addresses in the town of
Rhinewood:

2394AP 278 Rijndijk Rhinewood


2394AP 280 Rijndijk Rhinewood
2394AP 280 A Rijndijk Rhinewood
2394AP 280 B Rijndijk Rhinewood
2394AR 1 Molenlaan Rhinewood
2394AR 2 Molenlaan Rhinewood
2394AR 3 Molenlaan Rhinewood
2394AR 4 Molenlaan Rhinewood
2394AR 5 Molenlaan Rhinewood
2394AR 5 A Molenlaan Rhinewood
2394AR 6 Molenlaan Rhinewood
2394AR 8 Molenlaan Rhinewood
2394AR 10 Molenlaan Rhinewood
2394AR 12 Molenlaan Rhinewood

Each line contains an address. The line starts with the postcode. The next field contains the house
number, followed by a possible extension (a letter). Such house numbers with extensions are the
result of building new houses between existing houses. The next field in the address record contains
the name of the street. The final field is for the name of the town.
Each record also contains a field indicating whether an address is a house, a company, or something
else. So it is possible to sample only households.

5.7.3 Sampling from a postcode file


Some countries, among which The Netherlands, have postcode files. Each record in such a file is not an
address, but a postcode. A postcode record contains information about the addresses having the
particular postcode. See figure 5.8 for an example. It is in principle possible to transform a postcode
file into an address file, but you can encounter problems. It is not always clear how many, and which,
addresses belong to a postcode.

The address file generated from a postcode file may also suffer from over-coverage. Some addresses
may not be addresses of households but addresses of companies. Moreover, people living at a specific
address may not belong to the target population, because they are only temporary staying in town.
You should take into account over-coverage when you take a sample from the file. A lot of addresses
may be unusable. So you have to start with a larger sample than you finally want to have.

Example 5.8. Fragment of a postcode file

Here is an example of a short fragment of an postcode file. It contains some postcodes and
associated addresses in the town of Rhinewood:

2394AM Rijndijk 194 – 218 Rhinewood


2394AN Rijndijk 226 – 252 Rhinewood
2394AP Rijndijk 254 – 280 Rhinewood
2394AR Molenlaan 1 – 5 Rhinewood
2394AS Molenlaan 2 – 16 Rhinewood
2394AT Groenestein 1 – 33 Rhinewood
2394AT Groenhof 2 – 44 Rhinewood
2394AV Groenestein 2 – 20 Rhinewood
2394AW Groenestein 22 – 52 Rhinewood

54
Each record starts with a postcode. The next field contains the name of the street having this
postcode. This is followed by the range of house numbers having the postcode. Note that in The
Netherlands, even numbered houses have a different postcode than odd numbered houses. So the
range 1 – 5 means 1, 3, 5. And the range 2 - 16 means 2, 4, 6, 8, 10, 12, 14, 16. The final field in a
record contains the name of the town.
A problem with a postcode file like this is that it is not clear whether there are houses having house
numbers with extensions. For example, the range 1 – 5 in the Molenlaan relates to four houses
instead of three: 1, 3, 5, and 5A. When transforming such a postcode file into an address file, you
overlook these house numbers, and therefore they will not be selected in the sample.

Of course, the postcode file should be up-to-date. If not, it may contains addresses of houses that do
not exist anymore, or it may not contain houses that have recently be built.
5.7.4 Sampling from a telephone directory
If you want to conduct a poll by telephone, a telephone directory seems to be the obvious sampling
frame. Since such a directory not only contains telephone numbers, but also names and addresses, it
can also be used as a sampling frame for a face-to-face or a mail poll.
A telephone directory is not a perfect sampling frame. There are a number of coverage problems. We
mention some problems with Dutch telephone directories:
 A telephone directory mainly contains landline telephone numbers, and no mobile telephone
numbers. Not everyone has a landline telephone. Particularly young people often only have a
mobile telephone. Most mobile telephone numbers cannot be found in the directory. This causes
under-coverage.
 Even if someone has a landline telephone, there is no guarantee it is listed in the telephone
directory. The reason can be privacy or security concerns. This also causes under-coverage.
 There is a do-not-call register in The Netherlands. You include your telephone number in such a
register if you do not want to receive telemarketing calls. Although interviewing over the
telephone is not the same as selling products or services, many market researchers avoid calling
people in this register. If you do not include these people your poll, it may increase under-coverage
problems.
 Many listed numbers may not be numbers of private households, but of companies, schools, and
other organisations. If a sample of persons is to be drawn, these numbers are cases of over-
coverage.

Example 5.9. Fragment of a telephone directory

Below is an example of a short fragment of a telephone directory. It contains some telephone


numbers and and associated addresses in the town of Rhinewood.
Each records starts with a telephone number, followed by a name and an address. Note that there
are at least three addresses of companies in this fragment: ‘Aannemingsbedrijf’ denotes building
contractor, and ‘Administratiekantoor’ denotes administrative office. There is no guarantee that the
other addresses relate to private households.

55
071-5663255 Aalders, J W Chopinlaan 35 2394GK Rhinewood
071-3419413 Aalders, V M van Beethovenlaan 8 2394HC Rhinewood
071-3415742 Aalst, L van Chopinlaan 9 2394GK Rhinewood
071-3413286 Aannemingsbedrijf Sjardijn Staringstraat 2 2394ES Rhinewood
071-3413211 Aar, A J J vd Marsstraat 28 2394ND Rhinewood
071-3413223 Aarts, R Jacob Reviusstraat 69 2394VL Rhinewood
071-3419883 Abdulrasoul, R K J P Heijestraat 63 2394XW Rhinewood
071-3413252 Abspoel, C Potgieterlaan 63 2394VC Rhinewood
071-3413361 Abswoude, G J van Marsstraat 10 2394ND Rhinewood
071-3414206 Adank, A I Bilderdijklaan 16 2394EL Rhinewood
071-3412669 Adema, H J G Steenbokstraat 5 2394PE Rhinewood
071-3416219 Administratiekantoor Fransen Rijndijk 306/A 2394CH Rhinewood
071-3416747 Administratiekantoor Schepper Rijndijk 209/A 2394CC Rhinewood
071-3415287 Administratiekantoor Visser Corellistraat 4/A 2394GZ Rhinewood
071-3412142 Adrichem Eelco & Caroline Rijndijk 39/B 2394AC Rhinewood

Because of the substantial amount of over-coverage in a telephone directory, you cannot use all
selected addresses in your poll. So, if you want to realise a specific sample size, you should start your
selection process with a larger sample size.

56
6 Collecting data
The questionnaire form of your poll must be filled in for every object in the sample. If you have a
sample of persons, it is clear which persons must answer the questions. For a poll among households it
is less clear. You have to find one or more persons in each selected household for this. For a poll among
companies it may even be more difficult to find a proper company employee who can do the job. If the
questionnaire is large and complex, you might even need more than one employee to obtain all
requested information. This chapter focuses on polls among persons. There are various ways to let
people fill in a questionnaire form. These are called modes of data collection:
 Face-to-face: Interviewers visit the sample persons at home. The interviewers ask the questions
and the sample persons give the answers.
 By telephone: Interviewers call the sample persons by telephone. The interviewers ask the
questions (by telephone) and the sample persons give the answers.
 By mail: Questionnaires are sent to the sample persons by mail. They are asked to complete the
questionnaire forms, and to return them by mail to the researcher.
 Online: a link to the website with the (digital) questionnaire is sent to the sample persons. They are
asked to go to this website, and to fill in the questionnaire online.
So you can choose between several modes of data collection. Which one to use for your poll? Two
important aspects will probably determine your choice: quality and costs. Face-to-face and telephone
interviewing make use of interviewers. Therefore, these modes of data collection are called
interviewer-assisted. Interviewers can persuade people to participate in the poll, and they can also
assist them in formulating the correct answers to questions. Therefore, the collected data in
interviewer-assisted polls are usually of better quality. But a price has to be paid in terms of costs:
interviewers are expensive.
Mail and online polls do not employ interviewers. Therefore, these modes of data collection are called
self-administered. The respondents are completely on their own when filling in the questionnaire form.
This may lead to satisficing. This is the phenomenon that respondents do not do all they can to give the
correct answer. Instead they give a more or less acceptable answer with minimal effort. Consequently,
the quality of the answers in a self-administered poll may not be very good. Of course, the costs of a
mail poll or an online poll are much lower than that of a face-to-face or a telephone poll.
We often call all data collection activities the fieldwork. In fact, this term refers to interviewers going
into the field for a face-to-face poll. We will, however, use the term also for other modes of data
collection.
We will describe various modes of data collection in more detail in this chapter. Section 6.1 will be
about traditional modes of data collecting that use paper questionnaires. Section 6.2. discusses modes
of data collection that use digital questionnaires. This is called computer-assisted interviewing. Section
6.3. is devoted to online data collection.

6.1 Traditional data collection


Traditional data collection refers to modes of data collection that use a paper questionnaire. There are
three forms of traditional data collection: mail interviewing, telephone interviewing, and face-to-face
interviewing.
Mail interviewing is the least expensive of the three data collection modes. You send copy of the paper
questionnaire to the objects (e.g. persons, households, or companies) selected in the sample. They are
invited to answer the questions and to return the completed questionnaire form to the researcher.
There are no interviewers. Therefore, it is a cheap mode of data collection. Data collection costs only

57
involve mailing costs (letters, postage, envelopes). A possible advantage of a mail poll can be that the
absence of interviewers is experienced as less threatening for potential respondents. As a
consequence, respondents are more inclined to answer sensitive questions correctly.
The absence of interviewers also has a number of disadvantages. They cannot provide additional
explanation or assist the respondents in answering the questions. This may cause respondents to
misinterpret questions, and this has a negative impact on the quality of the collected data. Also, it is
not possible to use show cards. Show cards are typically used for answering closed questions. Such a
card contains the list of all possible answers to a question. It allows respondents to read through the
list at their own pace, and select the answer that reflects their situation or opinion.
Mail polls put high demands on the design of the paper questionnaire. It should be clear to all
respondents how to navigate through the questionnaire and how to answer questions.
Since the persuasive power of the interviewers is absent, response rates of mail polls tend to be low.
Of course, reminder letters can be sent, but this makes the poll more expensive, and it is often only
patially successful. More often poll documents (questionnaires, letters) end up in the pile of old
newspapers.
In summary, the costs of a mail poll are relatively low, but you pay a price in terms of quality: response
rates tend to be low and also the quality of the collected data is often not very good. However, Dillman
(2007) believes that good results can be obtained by applying his Tailored Design Method. This is a set
of guidelines for designing and formatting questionnaires for mail polls. This method pays attention to
all aspects of the data collection process that may affect response rates or data quality.
Face-to-face interviewing is the most expensive of the three data collection modes. Interviewers visit
the sample persons at home. Well-trained interviewers will be successful in persuading reluctant
persons to participate in the poll. Therefore, response rates of face-to-face polls are usually higher
than those of mail polls. The interviewers can also assist respondents in giving the right answers the
questions. This often results in data of better quality. However, the presence of interviewers can also
be a drawback. Research suggests that respondents are more inclined to answer sensitive questions
correctly if there are no interviewers in the room.
The researcher may consider sending a letter announcing the visit of the interviewer. Such an
announcement letter can also give additional information about the poll, explain why it is important to
participate, and assure that the collected information is treated confidentially. As a result, the
respondents are not taken by surprise by the interviewers. An announcement letter may also contain
the telephone number of the interviewer. This makes it possible for the respondent to make an
appointment for a more appropriate day and/or time. Of course, the respondent can also use this
telephone number to cancel the interview.
Response rates of a face-to-face polls are a higher than those of a mail polls, and the quality of the
collected data is better. But a price has to be paid literally: face-to-face interviewing is much more
expensive. A team of interviewers has to be trained and paid. Moreover, they have to travel a lot to
visit the selected persons, and this costs time and money.
A third mode of data collection is telephone interviewing. Interviewers are also needed for this mode,
but not as many as for face-to-face interviewing. They do not loose time traveling from one respondent
to the next. They can remain in the call centre of the research organisation and conduct more
interviews in the same amount of time. Therefore, the costs of interviewers are less high. An advantage
of telephone interviewing over face-to-face interviewing is that often respondents are more inclined to
answer sensitive questions, because the interviewer is not present in the room.
Telephone interviewing also has some drawbacks. Interviews cannot last too long and questions may
not be too complicated. Another complication can be the lack of a proper sampling frame. Telephone

58
directories may suffer from severe under-coverage because more and more people do not want their
telephone number to be listed in the directory. Another development is that increasingly people
replace their landline phone by a mobile phone. Mobile phone numbers are not listed in directories in
many countries. For example, according to Cobben & Bethlehem (2005) only between 60% and 70% of
the Dutch population can be reached through a telephone dictionary.
Telephone directories may also suffer from over-coverage. For example, if the target population of the
poll consists of households, only telephone numbers of private addresses are required. Telephone
numbers of companies must be ignored. It is not always clear whether a listed number refers to a
private address or a company address (or both).
A way to avoid under-coverage problems of telephone directories is applying random digital dialling
(RDD) to generate random phone numbers. A computer algorithm computes valid random telephone
numbers. Such an algorithm is able to generate both listed and unlisted numbers. So, there is complete
coverage. Random digital dialling also has drawbacks. In some countries it is not clear what an
unanswered numbers means. It can mean that the number is not in use. This is a case of over-coverage,
for which no follow-up is needed. It can also mean that someone simply does not answer the phone,
which is a case of non-response, which has to be followed up. Another drawback of RDD is that there is
no information at all about non-respondents. This makes correction for non-response very difficult
(see also chapter 9 about non-response).
The fast rise of the use of mobile phones has not made telephone interviewing easier. More and more
landline phones are replaced by mobile phones. A landline phone is a means to contact a household
whereas a mobile phone makes contact with an individual person. Therefore, the chances of contacting
any member of the household are higher for landline phones. And if persons can only be contacted
through their mobile phones, it is often in a situation not fit for conducting an interview. Also, it was
already mentioned that sampling frames in many countries do not contain mobile phone numbers.
And a final complication is that in countries such as The Netherlands people often switch from one
phone company to another. As a consequence they get a different phone number. For more
information about the use of mobile phones for interviewing, see for example Kuusela et al. (2006).
The choice of the mode of data collection is not any easy one. It is usually a compromise between
quality and costs. In a large country like the United States it is almost impossible to collect data by
means of face-to-face interviewing. It requires so many interviewers that have to do so much traveling
that the costs would be very high. Therefore it is not surprising that telephone interviewing emerged
there as a major data collection mode. In a very small and densely populated country like The
Netherlands face-to-face interviewing is much more attractive. Coverage problems of telephone
directories and low response rates also play a role in the choice for face-to-face interviewing. More
about data collection issues can be found in Couper et al. (1998).

6.2 Computer-assisted interviewing


Collecting data can be a complex, costly and time-consuming process, particularly if you want to have
high quality data. One of the problems of traditional data collection is that the completed paper
questionnaire forms usually contain errors. Consequently, you must spend substantial resources to
‘cleaning’ these forms. This is also called data editing. Extensive data editing is required to obtain data
of acceptable quality. Fortunately, rapid developments of information technology since the 1970’s
have made it possible to use microcomputers for data collection. Thus, computer-assisted interviewing
(CAI) was born. The paper questionnaire was replaced by a computer program containing the
questions to be asked. The computer took control of the interviewing process, and it also checked
answers to questions on the spot.

59
Like traditional interviewing, computer-assisted interviewing has different modes of data collection.
The first mode of data collection emerging in history was computer-assisted telephone interviewing
(CATI). Couper and Nicholls (1998) describe how it was developed in the United States in the early
1970’s. The first nationwide telephone facility for surveys was established already in 1966. The idea
was not implementation of computer-assisted interviewing but simplifying sample management.
These systems evolved in subsequent years into full featured CATI systems. Particularly in the United
States, there was a rapid growth of the use of these systems. CATI systems were little used in Europe
until the early 1980’s.
Interviewers in a CATI survey operate a computer running interview software. When instructed so by
the software, they attempt to contact a selected person by telephone. If this is successful and the
person is willing to participate in the survey, the interviewer starts the interviewing program. The first
question appears on the screen. If this is answered correctly, the software proceeds to the next
question on the route through the questionnaire, etc.
Many CATI systems have a tool for call management. Its main function is to offer the right phone
number at the right moment to the right interviewer. This is particularly important for cases in which
the interviewer has made an appointment with a respondent for a specific time and date. Such a call
management system also has facilities to deal with special situations like a busy number (try again
after a short while) or no answer (try again later). This all helps to increase the response rate as much
as possible.
The emergence of small portable computers in the 1980’s made computer-assisted personal
interviewing (CAPI) possible. It is a form of face-to-face interviewing where interviewers take their
laptop computer (or tablet) to the home of the respondents. There they start the interview program
and record the answers to the questions.
The first experiments with CAPI were carried out in The Netherlands in 1984. It turned out the
interviewers were able to handle the hardware and the software. Moreover, respondents did not
object to their personal data being entered in a computer. There were no ‘big brother’ effects. The
Labour Force Survey was the first large survey to be carried out with CAPI. There were no MS-DOS or
Windows computers yet. The laptop was an EPSON PX-4 running under CP/M. The screen could only
contain 8 lines of text, where each line could consist at most of 40 characters. At night, interviewers
connected their laptop to a telephone and modem. The batteries were recharged, and the collected
data were automatically uploaded to Statistics Netherlands. More about the early years of CAPI at
Statistics Netherlands can be found in Bethlehem & Hofman (2006).

Figure 6.1. The EPSON PX-4 laptop computer

The computer-assisted mode of mail interviewing also emerged. It is called computer-assisted self-
interviewing (CASI), or sometimes also computer-assisted self-administered questionnaires (CASAQ).

60
The digital questionnaire is send to the respondents. They answer the questions, and send the answers
back to the survey agency. Early CASI applications used diskettes or a telephone and modem to send
the questionnaire, but nowadays it is common practice to download it from the internet. The answers
are returned electronically in the same fashion.
An early application of CASI was the Telepanel, see Saris (1998). The Telepanel was founded in 1986. It
was a panel of 2,000 Dutch households who agreed to regularly complete questionnaires with the
computer equipment provided to them by the research organisation. A home computer was installed
in each household. It was connect to the telephone with a modem. It was also connected to the
television in the household so that it could be used as a monitor. After a diskette was inserted into the
home computer, it automatically established a connection with the survey agency to exchange
information (downloading a new questionnaire or uploading answers of the current questionnaires).
Panel members had agreed to complete a questionnaire each weekend.
Application of computer-assisted interviewing for data collection has three major advantages:
 It simplifies the work of interviewers. They do not have to pay attention to choosing the correct
route through the questionnaire. This is all taken care off by the interviewing software. Therefore,
interviewers can concentrate more on asking questions, and assisting respondents in getting the
right answers.
 It improves the quality of the collected data, because answers can be checked and corrected by the
software during the interview. If an error is detected, it is immediately reported on the screen, and
interviewer and respondent together can correct the error. This is more effective than error
treatment in traditional polls, where errors in paper forms could only be detected afterwards, in
the office of the researcher. The respondent was not available any more to correct the problem.
 Data is entered in the computer already during the interview, resulting in a clean record, so no
more subsequent data entry and data editing is necessary. This considerably reduces time needed
to process the collected data, and thus improves the timeliness of the research results.
It should be said that computer-assisted interviewing also has some disadvantages. One important
disadvantage is that it requires computers. Supplying a large number of interviewers with laptops of
tables is expensive.

6.3 Online interviewing


The rapid development of the internet in 1990’s has led to a new mode of data collection. Some call it
computer-assisted web interviewing (CAWI). The questionnaire is offered to the respondents through
the internet. Therefore, such a poll is sometimes also called a web poll or an online poll. In fact, such an
online poll is a special type of a CASI survey.
At first sight, online polls have a number of attractive properties. Now that so many people are
connected to the internet, it is a simple means of getting access to a large group of potential
respondents. Furthermore, questionnaires can be distributed at very low costs. No interviewers are
needed, and there are no mailing and printing costs. Finally, polls can be launched very quickly. Little
time is lost between the moment the questionnaire is ready and the start of the fieldwork.
Consequently, it is a cheap and fast means to get access to a large group of people.
However, online polls also have some serious drawbacks. These drawbacks are mainly caused by
under-coverage (not everyone has access to internet) and the lack of proper sampling designs (often
self-selection is applied). Because of the increasing popularity of online polls and the associated
methodological problems, a special chapter is devoted to this mode of data collection (chapter 10).

61
62
7 Checking and correcting data
After having finished the fieldwork for your poll, you have a (hopefully large) set of completed
questionnaire forms. These forms contain the data for your analysis. Unfortunately, you cannot start
right away with the analysis. Interviewers and respondent make errors when answering questions and
filling in forms. You should not analyse this ‘dirty data’ as it may lead to wrong conclusions. First, you
have to check the collected data for errors. And if you detect errors, you have to correct them. The
processes of detecting and correcting errors is also called data editing.

This chapter is about data editing. Section 7.1 describes possible sources of errors. Section 7.2 explains
you how you can detect errors. And 7.3 describes some techniques for correcting errors.

7.1 Sources of errors


You approach only a sample of objects for your poll and not the complete target population. Therefore
it is impossible to compute population characteristics exactly. You have to rely on estimates of these
characteristics. Generally, these estimates differ from the true value. Every random sample again is
different and thus it will produce a different estimate. Fortunately, if your sample size is not too small,
all these estimates will be close to the true value. If you sample is based on probability sampling, you
can compute how large the deviation from the true value at most can be. This is the margin of error of
the estimate.

The deviation caused by sampling is also called the sampling error. You have this error under control.
For example, you can reduce the sampling error by increasing the sample size. However, there are also
other sources of error you do not have under control. Three of these sources are described here:
under-coverage, non-response and measurement errors.

7.1.1 Under-coverage
Under-coverage occurs if objects in the target population do not have a corresponding entry in the
sampling frame. These objects will never be selected in your sample. If these objects systematically
differ from objects in the sampling frame, estimates of population characteristics may be invalid
because they are biased. One example of under-coverage is using a telephone directory as a sampling
frame. People without listed telephone numbers (for example a lot of young people with only a mobile
phone) will never be in your poll. Another example is an online poll. People without internet will never
be selected in the sample. Because of this phenomenon, particularly the elderly will be under-
represented in such polls.

7.1.2 Non-response
Non-response occurs if objects in the sample (and that belong to the target population) do not provide
the requested information. The main causes of non-response are non-contact (nobody is home),
refusal, and not-able (e.g. due to illness or language problems). If the non-respondents differ in a
systematic way from the respondents, you run a serious risk of drawing the wrong conclusions from
your poll. An example is an election poll. Research has shown that respondents in a poll are more
likely to vote in an election than non-respondents. As a consequence, voters will be over-represented
in such a poll, and this leads, for example, to a too high estimate of the election turnout. Non-response
is a serious problem in polls. Therefore, we discuss it in more detail in a separate chapter, and this is
chapter 9.

63
7.1.3 Measurement errors
Problems can occur in the process of asking and answering questions. This may lead to incorrect
answers. If the given answers differ from the correct answers, we talk about measurement errors.
Measurement errors can have many different causes. One important cause is satisficing. Correctly
answering questions in a poll requires a substantial cognitive effort. Respondents may be initially
motivated to do so, but they are likely to become tired of it in the course of the interview. Interest in
answering questions will fade away. If the interview takes long to finish, these respondents may
become impatient or distracted. Consequently, they will put less energy in answering questions.
According to Krosnick (1991), respondents will be less thoughtful about the meaning of questions,
they search their memories less thorough, they integrate information less carelessly, and they may
select an answer option more haphazardly. Satisficing comes in many different forms:
 Response order effects. If the list of possible answers to a closed question is long, respondents in a
self-administered poll (mail or online) tend to choose an answer early in the list. This is called a
primacy effect. Face-to-face and telephone polls may suffer from a recency effect, i.e. respondents
tend to select an answer near the end of the list.
 Acquiescence. Respondents tend to agree with statements in questions, regardless of their content.
This typically may occur for opinion questions. Research seems to suggest that acquiescence is
more of a problem in self-administered polls than in interviewer-assisted polls.
 Endorsing the status quo. If respondents are asked to give their opinion about changes, they simply
(without thinking) select the answer to keep everything the same. This is easier than having to
think about change. Endorsing the status quo is more of a problem in self-administered surveys.
 Selecting the middle option. If respondents are offered a middle response option for a neutral
response in a rating scale question (e.g. a Likert scale), they tend to select this option. It is an easy
way out for those not wanting to express an opinion.
 Non-differentiation. If respondents have to answer a series of questions with the same set of
possible answers, they tend to select the same answer for all these questions irrespective of the
question content. This effect is even more pronounced for grid questions, where respondents
select all answers in the same column. This is called straight-lining. Non-differentiation is typically
a problem in self-administered surveys.
 Don’t know. There is a dilemma for handling don’t know in polls. On the one hand, this option
should be explicitly offered as persons may really do not know the answer to a question. On the
other hand, if don’t know is available, many respondents will select it as an easy way out. In case of
a CAPI or CATI survey, it is also possible to offer don’t know implicitly. The interviewer only reads
out the substantive options, but if the respondent insists he does not know the answer, the
interviewer can record the answer as don’t know. This works well, but it is not possible for a mail
poll, and it may be difficult to implement for an online poll.
 Arbitrary answer. Respondents not wanting to think about the proper answer, may decide to pick
just an arbitrary answer. This type of satisficing typically occurs for check-all-that-apply questions.
It is more of a problem in self-administered polls than in interviewer-assisted polls.
Another cause of measurement errors is socially desirable answers for questions about potentially
sensitive topics. Respondents may give answers that will be viewed by others as more favourable. This
particularly happens for sensitive questions about topics like sexual behaviour and use of drugs. If the
true answer would not make the respondents look good, they will refuse to answer or give a different
answer. According to De Leeuw (1992), the effects of socially desirable answers are stronger in
interviewer-assisted polls. Respondents tend to give more truthful answers in self-administered polls.

64
A final cause of measurement errors to be mentioned here, are memory errors. They occur in questions
requiring recall of events that have happened in the past. Respondents tend to forget events,
particularly when they happened a long time ago. Recall errors are more severe as the length of the
reference period is longer. Moreover, important events, more interesting events and more frequently
happening events will be remembered better than other events.
Recall questions may also suffer from telescoping. This occurs if respondents report events as having
occurred either earlier or later than they actually did. As a result an event is incorrectly reported
within the reference period, or incorrectly excluded from the reference period. Bradburn et al. (2004)
note that that telescoping leads more often to overstating than to understating a number of events.
Particularly, for short reference periods, telescoping may lead to substantial errors in estimates.

Example 7.1. Contacts with the family doctor.

In the 1981 Health Survey of Statistics Netherlands, respondents had to report contacts with their
family doctor over the last three months. Memory effects were investigated by Sikkel (1983).
It turned out that the percentage of unreported contacts increased with time. The longer ago an
event took place, the more likely it is that it will be forgotten. The percentage of unreported events
for this question increased on average with almost 4% per week. Over the total period of three
months, about one quarter of the contacts with the family doctor were not reported.

7.2 Checking for errors


The collected questionnaire forms are ‘dirty’ as they contain errors. These errors must be corrected. A
first step is detecting these errors. It is useful to distinguish three types of errors: domain errors,
consistency errors, and routing errors.
7.2.1 Domain errors
Each question has a domain (or range) of valid answers. An answer outside this domain is considered
an error. It is a good idea to check the data for these domain errors. An error can easily be detected for
numerical questions, since domain errors are defined as any answer falling outside the allowable
range.
For questions asking for values or quantities, you can sometimes specify improbable as well as
impossible answers. For example, if you ask respondents for their age, a value of 199 would certainly
be unacceptable. So this value must be corrected. A value of 110 is unlikely but not impossible. So in
the end your decision could be not to change the answer.
Domain errors occur if paper questionnaire are used. In case of computer-assisted interviewing, the
software usually does not allow entering an answer outside the valid domain. So it is not possible to
make a domain error in a CAPI, CATI, or an online poll.
For a closed question, the answer has to be chosen from a list of list of answer options. If you use a
paper questionnaire, problems occur if a respondent selects no options at all, or selects more than one
option. In case of computer-assisted interviewing it is usually not possible to make such an error in the
answer to a closed question.
For open questions, any text is accepted as an answer. Therefore it is impossible to make a domain
error.

65
7.2.2 Consistency errors
Consistency errors occur when the answers to two or more questions contradict each other. Each
question may have an answer in its valid domain, but the combination of answers may be impossible
or unacceptable. This is called a consistency error. A check for consistency errors is called a consistency
check.
Suppose you conduct a radio listening poll, and you ask respondents for their age and their marital
status. If the target population consists of all people from the age of 13 years, an answer of 14 years is
acceptable. If someone answers that he is married, that is also acceptable. But if the same person is 14
years and married, that is impossible. People of that age cannot be married (in The Netherlands). So
this combination of answers is an inconsistency.
A relation check determines whether the answers to the questions involved constitute a valid
combination. As an example, valid combinations for the questions age and marital status are
 Age under 16 years and not married
 Age of at least 16 years and not married
 Age of at least 16 years and married.
Any invalid combination of answers is called a consistency error. When a consistency error is detected,
it is not always obvious which question involved caused the problem. A correction may be necessary in
one, two, or even more questions. Moreover, resolving one inconsistency may produce another. So, it is
easier to detect consistency errors than to correct them.
Carrying out manual consistency checks for paper questionnaires is not so easy, particularly if more
than two questions are involved, and these questions are on different pages of the questionnaire.
Consistency checks for digital questionnaires are usually programmed in the interviewing software.
This is one of the reasons that computer-assisted interviewing produces better data.
Consistency checks in online interviews are a point of discussion. Reporting consistency errors may
help to improve the quality of the data. Online interviews are self-administered and therefore may
contain measurement errors. However, too many and too harsh error messages may also frustrate
respondents, resulting in breaking off the interview.
7.2.3 Routing errors
Many questionnaires contain routing instructions. These instructions specify conditions under which
certain questions must be answered. In most cases, closed and numeric questions are used for these
instructions. In paper questionnaires, routing instructions usually take the form of skip instructions
attached to answers of questions, or of printed instructions for the interviewer. Routing instructions
ensure that all applicable questions are asked, while inapplicable questions are omitted.
A routing error occurs when an interviewer or respondent fails to follow a route instruction, and a
wrong path is taken through the questionnaire. Routing errors are also called skip pattern errors. As a
result, the wrong questions are answered or applicable questions are left unanswered.
Routing errors may occur in a paper questionnaires. Nothing prevents respondents from going to the
wrong parts of the questionnaire, and answering the wrong questions. So it is important to conduct
routing checks. Most CAPI and CATI software forces the correct route through the questionnaire. The
software determines the next question to be asked, and not the respondent. Therefore it is impossible
to make routing errors.
There are online polls in which the routing is forced, but there are also polls that leave respondents
completely free to answer questions or not, and to answer the questions in any order. It will be clear
that in the latter case you run a serious risk of getting no answers to questions that should have been
answered.

66
Example 7.2. Part of a questionnaire with routing instructions

The questionnaire fragment below contains two types of routing instructions. There is a general
instruction for the interviewers after questions 2, instructing them to ask the subsequent
questions only if the respondent is at least 18 years old. So, if an individual of 16 years has
answered question 3, this is a routing error.
Another type of routing instruction can be seen in question 4. The answer options contain skip
instructions. People who voted in the last elections, should not answer questions 5, but jump to
question 6. If a voter answered question 5, this is also a routing error.

...

2. What is your age (in years)? __ __ __

Interviewer: Ask questions below only of persons of age 18 and older.

3. Of the issues that were discussed during the election campaign,


which one was most important for you?
.......................................................................................................................................

4. Did you vote in the last parliamentary elections?


 Yes Go to question 6
 No Go to question 5
...

7.3 Correcting errors


When you check the completed questionnaires of your poll, you will find questions that were not
answered (while they should have been answered), and you will also find questions that have wrong
answers. In both cases, the correct answers are missing. To obtain ‘clean’ data, you have to determine
the correct answers. One obvious way to do this is to re-contact the respondents and confront them
with their errors. They may then provide the correct answers. Unfortunately, this approach is not
feasible in daily practice. Respondents consider completing a questionnaire form already a burden in
the first place. Having to reconsider their answers would in most cases lead to a refusal to do so.
Moreover, this approach is time-consuming and costly. Therefore, researchers usually rely on other
techniques to deal with errors in the collected data. A first step is to remove incorrect answers, and set
the answers to missing.
It should be noted that analysing a data set with missing answers is not without risks. In the first place,
the ‘holes’ in the data set may not be missing in an arbitrary way. If data are missing in some
systematic way, the remaining data may not properly reflect the situation in the target population. In
the second place, many statistical analysis techniques are not able to properly cope with missing data,
and therefore may produce misleading results. Some techniques even require all data to be there.
One way of coping with missing answers is throwing away all questionnaire forms containing at least
one question with a missing answer. This leaves you with a set of complete questionnaire forms.
Unfortunately, problems with questions often occur in specific groups. Therefore, the set of remaining
forms may not be representative for the target population. Moreover, you will probably throw away a
lot of information if you ignore forms as soon as the answer to one question is missing.
To solve the problem of missing answers, often imputation is applied. Imputation means that the
researcher replaces a missing answer by a synthetic answer. This is not an answer the researcher

67
obtained from the respondent, but a prediction of the true answer made by the researcher. There are
various ways to predict a missing answer, and this leads to different imputation techniques. We will
describe some of these techniques. More techniques are, for example, discussed in Bethlehem, Cobben
& Schouten (2011).
7.3.1 Imputation of the mean
Imputation of the mean is an imputation technique in which you replace a missing answer to a
question by the mean of the available answers to that question. For example, if some respondents
refuse to give their income, you replace missing values by the mean income of those who did specify
their income.
The imputed value is the mean of a number of values. This implies that you can only use imputation of
the mean for questions measuring quantitative variables. It does not work for qualitative variables. It
is not meaningful to impute an average gender or an average marital status.
Imputation of the mean works only well if the quantitative variable does not have a lot of variation. If
there is a lot of variation, the imputed value may be very different from the true value. Another
disadvantage of imputation of the mean is that your computation of the margin of error will result in a
wrong value. The computed margins of error will be too small, thereby creating a wrong impression of
very precise estimates.
7.3.2 Imputation of the group mean
Imputation works if the imputed values are close to the true values. One way in which you can achieve
this is by dividing the target population in homogeneous groups. You can then apply imputation of the
mean within each group. This is called imputation of the group mean.
Suppose, the target population consists of all inhabitants of a town, and the town is divided into
neighbourhoods. Suppose also that within a neighbourhood, incomes of people are more or less the
same. So there are poor neighbourhoods (all people have a low income) and there are rich
neighbourhoods (all people have a high income). If the answer to a question about income is missing,
the mean of all available incomes in a neighbourhood is probably closer to the true value than the
overall mean of all available incomes.
7.3.3 Random imputation
Random imputation is a form of imputation in which you replace a missing answer by a randomly
chosen answer from the set of available answers. For example, if an answer to a question about income
is missing, you take one of the available incomes at random.
Random imputation works well if the variable to be imputed does not have a lot of variation. If it has
substantial variation, the imputed value may differ too much from the true value.
Random imputation has the advantage that you can use it for all types of variables, both quantitative
and qualitative variables. Particularly, random imputation of missing answers to closed questions may
lead to unwanted situations. If someone’s gender is missing, what would happen if you impute a
random gender? This may lead to inconsistencies with respect to other questions.
Random imputation does not have the problem that the computed margins of errors may be wrong.
Generally, margins of errors computed after random imputation are a reasonable approximation of the
true margins of errors.
7.3.4 Random imputation within groups
Random computation works better if you do it within homogeneous groups. Within such a group, the
answers to a question are closer together, and therefore a randomly chosen answer will also be closer
to the true, but missing, value. This is called random imputation within groups.

68
Suppose, the target population consists of all inhabitants of a town, and the town is divided into
neighbourhoods. Suppose also that within a neighbourhood, incomes of people are more or less the
same. If the answer to a question about income is missing, a random income from the set of all
available incomes in the neighbourhood is probably closer to the true value than a random income
from the set of all available incomes.
7.3.5 Donor imputation
The idea of donor imputation is that if you have a missing answer for a respondent, you try to find
another respondent to looks as much as possible like the respondent with the missing answer. And
then you use the answer from this respondent. So you copy the answer from a donor-respondent.
Suppose again that you have a respondent with a missing income. Then you try to find a respondent
with the same gender, the same age, the same marital status, the same level of education, etc. Once you
have found this resembling person, you copy his or her income to the person with the missing income.
Donor-imputation works better as you use more variables for finding similar persons. All these
variables should contribute to predicting the variable with missing values. If only variables are used
that are completely unrelated to the variable with missing values, donor-imputation will not be very
effective.

Example 7.3. Imputation of income

We show the effects of different imputation techniques using a fictitious example. The table below
contains data about 11 respondents of a poll. The variables are monthly income, level of education
and number of years of work experience. One of the respondents (person 6) refused to give his
income. You want to fill the ‘hole’ in the data set, but the question is which imputation technique to
apply.
A look at the data makes clear that low educated people have a lower income than high educated
people. And a closer looks shows that income is higher as someone has more years of work
experience. An effective imputation technique will make use of these relationships.

Person Income Education Work experience


1 € 2,041 Low 1
2 € 2,110 Low 2
3 € 2,142 Low 3
4 € 2,201 Low 4
5 € 2,247 Low 5
6 - Low 6
7 € 4,099 High 1
8 € 4,204 High 2
9 € 4,298 High 3
10 € 4,401 High 4
11 € 4,497 High 5

First, we try imputation of the mean. This comes down to filling in the mean of the 10 available
incomes. This mean is equal to € 3,224. This is clearly not an appropriate value, since the incomes of
low-educated people are only a little bit more than € 2,000, and certainly not over € 3,000.
Income varies less within the education groups than between these groups. Incomes are close to
€ 2,000 for the low-educated and between € 4,000 and € 5,000 for the high educated. So it does not
seem unreasonable to apply imputation of the mean. The mean in the low-educated is € 2,148. This
value seems not unreasonable.

69
If we apply random imputation, the imputed value can be just over € 2,000, but it can also be a value
over € 4,000. The latter value would be wrong. A better approach would be to restrict random
imputation to the low-educated group. This would result in a value between € 2,041 and € 2,247.
This is better than random imputation in the whole sample.
Suppose, you apply donor-imputation. This implies you look for the respondent who is closest to
respondent 6. This is respondent 5. This respondent has the same level of education, and he or she
differs only one year in years of experience. So, you copy the income of person 5 (€ 2,247) to person
6. This imputation is not so bad.
Analysis of the available data shows that incomes of the low-educated increase approximately € 50
for each year of work experience. You can use this conclusion to formulate a model that predicts
income from work experience. The model would be something like this:
Income = 1997 + 50  Work experience.
If you apply this model, and predict the income of person 6, you obtain an income of
1997 + 50  6 = 2297.
This imputation technique uses all available information in the data set (the relation of income with
both level of education and years of work experience), and therefore probably is the best technique
in this case.

70
8 Computing estimates

8.1 Estimators
After you have checked your data for errors, and you have corrected any detected errors, your data are
ready for analysis. The first step in the analysis will usually be estimation of a number of population
characteristics, like totals, means and percentages of target variables. It may be informative to
compute these estimates also for several subpopulations into which the target population can be
divided. For example, you can compute separate estimates for males and females, for various age
groups, or for various regions in the country.
You always have to realise that you can only compute estimates of population characteristics and not
their exact values. This is because you only have data about a sample of objects from the target
population. Estimates have a margin of error. You have to publish these margins of error, if only to
avoid the impression that your estimates represent the true values.
To compute an estimate you need an estimator. An estimator is a procedure, a recipe, describing how
you must compute an estimate. The recipe also makes clear which ingredients you need. Of course, you
need the sample values of the target variables. Sometimes you can use additional information, like
sample values of auxiliary variables and the population distribution of auxiliary variables.

Figure 8.1. Estimation

An estimator is only useful if it produces estimates that are close to the population characteristics to
be estimated. Therefore, a good estimator must satisfy two conditions:
 An estimator must be unbiased. Suppose, you repeat the process of drawing a sample and
computing an estimate a large numbers of times. Because you work with random samples, each
turn will produce a different sample. Therefore, you also get a different estimate. The estimator is
unbiased if the average of all the estimates is equal to the population characteristic to be
estimated. There should be no systematic over- or under-estimation.
 An estimator must be precise. All estimates obtained by repeatedly selecting a sample must be
close to each other. The variation in the outcomes of the estimators must be as small as possible.
The term unbiased is related to the term valid. A measuring instrument is called valid if it measures
what it should measure. So an unbiased estimator is a valid measuring instrument.
The term precision is related to the term reliable. A measuring instrument is called reliable if repeated
use leads to (approximately) the same value. So a precise estimator is a reliable measurement
instrument.

71
To indicate how precise an estimator is, two quantities are used: the variance and the standard error.
The variance of an estimator can be seen as a kind of mean of the differences between the possible
values of the estimator and the true population value. The more the possible values differ, the larger
the variance is, and the less precise the estimator is. In the ideal situation all samples lead to the same
estimate. Then the variance is 0.

Example 8.1. An election poll in the town of Bentwood

Elections are to be held in the town of Bentwood. The number of eligible voters is 30,000. There is a
new political party in Bentwood called the National Elderly Party (NEP). As it stands up for the
interests of the elderly, it expects many votes from elderly people. An election poll is carried out to
explore the popularity of the new party.
We created a fictitious population of 30,000 people, in which 25.4% would vote for the NEP. What
happens if we select a sample from this population? We compare various approaches.
First we simulate simple random samples of size 1,000. For each sample we compute the percentage
of voters for the NEP. The results of 800 samples are used to show what the possible outcomes can
be. This is done in the form of a histogram. Each little block represents an estimate. When estimates
are very close, they are stacked onto each other. The graph on the left in figure 8.2 shows the
distribution of the 800 estimates. The vertical black line at 25.4 represents the population value to
be estimated.
The estimates are concentrated around the value to be estimated. Some estimates are smaller than
the true value, and other estimates are larger. On average, the estimates are equal to the population
percentage of 25.4%. So we can say they estimator is unbiased. There is still variation in the
estimates. The values are spread between approximately 23% and 28%. So the precision is not very
high.
Figure 8.2. Simple random samples of size 1,000 (left) and 3,000 (right)

Now, we increase the sample size from 1,000 to 3,000. Again we draw 800 samples and we compute
the estimates. The graph on the right in figure 8.2 shows the result. The estimator remains unbiased,
but the values are now much more concentrated around the true population value. So the estimator
is more precise.
What happens if we do an online poll. Suppose we select our samples from only those people that
have access to internet. The graph on the left in figure 8.3 shows the results if we draw a sample of
size 1,000 and repeat this 800 times. All values are systematically too low. Their average value is
20.3%. So the estimator has a bias of 25.4% - 20.3% = 5.1%.

72
This bias is not surprising. Typically, the elderly have less access to the internet. So they are under-
represented in an online poll. Therefore, the poll will contain too few people voting for the NEP. The
poll is not representative.
The values vary between 18% and 23%, which is approximately the same amount of variation as in
the left graph of figure 8.2.
Figure 8.3. Online poll with samples of size 1,000 (left) and 3,000 (right)

The graph on the right in figure 8.3 shows what happens if we increase the sample size of the online
poll from 1,000 to 3,000. There is clearly less variation, so the estimator is more precise. Increasing
the sample size helps to get more precise estimates. Note that increasing the sample size did not
decrease the bias. The estimates are, on average, still 5.1% too low. Apparently, increasing the
sample does not help to reduce an existing bias.

In the remaining part of this chapter we will discuss estimation procedures for a population
percentage and for a population mean. Section 8.2 describes these estimators for the case of simple
random sampling, in which each person has the sample probability of being selected in the sample (a
simple random sample). This leads to simple estimators.
Section 8.3 is about choosing the proper sample size. Once you know how precise your estimates must
be, you can compute the sample size required for this precision. Section 8.3 explains how to do this.
Section 8.4 is devoted to estimators for the case in which first a sample of addresses is drawn, followed
by drawing one person at each selected address. This is an example of two-stage sampling. Since not
every person has the same probability of being selected, the estimators are somewhat more
complicated.

8.2 Estimators for a simple random sample


This section describes how to compute estimates if a simple random sample has been selected. This
means that everybody has had the same probability of being selected in the sample. We also assume
that the sample has been selected without replacement.

8.2.1 Estimating a population percentage


The percentage is a population characteristics that is estimated very often in a poll. Examples are the
percentage of voters for a specific party, the percentage of people having access to internet, and the
percentage of people in favour of a government measure. If the sample is selected with equal
probabilities, the analogy principle applies. The analogy principle means that a population
characteristic can be estimated by computing the corresponding sample characteristic. So the sample

73
percentage is a good estimate of the population percentage. For example, if you want to estimate the
percentage of people in the population who have trust in the European Union, you use the
corresponding percentage in the sample as an estimate.
The sample percentage is (under simple random sampling without replacement) an unbiased
estimator of the population percentage. We can prove this by applying some mathematics. There is
also another way to show it, and this is by carrying out a simulation: construct a population, select
many samples, compute the estimate for each sample, and see how the estimates behave.
We simulated a radio listening poll. A population of 15,000 people was constructed in which 8,535
people listened to the local radio station. This comes down to 56.9%. We selected a large number of
samples (with simple random sampling), and determined the percentage of listeners in each sample.
Thus, we obtained a large number of estimates. The values of these estimates were displayed in the
form of a histogram. The graph on the left in figure 8.4 shows the result of drawing 400 samples of size
50. The vertical black line represents the value to be estimated: 56.9%. The estimates are distributed
around the value to be estimated. Many estimates are close to this value. On average, the estimates are
equal to this value. Therefore, we can say the sample percentage is an unbiased estimator.

Figure 8.4. Simulation of a simple random sample (estimating a percentage)


400 samples of size 50 400 samples of size 200

The graph on the right in figure 8.4 shows what happens if the sample size is increased from 50 to 200.
The difference with the graph on the left is that now the estimates are much closer to the true
population value. There is less variation. Consequently, you can estimate the percentage of listeners in
the population much more precise if you increase the sample size.
In practice, you should always report the precision of your estimates. Only then will users of your poll
results be able the appreciate the full value of your poll. The common way to describe the precision of
your estimates is by providing a margin of error or a confidence interval. The margin of error indicates
how large the difference between your estimate and the true values can be at most. The confidence
interval is an interval which will contain the true value (with a high probability). Computation of these
quantities proceeds in the following steps:
(1) Compute the variance of the estimator.
(2) Compute the standard error of the estimator. You obtain the standard error by taking the square
root of the variance of the estimator.
(3) Compute the margin of error. You obtain the margin of error by multiplying the standard error by
1.96.
(4) Compute the confidence interval. You obtain the lower bound of this interval by subtracting the
margin of error from the value of the estimate. You obtain the upper bound of this interval by
adding the margin of error to the value of the estimate.

74
The variance of the sample percentage

The variance of an estimator measures the amount of variation of its possible values. The smaller the
variance, the more precise the estimator.
If P denotes the population percentage to be estimated, and if p denotes the sample percentage, then
variance of p is equal to

1 1  N
Var( p)       P  (100  P ) .
 n N  N 1
N is the size of the population, and n is the sample size. If the population is very large, and the sample
is much smaller than the population, you can use the following simple approximation:
P  (100  P )
Var( p)  .
n
Note that the denominator of this expression contains the sample size. So the variance decreases
(and the precision increases) as the sample size increases.
In practice, it is impossible to compute the variance, because you do not know the value of the
population percentage P. Indeed, you carried out the survey to estimate this unknown quantity.
What you can do to solve this problem is to estimate the variance by replacing the unknown value P
by its estimate p. This gives the expression

1 1  n
var( p)       p  (100  p)
 n N  n 1
Or the approximation (for large populations, and reasonably large samples)
p  (100  p)
var( p)  .
n

Let us return to our simulation of a radio listening poll. The variance of the sample percentage is equal
to 49.0 for a sample of size 50. If we increase the sample size to 200, the variance of the estimator
decreases to 12.3. This is four times as small. This confirms our rule that estimators will be more
precise as the sample increases.
The variance is an indicator of the precision of an estimator. It is not easy, however, to interpret this
indicator. What does a value of 12.3 mean? Is this good or bad? A more useful indicator is one that tells
you how far away your estimator is at most from the true value. This is the margin of error. If the
sample percentage of listeners to the local radio station is 55.1%, and the margin of error is 6.9%, you
know that the population percentage can at most be 6.9% away from 55.1%.
The confidence interval is a different way of reporting the precision of your estimates. You compute
the lower bound and upper bound in such a way that it is very likely that the true population value will
be in the interval. The probability that the interval covers the population value is called the confidence
level. Often the confidence level is set to 95%, which gives you a 95%-confidence interval. So with a
probability of 95% this confidence interval contains the population value.
The margin of error of the 95%-confidence interval is equal to the standard error of the estimator
multiplied by 1.96. You obtain the lower bound of the interval by subtracting the margin of error from
the estimate. Likewise, the upper bound of the interval is equal to the estimate plus the margin of
error.

75
If you want, you can choose a higher confidence level. For example, the margin of error for a 99%-
confidence interval is equal to the standard error of the estimator multiplied by 2.58. So, the value of
1.96 is replaced by the value of 2.58. This gives you a wider confidence interval. This is the price you
have to pay for a higher confidence level: a larger confidence interval.

The confidence interval for a percentage

The standard error of an estimate p for the population percentage P is equal to

S( p)  Var( p)

The margin of error M of the 95%-confidence interval is equal to


M  1.96  S( p) .

The lower boud of the interval is equal to


pM

and the upper bound is equal to


pM .

Note that in practice you cannot compute this confidence interval, because you do not know the
exact value of the variance of the estimator. Fortunately, you can estimate this variance using the
data you have collected in the sample. You can use this estimated variance to estimate the standard
error, the margin of error, and the bounds of the confidence interval. So, formally you obtain an
estimated confidence interval instead of an exact confidence interval.

A different way of interpreting a 95%-confidence is the following: if you repeat the selection of the
sample, the computation of the estimate and corresponding confidence interval a large number of
times, then in, on average, 95 out of 100 cases the confidence interval will contain the true population
value. So in 5 out of 100 cases, you will draw the wrong conclusion. If you think this risk is to high, you
can go for 99%-confidence interval.
We should mention again that your poll will never allow you to make exact statements about the
target population. There will always be an element of uncertainty. This uncertainty is caused by the
fact that you only investigate a small part of the population which you obtained by taking a random
sample. It is always possible (but the probability is very small) that by accidence you select a sample
that deviates substantially from the target population. For example, it could happen in a radio listening
poll (but is very unlikely) that all sample persons listen to the radio station, while in the population
only half of the people listen.

Example 8.2. Confidence interval for a percentage of listeners to a local radio station

Suppose, you draw a simple random sample of 1,200 people from a target population of 19,000
people. 720 persons in the sample say they listen to the local radio station.
The percentage of listeners in the sample is equal to 100  (720 / 1200) = 60%. So, the estimate for
the percentage of listeners in the target population is also 60% (the analogy principle).
The estimate of the variance of the estimated percentage is equal to
(1/1200 – 1/19000)  (1200/1199)  60  40 = 1.875.

76
The estimate of the standard error is equal to the square root of the variance. Taking the square root
of 1.874 results in a value of 1.369.
The margin of error (for the 95%-confidence interval) is equal to the standard error multiplied by
1.96. This gives a value of 2.7 (rounded).
The lower bound of the confidence is equal to 60 – 2.7 = 57.3.
The upper bound of the confidence interval is equal to 60 + 2.7 = 62.7.
With a probability of 95%, we can say that the percentage of listeners in the target population will
be between 57.3% and 62.7%.
Note that use of the approximate variance expression would have led to an estimated variance of
2.000. This would have resulted in a confidence interval running from 57.2 to 62.8. The difference
with the exact expression is small.

8.2.2 Estimating a population mean


Another population characteristic is the mean of a quantitative variable. You will often measure such a
variable with a numerical question. Examples of such means are the average amount of time per day
spend on the internet, the mean monthly income, and the mean household size.
For estimating a population mean with a simple random sample, also the analogy principle applies:
The sample mean is a good estimator of the population mean. If you want to estimate how many hours
per week people listen on average to the local radio station, you use the average number of hours in
the sample for this.

The sample mean

The sample mean of a target variable Y is denoted by


y1  y2  ...  yn 1 n
y   yi
n n i 1

The quantity n denotes the size of the sample. The n sample values of the variable Y are denoted by
y1 , y2 ,..., yn .

Note that all sample quantities are denoted by lowercase letters, and all population quantities by
uppercase letters.

The sample mean is an unbiased estimator of the population mean (under simple random sampling
without replacement). We can prove this mathematically, but we can also show it by carrying out a
simulation: construct a population, select many samples, compute the estimate for each sample, and
see how the estimates behave.
We simulated a radio listening poll. A population of 15,000 people was constructed in which 8,535
people listened to the local radio station. For each of the 8,535 listeners, the population file contained
the number of hours they listened last week. For the non-listeners, the number of hours was set to 0.
We selected a large number of samples (with simple random sampling), and computed the mean
number of hours listened in each sample. Thus, we obtained a large number of estimates. The values of
these estimates were displayed in the form of a histogram. The graph on the left in figure 8.5 shows
the result of drawing 500 samples of size 50. The vertical black line represents the population value to

77
be estimated: 2.7 hours. The estimates are distributed around the value to be estimated. Many
estimates are close to this value. On average, the estimates are equal to this value. Therefore, we can
say the sample mean is an unbiased estimator.

Figure 8.5. Simulation of a simple random sample (estimating a mean)


500 samples of size 50 500 samples of size 200

The graph on the right in figure shows what the effect is of increasing the sample size from 50 to 200.
There is much less variation. The estimates are all much closer to each other. Again, the conclusion is
the you can get more precise estimates if you increase the sample size.
Once you have (an estimate of) the variance of the sample mean, you can compute the standard error
of this estimator. The standard error is equal to the square root of the variance of the sample mean.
Then you can computer the margin of error (for the 95%-confidence interval) of the sample mean as
an estimator of the population mean by multiplying the standard error by 1.96. The lower bound of the
95%-confidence interval is now equal to the sample mean minus the margin of error, and the upper
bound is equal to the sample mean plus the margin of error.

The variance of the sample mean

Let Y denote the population mean to be estimated, and y the sample mean. The variance of the
sample mean is equal to

1 1 
Var( y )      S 2 .
n N
N denotes the size of the population, and n denotes the sample size. The quantity S2 was already
introduced in chapter 3 (section 3.4). It is called the population variance:
1 N
 Yk  Y  .
2
S2 
N  1 k 1

If the population size N is large, and the sample size n is much smaller than the population size, you
can use a somewhat simpler approximation of the expression for the variance of the sample mean:
S2
Var( y )  .
n
Note that the denominator contains the sample size n. So increasing the sample size will lead to a
smaller variance, and therefore to a more precise estimator.
In practice, you cannot compute the variance of this estimator. You need all values of Y in the
population for this expression. You simply do not have them. Lack of information about Y was the
reason to do the poll in the first place. What you can do to solve this problem is estimating the
variance using the sample data. So you replace the population variance S2 by the sample analogue:

78
1 n
  yi  y  .
2
s2 
n  1 i 1

This gives you an estimated variance of the estimator:


s2
var( y )  .
n

Returning to the simulation of the radio listening poll, the variance of the sample mean for a sample of
size 50 is equal to 0.160. For a sample of size 200, the variance reduces to 0.040. This is four times as
small. Increasing the sample size by a factor 4 reduces the variance by a factor 4. The margins of error
are 0.4 (sample of size 50) and 0.2 (sample of size 200).

The confidence interval for a mean

The standard error of the sample mean y is equal to

S( y )  Var( y )

The margin of error M of the 95%-confidence interval is equal to


M  1.96  S( y ) .

The lower bound of the interval is equal to


yM

and the upper bound is equal to


yM .

Note that in practice you cannot compute this confidence interval, because you do not know the
exact value of the variance of the estimator. Fortunately, you can estimate this variance using the
sample data. You use the estimated variance to estimate the standard error, the margin of error, and
the bounds of the confidence interval. So, you obtain an estimated confidence interval instead of an
exact confidence interval.

79
Example 8.3. Confidence interval for the mean hours listened to the local radio station

Suppose, you draw a simple random sample of 20 people from a target population of 15,000 people
in a town. Objective of the poll is to explore listening behaviour of the inhabitants. The respondents
of the poll are asked how many hours they did listen to the local radio station last week. The
answers are in the second column of the table below:

 yi  y 
2
yi yi  y
1 0.00 -2.82 7.95
2 3.40 0.58 0.34
3 4.60 1.78 3.17
4 4.10 1.28 1.64
5 3.90 1.08 1.17
6 0.00 -2.82 7.95
7 3.40 0.58 0.34
8 7.30 4.48 20.07
9 0.00 -2.82 7.95
10 0.00 -2.82 7.95
11 3.80 0.98 0.96
12 0.00 -2.82 7.95
13 0.00 -2.82 7.95
14 4.20 1.38 1.90
15 5.50 2.68 7.18
16 4.40 1.58 2.50
17 6.40 3.58 12.82
18 5.40 2.58 6.66
19 0.00 -2.82 7.95
20 0.00 -2.82 7.95

The sample mean is equal to the sum of the values in the second column (56.40) divided by 20. The
result is 2.82 (hours).
To estimate the variance of the estimate, you must first compute the sample variance s2. To do this,
you subtract the mean (2.82) from each value. This leads to column 3. Next, you take the square of
each value in column 3. The result can be found in column 4. Now, you obtain the sample variance
by summing the values in column 4 and dividing the result by n – 1 = 19. This gives s2 = 6.44.
You obtain the (estimated) variance of the estimator by multiplying the sample variance by
(1/n – 1/N) = (1/20 – 1/15000). This results in the value 0.32. The estimated standard error is now
equal to the square root of 0.32, and this is 0.57
The (estimated) margin of error of the 95%-confidence interval is equal to the standard error
multiplied by 1.96. The (rounded) result is 1.11.
The lower bound of the 95%-confidence interval is equal to 2.82 – 1.11 = 1.71. The upper bound is
equal to 2.82 + 1.11 = 3.93.
We can say that with a high probability (95%) the mean number of hours listened to the local radio
station last week is somewhere between 1.71 and 3.93 hours.
The estimate has a substantial margin of error of 1.11 hours. This is caused by the small sample size
of 20 people.

80
8.3 How large should your sample be?
A decision you have to make in the design phase of your poll is the size of the sample to be selected.
This is an important decision. If, on the hand, the sample is larger than really necessary, you may waste
a lot of time and money. And if, on the other hand, the sample is too small, your estimates are not as
precise as planned, and this makes the results of your poll less useful.
It is not so simple to determine the sample size, since it depends on a number of different factors. We
already showed there is a relationship between the precision of estimators and the sample size: the
larger the sample, the more precise the estimators. Therefore, you can only answer the question about
the sample size if it is clear how precise your estimators must be. Once you have specified the
precision, you can compute the sample size. A very high precision nearly always requires a large
sample. However, a large poll will also be costly and time-consuming. Therefore, the sample size will in
practical situations often be a compromise between costs and precision.
We will explain in this section how you can compute the sample size for a simple random without
replacement. We first consider the case of estimating a population percentage. Next, we will describe
the case of estimating a population mean.
8.3.1 The sample size for estimating a percentage
Starting point is that we assume you have some indication of how large the margin of error M at most
can be. The margin of error is defined as the distance between the estimate and the lower or upper
bound of the confidence interval. So the margin of error is the maximum allowed difference between
estimate and true value.
The margin of error of a 95%-confidence interval is defined as the standard error of the estimator
multiplied by 1.96. Therefore, setting a maximum to the margin of error implies that the standard
error may not exceed a certain value.

The sample size for estimating a percentage

Suppose Mmax is the value of the margin of error that may not be exceeded. This implies that:
1.96  S(p) < Mmax,
where S(p) is de standard error of the sample percentage p. We can replace S(p) by the expression of
this standard error. If we convert the result into an expression for n, we get
1
n 2
.
N - 1  Mmax  1 1
   
N  1.96  P(100 - P ) N

If the size N of the target population is large, we can simplify this expression to
2
 1.96 
n   P  (100 - P ) .
 Mmax 
Both expressions contain the population percentage P. You do not know this value. Indeed, getting to
know this value was the reason to conduct the poll. To be able to use one of the expressions for
computing the sample size, you can substitute some indication for the value of P. This could, for
example, be a value from a previous poll. If you do not have any idea at all of the value of the
population percentage, substitute the value P = 50. This value leads to the largest sample size. If the
maximum margin of error is not exceeded for P = 50, it also will not happen for any other value of P.

81
Example 8.4. An election poll in the town of Bentwood

Elections are to be held in the town of Bentwood. The number of eligible voters is 30,000. There is a
new political party in Bentwood called the National Elderly Party (NEP). As it stands up for the
interests of the elderly, it expects many votes from elderly people. A previous poll predicted that
approximately 30% would vote for the NEP.
A new election poll is carried out to explore the popularity of the new party. The margin of error
should not exceed 3%. If we apply the expression for the sample size with N = 30,000, P = 30, and
Mmax = 3, the result is
1
n 2
 870.4
29999  3  1 1
  
30000  1.96  30(70) 30000
Rounding up this value to the nearest integer number gives a minimal sample size of 871.
Application of the simplified expression would lead to a minimal sample size of 897.
If nothing is known about the value of P, use the value P = 50. The expression for the sample size
now becomes
1
n 2
 1030.5
29999  3  1 1
  
30000  1.96  50(50) 30000
Rounding up this value to the nearest integer number gives a minimal sample size of 1031. The
simplified formula leads to a minimal sample size of 1068.

Instead of computing the minimal sample size with one of the expressions, you can also use table 8.1.
You can read the required sample size from this table, given an indication of the population percentage
to be estimated and the maximal margin of error.

Table 8.1.Required sample size for estimating a percentage


Indication of the Maximal margin of error
population percentage
1 2 3 4 5
5 1825 457 203 115 73
10 3458 865 385 217 139
15 4899 1225 545 307 196
20 6147 1537 683 385 246
25 7204 1801 801 451 289
30 8068 2017 897 505 323
35 8740 2185 972 547 350
40 9220 2305 1025 577 369
45 9508 2377 1057 595 381
50 9605 2402 1068 601 385
55 9508 2377 1057 595 381
60 9220 2305 1025 577 369
65 8740 2185 972 547 350
70 8068 2017 897 505 323
75 7204 1801 801 451 289
80 6147 1537 683 385 246
85 4899 1225 545 307 196
90 3458 865 385 217 139
95 1825 457 203 115 73

82
Many market research agencies work with sample sizes of around 1,000 persons. The table shows that
for such a sample size, the margin of error will not exceed 3%. If the sample size is 1,068, the margin of
error for every percentage will at most be 3%. Or to say it otherwise: with a sample of 1,068 persons
the estimate will never differ more than 3% from the true population value. So, if the sample
percentage is equal to 48%, the population percentage will be between 45% and 51% (with a high
probability of 95%).
8.3.2 The sample size for estimating a mean
Also for estimating the population mean, the starting point is that we assume you have some
indication of how large the margin of error M at most may be. The margin of error is defined as the
distance between the estimate and the lower or upper bound of the confidence interval. So the margin
of error is the maximum allowed difference between estimate and true value.
The margin of error of a 95%-confidence interval is defined as the standard error of the estimator
multiplied by 1.96. Therefore, setting a maximum to the margin of error implies that the standard
error may not exceed a certain value.

The sample size for estimating a mean

Suppose Mmax is the value of the margin of error that may not be exceeded. This implies that:
1.96  S( y )  Mmax ,

where S( y ) is de standard error of the sample percentage y . If we convert this into an expression
for n, we get
1
n 2
,
 Mmax  1
 1.96  S   N
 
In which S is the square root of the population variance S2 (see section 3.4). If the size N of the target
population is large, we can simplify this expression to
2
 1.96  S 
n  .
 Mmax 
Both expressions contain the square root S of the population variance S2. In practice, this value is
usually not known. This makes it difficult to apply these expressions. Sometimes you may have some
indication of its value from a different survey, or from a small test survey.
If you know that the values of the target variable have more or less a normal distribution (a bell-
shaped, symmetric distribution) over an interval of length L, you can use L/6 is an approximation for
S.

8.4 Estimators for a two-stage sample


If you have to draw a sample of persons, the ideal situation would be to have a sampling frame
containing persons, such as a population register. Often such frames are not available or accessible. An
alternative can be to use an address list. To select a sample of persons from a list of addresses, you
have to use a two-step approach: first you draw addresses, and then you draw one or more persons at
each selected address. This called a two-stage sample. We restrict ourselves here to the case of one
person per address.

83
What makes this approach a little more complicated is that not every person in the target population
has the same probability of selection. The selection probability is determined by the number of
persons living at an address (and belonging to the target population): the more persons there live at
an address, the smaller the selection probability will be.
Because of the unequal selection probabilities, it is not possible to apply the procedures for simple
random sampling described in section 8.2. The analogy principle does not apply. To obtain valid
estimators, you have to apply the theory for unequal probability sampling. We explain in this section
how to do this.
We assume that a sampling design was applied in which addresses were selected with simple random
sampling without replacement. So addresses have equal probabilities of selection. At each selected
address we draw one person at random from all persons living there (as far as the belong to the target
population). It is clear that, if more people are living at an address, the probability of selection of a
specific person is smaller. As a consequence, persons living at addresses with many people are under-
represented in the sample.
To obtain valid estimates for population characteristic, we have to correct for unequal selection
probabilities. We can only do this if we know the selection probabilities. Therefore, we have to count
the number of persons (as far as they are part of the target population) living at each selected address.

Example 8.5. Selection probabilities in a radio listening poll

You want to conduct a radio listening poll in the town of Harewood. You have a list consisting of the
addresses of all 9,590 households in Harewood. You want to select 209 addresses. At each selected
address you want to interview one person. The selection probability of each household is
209
 0.022 .
9590
The selection probability of a person at an address is determined by the number of people living
there (from the age of 12 years). If we denote the number of persons by A, the selection probability
is equal to
1
.
A
The total probability of a person to be selected in the sample, is obtained by multiplying the two
probabilities above. This results in the selection probability
209
.
9590  A
This expression indeed shows that not every person has the same selection probability. The
probability of someone at a single-person address is 209 / 9590 = 0.022. For someone at a 2-person
address, the probability is 209 / (2  9590) = 209 / 19180 = 0.011, which is two times as small.

If you use unequal selection probabilities, the analogy principle does not apply any more. The sample
percentage and the sample mean are not unbiased estimators for the population percentage and
population mean. We have to use different estimators that correct for the unequal probabilities.

84
8.4.1 Estimating a population percentage
First, we describe how to estimate population percentages using data from an address sample, where
one person per address was selected. The way to correct for unequal probabilities is to assign a weight
to each respondent. This weight is equal to

Number of persons at address  Number of addresses in target population


Weight = .
Number of persons in target population
The ‘number of persons at address’ denotes only those persons who belong to the target population. A
special case is the situation which the same number of people live at each address. Then all weights
are equal to 1, which implies that no correction takes place. Correction is not required since all
selection probabilities are equal.

Example 8.6. Computation of weights for a radio listening poll

You conduct a radio listening poll in the town of Harewood. For selecting the sample you have used
a list consisting of the addresses of all 9,590 households. You have selected 209 addresses. At each
selected address you have interviewed one person. If you apply the formula for computing the
weights, you get the following table:

Number of Weight of the


persons at person in the
address sample
1 0.415
2 0.829
3 1.244
4 1.659
5 2.073
6 2.488
… …

It is clear that single persons have a lower weight (0.415) than persons at multi-person addresses.
This is not surprising as singles are over-represented in the sample. To correct for this, their
influence in the estimation procedure has to be reduced.

If we have the weights, we can compute the value of an unbiased estimator for the percentage of
persons in the population with a specific property. This estimator is defined by

Sum of the weights of persons with the property


Estimate = 100  .
Sample size

Note that if all persons in the sample would have had the same weight (such as in the case of simple
random sampling in section 8.2), the expression for the estimate would reduce to the simple sample
percentage.

85
Estimating a percentage in a two-stage sample

Let Y be an indicator variable. It assumes the value 1 if a person has a specific property, and
otherwise it assumes the value 0. Hence, the population percentage P is equal to the population mean
multiplied by 100.
Suppose the target population of N persons is divided over M addresses. The numbers of persons per
address are indicated by
A1 , A2 ,..., AM .

Adding these numbers results in the population size N: A1 + A2 + … + AM = N.


From the population of M addresses we draw a random sample of n addresses. At each selected
address we randomly pick one person. So our sample consists of n persons. We have measured the
value of Y for each of these persons. We denote these values by
y1 , y2 ,..., yn

(where each value is either 0 or 1). Furthermore, we denote the numbers of persons at the selected
addresses by
a1 , a2 ,..., an .

The weight wi of person i is equal to


M
wi  ai ,
N
in which the index i takes values ranging from1 to n.
The estimate for the population percentage P is now equal to
(w1 y1  w2 y2  ...  wn yn ) 100 n
pw  100  yw  100     wi yi .
n n i 1

Example 8.7. Estimating a percentage for a two-stage sample

A target population consists of 20,000 persons, distributed over 7,000 addresses. We have drawn a
simple random sample of 20 addresses. One person was randomly drawn at each selected address.
The selected persons were asked whether they listened to the local radio station last week.
The table below contains the ingredients for computing an estimate of the population percentage of
persons who listened last week.
The sum of the weights of the listeners is equal to 10.50. The estimate for the percentage of listeners
in the population is now:
10.50
100   52.5% .
20
Note that the simple sample percentage is equal to 50% (10 out of 20 people listened to the local
radio station). This value is too low. So using a wrong estimator may result in an invalid estimate.

86
Address Number of Selected person Weights Weight of
persons listens listener
1 2 Yes 0.70 0.70
2 1 No 0.35
3 4 Yes 1.40 1.40
4 3 Yes 1.05 1.05
5 1 Yes 0.35 0.35
6 6 Yes 2.10 2.10
7 2 No 0.70
8 1 No 0.35
9 1 No 0.35
10 4 Yes 1.40 1.40
11 2 No 0.70
12 2 Yes 0.70 0.70
13 3 No 1.05
14 4 Yes 1.40 1.40
15 1 Yes 0.35 0.35
16 3 No 1.05
17 2 No 0.70
18 1 No 0.35
19 3 Yes 1.05 1.05
20 1 No 0.35
Total 10.50

Also for two-stage samples, the principle applies that larger samples produce more precise estimates.
This precision of an estimator is determined by first computing the variance of the estimator, then the
standard error of the estimator, followed by the margin of error and the confidence interval. However,
the formulae are different from those for the simple random sample.

The variance of the estimator for a percentage in a two-stage sample

Let pw be the estimator for a population percentage as defined above. The variance of pw is equal to

10000 N  wkYk - Y 
2

Var( pw )  10000  Var( yw )   w


nN k 1
,
k

where we sum over all objects in the target population. We cannot compute this expression in
practice, because it would require the values of the target variable for all objects in the population.
Fortunately, we can estimate the variance with the following expression:
10000 n
  wi yi - yw  ,
2
var( pw )  10000  var( yw ) 
n(n  1) i 1

where summation is over all objects in the sample.

To compute the margin of error and the 95%-confidence interval, you follow the same steps as in
section 8.2:
(1) Compute the variance of the estimator.
(2) Compute the standard error of the estimator. You obtain the standard error by taking the square
root of the variance of the estimator.

87
(3) Compute the margin of error. You obtain the margin of error by multiplying the standard error by
1.96.
(4) Compute the confidence interval. You obtain the lower bound of this interval by subtracting the
margin of error from the value of the estimate. You obtain the upper bound of this interval by
adding the margin of error to the value of the estimate.

Example 8.8. Confidence interval for a percentage in a two-stage sample

A target population consists of 20,000 persons, distributed over 7,000 addresses. We have drawn a
simple random sample of 20 addresses. One person was randomly drawn at each selected address.
The selected persons were asked whether they listened to the local radio station last week.
The table below contains the ingredients for computing an estimate of the population percentage of
persons who listened last week:

M
Address ai yi wi  ai wi  yi wi  yi  yw   wi  yi  yw 2
N
1 2 1 0.70 0.70 0.175 0.030625
2 1 0 0.35 0.00 -0.525 0.275625
3 4 1 1.40 1.40 0.875 0.765625
4 3 1 1.05 1.05 0.525 0.275625
5 1 1 0.35 0.35 -0.175 0.030625
6 6 1 2.10 2.10 1.575 2.480625
7 2 0 0.70 0.00 -0.525 0.275625
8 1 0 0.35 0.00 -0.525 0.275625
9 1 0 0.35 0.00 -0.525 0.275625
10 4 1 1.40 1.40 0.875 0.765625
11 2 0 0.70 0.00 -0.525 0.275625
12 2 1 0.70 0.70 0.175 0.030625
13 3 0 1.05 0.00 -0.525 0.275625
14 4 1 1.40 1.40 0.875 0.765625
15 1 1 0.35 0.35 -0.175 0.030625
16 3 0 1.05 0.00 -0.525 0.275625
17 2 0 0.70 0.00 -0.525 0.275625
18 1 0 0.35 0.00 -0.525 0.275625
19 3 1 1.05 1.05 0.525 0.275625
20 1 0 0.35 0.00 -0.525 0.275625
Sum 10.50 0.000 8.2075
Mean 0.525

Now, we can estimate the variance by substituting the total of the last column in the expression for
the estimated variance. This results in
10000
var( p)   8.2075  215.987 .
20  19
We can obtain the (estimated) standard error by taking the square root:

s( p)  var( p)  215.987  14.700 .

The margin of error of the 95%-confidence interval is equal to


M  1.96  s(p)  1.96  14.700  28.805 .

88
The lower bound of the confidence interval is
p  M  52.50  28.805  23.695

And the upper bound is


p  M  52.50  28.805  81.305 .

With a confidence of 95% we can say that the percentage of listeners in the target population will be
between 23.7% en 81.3%. This is a very wide confidence interval. It makes this poll more or less
meaningless. The cause is the very small size (20). For a more precise estimate, we have to increase
the sample size.

8.4.2 Estimating a population mean


We describe how to estimate a population mean using data from an address sample, where one person
per address is selected. The procedure is similar to that for estimating a population percentage. First,
we compute weights for all persons in the sample. These weights are the same as those for estimating
a percentage. The weight of a sample person is equal to

Number of persons at address  Number of addresses in target population


Weight = .
Number of persons in target population
The ‘number of persons at address’ denotes only those persons who belong to the target population.
Example 8.6 shows you how you compute the weights in practice.
A special case is the situation in which the same number of people live at each address. Then all
weights are equal to 1, which implies that no correction takes place. Correction is not required since all
selection probabilities are equal.
If we have the weights, we can compute the value of an unbiased estimator for the population mean of
a target variable. This estimator is defined by

Weighted sum in the sample of the values of the target variable


Estimate 
Sample size

Note that if all persons in the sample would have had the same weight (such as in the case of simple
random sampling in section 8.2), the expression for the estimate would reduce to the simple sample
mean.

Estimating a mean in a two-stage sample

Let Y be a quantitative variable. Objective of the poll is assumed to be estimation of the population
mean Y of Y.
Suppose the target population of N persons is divided over M addresses. The numbers of persons
per address are indicated by
A1 , A2 ,..., AM .

Adding these numbers results in the population size N: A1 + A2 + … + AM = N.


From the population of M addresses we draw a random sample of n addresses. At each selected
address we randomly draw one person. So our sample consists of n persons. We have measured the

89
value of Y for each of these persons. We denote these values by
y1 , y2 ,..., yn

Furthermore, we denote the numbers of persons at the selected addresses by


a1 , a2 ,..., an

The weight wi of person i is equal to


M
wi  ai ,
N
in which the index i takes values from1 to n.
The estimate for the population mean is now equal to
w1 y1  w2 y2  ...  wn yn 1 n
yw     w i yi .
n n i 1

Example 8.9. Estimating a mean for a two-stage sample

A target population consists of 20,000 persons, distributed over 7,000 addresses. We have drawn a
simple random sample of 20 addresses. One person was randomly drawn at each selected address.
The selected persons were asked how many hours they listened to the local radio station last week.
The table below contains the ingredients for computing an estimate of the population mean of the
number of hours people listened last week:

Address Number of Hours Weight Weight  Hours


persons listened
1 2 0.00 0.70 0.000
2 1 3.40 0.35 1.190
3 4 4.60 1.40 6.440
4 3 4.10 1.05 4.305
5 1 3.90 0.35 1. 365
6 6 0.00 2.10 0.000
7 2 3.40 0.70 2.380
8 1 7.30 0.35 2.555
9 1 0.00 0.35 0.000
10 4 0.00 1.40 0.000
11 2 3.80 0.70 2.660
12 2 0.00 0.70 0.000
13 3 0.00 1.05 0.000
14 4 4.20 1.40 5.880
15 1 5.50 0.35 1.925
16 3 4.40 1.05 4.620
17 2 6.40 0.70 4.480
18 1 5.40 0.35 1.890
19 3 0.00 1.05 0.000
20 1 0.00 0.35 0.000
Total 39.690

The weighted sum is equal to 39.690. The estimate for the mean hours listened last week is:

90
39.690
 1.985
20
Note that the simple sample mean is here equal to 2.820 hours. This is a higher estimate than the
weighted estimate. Indeed, failure to include weights in the estimation procedure leads to a wrong
estimate.

Also for estimating a population mean in a two-stage sample the principle applies that larger samples
produce more precise estimates. This precision of an estimator is determined by first computing the
variance of the estimator, then the standard error of the estimator, followed by the margin of error
and the confidence interval. However, the formulae are different than those for the simple random
sample.

The variance of the estimator for a mean in a two-stage sample

Let yw be the estimator for a population mean as defined above. The variance of yw is equal to

1 N w kYk  Y 
2
Var( yw )   w
nN k 1
,
k

where we sum over all objects in the target population. You cannot compute this expression in
practice, because it would require the values of the target variable for all objects in the population.
Fortunately, you can estimate the variance with the following expression:
1 n

 wi yi  yw 
2
var( yw ) 
n(n  1) i 1

where summation is over all objects in the sample.

To compute the 95%-confidence interval, you first have to determine the (estimated) variance of the
estimator. By taking the square root of this variance, you get the standard error of the estimator. Next,
you obtain the margin of error by multiplying the standard error of the estimator by 1.96. For the
lower bound of the confidence interval you subtract the margin of error from the estimate, and for the
upper bound you add the margin of error to the estimate.

Example 8.10. Confidence interval for a mean in a two-stage sample

A target population consists of 20,000 persons, distributed over 7,000 addresses. We have drawn a
simple random sample of 20 addresses. One person was randomly drawn at each selected address.
The selected persons were asked how many hours they listened to the local radio station last week.
The table below contains the ingredients for computing a 95%-confidence interval for the
population mean of the number of hours people listened last week.
We obtain the (estimated) variance by substituting the total of the last column in the expression for
the estimated variance. The result is
1
var( yw )   87.056  0.229 .
20  19

91
M
Address ai yi wi  ai w i  yi wi  yi - y  wi  yi - y 2
N
1 2 0 0.70 0.000 -1.985 3.938
2 1 3.4 0.35 1.190 -0.795 0.631
3 4 4.6 1.40 6.440 4.456 19.851
4 3 4.1 1.05 4.305 2.321 5.385
5 1 3.9 0.35 1.365 -0.620 0.384
6 6 0 2.10 0.000 -1.985 3.938
7 2 3.4 0.70 2.380 0.396 0.156
8 1 7.3 0.35 2.555 0.571 0.325
9 1 0 0.35 0.000 -1.985 3.938
10 4 0 1.40 0.000 -1.985 3.938
11 2 3.8 0.70 2.660 0.676 0.456
12 2 0 0.70 0.000 -1.985 3.938
13 3 0 1.05 0.000 -1.985 3.938
14 4 4.2 1.40 5.880 3.896 15.175
15 1 5.5 0.35 1.925 -0.060 0.004
16 3 4.4 1.05 4.620 2.636 6.946
17 2 6.4 0.70 4.480 2.496 6.228
18 1 5.4 0.35 1.890 -0.095 0.009
19 3 0 1.05 0.000 -1.985 3.938
20 1 0 0.35 0.000 -1.985 3.938
Sum 39.690 0.000 87.056
Mean 1.985

If we take the square root of the variance, we get the standard error:

s( yw )  var( yw )  0.229  0.479 .

The margin of error of the 95%-confidence interval is equal to


M  1.96  s( yw )  1.96  0.479  0.938 .

The lower bound of the confidence interval is equal to


p - M  1.985 - 0.938  1.046

And the upper bound is


p  M  1.985  0.938  2.923 .

We can say with 95% confidence that the mean number of hours listened to the local radio station
last week, is lies between 1.0 and 2.9 hours. This is a wide interval. It is probably too wide for all
practical purposes. The cause of this wide interval is the small sample size of only 20 persons. For a
more precise estimate, you must draw a larger sample.

92
9 The non-response problem
We described in the previous chapters how you should set up a poll. If you keep to the guidelines, your
poll will produce valid and reliable results that can be generalised to the target population.
Unfortunately, there are always practical problems that may affect these results. One of the most
important problems is non-response. Non-response occurs when objects in the selected sample (and
that belong to the target population) do not provide the requested information, or that the provided
information is unusable.
Non-response may seriously affect the outcomes of your poll. Therefore, you should always attempt to
prevent non-response. Unfortunately, whatever your efforts, you will always have non-response. If you
cannot prevent it from happening, you have to do something to remove or reduce a possible bias of
your estimates.
This chapter is devoted to the problem of non-response. We restrict ourselves to unit non-response.
Unit non-response occurs when a selected object does not provide any information at all. The
questionnaire form remains completely empty. Another type of non-response is item non-response.
This occurs when some questions have been answered, but no answer is obtained for some other,
possibly sensitive, questions. We already explained in chapter 7 how you can deal with item non-
response by applying some kind of imputation technique.
We describe the non-response problem in more detail in section 9.1. Section 9.2 shows how you can
get more insight in the effects of non-response problem in your survey. Section 9.3 is devoted to
techniques to correct for a bias due to non-response.
9.1 Non-response
One obvious consequence of non-response is that the realised sample is smaller than planned. If you
planned to have a sample of size 1,000, and you selected 1,000 objects from the sampling frame, but
only half of the objects responded, you are left with a sample of size 500. A lower sample size will
increase the variance of an estimator, and therefore it will also decrease its precision. Valid estimates
can still be obtained as the computed margins of errors or confidence intervals take into account the
lower sample size.
If you want to avoid a realised sample of being too small, you should take the initial sample size larger.
For example, if you want the response of 1,000 people in your poll, and you expect response rate of
around 50%, your initial sample size should be around 2,000 people.
A far more serious consequence of non-response is that estimates of population characteristics may be
biased. This situation occurs if, due to non-response, some groups in the population are under- or
over-represented in the sample, and these groups behave differently with respect to the phenomena
you are investigating. This is called selective non-response.
If you have non-response in your poll, it is likely that your estimates are biased unless you have very
convincing evidence to the contrary. Here are a few examples:
 In a Dutch Victimisation Survey data collection was carried out by means of face-to-face
interviews. It turned that people who are afraid to be at home alone at night, were less inclined to
participate in the survey. They did not want to open the door at night.
 People who refused to participate in a Dutch Housing Demand Survey, had lesser housing demands
than those who responded.
 Mobile people were under-represented in a Dutch Mobility Survey. They more often were not at
home when the interviewer called.

93
 In many election polls, voters are over-represented among the respondents, and non-voters are
over-represented among the non-respondents.

Example 9.1. Non-response in a radio listening poll

We simulate what the effect of non-response can be in a radio listening poll. A population of 15,000
people was constructed in which 8,535 (56.9%) people listened to the local radio station. We
selected a large number of samples (with simple random sampling), and computed the percentage
of listeners in each sample. Thus, we obtained a large number of estimates. The values of these
estimates were displayed in the form of a histogram.
The graph on the left in figure 9.1 shows the result of drawing 1,000 samples of size 1,000. The
vertical black line represents the population value to be estimated: 56.9%. The estimates are
distributed around the value to be estimated. Many estimates are close to this value. On average, the
estimates are equal to this value. Therefore, we can say the sample mean is an unbiased estimator.

Figure 9.1. Simple random samples without (left) and with (right) non-response

Now, we introduce non-response. We do this in such a way that non-response is high among the
high-educated, and low among the low-educated. Again, we draw 1,000 samples of size 1,000, and
compute percentages of listeners. The distribution of these estimates is shown in the graph on the
right in figure 9.1.
The distribution is not concentrated any more around the true value. The distribution has shifted to
the right. The average estimate is 64.6%. So there is a bias of 64.6 – 56.9 = 7.7%. The cause of this
bias is that low-educated listen more to the radio than high-educated people. And the low-educated
people are over-represented in this poll.
Note that the variation of the estimates is larger in the graph on the right. This due to smaller
sample sizes. The response rate was approximately 62%.

The problems of non-response have not diminished over time. To the contrary, surveys and polls in
many countries suffer from increased non-response rates. As an example, figure 9.1 shows the
response rate over the years of one of the most important surveys of Statistics Netherlands. It is called
the Labour Force Survey (LFS). The LFS is conducted in all member states of the European Union. It
collects information on labour participation of people aged 15 and over as well as on persons outside
the labour force. The response was high in the 1970’s (almost 90%), but it declined over the years.
Nowadays the response rate is blow 60%. A rate of 60% is considered a good result. Indeed, it takes a
lot of effort to reach such a response rate.

94
Figure 9.2. The response rate in the Dutch Labour Force Survey

The response rate of a poll depends on may factors. One important factor is the topic of the poll. If
people are interested in the topic of the poll, they are more inclined to participate. The chances of
success or much smaller for dull, uninteresting, or irrelevant polls.
If the target population of your poll consists of households, the questions can be answered by any
person in the household. This makes it easier to collect information about the household, and thus
increases the probability of response. If you need to talk to one specific person in the household,
establishing contact and getting cooperation can be much more cumbersome.
Stoop et al. (2010) show that response rates differ by country. See also figure 9.3. Even the period of
the year plays a role. It is wise to avoid the summer holidays period, since a lot of people will not be at
home. Also around Christmas it may be difficult to contact people. They are probably too busy with
other activities. Sending questionnaires by mail in the Christmas period is also not a good idea, as they
may get lost in all other mail (Christmas/New Year cards).

Figure 9.3. Response rates in the third round of the European Social Survey
Slovakia
Portugal
Roumania
Poland
Russia
Cyprus
Spain
Hungary
Sweden
Estonia
Slovenia
Bulgaria
Norway
Finland
Belgium
Netherlands
Germany
United Kingdom
Denmark
Switzerland
France
0 20 40 60 80
Response rate

Use of interviewers for data collection usually leads to higher response rates. They can persuade
people to participate in a poll. Therefore, response rates of face-to-face polls and telephone polls are
often much higher than those of mail and online polls.

95
In the United States, response rates of opinion polls have fallen dramatically over the years. Pew
Research Center (2012) reports that the response rate of a typical RDD telephone survey dropped
from 36% in 1997 to 9% in 2012. According to Vavrek & Rivers (2008), response rates have
deteriorated over time so that most media polls (telephone polls using RDD) have response rates of
around 20 percent.
Online polls are not successful in increasing response rates. Generally, response rates for this mode of
data collection are at most 40%. For example, Bethlehem & Biffignandi (2012) describe an experiment
with a survey (the Safety Monitor), where the online mode resulted in a response of 41.8%.
Non-response can have various causes. It is a good idea the classify these causes. Different causes of
non-respons may have different effects on estimators, and therefore require different treatment. The
three main causes of non-response are no-contact, refusal, and not-able.

Figure 9.4. Main causes of non-response

No-contact Refusal Not-able

The first step in getting response from a person in your sample, is establishing contact. If contact is not
possible, you have a case of non-response due to no-contact. There can be various reasons why the
contact attempt may fail. For example, someone is not at home because he was away for a short period
of time (shopping), for a long period of time (a long holiday in Spain), or even permanently (moved to
a different, unknown, address). Non-contact may also be caused by gatekeepers, such as guards or
doormen of a secured residential complexes, and by dangerous dogs around the house. A selected
person may even have been deceased between the moment of sample selection and the contact
attempt.
Non-contact also occurs in telephone polls, if someone does not answer the telephone, or the
telephone is busy. Non-contact in mail polls may be caused by people being away for a long time, by
using a wrong address, or by people throwing away all unwanted mail immediately.
You may attempt to reduce non-response by non-contact by making several attempts. If someone is
away today, he can be at home tomorrow. It is not uncommon for large survey organisations, like
national statistical institutes, to make up to six contact attempts before they decide to close the case
and record it as a no-contact.
As soon as you have established contact with someone, you have to determine whether he or she
belongs the target population. Those who do are called eligible. If people are not eligible, they do not
have to be interviewed. They are a case of over-coverage, which means you can ignore them. If people
are eligible, they have to be persuaded to complete the questionnaire form. If you fail to get their
cooperation, it is a case of non-response due to refusal.
It helps to distinguish temporary refusal and permanent refusal. Temporary refusal often means that
that moment is not suitable for answering the questions. The baby is crying, they are busy in the
kitchen, or there is a football match on TV. May be it is possible to make an appointment for a later
date and time. In case of a permanent refusal, it will not be possible to get the answers at any time.

96
Permanent refusal may, for example, occur if people do not like the topic of the poll, or if they consider
the poll an intrusion of their privacy.
Even if it is possible to contact people, and they want to participate, it still may be impossible to obtain
their answers to the questions. Physical or mental conditions may prevent them doing so. This is called
non-response due to not-able. People can be ill, drunk, deaf, blind or have a mental handicap. Another
condition causing people not to fill in the questionnaire is a language problem. They speak a different
language, and therefore they do not understand the questions.

9.2 Analysis of the non-response


If you have non-response in your poll, you should be careful. Non-response may cause your estimates
to be biased. Therefore, you just cannot apply the estimation techniques of chapter 8. You have to
correct your survey data for a possible bias due to non-response.
You always have to carry out a non-response analysis. The first step is finding out whether non-
response may cause problems. This is the topic of this section. If you detect non-response problems,
you have to correct your estimates. This is called adjustment weighting. This topic will be treated in
section 9.3.

The effect of non-response

To explore the effect of non-response, often a model is used in which each object is assigned a
certain, unknown probability of response. Persons with a high response probability often
participate in surveys and polls, whereas persons with a low response probability rarely participate.
We denote the response probabilities of all objects in the target population by
ρ1, ρ2, …, ρN.
If we draw a simple random sample from this population, not everyone will participate. This is due
to non-response. Therefore, we do not obtain n observations y1, y2, …, yn, but less. We denote the
number of respondents by m (where m is smaller than n), and the available values of the target
variable by
y1, y2, …, ym.
Suppose we want to estimate the population mean of the target variable Y . At first sight, it seems
obvious to apply the analogy principle and use the mean
y1  y2  ...  ym 1 m
yR    yi
m m i 1
of the available values as the estimator. Unfortunately, this is a biased estimator. It can be shown
that the bias is equal to
R ,Y  S   SY

The quantity RY,ρ is the correlation coefficient. This is a measure of the strength of the relationship
between the target variable and the response probabilities. The correlation coefficient is only 0 if
there is no relationship. The stronger the relationship, the larger the bias.
The quantity Sρ is the standard deviation of the response probabilities. It is the square root of the
variance of the response probabilities. The more the response probabilities vary in size (there are
small and large response probabilities), the larger the bias will be. There is no non-response bias if
all response probabilities are equal.

97
The quantity  is the mean of all response probabilities. We can estimate this quantity by the
fraction response in the sample. The higher the response rate, the smaller the bias. So you should be
careful if the response rate of your poll is low.
You can find a more in-depth treatment of the problem of non-response in surveys and polls in
Bethlehem, Cobben & Schouten (2011).

How can you detect whether non-response is selective? The available data with respect to the target
variables will not be of much use. You only have data for the respondents, and not for the non-
respondents. So you cannot find out whether respondents and non-respondents differ with respect to
these variables. The way out for this problem is to use auxiliary variables. Auxiliary variables are
variables that you measured in your poll and for which you know their distribution in the population.
For example, if you see that 60% is male in your poll and 40% is female, whereas there are 50% males
and 50% females in the target population, you can conclude there is something wrong. You have too
many males in your poll and too few females. To say it in other words: the response is not
representative with respect to gender. Males respond better than females. Apparently, there is a
relationship between gender and response behaviour. This leads to selective response.
It is important that you look for auxiliary variables that have a relationship with response behaviour. If
you find such variables, you know that the response of your poll is selective with respect to these
variables. If these auxiliary variables also have a strong relationship with the target variables of your
poll, your estimates for the target variables are biased.
If it turns out the response is selective, you cannot use the estimation procedures described in section
8. First, you have to correct your data for the lack of representativity. You need the auxiliary variables
for this correction, particularly the variables that show a relationship with response behaviour.
An important question is where to find auxiliary variables? You need variables that you not only
measure in your poll, but for which you also know the population distribution (or complete sample
distribution). Here are a few sources of auxiliary variables:
 The sampling frame. Some sampling frames contain more variables than just the contact
information. An example is the population register in The Netherlands, which contains gender, age
(derived from date of birth), marital status, and country of birth.
 The national statistical institute. National statistical institutes like Statistics Netherlands have the
population distribution of many variables. You should be careful that these variables have been
measured for the same population as the variables in your poll.
 Observations by interviewers. Interviewers can observe some properties of the persons in the
sample, like the social-economic status of the neighbourhood, the type of the house, and the age of
the house.
Figure 9.5 contains an example of a graph that gives insight in a possible relationship between the
auxiliary variable ‘degree of urbanisation’ and response behaviour. The data come from a survey of
Statistics Netherlands. It is the survey of well-being of the population in 1998. It is clear that the
response rate is very low (between 40% and 50%) in strongly urbanised areas. These are the big
cities. The less urbanised an area, the higher the response rate is. Indeed, the response rate in rural
areas is almost 70%. The low response rate in big cities is a problem in many countries.

98
Figure 9.5. Response rate by degree of urbanisation

Figure 9.6 shows a graph for another variable: the size of the household of a person. Also here there is
a clear pattern. The response rate increases as the size of the household increases. The main cause of
the low response rate of 1-person households is that these singles are very difficult to contact.
Apparently, they are often away from home. Moreover, some small households are not able to
participate. These are old singles and couples. Finally, singles tend the refuse more than people in
multi-person households.

Figure 9.6. Response rate by household size

This simple analysis shows that people in big cities and people in small households may be under-
represented in your poll. If you investigate something that is related to these variables, you may expect
your estimates to be biased.
There are much more auxiliary variables that related to response behaviour. See for a detailed
treatment Bethlehem, Cobben & Schouten (2011).

9.3 Correcting for non-response


If the analysis of non-response provides sufficient evidence for a potential lack of representativity, it is
not scientifically sound to use the estimation procedures of chapter 8 without any correction.
Estimates would be biased. The most frequently used correction method is adjustment weighting. This
comes down to assigning a correction weight to each respondent. In the computation of estimates each

99
respondent does not count for 1 anymore, but for the associated correction weight. Persons in under-
represented groups get a weight larger than 1, and persons in over-represented groups get a weight
smaller than 1.
To compute correction weights, you need auxiliary variables. Adjustment weighting is only effective if
the auxiliary variables satisfy the following two conditions:
 They must have a strong relationship with the target variables of your poll. If there is no
relationship, weighting will not reduce the bias.
 They must have a strong relationship with the response behaviour in your poll. If the auxiliary
variables are unrelated, adjustment weighting will not reduce the response bias.
You use the auxiliary variables to make the response representative. You do that by computing
correction weights in such a way that the weighted distribution of each auxiliary variable in the
response is equal to the corresponding population distribution. The correction weights for under-
represented groups will be larger than 1, and those for over-represented groups will be smaller than 1.
If it is possible to make the response representative with respect to a set of auxiliary variables, and all
these auxiliary variables have a strong relationship with the target variables, the (weighted) response
will also become (approximately) representative with respect to the target variables. Therefore,
estimates based on the weighted response will be better than estimates based on the unweighted
response.
We use a simple example to illustrate adjustment weighting. Elections are to be held in the (fictitious)
town of Bentwood. The number of eligible voters is 30,000. You carry out an election poll. The initial
sample size is 1,000, but due to non-response you only have 500 respondents. We can use gender as
auxiliary variable because we have recorded gender of the respondents in our poll, and we know the
population distribution of gender in the town. Table 9.1 shows how the weights are computed.

Table 9.1. Adjustment weighting with the variable gender

Response Population Correction weight


Frequency Percentage Frequency Percentage
Male 240 48.0% Male 15330 51.1% Male 1.064583
Female 260 52.0% Female 14670 48.9% Female 0.940385
Total 500 100.0% Total 30000 100.0%

The percentage of males in the response (48.0%) differs from the percentage of males in the
population (51.1%). There are too few males in the response. We can make the response
representative with respect to gender by assigning all males a weights equal to
Percentage of males in the population 51.1
  1.064583 .
Percentage of males in the response 48.0

In a similar way, all females are assigned a correction weight equal to


Percentage of females in the population 48.9
  0.940385 .
Percentage of females in the response 52.0

Not surprisingly, the correction weight of males is larger than one (1.064583). Indeed, they are under-
represented in the response. After adjustment weighting, each male counts for 1.064583 males.
Females are over-represented and get a weight smaller than 1 (0.940385). Hence, each female will
count for only 0.940385 female.

100
Suppose that we want to use the weighted response to estimate the percentage of males in town. All
240 males have a weight of 1.065. So they weighted percentage is

240 1.064583 255.5


100    51.1 .
500 500
This is exactly the percentage of males in the target population. Likewise, the estimate for the
percentage of females in the target population is exactly equal to the true population value. So we can
conclude that the weighted response is representative with respect to gender.
We have now described adjustment weighting with one auxiliary variable. It is better, however, to use
more auxiliary variables, as this is more effective to reduce non-response bias. Adjustment weighting
with several auxiliary variables is a bit more cumbersome, as we have to cross-classify these variables.
We show how this works with an example for two auxiliary variables gender and age (in three
categories young, middle-aged and old). In the case of one auxiliary variable, there are as many groups
(also called strata) as the variable has categories. So, for gender there are two groups (males and
females). In the case of two auxiliary variables gender and age, there is a group for each combination
of gender and age. So there are 2 × 3 = 6 groups (young females, middle-aged females, old females,
young males, middle-aged males, and old males). If we know the distribution in the target population
over these 6 groups, we can compute a correction weight for each group. Table 9.2 illustrates this with
a numerical example.

Table 9.2. Adjustment weighting with two auxiliary variables


Response Population Correction weight
Male Female Male Female Male Female
Young 115 75 Young 6780 6270 Young 0.982609 1.393333
Middle 80 85 Middle 4560 4320 Middle 0.950000 0.847059
Old 65 80 Old 3990 4080 Old 1.023077 0.850000

We have computed the weights in exactly the same was as in the example with one auxiliary variable.
Weights are equal to population percentages divided by corresponding response percentages. For
example, the percentage of old females in the population is 100 × 4080 / 3000 = 13.6%, whereas the
percentage of old females in the response is 100 × 80 / 500 = 16.0%. So the correction weight for this
group is equal to 13.6 / 16.0 =0.850000.
As a result of weighting by gender and age, the response becomes representative with respect to
gender and age. Moreover, the response becomes representative with respect to gender within each
age group, and it becomes representative with respect to age for each gender.
As you use more and more relevant auxiliary variables, you will be more effective in reducing the non-
response bias. You have to keep in mind that adjustment weighting only works if the groups you have
obtained by crossing auxiliary variables, satisfy the two conditions we already mentioned. We have
rephrased them here:
 The groups have to be homogeneous with respect to the target variables in your poll. This means
that the persons in the groups must resemble each other with respect to the target variables. To
say it in a different way: the values of a target variable must vary between groups and not within
groups.
 The groups have to be homogeneous with respect to response behaviour. This means that the
persons in the groups must have more or less the same response probabilities. To say it in a
different way: the values of a response probabilities must vary between groups and not within
groups.

101
It is not always easy in practice to find proper auxiliary variables for adjustment weighting. Often you
simply have to do with the variables you have. As a consequence, non-response correction will be less
effective. You may have reduced the bias somewhat, but it will not vanish completely.

Exercise 9.2. Adjustment weighting in a radio listening poll

A radio listening poll was conducted in a town. The main research questions was how many people
are listening to the local radio station. The target population consisted of all 19950 inhabitants of
the town with an age of 12 years and over.
A simple random sample of addresses was selected, and at each selected address one person was
drawn at random. The initial sample consisted of 499 persons. In the end, 209 persons participated
in the poll. So the response rate was 100 × 209 / 499 = 41.9%. This is a low response rate. This
means there was a serious risk of biased estimates.
For the population of age 12 and older, the distribution over gender and age (in 3 categories) was
available. Gender and age were recorded in the poll, so that it was possible to apply adjustment
weighting by gender and age. The table below contains the data

Response (n = 209) Population (N = 19950) Weight


Male Female Male Female Male Female
Young 9.5% 18.3% Young 12.5% 13.0% Young 1.136 0.613
Middle 17.7% 28.6% Middle 26.3% 27.7% Middle 1.284 0.835
Old 13.7% 12.3% Old 8.9% 11.6% Old 0.561 0.817

There are substantial differences between the response distribution and the population distribution.
Females are over-represented in all age groups. Middle-age males are clearly under-represented.
They are probably hard to contact, because they are working. Old males heavily are over-
represented.
Note that an address sample was drawn. This implies that people have unequal selection
probabilities. The weights in the table above correct for both the unequal selection probabilities as
well as well as for unequal response probabilities.
We show the effect of weighting on the estimates of the percentage of people listening to the local
radio station. See the table below:

Do you ever listen to the local radio station?


Estimate based on uncorrected response 55.0 %
Estimate after correction for unequal selection probabilities: 59.1 %
Estimate after correction for non-response: 57.8 %

The uncorrected estimate is 55.0%. This is a wrong estimate as it is not corrected for unequal
selection probabilities and non-response. The second estimate (59.1%) only corrects for unequal
selection probabilities. This would be a correct estimate in the case of full response. The third
estimate (57.8%) corrects for both unequal selection probabilities and non-response. This is the
best estimate.

If there are many auxiliary variables, it may become difficult to compute correction weights. There
could be groups (strata) without observations, in which it is simply not possible to compute weights
(division by zero). Also the population distributions of the crossing of variables may be missing. Other,

102
more general, weighting techniques can be applied in these situations, such as linear weighting
(generalised regression estimation) or multiplicative weighting (raking ratio estimation). See
Bethlehem, Cobben & Schouten (2012) for more information.
In the market research world, another weighting technique is gaining more and more popularity. It is
called propensity weighting. Participation in the poll is modelled by means of a logistic regression
model. This comes down to predicting the probability of participation in a survey from a set of
auxiliary variables. The estimated participation probabilities are called response propensities. A next
step could be to form groups of respondents with (approximately) the same response propensities.
Finally, you can compute a correction weight for each group. A drawback of propensity weighting is
that the individual values of the auxiliary variables for the non-participating persons are required.
Such information is often not available. Note that response propensities can also be used in other ways
to reduce the bias, see e.g. Bethlehem, Cobben & Schouten (2011) and Bethlehem (2010).

103
104
10 Online polls
Traditionally, polls used paper questionnaire forms to collect data. Data collection came in three
modes: face-to-face polls, telephone polls and mail polls. With the developments of information
technology since the 1970’s , computer-assisted interviewing (CAI) emerged. The paper questionnaire
was replaced by a computer program asking the questions. The computer took control of the
interviewing process, and it also checked answers to questions on the spot. Computer-assisted
interviewing could also be carried out in three different modes: computer-assisted telephone
interviewing (CATI), computer-assisted personal interviewing (CAPI), and computer-assisted self-
interviewing (CASI).
The rapid development of the internet in the last decades has led to a new mode of data collection: the
online poll. This is sometimes also called computer-assisted web interviewing (CAWI). In the 1980’s ,
the first experiments were already carried out with e-mail. E-mail questionnaires have a lot of
restrictions (for example with respect to layout) and internet coverage was still low. Therefore, e-mail
polls were not a big success.
This all changed in 1995, when HTML 2.0 became available. HTML (HyperText Markup Language) is a
markup language for web pages. The first version of HTML was developed by Tim Berners Lee in 1991
and thus the World Wide Web emerged. A strong point of HTML 2.0 was the possibility to have data
entry forms on the computer screen. Computer users could enter data, and these data were sent from
the computer of the user to the server of the researcher.
HTML 2.0 made it possible to design questionnaire forms, and to offer these forms to respondents. And
so online polls emerged. These online polls are almost always self-administered: respondents visit the
website, and complete the questionnaire by answering the questions.
Not surprisingly, researchers use, or consider using, online polls. Indeed, online polls seem to have
some attractive advantages:
 Now that so many people are connected to the internet, an online poll is a simple means to get
access to a large group of potential respondents.
 The questionnaires of the poll can be distributed at very low costs. No interviewers are needed, and
there are no mailing and printing costs.
 A poll can be launched very quickly. Little time is lost between the moment the questionnaire is
ready and the start of the fieldwork;
 Online polls offer new, attractive possibilities, such as the use of multimedia (sound, pictures,
animation and movies);
Online polls became quickly very popular. In countries with high internet coverage (like The
Netherlands), almost all polls are now online polls. For example, during the campaign for the
parliamentary election in September 2012, the four main poll organisations in The Netherlands were
Peil.nl (Maurice de Hond), Ipsos, TNS NIPO and GfK Intomart. All their polls were online polls.
Therefore, they could do polls very quickly. Some even conducted a poll every day.
Online polls seem to be a fast, cheap an attractive means of collecting large amounts of data. However,
there are methodological problems. These problems are partly caused by using the internet for
selecting respondents, and partly by using the web as a measuring instrument. If these problems are
not seriously addressed, online polls may result in low quality data for which no proper inference can
be made with respect to the target population of the poll. We will discuss the following issues in this
chapter:

105
 Under-coverage. Since not everyone in the target population has access to the internet, portions of
the population are excluded from the poll. This may lead to biased estimates.
 Sample selection. How to select a simple random sample of persons having access to the internet?
Sometimes, researchers rely on self-selection of respondents. Unfortunately, this approach is
known to produce biased estimates.
 Non-response. Almost every poll suffers from non-response. Nonresponse rates are particularly
high for self-administered surveys like mail and web surveys. Unfortunately, low response rates
increase the bias of estimators.
 Measurement errors. Interviewer-assisted polls like CAPI and CATI polls produce high quality data.
However, interviewer assistance is missing for online polls, and this may lead to more
measurement errors.

10.1 Under-coverage in an online polls


Under-coverage occurs in a poll if the sampling frame does not contain all objects of the target
population. Because the sampling frame does not completely cover the target population, some objects
have a zero probability of being selected in the sample. If these objects differ (on average) from those
in the sampling frame, there is a serious risk of estimators being biased.
The obvious sampling frame for an online poll would be a list of e-mail addresses. Sometimes such a
sampling frame exists. For example, all employees of a large company may have a company e-mail
address. Similarly, all students of university usually have a university e-mail address. The situation is
more complicated for polls having the general population as the target population. In the first place,
not everyone in the population has access to the internet, and in the second place, there is no list of e-
mail addresses of those with internet.

Figure 10.1. Percentage of households with internet access at home in 2013 (Source: Eurostat)

106
Figure 10.1 shows internet coverage for countries in Europe in 2013. Internet access of households
varies considerably. Internet coverage is highest in the Scandinavian countries (Iceland 96%, Norway
94%, Sweden 93%, Denmark 93%), The Netherlands (95%) and Luxemburg (94%). Internet coverage
is lowest in the south-east of Europe: Romania (58%), Greece (56 %), Bulgaria (54%), and Turkey
(43%).
The problem would be less severe if there were no differences between those with and without
internet access. Then it would still be possible to draw a representative sample. However, there are
differences. Internet access is not equally spread across the population. Bethlehem & Biffignandi
(2012) show that in 2005 the elderly, the low educated and ethnic minority groups in The Netherlands
are less well represented among those having access to internet. Although inter coverage is very high
in The Netherlands (95%), only 34% of the elderly (of age 75 years and older) use the internet
(source: Statistics Netherlands).
Scherpenzeel & Bethlehem (2011) describe the LISS panel. This is a Dutch online panel recruited by
means of probability sampling from a population register. Again, the elderly and ethnic minority
groups are under-represented. Moreover, internet access among single households is much lower.
They also conclude that voters for the general elections are over-represented.
Similar patterns can be found in other countries. For example, Couper (2000) describes coverage
problems in the United States. It turns out that Americans with higher incomes are much more likely
to have access to the internet. Black and Hispanic households have much less internet access than
White households. People with a college degree have more internet access than those without it.
Furthermore, internet coverage in urban areas is better than in rural areas.
Dillman and Bowker (2001) mention the same problems and conclude therefore that coverage
problems for online polls of the general population cannot be ignored Certain specific groups are
substantially under-represented. Specific groups in the target population will not be able to fill in the
online questionnaire form.
Duffy et al. (2005) conclude that in the USA and the UK web survey respondents tend to be politically
active, are more likely to be early adopters of new technology, tend to travel more, and eat out more.

The under-coverage bias of an online poll

Bethlehem & Biffignandi (2012) show that the bias due to under-coverage of the sample mean yI
as an estimator of the population mean Y of a target variable Y is equal to
B( y I )  NI YI  YNI  ,
N
N
where NNI is the number of people in the population without internet access, YI is the mean of the
target variable for people with internet access, and YNI is the mean of Y for those without internet
access. The magnitude of this bias is determined by two factors:
 The relative size NNI / N of the non-internet population. The more people have access to the
internet, the smaller the bias will be.
 The contrast YI  YNI between the internet population and the non-internet population. It is the
difference between the population means of the two sub-populations. The more the mean of the
target variable differs for these two sub-populations, the larger the bias will be.

107
Figure 10.1 shows that relative size of the non-internet population cannot be neglected in many
countries. Moreover, there are substantial differences between those with and without internet.
Specific groups are under-represented in the internet population, for example the elderly, those with a
low level of education, and ethnic minority groups. So, the conclusion is that generally a random
sample from an internet population will lead to biased estimates of the characteristics of the target
population.
The under-coverage bias of an online poll depends on (1) the fraction of people in the population
without internet, and (2) the difference (contrast) between the population means of those with and
those without internet access. It is to be expected that internet coverage will increase over time.
Hence, the fraction of people without internet will decrease, and this will reduce the bias. It is unclear,
however, whether the contrast will also decrease over time. It is even possible that it increases, as the
remaining group of people without internet access may differ more and more from the internet users.
So the combined effect of a smaller non-internet population and a larger contrast need not necessarily
lead to a smaller bias.
It is important to realise that the under-coverage bias of an online poll does not depend on the sample
size. Consequently, increasing the sample size will have no effect on the bias. So the problem of under-
coverage in online polls does not diminish by collecting more data.
The fundamental problem of online polls is that persons without internet are excluded from the poll.
There are a number of ways in which this problem can be reduced. The first approach is that of the
LISS panel. This is a Dutch online panel consisting of approximately of 5,000 households. The panel is
based on a true probability sample of households drawn from the population register of The
Netherlands. Recruitment took place by CATI (if there was a listed telephone number) or CAPI.
Households without access to internet, or who were worried that an internet survey might be too
complicated for them, were offered a simple to operate computer with internet access that could be
installed in their homes for free for the duration of the panel. So the size of the non-internet population
was reduced by giving internet access to those without it. More details about the LISS panel can be
found in Scherpenzeel (2009).
A second approach is to turn the single-mode online poll into a mixed-mode poll. One simple way to do
this is to send an invitation letter to the persons in the sample, and give them a choice to either
complete the questionnaire on the internet or on paper. So, there are two modes here: web and mail.
Another mixed-mode approach is to do it sequentially. You start with the cheapest approach and that
is online interviewing. People are invited to fill in the questionnaire on the internet. Non-respondents
are re-approached by CATI, if a listed telephone number is available. If not, these non-respondents are
re-approached by CAPI.
It will be clear that these approached increase the costs of a poll, and it also may increase the length of
the fieldwork period. However, this may be the price you have to pay in order to get valid and reliable
statistics.

10.2 Sample selection for an online poll


We already explained in chapter 5 how important it is to select the sample for your poll by means of
probability sampling. You can compute valid (unbiased) estimates of population characteristics only if
you select a real probability sample, every object in the target population has a non-zero probability of
selection, and you know all these probabilities. Furthermore, only under these conditions, you can
compute the precision of estimates.
The principles of probability sampling also apply to online polls. For example, if you want to conduct a
poll among students of a university, you can use a list of e-mail addresses as a sampling frame for a
simple random sample. Unfortunately, many online polls, particularly those conducted by market

108
research organisations, are not based on probability sampling. The questionnaire of the poll is simply
put on the web. Respondents are those people who happen to have internet, visit the website and
decide to participate in the poll. As a result, the researcher is not in control of the selection process.
Selection probabilities are unknown. Therefore, no unbiased estimates can be computed nor can the
precision of estimates be determined. These polls are called here self-selection polls. Sometimes they
are also called opt-in polls.
Self-selection polls have a high risk of not being representative. For some of these polls, also people
outside the target population can participate, and sometimes it is even possible to complete the
questionnaire more than once. It is even possible that certain groups in the population attempt to
manipulate the outcomes of the survey. For example, a group of people tried to influence opinion polls
conducted during the campaign for the parliamentary elections in 2012 in The Netherlands. The group
consisted of 2,500 people. They subscribed to an online opinion panel. Their plan was to behave
themselves first as Christian Democrats (CDA). Later on they would change their opinion and vote for
the elderly party (50PLUS). Unfortunately for them, and fortunately for the researcher, their attempt
was discovered when suddenly so many people at the same time subscribed to the panel. See
Bronzwaer (2012).

Example 10.1. The 2005 Book of the Year Award

An example of the effects of self-selection could be observed in the election of the 2005 Book of the
Year Award, a high-profile literary prize in The Netherlands. The winning book was determined by
means of an online poll. People could vote for one of the nominated books or mention another book
of their choice. More than 90,000 people participated in the survey.

Figure 10.2. The winner of the 2005 Book of the Year Award

The winner turned out to be the new Bible translation published by the Netherlands and Flanders
Bible Societies. This book was not nominated, but nevertheless an overwhelming majority (72%)
voted for it. This was the result of a campaign launched by (among others) Bible societies, a
Christian broadcaster and Christian newspaper. Although this was all completely within the rules of
the poll, the group of voters was clearly not representative for the Dutch population.

Due to the selection problems, you cannot draw scientifically sound conclusions from a self-selection
online poll. It is impossible to use such polls for drawing valid and reliable conclusions. It is imperative
to draw a probability sample from a proper sampling frame.
Indeed, the American Association for Public Opinion Research (AAPOR) warns for the risks of self-
selection (Baker et al., 2010): “Only when a web-based survey adheres to established principles of
scientific data collection can it be characterized as representing the population from which the sample
was drawn. But if it uses volunteer respondents, allows respondents to participate in the survey more
than once, or excludes portions of the population from participating, it must be characterized as
unscientific and is unrepresentative of any population.”

109
The self-selection bias of an online poll

To show the effects of self-selection on estimators, it is assumed that each element k in the internet
population has unknown probability πk of participating in the survey. A naive researcher assuming
that every element in the internet population has the same selection probability, will use the simple
sample mean yS . Bethlehem & Biffignandi (2012) show that the bias of this estimator can be
written as
R ,Y  S  SY
B( yS )  ,

in which Rπ,Y is the correlation coefficient between the participation probabilities and the values of
the target variable. Furthermore, Sπ is the standard deviation of the participation probabilities, SY is
the standard deviation of the target variable, and  is the mean of all participation probabilities.
The bias of the sample mean (as an estimator of the mean of the internet population) is determined
by three factors:
 The average participation probability. If people are more likely to participate in the poll, the
average response probability will be higher, and thus the bias will be smaller.
 The variation in participation probabilities. The more these probabilities vary (some people
have low participation probabilities, and others have high participation probabilities), the
larger the bias will be.
 The relationship between the target variable and participation behaviour. A strong correlation
between the values of the target variable and the participation probabilities, will lead to a large
bias.
The bias vanishes if all participation probabilities are equal. In this case, the self-selection process is
comparable to a simple random sample. The bias also vanishes if the participation probabilities do
not depend on the value of the target variable.

Selecting a probability sample provides safeguards against these manipulations. It guarantees that
sampled persons are always in the target population, and they can participate only once in the poll.
The researcher is in control of the selection process.
In considering using the internet for a poll, the ideal sampling frame would be to have a list of e-mail
addresses of every object in the target population. Unfortunately, such sampling frames do not exist
for general population polls. A way out is to do recruitment in a different mode. One obvious way of
doing this is to send sample persons a letter with an invitation to complete the questionnaire on the
internet. To that end, the letter must contain a link to the poll website, and also the unique
identification code.
Recruitment by mail is more cumbersome than by e-mail. In case of e-mail recruitment, the
questionnaire can simply be started by clicking on the link in the e-mail. In case of mail recruitment,
more actions are required: going to the computer (e.g. upstairs in de study room), starting the
computer, connecting to the internet, and typing in the proper internet address (with the risk of
making typing errors).

110
Example 10.2. Election polls

On 12 September 2012 there were parliamentary elections in The Netherlands. The elections were
preceded by a short, but intense campaign. Polling organisations were very active. The four main
ones were Maurice de Hond (Peil.nl), Ipsos (Politieke Barometer), TNS NIPO, and GfK Intomart (De
Stemming). They conducted many polls. Some of them even did a poll every day in the final phase of
the campaign.
All four polling organisations used online polls. They select samples from online panels. These
panels were constructed by means of self-selection. Consequently, the election polls only contain
people who have spontaneously subscribed to the panel. Probably these people like doing polls and
are interested in politics.
Table 10.1 compares the election results (seats in parliament) with the outcomes of the polls one
day before the election. For all four polling organisations there are significant differences between
predictions and the real result. These differences are larger than the margin of error.

Table.10.1. Predictions (seats in parliament) for the parliamentary elections


in The Netherlands 12 September 2012
Election Peil.nl Politieke TNS NIPO De
result Barometer Stemming
VVD (Liberals) 41 36 37 35 35
PvdA (Social-democrats) 38 36 36 34 34
PVV (Populists) 15 18 17 17 17
CDA (Christian-democrats) 13 12 13 12 12
SP (Socialists) 15 20 21 21 22
D66 (Liberal-democrats) 12 11 10 13 11
GroenLinks (Green) 4 4 4 4 4
ChristenUnie (Christian) 5 5 5 6 7
SGP (Christian) 3 3 2 2 3
PvdD (Animals) 2 3 3 2 2
50PLUS (Elderly) 2 2 2 4 3
Total difference 18 18 24 24
Mean difference 1.6 1.6 2.2 2.2

The largest difference is found for the prediction of the SP by the De Stemming. The prediction was
22 seats in parliament, but this party got only 15 seats. Four times we can observe a difference of 6
seats, and two times there is a difference of 5 seats.
We can partly attribute the differences to using self-selection for the polls. Other possible
explanations are that people changed their opinion between the poll and the elections. And people
may behave differently in a poll than in an election.

10.3 Non-response in online polls


Like any other type of poll, also online polls suffer from non-response. Non-response occurs when
persons in the selected sample, and which are also eligible for the poll, do not provide the requested
information. The problem of non-response is that the availability of data is determined both by the
(known) sample selection mechanism and the (unknown) response mechanism. Therefore, selection
probabilities are unknown. Consequently, it is impossible to computed unbiased estimates. Moreover,
use of naive estimators will lead to biased estimates.

111
Non-response can have several causes. It is important to distinguish these causes, as different causes
can have different effects on estimators, and therefore they may require different treatment. Here
three main causes of nonresponse are described: no-contact, refusal, and not-able.
No-contact occurs if it is impossible to get into contact with a selected person. Various forms of non-
contact are possible in an online poll. It depends on the way in which you recruit people for the poll. If
the sampling frame is a list of e-mail addresses, no-contact occurs if the e-mail with the invitation to
participate in the poll does not reach a selected person. The e-mail address may be wrong or the e-mail
may be blocked by a spam filter. If the sampling frame is a list of postal addresses and letters with an
internet-address are sent to selected persons, no-contact may be caused by not receiving the letter. If
recruitment for an online poll takes place by means of a face-to-face or telephone poll, no-contact can
be due to respondents being not at home or not answering the telephone.
Non-response due to refusal can occur after contact has been established with a selected person.
Refusal to cooperate can have many reasons: people may not be interested, they may consider it an
intrusion of their privacy, they may have no time, etc.
If sample persons for an online poll are contacted by an e-mail or a letter, they may postpone and
forget to complete the questionnaire form. You can see this as a weak form of refusal. Sending a
reminder may help to reduce this form of refusal.
Nonresponse due to not-able may occur if respondents are willing to respond but are not able to do so.
Reasons for this type of non-response can be, for example, illness, hearing problems or language
problems. If you send a letter with an internet address of an online questionnaire to selected persons,
and they want to participate in the online poll, but do not have access to the internet, this can also be
seen as a form of non-response due to not-able.
It should be noted that lack of internet access should sometimes be qualified as under-coverage
instead of non-response. If the target population of a poll is wider than just those with internet and the
sample is selected using the internet, people without internet have a zero selection probability. They
will never be selected in the poll. This is under-coverage. Non-response due to not-able occurs if
people have been selected in the poll but are not able to complete the questionnaire form (on the
internet).

The non-response bias of an online poll

It is usually assumed that every person k in the population has an unknown response probability ρk.
Suppose a simple random sample is selected. Just concentrating on non-response, and assuming
there are no coverage problems, Bethlehem, Cobben & Schouten (2011) show that the bias of the
response mean y R is equal to

R ,Y  S   SY
B( yR ) 

in which R,Y is the correlation coefficient of the response probabilities and the values of the target
variable. Furthermore, Sρ is the standard deviation of the response probabilities, SY is the standard
deviation of the target variable, and  is the mean of all response probabilities. Note that this
expression is similar to the expression for the bias in a self-selection online poll. The participation
probabilities of a self-selection poll are, however, usually much smaller than the response
probabilities. Therefore, the bias of an estimator in a self-selection poll is potentially much larger.
The bias of the response mean (as an estimator of the population mean of the internet population)
is determined by three factors:

112
 The average response probability. The higher the response rate, the smaller the bias.
 The variation in response probabilities. The more these probabilities vary, the larger the bias
will be.
 The relationship between the target variable and response behaviour. A strong correlation
between the values of the target variable and the response probabilities, will lead to a large
bias.
The bias vanishes if all response probabilities are equal, in which case the response can be seen as a
simple random sample. The bias also vanishes if the response probability does not depend on the
value of the target variable.

The mode of data collection has an impact on the response rate of a survey. Typically, interviewer-
assisted polls have higher response rates than self-administered polls. Since an online poll is a self-
administered survey, you can expect the response rate to be lower than, for example, CAPI and CATI
polls. Indeed, the literature seems to show that this the case.
Cook, Heath & Thompson (2000) describe a meta-analysis in which they explore response rates of 68
online polls and surveys. The average response rate is around 40%.
Kaplowitz, Hadlock & Levine (2004) compare response rates of online surveys and mail surveys. The
conclude that these rates are comparable. They see response rates vary between 20% and 30%.
Lozar Manfreda et al. (2008) conducted a meta-analysis of 45 comparisons of online surveys with
other types of surveys. The found that, on average, the response rate of online surveys was 11% lower
than that of other surveys.

Example 10.3. The Safety Monitor

Beukenhorst en Wetzels (2009) describe an experiment with the Safety Monitor of Statistics
Netherlands. This survey measures how people feel with respect to security. Questions are asked
about feelings of security, quality of life, and level of crime experienced. For the traditional Safety
Monitor, data collection was done with CATI (if a listed telephone number was available) or CAPI.
The response rate was 63.5%.
A sequential mixed-mode survey was conducted as part of the experiment. All sample persons
received a letter in which they were asked to complete the survey questionnaire on the internet. The
letter also contained a postcard which could be used to request a paper questionnaire. After two
reminders, remaining non-respondents were approached by CATI (if a listed telephone number was
available) and by CAPI (if there was no listed number).
The response rate of this mixed-mode survey was 59.7%. The response rate of the first mode
(online) was only 25.0%. The conclusion was that a single mode online survey leads to a lower
response rate. If the web is used as the first mode in a mixed-mode survey, almost the same
response rate could be obtained, but at a much lower cost (as an online mode case is less expensive
than a CAPI or CATI case).

10.4 Adjustment weighting


Under-coverage, self-selection and non-response can all affect the representativity of the response of
an online poll. Consequently, estimators may be biased. To reduce these problems, you should carry

113
out some kind of correction procedure. This usually comes often down to applying an adjustment
weighting technique.
The fundamental idea of adjustment weighting is to restore the representativity of the poll response.
You do that by assigning weights to responding persons. Persons in under-represented groups are
assigned weights larger than 1, and those in over-represented groups get a weight smaller than 1.
Weights can only be computed if auxiliary variables are available. You must have measured such
variables in your poll, and you must also have their population distribution (or complete sample
distribution). By comparing the population distribution of an auxiliary variable with its response
distribution, you can assess whether the response is representative with respect to this variable. If
these distributions differ considerably, you must conclude that the response is selective.
Weighting adjustment is only effective if two conditions are satisfied: (1) the auxiliary variables must
have a strong relationship with the selection mechanism of the poll, and (2) the auxiliary variables
must be correlated with the target variables of the poll. If these conditions are not fulfilled, the bias of
estimators will not be reduced.
The availability of proper auxiliary variables is often a problem. Usually, there are not many variables
that have a known population distribution and that satisfy the two conditions above. If proper
auxiliary variables are not available, you can consider conducting a reference survey. The objective of
such a survey is only estimating the population distribution of relevant auxiliary variables in an
unbiased way. Next, you can use the (estimated) distributions to compute correction weights for an
online poll.
Such a reference survey should be based on a real (possibly small) probability sample, in which you
collect data with a mode different from the web, e.g. CAPI or CATI. If no non-response occurs, or if non-
response is unrelated to all variables, the reference survey will produce unbiased estimates of the
population distribution of auxiliary variables.
An interesting aspect of the reference survey approach is that you can use any variable for adjustment
weighting as long as it is measured in both surveys. For example, some market research organisations
use psychographic variables to divide the population in mentality groups. People in the same group are
assumed to have more or less the same level of motivation and interest to participate in polls. If this is
the case, such variables can be effectively used for adjustment weighting.
The reference survey approach also has some disadvantages. In the first place, it is expensive to
conduct an extra survey. However, it should be noted this survey need not be very large as it is just
used for estimating the population distribution of auxiliary variables. And the information can be used
for more than one online poll. In the second place, Bethlehem (2010) shows that the variance of the
weighted estimator is, for a substantial part, determined by the size of the (small) reference survey. So,
a large number of observations in an online poll does not guarantee precise estimates. The reference
survey approach reduces the bias of estimates at the cost of a higher variance.
The conclusion can be that some form of weighting adjustment must certainly be applied in order to
reduce the bias of estimator. However, success is not guaranteed. The ingredients for effective bias
reduction may not always be available.

10.5 Measurement errors


Traditionally, many polls used CAPI or CATI for data collection. These are not the cheapest modes of
data collection, but they are used because response rates are high and data quality tends to be good.
What would change in this respect if a CAPI or CATI poll were to be replaced by a online poll? This
section focuses on measurement errors in online polls and the effects they have on data quality.

114
Answering questions is not an easy task. Schwarz et al. (2008) describe the steps the respondents have
to go through: (1) understanding the question, (2) retrieving the required information from memory,
(3) translating the information in the proper answer format, and (4) deciding whether to give the
answer or not. A lot can go wrong in this process. This may particularly be a problem for online polls,
where there are no interviewers. So they cannot motivate respondents, answer questions for
clarification, provide additional information and remove causes for misunderstanding. Respondents
are on their own.
When designing a questionnaire for an online poll, you should realise that respondents are usually not
interested in the topic of the survey. Therefore participation is not important for them. Krug (2006)
describes how people read websites. Many of his points also apply to questionnaires of online polls:
 Respondents do not read the text on the screen. They just scan it looking for words or phrases that
catch the eye.
 Respondents do not select the optimal answer to the question, but the first reasonable answer.
This is called satisficing.
 Respondents are aware of the fact that there is no penalty for giving wrong answers.
 Respondents do not read introductory texts explaining how the questionnaire works. They just
muddle through and try to reach the end.
There are several aspects in which online polls perform different from face-to-face or telephone polls.
One import aspect was already mentioned: satisficing. It means that respondents do not do all they can
to provide a correct answer. Instead, they give a satisfactory answer with minimal effort. Satisficing
can come in many forms:
 Response order effects. Particularly if the list of possible answers to a closed question is long,
respondents tend to choose an answer early in the list. This is called the primacy effect. Note that
face-to-face and telephone surveys may suffer from a recency effect, i.e. respondents tend to select
an answer near the end of the list.
 Acquiescence. Respondents tend to agree with statements in questions, regardless of their content.
This typically may occur for opinion questions. Research seems to suggest that acquiescence is
more of a problem in self-administered polls than in interviewer-assisted polls.
 Endorsing the status quo. If asked to give their opinion about changes, respondents simply (without
thinking) select the answer to keep everything the same. Endorsing the status quo is more of a
problem in self-administered polls.
 Selecting the middle option. If respondents are offered a middle response option for a neutral
response, they tend to select this option. It is an easy way out for those not wanting to express an
opinion.
 Non-differentiation. If respondents have to answer a series of questions with the same set of
answer options, they tend to select the same answer for all these questions irrespective of the
question content. This effect is even more pronounced for grid questions, where respondents
select all answers in the same column. This is called straight-lining. This is typically a problem in
self-administered polls.
 Don’t know. There is a dilemma for handling don’t know in online polls. On the one hand, this
option should be explicitly offered as persons may really do not know the answer to a question. On
the other hand, if don’t know is available, many persons will select it as an easy way out. In case of
a CAPI or CATI survey, it is also possible to offer don’t know implicitly. The interviewer only reads
out the substantive options, but if the respondent insists he does not know the answer, the

115
interviewer can record the answer as don’t know. This works well, but it is difficult to implement in
an online poll.
 Arbitrary answer. Respondents not wanting to think about the proper answer, may decide to pick
just an arbitrary answer. This type of satisficing typically occurs for check-all-that-apply questions.
It is more of a problem in online polls than in interviewer-assisted polls.
There are some other aspects in which online polls differ from interviewer-assisted polls. One of these
aspects is including sensitive questions. There are indications that respondents give more ‘honest’
answers to such questions in self-administered modes. The presence of interviewers may very well
lead to socially desirable answers.
CAPI and CATI questionnaires often apply some form of routing. Routing instructions see to it that
relevant questions are answered and irrelevant questions skipped. So the computer decides the next
question to be asked and not the respondent. Many online polls do not have built-in routing.
Respondents are free to jump back and forth through the questionnaire. There is a risk that not all
relevant questions will be answered. One may wonder what will produce the best results in terms of
data quality and response rates? Enforced routing or complete freedom?
Computer-assisted interviewing has the advantage that some form of error checking can be
implemented in the interviewing software, i.e. answers to questions are checked for consistency.
Errors can be detected during the interview, and therefore also corrected during the interview. It has
been shown, see e.g. Couper et al. (1998), that this improves the quality of the collected data. The
question is now whether error checking should be implemented in an online poll? What happens when
respondents are confronted with error messages? Will they just correct their mistakes, or will they
become annoyed and stop answering questions?
A last aspect to be mentioned here, is the length of the questionnaire. If it is too long, respondents may
refuse to participate, or they may stop somewhere in the middle of the questionnaire. Questionnaires
of CAPI surveys can be longer than those of CATI and online polls. It is more difficult to stop a face-to-
face conversation with an interviewer than to hang up the phone or to stop somewhere in the middle
of questionnaire of an online poll. Literature seems to suggest that CATI interviews should not last
longer than 50 minutes, and completing a questionnaire of an online poll should not take more than 15
minutes.

10.6 Online panels


Properly setting up and carrying out a poll is often costly and time-consuming. You have to find a
sampling frame, approach sampled people, persuade them to cooperate, etc. If you regularly conduct
polls, you may consider setting up a panel. You select a sample of people only once, and invite them
become a member of the panel. The people in the panel are asked regularly (say, once a month) to
complete a questionnaire form.
Online panels have become increasingly popular, particularly in the world of market research. This is
not surprising, as it is a simple, fast and inexpensive way to collect large amounts of data. Once an
online panel has been put into place, it is simple to conduct a poll. No complex sample selection
procedures are required. It is just a matter of sending an e-mail to the panel members. No interviewers
are involved, and there are no mail costs for sending paper questionnaires. It suffices to put the
electronic questionnaire on the internet.
Speed is another advantage of online data collection. A new poll can be launched quickly. There are
examples of online polls in which questionnaire design, data collection, analysis and publication took
no more than just one day. Online panels have become a powerful tool for opinion polls. For example,
in the last weeks of the campaign for the parliamentary elections of 2012 in The Netherlands, there
were four different major national polls each day, and they were all based on online panels.

116
Panels can be used in two different ways. The first one is longitudinal research, in which the same set
of variables is measured for the same group of individuals at different points in the time. The focus of
research is on measuring change. The second way to use a panel is cross-sectional research. The panel
is used as a sampling frame for specific polls that may address different topics, and thus measure
different variables. Also, samples may be selected from specific groups (for example elderly, high-
educated, or voters for a political party).
This section discusses some issues related to cross-sectional use of online panels. Also here the
principles of probability sampling apply: recruitment and sampling must be based on probability
sampling. Examples of such panels are the LISS panel in The Netherlands (Scherpenzeel, 2008) and the
KnowledgePanel in the US (Knowledge Networks, 2012).
A first issue is under-coverage. An online panel may suffer from under-coverage because the target
population is usually much wider than just persons with internet access. Internet coverage is so high
in some countries that under-coverage is hardly an issue. However, under-coverage is still substantial
in many countries. This means that estimates can be biased. One way to reduce coverage problems is
to provide internet access to those without. See example 10.4 about the LISS Panel.
Setting up a representative online panel is not simple. Usually, there is no sampling frame of e-mail
addresses. So you have to recruit people in another way. For example, Statistics Netherlands uses the
population register to select samples. Selected persons can be approached in different ways: by mail,
by telephone, or face-to-face. Face-to-face (CAPI) recruitment is known to produce the highest
response rates, but it is also the most expensive mode. CATI is somewhat less costly, but is has the
drawback that only those listed in the telephone directory can be contacted. You could avoid this
problem by applying Random Digit Dialling (RDD) to select the sample, but this has the drawback that
there will be no information at all about the non-respondents. The cheapest mode to recruit persons
for an online panel is by sending an invitation letter by ordinary mail. However, this approach is
known to produce low response rates. For example, response rates of online polls of Statistics
Netherlands do not exceed 40%, whereas response rates of CAPI and CATI are around 60%.
Setting up a representative online panel requires a lot of efforts and costs. This may be the reason that
many online panels are not based on the principles of probability sampling, but on self-selection. Self-
selection (also called opt-in) means that it is completely left to people to select themselves for the
panel, or not. Respondents are those who happen to have internet, encounter an invitation, visit the
appropriate website, and decide to participate. Unfortunately such panels lack representativity, and
therefore their outcomes may be invalid. For example, the four main market research organisations in
The Netherlands are Peil.nl (Maurice de Hond), Ipsos, TNS NIPO and GfK Intomart. They all use online
self-selection panels for their opinion polls.
Non-response is an important problem in online panels. Non-response occurs in two phases of an
online panel: (1) during the recruitment phase, and (2) in the specific polls taken from the panel.
Recruitment nonresponse may be high because participating in a panel requires substantial
commitment and effort of respondents. Nonresponse in a specific poll is often low as the invitation to
participate in it is a consequence of agreeing to be a panel member. Causes of non-response are not at
home, not interested in a specific topic, and not able (e.g. due to illness). Non-response need not be
permanent. After skipping one of the specific polls, a panel member may decide to participate again in
a subsequent poll.
Online panels may also suffer from attrition. This is a specific type of non-response. People get tired of
having to complete the questionnaires of the specific polls and decide to stop their cooperation. Once
they stop, they will never start again.

117
Example 10.4. The LISS Panel

The LISS panel is an online panel consisting of approximately 5,000 households. LISS stands for
Longitudinal Internet Studies for the Social Sciences. The panel was set up by CentERdata, a research
institute in The Netherlands. Objective of the panel was to provide a laboratory for the development
of and testing of new, innovative research techniques.
The panel is based on a true probability sample of households drawn from the population register in
The Netherlands. Telephone numbers were added to selected names and addresses. This was only
possible for households with listed numbers. Such households were contacted by means of CATI.
Addresses that could not be contacted by telephone were visited by interviewers (CAPI).
Households without internet access and those who worried that filling in a questionnaire on the
internet would be too complicated for them, were offered a simple-to-operate computer with
internet access that could be installed and used for the duration of the panel. This reduced under-
coverage problems.
Table 10.2 shows the response rates during the recruitment and the use of the LISS Panel. The data
are taken from Scherpenzeel & Schouten (2011).

Table 10.2. Response rates in the LISS Panel


Phase Response
Recruitment contact 91 %
Recruitment interview 75 %
Agree to participate in panel 54 %
Active in panel in 2007 48 %
Active in panel in 2008 41 %
Active in panel in 2009 36 %
Active in panel in 2010 33 %

Of the households in the initial sample, 91% could be contacted. So the no-contact rate was 9%. In
75% of the cases it was possible to conduct a recruitment interview. Of the people in the original
sample, only 54% agreed to become a member of the panel. So the recruitment rate was 54%. Over
the years the percentage of respondents dropped to 33%. This is the effect of attrition. So, one out of
three persons was still active in 2010.

There will always be non-response in an online panel, both in the recruitment phase and in the specific
polls. To avoid drawing wrong conclusions, you have to carry out some kind of correction. Usually,
adjustment weighting is applied. A vital ingredient of weighting is the availability of a set of proper
auxiliary variables. These variables must have been measured in the panel, and moreover their
population distribution must be known.
Weighting adjustment is only effective if two conditions are satisfied: (1) the auxiliary variables must
be correlated with response behaviour, and (2) the auxiliary variables must be correlated with the
target variables.
It is wise to conduct weighting in two steps. First, you apply a weighting technique to correct for
recruitment non-response. Second, you apply another weighting technique to correct for non-response
in a specific poll taken from the panel. These weighting techniques differ because they require
different variables. Adjustment weighting for recruitment non-response is often difficult, because the
number of available auxiliary variables is limited. Adjustment weighting for a specific poll is more
promising, because there can be a lot more auxiliary variables available. Typically, all members of an

118
online panel competed a so-called profile poll when they were recruited. This profile information was
recorded for all panel members. Therefore, all profile variables can be used for weighting adjustment
of a specific poll. It is important to realise in the design stage of an online panel that profile
information may be needed for non-response correction. This may help to select the proper variables
for the profile poll.
It is important that you keep the composition of the panel stable over time. Only then you can attribute
changes over time to real changes in society and not to changes in the panel composition. An online
panel may become less representative due to attrition. This makes it important to carry out panel
refreshment at certain times. The question is how to do this properly? At first sight, one could think of
adding a fresh random sample from the population to the web panel. However, this does not improve
the representativity. Those with the highest attrition probabilities remain under-represented.
Ideally, the refreshment sample should be selected such that the new members resemble the members
that have disappeared due to attrition. The refreshment sample should focus on getting people from
these groups in the panel. You should also realise that due to refreshment not all members in the panel
will have the same selection probabilities. This should be taken into account when computing
unbiased estimates.
Being in a panel for a long time may have an effect on the behaviour and attitudes of the panel
members, and even be the cause of a bias. For example, persons may learn how to follow the shortest
route through a questionnaire. This effect is called panel conditioning. Panel conditioning, may be
avoided by restricting panel membership to a specific time period. The maximum time period depends
on frequency of the specific polls, the length of the poll questionnaires, and also on the variation in poll
topics.

119
120
11 Analysing the data
After finishing data collection for your poll, you have a large amount of completed questionnaire
forms. If you used paper questionnaire forms, you have to enter the data into a computer. If you used a
form of computer-assisted interviewing, your data are already in the computer. A next step will be to
analyse the data. This is the topic of this section.
A first step in the analysis of your data is to carry out an exploratory analysis. Such an analysis focuses
on exploring an (often large) data set. It summarises the main characteristics of the data. It should
reveal the information that is in the data set. You need software tools to carry out an exploratory
analysis. They should help you to discover unexpected patterns an relationships in the data.
A first step in the exploratory analysis should be to check the data for errors. We call this data editing.
We described data editing already in more detail in chapter 7. You can look at one variable at the time,
and look for unusually small or large values, for example a person with the age of 140 years. These are
called outliers. An outlier can be an incorrect value that must be corrected, but it can also be a correct
value. You have to check this. You can also look for unusual combinations of values, like a voter of 12
years. Again, such values can point at an error. They can also be unlikely, but correct.
After you have cleaned your data set, you start the exploratory analysis. It is the search for interesting
aspects of the data. Exploratory analysis offers you a set of tools and techniques to summarise large
data sets in a few numbers, tables or graphs. If you discover an unexpected feature, you still have to
check whether this is an interesting new fact, or maybe just an artefact of this data set.
In your exploratory analysis, you explore the collected data, and nothing more. Your conclusions only
relate to this data set. So you make no attempt to generalise from the response of your poll to the
target population. If you want to use the data for drawing conclusions about the target population as a
whole, you should carry out a different type of analysis. This is called an inductive analysis. Usually,
these conclusions take the form of estimates of population characteristics. Examples are an estimate of
the percentage of people who are going to vote, and an estimate of the mean number of hours people
are online. Inductive analysis can also take the form of hypothesis testing. For example, you can test
the hypothesis that people are more satisfied with their life than five years ago.
It is typical for inductive analysis that the conclusions you draw, are based on sample data and
therefore have an element of uncertainty. You always have to take into account margins of error.
Fortunately, you have some control over the margin of error. For example, an increased sample size
will reduce the uncertainty.
It is not always simple to compute population estimates. The reason is that all kinds of problems can
occur during data collection. Therefore the quality of the data is not always as good as you hope it to
be. Here are a number of issues:
 The sample has not been drawn with equal but with unequal selection probabilities. As a
consequence, the sample will not be representative. To be able to draw valid conclusions, you have
to correct your estimates for this lack of representativity.
 Some answers to your questions may be wrong or missing. To correct this, you may have replaced
the ‘holes’ in your data set by synthetic values. This is called imputation. If persons with missing
answers differ from persons who answered the questions, imputation will still lead to wrong
conclusions. For example, if you apply imputation of the mean, the computed margins of error are
too small, inadvertently creating the impression of very precise estimates.
 The data may contain measurement errors. You have no guarantee that respondents have given the
right answers to the questions. If you ask a sensitive question, they may give a socially desirable

121
answer. If you ask questions about the past, respondents may have forgotten events. And if
respondents do not want to think about their opinion, they may just say “I don’t know”.
 You will be faced with non-response in your poll. If respondents differ from non-respondents (and
this often happens to be the case), you run a serious risk of drawing wrong conclusions from your
poll. You have to apply some kind of adjustment weighting to correct for non-response bias.
If you discover special patterns in your data set, you will attempt to come with an explanation. This
requires more in-depth analysis, like regression analysis and factor analysis. Discussion of these
techniques are outside of the scope of this publication.
The remaining part of this section is devoted to exploratory analysis. We distinguish techniques that
analyse the distribution of a single variable from techniques that analyse the relationship between two
variables. Furthermore, you should realise that techniques for the analysis of quantitative variables
cannot be used for the analysis of qualitative variables. Table 11.1 gives an overview of the techniques
we will discuss. This overview is by no means complete. There are more techniques. We restrict
ourselves here to the most important ones.

Table 11.1. Techniques for explanatory analysis


Variables Analysis of the distribution Analysis of the relationship
Quantitative  One-dimensional scatterplot  Two-dimensional scatterplot
 Boxplot  Correlation coefficient
 Histogram
 Summary table
Qualitative  Bar chart  Grouped bar chart
 Pie chart  Stacked bar chart
 Frequency table  Cross-tabulation
Mixed  Analysis of the distribution
of a quantitative variable for
each category of a qualitative
variable

There are graphical and numerical techniques for explanatory analysis. It is always a good idea to start
with graphical techniques. The well-known proverb “one picture is worth a thousand words” certainly
applies here. Charts can summarise large amounts of data in a clear and well-organised way, thus
providing insight in patterns and relationships. If you find clear and simple patterns in your data, you
can summarise them in a numerical overview.

Many of the techniques discusses here, have been implemented in statistical software like SPSS, SAS,
and Stata. These packages are generally large and expensive, as they offer much more possibilities
than just explanatory analysis. In this publication, we used a simple and cheap approach: we stored the
data in the spreadsheet program Excel, and we used the free open source package R for analysis.

We use one small data set to illustrate all techniques in the table. This data set contains data about the
working population of the small country of Samplonia. This country consists of two provinces: Agria
and Induston. Agria is divided into three districs: Wheaton, Greenham, and Newbay. Induston is
divided into four districts: Oakdale, Smokeley, Crowdon, and Mudwater. Table 11.2 contains a
description of the contents of the data set.

122
Table 11.2. The data set with the working population of Samplonia
Variable Type Values
District Qualitative Wheaton, Greenham, Newbay, Oakdale, Smokeley, Crowdon, Mudwater
Province Qualitative Agria, Induston
Gender Qualitative Male, Female
Age Quantitative From 20 to 64
Employed Indicator Yes, No
Income Quantitative From 101 to 4497
Ageclass Qualitative Young, Middle, Old

To be able to analyse the data, we have to enter them in a computer. We have used the spreadsheet
program Excel for this. Figure 11.1 shows part of the spreadsheet with data about the working
population of Samplonia. Note that the first line contains the names of the variables.

Figure 11.1. The spreadsheet with the first 10 cases of the data set

To prepare these data for analysis with R, you have to save the data as a CSV-file. Use the semicolon as
a field separator. The first line of the file should contain the names of the variables.
You can download the free package R from the website www.r-project.org. After you have installed the
package on your computer, and started it, you read the CSV-file with the function read.csv().

1.1 Analysis of the distribution of a quantitative variable


The values of a quantitative variable are measured by means of a numerical question. First we
describe three graphical techniques: the one-dimensional scatterplot, the boxplot, and the histogram,
Next, we describe a numerical technique: the summary table.
The one-dimensional scatterplot shows the distribution of a quantitative variable in its purest form. A
horizontal axis has a scale that reflects the possible values of the variable. The individual cases are plot
as points alongside this axis. You can see an example of a one-dimensional scatterplot in figure 11.2. It
shows the distribution of the variable Income for the working population of Samplonia.
Note that all points have a different vertical distance to the X-axis. These distances are completely
random. It is called jitter. The jitter has been applied on purpose, to prevent points from overlapping
each other. So all points remain visible.

123
Figure 11.2. A scatterplot for one variable

To what aspects of the one-dimensional scatterplot do you have to pay attention? It is difficult to give
general guidelines, because each variable is different. You should always be aware of unexpected
patterns. Still, here are some aspects that may be of interest:
 Outliers. Are there values that are completely different from the rest? Such values manifest
themselves as isolated points in the graph. You should look at these point carefully. Maybe you
made an error when you were processing the answers to the question. It is, however, also possible
that the value was correct, and this was just a very special object in the sample.
 Grouping. Are the values more or less equally spread over a certain interval? Or can you
distinguish various groups of values? Separate groups may be an indication that the population
consists of different sub-populations, each with their own behaviour. It may be better to analyse
these groups separately.
 Concentration. Is there in an area with a high concentration of values? Maybe all values
concentrate around a certain point. Then, it may be interesting to characterise this point. You can
use other analysis techniques for his.
There are no outliers in figure 11.2. There seems to be a separate group with very low incomes.
Further analysis should show what kind of group this is. The incomes do not seem to concentrate
around a central value. The distribution is skew with many low incomes and a few high incomes.
A second graphical technique for displaying the distribution of a quantitative variable is the boxplot. It
is sometimes also called the box-and-whisker plot. The boxplot is a schematic presentation of the
distribution of a variable. A box represents the middle half of the distribution. The ‘whiskers’
represent the tails of the distribution. Figure 11.3 contains the boxplot for the income of the working
population of Samplonia.
The box denotes the area containing the middle half (50%) of the values. The vertical line in the box is
the median of the distribution. Lines (‘whiskers’) extend from the left and right of the box. These lines
run to the values that are just within a distance of 1.5 times the length of the box. All values further
away are drawn as separate points. These point are outliers.
A boxplot is useful for exploring the symmetry of a distribution. For a symmetric distribution, the
median is in the middle of the box, and the whiskers have the same length. The boxplot is particularly
powerful for detecting outliers. But you should be careful. For example, the outliers in figure 11.3 are

124
not really outliers. They only seem outliers because the distribution is skew to the right. So there is a
long tail on the right.

Figure 11.3. A boxplot

The histogram also uses a horizontal line representing the range of possible values of the variable. The
line is divided in a number of intervals of equal length. Next, you count the number of values in each
interval. You draw a rectangle above each interval. The widths of the intervals are the same, but the
heights are proportional to the numbers of observations. The rectangles are drawn adjacent to each
other. Figure 11.4 contains a histogram for the incomes of the working people in Samplonia.

Figure 11.4. A histogram

A point for consideration is the number of intervals in which you divide the range of possible values. If
you only have a few intervals, you get a very coarse idea of the distribution. Much details are hidden. If
you have many intervals, there will be so many details that you do not see the global picture. An often
used rule of thumb is to take the number of intervals approximately equal to the square root of the
number of values, with a minimum of 5 and a maximum of 20.
You can use a histogram to determine whether a distribution is symmetric with a peak in the middle. If
this is the case, the distribution more or less resembles the normal distribution. Consequently, you can
characterise the distribution by a few numbers (mean and standard deviation).

125
Figure 11.4 contains an example of a histogram. It shows the distribution of the incomes of the
working people in Samplonia. The distribution is certainly not symmetric. There are many people with
a small income and only a few people with a large income. You can see this pattern often when you
display the distribution of quantities or values.
The distribution in figure 11.4 also seems to have more than one peak. This can happen when you mix
several groups each having its own distribution. It could be a good idea to attempt to identify these
groups and study each of them separately. It will be clear that for a mix of distributions it will be
difficult to characterise data in two numbers (means and standard deviation).
If the distribution of the values of the quantitative variables is more or less normal (symmetric, bell-
shaped), you can summarise it in a so-called summary table. Such a summary could contain the
following quantities:
 Minimum. This is the smallest value in the data set.
 Maximum. This is the largest value in the data set.
 Average. This is the central location of the distribution All values are concentrated around this
location
 Standard deviation. This is a measure of the spread of the values. The more the values vary, the
larger the standard deviation will be. The standard deviation is 0 if all values are equal.
 Rule-of-thumb interval. This is the interval containing approximately 95% of the values, provided
the distribution is more or less normal (a bell-shaped distribution). The lower bound of the
interval is obtained by subtracting two times the standard deviation from the average. The upper
bound of the interval is obtained by adding two times the standard deviation to the average.
The distribution of the incomes of the working population of Samplonia is skew and has several peaks.
It is certainly not normal. Therefore, it is not possible to summarise it in the summary table. If we
restrict ourselves to just the 58 working males in the province of Agria, we have much more normal
distribution. The summary table of this distribution is presented in table 11.2.

Table 11.3. Summary table


Variable Income
Number of cases 58
Minimum: 353
Maximum: 841
Average: 551.2
Standard deviation: 119.3
Rule-of-thumb interval: (312.6 ; 789.8)

The incomes of the working males in Agria vary between 353 and 841. They concentrate around a
value of 551.2. A standard deviation of 119.3 leads to a rule-of-thumb interval between 312.6 tot
789.8. So, 95% percent of the incomes are between 312.6 and 789.8.

1.2 Analysis of the distribution of a qualitative variable


Qualitative variables are measured with a closed question. Only a few techniques are available for the
analysis of the distribution of a qualitative variable. The reason is that you cannot carry out
computations with the values of qualitative variables. Their values are just labels, or code numbers of
labels. These labels denote categories. They divide the population into groups. The only thing you can
do is count the number of people in a group.
We will describe two graphical techniques: the bar chart and the pie chart. We will present one
numerical technique: the frequency table.

126
The first graphical technique is the bar chart. It presents the groups as bars, where the length of the
bars reflect the numbers of persons in the groups. To avoid confusion with a histogram, it is better to
draw the bars of a bar chart horizontally and to have some space between the bars.
If the categories of a qualitative variable have no natural order (it is a nominal variable), you can draw
the bars in any order. You can take advantage of this by ordering the bars from small to large (or vice
versa). This makes it easier to interpret the graph.
Figure 11.5 contains an example of a bar chart with ordered bars. It shows the numbers of employed
in the seven districts of Samplonia. There is a bar for each district, and the length of the bar reflects the
number of employed in the district. It is clear from the bar chart that two districts (Smokeley and
Mudwater) have substantially more employed than the other districts. Also note that two districts
(Oakdale and Newbay) have a small number of employed.

Figure 11.5. A bar chart

A second graphical technique for the analysis of the distribution of a qualitative variable is the pie
chart. This type of graph is particularly popular in the media. The pie chart is a circle (pie) that is
divided into as many sectors as the variable has categories. The area of the sector is taken proportional
to the number of people in the corresponding category. Figure 11.6 contains an example. It shows the
numbers of employed persons per district. So it contains the same information as figure 11.5.

Figure 11.6. A pie chart

127
Pie charts are sometimes more difficult to interpret than bar charts. If there are many parts, which all
have approximately the same size, it is hard to compare these parts. This much easier for a bar chart. It
may help to order the sectors in ascending or descending size.
The numerical way the present the distribution of a qualitative variable is the frequency table. This is a
table with for each category the number and the percentage of people that category. Table 11.2
contains the frequency distribution of the numbers of employed by district in Samplonia.

Table 11.4. A frequency table


Category Frequency Percentage
Wheaton 60 17.6%
Greenham 38 11.1%
Oakdale 26 7.6%
Newbay 23 6.7%
Smokeley 73 21.4%
Crowdon 49 14.4%
Mudwater 72 21.1%
Total 341 100.0%

As we already said, it is not possible to do calculations with a qualitative variable. Therefore, we


cannot compute an average value. If you really want to characterise the distribution in one number,
you could determine the mode. The mode is defined as the value that appears most often in a data set.
For example, for the number of employed per district, the category Smokely is the mode. There are 73
persons in this category (see table 11.4), and this corresponds to 21.4% of all employed.

1.3 Analysis of the relationship between two quantitative variables.


To analyse the relationship between two quantitative variables, there is one very popular technique:
the two-dimensional scatterplot. Every object in the sample is presented by a point in coordinate
system. The horizontal coordinate is equal to the value of one variable, and the vertical coordinate is
equal to the value of the other variable. Thus, a cloud of points is created.
If the cloud does not show a clear structure (it is a random snow storm), there is no relationship
between the two variables. Can you see some clear pattern, there is a relationship between the two
variables. Then it is a good idea to further explore the nature of this relationship.
The most extreme form of relationship is the one for which all points are on a straight line. In this case,
the value of one variable can be predicted exactly from the value of the other variable. Also other
aspects of a two-dimensional scatterplot could be interesting. For example, it will not be difficult to
detect outliers, or a group of points that behave differently from the rest.
Figure 11.7 shows a two-dimensional scatterplot of the variables age and income in the working
population of Samplonia. The point show a clear structure. First, points are divided in several different
groups. For example, there is a smaller group of points representing people of age 40 and older and
who have very high incomes. Income increase with age in this group. There is also a group with very
low incomes, where income does not increase with age.
If the points fall apart in separate groups, it could be interesting the characterise these groups. It may
help to include more variables in the analysis. If you have another quantitative variable, you could
indicate its value by the size of the points (by taking the size of the point proportional to the value of
this variable). If the third variable is a qualitative variable, you could give colors to the points that
correspond to the categories of this variable.

128
Figure 11.7. A scatterplot for two variables

The structure of the points can be so complex that it is not possible to summarise it in a few numbers.
However, you always hope to encounter a situation in which such a summary is possible. One situation
is that in which all point are more or less on a straight line. It means there is a linear relationship. You
can summarise such a relationship by a correlation coefficient and a regression line.
The correlation coefficient attempts to express the strength of the relationship in a value between -1
and +1. You should note that the correlation coefficient only works well for linear relationships. For
example, if all points form a parabolic curve, the relationship is still very strong, but this will not show
in the value of the correlation coefficient.
The value of the correlation coefficient can vary between -1 and +1. The correlation coefficient is 0 if
there is no relationship at all between the two variables. If there is a perfect linear relationship
between the two variables, the correlation coefficient is either -1 (for a downward trend) or +1 (for an
upward trend). A correlation of -1 or +1 means that the value of one variable can be predicted exactly
from the value of the other variable.
It is not meaningful to compute the correlation coefficient before having looked at the two-
dimensional scatterplot. Figure 11.7 illustrates this. If you compute the correlation coefficient for all
point, you will find a value of 0.568. This value is not very high. So there only seems to be a weak
relationship. However, relationships within the various groups are strong. For example, the
correlation coefficient within the high income group is 0.964, which indicates an almost perfect linear
relationship in this group.

The correlation coefficient

Let
1 n
s 2X   xi  x 2
n  1 i 1

be the variance of the observations on the variable X, and let

129
1 n
sY2    yi  y 2
n  1 i 1

be the variance of the observations on the variable Y. Furthermore, let


1 n
s 2XY   xi  x   yi  y 
n  1 i 1

be the covariance between the values of X and Y. Then the correlation coefficient is defined by
s XY s XY
RXY   .
s 2X  sY2 s X  sY

The value of RXY is always in the interval from -1 to +1. A value of -1 or +1 means a perfect
relationship, and a value of 0 means no relationship at all.

If there is a strong relationship between the two variables, and all points are approximately on a
straight line, you can summarise this relationship by means of a regression line. The computations for
such a regression line are outside the scope of this publication. Fortunately, most statistical software
packages can do it for you. It is not meaningful to compute the regression line for the scatterplot in
figure 11.7. because the points do not form a straight line. The situation is different in figure 11.8. This
is the two-dimensional scatterplot of the variables age and income for the working males in the
province of Agria. There is clearly a linear relationship. The correlation coefficient is a good indication
for the strength of the relationship. Its value turns out to be 0.960, which is almost equal to +1. So
there is a strong relationship. The expression for the regression line is
205.493 + 9.811  Age.
This means you can predict someone’s income by multiplying his age by 9.811 and adding 205.493 to
the result. For example, if a male in Agria has an age of 50 years, his income should be something like
205.493 + 9.811  50 = 696.043.

Figure 11.8. A scatterplot for two variables with a linear relationship

130
The regression line

The formula for the regression line for predicting variable Y from variable X is equal to
y = a + bx
The regression coefficient b is equal to
n

s XY
 x i  x    y i  y 
b  i 1
.
s 2X n
 x i  x 
2

i 1

The regression coefficient a is equal to


a  y  bx .

1.4 Analysis of the relationship between two qualitative variables


The possibilities for analysis of the relationship between qualitative variables are limited, as it is not
possible to do meaningful calculations with these types of variables. For graphical analysis, you can
use extensions of the bar chart: the grouped bar chart and the stacked bar char. Furthermore, you can
make pie charts. You can compute the quantity Cramérs V to measure the strength of the relationship
between two qualitative variables.
A grouped bar chart consists of a set of bar charts. You draw a bar chart of the one variable for each
category of the other variable. You combine all these bar charts in one graph. Figure 11.9 contains an
example of a grouped bar chart. It is the age distribution (in age categories) by district for the working
population of Samplonia. The bars have been drawn horizontally. Again, this is done to avoid confusion
with histograms.
The grouped bar chart contains a bar chart for the age distribution in each district. For example, you
see that there are no young people in Oakdale and no old people in Newbay. Of course, you can do it
the other way around: a bar chart of the distribution over the districts for each age category. This
helps to detect other aspects of the relationship between age and district.
Is the shape of the bar chart for the one variable different for each category of the other variable, you
may assume some kind of relationship between the two variables. To further investigate the nature of
this relationship, you have to take a closer look at each bar chart separately.
You should realise that you cannot see every single aspect of the relationship between the two
variables. This type of chart only will reveal certain aspects. For example, it is difficult to see in figure
11.9 which district has most employed persons. For this, you would have to add all the bars for each
district. It is also difficult to see whether an age category is under- or over-represented in a specific
district. It is hard to say whether the percentage of young people in Smokeley is larger or smaller than
the percentage of young people in Mudwater. It is easier to compare numbers of people in the various
age categories. For example, you can conclude that Smokely has the largest group of elderly.

131
Figure 11.9. A grouped bar chart

Another way to show bar charts of one variable for each category of the other variable is the stacked
bar chart. The bars of each bar chart are not drawn below each other (like in figure 11.9) but stacked
onto each other. Figure 11.10 shows an example of a stacked bar chart. The same data are used as in
figure 11.9.
The total length of all bars is proportional to the number of employed in the various districts. Each bar
is divided into segments the lengths of which reflect the distribution of the age categories.
What is there to see in a stacked bar chart? It is clear which category of the one variable (district) is
smallest (Newbay) and which one is largest (Smokeley). Furthermore, you can get a good idea which
category of the other variable is under- or over-represented within a category of the one variable. For
example, it is clear that there are no young employed in Oakdale, and no old employed in Newbay.
Comparing the age distributions of two districts is difficult.

Figure 11.10. A stacked bar chart

Comparing distributions for the other variable within categories of the one variable is easier if you
replace the counts in the bars by percentages. This will produce a stacked bar chart in which each bar
has the same length (100%) and the various segments represent the distribution of the other variable
in percentages. See the example in figure 11.11.

132
Figure 11.11. A stacked bar chart with percentages

A chart like this one gives you the possibility to compare relative distributions. For example, you
cannot only compare the percentages of young employed in the various districts, but also the
percentages old employed. It is clear that relatively more older employed live in Smokeley and
Mudwater than in Wheaon or Greenham. Relatively many young employed live in Newbay and
Smokeley, whereas middle-age employed seem to be well-represented in Greenham and Oakdale.
We already mentioned the pie chart as an alternative for the bar chart. Also for the analysis of the
relationship between two qualitative variables you can use pie charts. You could simply replace the
bar charts in figure 11.9 by pie charts. Such a graph will become even more informative if you take the
size of the pie charts proportional to the number of people on which it is based (the number of people
in the category of the other variable). Figure 11.12 shows an example.

Figure 11.12. Pie charts

The areas of the circles are proportional to the numbers of employed in the corresponding districts. So
you can see which district is large (Mudwater) and which district is small (Newbay) in terms of
number of employed people. You can also see the age distribution within each district. It is hard,
however, to compare the age distributions. It is also not so easy to compare the sizes of the same age
group in different districts.
So there are various instruments to explore the relationship between two qualitative variables in a
graphical way. There is no single technique that performs best. Each technique has strong points and
weak points. The best thing to do is to try all techniques.

133
You can obtain a numerical overview of the combined distribution of two qualitative variables by
making a cross-tabulation. Table 11.5 is an example. It shows the cross-tabulation of the variables
district and age class of the working population in Samplonia.

Table 11.5. A cross-tabulation


District Age class
Young Middle-aged Old Total
Wheaton 30 19 11 60
Greenham 13 17 8 38
Oakdale 0 12 14 26
Newbay 18 5 0 23
Smokeley 26 25 22 73
Crowdon 29 12 8 49
Mudwater 35 16 21 72
Total 151 106 84 341

Interpretation of counts in a small table is doable, but it becomes harder and harder as the number of
rows and columns increases. What may help is to replace counts by percentages. You can compute
percentages in various ways: percentages of the table total, row percentages, and column percentages.
As an example, table 11.6 contains row percentages. This means all percentages in a row add up to
100%. In this way you obtain the age distribution within each district.

Table 11.6. A cross-tabulation with row percentages


District Age class
Young Middle-aged Old Total
Wheaton 50.0% 31.7% 18.3% 100.0%
Greenham 34.2% 44.7% 21.1% 100.0%
Oakdale 0.0% 46.2% 53.8% 100.0%
Newbay 78.3% 21.7% 0.0% 100.0%
Smokeley 35.6% 34.2% 30.1% 100.0%
Crowdon 59.2% 24.5% 16.3% 100.0%
Mudwater 48.6% 22.2% 29.2% 100.0%
Total 44.3% 31.1% 24.6% 100.0%

You can see in this table that young employed are over-represented in Newbay (78.3%), and that
relatively many old employed live in Oakdale (53.8%).
If there is no relationship between the row and column variable, the relative distribution of the column
variable will be more or less the same in each row. And vice versa, the relative distribution of the row
variable will be more or less the same in each column. The larger the differences between the relative
distributions, the stronger the relationship is.
There are numerical quantities that attempt to express the strength of the relationship between two
qualitative variables in one number. A well-known quantity is the chi-square statistic. If this statistics is
close to zero, there is (almost) no relationship. The larger its value, the stronger the relationship. But
what is large? A problem of the chi-square statistic is that its value does not only depends on the
strength of the relationship, but also on the numbers of rows and columns and on the number of
observations. So it is difficult to interpret its value. Therefore, other quantities have been defined that
are independent of these numbers. You can see them as a ‘standardised’ chi-square statistic. One such
quantity is Cramérs V. The values of Cramérs V is always between 0 and 1. A value of 0 means a total
lack of relationship. A value of 1 means a perfect relationship. A rule of thumb is to consider values

134
below 0.3 as a weak relationship, values between 0.3 and 0.7 as a moderate relationship, and values of
more than 0.7 as a strong relationship.
The value of Cramérs V is 0.268 for the data in table 11.5. So we can conclude there is a weak
relationship between district and age distribution in the working population of Samplonia.

Cramérs V

Let G2 be the value of the chi-square statistic for a cross-tabulation with R rows and C columns.
Then Cramérs V is equal to

G2
V ,
n  min( R  1 ,C  1 )

In which min(R-1, C-1) denotes the minimum of R-1 and C-1. The value of V is always in the interval
from 0 to 1, where 1 means a prefect relationship and 0 means no relationship at all.

1.5 Analysis of the relationship between a quantitative and a qualitative variable


There are no specific techniques for the analysis of the relationship between a quantitative and a
qualitative variable. What you can do is take a technique for the analysis of the distribution of a
quantitative variable, and apply it for every category of the qualitative variable. We will describe two
graphical techniques, one based on the one-dimensional scatterplot, and one based on the boxplot. We
also show how to make a numerical overview based on the summary table.
The one-dimensional scatterplot can be used to study a possible relationship between a qualitative
variable and a quantitative variable. The idea is to make a one-dimensional scatterplot of the
quantitative variable for each category of the qualitative variable, and to combine all these scatterplots
in one graph. To avoid points from overlapping each other, vertical jitter is added in each scatterplot.
Figure 11.13 gives an example. The scatterplot shows the relationship between the variables income
and district for the working population of Samplonia.

Figure 11.13. Scatterplots of one variable

It is clear that the incomes are low in Wheaton, Newbay, and Greenham. In each of these three districts
there seems to be two groups: one group with very low incomes, and another group with somewhat
higher incomes. The incomes are clearly the highest in Oakdale. There is only one group. The other
three districts (Smokeley, Mudwater, and Crowdon) take a middle position with respect to income.

135
The second graphical technique for the analysis of the relationship between a quantitative and a
qualitative variables is a set of boxplots. You draw a boxplot of the quantitative variable for each
category of the qualitative variable. To be able to compare the boxplots, you must use the same scale
for all boxplots.
Figure 11.14 contains the boxplots for the analysis of the relationship between income and district for
the working population of Samplonia. The graph shows clear differences between the income
distributions. There are districts with low incomes (Wheaton, Newbay, and Greenham) and there is
also a district with very high incomes (Oakdale). The income distribution of the other three districts
(Smokeley, Mudwater, and Crowdon) are somewhere in between. The income distributions of these
three districts are more or less the same.

Figure 11.14. Box plots

You can use a similar approach for a numerical analysis of the relationship between a quantitative and
a qualitative variables: do a numerical analysis of the quantitative variable for each category of the
qualitative variable.
Table 11.7 presents an example of what you can do. It is a table that contains for each district a
number of characteristics of income: number of cases, minimum value, maximum value, average value,
and standard deviation. Other quantities, like the rule-of-thumb interval and the median, could also be
included.

Table 11.7. A numerical overview of the variable income

District Cases Minimum Maximum Average Standard


deviation
Wheaton 60 101 787 356 234
Greenham 38 102 841 324 219
Oakdale 26 2564 4497 3534 586
Newbay 23 115 648 344 167
Smokeley 73 635 2563 1607 518
Crowdon 49 612 2471 1356 505
Mudwater 72 625 2524 1440 569
Total 341 101 4497 1234 964

If you study table 11.7, it becomes clear that the income distribution in Oakdale is very different from
the income distributions in the other districts. The standard deviation of income is in Wheaton,

136
Greenham, and Newbay lower than in the other districts. Apparently, incomes do not differ so much in
these three districts.
Again, we advise you to first have a graphical look at the distributions of the quantitative variable
before you summarise them in a numerical overview. The numerical summary is only meaningful if the
distributions of the quantitative variable are more or less normal (bell-shaped).
For example, some income distributions within districts are rather skew. If you would compute a rule-
of-thumb interval for income in Wheaton, the lower bound is negative. This is silly, as all incomes are
positive. So the rule-of-thumb interval does not work here.

137
138
12 Publishing the results
In the end, the ultimate goal of your poll will be to publish interesting results. The obvious way to do
this is by making a research report. This chapter give you some ideas about the structure and contents
of such a report.
A research report should be an account of how you designed your poll, how you collected the data,
how you processed the data, how you analysed the data, and how you reached your conclusions. The
report has to satisfy two important conditions. In the first place, the report must be readable. For
example, you should avoid technical jargon. Also readers without a statistical background must be able
to understand it. In the second place, you are accountable for the way in which you conducted your
poll. Other experts in the field of survey research must be able to check whether you followed the
methodological principles, and whether you drew scientifically sound conclusions about your target
population.
You should write your research report in a concise and professional writing style. The report should
be objective and neutral, and not enforce a specific opinion. You should avoid the use of common
language. Write in the passive form, and do not use the ‘you-style’. Avoid unfamiliar terms and
symbols.
The various ways in which you can give structure to your research report. An often suggested
structure is divide your report in the following parts.
 Executive summary. This is a short summary of why you conducted the poll, and what the results
are. Everybody must be able to read and understand the text.
 Methodological account. This is the methodological part of the report. This is and accurate
description of the design and fieldwork of your poll.
 Results. This is the part in which you described the results of your analysis of the collected data. It
will be a mix of text, tables, and graphs.
 Conclusions. This is the part in which you draw conclusions from the results of your analysis. You
interpret the outcomes, and translate them to the practical problem that caused you to do the poll.
 Literature. This part is the list of the literature you consulted. The list could contain both
methodological publications as well as subject-matter publications.
 Appendices. This part could contain all kinds of technical details of your poll. The complete
questionnaire should be included. It could also contain tables that are too big to be included in the
normal text.
In the remainder of this chapter we will describe the above mentioned parts of the research report in
some more detail.

12.1 The executive summary


The executive summary is a short description of your poll in a manner that is readable and
understandable for everyone. It consists of two parts: the research question and the conclusions.
The research question is a clear, focused, and concise question about a practical issue. It is the question
around which you centre your poll. You should describe what the practical issue is in subject-matter
terms, and how your poll will provide the answers to the questions raised.
It should be clear from the research questions who commissioned the poll and possibly also who
sponsored it. You should mention possible interests of these organisations in specific outcomes of the
poll.

139
After having described the research question, you give an overview of the most important conclusions
of your poll. Such conclusions must only be based on the data you collected. The conclusions should
contribute to answering the research question.
It is important that everyone (particularly commissioners and sponsors, but possibly also the media
and the general public) understand the conclusions. You must also make clear how far your
conclusions reach. What is the target population to which they refer? And what is the reference date of
the survey? The executive summary should also indicate how accurate your conclusions are. What are
the margins of error? Was there non-response, and what was its effect on the outcomes.
The executive summary should be short, and should consist of no more than a few pages. Particularly,
the conclusions should be concise, but placed in the right context. There should only be conclusions,
and no arguments leading to the conclusions.
The executive summary must be readable and understandable by the commissioner of the poll. The
commissioner must be able to implement the conclusions in concrete policy decisions. There is no
place here for statistical jargon or mathematical formulae.

12.2 Methodological account


The second part of the research report could contain the methodological account of your poll. Your
description of the design of the poll and the way the data was collected, should provide sufficient
information to determine whether you have drawn your conclusions in a scientifically sound way.
Your conclusions should be supported by the data you collected.
The methodological account should at least contain the following topics:
 Target population. Give an exact definition of the target population. You should make clear which
objects do and which objects do not belong to the target population, and so to which group of
objects the conclusions refer.
 Variables. Describe the variables you measured in your poll. For each qualitative variable, you have
to give a list of the categories you distinguished. For each quantitative variable, you specify the unit
of measurement.
 Questionnaire. Give information about the questionnaire you used, such as the (approximate) time
it took to complete it, whether it was a paper questionnaire or a digital one, whether the
questionnaire contained checks on the answers (in case of a digital questionnaire). In case you
included checks in an digital questionnaire, you might describe these checks. Also describe the
procedure used to test the questionnaire.
 Population characteristics. Give an overview of the population characteristics you estimated.
Explain how you computed the estimates using the answers to the questions.
 Sampling frame. Describe the sampling frame you used to select the sample. Indicate whether the
sampling frame was up-to-date. Explain whether you encountered problems like under-coverage
or over-coverage.
 Sampling design. Describe how you selected your sample. Was it a probability sample? Was the
sample selected with equal or unequal probabilities? Was it sampling with or without
replacement? Explain what the values of the selection probabilities were.
 Fieldwork. Describe how the fieldwork was carried out. Was it a face-to-face, telephone, mail or
online poll? Was it interviewer-assisted or self-administered poll? Was the face-to-face or
telephone poll computer-assisted? If the fieldwork was carried out by interviewers, give more
information about them. How many were involved? Were they experienced interviewers? Did they
get special training? Did they encounter problems in the field?

140
 Data editing. After finishing the fieldwork, you have to check the collected data. Explain what you
did to detect errors. Explain how you corrected detected errors. If you used imputation techniques
to correct for erroneous or missing data, explain which particular techniques you used, which
variables were imputed, and how much imputation took place.
 Non-response. Your poll will suffer from non-response. What was the response rate of your poll?
Describe the composition of the non-response (with at least the categories no-contact, refusal and
not-able).
 Weighting. If there is a substantial amount of non-response, you have to correct your poll for a
possible bias by carrying out a adjustment weighting technique. Explain what kind of weighting
technique you used, and which auxiliary variables were included.
 Estimates. Finally, you have computed estimates of all kinds of population characteristics. Describe
how you computed these estimates, and whether you included (possibly unequal) selection
probabilities and correction weights in your computations. You may decide to include the
mathematical formulae here, but you can also decide to put the formulae in an appendix.
 Margins of error. Your poll is based on a sample. Therefore, your estimates have a margin of error.
Indicate how large these margins of error are. If your poll suffered from non-response, you should
mention that discrepancies between estimates and true population values can be larger than just
the margins of error. There may also be an unknown non-response bias.

12.3 Results
The third part of the research report describes the analyse of the collected data. It could start with an
exploratory analysis of the data. This exploratory analysis should provide insight in each variable
(target variable or auxiliary variable) separately.
You have to choice to analyse the distribution of a variable with a graphical or with a numerical
technique. Graphs are often easier to interpret, and therefore may provide more insight (‘one picture
is worth a thousand words’). So your first choice could be for graphs. Tables with frequency
distributions and other numerical details could possibly be included in an appendix.
The exploratory analysis may be followed by a more in-depth analysis in which you try to describe and
interpret the distribution of the target variables. You also may want to explore possible relationships
between target variables, or between target variables and auxiliary variables. Again, you can choose
between graphical and numerical techniques. The graphs give an impression of the global picture,
while the numerical techniques provide the numerical details.
Do not forget to take into account that all quantities have a margin of error due to sampling. Were
possible you have to account for this uncertainty.
It may be better not to include too many technical details in the text, as this may affect its readability.
You could consider putting technical details in an appendix.
12.4 Conclusions
The fourth part of the research report is devoted to the conclusions you have drawn from your poll. It
is a translation from your estimates back to the practical issue you were investigating. This part should
be more than just a listing conclusions, as in the first part. There should more room for interpretation
here, and therefore somewhat more subjective conclusions. Nevertheless they may never contradict
you poll findings.
Your conclusion could also be a hypothesis about the target population. Such a hypothesis should be
tested in a new poll.

141
12.5 Literature
The fifth part of the research report should contain an overview of all literature used by you. There
could be two lists of publications:
 Subject-matter literature. These are the publications you consulted about the subject-matter
problem you investigated with your poll.
 Methodological literature. These are the methodological publications you consulted in order to
design and conduct your poll in a scientific sound way.
12.6 Appendices
The appendices are for a number of things that are too big are too complex to be included in the
regular text, and that are not really necessary to understand what was going on in the poll. It is
information that is necessary to get a good assessment of how the poll was carried out.
Here are a number of things that could be included in an appendix:
 The questionnaire.
 An explorative analysis with for each variable a numerical summary of the distribution of the
answers.
 Formulae of estimation procedures, including formulae for computing correction weights.
 Large tables with outcomes.
 Letters (or e-mails) that have been sent to respondents, including reminders.

142
13 A checklist for polls

13.1 Separating the chaff from the wheat


All over the world, many polls are conducted. Particularly election periods show an increase in the
number of opinion polls. There are ample examples of countries in which election polls are carried out
almost every day of the election campaign. Also when there are no elections, we see an increase in the
number of polls. More and more, the opinions of citizens are asked about all kinds of topics.
All polls have in common, that persons in a sample are asked to complete a questionnaire. The
questions can be about facts, behaviour, opinions and attitudes. The researcher uses the answers to
draw conclusions about the population as a whole. That can be done in a meaningful way, provided the
poll has been set up and conducted in a scientifically sound way.
The increase in the number of polls is mainly caused by the rapid developed of the internet. Internet
makes it possible to collect a lot of data in an easy and cheap way. There are websites (for example,
www.surveymonkey.com) where everyone, even people without any knowledge of research methods,
can setup a poll very quickly. The question is, however, whether these polls collect data in scientifically
meaningful way. If not, the validity of the outcomes is a stake.
So there are more and more polls. Some of them are good and some of them are bad. For users of the
results of polls (journalists, policy-makers, decision-makers) it is difficult to distinguish the chaff from
the wheat. The help them, a checklist has been developed. By going through the questions one by one,
you will get an impression of the quality and the usefulness of the poll. If the quality seems to be good,
you can pay more attention to the outcomes of the poll. If the checklist raises concern about the quality
of the poll ( too many checklist questions were answered by ‘no’), it may be better to ignore the poll.
The checklist was kept simple, and is outspoken in its judgement of polls. A poll can only be right or
wrong. In practice, the situation might not be so simple, and therefore our judgement may be less
nuanced.
The checklist was a combined initiative of three Dutch organisations: Statistics Netherlands, the Dutch-
speaking Platform for Survey Research (NPSO), and the Dutch-Flemish Association for Investigative
Journalism (VVOJ).
Section 13.2 contains the checklist with its nine questions. In section 13.3 we describe these questions
in more detail, and explain why these questions are relevant for our assessment of the quality of a poll
and its results.

143
13.2 The checklist

1. Is it clear who commissioned or sponsored the poll? If so, you can determine whether the
particular organisation has an interest in certain outcomes of the poll. This may be the case if
the poll is conducted as part of a marketing campaign for a product, service, or point of view.
 Yes: Go to 2.
 No: Beware! The objectivity of the poll is not guaranteed.
2. Is there a research report containing a detailed account of how the poll was conducted?
 Yes: Go to 3.
 No: Beware! You cannot determine the validity of the outcomes of the poll.
3. Is it clear what the target population of the poll was? This is the group of people who were
investigated, and to which the conclusions of the poll refer.
 Yes: Go to 4.
 Nee: Beware! You cannot interpret the outcomes in the proper context.
4. To be able to assess the quality of the questionnaire of the poll, at least the following two
conditions must be satisfied:
o A copy of the questionnaire must be included in the research report;
o The questionnaire must have been tested before the fieldwork started.
Are these conditions satisfied?
 Yes: Go to 5.
 No: Beware! The conclusions of the poll may be invalid.
5. How was the sample selected? Was it a simple random sample in which every person in the
target population had a non-zero probability of selection? Preferably all selection probabilities
are equal. If not, it must be possible to compute the selection probabilities.
 A random sample from the whole target population. Go to 6.
 A random sample from part of the target population. For example, only people with internet
access, or only people with a listed telephone number. Go to 6, but realise the outcomes
relate to only part of the target population.
 Self-selection through the internet. Beware! The outcomes of the poll are not valid.
 Any other type of selection with unknown selection probabilities, for example a quota
sample. Beware! The conclusions of the poll may be invalid.
6. Was the size of the realised sample reported? This is the number of respondents.
 Yes: Go to 7.
 No: Beware! It is impossible to compute the margins of error of the estimates.
7. Is response rate sufficiently high, say higher than 50%?
 Yes: Go to 8.
 No: Beware! Non-response is often selective. The lower the response, the higher the bias.
Therefore, conclusion may be invalid.
8. Has the response been corrected (by adjustment weighting) for selective non-response?
 Yes: Go to 9.
 No: Beware! Results may be invalid due to non-response bias.
9. Were the margins of errors reported? This is the discrepancy between an estimate and the true
value caused by working with a sample instead of the complete population.
 Yes. Note that these margins do not include a possible bias due to non-response or
measurement errors (for example memory errors). So the uncertainty in the outcomes can
be bigger than indicated by the margins of error.
 No. Beware! It is difficult to properly interpret the outcomes of the poll. For example, you
cannot distinguish real effects from the ‘noise’ of sample selection.

144
13.3 The nine questions

13.3.1 Who commissioned or sponsored the poll?

It is important to know who commissioned or sponsored a poll. Such organisations may have an
interest in certain outcomes. It is not uncommon to see press releases reporting about polls showing
that certain products or services are very good. A closer look often shows that the poll was conducted
by companies offering these products or services. So, the poll is more a marketing instrument than a
case of objective research
You have to be very careful if the poll was conducted by the organisation that commissioned it. For
example, the BBC (2010) has editorial guidelines for this situation: “if the research has been
commissioned by an organisation which has a partial interest in the subject matter, we should show
extra caution, even when the methodology and the company carrying it out are familiar. The audience
must be told when research has been commissioned by an interested party”.

13.3.2 Is there a research report?


There should be a research report describing how the poll was set up and conducted. This report
should contain sufficient information to assess whether the poll was carried out in a scientifically
sound way. So the report should not only describe the outcomes of the poll but also the methodological
aspects. The report should at least contain the following information :
 The organisation commissioning or sponsoring the poll.
 The organisation that conducted the poll.
 The definition of the target population. This is the group from which the sample was selected, and
to which the conclusions of the poll relate.
 The complete questionnaire. It must be clear whether the questionnaire was tested, and how it
was tested.
 The sampling frame used to select the sample from. It is the list with contact information (name,
address. Telephone number, or e-mail addresses) of every object in the target population.
 The way in which the sample was selected. It must be clear whether a random sample was drawn,
and how the random sample was drawn. Was the sample drawn with equal or unequal
probabilities?
 The initial sample size. This is the size of the sample that was drawn from the sampling frame.
 The final sample size. This is the number of respondents.
 The response rate (100 × Number of respondents / Initial sample size).
 The way in which the poll was corrected for the effects of non-response (and possible other
selection effects). If adjustment weighting was carried out, there should be a list of the auxiliary
variables used. The report should make clear how the correction weights were computed.
 The margins of error. Note that you can only compute these margins if a random sample was
selected and non-response was not selective. In case of a substantial amount of non-response, the
researcher should warn that the differences between estimates and true values can be much larger
than the margins of error.

145
13.3.3 What is the target population?

The target population is the group of people from which the sample was drawn and to which the
outcomes of the poll refer. There must be a clear definition of the target population. It must always be
possible to decide in practical situations whether or not a sampled person belongs to the target
population.
Problems may occur if the sampling frame does not cover the target population. In case of under-
coverage, the sample is selected from only part of the target population. Consequently, the outcomes
only relate to this sub-population, and not to the whole population.
Example: you define the target population as all Dutch of age 18 years and older, but you select the
sample from only those with access to the internet. In fact, your conclusions only concern Dutch of age
18 years and older with access to internet.

13.3.4 How is the quality of the questionnaire?

A good questionnaire is of vital importance. Indeed, practice has shown that it is easy to influence the
outcomes of a poll by manipulating the texts of the questions, and the order of the questions. A good
questionnaire contains objective and comprehensible questions. The following pitfalls must at least be
avoided:
 Incomprehensible questions. Respondents may not understand questions because of the use of
jargon, unfamiliar terms, or long, vague and complex sentences. Example: “Are you satisfied with
the recreational facilities in your neighbourhood?”.
 Ambiguous questions. Example: “When did you leave school?” What kind of answer is expected? A
date, an age, or maybe some event (when I married)?
 Leading questions. Example: “Most people feel that € 5 is way too much money to pay for a simple
cup of coffee. Would you pay € 5 for a cup of coffee?”.
 Double questions / double-barrelled questions. Example: “Do you think that people should eat less
and exercise more?”.
 Negative questions or double-negative questions. Example: “Would you rather not use a non-
medicated shampoo”.
 Recall questions. Questions requiring recall of events that have happened in the past are a source of
errors. People tend to forget these events. Example: “How many times did you contact your family
doctor in the last two years?”.
A good poll requires thorough testing of the questionnaire before it is used for data collection in ‘the
field’. It should be clear how the questionnaire was tested.

13.3.5 How was the sample selected?

To be able to draw meaningful conclusions about a target population, you must apply probability
sampling. The sample must be a random sample. Everyone in the population must have a non-zero
probability of being selected in the sample. The selection probabilities must be known.
The simplest way to select a random sample is to draw a simple random sample. This implies that
everyone in the population has the same probability of selection. The analogy principle applies, which
means that sample mean (or percentage) is a good estimator of the population mean (or percentage).
It is possible to select a sample with unequal probabilities. As a consequence, the estimators are
somewhat more complicated, because you have to correct for the unequal probabilities. An example is
an approach in which you first draw addresses with equal probabilities, after which you draw one

146
random person at each selected address. Persons in large households have smaller selection
probabilities than persons in small households. This is a list (digital or on paper) of all members of the
target population. For each member of the population, the list must contain contact information.
Names and addresses are required for face-to-face or mail polls, telephone numbers for a telephone
poll, or e-mail addresses for an online poll. If the sampling frame does not cover the whole population,
the conclusions of your poll are only relevant for the sub-population that can be reached through the
sampling frame.
If you did not apply probability sampling, selection probabilities are known, making it impossible to
compute valid estimates. This is the case for a quota sample, or for a self-selection sample.

13.3.6 How large is the sample?


If you selected your sample by means of probability sampling, you can compute the precision of your
estimates. There is a simple rule: The precision increases as the sample size increases.
The precision is usually indicated by the margin of error. The margin of error is the maximal difference
between the estimate and the true population value. Table
Table 13.1 contains the margin of error for estimating a population percentage. The margin of error is
large if the percentage is close to 50%. The margins of error decrease as the sample size increases.

Table 13.1. Margins of error


Percentage Sample size
100 200 500 1000 2000 5000
10 5.9 4.2 2.6 1.9 1.3 0.8
20 7.9 5.6 3.5 2.5 1.8 1.1
30 9.0 6.4 4.0 2.8 2.0 1.3
40 9.7 6.8 4.3 3.0 2.1 1.4
50 9.8 6.9 4.4 3.1 2.2 1.4
60 9.7 6.8 4.3 3.0 2.1 1.4
70 9.0 6.4 4.0 2.8 2.0 1.3
80 7.9 5.6 3.5 2.5 1.8 1.1
90 5.9 4.2 2.6 1.9 1.3 0.8

Suppose you conducted a poll with a simple random sample of size 500. It turns out that 40% of the
respondents are in favour of certain policy. Table 13.1 shows that te margin of error for a sample size
of 500 and an estimate of 40% is equal to 4.3%. So, the percentage in favour of the policy in the target
population will be between 40-4.3 = 35.7% and 40+4.3 = 44.3%.
Suppose you conducted an election poll with a sample size of 1,000 people. The result is that 20% of
the respondents will vote for a certain party. One month later you conduct the poll again. This time
22% will vote for the party. Can you conclude that support for the party has increased? No, because
both percentages have a margin of error of 2.5%. The margin of error is larger than the difference
between the two percentages (22 – 20 = 2%). So the difference between the two polls may be
attributed the sample ‘noise’.

13.3.7 What is the response rate?

Non-response occurs when people that have been selected in the sample (and belong to the target
population of the poll) do not provide the requested information. The questionnaire remains empty.
Non-response can have various causes: it may be impossible to make contact (because people are not
at home), they may refuse to cooperate (because they consider the poll an intrusion of their privacy),
or they are not able to answer the questions (because of illness or language problems). Because of
non-response, some groups will be over-represented, while other groups will be under-represented.

147
Hence, the response is selective. In other words: the realised sample lacks representativity. There are
three factors determining the magnitude of the bias of estimates:
 The percentage of people who did not participate. The higher the non-response percentage, the
more the estimates differ from the true population value. So, if your poll has a low response rate,
you run a high risk of serious problems.
 The difference between respondents and non-respondents. For example: election polls often show
a strong relationship between response behaviour and voting behaviour. Respondents tend to
vote, and non-respondents tend to refrain from voting. This causes voters to be over-represented
among respondents. Therefore, estimates for the percentage of voters will be too high.
 Are there people who participate more often in polls than other people. If some people have high
response probabilities (the participate often in this type of polls), and some people have low
response probabilities (they almost never participate), the bias of the estimates is increased. If
everyone has the same response probability (is as likely to participate), non-response will not lead
to a bias.
Usually you cannot determine the magnitude of the bias. You can only do that, if you know the answers
to the questions of the non-respondents. Because they are non-respondents, they do not answer the
questions. It is, however, possible to compute the worst case: how large can de bias at most be?
Suppose, you conduct an election poll, and the response rate is 40%. Of the respondents, 60% say they
will vote. If 40% responds, 60% does not respond. There are two extreme situations. The first one is
that all non-respondents will not vote. Then, the percentage of voters in the complete sample is equal
to
0.40  60% + 0.60  0% = 24%.
The second extreme situation is that all non-respondents will vote. Then, the percentage of voters in
the complete sample is
0.40  60% + 0.60  100% = 84%.
So the percentage of voters among the respondents (40%) could also have been any other percentage
between 24% and 84%. The bandwidth is large. Indeed, the effects of non-response can be substantial.
In fact, you must conclude the outcomes of this poll are not meaningful.
Table 13.2 contains the bounds of the complete sample percentages for a series of response rates. It is
clear that the bandwidth decreases as the response rate increases.

Table 13.2. Bandwidth of the estimator due to non-response


Percentage Response rate
in response 20 40 60 80
10 2 – 82 4 – 64 6 – 46 8 – 28
20 4 – 84 8 – 68 12 – 52 16 – 36
30 6 – 86 12 – 72 18 – 58 24 – 44
40 8 – 88 16 – 76 24 – 64 32 – 52
50 10 – 90 20 – 80 30 – 70 40 – 60
60 12 – 92 24 – 84 36 – 76 48 – 68
70 14 – 94 28 – 88 42 – 82 56 – 76
80 16 – 96 32 – 92 48 – 88 64 – 84
90 18 – 98 36 – 96 54 – 94 72 – 92

148
13.3.8 Was the poll corrected for the effects of non-response?

It is important to correct the outcomes of a poll for the negative effects of non-response. Usually some
kind of adjustment weighting is applied. This means you assign a weight to each respondent. You
compute these weights in such a way that you correct the response of your poll for over- or under-
representation of specific groups.
An example: Suppose, you conduct a poll, and it turns out the response consists for 60% out of males
and for 40% out of females. There is a discrepancy with the population distribution, because the
population of the Netherlands consists for 49.5% out of males and for 50.5% out of females.
Apparently, males have responded better in the poll. They are over-represented. To correct for this,
every male respondent is assigned a correction weight equal to
49.5 / 60.0 = 0.825.
This means that every responding male will count for 0.825 instead of 1. The weight is smaller than 1,
because there were too many males among the respondents. Each female is assigned a correction
weight equal to
50.5 / 40.0 = 1.263.
So each female counts for 1.263 instead of 1. The correction weight is larger than 1, because there
were too few females among the respondents.
By assigning these correction weights, the response becomes representative with respect to gender.
You can compute the weights because you know the true population percentages. The idea of
weighting is making the response representative with respect to as many variables as possible.
However, you can only use variables that you measured in the poll and for which the population
distribution is available. This can be a serious restriction. Often used weighting variables are gender,
age, marital status and region of the country.
The hope is that if you make the response representative with respect to many auxiliary variables, the
response also becomes representative with respect to the target variables of your poll.
Not every adjustment weighting procedure is effective. Weighting is only able to reduce the non-
response bias if two conditions are satisfied:
 There is a strong relationship between the target variables of your poll and the auxiliary variables
you use for weighting.

 There is a strong relationship between response behaviour and the auxiliary variables you use for
weighting.
For example: you conduct an opinion poll, and you decide to weight the response by gender and age.
Weighting will be meaningless if there is no relationship between the variables gender and age and the
variable opinion.
You should always attempt to determine groups that are under- or over-represented, what kind of
effect this can have, and whether adjustment can improve the situation.

13.3.9 Did the researcher provide margins of error?

The outcomes of a poll are just estimates of population characteristics. Therefore, it is not realistic to
assume the estimates are exactly equal to the true population values. Even in the ideal case of a simple
random sample with full response, there still is a discrepancy between the estimate and the true, but
unknown, population value. This is the margin of error. You can calculate how large the margin of
error is. You can also use table 13.1.

149
Example: You select a simple random sample of 500 persons. Everyone participates, so there is no
non-response. Of the respondents, 60% say they will vote at the next elections. The corresponding
margin of error is 4.3%. Hence, the percentage of voters in the population will be between 60 – 4.3 =
55.7% and 60 + 4.3 = 64.3%.
Note again that the margins of error only indicate the uncertainty caused by sample selection. Non-
response causes extra problems in the form of a bias. This bias is not included in the margin of error.
In practice, the situation can be worse than indicated by the margin of error. The margin of error is
only a lower bound. The real margin can be larger.
If you repeat your poll, you must realise that small differences do not have to indicate ‘real’ differences
between polls. It may just have been caused by sampling ‘noise’. Only if the differences are larger than
the margins of error, can you conclude that something really changed. Note that the situation can be
even more complicated if non-response occurs in both polls.

13.4 An example: social media stress


“Recent research of the National Academy for Media & Society shows that young people between the
age of 13 and 18 years suffer from a serious form of Social Media Stress (SMS). Social media are getting
a hold on young people with their subtle stimuli, such as sounds, push messages, attention, and
rewards. Young people are unable to take an independent decision to stop, because they fear to be
excluded. The serious form of this fear is called FOMO (Fear of Missing Out).”
This was the first part of a press release published by the National Academy for Media & Society in
May, 2012. Many news media took this up as a serious item. National radio and TV channels (e.g. NOS),
national newspapers (e.g. Trouw) and many news websites had stories about the dangers of social
media.
What this media failed to do was checking the press release. The National Academy for Media &
Society was an unknown organisation. What kind of an organisation was it? How did it reach the
conclusions about Social Media Stress. And how valid were these conclusions?
This section checks the facts about this research project. We will use the checklist for polls, and we will
try to answer the nine questions.

150
Figure 13.1. The news item about Social Media Stress on the website of the NOS.

13.4.1 Who commissioned or sponsored the poll?


De poll was an initiative of the National Academy for Media & Society. This organisation also
conducted the poll. The Academy turned out to be a small foundation, run by two people. It offers
courses like ‘Social Media Professional’ and ‘Media Coach’. So the organisation has an interest in
certain outcomes of the poll: the more problems the social media cause, the more attractive the
courses become.
As the poll was not conducted by a neutral organisation, you should be very careful with the
interpretation of the results.

13.4.2 Is there a research report?

It was possible to download a report from the internet. This report focuses on the outcomes of the
survey. There was very little about the way the poll was set up and conducted. The information was
insufficient to assess the quality of the poll.
Figure 13.2 shows how the outcomes of the poll were presented in the report. The respondents were
asked to what extent three statements applied to them (completely, somewhat, or not at all). The

151
statements were (1) I feel stressed when I notice I cannot keep up-to-date with everything in the social
media, (2) I have to process too much information in the social media to keep up-to-date, and (3)
Because I want to keep up-to-date with everything, and I do not have time to immerse myself, I read
information in the social media only superficially.

Figure 13.2. Some outcomes of the poll on Social Media Stress

The conclusions of the poll were not always supported by the collected data. For example, the
conclusions of the report contain the statement that young people between the age of 13 and 18 years
suffer from a serious form of Social Media Stress (SMS). However, from charts like the one in figure
13.2 you can conclude that at least half of the young people do not have a problem at all.

13.4.3 What is the target population?

There was some confusion about the target population. The press release seem to suggest that the
target population consisted of all young people between the ages of 13 and 18 year. However, the
estimates in the report assume the target population to consist of all young people between the ages of
13 and 18 with a smartphone. So the conclusions of the poll only relate to young people with a
smartphone and not to all young people.

13.4.4 How is the quality of the questionnaire?

We cannot say much about the quality of the questionnaire, because it was not included in the report.
It was also not clear whether the questionnaire was tested.
The charts in the report contain question texts. These texts raise some concern about the jargon used.
What do the respondents exactly mean when they say the ‘feel stressed’? And what is the meaning of
‘keeping up-to-date with everything’? There is no guarantee that all respondents interpret these terms
in the same way.
13.4.5 How as the sample selected?
There is a problem with sample selection. The research report does not contain any information on the
way the sample was selected. We can only read that 493 young people with an age between 13 and 18
years completed an online questionnaire.
After sending an e-mail to the Academy with the request to give more information about the sampling
design, an e-mail was return with some more details. Apparently, the National Academy for Media &
Society has a network of 850 so-called youth professionals. These youth workers are active in

152
education, libraries, youth counselling, and community work. A number of these youth workers were
asked to invite approximately 20 young people to complete the questionnaire on the internet. In the
end, 240 boys and 253 girls did this.
There are serious doubts about this sample design. No random sampling has taken place. The youth
workers were not sampled at random from all youth workers. And young people were not sampled at
random by the youth workers. Moreover, young people not involved in any kind of youth work, could
not be selected at all. So the conclusion is that this is not a representative sample.

13.4.6 How large is the sample?

The final sample size was 493 young people. This is not a very large sample. So it is important to keep
in mind that there are margins of errors. For example, if 50% of the young people in the sample say
they not feel stressed, the percentage in the population is somewhere between 45.6% and 54.4%. This
is based on a margin of error of 4.4% (see table 13.1).
Note that we can only make such computations if a simple random sample has been selected, and the
non-response is not selective.

13.4.7 What is the response rate?

There is no mention of non-response in the research report. We may aspect that there are youth
professionals and young people who do not want to cooperate. If those people differ from the people
that want to participate, there may be a serious bias.
It is very unfortunate that nothing has been reported about non-response and the problems it may
cause.

13.4.8 Was the poll corrected for the effects of non-response?

Although nothing was said about non-response, the researcher indicate they applied a ‘mild’ form of
adjustment weighting. They corrected their sample by comparing the distribution of males and
females in the sample by that of males and females in the population. Unfortunately, this had no effect.
They constructed their sample such that it more or less had the proper numbers of males and females.
So adjustment weighting could not improve the situation. The sample was already representative with
respect to gender.

13.4.9 Did the researcher provide margins of error?


The research report of the poll on Social Media Stress did not contain margins of error, or any
indication that figures in the report are just estimates.
Of course, we could compute margins of errors ourselves. We can only do this if we assume simple
random sampling and non-selective non-response. The expression for the (estimated) margin of error
of a sample percentage p is equal to

p  (100  p)
M  1.96  .
n
If we want to compute the margin of error for the percentage of people suffering from Social Media
Stress, we substitute p = 23 (see figure 13.2), and n = 493. The margin of error becomes

23  77
M  1.96   3.7.
493

153
So the population percentage will be (with 95% confidence) between 23 – 3.7 = 19.3% and 23 + 3.7 =
26.7%.

13.5 Conclusion
The poll on Social Media Stress was carried out by an organisation with an interest in substantial
media stress problems. No proper sample was selected. Non-response was not reported and not
corrected. The conclusions were not always supported by the data. All-in all, we should not attach too
much importance to this poll. It may be better to ignore it.

154
References
Baker, R., Blumberg, S.J., Brick, J.M., Couper, M.P., Courtright, M., Dennis, J.M., Dillman, D., Frankel,, M.R.,
Garland, P., Groves, R.M., Kennedy, C., Krosnick, J., Lavrakas, P.J., Lee, S., Link, M., Piekarski, L.,
Rao, K., Thomas, R.K. & Zahs, D. (2010), Research Synthesis: AAPOR Report on Online Panels.
Public Opinion Quarterly 74, pp. 711–781.
BBC (2010), Editorial Guidelines | Guidance | Opinion Polls, Surveys, Questionnaires, Votes and Straw
Polls. http://www.bbc.co.uk/editorialguidelines/page/guidance-polls-surveys-full
Bethlehem, J.G. (2010), Selection Bias in Web Surveys. International Statistical Review 78, pp. 161-188.
Bethlehem, J.G. & Biffignandi, S. (2012), Handbook of Web Surveys. John Wiley & Sons, Hoboken, NJ.
Bethlehem, J.G., Cobben, F. & Schouten, B. (2011), Handbook of Nonresponse in Household Surveys. John
Wiley & Sons, Hoboken, NJ, USA.
Bethlehem, J.G. & Hofman, L.P.M.B. (2006), Blaise – Alive and kicking for 20 years. Proceedings of the
10-th Blaise, Users Meeting, Statistics Netherlands, Voorburg / Heerlen, The Netherlands, pp. 61-
88.
Beukenhorst, D. & Wetzels, W. (2009), A Comparison of Two Mixed-mode Designs of the Dutch Safety
Monitor: Mode Effects, Costs, Logistics. Technical paper DMH 206546, Statistics Netherlands,
Methodology Department, Heerlen, The Netherlands.
Bishop. G.F. (2005). The Illusion of Public Opinion. Rowman & Littlefield Publishers. Lanham. MD. USA.
Bowley. A.L. (1906). Address to the Economic Science and Statistics Section of the British Association
for the Advancement of Science. Journal of the Royal Statistical Society 69. pp. 548-557.
Bowley. A.L. (1926): Measurement of the Precision Attained in Sampling. Bulletin of the International
Statistical Institute. XII. Book 1. pp. 6-62.
Bradburn. N.M.. Sudman. S. & Wansink. B. (2004). Asking Questions. The Definitive Guide to
Questionnaire Design – For Market Research. Political Polls. and Social and Health Questionnaires.
Jossey-Bass. San Francisco. USA.
Bronzwaer, S. (2012), Infiltranten probeerden de peilingen van Maurice de Hond te manipuleren. NRC,
13 September 2012.
Cobben. F. & Bethlehem. J.G. (2005). Adjusting Undercoverage and Non-response Bias in Telephone
Surveys. Discussion Paper 05006. Statistics Netherlands. Voorburg / Heerlen. The Netherlands.
Converse. J.M. & Presser. S. (1986). Survey Questions: Handcrafting the Standardized Questionnaire.
Sage University Paper Series on Quantitative Applications in the Social Sciences. 07-063. Sage
Publications. Beverly Hills. CA. USA.
Cook, C., Heath, F. & Thompson, R.L. (2000), A Meta-Analysis of Response Rates in Web- or Internet-
Based Surveys. Education and Psychological Measurement 60, pp. 821-836.
Couper, M.P. (2000), Web surveys: A review of issues and approaches. Public Opinion Quarterly 64, pp.
464-494.
Couper. M.P.. Baker. R.P.. Bethlehem. J.G.. Clark. C.Z.F.. Martin. J.. Nicholls II. W.L.. O’Reilly. J.M. (eds.)
(1998). Computer Assisted Survey Information Collection. Wiley. New York. USA
Couper. M.P. & Nicholls. W.L. (1998). The History and Development of Computer Assisted Survey
Information Collection Methods. In: Couper. M.P.. Baker. R.P.. Bethlehem. J.. Clark. C.Z.F.. Martin.
J.. Nicholls. W.L. and O'Reilly. J.. Computer Assisted Survey Information Collection. Wiley. New
York. USA.

155
Couper. M.P.. Traugott. M. & Lamias. M. (2001). Web Survey Design and Administration. Public Opinion
Quarterly 65. pp. 230-253.
De Fuentes-Merillas. L.. Koeter. M.W.J.. Bethlehem. J.G.. Schippers. G.M. & Van den Brink. W. (1998). Are
Scratchcards Addictive? The Prevalence of Pathological Scratchcard Gambling among Adult
Scratchcard Buyers in The Netherlands. Addiction 98. pp. 725-731.
De Leeuw, E.D. (1992), Data Quality in Mail, Telephone, and Face-to-Face Surveys. TT-Publications,
Amsterdam, The Netherlands.
Dillman. D.A. (2007). Mail and Internet Surveys.. The Tailored Design Method. John Wiley & Sons.
Hoboken. NJ. USA.
Dillman. D.A.. Smyth. J.D. & Christian. L.M. (2009). Internet. Mail. and Mixed-mode Surveys. The Tailored
Design Method. John Wiley & Sons. Hoboken. NJ. USA.
Dillman, D A. Bowker, D. (2001), The Web Questionnaire Challenge to Survey Methodologists. In:
Reips, U.D. and Bosnjak, M. (eds.), Dimensions of Internet Science, Pabst Science Publishers,
Lengerich, Germany.
Dillman. D A., Tortora. R.D. and Bowker. D. (1998). Principles for construction web surveys. Technical
report 98-50. Social and Economic Sciences Research Center. Washington State University.
Pullman. Washington. USA.
Duffy, B., Smith, K., Terhanian, G. & Bremer, J. (2005), Comparing Data from Online and Face-to-face
Surveys. International Journal of Market Research 47, pp. 615-639.
Galesic. M.. Tourangeau. R.. Couper. M.P. & Conrad. F.G. (2007). Using Change to Improve Navigation in
Grid Questions. Paper presented at the General Online Research Conference (GOR ’07). Leipzich.
Germany.
Horvitz, D.G. & Thompson, D.J. (1952), A generalization of sampling without replacement from a finite
universe. Journal of the American Statistical Association 47, pp. 663-685.
Kaczmirek. L. (2010). Attention and Usability in Internet Surveys: Effects of Visual Feedback in Grid
Questions. In: Das. M.. Ester. P. & Kaczmirek. L (Eds.). Social and Behavioral Research and the
Internet. Advances in Applied Methods and Research Strategies. Routledge. New York / London.
pp. 191-214.
Kalton. G. & Schuman. H. (1982). The effect of the question on survey responses: a review. Journal of
the Royal Statistical Society. 145. pp. 42-57.
Kaplowitz, M.D., Hadlock, T.D. & Levine, R. (2004), A Comparison of Web and Mail Survey Response
Rates. Public Opinion Quarterly 68, pp. 94-101.
Kiaer. A. N. (1895). Observations et Expériences Concernant des Dénombrements Représentatives.
Bulletin of the International Statistical Institute. IX. Book 2. pp. 176-183.
Kiaer. A. N. (1997. reprint): Den Repräsentative Undersökelsesmetode. Christiania Videnskabsselska-
bets Skrifter. II. Historiskfilosofiske klasse. Nr 4 (1897). Statistisk Sentralbyrå. Oslo. Norway.
Knowledge Networks (2012), KnowledgePanel Design Summary, www.knowledgenetworks.com.
Krosnick. J. A. (1991). Response strategies for coping with the cognitive demands of attitude measures
in surveys. Applied Cognitive Psychology. pp. 213-236.
Krug, S. (2006), Don’t Make Me Think! A Common Sense Approach to Web Usability, Second Edition. New
Riders, Berkeley, California, USA.

156
Kruskal. W. & Mosteller. F. (1979a). Representative Sampling. I: Non-scientific Literature. International
Statistical Review 47. pp. 13-24.
Kruskal. W. & Mosteller. F. (1979b). Representative Sampling. II: Scientific Literature. Excluding
Statistics. International Statistical Review 47. pp. 111-127.
Kruskal. W. & Mosteller. F. (1979c). Representative Sampling. III: the Current Statistical Literature.
International Statistical Review 47. pp. 245-265.
Kuusela, V., Vehovar, V. en Callegaro, M. (2006), Mobile phones – Influence on Telephone Surveys. Paper
presented at the Second International Conference on Telephone Survey Methodology, Florida,
USA.
Lienhard. J.H. (2003). The Engines of Our Enginuity. An Engineer Looks at Technology and Culture.
Oxford University Press. Oxford. UK.
Lodge. M. Steenbergen. M.R. & Brau. S. (1995). The responsive voter: campaign information and the
dynamics of candidate evaluation. American Political Science Review 89. pp. 309-326.
Lozar Manfreda, K., Bosnjak, M., Berzelak, J., Haas, I. & Verhovar, V. (2008), Web Surveys versus Other
Survey Modes – A Meta-Analysis Comapring Response Rates. International Journal of Market
Research 50, pp. 79-104.
Neyman. J. (1934). On the Two Different Aspects of the Representative Method: the Method of
Stratified Sampling and the Method of Purposive Selection. Journal of the Royal Statistical Society
97. pp. 558-606.
Pew Research Center (2012), Assessing the Representativeness of Public Opinion Surveys. Retrieved
from www.people.pres.org.
Saris. W.E. (1997). The public opinion about the EU can easily be swayed in different directions. Acta
Politica. vol 32. pp. 406-436.
Saris. W.E. (1998). Then years of interviewing without interviewers: the Telepanel. In: Couper. M.P..
Baker. R.P.. Bethlehem. J.G.. Clark. C.Z.F.. Martin. J.. Nicholls II. W.L.. O’Reilly. J.M. (eds.) (1998).
Computer Assisted Survey Information Collection. Wiley. New York. USA. pp. 409-430.
Scherpenzeel, A. (2008), An Online Panel as a Platform for Multi-Disciplinary Research. In: Stoop, I. &
Wittenberg, M. (Eds.), Access Panels and Online Research, Panacea or Pitfall? Aksant, Amsterdam,
The Netherlands, pp. 101-106.
Scherpenzeel, A.C. (2009), Start of the LISS Panel: Sample and Recruitment of a Probability-based
Internet Panel. CentERdata, Tilburg, The Netherlands.
Scherpenzeel, A.C. & Bethlehem, J.G. (2011), How Representative are Online Panels? Problems of
Coverage and Selection and Possible Solutions. In: Das, M., Ester, P. & Kaczmirek, L. (eds.), Social
and Behavioral Research and the Internet, Routledge, New York-London, pp. 105-132.
Scherpenzeel, A. & Schouten, B. (2011), LISS Panel R-indicator: Representativity in Different Stages of
Recruitment and Participation of an Internet Panel. Paper presented at the 22nd International
Workshop on Household Survey Nonresponse, Bilbao, Spain
Schwarz. N.. Knäuper. B.. Oyserman. D. & Stich. C. (2008). The psychology of asking questions. In: E.D.
de Leeuw. J.J. Hox & D.A. Dillman (eds.). International Handbook of Survey Methodology.
Lawrence Erlbaum Associates. New York. USA. pp. 18-34.
Sikkel. D. (1983). Geheugeneffecten bij het rapporteren van huisartsencontacten. Statistisch Magazine
3. nr. 4. Netherlands Central Bureau of Statistics. pp. 61-64.

157
Stoop, I., Billiet, J, Koch, A. & Fitzgerald, R. (2010), Improving Survey Response, Lessons Learned from the
European Social Survey. John Wiley & Sons, Chichester, UK.
Tiemeijer. W.L. (2008). Wat 93.7 Procent van de Nederlanders moet weten over Opiniepeilingen. Aksant.
Amsterdam. The Netherlands.
Tourangeau, R. (2004), Survey Research and Societal Change. Annual Review of Psychology 55, pp. 775-
801.
Utts. J.M. (1999). Seeing Through Statistics. Duxburry Press. Belmont. California. USA.
Vavrek, L. & Rivers, D. (2008), The 2006 Cooperative Congressional Election Study. Journal of Elections,
Public Opinion and Parties 18, pp. 355–366.
Zaller. J. R (1992). The Nature and Origins of Mass Opinion. Cambridge University Press. Cambridge. UK.

158
Index
A-number 53
Acquiescence 64, 115
Adjustment weighting 9, 99, 100, 113, 141, 149
Ambiguous question 25
Analogy principle 73
Analysis 121
Announcement letter 58
Arbitrary answer 64, 116
Attitude 5
Attrition 117
Auxiliary variable 20, 98, 114
Average 126

Bar chart 127


Boxplot 124, 136
Box-and-whisker plot 124

Calculator 46, 48
Categorical variable 18
Census 6, 11, 12
Check-all-that-apply question 32
Checkbox 32
Checklist for polls 143, 144
Chi-square statistic 134
Closed question 29
Complete enumeration 6, 11
Computer-assisted interviewing (CAI) 8, 57, 59
Computer-assisted personal interviewing (CAPI) 8, 60
Computer-assisted self-adminsitered questionnaires (CASAQ) 8, 60
Computer-assisted self-interviewing (CASI) 8, 60
Computer-assisted telephone interviewing (CATI) 8, 59
Computer-assisted web interviewing (CAWI) 61, 105
Concentration 124
Confidence interval 13, 74, 79, 88
Confidence level 75
Consistency error 65, 66
Continuous variable 19
Correction weight 99, 100, 149
Correlation coefficient 129
Cramérs V 134, 135
Cross-sectional research 117
Cross-tabulation 134
Data collection 8, 57

Data editing 9, 59, 63, 121, 141


Date question 33
Discrete variable 19
Domesday Book 11

159
Don’t know 31, 64, 115, 122
Do-not-call register 55
Domain error 65
Donor imputation 69
Double question 28
Dynamic shading 34

E-mail poll 105


Election poll 111
Eligible 96
Endorsing the status quo 64, 115
Error checking 116
Estimation 9, 14, 71
Estimate 71, 85, 89, 141
Estimator 71, 85, 89
Eurobarometer 35
Executive summary 139
Exploratory analysis 121, 122

Face-to-face poll 7, 57, 58


Fact 5
Factual question 23
Fieldwork 57, 140
Filter question 31
First birthday procedure 51
Frequency table 128

Gemeentelijke Basisadministratie Persoonsgegevens (GBA) 52


Generalised regression estimation 103
Grid question 33
Grouped bar chart 131
Grouping 124

Histogram 125
Homogeneous group 101
HTML 105
Hypothetical question 28

Imputation 67, 121


Imputation of the mean 68
Imputation of the group mean 68
Indicator variable 19, 22
Inductive analysis 121
Informed consent 38
Interviewer-assisted 7, 57
Item non-response 93

Jitter 123, 135

Leading question 26

160
Learning effect 34
Linear weightin 103
Longitudinal research 117

Mail poll 8, 57
Margin of error 9, 13, 63, 74, 75, 79, 81, 88, 141, 147, 149
Matrix question 33
Measurement error 64, 106, 114, 121
Memory based model 24
Memory error 26, 65
Mentality group 114
Methodological account 139, 140
Missing answer 67
Mode 128
Mode of data collection 7
Multiplicative weighting 103

Negative question 28
No-contact 96, 112
Nominal variable 19
Non-differentation 64, 115
Non-factual question 23
Non-response 6, 9, 63, 93, 106, 111, 122, 141, 149
Non-response bias 97, 112
Normal distribution 125
Not-able 96, 112
Numerical overview 136
Numerical question 32

One-dimensional scatterplot 123, 135


Online panel 116
Online poll 8, 57, 61
Online processing model 24
Open question 28
Opinion 5
Opt-in panel 117
Opinion poll 11, 14
Opt-in poll 109
Ordinal variable 19
Outlier 121, 124
Over-coverage 42

Panel conditioning 119


Panel refreshment 119
Pie chart 127, 133
Poll 5
Population characteristic 7, 17, 20, 21, 140
Population fraction 21
Population mean 21, 77, 89
Population percentage 21, 73, 85, 86

161
Population register
Population size 18, 50
Population total 21
Population variance 21
Postal Adress File (PAF) 42, 53
Postcode file 54
Pre-selection shading 34
Precise estimator 71
Primacy effect 30, 64
Probability sampling 6
Profile poll 119
Propensity weighting 103
Psychographic variable 114
Public opinion 5
Publication 9
Purposive selection 13, 14

Qualitative variable 18, 21


Quantitative variable 19
Question text 24
Questionnaire 6, 7, 23, 140, 146
Questionnaire fatigue 38
Questionnaire testing 37
Quipu 12
Quipucamayoc 12
Quota sampling 6, 13, 14, 15

R 123
Radio buttons 29
Raking ratio estimation 103
Random digit dailing (RDD) 43, 59
Random imputation 68
Random imputation withing groups 68
Random sampling 13, 15
Randomiser 45
Rating scale question 30
Recall question 26
Recency effect 30, 64
Recruitment non-response 117
Reference date 7, 17, 18
Reference survey 114
Refusal 96, 112
Regression line 130, 131
Reliable estimator 71
Representative Method 13
Representative sample 41
Representativity 13
Research report 139, 145
Respondent 7
Response order effect 64, 115

162
Response rate 94, 95, 98, 147
Routing 35, 116
Routing check 66
Routing error 66
Routing instruction 35
Rule-of-thumb interval 126

Sample 6, 43
Sample selection 44, 106, 146
Sample size 8, 50, 81, 83, 147
Sampling design 8, 140
Sampling frame 8, 41, 140
Sampling with replacement 47
Sampling without replacement 47
Satisficing 31, 57, 64, 115
Self-administered 7, 31, 57
Self-selection 6, 109, 117
Self-selection bias 110
Selecting the middle option 64, 115
Selection probability 84, 121
Selective non-response 93, 98
Sensitive question 27, 116
Show card 58
Simple random sample 8, 41, 47, 48, 73
Socially desirable answer 64, 116
Spreadsheet 46, 49, 123
Stacked bar chart 132
Standard deviation 126
Standard error of the estimator 74, 87
Starting point 50
Step length 50
Straight-lining 34, 64
Straw poll 14
Summary table 126
Survey 5
Synthetic answer 67
Systematic sample 8, 47, 49

Tailored Design Method 58


Target population 7, 17, 18, 140, 146
Target variable 19, 20
Telepanel 61
Telephone directory 42, 55
Telephone poll 7, 57, 58
Telescoping 26, 65
True value 5, 23
Two-dimensional scatterplot 128
Two-stage sample 8, 47, 51, 83

Unbiased estimator 71

163
Under-coverage 41, 63, 106, 117
Under-coverage bias 107, 108
Unit non-response 93

Validity 37
Variable 7, 17, 18, 140
Variance of the estimator 74, 78, 87, 91

Web poll 61
Weight 85, 89

164

You might also like