The Royal Statistical Society 2007 Examinations Solutions Higher Certificate (Modular Format) Data Collection and Interpretation

Higher Certificate, Module 1, 2007.
Question 1
THE ROYAL STATISTICAL SOCIETY

(i) Dear Member,
We are carrying out a survey of all our members. We particularly want to find
2007 EXAMINATIONS SOLUTIONS out why people join our organisation and what their special interests are. The
views of all our members are important to us, so please take a few moments to
complete this questionnaire.
Please be assured that your answers will be treated in the strictest confidence,
HIGHER CERTIFICATE and will remain completely anonymous when the results of the survey are
reported. No-one will contact you as a result of answering the questionnaire
(MODULAR FORMAT) and no other organisation will receive any information from it.
It will only take a few minutes to complete the questionnaire. Some questions
simply need a tick in the appropriate box; some need you to circle the
MODULE 1 appropriate number. A few questions ask you to write a short answer in the
space provided.
DATA COLLECTION AND INTERPRETATION
When you have completed the questionnaire please return it to us in the
enclosed reply-paid envelope. Thank you for your help and please reply
soon!
The Society provides these solutions to assist candidates preparing for the
examinations in future years and for the information of any other persons using the (ii) Explanation of rating in Q1a. Sections in a sensible order, personal questions
examinations. last. All types of status covered in Q8.
The solutions should NOT be seen as "model answers". Rather, they have been
written out in considerable detail and are intended as learning aids. (iii) Instructions to circle a number (Q1a) or tick a box (several times) should be
repeated in each question. Responses in Q2a and 2b should have the same
Users of the solutions should always be aware that in many cases there are valid response boxes, possibly including "poor" and/or "don't know". The rating
alternative methods. Also, in the many cases where discussion is called for, there may score 1 10 in Q3 needs fuller explanation. By the nature of the topic, Q4 is
be other valid points that could be made. not easy to word clearly some will use it a few times a year, others hardly
at all, and a few might use it very often (it depends largely on the information
While every care has been taken with the preparation of these solutions, the Society given in it); perhaps give some numbers per year, such as "more than 20",
will not be responsible for any errors or omissions. "1120", "510", "fewer than 5". Q7 should say "age in years", with the first
box "under 25".
The Society will not enter into any correspondence in respect of these solutions.
It would be useful to know when members joined, so as to evaluate the effect
of any changes in aims or activities that may have occurred.
Besides rating the present scheme in Q3, it would be useful to go into more
detail about specific items in it.
Note. In accordance with the convention used in the Society's examination papers, the notation log denotes
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10.
© RSS 2007
Higher Certificate, Module 1, 2007. Question 2 Higher Certificate, Module 1, 2007. Question 3
(i) By the end of mailing 3, there were 2099 + 701 + 483 = 3283 responses, Part (a)
which is 68.1%. The three mailings were worthwhile, because the percentage
response to mailing 1 was only 43.5; after mailing 2 it was 58.1. (i)
Refusals (only 2 to mailing 1) rapidly increased in further mailings, ending Number of visits, xi Number of students, fi f i xi fi xi2
with a total of 168. The percentages of actual responses to mailings 2 and 3 0 16 0 0
were respectively 25.8 and 24.6, which is reasonable. 1 21 21 21
2 29 58 116
The second and third mailings were made to all who had not responded or 3 19 57 171
whose questionnaires were undelivered. Non-delivery raises questions which fi = 85 fi xi = 136 fi xi2 = 308
are discussed in (iii) below. The final percentage undelivered at the end of the
mailings was 100 × 132/4822 = 2.7. fi xi 136
Population size N = 987. Sample mean x = 1.6 .
fi 85
At the end of the process there were still 1239/4822, or 25.7%, who had not
responded at all, even though they had apparently received three requests.
Hence estimate of total number of visits is Nx 987 1.6 1579.2 .
(ii) The two main reasons are (1) to increase the size of the sample that is actually 1 f i xi 1 1362 90.4
2
Sample variance s2 = f i xi 2 308 1.076 .

collected, so reducing the estimated variance of means etc calculated from it, 84 85 84 85 84
and (2) to avoid bias. Bias may arise because those who respond immediately
are often those who are most interested in making comments (good or bad)
and not a fully representative selection from the whole population. Provided, The estimated variance of the estimated total is N 2(s2/n), where n (= fi) = 85
as here, the follow-up response is reasonable, the results should be improved. is the sample size, i.e. it is
9872 (1.076/85) = 12331.8.

(iii) There is no indication that alternative methods of locating non-responders
were used, e.g. telephone. Nor is there any indication whether any reasons for
refusal to respond were given or asked for; in a postal survey, the respondents 85 16
(ii) Estimated proportion = = 0.8118.
would in any case have had to contact the survey organisers to indicate a 85
refusal.
0.8118(1 0.8118)
The estimated variance of this is = 0.001797.
Undelivered questionnaires were presumably sent to the same address a 85
second (or third) time, which seems rather pointless. Attempts to locate these
people could be made by printing a list of "lost contacts" in any magazine or Expressed in terms of percentages, the estimate is 81.2% with estimated
publication the Gulf War veterans are likely to read. Commanding officers variance 1002 0.001797 = 17.97.
could contact any still in service.
These methods would not increase the cost of the survey very much. Some
other possibilities would require extra resources, such as telephoning non-
responders to either (1) encourage them to reply or (2) offer a telephone
interview or a face-to-face interview. Besides extra cost, this would delay
completion of the survey. Also, collecting information from this group by a
different method from that used for the others could lead to different
responses. Whether extra effort would be worthwhile depends on the nature of
the survey. Solution continued on next page
Part (b) Higher Certificate, Module 1, 2007. Question 4
A stratified random sample selected from library records takes considerable staff time;
and posting a questionnaire will not lead to 100% response (see question 2). There is (i) D (dissolution) data in order:
also the problem that there is no guarantee that the questionnaires returned will have
been completed by the selected sample members; someone else in a household may 4.10, 4.10. 4.52, 4.52, 4.81, 5.09, 5.37, 5.51, 5.51, 5.94, 6.22, 6.22
have done so. Stratification does help to obtain male and female readers in the correct
proportions of those registered, and to obtain estimates for each sex. If a number of q M Q
questions are to be answered, there may be different patterns of response between the
sexes for different questions, and it may be important to know about this.
Occasionally the records of names might not make it clear which sex a person is (e.g. There are 12 items. The median M is midway between the 6th and 7th, i.e.
"Pat" could be male or female), while it might be that records do not indicate a first 5.23. The lower quartile q is midway between the 3rd and 4th, i.e. 4.52, and
name (perhaps only an initial is used) and/or do not indicate the sex. Presumably any the upper quartile Q midway between 5.51 and 5.94, i.e. 5.73. [Other
school students, not technically "adults", would be on a different register or would conventions for the detailed locations of the quartiles also exist.]
have ages recorded. Dates of the "last four weeks" should be specified clearly,
although memory is not likely to be good on visits four weeks ago (or five weeks or H (human) data in order:
more by the time the questionnaire is completed).
0.84, 0.96, 0.96, 1.55, 2.20, 2.51, 2.52, 2.84, 3.92, 5.01
A systematic sample on one day would be easier to carry out on the spot by one or
two people, and thus much cheaper. Forms would be given out for immediate
completion. Having a box for completed questionnaires would preserve anonymity q M Q
and so possibly improve accuracy. The same "memory" problem exists, and there
will also be refusals (which may be more numerous among certain sections of the Here there are 10 items. By similar arguments we have q = 0.96, M = 2.355
population e.g. on the way home to a meal!). "Systematic" should mean every and Q = 2.84.
10th, or 20th, or some convenient gap, but this will not be easy to implement at busy (8)
times or when a group of people all leave at the same time. Conversely, it could
waste time at slack periods. Choosing just one day for the survey will exclude users
whose regular weekly routine does not include a library visit that day. A major (ii) The boxplots below are shown horizontally, but vertical plots are equally
problem is that not all those leaving the library will be registered users. good. [Note. The limits of electronic reproduction may mean that the plots do
not appear exactly accurately.]
0 1 2 3 4 5 6 %
Solution continued on next page

(iii) For D, sum = 61.91, n = 12, sum of squares = 325.7425.
Hence mean x = 5.16 and standard deviation s = 0.759. THE ROYAL STATISTICAL SOCIETY
1 61.912
[Note. s is calculated as 325.7425 .]
11 12 2008 EXAMINATIONS SOLUTIONS
For H, sum = 23.31, n =10, sum of squares = 70.9739.
Hence x = 2.33 and s = 1.360. HIGHER CERTIFICATE
(iv) The in vitro (D) figures are mostly higher than those for in vivo (H) and are
(MODULAR FORMAT)
much less widely spread. For each of D and H the mean is roughly equal to
the median, but it appears that H is more skewed. This is due to H having two
high values some distance away from the others, so its range is much larger MODULE 1
than D's. For the same reason, the standard deviation of H is the higher one.
examinations in future years and for the information of any other persons using the
examinations.
written out in considerable detail and are intended as learning aids.
Users of the solutions should always be aware that in many cases there are valid
alternative methods. Also, in the many cases where discussion is called for, there may
be other valid points that could be made.
While every care has been taken with the preparation of these solutions, the Society
will not be responsible for any errors or omissions.
© RSS 2008
Higher Certificate, Module 1, 2008. Question 1 children then one of these might be chosen according to a previously assigned
(Solution continues on next page) random number to ensure that every household has the same chance of being
selected.
(i) Every survey should be aimed at a target population of individuals or Selecting a sample of children from a sample frame of school children could
organisations or groups whose characteristics are being studied. The sampling be adapted to produce a sample of households with school age children by
frame should be a complete list of items in the target population. Sampling obtaining the addresses where the children lived most of the time, and the
frames can be imperfect in several ways, such as being out of date, having name of a responsible adult at the address. Not all children live with parents
duplication or omissions or other inaccuracies, or not matching the target and some children spend part of the week with one family and part with
population. These are inter-related to some extent. another. Others live in institutions.
Every sampling frame is out of date as soon as it is published. The longer the A sample frame of school children would lead to duplication if it included two
time since a frame was drawn up, the greater the discrepancy between it and or more children from the same household. It would lead to omissions of
the target population. An out of date list is likely to include elements that no households which had moved into the area, and it could include households
longer exist and exclude elements that have joined the population since the list which had moved away and those where the children were no longer of school
was compiled; for example, in a list of buildings some might have been age.
demolished and some new ones might have been built in the intervening
period. Duplication occurs when the same element is listed more than once;
for example a mail order firm's list of customers might include the same (ii) (b) Either choice is acceptable if reasonable justification is given. Using a
customer with first name and surname and also with initials and surname. sampling frame of addresses is likely to lead to more wastage in that it will
Omissions occur when an element that should have been in the frame at the include addresses with no households in them and many households with no
time it was drawn up is not listed; for example some entries might be missed children of school age. Most children selected from a sample frame of school
when transferring clerical records to electronic form. Inaccuracies occur when children will be in households with school age children, but a possible
details about elements are incorrect; for example a school in a sampling frame difficulty with this frame is obtaining the addresses of the households.
of schools might incorrectly state that it had a sixth form. A frame might also
fail to match the target population because the elements do not match; for
example the electoral register for an area is not a frame of adults as it excludes
prisoners and a few other categories of people.
(ii) (a) Neither frame is a sample frame of households with school age children, so
neither matches the target population. Similarly both frames could be out of
date unless drawn up very recently.
There is a fairly good relationship between addresses and households, in that

once non-private addresses have been eliminated from a list of addresses there
will mostly be one household per address. However, a sample frame of
addresses could contain addresses which no longer exist, or omit addresses
corresponding to new buildings or new subdivisions of buildings into separate
units. There could be duplication if a building which was formerly
subdivided, giving rise to more than one address, is no longer subdivided.
Some addresses will obviously be non-private, for example they might include
the name of a shop; local knowledge may indicate whether there is also a
residential space (maybe above or behind it), and may also be useful in
identifying other such residential addresses. A sample of addresses could be
drawn from those remaining on the list after an initial cleaning up and pruning.
When the addresses are surveyed, questions will need to be asked as to how
many households live at the address and whether the households contain
children of school age. If there is more than one household with school-age
Higher Certificate, Module 1, 2008. Question 2 (iii)
1. Did the nursing staff check whether you were in pain when you came
round from the operation?
(i) Cluster sampling would be an obvious choice of sampling method for the first
stage of sampling. As lists of hospitals are available for each region, either a Yes No Cannot remember
simple random sample of regions could be taken at the first stage, or a sample
of regions with probabilities proportional to the number of hospitals in the
region. At the second stage, hospitals in the selected regions could be 2. Do you think that the nursing staff paid you sufficient attention in the
stratified into teaching and non-teaching hospitals and a simple random 24 hours after your return to the ward?
sample of hospitals taken from each stratum.
Yes No Don't know
The advantage of cluster sampling is that, if members of the research
organisation need to travel to the hospitals or talk face to face with patients in
connection with the survey, there will be less travel involved if selected 3. If you were in pain at any time, were you offered pain relief medicine?
hospitals are in a few regions only. A possible disadvantage is that if the
patients in the different regions vary a great deal, as regards their perceptions Had no pain
of post-operative care and other characteristics, the selected regions are not
necessarily going to represent all of this variability. For cluster sampling to Was given pain relief whenever I was in pain
work well, clusters need to be similar to one another so that any one cluster is
representative of the whole. Another disadvantage is that estimators under Was given pain relief some of the time I was in pain
cluster sampling tend to have high variability.
Was not ever given pain relief when I was in pain
It is likely that teaching hospitals and non-teaching hospitals vary, both as
regards the conditions for which patients are admitted and in the amount of
time that they are able to spend interacting with patients. Patients' perceptions
of post-operative care are therefore likely to reflect these differences. Having 4. Were you visited by a doctor in the 24 hours after your return to the
teaching and non-teaching hospitals as strata ensures that both types of ward?
hospital are in the sample. In addition it would be easy to obtain estimates for
each stratum alone as well for all hospitals combined. Stratified sampling
Yes No Don't know
works well when elements within each stratum are like one another but
elements in the different strata are dissimilar. A disadvantage of stratified
sampling is that elements have to be sorted into strata in order to obtain the
5. Did a doctor agree that you were fit to go home before you left the
separate estimates.
hospital?
[Other reasonable two-stage schemes were acceptable in candidates'
Yes No Don't know
answers.]
6. Were you given any information or advice concerning after-effects of

(ii) The research organisation would need to know the way in which records are
the operation when you left the hospital?
kept – for example in electronic form, on paper, on record cards, etc. The
organisation would also need to know the order in which records are kept
(alphabetical, geographical, date of admission, ward in which treated, etc) and I was given both information and advice
whether it is possible to sort out those who have had operations and those who
left hospital in any specified week before taking a sample. They would also I was given information but not advice
need to know the approximate numbers of patients in groups – in the group of
interest, and, if this group cannot be sampled directly, in the larger group or I was given advice but not information
groups containing the patients of interest.
I was given no information or advice
Higher Certificate, Module 1, 2008. Question 3 (ii) In general people are reluctant to give details of their incomes, perhaps fearing
that doing so would mean they have to pay more tax. Income in itself is
complicated as it might include, for example, benefits as well as salaries, and
(i) The increase in the number of women in the population from 1996/7 to 2003/4 income from investments. Different sources of income are received at
was 22913 – 22289 = 624, so the percentage increase in the population of different intervals of time, for example salary might be paid monthly but
women was 100 624/22289 = 2.8. interest on a building society account might be paid annually. The time period
for which income is required needs to be specified. Income can be quoted as
The corresponding increase in the number of men was 21399 – 20591 = 808, gross (before tax) or net (after tax). Tax adjustments are sometimes made in
so the percentage increase in the population of men was 100 808/20591 = 3.9. arrears. Designing a questionnaire to obtain accurate information about
incomes is therefore complicated, whether it is to be completed with the help
In both years there were more women than men. of an interviewer or as self-completion. A detailed sheet of instructions will
be needed, especially for the case of self-completion. Finding all the
The distributions of incomes of women and men are different from one information needed is time-consuming for respondents. The response rate
another in each year, but the patterns of the distributions for the same sex for may be rather low.
each year are very similar. In the case of women in 1996/7, 28% had incomes
in the bottom quintile group as defined by the distribution for all adults in that
year, and 53% had incomes in the bottom two quintile groups compared with
40% of all adults and 25% of men. In contrast 56% of men had incomes in the
two highest income groups in 1996/7, compared with 25% of women. The
two distributions are mirror images of one another. This shows very clearly
that the total incomes of women were concentrated at the lower end of the
income distribution. The percentages in the 3rd (the middle) quintile group
were similar for both sexes, and close to the theoretical 20% in this group.
The picture is very similar in 2003/4 with 52% of women having incomes in
the lowest two quintile groups compared with 27% of men, and 54% of men
having incomes in the highest two groups compared with 27% of women.
Thus there was virtually no change between the two years to which the tables
relate.
The mean total weekly individual incomes for both women and men increased
over the period. That for women increased by £50 per week and that for men
by £61 per week in absolute terms, that is a percentage increase of 28.2 for
women, and 17.6 for men. It is appropriate to compare the means in this way
as the means for both years are quoted in terms of 2003/4 prices. Men's mean
weekly total income was 1.96 times that of women’s in the earlier year, and
1.80 times women's in the later year. It appears that women are catching up
with men. However, a mean can hide the actual picture. It could be that
women who had the highest total incomes in the earlier year had a greater
increase than women who had the lowest incomes in the earlier year, that is
there could be inequalities as regards the changes in the distributions of
incomes between the two years.

Higher Certificate, Module 1, 2008. Question 4 (b) As discussed above, the responses to the variable concerned are restricted to
discrete values. The ranges of the values are relatively small (0 to 6.25 for
boys and 0 to 5.5 for girls). The dotplots clearly show the individual values,
Part (i) including the tails of the distributions. They also show that quite a few boys
and girls had not watched TV after school on the day for which they
(a) The mean is the total of all the observations divided by the total number of responded to the question. Histograms would not bring out these details. In
observations. It is a "sharing out" average. As there were 97 boys and the particular it is likely that a histogram would group together values greater than
mean time they spent watching TV was 1.799 hours, in total they must have 3 or thereabouts, making it difficult to discern the maximum value.
spent 174.5 hours watching. 1.799 hours is the number each would have
watched if they had all watched for exactly the same amount of time, found as
174.5 divided by 97. The calculation is similar for the girls, whose mean
watching time was 1.473 hours.
(c)
The median is a "half-way" measure in the sense that half of the observations Boxplot of number of hours watched
are less than or equal to it and half are greater than or equal to it. It is the 7
middle observation when an odd number of observations are ranked in order
of magnitude, and is taken as half-way between the two middle observations 6
when an even number of observations are ranked in order of magnitude. For 5

both the boys and the girls the median number of hours, as given in the table,
was 1.25. There were 97 boys so the middle observation is the 49th. It is 4
possible to count the dots in the dot-plot; we can see that the 49th observation
Data
3
is one of the 6 at 1.25 hours. There were 92 girls, so we take the median as
half-way between the 46th and 47th observations. Counting dots on the dot-
2
plot confirms that both of these observations are among the 11 at 1.25 hours. 1
The mode is the value of highest frequency. The mode for boys is 0 hours
0
which has a frequency of 15. More boys did not watch at all than the number
boys girls
who watched any specified number of hours. There are three modes for girls,
each with a frequency of 11, at 0, 1.25 and 1.5 hours. More girls watched [Note. This diagram was produced by Minitab. In the examination,
these numbers of hours than any other number of hours. candidates were not necessarily expected to show the outliers for the
girls by separate * characters, though not recognising them as
A complicating factor in the data is that the answers are given to the nearest outliers might prejudice the discussion below.]
quarter-hour. This is immediately apparent from the dot-plots, which show
that the observations take only discrete values at the quarter-hours. While it
was no doubt intended that a response of, say, "half an hour" was meant to The box and whisker plots show that both distributions are of positive
mean anything between 22½ and 37½ minutes, it is likely that some skewness, i.e. they have longish tails of high values. This is more so in the
respondents may have interpreted it as from 30 to 45 (exclusive) minutes case of the distribution for boys than that for girls. The distribution for girls is
while others may have interpreted it as 15 (exclusive) to 30 minutes. Further, more tight in the sense that the middle 50% between the first and third
the "0" entry cannot possibly mean a negative number of hours; presumably it quartiles falls in a smaller range than the middle 50% of the boys' responses.
was meant to mean anything from 0 to 7½ minutes, but we cannot be certain In the distribution for girls, there are four values that are unusually high
that every respondent made this interpretation. It is difficult to know how the compared with the bulk of the observations. These count as "outliers". For
calculations might be adjusted to try to take this into account. It may be better the boys, there are several high values but they are well scattered and do not
simply to take them at their face value of exact quarter-hours, which is what appear as outliers in the diagram as shown above [produced by Minitab].
has been done in the dot-plots and in the computer calculations leading to the
table.
Solution continued on next page Solution continued on next page

Part (ii)
As use of drugs is dangerous, and drugs are expensive and difficult to obtain legally, THE ROYAL STATISTICAL SOCIETY
children are not necessarily going to answer truthfully to questions about their use of
drugs. It will be necessary to get their confidence and to assure them that their
answers will be confidential and will not be revealed to anyone outside the survey
organisation. They should also be told that it will not be possible to identify 2009 EXAMINATIONS SOLUTIONS
individuals in any reports of the survey. It might be sensible, when getting the
permission of the school and parents/guardians to run the survey, to make clear that
questions on drug use are to be asked. The survey and questionnaire might need
approval by an ethics committee. Talking to classes of children about the survey HIGHER CERTIFICATE
could help persuade children to give honest answers. A plan as to what action to take
if a child is suspected of drug abuse as a result of his or her survey responses should
be made in advance of the survey. Randomised response techniques might be
considered.
MODULE 1
examinations in future years and for the information of any other persons using the
examinations.
written out in considerable detail and are intended as learning aids.
© RSS 2009
Higher Certificate, Module 1, 2009. Question 1 Higher Certificate, Module 1, 2009. Question 2
Part (a)
(i) In open-ended questions the researcher might obtain unexpected answers or
pick up extreme values. Also, the researcher can get opinions more easily than (i) If the digits had been chosen at random then we would expect each digit to
when a similar question is asked in closed form. occur approximately 25×228/10 = 570 times. Clearly this is not the case. 0 is
of very low frequency of occurrence. Digits 1 to 4 are chosen more frequently
than would be expected, especially noticeably so for digit 3. Digits 5, 6, 8
(ii) Reasons for preferring closed questions include [only three were required in and 9 are chosen noticeably less frequently than expected. Only digit 7 is near
the examination] that they are easier to code, easier and quicker for (actually it is very near) to the expected frequency of occurrence.
respondents to answer, easier to analyse, easier for interviewer to record
(ii) Digit (x) Frequency (f ) fx x2 f x2
answers.
0 375 0 0 0
1 637 637 1 637
(iii) 1. Who pays for your mobile phone calls? .. 2 676 1352 4 2704
3 709 2127 9 6381
4 607 2428 16 9712
2. How often do you use a mobile phone?
5 522 2610 25 13050
Several times every day 6 534 3204 36 19224
Once every one to two days 7 573 4011 49 28077
Once or twice a week 8 515 4120 64 32960
Less than once a week 9 552 4968 81 44712
5700 25457 157457
3. How long do your calls usually last?
Sample mean = 25457/5700 = 4.466.
4. Why do you make calls on a mobile phone? Select as many responses
as apply. 254572
157457
To chat with my friends Sample variance = 5700 = 7.6790.
To contact my parents 5699
To call a taxi
To obtain information e.g. the times of buses (iii) The sample mean is near to the supposed population mean of 4.5, but the
sample variance is distinctly smaller than the supposed population variance of
5. Approximately what proportion of calls are ones to you rather than 8.25. This is largely due to the low observed frequency of occurrence of 0.
ones that you make? Select the response that is nearest.
(iv) Another property one might look for is whether each digit is followed by each
None of the ten digits approximately 1/10 of the time.
About a quarter
About a half
About three-quarters Part (b)
All 3-digit random numbers are required. One method which avoids some "wastage" of
the generated random numbers (which might be important in a large simulation, for
6. Approximately what percentage of calls that you make are text calls? example) would be to number the elements of the population from 001 through to
. 350, and then again from 351 through to 700, so that each element of the population is
allocated two numbers, x and x + 350. 3-digit random numbers are then generated;
Questions 1, 3 and 6 are open; 2, 4 and 5 are closed. Qu 4 can be criticised 000 does not lead to selection of any member of the population, nor does any number
for not including an "Other" category; it can still be regarded as closed if it from 701 through to 999.
has an "Other" category, but "Other (please specify)" could be regarded as Any allocation giving either exactly one choice or exactly two choices to the 350
making it open. members of the population is an acceptable procedure.
Higher Certificate, Module 1, 2009. Question 3 Report:
Part (a) There are more women than men in the sample: out of the total of 1724, 44.4% are
men and 55.6% are women.
Preliminary calculations:
The numbers of men in the three regions are fairly similar to one another, and this is
Men Women also true of the numbers of women. This is shown in the bar chart. The relative
Age Class proportions of men and women in the three regions are close to one another and also
group mark Frequency Freq density % Frequency Freq density % close to the overall percentages.
19–24 22 61 10.17 8.0 78 13.00 8.1
25–34 30 160 16.00 20.9 211 21.10 22.0 Of the current smokers, 43.0% are men and 57.0% are women.
35–54 45 393 19.65 51.3 486 24.30 50.7
55–64 60 152 15.20 19.8 183 18.30 19.1
Of the men, 30.8% are current smokers; of the women, 32.7% are.
Note. As the age groups vary in width, the frequency density or % density should be
The sample sizes in the regions are fairly similar to one another, each region
used in frequency polygons or histograms. As the % distributions are almost
contributing about one third of the overall sample; this is shown in the pie chart.
identical, the two % density polygons would almost coincide if on the same diagram.
The diagram below was produced as a joined up scatter diagram in Excel. The distributions of ages of the men and women are very similar, as can be shown by
frequency density diagrams. On average the men are slightly older than the women,
by about one third of a year; the average age of men is 43.0 years and of women is
For men: f x = 32947, f x2 = 1516549.
42.7 years. The pattern for the median ages is similar, that for men being 43.2 years
Sample mean = 32947/766 = 43.01.
and that for women 42.8 years. The spreads of the ages are approximately the same,
Sample variance = [1516549 (329472/766)]/765 = 129.99,
with standard deviations for both men and women being 11.4 years (to one d.p.).
standard deviation = 11.40.
For women: f x = 40896, f x2 = 1870602.

Sample mean = 40896/958 = 42.69.
Sample variance = [1870602 (408962/958)]/957 = 130.40,
standard deviation = 11.42. Numbers of men and women in the regions
Median of men's ages = 35 + {(383 221)×20/393} = 43.24.

400
Median of women's ages = 35 + {(479 289)×20/486} = 42.82.
350
300
Region Total Men % men Women % women % for region Angle for 250
No. men
in region in region in sample pie chart 200
No. women
Scotland 574 248 43.2 326 56.8 33.3 120 150
and North
100
Central, 624 274 43.9 350 56.1 36.2 130
South 50
West and 0
Wales Scotland and Central, South London and South
London 526 244 46.4 282 53.6 30.5 110 North West and Wales East
and South
East
Total 1724 766 44.4 958 55.6
Solution continued on next page Solution continued on next page

Breakdown of the overall sample by region Higher Certificate, Module 1, 2009. Question 4
(i) Either a random sampling scheme or a quota sampling scheme would be

suitable, or a combination of both. For example, if a list of residents on the
Scotland and North
estate is available, then taking a simple random sample from this would be
Central, South West and suitable. Alternatively, a random sample of addresses could be taken and
Wales people living at these asked to participate. If the estate is large, some kind of
London and South East stratification by type of dwelling (such as flat, terrace, end terrace, semi-
detached house, detached house) might be worthwhile before sampling
residents. Quota sampling might be done by walking round the streets and
knocking on doors, once quotas had been established.
(ii) For example, questions for use in an interview might be as follows.

Age distribution of men and women in the sample Do you buy food at the local shops on the estate? Occasionally/often/never
30 Do you buy food in the town? Occasionally/often/never
Do you use the internet for food shopping? Yes/no
25
20
men How often do you shop for perishable food items?
15
women Every day
10
Once or twice a week
5 Once every 7 to 10 days
0
0 20 40 60 80
Where do you usually shop for these?
age
How often do you shop for non-perishable food items?
Every day
Once or twice a week
Part (b) Once every 7 to 10 days
Advantages of collecting dietary information by a diary as described include the Where do you usually shop for these? .
following [only four were required in the examination].
Layout should make it easy for respondents to complete. Do you use a private car when shopping for food? Mostly/Sometimes/Never
Should be accurate if completed at time food/drink is consumed. If you don't use a private car for food shopping, how do you usually travel
Need not rely on memory. back from the shops? On foot/by bicycle/by taxi/by public transport
Could be fairly easy/cheap to design.
Does not usually involve interviewing costs.
Disadvantages include the following [only four were required in the examination]. (iii) Would want all tabulations broken down by sex, age group and household size.
Respondents might change eating habits as a result of keeping the diary. Would want tables showing type of shop (corner shop on estate, in town, out of
No guarantee that entries are made at the time. town, internet, etc) by type of commodity.
Tedious for respondents to complete.
Diary could get lost. For different types of commodity (e.g. food, household cleaning materials,
Low agreement to participate. luxury goods) would want a breakdown of geographical location where they
High dropout rate. usually bought such items by how often they bought them.
Could be difficult to compare different respondents.
Coding and analysis could be difficult/time consuming. Would want tables showing distance travelled and mode of transport used.
Higher Certificate, Module 1, 2010. Question 1
[Solution continues on next page]
Part (a)
2010 EXAMINATIONS SOLUTIONS
Q1. The question asks for a year but "boxes" are also given for day and month.
One obvious way to correct this is simply to remove all the boxes and allow
respondents merely to write in the year. Alternatively, the question could be
HIGHER CERTIFICATE reworded as "Please state the date on which you left full-time education"; or,
as few people are likely to remember the exact date, "Please state the month
and year when you left full-time education", giving boxes for month and year
only.
MODULE 1 Another point is that larger boxes are needed, as numbers have to be written in
them. This is less of an issue for the tick-boxes in Q2 and Q5 as it will not
really matter if a tick is larger than its box.
Q2. There are many other possible categories, including unemployed, home-maker
(or similar, but care should be taken that this title will be generally
understood) and retired. An "other" category should certainly be included too.
The Society provides these solutions to assist candidates preparing for the The definition of "employed part-time" is incomplete (is it < 40 hours per
examinations in future years and for the information of any other persons using the week, per month, or what?). Further, as it is likely to be thought to mean per
examinations. week, it is misleading vis-à-vis "full-time", as many full-time employees work,
at least formally even if not in practice, less than 40 hours per week. It is
The solutions should NOT be seen as "model answers". Rather, they have been probably better not to give any definition, as people are likely to know
written out in considerable detail and are intended as learning aids. whether they are full-time or part-time.
There may also be a problem with people who (legitimately) have more than
one job. If this is thought to matter for the survey, a solution might be to add
"for your sole or main employment" to the request to state employment status.
Q3. This is a leading question due to the use of "really" which gives a strong
suggestion that the job is not enjoyed. Deleting this word improves the
question, but it remains leading as there is then a suggestion that the job is
enjoyed. A preferable wording would be simply "Do you enjoy your job or
not?"
Q4. This does not make clear to what period of time it relates or whether it is a
basic income. A better wording is "What is your annual gross income
excluding any additional benefits?"
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10. A further point is that many respondents may genuinely not know it accurately
without checking records, and many will be unwilling to give it accurately
even if they do know it. A set of "boxes" to capture ranges of income is likely
to work better.
© RSS 2010
The point about sole or main employment (see Q2) may arise here too.
Q5. There is an overlap here in that 10 employees appears both in "10 or fewer" Higher Certificate, Module 1, 2010. Question 2
and in "10-50". This can be avoided by making the first category fewer than [Solution continues on next page]
10. However, the categories in themselves are suitable only for those working
in places where the number of employees is relatively small. An open-ended
question asking for the approximate number of employees might work better. The gains in height in cm of daughters of short mothers are 19.5, 21.4, 22.3, 20.0,
20.5 and 23.2.
A separate problem is that it is not clear whether "where you are employed"
refers to an entire organisation (which may be a worldwide multinational!) or The gains for daughters of tall mothers are 20.3, 26.3, 25.1, 25.3, 23.0 and 25.2.
to some division of it such as a Department. This needs to be clarified, but the
nature of the clarification will depend on what the survey intends to measure.
A phrase such as "(This refers to the unit, department or division in which you The sample means and standard deviations (divisor n – 1) are as follows (these can be
work rather than the entire organisation)" might be appropriate. It might be taken routinely from a calculator, or obtained by standard formulae via x and x2).
better to provide this clarification via a footnote rather than a lengthy addition
to the question itself. Daughters of short mothers, age 6: mean 112.53, standard deviation 1.81.
Editorially, more space should be left after each of the first two boxes so that it
is immediately clear that a box refers to the category on its left. Q2 is set out Daughters of short mothers, age 10: mean 133.68, standard deviation 2.23.
better in this regard.
Daughters of short mothers, gain in height: mean 21.15, standard deviation 1.42.
Q6. The direction of scoring is not stated so there would be no indication of
whether a low score is good or bad. This cannot be assumed to be obvious. Daughters of tall mothers, age 6: mean 119.52, standard deviation 1.91.
The question could be amended by adding "where 0 denotes no satisfaction
and 10 denotes perfectly satisfied". Daughters of tall mothers, age 10: mean 143.72, standard deviation 2.41.
Daughters of tall mothers, gain in height: mean 24.20, standard deviation 2.19.
Part (b)
________________________________________
(i) Sensitive questions are likely to be troublesome as respondents might feel

embarrassed by them and might not want to give honest answers – or, indeed,
to answer them at all. This may extend to a decision not to answer any part of All measurements in this report are in cm. The report is based on data from two small
the questionnaire, even those parts that are not sensitive. Topics that are often samples of girls, some with short mothers and some with tall mothers; each sample
found to be sensitive are those to do with income, with sexual habits and was of size only 6. The girls' heights were measured at ages 6 and 10.
attitudes, and with activities that are either illegal or widely frowned on
without necessarily being illegal (eg use of "recreational" drugs). Perhaps the most noticeable feature of the data is that all the girls with short mothers
were shorter than all the girls with tall mothers both at age 6 and at age 10.
(ii) Hypothetical questions are of limited worth as respondents cannot be sure
what they would actually do in a particular situation that has not occurred. For girls with short mothers, the range of heights at age 6 was 110 to 114.5, a range of
Examples of such questions include "What would you do if you won a million 4.5. At age 10, it was 130.5 to 136, a range of 5.5. The range of increases in heights
pounds?" and "How would you react if your house was burnt down?" was 19.5 to 23.2, a range of 3.7. The average height at age 6 was 112.53 and the
standard deviation of heights was 1.81. At age 10, the average was 133.68 and the
standard deviation was 2.23. The average increase in height was 21.15 and the
standard deviation was 1.42.
Unsurprisingly, heights were all greater at age 10 than at age 6, by around 20 cm.
There was no overlap of the height ranges at the two ages. In addition, for these girls
there was more variability in heights at age 10 than at age 6, as shown both by the
increased range and by the standard deviations. At age 10 there was a noticeable gap
between the heights of the two shortest girls (130.5 and 131.4) and those of the others.
Such a gap was not apparent at age 6. The two shortest girls at age 10 had also been
shortest at age 6, but overall the rank orderings of heights were not consistent over the Higher Certificate, Module 2, 2010. Question 3
two ages.
For girls with tall mothers, the range of heights at age 6 was 115.9 to 121.0, a range of (i) In stratified sampling, the population is divided into homogeneous groups
5.1 – a slightly wider range than for the girls with short mothers. At age 10, it was called strata which, it is thought, might show consistent differences from each
140.7 to 146.5, a range of 5.8 – again slightly wider than for girls with short mothers, other in respect of the characteristics being investigated in the survey. A
but only a small difference in this case. The range of increases in heights was 20.3 to common example in human populations is to divide into males and females.
26.3, a range of 6.0 – considerably more than for the girls with short mothers. The A sample, usually (simple) random, is taken from each stratum. The
average height at age 6 was 119.52 and the standard deviation of heights was 1.91. At advantages of this method are that variances of estimators for the entire
age 10, the average was 143.72 and the standard deviation was 2.41. The average population are likely to be smaller than those obtained using other methods of
increase in height was 24.20 and the standard deviation was 2.19. sampling, and also that every group is represented and information is available
about each group separately as well as about the population as a whole. A
Again, heights were unsurprisingly all greater at age 10 than at age 6. The increases disadvantage is that more work is commonly required in setting up the
tended to be more, and were more variable, than for the girls with short mothers. sampling scheme, as it is (usually) necessary to know which stratum each
There was no overlap of the height ranges at the two ages. For these girls there was element of the population is in before the sample is taken.
again more variability in heights at age 10 than at age 6, as shown both by the
increased range and by the standard deviations. It was again the case that two girls at In cluster sampling, the population is divided into heterogeneous groups called
age 10 were noticeably shorter than the rest; and it also looked as though two girls at clusters which, it is hoped, are themselves representative of the entire
age 10 were beginning to pull away from the rest in growing taller. There had also population – in particular, that they capture the variation in the entire
been something of a gap between the shortest girl and the rest at age 6. This shortest population. A random sample of clusters is then taken to form the sample. A
girl had not been much taller than the girls with short mothers, and remained among common example in educational research is to divide a country into Local
the shorter girls of tall mothers at age 10. Again the overall rank orderings of heights Education Authority (or equivalent) areas which, it is hoped, replicate the
were not consistent over the two ages. characteristics of the entire country. There might be further stages of sampling
within the clusters themselves (though it is quite common for only a few
clusters, perhaps only one, to be selected and then complete enumeration
carried out within the chosen cluster(s)). The advantages of this method are
that details of the elements to be sampled are needed only for the selected
clusters, and costs are generally less than with other methods, especially if
face-to-face interviews are used. A principal disadvantage of this method is
that variances of estimators are usually greater than with other methods.
In quota sampling, the population is notionally divided into strata and
interviewers are asked to find samples from each stratum. It is thus a non-
random method of sampling and analysis of the results cannot proceed in a
straightforward way using the usual probability arguments. Further, there is
no guarantee that elements will be correctly identified in respect of which
stratum they belong to. Whereas in stratified sampling the stratum
membership is known before selection, in quota sampling membership is
identified by sight or by asking questions as part of the interview. Advantages
are that the target sample sizes can be achieved as interviewing can continue
until quotas are filled, and (often) the speed with which the survey can be
completed. Also, in practice quota sampling is often the only realistic way of
carrying out a survey, especially if the strata are not particularly well defined,
especially in respect of their sizes, in advance. Disadvantages are the reliance
on interviewers in filling the quotas correctly, a risk of interviewer bias, and
frequently that there are problems in filling quotas (especially in rarer groups)
as the survey draws to a close.

(ii) The regions of the country can be considered to be clusters. The first stage Higher Certificate, Module 2, 2010. Question 4
(cluster sampling) is to select one or more of these, presumably on the basis of
simple random sampling. In the selected regions, the urban and rural areas can
be considered to constitute two strata. The second stage (stratified sampling) (i) Those who refuse to answer questions and those who are not contacted might
is to select, again presumably at random, a sample of areas from each of these be very different as regards characteristics and attitudes from those who take
strata. In the last stage of sampling (quote sampling), carried out within the part in the survey.
selected urban and rural areas, quotas of say males and females in different
age groups might be defined, and interviewers asked to obtain samples of As an example, people may refuse because they are short of time. But busy
specified size according to the quotas set. people are likely to have different characteristics and attitudes compared with
people with spare time on their hands.
Another example is that people may refuse because they are not interested in
(iii) A systematic sample of "about 50" from 506 is asked for. 506/50 = 10.12, so the topic of the survey. In this case, the people who do respond will tend to be
we use an interval of 10. Choose a starting point at random in the first 10 those who are interested in the topic, and their responses commonly tend to be
names listed and then select every 10th name after that. more extreme (one way or the other) than those of the population as a whole.
If the starting point is in the first 6 names in the list, a sample of size 51 will Assuming the interviewing is done at people's homes, those who are not
result. Otherwise, the sample will be of size 50. contacted tend to be those who are away from home a great deal, and here too
it is likely that characteristics and attitudes will be different from those who
Systematic sampling is simple to describe and easy to carry out. It has are at home a great deal.
intuitive appeal. Users of the survey may readily be satisfied that this is a
good method of sampling. In the case of an outright refusal, reasons for the refusal might be sought. For
example, the refusal might be because the person did not have time
However, while a systematic sample may have the appearance of being immediately but would agree to an interview at a later date. Alternatively the
random, and may in practice behave as though it is random, in fact it is not. person might agree to a short telephone interview or to complete a self-
Only the starting point is chosen at random; once it has been chosen, all the completion questionnaire. Or the person might agree to answer a few basic
other members of the sample are fixed. The familiar theoretical results of questions so that at least a partial response is obtained. Substitution of results
simple random sampling do not strictly apply. from those with similar characteristics might then be considered and this might
help reduce the bias.
Further, if there is some cyclic pattern (possibly unknown) in the way the list
has been drawn up, the members chosen may coincide with it. This would In the case of non-contact, call-backs should be made at different times. If it is
happen here if, for some reason, every 10th person in the list is similar in some possible to obtain, information from neighbours might help determine when
way; the selected sample would then be too homogeneous and would not the person was likely to be in. A self-completion questionnaire might be left,
capture all the variability in the population. or if a telephone number is known the potential respondent might be
telephoned either to arrange an appointment or to conduct a telephone
Theoretical results for systematic sampling are derived by considering it as a interview. Simply leaving information about the survey and asking the
form of cluster sampling. Using the present case as an example, the potential respondent to telephone an interviewer or the survey office might
population is divided into ten clusters: the first cluster consists of inhabitants also result in a contact, but this is less likely to be successful.
numbered 1, 11, 21, … in the list, the second cluster consists of inhabitants 2,
22, 32, …, and so on. One of these clusters is then chosen (at random) to be Training of interviewers can also help reduce refusals and non-contacts.
the sample. Interviewers need to be advised both about how to conduct themselves in
terms of appearance (dress, etc) and about how to approach respondents. They
also need to be advised about the best times to call to minimise non-contacts
and refusals.

(ii) 1000 people are selected to take part. THE ROYAL STATISTICAL SOCIETY
At the first attempt to make contact, 725 are at home but 52 refuse to take part.
So the response rate at this first contact is 673/1000 = 67.3%.
HIGHER CERTIFICATE EXAMINATION
The second attempt to make contact consists of a call-back to the 275 people
who were missed at the first attempt. 30 of these are again missed and, of the
remaining 245, 49 refuse to take part. So the response rate at this second
contact is 196/275 = 71.3%. NEW MODULAR SCHEME
Overall, 673 + 196 = 869 people take part. So the overall response rate is
869/1000 = 86.9%.
introduced from the examinations in 2007
MODULE 1
SPECIMEN PAPER A
AND SOLUTIONS
The time for the examination is 1½ hours. The paper contains four questions, of
which candidates are to attempt three. Each question carries 20 marks. An indicative
mark scheme is shown within the questions, by giving an outline of the marks
available for each part-question. The pass mark for the paper as a whole is 50%.
The solutions should not be seen as "model answers". Rather, they have been written
out in considerable detail and are intended as learning aids. For this reason, they do
not carry mark schemes. Please note that in many cases there are valid alternative
methods and that, in cases where discussion is called for, there may be other valid
points that could be made.
While every care has been taken with the preparation of the questions and solutions,
the Society will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of the questions or
solutions.
© RSS 2006
1. In a project designed to see whether testing procedures in different laboratories 2. (i) What do you understand by the term simple random sampling?
give similar results, fatigue tests at three different strain levels (level 1 lowest, Describe conditions under which it may not be a suitable sampling
level 3 highest) were carried out on samples from the same batch of a procedure or where it would be desirable to combine it with some other
composite material. The tests were carried out at 9 different laboratories sampling method.
across Europe. The results (cycles to fatigue) from each laboratory and the (6)
means of the samples at each strain level are shown in the table below.
(ii) Choose three different types of non-sampling error, and briefly
laboratory strain level 1 mean strain level 2 mean strain level 3 mean describe circumstances in surveys that give rise to these errors.
results (1) results (2) results (3) (9)
A 7335, 6882, 8353 7523 1336, 1693 1515 690, 735 712.5
B 3512, 4000, 4300 3937 977, 1152 1065 428, 460, 473 453.7 (iii) Outline the main disadvantages of telephone surveys.
C 3822, 5558, 6910 5430 1516, 1607, 1650 1591 630, 675 652.5 (5)
D 1740, 1852 1796 447, 718, 935 700
E 4382, 6239, 8930 6517 1488, 1501, 1516 1502 674, 681, 781 712
F 4030, 4138, 4202 4123 1095, 1205, 1290 1197 465, 380, 395 413.3
G 5000, 5245, 5452 5232 1031, 1156, 1238 1142 380, 408, 454 414
H 2510, 2604, 2811, 2762 738, 771, 864, 814 303, 325, 329 319
2900, 2986 883
J 2247, 3800, 5267 3771 957, 1156, 1202 1105 225, 487 356
Two questions are of interest:
(1) Are there substantial differences between the measurements obtained

at different laboratories? 3. Part of a printed questionnaire, to be sent to a sample of households in a
Shoppers' Survey, is shown on the next page. It is intended that households
(2) Can a model be found to predict the cycles to fatigue at strain levels should return the completed questionnaire by post in a (free) reply envelope.
other than those tested?
Comment on the strengths and weaknesses of this questionnaire. Suggest
Write a preliminary report on your general conclusions about the questions alternative phrasing for those questions which you think could be improved.
posed in (1) and (2), based on the tabulated data, and supported by suitable (20)
graphical evidence. (You are advised to avoid spending excessive time on
detailed statistical analysis or repetitive calculations.)
Express your report in a way that makes it accessible to non-technical readers.

(20)
The questionnaire for question 3 is shown on the next page
Questionnaire for question 3 4. (i) Explain the difference which can exist between the target population
and the study population in a survey. Define both of these terms and
NATIONAL SHOPPERS SURVEY given an example of a situation where a difference is likely to exist and
Instructions one where it should not exist.
1) This questionnaire should be completed by the main shopper in your household Comment on the importance of obtaining a good sampling frame for a
2) Please give the answers that apply to you by putting an X in the appropriate boxes
3) Please return your completed survey in the reply-paid envelope survey, and how it might be obtained in your two examples.
(9)
1. ABOUT YOU AND YOUR FAMILY
1. What is your date of birth? __ __ __ __ __ __ __ __

Day Month Year (ii) Using the computer output of descriptive statistics given below,
explain how these can be used to summarise a set of data and how
2. Are you? 1 Married 2 Single subsets of the data may be compared. Suggest other information,
diagrams or tables that could have been produced from the raw data at
3. How many persons are living at home? ______
the same time as the descriptive statistics and which would help
4. Which group best describes your combined household income? interpretation of the data. Illustrate your answer with appropriate
1 Up to £5 000 4 £20 000 £30 000 diagrams. The data are wages ($ per hour) in three different sectors (A,
2 £5 000 £10 000 5 £30 000 £50 000 B, C) of employment in a region.
3 £10 000 £20 000 6 £50 000 plus (11)
2. YOUR SHOPPING
When doing your main shopping:
1. Which stores do you use for food and grocery shopping? (please specify) ________________
2. Why do you buy where you do?

1 Distance 4 Parking Facilities
2 Convenience 5 Prices
3 Quality of Products 6 Food Range
3. What do you spend on groceries?

1 Under £15 4 £35 £44.99 7 £75 £89.99
2 £15 £24.99 5 £45 £59.99 8 £90 £104.99
3 £25 £34.99 6 £60 £74.99 9 £105+
4. How often have you bought the following products in the past 3 months?
Environmentally Friendly Products
1 Very Often 2 Somewhat often 3 Not too often 4 Never
Recycled Products
1 Very Often 2 Somewhat often 3 Not too often 4 Never
5. How many packets of breakfast cereals do you buy?

1 2 3 4
6. What breakfast cereal brands do you buy regularly? (please tick all that apply)
1 Kellogg
2 Nabisco
3 Jordan
4 Shops Own
5 Other (please list) ____________________________
SOLUTIONS Question 2
Question 1
(i) Simple random sampling, for samples of size n from a population of size N, is
It is useful to have a graph showing the measurements made by each laboratory on where every sample has the same probability of selection. (This probability is, of
each strain (laboratories A – J are relabelled 1 – 9). Individual responses are plotted. N
course, 1/ .) A consequence of this is that every individual in the target
n
population has the same probability of being selected for the sample.
If a population is not homogenous as a whole, but can be split into groups each of
which is homogenous within itself, it will be better to select randomly within each
group (i.e. stratified random sampling). This allows the different groups to be studied,
as well as increasing precision of overall estimates. Also, when a very large
population is to be sampled using, for example, a list of names, a systematic sample
can be much easier to organise and may be treated as random provided any trends or
cyclical patterns in the list are avoided.
(ii) (a) Errors in recording responses, due to poor training of enumerators or

interviewers, and/or to carelessness or misunderstanding of subjects' answers.
In a postal questionnaire, poor wording of questions may lead to respondents
not answering the question intended.
(b) Transfer errors when data are taken from forms and entered into a
processing system. Illegible answers could also occur on postal survey
questionnaires.
(c) Non-response to postal surveys or refusal to co-operate/be interviewed.

The substantial difference in mean levels between the three strains makes it hard to
This may happen because of lack of interest in the topic being studied,
present the graphical results on a convenient scale. However, several points emerge.
objection to the wording of the questions or the approach of the interviewer,
(1) Some laboratories have consistently lower readings than others. For example, unwillingness to give time to answering, or simply being asked too often to
H has by some way the lowest mean throughout; A, C, E are high; B, J tend to be take part in a survey.
low. Variability within laboratories is also very different; this can be seen especially
for strain 1, where C, E, J give very wide ranges while F, G, H do not. The basic (d) Failure to locate individuals/units chosen to take part in a survey. This
material used does not appear to have been so variable, because not all within- may for example happen because of faulty lists, non-availability at the time an
laboratory variation is large; technical reasons in respect of resources of equipment or interviewer calls, premises being empty because people have moved, or
people is a more likely reason. different work and/or leisure habits so that individuals would need to be
contacted at unusual times not planned for in the survey.
(2) Level 1 is the lowest strain level, and it shows much higher means and much
more variation than levels 2 and 3. There is clearly an inverse relationship between
cycles to fatigue and strain level. However, a model assuming constant variance (iii) Telephone surveys only contact people available and willing to answer at the
would not be suitable; transformations of the y (cycles) variable could be explored, time of ringing, who have some interest in the topic under study, and whose numbers
possibly log y. Prediction will be more accurate at higher strain levels, and should are not ex-directory (if a telephone directory is used as a sample frame). High rates of
only be attempted within the range of levels already tested; extrapolation below level refusal to respond are likely from people who have been contacted frequently for such
1 or above level 3 would be unwise. surveys. Further, in some countries by no means everyone has a telephone.
Overall, the laboratories lack consistency. If the aim is to have a laboratory-
independent prediction, laboratory practice needs to be more consistent. If this cannot
be achieved, the model used needs to incorporate a laboratory effect.
Question 3 Question 4
Section 1 : Question 2 needs more boxes, for widowed, divorced/separated, living (i) An essential part of planning a survey is to decide exactly what population the
with partner. results should apply to – this is the target population – and what resources will
be needed to achieve this. If some parts of the target population are
Question 3 could be more specific and ask how many adults and how particularly difficult or expensive to cover, resources of time, money and
many children are living at home. personnel may not be sufficient to carry out a survey of adequate size. When a
part of the target population is omitted for practical reasons, and the survey is
Question 4 needs to say annual combined household income, and it carried out on the remainder, this remainder is the study population and is not
should be made clear where (e.g.) £10000 is to be entered by having the same as the target population.
"£5000 – £9999", "£10000 – £19999", etc.
If the omitted part of the target population is likely to be rather different from
the study population, this must be made clear in a report; but if there is no
Section 2 : Question 2 should have an instruction to put a mark in all relevant clear reason to expect that there will be a difference, the results could be used
boxes. as representing the whole population.
Question 3 should say per week, to avoid relying on memory/ An agricultural survey of a crop grown by large-scale farmers and also by
guesswork for longer periods. Also, the present question 3 could smallholders in a region is made more expensive by the need to travel to each
possibly be called "main shopping", with another question for "top-up chosen smallholding. But the two types of grower are likely to give different
shopping" with categories (say) "Under £10", "£10 – £19.99", "£20 results and so, even if smallholders only provide a relatively small part of the
plus". crop yield, both groups must be studied.
Question 4 might refer to the previous month only (again to avoid On the other hand, if a crop is grown in a large region over which the climate
relying on memory/guesswork) and should give some numbers such as is reasonably uniform, it may be possible to set up administration for sampling
"More than 5 times", "3 to 5 times", "Once or twice" and "Never (or in only a few centres in the region and concentrate work near them.
hardly ever)".
A sampling frame needs to be up-to-date and complete so far as it possibly
Question 5 should say per week and perhaps be explicitly restricted to can. Duplications or omissions should be avoided. In the first situation above,
the last week. It should include a box for 0 (zero) and possibly one for a good list of smallholders for the whole region is required, as well as of major
"More than 4". growers. This could add to the basic costs. In the second situation, the lists
are only required for limited areas through the whole region.
(ii) Given only the descriptive statistics summary, not all the useful diagrams can
be drawn. (However the computer can be asked for dot-plots and stem and
leaf diagrams as part of the output.)
The most useful diagrams, without doing any statistical tests, are boxplots.
[Histograms could be asked for as part of the output, but B and C do not have
enough observations to make histograms very useful.]

A
B HIGHER CERTIFICATE EXAMINATION

C
NEW MODULAR SCHEME
0 10 20 30 40 50 introduced from the examinations in 2007
Wages ($ per hour)
C is skew to the left, A is somewhat skew to the right but appears to have at
MODULE 1
least one very large outlier. All groups have similar central values, and A is
more disperse than B or C. Apart from the longer right hand whisker, B is
fairly symmetrical. C is a small group, but there seems to be some
concentration of data just above the median.
SPECIMEN PAPER B
AND SOLUTIONS
The time for the examination is 1½ hours. The paper contains four questions, of
which candidates are to attempt three. Each question carries 20 marks. An indicative
mark scheme is shown within the questions, by giving an outline of the marks
available for each part-question. The pass mark for the paper as a whole is 50%.
The solutions should not be seen as "model answers". Rather, they have been written
out in considerable detail and are intended as learning aids. For this reason, they do
not carry mark schemes. Please note that in many cases there are valid alternative
methods and that, in cases where discussion is called for, there may be other valid
points that could be made.
While every care has been taken with the preparation of the questions and solutions,
the Society will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of the questions or
solutions.
© RSS 2006
1. The data given below are the numbers (in thousands) of farms in each of the 2. In a survey of British adults (those aged over 16), there were questions about
50 states of the USA in 1987, given in order of increasing numbers of farms. alcohol consumption. One of these questions related to their average weekly
alcohol consumption. The table below shows trends in the distribution of
1 1 2 3 3 4 4 6 7 8 8 9 9 average weekly alcohol consumption during the period 1992–1998.
14 14 17 21 24 24 27 28 33 36 36 37 38
39 42 46 49 50 50 52 57 57 61 70 71 73
Average Weekly alcohol consumption by sex and age: 1992 to 1998
78 79 82 87 88 93 96 99 109 115 160 Persons aged 16 and over
Age 1992 1994 1996 1998

(i) Draw a stem and leaf diagram to display these data. Briefly discuss
whether the data appear to be symmetrical or skew. Mean number of units per week
(7) Men
16–24 19.1 17.4 20.3 23.6
25–44 18.2 17.5 17.6 16.5
(ii) Compute the median, mean, standard deviation and inter-quartile 45–64 15.6 15.5 15.6 17.3
range. 65 and over 9.7 10.0 11.0 10.7
Total 15.9 15.4 16.0 16.4

[Note: x 2217, x 2 164441 .]
Women
(8) 16–24 7.3 7.7 9.5 10.6
25–44 6.3 6.2 7.2 7.1
45–64 5.3 5.3 5.9 6.4
(iii) State, with reasons, which measure of central location you consider 65 and over 2.7 3.2 3.5 3.3
more appropriate for summarising these data. Interpret the results for
Total 5.4 5.4 6.3 6.4
this measure and for the corresponding measure of variability (or
spread).
(5) All persons
16–24 12.9 12.3 14.7 16.6
25–44 11.8 11.4 11.9 11.4
45–64 10.2 10.2 10.5 11.6
65 and over 5.6 6.0 6.8 6.5
Total 10.2 10.0 10.7 11.0
(i) Using the statistics given in the table, draw suitable diagrams to
illustrate
(a) the overall trend in alcohol consumption amongst men and

women during the period 1992–1998,
(b) similarities and differences between the 1998 age-specific

alcohol consumption for males and females.
(10)
(ii) Write a short report, suitable for publication in a serious newspaper,

based on the diagrams you have produced in part (i) and any other
aspects of the data you consider relevant.
(10)
3. (i) (a) Explain why non-response is a problem in social surveys. 4. A society has two grades of membership (Grade I and Grade II) and a
(5) worldwide membership of about 7000 individuals. About 20% of the
members fall into Grade II, and about 75% of all members are concentrated
(b) Suggest how non-response might be reduced in a mail survey into three geographical areas (A, B, C), the rest being spread throughout the
by appropriate follow-up procedures. rest of the world.
(5)
The society publishes a journal which is sent by post to all members, and it
(ii) An interview survey of heads of household is to be undertaken and it wishes to carry out a survey to discover if members find the journal useful,
has been estimated that 600 completed interviews are needed. and what aspects of their subject they most wish to see covered in the journal.
The society is particularly anxious to discover whether members of both
(a) Comment on the following two strategies (A) and (B) for grades find the journal useful.
achieving 600 completed interviews.
The society keeps its membership list as a computer file, with one record for
(A) Give the team of interviewers contact details of a large each member. The records are stored in alphabetical order of members'
sample of heads of household and give instructions that names, though the secretary is assured that separate lists could be provided for
interviewing should stop once 600 interviews have been the different grades of membership.
completed.
(B) Give the team of interviewers contact details of a Consider five possible methods for selecting a sample of members to receive,
sample of 600 heads of household and give instructions that by post, a questionnaire for this purpose:
only one attempt is to be made to interview each of these heads.
If no interview is obtained from m of these 600 heads of simple random sampling,
household then give the team contact details of a further stratified random sampling,
sample of m heads of household and give instructions that quota sampling,
several attempts are to be made to interview the members of cluster sampling,
this further sample.
systematic sampling.
(10)
For each method, discuss
(i) whether it would be possible to use this method of sampling for this
purpose,
(ii) whether the method would be a good one to choose for the purpose.
(4 marks for each method)

SOLUTIONS Question 2
Part (i)
Question 1 (a)
Average units of alcohol per week
Units
(i) Ordered diagram: (stem unit 10000) 20
STEM
0 1123344678899
1 447 15
2 14478 Men (overall)
3 366789
4 269
10
5 00277
6 1
7 01389
8 278 5
Women (overall)
9 369
10 9
11 5
… Year
16 0 1992 1994 1996 1998
There is considerable skewness, with a large number in stem 0 and a long tail (The limits of electronic reproduction may make the lines in this diagram and the next
to the right. There are also gaps. appear somewhat ragged.)
37 38 (b)
(ii) The median is between the 25th and 26th in order: M 37.5 .
2
Average units of alcohol per week
The quartiles are at the 13th and 38th in order: lower quartile q = 9, upper Units
quartile Q = 71. [Other conventions are also acceptable for the quartiles.] 25
These are in thousands; so we have q = 9000, M = 37500, Q = 71000. The

inter-quartile range is then 62000. 20
Men (1998)
2217
x 44.34 (thousand) .
50 15
1 2217 2 66139.22
s2 164441 1349.78; so s 36.74 (thousand) . 10
49 50 49
Women (1998)
5
(iii) For reasons given above, the mean and standard deviation will not be good
measures of location and dispersion. The median, 37500, and inter-quartile
range, 62000 (or the semi-iqr 31000), would be preferred.
Age
The middle 50% of the observations have range 62000. The lower 50% are 16-24 25-44 45-64 65+
37500 or less.
The solution to part (ii) is on the next page
Part (ii) Question 3
For overall consumption of alcohol, there was a slight increase over the time period as
a whole, but not a very regular pattern. (i) (a) Non-respondents tend not to be a random part of the whole population,
but instead particular types of person are more likely to fail, or refuse,
Both for men and for women, the 16–24 age group showed a distinct rise between the to reply. Their responses, if known, would quite likely be different
first two and the last two data sets (i.e. between 1992/94 and 1996/98). from other parts of the population. Unless they are represented, the
results of the survey will not validly apply to the whole population.
Both for men and for women, the 65 and over age group showed a fall over the last Besides introducing this bias, intended sample size is reduced by non-
two periods (i.e. from 1996 to 1998). This was also true of the 25–44 age groups, but response and so precision suffers.
the other age groups showed quite substantial increases.
(b) Some possible procedures are as follows.
Overall, there is a general decrease in consumption with age, though this is largely
explained by markedly high 16–24 and low 65+ figures. Send the questionnaire again, once or twice more
Send reminder letters (without the questionnaire)
Overall, men drink two to three times as much as women. Telephone people who have not responded
Visit those who have not responded, perhaps only a sample of them
These all require identification of non-responders, usually by means of

a number on the questionnaire which is kept separate from the answers
to preserve anonymity.
(ii) (a) Strategy A will lead to a sample consisting of those who are easiest to
locate, so even if the list is constructed in a properly random way the
actual members used will not have been selected at random from the
list. Any who refuse at first request will be ignored rather than any
attempt being made to persuade them. Since there is a team of
interviewers, the more efficient of these may carry out a higher
proportion of the 600 (more quickly). Or, alternatively, quickly
completed interviews may not have been done so thoroughly. Why
stop at 600 instead of attempting to get as many as possible of the
originally selected list?
Strategy B will also lead to willing and easily available heads of

household being selected, so that the first sample of (600 – m) could
suffer considerable bias. The second sample of m ought to be more
representative, but the quality of the final data will be affected by how
large m is (the larger the better to avoid bias). Office work is also
increased by this method.
Question 4
Note that there about 5600 members in Grade I, and 1400 in Grade II; also about
5250 in areas ABC and 1750 in the rest of the world.
Simple Random Sampling from the alphabetical list of members would be easy to
organise; questionnaires could be distributed separately or perhaps by including them
in the appropriate copies of the next issue of the journal (or in any other regular
publication such as a newsletter). It may not be a very good method because the
Grade II members are a small proportion, as are "rest of the world" ones. These
groups could be in danger of not being sampled very well.
Stratified Random Sampling would be less easy but far more satisfactory because not
only these smaller groups but also the A/B/C groups could be examined satisfactorily
according to likely variability, cost, proportion satisfied, as well as having appropriate
numbers from each group. Lists subdivided more than just by grade would be useful,
and modern data storage methods should make identifiers for subgroups easy to
provide.
Quota Sampling is totally infeasible. It would be very desirable to split into several
groups as suggested above, and if it were possible quota sampling would produce the
required sample numbers. As it is not possible, reminders to non-respondents would
be the only way of achieving reasonable sample sizes in subgroups.
Cluster Sampling is not feasible because there is no obvious way of splitting into
clusters, nor could enough of them be produced to make sampling from them a
reasonable process. There do not seem to be any theoretical grounds for wanting to
sample in clusters either.
Systematic Sampling from the original alphabetic list would be very easy, and
probably just as satisfactory as simple random sampling. However, there is a distinct
risk that some surnames would be especially associated with some areas, so that
stratification would be better. If the sample method is going to involve producing lists
in different groups to sample from, systematic sampling could be used instead of
random choice because it may be quicker.
Possible Groups would be Grades I, II, each split into A, B, C, "rest". A 'good'
method must compare Grades satisfactorily. This seems to be the most important
requirement in the specification.

The Royal Statistical Society 2007 Examinations Solutions Higher Certificate (Modular Format) Data Collection and Interpretation

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

The Royal Statistical Society 2007 Examinations Solutions Higher Certificate (Modular Format) Data Collection and Interpretation

Uploaded by

Copyright:

Available Formats

Higher Certificate, Module 1, 2007.

THE ROYAL STATISTICAL SOCIETY

Sample variance s2 = f i xi 2 308 1.076 .

9872 (1.076/85) = 12331.8.

Solution continued on next page

Hence x = 2.33 and s = 1.360. HIGHER CERTIFICATE

There is a fairly good relationship between addresses and households, in that

6. Were you given any information or advice concerning after-effects of

Solution continued on next page

when an even number of observations are ranked in order of magnitude. For 5

Solution continued on next page Solution continued on next page

DATA COLLECTION AND INTERPRETATION

For women: f x = 40896, f x2 = 1870602.

Median of men's ages = 35 + {(383  221)×20/393} = 43.24.

Solution continued on next page Solution continued on next page

(i) Either a random sampling scheme or a quota sampling scheme would be

(ii) For example, questions for use in an interview might be as follows.

(i) Sensitive questions are likely to be troublesome as respondents might feel

Solution continued on next page

Solution continued on next page

Two questions are of interest:

(1) Are there substantial differences between the measurements obtained

Express your report in a way that makes it accessible to non-technical readers.

1. What is your date of birth? __ __ __ __ __ __ __ __

When doing your main shopping:

2. Why do you buy where you do?

3. What do you spend on groceries?

5. How many packets of breakfast cereals do you buy?

(ii) (a) Errors in recording responses, due to poor training of enumerators or

(c) Non-response to postal surveys or refusal to co-operate/be interviewed.

Solution continued on next page

B HIGHER CERTIFICATE EXAMINATION

Age 1992 1994 1996 1998

Total 15.9 15.4 16.0 16.4

Total 10.2 10.0 10.7 11.0

(a) the overall trend in alcohol consumption amongst men and

(b) similarities and differences between the 1998 age-specific

(ii) Write a short report, suitable for publication in a serious newspaper,

(4 marks for each method)

These are in thousands; so we have q = 9000, M = 37500, Q = 71000. The

These all require identification of non-responders, usually by means of

Strategy B will also lead to willing and easily available heads of

You might also like

Median of men's ages = 35 + {(383 221)×20/393} = 43.24.

1. What is your date of birth?