Professional Documents
Culture Documents
Question 1
The Society provides these solutions to assist candidates preparing for the
examinations in future years and for the information of any other persons using the (ii) Explanation of rating in Q1a. Sections in a sensible order, personal questions
examinations. last. All types of status covered in Q8.
The solutions should NOT be seen as "model answers". Rather, they have been
written out in considerable detail and are intended as learning aids. (iii) Instructions to circle a number (Q1a) or tick a box (several times) should be
repeated in each question. Responses in Q2a and 2b should have the same
Users of the solutions should always be aware that in many cases there are valid response boxes, possibly including "poor" and/or "don't know". The rating
alternative methods. Also, in the many cases where discussion is called for, there may score 1 10 in Q3 needs fuller explanation. By the nature of the topic, Q4 is
be other valid points that could be made. not easy to word clearly some will use it a few times a year, others hardly
at all, and a few might use it very often (it depends largely on the information
While every care has been taken with the preparation of these solutions, the Society given in it); perhaps give some numbers per year, such as "more than 20",
will not be responsible for any errors or omissions. "1120", "510", "fewer than 5". Q7 should say "age in years", with the first
box "under 25".
The Society will not enter into any correspondence in respect of these solutions.
It would be useful to know when members joined, so as to evaluate the effect
of any changes in aims or activities that may have occurred.
Besides rating the present scheme in Q3, it would be useful to go into more
detail about specific items in it.
Note. In accordance with the convention used in the Society's examination papers, the notation log denotes
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10.
© RSS 2007
Higher Certificate, Module 1, 2007. Question 2 Higher Certificate, Module 1, 2007. Question 3
(i) By the end of mailing 3, there were 2099 + 701 + 483 = 3283 responses, Part (a)
which is 68.1%. The three mailings were worthwhile, because the percentage
response to mailing 1 was only 43.5; after mailing 2 it was 58.1. (i)
Refusals (only 2 to mailing 1) rapidly increased in further mailings, ending Number of visits, xi Number of students, fi f i xi fi xi2
with a total of 168. The percentages of actual responses to mailings 2 and 3 0 16 0 0
were respectively 25.8 and 24.6, which is reasonable. 1 21 21 21
2 29 58 116
The second and third mailings were made to all who had not responded or 3 19 57 171
whose questionnaires were undelivered. Non-delivery raises questions which fi = 85 fi xi = 136 fi xi2 = 308
are discussed in (iii) below. The final percentage undelivered at the end of the
mailings was 100 × 132/4822 = 2.7. fi xi 136
Population size N = 987. Sample mean x = 1.6 .
fi 85
At the end of the process there were still 1239/4822, or 25.7%, who had not
responded at all, even though they had apparently received three requests.
Hence estimate of total number of visits is Nx 987 1.6 1579.2 .
(ii) The two main reasons are (1) to increase the size of the sample that is actually 1 f i xi 1 1362 90.4
2
These methods would not increase the cost of the survey very much. Some
other possibilities would require extra resources, such as telephoning non-
responders to either (1) encourage them to reply or (2) offer a telephone
interview or a face-to-face interview. Besides extra cost, this would delay
completion of the survey. Also, collecting information from this group by a
different method from that used for the others could lead to different
responses. Whether extra effort would be worthwhile depends on the nature of
the survey. Solution continued on next page
Part (b) Higher Certificate, Module 1, 2007. Question 4
A stratified random sample selected from library records takes considerable staff time;
and posting a questionnaire will not lead to 100% response (see question 2). There is (i) D (dissolution) data in order:
also the problem that there is no guarantee that the questionnaires returned will have
been completed by the selected sample members; someone else in a household may 4.10, 4.10. 4.52, 4.52, 4.81, 5.09, 5.37, 5.51, 5.51, 5.94, 6.22, 6.22
have done so. Stratification does help to obtain male and female readers in the correct
proportions of those registered, and to obtain estimates for each sex. If a number of q M Q
questions are to be answered, there may be different patterns of response between the
sexes for different questions, and it may be important to know about this.
Occasionally the records of names might not make it clear which sex a person is (e.g. There are 12 items. The median M is midway between the 6th and 7th, i.e.
"Pat" could be male or female), while it might be that records do not indicate a first 5.23. The lower quartile q is midway between the 3rd and 4th, i.e. 4.52, and
name (perhaps only an initial is used) and/or do not indicate the sex. Presumably any the upper quartile Q midway between 5.51 and 5.94, i.e. 5.73. [Other
school students, not technically "adults", would be on a different register or would conventions for the detailed locations of the quartiles also exist.]
have ages recorded. Dates of the "last four weeks" should be specified clearly,
although memory is not likely to be good on visits four weeks ago (or five weeks or H (human) data in order:
more by the time the questionnaire is completed).
0.84, 0.96, 0.96, 1.55, 2.20, 2.51, 2.52, 2.84, 3.92, 5.01
A systematic sample on one day would be easier to carry out on the spot by one or
two people, and thus much cheaper. Forms would be given out for immediate
completion. Having a box for completed questionnaires would preserve anonymity q M Q
and so possibly improve accuracy. The same "memory" problem exists, and there
will also be refusals (which may be more numerous among certain sections of the Here there are 10 items. By similar arguments we have q = 0.96, M = 2.355
population e.g. on the way home to a meal!). "Systematic" should mean every and Q = 2.84.
10th, or 20th, or some convenient gap, but this will not be easy to implement at busy (8)
times or when a group of people all leave at the same time. Conversely, it could
waste time at slack periods. Choosing just one day for the survey will exclude users
whose regular weekly routine does not include a library visit that day. A major (ii) The boxplots below are shown horizontally, but vertical plots are equally
problem is that not all those leaving the library will be registered users. good. [Note. The limits of electronic reproduction may mean that the plots do
not appear exactly accurately.]
0 1 2 3 4 5 6 %
Hence mean x = 5.16 and standard deviation s = 0.759. THE ROYAL STATISTICAL SOCIETY
1 61.912
[Note. s is calculated as 325.7425 .]
11 12 2008 EXAMINATIONS SOLUTIONS
For H, sum = 23.31, n =10, sum of squares = 70.9739.
(iv) The in vitro (D) figures are mostly higher than those for in vivo (H) and are
(MODULAR FORMAT)
much less widely spread. For each of D and H the mean is roughly equal to
the median, but it appears that H is more skewed. This is due to H having two
high values some distance away from the others, so its range is much larger MODULE 1
than D's. For the same reason, the standard deviation of H is the higher one.
DATA COLLECTION AND INTERPRETATION
The Society provides these solutions to assist candidates preparing for the
examinations in future years and for the information of any other persons using the
examinations.
The solutions should NOT be seen as "model answers". Rather, they have been
written out in considerable detail and are intended as learning aids.
Users of the solutions should always be aware that in many cases there are valid
alternative methods. Also, in the many cases where discussion is called for, there may
be other valid points that could be made.
While every care has been taken with the preparation of these solutions, the Society
will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of these solutions.
Note. In accordance with the convention used in the Society's examination papers, the notation log denotes
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10.
© RSS 2008
Higher Certificate, Module 1, 2008. Question 1 children then one of these might be chosen according to a previously assigned
(Solution continues on next page) random number to ensure that every household has the same chance of being
selected.
(i) Every survey should be aimed at a target population of individuals or Selecting a sample of children from a sample frame of school children could
organisations or groups whose characteristics are being studied. The sampling be adapted to produce a sample of households with school age children by
frame should be a complete list of items in the target population. Sampling obtaining the addresses where the children lived most of the time, and the
frames can be imperfect in several ways, such as being out of date, having name of a responsible adult at the address. Not all children live with parents
duplication or omissions or other inaccuracies, or not matching the target and some children spend part of the week with one family and part with
population. These are inter-related to some extent. another. Others live in institutions.
Every sampling frame is out of date as soon as it is published. The longer the A sample frame of school children would lead to duplication if it included two
time since a frame was drawn up, the greater the discrepancy between it and or more children from the same household. It would lead to omissions of
the target population. An out of date list is likely to include elements that no households which had moved into the area, and it could include households
longer exist and exclude elements that have joined the population since the list which had moved away and those where the children were no longer of school
was compiled; for example, in a list of buildings some might have been age.
demolished and some new ones might have been built in the intervening
period. Duplication occurs when the same element is listed more than once;
for example a mail order firm's list of customers might include the same (ii) (b) Either choice is acceptable if reasonable justification is given. Using a
customer with first name and surname and also with initials and surname. sampling frame of addresses is likely to lead to more wastage in that it will
Omissions occur when an element that should have been in the frame at the include addresses with no households in them and many households with no
time it was drawn up is not listed; for example some entries might be missed children of school age. Most children selected from a sample frame of school
when transferring clerical records to electronic form. Inaccuracies occur when children will be in households with school age children, but a possible
details about elements are incorrect; for example a school in a sampling frame difficulty with this frame is obtaining the addresses of the households.
of schools might incorrectly state that it had a sixth form. A frame might also
fail to match the target population because the elements do not match; for
example the electoral register for an area is not a frame of adults as it excludes
prisoners and a few other categories of people.
(ii) (a) Neither frame is a sample frame of households with school age children, so
neither matches the target population. Similarly both frames could be out of
date unless drawn up very recently.
When the addresses are surveyed, questions will need to be asked as to how
many households live at the address and whether the households contain
children of school age. If there is more than one household with school-age
Higher Certificate, Module 1, 2008. Question 2 (iii)
1. Did the nursing staff check whether you were in pain when you came
round from the operation?
(i) Cluster sampling would be an obvious choice of sampling method for the first
stage of sampling. As lists of hospitals are available for each region, either a Yes No Cannot remember
simple random sample of regions could be taken at the first stage, or a sample
of regions with probabilities proportional to the number of hospitals in the
region. At the second stage, hospitals in the selected regions could be 2. Do you think that the nursing staff paid you sufficient attention in the
stratified into teaching and non-teaching hospitals and a simple random 24 hours after your return to the ward?
sample of hospitals taken from each stratum.
Yes No Don't know
The advantage of cluster sampling is that, if members of the research
organisation need to travel to the hospitals or talk face to face with patients in
connection with the survey, there will be less travel involved if selected 3. If you were in pain at any time, were you offered pain relief medicine?
hospitals are in a few regions only. A possible disadvantage is that if the
patients in the different regions vary a great deal, as regards their perceptions Had no pain
of post-operative care and other characteristics, the selected regions are not
necessarily going to represent all of this variability. For cluster sampling to Was given pain relief whenever I was in pain
work well, clusters need to be similar to one another so that any one cluster is
representative of the whole. Another disadvantage is that estimators under Was given pain relief some of the time I was in pain
cluster sampling tend to have high variability.
Was not ever given pain relief when I was in pain
It is likely that teaching hospitals and non-teaching hospitals vary, both as
regards the conditions for which patients are admitted and in the amount of
time that they are able to spend interacting with patients. Patients' perceptions
of post-operative care are therefore likely to reflect these differences. Having 4. Were you visited by a doctor in the 24 hours after your return to the
teaching and non-teaching hospitals as strata ensures that both types of ward?
hospital are in the sample. In addition it would be easy to obtain estimates for
each stratum alone as well for all hospitals combined. Stratified sampling
Yes No Don't know
works well when elements within each stratum are like one another but
elements in the different strata are dissimilar. A disadvantage of stratified
sampling is that elements have to be sorted into strata in order to obtain the
5. Did a doctor agree that you were fit to go home before you left the
separate estimates.
hospital?
[Other reasonable two-stage schemes were acceptable in candidates'
Yes No Don't know
answers.]
The picture is very similar in 2003/4 with 52% of women having incomes in
the lowest two quintile groups compared with 27% of men, and 54% of men
having incomes in the highest two groups compared with 27% of women.
Thus there was virtually no change between the two years to which the tables
relate.
The mean total weekly individual incomes for both women and men increased
over the period. That for women increased by £50 per week and that for men
by £61 per week in absolute terms, that is a percentage increase of 28.2 for
women, and 17.6 for men. It is appropriate to compare the means in this way
as the means for both years are quoted in terms of 2003/4 prices. Men's mean
weekly total income was 1.96 times that of women’s in the earlier year, and
1.80 times women's in the later year. It appears that women are catching up
with men. However, a mean can hide the actual picture. It could be that
women who had the highest total incomes in the earlier year had a greater
increase than women who had the lowest incomes in the earlier year, that is
there could be inequalities as regards the changes in the distributions of
incomes between the two years.
possible to count the dots in the dot-plot; we can see that the 49th observation
Data
3
is one of the 6 at 1.25 hours. There were 92 girls, so we take the median as
half-way between the 46th and 47th observations. Counting dots on the dot-
2
plot confirms that both of these observations are among the 11 at 1.25 hours. 1
The mode is the value of highest frequency. The mode for boys is 0 hours
0
which has a frequency of 15. More boys did not watch at all than the number
boys girls
who watched any specified number of hours. There are three modes for girls,
each with a frequency of 11, at 0, 1.25 and 1.5 hours. More girls watched [Note. This diagram was produced by Minitab. In the examination,
these numbers of hours than any other number of hours. candidates were not necessarily expected to show the outliers for the
girls by separate * characters, though not recognising them as
A complicating factor in the data is that the answers are given to the nearest outliers might prejudice the discussion below.]
quarter-hour. This is immediately apparent from the dot-plots, which show
that the observations take only discrete values at the quarter-hours. While it
was no doubt intended that a response of, say, "half an hour" was meant to The box and whisker plots show that both distributions are of positive
mean anything between 22½ and 37½ minutes, it is likely that some skewness, i.e. they have longish tails of high values. This is more so in the
respondents may have interpreted it as from 30 to 45 (exclusive) minutes case of the distribution for boys than that for girls. The distribution for girls is
while others may have interpreted it as 15 (exclusive) to 30 minutes. Further, more tight in the sense that the middle 50% between the first and third
the "0" entry cannot possibly mean a negative number of hours; presumably it quartiles falls in a smaller range than the middle 50% of the boys' responses.
was meant to mean anything from 0 to 7½ minutes, but we cannot be certain In the distribution for girls, there are four values that are unusually high
that every respondent made this interpretation. It is difficult to know how the compared with the bulk of the observations. These count as "outliers". For
calculations might be adjusted to try to take this into account. It may be better the boys, there are several high values but they are well scattered and do not
simply to take them at their face value of exact quarter-hours, which is what appear as outliers in the diagram as shown above [produced by Minitab].
has been done in the dot-plots and in the computer calculations leading to the
table.
As use of drugs is dangerous, and drugs are expensive and difficult to obtain legally, THE ROYAL STATISTICAL SOCIETY
children are not necessarily going to answer truthfully to questions about their use of
drugs. It will be necessary to get their confidence and to assure them that their
answers will be confidential and will not be revealed to anyone outside the survey
organisation. They should also be told that it will not be possible to identify 2009 EXAMINATIONS SOLUTIONS
individuals in any reports of the survey. It might be sensible, when getting the
permission of the school and parents/guardians to run the survey, to make clear that
questions on drug use are to be asked. The survey and questionnaire might need
approval by an ethics committee. Talking to classes of children about the survey HIGHER CERTIFICATE
could help persuade children to give honest answers. A plan as to what action to take
if a child is suspected of drug abuse as a result of his or her survey responses should
be made in advance of the survey. Randomised response techniques might be
considered.
MODULE 1
The Society provides these solutions to assist candidates preparing for the
examinations in future years and for the information of any other persons using the
examinations.
The solutions should NOT be seen as "model answers". Rather, they have been
written out in considerable detail and are intended as learning aids.
Users of the solutions should always be aware that in many cases there are valid
alternative methods. Also, in the many cases where discussion is called for, there may
be other valid points that could be made.
While every care has been taken with the preparation of these solutions, the Society
will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of these solutions.
Note. In accordance with the convention used in the Society's examination papers, the notation log denotes
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10.
© RSS 2009
Higher Certificate, Module 1, 2009. Question 1 Higher Certificate, Module 1, 2009. Question 2
Part (a)
(i) In open-ended questions the researcher might obtain unexpected answers or
pick up extreme values. Also, the researcher can get opinions more easily than (i) If the digits had been chosen at random then we would expect each digit to
when a similar question is asked in closed form. occur approximately 25×228/10 = 570 times. Clearly this is not the case. 0 is
of very low frequency of occurrence. Digits 1 to 4 are chosen more frequently
than would be expected, especially noticeably so for digit 3. Digits 5, 6, 8
(ii) Reasons for preferring closed questions include [only three were required in and 9 are chosen noticeably less frequently than expected. Only digit 7 is near
the examination] that they are easier to code, easier and quicker for (actually it is very near) to the expected frequency of occurrence.
respondents to answer, easier to analyse, easier for interviewer to record
(ii) Digit (x) Frequency (f ) fx x2 f x2
answers.
0 375 0 0 0
1 637 637 1 637
(iii) 1. Who pays for your mobile phone calls?
.. 2 676 1352 4 2704
3 709 2127 9 6381
4 607 2428 16 9712
2. How often do you use a mobile phone?
5 522 2610 25 13050
Several times every day 6 534 3204 36 19224
Once every one to two days 7 573 4011 49 28077
Once or twice a week 8 515 4120 64 32960
Less than once a week 9 552 4968 81 44712
5700 25457 157457
3. How long do your calls usually last?
Sample mean = 25457/5700 = 4.466.
4. Why do you make calls on a mobile phone? Select as many responses
as apply. 254572
157457
To chat with my friends Sample variance = 5700 = 7.6790.
To contact my parents 5699
To call a taxi
To obtain information e.g. the times of buses (iii) The sample mean is near to the supposed population mean of 4.5, but the
sample variance is distinctly smaller than the supposed population variance of
5. Approximately what proportion of calls are ones to you rather than 8.25. This is largely due to the low observed frequency of occurrence of 0.
ones that you make? Select the response that is nearest.
(iv) Another property one might look for is whether each digit is followed by each
None of the ten digits approximately 1/10 of the time.
About a quarter
About a half
About three-quarters Part (b)
All 3-digit random numbers are required. One method which avoids some "wastage" of
the generated random numbers (which might be important in a large simulation, for
6. Approximately what percentage of calls that you make are text calls? example) would be to number the elements of the population from 001 through to
. 350, and then again from 351 through to 700, so that each element of the population is
allocated two numbers, x and x + 350. 3-digit random numbers are then generated;
Questions 1, 3 and 6 are open; 2, 4 and 5 are closed. Qu 4 can be criticised 000 does not lead to selection of any member of the population, nor does any number
for not including an "Other" category; it can still be regarded as closed if it from 701 through to 999.
has an "Other" category, but "Other (please specify)" could be regarded as Any allocation giving either exactly one choice or exactly two choices to the 350
making it open. members of the population is an acceptable procedure.
Higher Certificate, Module 1, 2009. Question 3 Report:
Part (a) There are more women than men in the sample: out of the total of 1724, 44.4% are
men and 55.6% are women.
Preliminary calculations:
The numbers of men in the three regions are fairly similar to one another, and this is
Men Women also true of the numbers of women. This is shown in the bar chart. The relative
Age Class proportions of men and women in the three regions are close to one another and also
group mark Frequency Freq density % Frequency Freq density % close to the overall percentages.
19–24 22 61 10.17 8.0 78 13.00 8.1
25–34 30 160 16.00 20.9 211 21.10 22.0 Of the current smokers, 43.0% are men and 57.0% are women.
35–54 45 393 19.65 51.3 486 24.30 50.7
55–64 60 152 15.20 19.8 183 18.30 19.1
Of the men, 30.8% are current smokers; of the women, 32.7% are.
Note. As the age groups vary in width, the frequency density or % density should be
The sample sizes in the regions are fairly similar to one another, each region
used in frequency polygons or histograms. As the % distributions are almost
contributing about one third of the overall sample; this is shown in the pie chart.
identical, the two % density polygons would almost coincide if on the same diagram.
The diagram below was produced as a joined up scatter diagram in Excel. The distributions of ages of the men and women are very similar, as can be shown by
frequency density diagrams. On average the men are slightly older than the women,
by about one third of a year; the average age of men is 43.0 years and of women is
For men: f x = 32947, f x2 = 1516549.
42.7 years. The pattern for the median ages is similar, that for men being 43.2 years
Sample mean = 32947/766 = 43.01.
and that for women 42.8 years. The spreads of the ages are approximately the same,
Sample variance = [1516549 (329472/766)]/765 = 129.99,
with standard deviations for both men and women being 11.4 years (to one d.p.).
standard deviation = 11.40.
20
men How often do you shop for perishable food items?
15
women Every day
10
Once or twice a week
5 Once every 7 to 10 days
0
0 20 40 60 80
Where do you usually shop for these?
age
How often do you shop for non-perishable food items?
Every day
Once or twice a week
Part (b) Once every 7 to 10 days
Advantages of collecting dietary information by a diary as described include the Where do you usually shop for these?
.
following [only four were required in the examination].
Layout should make it easy for respondents to complete. Do you use a private car when shopping for food? Mostly/Sometimes/Never
Should be accurate if completed at time food/drink is consumed. If you don't use a private car for food shopping, how do you usually travel
Need not rely on memory. back from the shops? On foot/by bicycle/by taxi/by public transport
Could be fairly easy/cheap to design.
Does not usually involve interviewing costs.
Disadvantages include the following [only four were required in the examination]. (iii) Would want all tabulations broken down by sex, age group and household size.
Respondents might change eating habits as a result of keeping the diary. Would want tables showing type of shop (corner shop on estate, in town, out of
No guarantee that entries are made at the time. town, internet, etc) by type of commodity.
Tedious for respondents to complete.
Diary could get lost. For different types of commodity (e.g. food, household cleaning materials,
Low agreement to participate. luxury goods) would want a breakdown of geographical location where they
High dropout rate. usually bought such items by how often they bought them.
Could be difficult to compare different respondents.
Coding and analysis could be difficult/time consuming. Would want tables showing distance travelled and mode of transport used.
Higher Certificate, Module 1, 2010. Question 1
[Solution continues on next page]
THE ROYAL STATISTICAL SOCIETY
Part (a)
2010 EXAMINATIONS SOLUTIONS
Q1. The question asks for a year but "boxes" are also given for day and month.
One obvious way to correct this is simply to remove all the boxes and allow
respondents merely to write in the year. Alternatively, the question could be
HIGHER CERTIFICATE reworded as "Please state the date on which you left full-time education"; or,
as few people are likely to remember the exact date, "Please state the month
and year when you left full-time education", giving boxes for month and year
only.
MODULE 1 Another point is that larger boxes are needed, as numbers have to be written in
them. This is less of an issue for the tick-boxes in Q2 and Q5 as it will not
really matter if a tick is larger than its box.
DATA COLLECTION AND INTERPRETATION
Q2. There are many other possible categories, including unemployed, home-maker
(or similar, but care should be taken that this title will be generally
understood) and retired. An "other" category should certainly be included too.
The Society provides these solutions to assist candidates preparing for the The definition of "employed part-time" is incomplete (is it < 40 hours per
examinations in future years and for the information of any other persons using the week, per month, or what?). Further, as it is likely to be thought to mean per
examinations. week, it is misleading vis-à-vis "full-time", as many full-time employees work,
at least formally even if not in practice, less than 40 hours per week. It is
The solutions should NOT be seen as "model answers". Rather, they have been probably better not to give any definition, as people are likely to know
written out in considerable detail and are intended as learning aids. whether they are full-time or part-time.
There may also be a problem with people who (legitimately) have more than
Users of the solutions should always be aware that in many cases there are valid
one job. If this is thought to matter for the survey, a solution might be to add
alternative methods. Also, in the many cases where discussion is called for, there may
"for your sole or main employment" to the request to state employment status.
be other valid points that could be made.
Q3. This is a leading question due to the use of "really" which gives a strong
While every care has been taken with the preparation of these solutions, the Society
suggestion that the job is not enjoyed. Deleting this word improves the
will not be responsible for any errors or omissions.
question, but it remains leading as there is then a suggestion that the job is
enjoyed. A preferable wording would be simply "Do you enjoy your job or
The Society will not enter into any correspondence in respect of these solutions.
not?"
Q4. This does not make clear to what period of time it relates or whether it is a
basic income. A better wording is "What is your annual gross income
excluding any additional benefits?"
Note. In accordance with the convention used in the Society's examination papers, the notation log denotes
logarithm to base e. Logarithms to any other base are explicitly identified, e.g. log10. A further point is that many respondents may genuinely not know it accurately
without checking records, and many will be unwilling to give it accurately
even if they do know it. A set of "boxes" to capture ranges of income is likely
to work better.
© RSS 2010
The point about sole or main employment (see Q2) may arise here too.
Q5. There is an overlap here in that 10 employees appears both in "10 or fewer" Higher Certificate, Module 1, 2010. Question 2
and in "10-50". This can be avoided by making the first category fewer than [Solution continues on next page]
10. However, the categories in themselves are suitable only for those working
in places where the number of employees is relatively small. An open-ended
question asking for the approximate number of employees might work better. The gains in height in cm of daughters of short mothers are 19.5, 21.4, 22.3, 20.0,
20.5 and 23.2.
A separate problem is that it is not clear whether "where you are employed"
refers to an entire organisation (which may be a worldwide multinational!) or The gains for daughters of tall mothers are 20.3, 26.3, 25.1, 25.3, 23.0 and 25.2.
to some division of it such as a Department. This needs to be clarified, but the
nature of the clarification will depend on what the survey intends to measure.
A phrase such as "(This refers to the unit, department or division in which you The sample means and standard deviations (divisor n – 1) are as follows (these can be
work rather than the entire organisation)" might be appropriate. It might be taken routinely from a calculator, or obtained by standard formulae via x and x2).
better to provide this clarification via a footnote rather than a lengthy addition
to the question itself. Daughters of short mothers, age 6: mean 112.53, standard deviation 1.81.
Editorially, more space should be left after each of the first two boxes so that it
is immediately clear that a box refers to the category on its left. Q2 is set out Daughters of short mothers, age 10: mean 133.68, standard deviation 2.23.
better in this regard.
Daughters of short mothers, gain in height: mean 21.15, standard deviation 1.42.
Q6. The direction of scoring is not stated so there would be no indication of
whether a low score is good or bad. This cannot be assumed to be obvious. Daughters of tall mothers, age 6: mean 119.52, standard deviation 1.91.
The question could be amended by adding "where 0 denotes no satisfaction
and 10 denotes perfectly satisfied". Daughters of tall mothers, age 10: mean 143.72, standard deviation 2.41.
Daughters of tall mothers, gain in height: mean 24.20, standard deviation 2.19.
Part (b)
________________________________________
Unsurprisingly, heights were all greater at age 10 than at age 6, by around 20 cm.
There was no overlap of the height ranges at the two ages. In addition, for these girls
there was more variability in heights at age 10 than at age 6, as shown both by the
increased range and by the standard deviations. At age 10 there was a noticeable gap
between the heights of the two shortest girls (130.5 and 131.4) and those of the others.
Such a gap was not apparent at age 6. The two shortest girls at age 10 had also been
shortest at age 6, but overall the rank orderings of heights were not consistent over the Higher Certificate, Module 2, 2010. Question 3
two ages.
For girls with tall mothers, the range of heights at age 6 was 115.9 to 121.0, a range of (i) In stratified sampling, the population is divided into homogeneous groups
5.1 – a slightly wider range than for the girls with short mothers. At age 10, it was called strata which, it is thought, might show consistent differences from each
140.7 to 146.5, a range of 5.8 – again slightly wider than for girls with short mothers, other in respect of the characteristics being investigated in the survey. A
but only a small difference in this case. The range of increases in heights was 20.3 to common example in human populations is to divide into males and females.
26.3, a range of 6.0 – considerably more than for the girls with short mothers. The A sample, usually (simple) random, is taken from each stratum. The
average height at age 6 was 119.52 and the standard deviation of heights was 1.91. At advantages of this method are that variances of estimators for the entire
age 10, the average was 143.72 and the standard deviation was 2.41. The average population are likely to be smaller than those obtained using other methods of
increase in height was 24.20 and the standard deviation was 2.19. sampling, and also that every group is represented and information is available
about each group separately as well as about the population as a whole. A
Again, heights were unsurprisingly all greater at age 10 than at age 6. The increases disadvantage is that more work is commonly required in setting up the
tended to be more, and were more variable, than for the girls with short mothers. sampling scheme, as it is (usually) necessary to know which stratum each
There was no overlap of the height ranges at the two ages. For these girls there was element of the population is in before the sample is taken.
again more variability in heights at age 10 than at age 6, as shown both by the
increased range and by the standard deviations. It was again the case that two girls at In cluster sampling, the population is divided into heterogeneous groups called
age 10 were noticeably shorter than the rest; and it also looked as though two girls at clusters which, it is hoped, are themselves representative of the entire
age 10 were beginning to pull away from the rest in growing taller. There had also population – in particular, that they capture the variation in the entire
been something of a gap between the shortest girl and the rest at age 6. This shortest population. A random sample of clusters is then taken to form the sample. A
girl had not been much taller than the girls with short mothers, and remained among common example in educational research is to divide a country into Local
the shorter girls of tall mothers at age 10. Again the overall rank orderings of heights Education Authority (or equivalent) areas which, it is hoped, replicate the
were not consistent over the two ages. characteristics of the entire country. There might be further stages of sampling
within the clusters themselves (though it is quite common for only a few
clusters, perhaps only one, to be selected and then complete enumeration
carried out within the chosen cluster(s)). The advantages of this method are
that details of the elements to be sampled are needed only for the selected
clusters, and costs are generally less than with other methods, especially if
face-to-face interviews are used. A principal disadvantage of this method is
that variances of estimators are usually greater than with other methods.
In quota sampling, the population is notionally divided into strata and
interviewers are asked to find samples from each stratum. It is thus a non-
random method of sampling and analysis of the results cannot proceed in a
straightforward way using the usual probability arguments. Further, there is
no guarantee that elements will be correctly identified in respect of which
stratum they belong to. Whereas in stratified sampling the stratum
membership is known before selection, in quota sampling membership is
identified by sight or by asking questions as part of the interview. Advantages
are that the target sample sizes can be achieved as interviewing can continue
until quotas are filled, and (often) the speed with which the survey can be
completed. Also, in practice quota sampling is often the only realistic way of
carrying out a survey, especially if the strata are not particularly well defined,
especially in respect of their sizes, in advance. Disadvantages are the reliance
on interviewers in filling the quotas correctly, a risk of interviewer bias, and
frequently that there are problems in filling quotas (especially in rarer groups)
as the survey draws to a close.
Another example is that people may refuse because they are not interested in
(iii) A systematic sample of "about 50" from 506 is asked for. 506/50 = 10.12, so the topic of the survey. In this case, the people who do respond will tend to be
we use an interval of 10. Choose a starting point at random in the first 10 those who are interested in the topic, and their responses commonly tend to be
names listed and then select every 10th name after that. more extreme (one way or the other) than those of the population as a whole.
If the starting point is in the first 6 names in the list, a sample of size 51 will Assuming the interviewing is done at people's homes, those who are not
result. Otherwise, the sample will be of size 50. contacted tend to be those who are away from home a great deal, and here too
it is likely that characteristics and attitudes will be different from those who
Systematic sampling is simple to describe and easy to carry out. It has are at home a great deal.
intuitive appeal. Users of the survey may readily be satisfied that this is a
good method of sampling. In the case of an outright refusal, reasons for the refusal might be sought. For
example, the refusal might be because the person did not have time
However, while a systematic sample may have the appearance of being immediately but would agree to an interview at a later date. Alternatively the
random, and may in practice behave as though it is random, in fact it is not. person might agree to a short telephone interview or to complete a self-
Only the starting point is chosen at random; once it has been chosen, all the completion questionnaire. Or the person might agree to answer a few basic
other members of the sample are fixed. The familiar theoretical results of questions so that at least a partial response is obtained. Substitution of results
simple random sampling do not strictly apply. from those with similar characteristics might then be considered and this might
help reduce the bias.
Further, if there is some cyclic pattern (possibly unknown) in the way the list
has been drawn up, the members chosen may coincide with it. This would In the case of non-contact, call-backs should be made at different times. If it is
happen here if, for some reason, every 10th person in the list is similar in some possible to obtain, information from neighbours might help determine when
way; the selected sample would then be too homogeneous and would not the person was likely to be in. A self-completion questionnaire might be left,
capture all the variability in the population. or if a telephone number is known the potential respondent might be
telephoned either to arrange an appointment or to conduct a telephone
Theoretical results for systematic sampling are derived by considering it as a interview. Simply leaving information about the survey and asking the
form of cluster sampling. Using the present case as an example, the potential respondent to telephone an interviewer or the survey office might
population is divided into ten clusters: the first cluster consists of inhabitants also result in a contact, but this is less likely to be successful.
numbered 1, 11, 21, … in the list, the second cluster consists of inhabitants 2,
22, 32, …, and so on. One of these clusters is then chosen (at random) to be Training of interviewers can also help reduce refusals and non-contacts.
the sample. Interviewers need to be advised both about how to conduct themselves in
terms of appearance (dress, etc) and about how to approach respondents. They
also need to be advised about the best times to call to minimise non-contacts
and refusals.
MODULE 1
SPECIMEN PAPER A
AND SOLUTIONS
The time for the examination is 1½ hours. The paper contains four questions, of
which candidates are to attempt three. Each question carries 20 marks. An indicative
mark scheme is shown within the questions, by giving an outline of the marks
available for each part-question. The pass mark for the paper as a whole is 50%.
The solutions should not be seen as "model answers". Rather, they have been written
out in considerable detail and are intended as learning aids. For this reason, they do
not carry mark schemes. Please note that in many cases there are valid alternative
methods and that, in cases where discussion is called for, there may be other valid
points that could be made.
While every care has been taken with the preparation of the questions and solutions,
the Society will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of the questions or
solutions.
© RSS 2006
1. In a project designed to see whether testing procedures in different laboratories 2. (i) What do you understand by the term simple random sampling?
give similar results, fatigue tests at three different strain levels (level 1 lowest, Describe conditions under which it may not be a suitable sampling
level 3 highest) were carried out on samples from the same batch of a procedure or where it would be desirable to combine it with some other
composite material. The tests were carried out at 9 different laboratories sampling method.
across Europe. The results (cycles to fatigue) from each laboratory and the (6)
means of the samples at each strain level are shown in the table below.
(ii) Choose three different types of non-sampling error, and briefly
laboratory strain level 1 mean strain level 2 mean strain level 3 mean describe circumstances in surveys that give rise to these errors.
results (1) results (2) results (3) (9)
A 7335, 6882, 8353 7523 1336, 1693 1515 690, 735 712.5
B 3512, 4000, 4300 3937 977, 1152 1065 428, 460, 473 453.7 (iii) Outline the main disadvantages of telephone surveys.
C 3822, 5558, 6910 5430 1516, 1607, 1650 1591 630, 675 652.5 (5)
D 1740, 1852 1796 447, 718, 935 700
E 4382, 6239, 8930 6517 1488, 1501, 1516 1502 674, 681, 781 712
F 4030, 4138, 4202 4123 1095, 1205, 1290 1197 465, 380, 395 413.3
G 5000, 5245, 5452 5232 1031, 1156, 1238 1142 380, 408, 454 414
H 2510, 2604, 2811, 2762 738, 771, 864, 814 303, 325, 329 319
2900, 2986 883
J 2247, 3800, 5267 3771 957, 1156, 1202 1105 225, 487 356
1. Which stores do you use for food and grocery shopping? (please specify) ________________
4. How often have you bought the following products in the past 3 months?
Environmentally Friendly Products
1 Very Often 2 Somewhat often 3 Not too often 4 Never
Recycled Products
1 Very Often 2 Somewhat often 3 Not too often 4 Never
6. What breakfast cereal brands do you buy regularly? (please tick all that apply)
1 Kellogg
2 Nabisco
3 Jordan
4 Shops Own
5 Other (please list) ____________________________
SOLUTIONS Question 2
Question 1
(i) Simple random sampling, for samples of size n from a population of size N, is
It is useful to have a graph showing the measurements made by each laboratory on where every sample has the same probability of selection. (This probability is, of
each strain (laboratories A – J are relabelled 1 – 9). Individual responses are plotted. N
course, 1/ .) A consequence of this is that every individual in the target
n
population has the same probability of being selected for the sample.
If a population is not homogenous as a whole, but can be split into groups each of
which is homogenous within itself, it will be better to select randomly within each
group (i.e. stratified random sampling). This allows the different groups to be studied,
as well as increasing precision of overall estimates. Also, when a very large
population is to be sampled using, for example, a list of names, a systematic sample
can be much easier to organise and may be treated as random provided any trends or
cyclical patterns in the list are avoided.
(b) Transfer errors when data are taken from forms and entered into a
processing system. Illegible answers could also occur on postal survey
questionnaires.
Section 1 : Question 2 needs more boxes, for widowed, divorced/separated, living (i) An essential part of planning a survey is to decide exactly what population the
with partner. results should apply to – this is the target population – and what resources will
be needed to achieve this. If some parts of the target population are
Question 3 could be more specific and ask how many adults and how particularly difficult or expensive to cover, resources of time, money and
many children are living at home. personnel may not be sufficient to carry out a survey of adequate size. When a
part of the target population is omitted for practical reasons, and the survey is
Question 4 needs to say annual combined household income, and it carried out on the remainder, this remainder is the study population and is not
should be made clear where (e.g.) £10000 is to be entered by having the same as the target population.
"£5000 – £9999", "£10000 – £19999", etc.
If the omitted part of the target population is likely to be rather different from
the study population, this must be made clear in a report; but if there is no
Section 2 : Question 2 should have an instruction to put a mark in all relevant clear reason to expect that there will be a difference, the results could be used
boxes. as representing the whole population.
Question 3 should say per week, to avoid relying on memory/ An agricultural survey of a crop grown by large-scale farmers and also by
guesswork for longer periods. Also, the present question 3 could smallholders in a region is made more expensive by the need to travel to each
possibly be called "main shopping", with another question for "top-up chosen smallholding. But the two types of grower are likely to give different
shopping" with categories (say) "Under £10", "£10 – £19.99", "£20 results and so, even if smallholders only provide a relatively small part of the
plus". crop yield, both groups must be studied.
Question 4 might refer to the previous month only (again to avoid On the other hand, if a crop is grown in a large region over which the climate
relying on memory/guesswork) and should give some numbers such as is reasonably uniform, it may be possible to set up administration for sampling
"More than 5 times", "3 to 5 times", "Once or twice" and "Never (or in only a few centres in the region and concentrate work near them.
hardly ever)".
A sampling frame needs to be up-to-date and complete so far as it possibly
Question 5 should say per week and perhaps be explicitly restricted to can. Duplications or omissions should be avoided. In the first situation above,
the last week. It should include a box for 0 (zero) and possibly one for a good list of smallholders for the whole region is required, as well as of major
"More than 4". growers. This could add to the basic costs. In the second situation, the lists
are only required for limited areas through the whole region.
(ii) Given only the descriptive statistics summary, not all the useful diagrams can
be drawn. (However the computer can be asked for dot-plots and stem and
leaf diagrams as part of the output.)
The most useful diagrams, without doing any statistical tests, are boxplots.
[Histograms could be asked for as part of the output, but B and C do not have
enough observations to make histograms very useful.]
C is skew to the left, A is somewhat skew to the right but appears to have at
MODULE 1
least one very large outlier. All groups have similar central values, and A is
more disperse than B or C. Apart from the longer right hand whisker, B is
fairly symmetrical. C is a small group, but there seems to be some
concentration of data just above the median.
SPECIMEN PAPER B
AND SOLUTIONS
The time for the examination is 1½ hours. The paper contains four questions, of
which candidates are to attempt three. Each question carries 20 marks. An indicative
mark scheme is shown within the questions, by giving an outline of the marks
available for each part-question. The pass mark for the paper as a whole is 50%.
The solutions should not be seen as "model answers". Rather, they have been written
out in considerable detail and are intended as learning aids. For this reason, they do
not carry mark schemes. Please note that in many cases there are valid alternative
methods and that, in cases where discussion is called for, there may be other valid
points that could be made.
While every care has been taken with the preparation of the questions and solutions,
the Society will not be responsible for any errors or omissions.
The Society will not enter into any correspondence in respect of the questions or
solutions.
© RSS 2006
1. The data given below are the numbers (in thousands) of farms in each of the 2. In a survey of British adults (those aged over 16), there were questions about
50 states of the USA in 1987, given in order of increasing numbers of farms. alcohol consumption. One of these questions related to their average weekly
alcohol consumption. The table below shows trends in the distribution of
1 1 2 3 3 4 4 6 7 8 8 9 9 average weekly alcohol consumption during the period 1992–1998.
14 14 17 21 24 24 27 28 33 36 36 37 38
39 42 46 49 50 50 52 57 57 61 70 71 73
Average Weekly alcohol consumption by sex and age: 1992 to 1998
78 79 82 87 88 93 96 99 109 115 160 Persons aged 16 and over
(i) Using the statistics given in the table, draw suitable diagrams to
illustrate
(i) whether it would be possible to use this method of sampling for this
purpose,
(ii) whether the method would be a good one to choose for the purpose.
STEM
0 1123344678899
1 447 15
2 14478 Men (overall)
3 366789
4 269
10
5 00277
6 1
7 01389
8 278 5
Women (overall)
9 369
10 9
11 5
… Year
16 0 1992 1994 1996 1998
There is considerable skewness, with a large number in stem 0 and a long tail (The limits of electronic reproduction may make the lines in this diagram and the next
to the right. There are also gaps. appear somewhat ragged.)
37 38 (b)
(ii) The median is between the 25th and 26th in order: M 37.5 .
2
Average units of alcohol per week
The quartiles are at the 13th and 38th in order: lower quartile q = 9, upper Units
quartile Q = 71. [Other conventions are also acceptable for the quartiles.] 25
1 2217 2 66139.22
s2 164441 1349.78; so s 36.74 (thousand) . 10
49 50 49
Women (1998)
5
(iii) For reasons given above, the mean and standard deviation will not be good
measures of location and dispersion. The median, 37500, and inter-quartile
range, 62000 (or the semi-iqr 31000), would be preferred.
Age
The middle 50% of the observations have range 62000. The lower 50% are 16-24 25-44 45-64 65+
37500 or less.
The solution to part (ii) is on the next page
Part (ii) Question 3
For overall consumption of alcohol, there was a slight increase over the time period as
a whole, but not a very regular pattern. (i) (a) Non-respondents tend not to be a random part of the whole population,
but instead particular types of person are more likely to fail, or refuse,
Both for men and for women, the 16–24 age group showed a distinct rise between the to reply. Their responses, if known, would quite likely be different
first two and the last two data sets (i.e. between 1992/94 and 1996/98). from other parts of the population. Unless they are represented, the
results of the survey will not validly apply to the whole population.
Both for men and for women, the 65 and over age group showed a fall over the last Besides introducing this bias, intended sample size is reduced by non-
two periods (i.e. from 1996 to 1998). This was also true of the 25–44 age groups, but response and so precision suffers.
the other age groups showed quite substantial increases.
(b) Some possible procedures are as follows.
Overall, there is a general decrease in consumption with age, though this is largely
explained by markedly high 16–24 and low 65+ figures. Send the questionnaire again, once or twice more
Send reminder letters (without the questionnaire)
Overall, men drink two to three times as much as women. Telephone people who have not responded
Visit those who have not responded, perhaps only a sample of them
(ii) (a) Strategy A will lead to a sample consisting of those who are easiest to
locate, so even if the list is constructed in a properly random way the
actual members used will not have been selected at random from the
list. Any who refuse at first request will be ignored rather than any
attempt being made to persuade them. Since there is a team of
interviewers, the more efficient of these may carry out a higher
proportion of the 600 (more quickly). Or, alternatively, quickly
completed interviews may not have been done so thoroughly. Why
stop at 600 instead of attempting to get as many as possible of the
originally selected list?
Note that there about 5600 members in Grade I, and 1400 in Grade II; also about
5250 in areas ABC and 1750 in the rest of the world.
Simple Random Sampling from the alphabetical list of members would be easy to
organise; questionnaires could be distributed separately or perhaps by including them
in the appropriate copies of the next issue of the journal (or in any other regular
publication such as a newsletter). It may not be a very good method because the
Grade II members are a small proportion, as are "rest of the world" ones. These
groups could be in danger of not being sampled very well.
Stratified Random Sampling would be less easy but far more satisfactory because not
only these smaller groups but also the A/B/C groups could be examined satisfactorily
according to likely variability, cost, proportion satisfied, as well as having appropriate
numbers from each group. Lists subdivided more than just by grade would be useful,
and modern data storage methods should make identifiers for subgroups easy to
provide.
Quota Sampling is totally infeasible. It would be very desirable to split into several
groups as suggested above, and if it were possible quota sampling would produce the
required sample numbers. As it is not possible, reminders to non-respondents would
be the only way of achieving reasonable sample sizes in subgroups.
Cluster Sampling is not feasible because there is no obvious way of splitting into
clusters, nor could enough of them be produced to make sampling from them a
reasonable process. There do not seem to be any theoretical grounds for wanting to
sample in clusters either.
Systematic Sampling from the original alphabetic list would be very easy, and
probably just as satisfactory as simple random sampling. However, there is a distinct
risk that some surnames would be especially associated with some areas, so that
stratification would be better. If the sample method is going to involve producing lists
in different groups to sample from, systematic sampling could be used instead of
random choice because it may be quicker.
Possible Groups would be Grades I, II, each split into A, B, C, "rest". A 'good'
method must compare Grades satisfactorily. This seems to be the most important
requirement in the specification.