You are on page 1of 18


discussions, stats, and author profiles for this publication at:

Development of an IQ test for New Zealand


Article in Psychological Reports · May 2008

DOI: 10.2466/PR0.102.2.389-397 · Source: PubMed


1 236

2 authors:

John Hattie Richard Fletcher

University of Melbourne Massey University


Some of the authors of this publication are also working on these related projects:

E4Kids View project

Innovative Learning Environments and Teacher Change (ILETC) Project View project

All content following this page was uploaded by John Hattie on 19 July 2014.

The user has requested enhancement of the downloaded file.

The development of an IQ test 1


The Development of an IQ Test for New Zealand Adults

John Hattie (Corresponding Author)

Faculty of Education, University of Auckland

PO Box 92019. Auckland, New Zealand


Phone: +64-9-3737599 Ext. 82496

Fax: +64-9-373 7455

Richard Fletcher

School of Psychology
Massey University
Private Bag 102-904
North Shore MSC
New Zealand
Phone: +64 9 414-0800, Ext 41213
Fax: +64 9 441-8157

Paper currently in press with Psychological Reports

The development of an IQ test 2


In 2003 and 2004 a television network in New Zealand (TVNZ, TV1) showed a

syndicated program called Test the Nation: The New Zealand IQ Test. This paper

demonstrates the advantages of using item response theory (IRT) as the model of choice

when developing tests such as the New Zealand IQ Test.

The development of an IQ test 3


The current interest by the broadcasting media to assess psychological attributes via

an entertainment medium (reality TV) has seen a plethora of new psychological tests

developed and administered (IQ, health, personality, and general knowledge, to name but

a few). Little is known about the psychometric properties of these tests, which is a major

limitation of this new testing environment. This should be of concern as a major claim by

the producers of these programs is that such “testing” can add to the credibility and

public support and understanding of tests and measures. Regardless of the medium of

delivery though, the underlying principles of testing should be maintained, and therefore

the process described in this paper offers an insight into test development for this new

testing medium using some of the most sophisticated measurement methods available.

The focus of this paper is the development of an IQ test for an internationally

syndicated program: Test the Nation. This program has been shown in many countries,

with widespread popularity, including over 10 million viewers (460,000 online

participants) in the US version of the program. It has been repeated over many years in

some countries, and in New Zealand it now has been shown over three years. The format

of the shows is that there are 6–8 groups of participants in the studio (e.g., farmers,

blonds, twins, and teachers) who complete the test at the same time as viewers at home

are encouraged to answer the questions. Home viewers can also use cell phone texting or

the internet to submit their answers. The live television program then shows the results of

the test and using the text and internet responses profiles the nation (males versus female

mean scores, favorite sports team and related IQ, mean scores by geographical location,

The development of an IQ test 4

The concern of this paper is to outline the method by which such television tests can

be developed in a rigorous and dependable manner thus adding credibility to the

entertainment side of television assessment. The starting point is the notion of IQ, and

while this has been hotly contested, there is agreement that the major components of any

model of intelligence include proficiency in abstract thinking; capacity to learn from

experience; and an ability to learn from and to adapt to one’s environment, new situations

and society. Some components of intelligence are more powerful in some situations, in

some cultures, in some instances, and on some occasions. Choosing what to do when one

does not know what to do, is possibly the underlying hallmark of intelligence. Guidance

as to what a TV IQ test could include can also come from relating the TV test to the gold

standards of IQ testing.

Two of the Gold Standards in the IQ testing movement are the Raven Standard

Progressive Matrices (SPM) Tests, and the Wechsler Adult Intelligence Scale (WAIS).

The Raven is a non-verbal measure of a person’s ability to form perceptual relations and

to reason by analogy (Raven, 1938; Raven, Raven, & Court, 2000). It consists of 60

problems arranged into five sets of 12 questions each. The manual provides conversion

from the raw scores to percentiles, which we then convert to IQ scores (mean = 100, sd =

15). Norms are available for over 15 countries including New Zealand adults. The WAIS

(Wechsler, 1939, 1997) is based on his notion that intelligence is the “capacity of an

individual to understand the world about him and his resourcefulness to cope with its

challenges” (Wechsler, 1975, p. 139). The WAIS is organized around four major

components, each having various subtests: verbal comprehension, perceptual

organization, working memory, and processing speed.

The development of an IQ test 5

The 60-item New Zealand IQ Test was written to test 12 (in the first year) and 10 (in

the second year) attributes of intelligence: language (meanings, anagrams), numerical

(number problems, word problems), spatial (unfolding, rotation), reasoning (analytic,

mechanical), and memory (for pictures, for shapes). In the first year there were slight

differences in what was assessed: language (vocabulary, meanings), learning (general

knowledge, social intelligence), reasoning (number series, series completion). In light of

the results from the first year, the measures of “learning” were dropped in the second

year as these tests loaded on a separate factor to the other tests.

Criteria for Choosing the Questions

The main psychometric measurement model used for the individual analysis of each

question is the item response theory (IRT, Hambleton & Swaminathan, 1985). IRT is a

mathematical approach to understanding the relationship between the responses from the

test-takers and the psychometric attributes of a question. Of the many possible IRT

models, the two-parameter logistic model was applied to the data as it provides a more

realistic account of the interactions between people and the items. The two-parameter

model allows for questions to differ in terms of their difficulty (some easier, some

harder) and their ability to discriminate between those with high and those with low IQ

for a particular question. The criteria for successfully developing a set of questions to

assess these attributes involve four steps.

First, the aim was to select questions across the difficulty range that maximized the

discrimination between people with high and people with low overall scores. Difficulty

and discrimination refer to the two major parameters of any question. Difficulty relates to

how easy or difficult the question is, and discrimination refers to whether a question can
The development of an IQ test 6

maximally discriminate between high and low performers overall on the total set of

questions – the greater the discrimination then the more power the question has to

provide information about the quality being measured.

Second, the optimal way to choose questions is to consider the information from

each question. Information is related to the size of the standard error of measurement

across the ability range. The higher the information then the more psychometric

information a question will provide around the difficulty and the discrimination

parameters. A major reason why information (at the test and question levels) is critically

important, is because it determines how well a test is performing and has an exact

relationship with a person’s standard error of measurement: se(ϑ) = 1/Information(ϑ).5.

Having calculated each question’s information index it was important to ensure that there

was sufficient information across the expected range of IQ (70 to 130 IQ). One of the

unique requests was that there should be slightly more easier items in the overall test, to

allow all viewers to more fully engage in the TV program and get a reasonable number of

questions correct before the scores are translated onto the typical IQ distribution. Thus,

questions were chosen that filled in a rectangular information distribution, while ensuring

that there were sufficient questions at each point of the difficulty scale (with a bias

towards more questions of lower than average difficulty).

Third, there should be evidence of a higher order factor that explains most of the

covariance between the five major dimensions, or ten components. Essentially this higher

order factor can be viewed as overall IQ, as it is the summation of various IQ


Fourth, it was requested that there be six questions measuring each attribute. We thus

used the above information curves to select the best 6 questions for each dimension such
The development of an IQ test 7

that we had excellent fit to the slightly positively skewed rectangular distribution (this

would lead to more easier than harder questions in the overall test), and maximum


The design involved devising and reviewing over 400 items, and after extensive

critique these were cut to 160 items. A sample of 271 in the first year and 335 New

Zealand adults in the second year, representing the range of ages and occupations, were

invited to an all day session to be administered the 160 questions, the Raven SPM and the

WAIS, and in the second year 24 questions from the 2003 test as an additional validation

check. For each question, the difficulty (b) and discrimination (a) parameters were

estimated using a two-parameter item response model (using BILOG, Mislevy & Bock,

1983). The final set of six questions per domain was chosen as outlined above. After

choosing the final six questions, a maximum likelihood factor analysis was used to assess

the amount of common variance, as it was expected that there would be much overlap

between the 12 subscales. The scores from the final 60 questions were then equated to

the Raven scores, and a 60 question to IQ chart calibrated. This chart allowed viewers

completing the 60 questions to then convert their total test score to an appropriate IQ

score (adjusted for their age).

As much care was given to the sampling framework. There were equal number of

males and females and a distribution of ages that matched the profile of the New Zealand

population: 29% aged between 18-25, 38% between 26-40, 20% between 41-59, and

13% greater than 60. Similarly, the occupations of the participants closely represented

the New Zealand Statistics categories, and each occupation converted to an index of

socio-economic status, using the NZSEI index (Davis, McLeod, Ransom, & Ongley,

The development of an IQ test 8

Psychometric attributes of the questions

To illustrate the manner in which the IRT model was used, consider first the

attributes of an excellent question. The graph in Figure 1, of the performance of one item

(item 29 in a trial form), is presented on the left-hand side. It indicates that there is much

discrimination (a-value) between 85 and 115 IQ (- 1 sd to + 1 sd) and the difficulty (b-

value) of the item is close to 105 IQ. The more vertical the line of this curve, then the

more successful the item is at discriminating at the difficulty level of the item. Contrast

this with item 16 (to the right in Figure 1) where there is less discrimination (the curve is

more flat) although the difficulty is about the same. Item 29 provides more information

about IQ than does item 16 (in a trial form). It is optimal to choose a series of items with

steeper discrimination curves right across the range of difficulty desired (80 to 125), and

this is how the final set was chosen.

Using the four criteria outlined above, 60 questions were chosen for the final New

Zealand IQ Test (in the second year, and 72 in the first year). For each of the final 60

questions, the difficulty and discrimination are presented in Table 1. For ease of

interpretation, the percentage correct is also provided. Thus for example, for Question 1,

86% of the sample correctly answering this question, and it has a discrimination of .36,

and the correlation with the Ravens IQ test is .30. The average percentage correct is 63%,

which is exactly the desired mean (when guessing is taken into account). The average

discrimination is .25, which is as good as can be expected from multiple choice items.

Across the total test there is sufficient information at each point of the IQ distribution

from - 3 to + 3 logits, corresponding to 55 to 135 IQ. Figure 2 below presents the

information function for the total score (this is the accumulation of the information

curves across all 60 items). It can be seen that the “information curve” is greatest
The development of an IQ test 9

between 70 to 120, and there is less information above 120, although there is sufficient to

have confidence in the IQ scores across the range 85 to 135. In future developments

perhaps we would need to make the test slightly more difficult to allow the peak of this

information curve to be above IQ = 100.

A maximum likelihood factor analysis with oblimin rotation indicated that there was

one clear underlying factor (Table 2). The mean of the final 60 questions was 100.78 (sd

= 13.52). The mean IQ was 100 (sd = 15.25) for the Raven, and 109 (sd = 14.9) for the

WAIS subtests. The correlation between the New Zealand IQ test and the total Raven

was .67, and with the WAIS was .69.

There was a statistically significant difference in the New Zealand IQ Test means

across the various occupation groups (F = 2.04, df = 9,235, p < .05). Professionals and

Legislators, Administrators and Managers had higher mean IQ scores, whereas Service

and Sales workers and those in Elementary Occupation (e.g., laborers, packers) had

lower means. There were statistically significant differences on the test scores (prior to

adjusting the raw scores to the IQ scale) by the four age groups (F = 19.40, df = 3, 325, p

< .001). The mean for 18-25 year olds (before conversion) was 42, 26-40 was 39, 41-59

was 36 and 60 was 32. Hence it was important to have different conversions for these

four groups to place them on the same IQ distribution (as also occurs in the Raven and all

IQ tests).

Concluding comments

The second-year final 60 questions cover five major dimensions of intelligence – the

ability to use language, process spatially, consider numeracy, use memory, reason,

problem solve, and learn. The correlations between the New Zealand IQ Test and other
The development of an IQ test 10

related the ‘Gold Standard’ IQ tests are very high, providing much confidence in the

proficiency of these questions to assess New Zealand adults’ IQs.

It is recognized that the aim of the television program is credible entertainment.

However, it is still critical that any psychometric test is well developed and this article

aims to provide evidence of such psychometric rigor. The methods presented in this

paper demonstrate the advantages of developing tests using the most stringent of

measurement models, namely IRT.

The development of an IQ test 11


Davis, P., McLeod, K., Ransom, M., & Ongley, P. (2000). The New Zealand socio-

economic index of occupation status (NZSEI). Research Report #2. Wellington,

New Zealand: Statistics New Zealand.

Hambleton, R.K., & Swaminathan, H. (1985). Item response theory: Principles and

applications. Kluwer-Nijhoff: Boston, MA.

Mislevy, R. J., & Bock, R. D. (1983). BILOG: maximum likelihood item analysis and test

scoring-logistic models. Chicago: Scientific Software International.

Raven, J. (1938/1962/2004). First published as Progressive Matrices (1938). London:

H.K. Lewis. Revised and re-names as the Standard Progressive Matrices in 1962,

H.K. Lewis. Later published in Oxford England by OPP and, from 2005 by

Harcourt Assessment, San Antonio, TX.

Raven, J., Raven, J.C., & Court, J.H. (2000/2004). Manual for Raven’s Progressive

Matrices and Vocabulary Scales. Section 3: The Standard progressive matrices:

including the parallel and plus versions. San Antonio, TX: Harcourt Assessment.

Wechsler, D. (1939). The measurement of adult intelligence. Baltimore MD: Williams &


Wechsler, D. (1975). Intelligence defined and undefined: A relativistic appraisal.

American Psychologist, 30, 135-139.

Wechsler, D. (1997). Wechsler Adult Intelligence Scale—Third education. San Antonio,

TX: Psychological Corporation.

The development of an IQ test 12

Table 1

Percentage correct (%), discrimination, and correlation with the Ravens for each of the

final 60 questions

% Discrimination r to Ravens % Discrimination r to Ravens


Meanings Anagrams
1 86 36 .30 7 62 21 .16
2 65 47 .22 8 59 11 .17
3 65 40 .22 9 54 8 .54
4 54 29 .10 10 19 7 .40
5 47 24 .11 11 38 9 .15
6 76 33 .22 12 71 15 .33

Analytical Mechanical
13 80 10 .19 19 77 26 .33
14 72 10 .40 20 69 32 .30
15 71 19 .19 21 55 29 .19
16 39 8 .39 22 55 31 .36
17 67 2 .69 23 50 20 .20
18 77 17 .25 24 72 43 .42

Pictures Memory for Shapes

25 90 17 .26 31 73 22 .34
26 59 7 .23 32 66 16 .14
27 81 26 .35 33 75 13 .60
28 43 6 .60 34 74 20 .18
29 34 8 .20 35 62 23 .32
30 83 15 .24 36 75 26 .27

Number Problems Word Problems

37 77 37 .43 43 72 32 .24
38 74 39 .30 44 65 36 .16
39 69 41 .33 45 55 30 .31
The development of an IQ test 13

40 69 37 .46 46 49 28 .23
41 65 39 .42 47 42 37 .18
42 58 24 .17 48 66 32 .35

Unfolding Rotation
49 81 29 .22 55 84 31 .19
50 63 34 .24 56 74 38 .36
51 60 33 .25 57 54 23 .19
52 44 33 .30 58 51 24 .23
53 36 34 .22 59 47 33 .26
54 74 31 .21 60 84 24 .33
The development of an IQ test 14

Table 2

Factor loadings on the single factor for each of the 12 dimensions of Test the Nation:

The New Zealand IQ Test

Dimensions Factor loading

Meanings Vocabulary .35

Meanings Anagrams .33

Analytic Reasoning .47

Mechanical Reasoning .61

Memory for Pictures .54

Memory for Shapes .59

Number Problems .64

Number Word Problems .65

Spatial Unfolding .52

Spatial Rotation .59

The development of an IQ test 15

Figure Captions

Figure 1. Item characteristic curves for an excellent and not so excellent item.

Figure 2. Test characteristic curve for the New Zealand IQ test.

The development of an IQ test 16

Item Characteristic Curve: ITEM0029 Item Characteristic Curve: ITEM0016

a = 0.439 b = 0.145
a = 0.983 b = 0.336

0.8 0.8





0 b
-3 -2 -1 0 1 2 3
Ability -3 -2 -1 0 1 2 3
The development of an IQ test 17

View publication stats