P. 1
A Physics Diagnostic Test

A Physics Diagnostic Test

|Views: 1,955|Likes:
Published by Leo Sutrisno
it is presented the development of a physiscs diagnostic test
it is presented the development of a physiscs diagnostic test

More info:

Published by: Leo Sutrisno on Aug 15, 2008
Copyright:Attribution Non-commercial


Read on Scribd mobile: iPhone, iPad and Android.
download as DOC, PDF, TXT or read online from Scribd
See more
See less






Leo Sutrisno
Dept. Math and Science Education

Faculty of Education Tanjungpura University Pontianak, Indonesia

List of contents

1 Constructing the test
1.1 The purpose of the test 1.2 The table of specification 1.3 Types of the test form 1.4 Multiple-choice form

2 Trialling the test
2.1 Item analysis 2.3 Criteria for selecting the appropriate items


List of Figures Figure 1: A 2x2 table of response pattern of item A and item B on dichotomous events List of Tables Table 1 Characteristics of items of the physics diagnostic test

It is presented the development of a physics diagnostic test. Gronlund (1981) describes the several stages in developing a diagnostic test: determining the purpose of testing, developing the table of specifications, selecting As appropriate item types, and preparing relevant test items.

1 Constructing the test
1.1 The purpose of the test As reported in Chapter 3 students hold their own

conceptions about the phenomena of sound before they attend formal instruction in these topics at secondary school. These conceptions - the students' pre-conceptions - are generally different from the scientists' conceptions. In class activities, however, scientists' the students' pre-conceptions by interact with the conceptions promoted teachers with several

possible outcomes. Fisher and Lipson (1986, p.784) called students' observable outcomes which are different from an expected ("correct") model of performance "student errors".


The test is designed primarily to detect the common student errors in learning about sound. In other words, the test is used as a diagnostic test. (Gronlund, 1981). The results of the test will then be used to design learning experiences to remedy these errors. The test is used again to detect whether or not these errors have been overcome through remedial activities. In this study, the expected ("correct") model of performance is scientists' conceptions and the items of the test are based on the students' pre-conceptions. 1.2 The table of specification Theobald (1974) states that "the foremost requirement of a test is that it be valid for its intended purpose" (p.28). He also states that "a test ... can measure only a small proportion of the total number of specific objectives of a course of study". It is important therefore that the test has content validity. The standard method to ensure the content validity of a test is by working to a table of specifications. Gronlund outlines three steps in building the table: obtaining a list of instructional objectives, outlining the course content, and preparing a two-way chart of instructional objectives vs course content. (See also Karmel, 1970; Hopkins and Stanley, 1981; Theobald, 1974). The relationship between objectives and measurement is clearly stated by Noll, Scannell, and Craig (1979). Objectives and measurement are complementary. Bloom's Taxonomy of Educational Objectives (1956) has been widely used as a framework for educational measurement, and the cognitive


domain of this Taxonomy will be adopted as the basis for a list of behaviourial objectives. The course content outline is based on the Indonesian Curriculum-1984 for the General Senior High Schools (SMA). The unit of instruction about sound consists of several sub units such as the sources of sound, the transmission of sound, the medium of transmission, velocity of sound in several media, musical instruments, human ears and the Doppler Effect. Gronlund (1981, p.126) suggests that a diagnostic test should have a relatively low level of difficulty and that most items should be focused on the knowledge level (51%) with few items at the level of synthesis or evaluation. At the knowledge level, items are grouped into knowledge of specific facts, knowledge of conventions and knowledge of principles. Knowledge of specific facts refers to "those facts which can only be known in a larger context" (Bloom, 1956 p.65). Knowledge of conventions refers to "particular abstractions which summarize observations of phenomena" (p.75). The items are intended to recall and recognize facts, conventions or principles in learning sound. At the comprehension level items deal with the interpretation of the given illustration, while at the application level, items deal with selecting an appropriate principle(s) to solve a given problem. Items at the analysis level focus on analysisng the element of given problem. 1.3 Types of the test form


As a broad classification, tests can be grouped as either essay or objective types. Karmel (1970) makes a comparison between essay and objective tests on the abilities measured, scope, incentive to pupils, preparation and method of scoring. The essay test uses few questions, requires the students to use their own words to express their knowledge, generally covers only a limited field of knowledge, encourages pupils to learn how to express and organize their knowledge, and is very time consuming to score. On the other hand, the objective test requires the students to answer, at most, in a few words, can cover a broad range of student preparation, is time consuming to construct but can be scored quickly. "Objective" tests are objective in the sense that once an intended correct response has been decided on, the test can be scored objectively, by a clerk or mechanically. On the other hand, "essay" tests require an "expert" to decide on the worth of the response and so involve an element of subjectivity in scoring. See also Gronlund, 1981. Hopkins and Stanley (1981) present a list of limitations of essay tests based on research findings: reader unreliability, halo effects, item-to-item carryover effects, test-to-test carryover effects, order effects, and language mechanics effects. Theobald presents similar criticisms: evaluation is difficult, scoring is generally unreliable, and costly in time and effort, and the sampling of student behaviours is usually inadequate. (Theobald, 1974). However, Blum and Azencot (1986) who compared the results of students on multiple-choice test items and equivalent essay questions in an Israeli examination concluded that there were no significant differences between the mean scores. 35

Because of the need to comprehensively sample the field of learning in a diagnostic test (Hopkins and Stanley, 1981), objective tests are to be preferred over essay test for this purpose. 1.4 Multiple-choice form Gronlund (1981) presents several advantages of the

multiple choice form over other forms of objective test: it has greater reliability than other forms, it is one of the most widely applicable it can measure various types of knowledge effectively, and the use of a number of plausible alternatives makes the results amenable to diagnosis. The chances of guessing correct answers in multiple choice items can be reduced by increasing the number of options in each item. Grosse and Wright (1985) found that "more true-false items are required to achieve the same reliability expected from 5choice items" (p.12). The multiple-choice form will be used in this study. There is no agreement about the best number of options in multiple-choice tests. Noll, Scannell and Craig (1979) suggest that at least four options per item should be used, and that five alternatives are preferable, the chance factor being decreased by increasing the number of options. Hopkins and Stanley (1981) state that "five alternatives per item are optimal for many situations" (p.245). Lord (1977) reviewed four approaches to predict the optimal number of options per item. The first approach is based on maximizing the "discrimination function", A" proposed by Tversky (1964). A" is the total number of possible distinct response 36

patterns on N items with A options in each item. The function A" is maximized by A = e if N.A is fixed. Since e is 2.718 then A is approximated by 3. Costin (1970) compared the discrimination indices, the difficulty indices and the reliabilities (KR-20) of test results which were based on 3-option and 4-option multiple-choice tests on perception (N = 25), motivation (N = 30), learning (N = 30) and intelligence (N = 25). He found that the 3-option forms produced higher values than the 4-option forms. The second approach was proposed by Grier (1975) and is based on test reliability. The expected reliability of a multiplechoice test is a function of the number of options per item. Grier concluded that for C > 54 the three options per item maximizes the expected reliability coefficient. (for large C the optimal value of N approaches 2.50). The third approach examines the knowledge-or-randomguessing assumption. Lord (1977) used this approach and found it to support Grier's result (= 3). The fourth approach is based on the use of the item characteristic curve (Lord, 1977). The item characteristic curve of an item gives the probability of a correct answer to the item as a function of examinee ability. Lord found that the test where the pseudo chance-score level of items is 0.33 is superior to others. Green, Sax and Michael (1982) analyse the reliability and the validity of tests by the numbers of options per item. They found that 4-options per item produces the highest reliability when compared with 3-options and 5-options, but that 3-options has the highest predicted validity (correlations with the course grade). The experimental evidence quoted above argues in favour of using three options per item and this form will be adopted in this 37

study. A multiple choice item has two parts: the introductory statement to pose the problem and a series of possible responses. The first part is called "a stem" and the second part is called "options", "alternatives", or "choices". In this study the term "options" is being used. Options include the correct response and distracters, foils or decoys for the others. Using the table of specifications items were constructed for each sub unit of instruction. There are standard guidelines for constructing multiple choice items (Hopkins and Stanley, 1981; Noll et al., 1979; Gronlund, 1981; Theobald, 1974). In relation to the item, there are many suggestions. Summing up, items should be as clear and short as possible to pose the problem and they should not contain cues which attract students' attention to the correct response. Several studies have been conducted on the effects of violating item construction principles. McMorris, Brown, Snyder and Pruzek (1972) studied the relationship between providing signs about the correct response and the test results. Three types of signs used in their study are: words or phrases in the stem which provide a sign to the correct answers, grammar - where the correct answer was the only one grammatically consistent with the stem, and length where the correct answer was longer than the distracters. They found that these factors are positively correlated to the test results. (.48; .39; and .47 for words or phrases, grammar, and length respectively). Austin and lee (1982) studied the relationship between the readability of the test items and item difficulty. Several aspects of the readability were the number of sentences in each item, the number of words in each item, and the number of "tokens" 38

(words,digits or mathematical symbols) in each item. They found that these aspects of the readability of the items were negatively correlated with item difficulty. (-.22, -.24, -.15 for the number of sentences, words and tokens respectively). Increasing the number of sentences, words or tokens in the item would decrease the number of the correct answers. Schrock and Mueller (1982) studied the effect of the stem form (incomplete and complete statement), and the presence or absence of extraneous material as cues to attract students to the correct answers. They suggest that "a complete sentence stem is to be used rather than an incomplete sentence stem" and that "extraneous material should not be present in any type of stem" (p.317). Green (1984) studied the effects of the difficulty of language and the similarity among the options on item difficulty. The difficulty of language was varied by the length of stems, the syntactic complexity, and the substitution of an uncommon term with a familiar term in the stem. She found that the similarity among the options significantly affect the item difficulty (F = 72.21; a < 1%) but it did not affect difficulty of language. There are standard guidelines for constructing multiple choice items (Hopkins and Stanley, 1981; Noll et al., 1979; Gronlund, 1981; Theobald, 1974). In relation to the item, there are many suggestions: 1. 2. 3. Items should cover an important achievement. Each item should be as short as possible. The reading and linguistic difficulty of items should be low.


4. 5. 6. 7. 8.

Items that reveal the answer to another should be avoided. Items which use the textbook wording style should be avoided. Items which use specific determiners (e.g., always, never) should be avoided. Each item should have only one correct answer. In a power test, there should not be too many items otherwise the text would become a speed test.

Some suggestions related to the stem are: 1. 2. 3. The stem should contain the central problem. The negative statement or the incomplete statement in the stem should be used with care. The stem should be meaningful in itself. 114 Several suggestions related to the options are: 1. 2. 3. 4. 5. 6. 7. All options should be grammatically consistent with the stem of the item. All options should be plausible. All options should be as parallel in form as possible. The correct response should be placed equally often in each possible position. Verbal associations between the stem and the correct response should be avoided. Any clue which leads to the choice of the correct response should be avoided. Options which are synonymous or opposite in meaning in the item should be avoided. 40

Summing up, items should be as clear and short as possible to pose the problem and they should not contain cues which attract students' attention to the correct response. Test items used in this study were written to follow the recommendations quoted above as closely as possible. Ninety-two items were constructed and divided into two tests for ease of administration: form-A (45 items) and form-B (47 items). The first drafts were written in English and the final drafts for trying out were in Bahasa Indonesia

2 Trialling the test
The final drafts were sent to Indonesia, duplicated and distributed to schools in West Kalimantan. Neither schools nor students were randomly chosen. This was because of the lack of time available to negotiate with school authorities. During that time (May - June 1987) schools were very busy as it was the end of the academic year and only six schools were willing to be used for the trialing of the items. Physics teachers in these schools then were asked to choose classes to be given the test, 231 students were participated. The answer sheets together with teachers' responses and comments analysed. It should be noted that the test results were scored by counting the number of items answered correctly. In administrations of the final form of the diagnostic test the focus for scoring was on the number of curriculum sub-units not mastered.


2.1 Item analysis There are many standard item analysis techniques (Hopkins and Stanley, 1981; Anastasi, 1976; Hills, 1981; Gronlund, 1981; and Theobald, 1974). The most common indices used are the facility value and the discriminating power of the item. The means and standard deviations of the total correct answers for each form of the tests are 54.91 and 22.92 (form A); 53.26 and 26.04 (form B). Facility values Facility value is defined as "the percentage of students completing an item correctly" (Theobald, 1974, p.33). Some authors use the reverse concept, that of "item difficulty" (Noll et al., 1979, p.83; Anastasi, 1976, p.199; Gronlund, 1981, p.258). In this study the term "facility value" is adopted as its use will lead to less confusion. The greater the facility value of the item the more students are able to answer correctly. The means and the standard deviations of the facility values of items of the test are 55 percent (SD = 10) for the form-A and 48.6 percent (SD = 21.5) for the form-B. The minimum values are 24 percent (item no 24 of form-A) and 19 percent (item no 25 of the form-B) while the maximum values are 90 percent (item no 7 of the form-A) and 100 percent (item no 5 of the form-B). The facility values of items will be considered as an aspect which should be taken in account when selecting items to produce the final test. Theobald (1974) suggests that "items should lie within the range of 20 per cent - 80 per cent difficulty" (p.34). This suggestion will be implemented but with consideration given to Anastasi's warning that "the decisions about item difficulty 42

cannot be made routinely, without knowing how the test scores will be used" (Anastasi, p.201). As mentioned earlier the test is designed to be used as a diagnostic test in an attempt to measure in detail academic strength and weaknesses in a specific area, in contrast to the survey test which is an attempt to measure overall progress (Karmel, 1970, p.283). If an item does not appear to suffer from technical faults in construction, a low facility value could reflect that students' pre-conceptions have not been replaced by scientists' conceptions in most students. Such an item indicates a common students' weakness and would tend to be retained. The discrimination power of the test Gronlund (1981) defines the discrimination power of the test item as "the degree to which it discriminates between pupils with high and low achievement" (p.259). Anastasi (1976, p.206) used "item validity" to describe this concept. The ideal item is the item which is answered correctly by all students of the upper group and is not by all students of the lower group (Findley, 1956). In Findley's 121 usage the upper group is a defined proportion, usually a third, of the class who scored most on the test and the lower group is the same proportion who scored least. The assumption is that the score on the total test is the best measure we have of high and low achievement. Several methods have been proposed to measure the discriminating power of an item: biserial correlation, phi coefficient, and the index of discrimination. All of these methods are based on the measurement of the relationship between the


item score and the criterion score. Usually the total score on the test itself is used as the criterion score. The biserial correlation method is based on the assumption that "the knowledge of the test item and the knowledge of the entire test are both distributed normally" (Mosier and McQuitty, 1940, p.57). Means and standard deviations of the bi-serial correlation coefficients of the test form-A and the test form-B. (.20 and 0.12 Sd for the test form-A, .15 and 0.13 Sd for the test form-B). Item no 1 of form-A had the lowest correlation. This item dealt with generation of sound. Item no 6 of form-B which attempted to investigate students' understanding about waves by using a diagram of a wave, had the lowest correlation among items of formB. Item number 43 of the form-A and number 17 of the form-B have the highest correlation coefficients (.46, and .38 respectively). Item 43 dealt with the Doppler Effect, while item 17 dealt with the transmission of sound in a metal bar. The use of the phi coefficient was developed by Guilford (1941) based upon "the principle of the correlation between an item and some criterion variable" (p.ll). This method is applicable if these two groups are equal in number. Several computational aids have been developed for arriving at phi coefficient values: tables (Jurgensen, 1947, for equal groups; Edgerton, 1960, for unequal groups), nomograph (Lord, 1944), and abacs (Mosier and McQuitty, 1940; Guilford, 1941). In this study the phi coefficients of items are calculated following Theobald's procedures for using the abac method. Table 4.4.7 presents means and standard deviations of phi 44

coefficients of the tests form-A and form-B. There is no significant difference between the phi coefficients of the two groups. (t = 0.02, p< .O1). Findley (1956, p.177) also proposed two formulas to measure the discrimination power of items. One of the potential virtues of this method is that it can be used to provide "a precise measure of the distractive power of each option" (Findley, p. 179). In this regard, Findley's method can provide more information than the bi-serial correlation coefficient and than the phi coefficient measured by the abac method. So the index of discrimination power of items measured using Findley's method will be used to select items. Means and standard deviations of these indices are .22 and 0.17 for the form-A, and .21 and 0.18 for the form-B). There is no significant difference between the indices of the two groups (t = 0.02, a _ .O1). The Findley index and phi coefficient are highly correlated (.96 for the form-A, and .97 for the form-B). 3 Criteria for selecting the appropriate items Anastasi (1976) states that items of a test should be evaluated both qualitatively and quantitatively. Qualitative evaluation deals with the content and the verbal structure of the item. The care taken in the construction of the items has been described in sections 4.1. Quantitative evaluation deals with the difficulty and the discrimination of the items and this has been presented in section 4.2. Results of the evaluation will be used as a base for selection of the items in order to improve the reliability and the validity of the test. 45

The official class-period time in Indonesian secondary schools is 45 minutes but the effective period would be about 40 minutes. Students have wide experience of multiple-choice testing in Indonesian secondary schools and experience suggests that for a test to be completed by nearly all students the appropriate number of items would be about 30. A final form of a physics diagnostic test was constructed from the 97 items which were trialled. The 30 items will be selected mainly from the test form-A. Whenever needed items of the form-B were included. In selecting items for inclusion in a test, Hopkins and Stanley (1981, p.284) believe that an item which has high difficulty and low discrimination power may, on occasions, be acceptable. On the other hand, Gronlund (1981) says that "a low D should alert us to the possible presence of technical defect in a test item" (p.262). Noll et al. (1979) stated that there is little reason to retain items which have negative discrimination indices unless the other important values can be shown. Theobald (1974) suggested that "items should lie within the range of 20 per cent 80 per cent difficulty" (p.34) and "items should be carefully scrutinized whenever D < +.2" (p.32). Anastasi claims that items which have around 50 per cent of difficulty and of discrimination power are preferable. Although "item analysis is no substitute for meticulous care in planning, constructing, criticizing, and editing items" (Hopkins and Stanley 1981, p.270) Theobald's quantitative guidelines will be adopted. He states, however, that even when test items are chosen to reflect precisely stated behavioural objectives, and the test as a whole is an adequate and representative sample of the 46









considerations also apply (P32). There are three additional considerations: the students' and the teachers' comments about the items, prerequisite relationships among sub units of study and caution indices. Students' and teachers' comments on the items are available. These were also considered carefully when selecting or rejecting items. There are many methods used to test models of prerequisite relationships among sub units. The first method is the Proportion Positive Transfer method which was pioneered by Gagne and Paradise (Barton, 1979), and its refinement suggested by Walbesser and Eisenberg (in White, 1974b). White observed that these methods do not take into account errors of measurement, so he and Clark proposed another method (White 1974a, 1974b). By applying the Guttman's coefficient (Yeany, Kuch and Padilla, 1986), and the phi coefficient (Barton, 1979; White, 1974a) the scalogram method has also been widely used. Barton also proposed a method called the Maximum Likelihood method. Proctor (1970) suggested the use of X.2 procedures (Bart and Krus, 1973 p.293). Dayton and Macready (1976) used a probabilistic method (Yeany et al., 1986). An Ordering Theory method has been used by Bart and Krus (1973); Airasian and Bart (1973); Krus and Bart (1974). Bart and Read (1984) tried to adopt Fisher's exact probability method. Although all these methods have their own particular advantages and limitations most of them share a similar problem of determining whether a certain model of prerequisite 47

relationship occurred by chance or not. Bart and Read's method suggests procedures to solve this problem, so their method has been adopted in this study. This method is based on the axiom: For dichotomous items, with a correct response scored "1", and incorrect response scored "0", success on item i is considered a prerequisite to success on item j, if and only if the response pattern (O1) for items i and j respectively does not occur. Bart and Read, 1984 (p.223) Given items A and B which have been administered to N students we produce a 2 x 2 table of response pattern as follows:

Item A (1)


N 11 N Z1 N,1



N 1.

Fa i l ( 0 )



Figure 4.4.1: A 2x2 table of response pattern of item A and item B on dichotomous events. If the success of item A is necessary though not sufficient for success on item B (ie. NZ1 = 0), then item A is a prerequisite to item B. All items which were found to be prerequisite for the development of the concepts of sound were included. There are several methods to analyze item response patterns. These methods can be used to detect students who 48

need to be given remedial activities. The method adopted in this study is based on Student-Problem (S-P) curve theory. Harnisch (1983) states that this method also provides information about each respective item by observing the distribution of distracters above the P-curve. An unusual item is one that has a large number of better than average students answering it incorrectly while an equal number of less than average students answer it correctly, Harnisch, 1983 (p.199) Harnisch and Linn (1981), and Harnisch (1983) proposed the Modified Caution Index (MCI) Formula. This formula is originally used to detect the characteristics of students based on their responses. Harnisch (1983) stated that the MCI can be used to detect the characteristics of items as well by reversing the roles of students and items. "High MCI's for items indicate an unusual set of responses by students of varying ability, and thus these items should be examined closely by the test constructors and the classroom teacher" (p.199). This criterion will be adopted as one of the additional considerations in selecting items. Thus the criteria used for the selection of appropriate test items to be used in this study can be restated as: 1. 2. than .20. If these requirements are not met, additional information about the items is needed such as: 3. 4. items. 49 The students' and teachers' comments about the item. The prerequisite relationships of the item for other The facility value of the item lies between 20%-80%. The discrimination index (D) is equal to or more

5. 6

The Modified Caution Index (MCI) of the item. The final form of the test

There were 22 items of the form-A which meet the first two criteria. Several items which did not meet these criteria were considered for inclusion in the final form of the test after additional consideration. For example item no 8, facility value = 64, D = .15, received several comments that indicate confusion between a vacuum pump (pompa hampa udara) and a vacuum (ruang hampa u ara). It is suggested that the description of a vacuum pump within the stem be rephrased. In addition, the MCI of this item is low (.05). Item no.15 will also be included in the final test because the MCI of the item is low (.21) Other items were not acceptable for inclusion although some of them could be acceptable after slight revision. For example, item no21 which has 50 percent facility value and .18 index of discrimination received some comments revealing that many students have not heard the term bulk modulus. Providing an explanation about the meaning of the bulk modulus may be expected to increase its facility and may alter the index of discrimination. However, this item is excluded because the MCI of this item is high (.77). Six items from form-B were chosen to replace the position of items of form-A on the bases of prerequisite relationships among sub units. Item no.l (form-8) replaces item no.l (form-A), item no.12 (B) replaces item no.19 (A), items no.13(B) and 14(B) replace items no.13(A) and 14(A), respectively, item no.17(B) replaces item no.16(A), item no.21(B) replaces item no.21(A), and items no.42(B) and 44(B) replace items no.40(A) and 45(A) respectively. 50

Table 4. 1 presents the number of items, their numbering in the original forms and facility values, discrimination indices, and the MCIs of each item calculated from the initial trialling, and from the second investigation. The facility values of the 32 items in the final version of the test fell between 20 and 80 percent: the increase in facility values was to be expected over those in form-A and form-B because these students had received instruction in physics of sound. Similarly all but 2 items had discrimination indices above .20. There was strong pressure from physics teachers to include, in the final form of the test, items which would test students' conceptions of the transmission of sound at night and the influence of the force of gravity on sound. There were no such items in form-A or form-B and two new items were constructed with great care and included in the final form. If these items performed poorly in the second investigation (where the test was administered to 596 students from.l9 schools) they could have been dropped for the experimental stage of the study. As it was they performed as well as many other items and contributed to a total test reliability of .85 (SpearmanBrown) (Standard error of measurement = 1.77). Table 1 Characteristics of items of the physics diagnostic test.


Item Original no. item no.

Fac. value



Sub units







62 89.50


.06 .11

The generation of sound



71 83.67


.11 .41



61 69.26


.41 .32

The transmis sion of sound



76 56.38


.23 .13



63 80.15


.05 .07

The medium of transmission



50 61.38


.23 .18


List of contents

1 Constructing the test
1.1 The purpose of the test 1.2 The table of specification 1.3 Types of the test form 1.4 Multiple-choice form

2 Trialling the test
2.1 Item analysis 2.3 Criteria for selecting the appropriate items List of Figures Figure 1: A 2x2 table of response pattern of item A and item B on dichotomous events List of Tables Table 1 Characteristics of items of the physics diagnostic test

Important terms 3-option 4-option 5-option abacs 53

analysis level application level behaviourial objectives biserial correlation, Bloom's Taxonomy common student errors comprehension level content validity diagnostic test diagnostic test difficulty difficulty indices difficulty of language discrimination function discrimination indices, discrimination of the items Educational Objectives effects of violating item construction principles essay and objective tests Facility values framework for educational measurement Guttman's coefficient halo effects, index of discrimination. instructional objectives item-to-item carryover effects, knowledge level knowledge of conventions Knowledge of conventions 54

knowledge of principles. knowledge of specific facts knowledge-or-random-guessing assumption language mechanics effects length of stems, level of difficulty level of synthesis Modified Caution Index (MCI) Formula optimal number of options per item. order effects, Ordering Theory own conceptions phi coefficient phi coefficient, prerequisite relationships probabilistic method readability of the test items and item difficulty reader unreliability, recall and recognize facts, conventions or principles reliability scientists' conceptions standard guidelines for constructing multiple choice items stem form student errors Student-Problem (S-P) curve theory students' observable outcomes students' pre-conceptions syntactic complexity table of specifications 55

table of specifications test-to-test carryover effects, The discrimination power of the test the reliabilities (KR-20) the substitution of an uncommon term true-false items validity of tests List of refernces Anastasi, (1976). Psycho log i ca l tes t (4 th ed . ) . New Yo rk : i ng Mac l l an . mi Bart, W.M., & Krus, D.J., (1973). An ordering-theoretic method to determine hierarchies among items. Educa t i ona l and Psycho log i ca l Measu rement , 33, 291-300. Bart, W.M. & Read, S.A., (1984). A statistical test for prerequisite relations. Educational and Psychological Measurement, 44, 223227. Barton, A.R., (1979). A new statistical procedure for the analysis of hierarchy validation data. Research in Science Education, 9, 23-31 Bloom, B.S., Engelhart, M. D., Furst, E. J., Hill, W. H., & Krathwohl, D. R., (1956). Taxonomy of educational objectives: handbook 1: cognitive domain. London: Longman. Blum, A., (1979). The remedial effect of a biological game. Journal of Research in Science Teaching, 16(4), 333-338. Blum, A., & Azencot, M., (1986). Multiple choice versus equivalent essay questions in a national examination. European Journal of Science Education, 8(2), 225-228. Costin, F., (1970). The optimal number of alternatives in multiplechoice achievement tests: Some empirical evidence for a mathematical proof. Educational and Psychological Measurement, 30, 353-358. Dayton, C.M., & Macready, G.B., (1976). A probabilistic model for hierarchical relationships among intellectual skills and propositional logic tasks. Psvchometrica, 41, 189-204. Edgerton, H.A., (1960). A table for computing the phi coefficient. Journal of Applied Psychology, 44, 141-145.


Fisher, K.M., & Lipson, J.I., (1986). Twenty questions about student errors. J ou rna l o f Resea rch i n Sc ience Teach ing , 783 23(9 ) 803 . Gagne, R.M., & Paradise, N.E., (1961). Abilities and learning sets in knowledge acquisition. Psycho log i ca l Monographs , (14 , 75 who le , no . 518) . Green, K. (1984). Effects of item characteristics on multiple choice item difficulty. Educat i ona l and Psycho log i ca l Measurement , 44, 551-561. Green, K., Sax, G., & Michael, W.B., (1982). Validity and reliability of test having differing numbers of options for students of differing level of ability. Educa t i ona l and Psycho log i ca l Measurement , 42, 239-245. Grier, J.B., (1975). The number of alternatives for optimum test reliability. J ou rna l o f Educat i ona l Measurement 109-113. , 12, Grondlund, N.E., (1981). Measurement and eva lua t i on i n teach ing (4th ed.). New York: Collier Macmillan. Guilford, J.P., (1941). The phi coefficient and chi square as indices of item validity. Psychomet r i ka 6, 11-19. , Guilford, J.P., & Fruchter, B., (1983). Fundamenta l s ta t i s t i c s i n psycho logy and educat i on ed.). Tokyo: McGraw-Hill (7th Kogakusha. Hopkins, C.D., (1976). Educa t i ona l re sea rch : A s t ruc tu re f o r i nqu i r y. Colombus, Ohio: Merrill, Bell & Howell. Hopkins, C.D., (1980) Unders tand ing educa t i ona l resea rch : An i nau i r v approach . Ohio: Merrill. Hopkins, K.D., & Stanley, J.C., (1981). Educa t i ona l and psycho log i ca l measu rement and eva lua t. iEngiewood on Cliffs, New Jersey: Prentice-Hall. Jurgensen, C.E., (1947). Table for determining phi coefficients. Psvchomet r i ka 12(1), 17-29. , Kraemer, H. C., & Thiemann, S. (1987). How many sub jec ts? Sta t i s t i ca l power ana lys i s i n resea rch Sage. . London: Krus, D.J., & Bart, W.M., (1974). An ordering theoretic method of multidimensional scaling of items. Educat i ona l and Psycho log i ca l Measurement , 34, 525-535. Lord, F.M., (1944). Alignment chart for calculating the fourfold point correlation coefficient. Psychomet r i ka 9(1), 41-42. , Lord, F.M., (1977). Optimal number of choices per item - a comparison of four approaches. Journal of Educational Measurement, 14(1), 33-38.


McMorris, R.F., Brown, J.A., Snyder, G.W., & Pruzek, R.M., (1972). Effects of violating item construction principles. J ou rna l o f Educa t i ona l Measurement9(4), 287-295. , Mosier, C.I., & McQuitty, J.V., (1940). Methods of item validation and ABACS for item-test correlation and critical ratio of upper-lower difference. Psvchometrika, 5(1), 57-85. Noll, V.H., Scannell, D.P., & Craig, R.C., (1979). Introduction to educational measurement (4th ed.). Boston: Houghton Mifflin. Proctor, C.H., (1970). A probabilistic formulation and statistical analysis of Guttman scaling. Psychometrika, 35, 73-18. Theobald, J.H., (1974). Classroom testing. Principles and practice (2nd ed.). Melbourne: Longman Cheshire. Theobald, J.H., (1977). Attitudes and achievement in biology. Unpublished Ph.D. Thesis, Monash University. Wade, R.K., (1984/85). What makes a difference in inservice teacher education? A meta-analysis of research. Education Leadership, 42(4), 48-54. Walbasser, N.H., & Eisenberg, T.A., (1972). A review of research on behavioural objectives and learning hierarchies. Mathematics Education Reports. Columbus, Ohio: ERIC information analysis centre for science, mathematics and environmental education. ERIC no.ED059900. White, F.A., (1975). Our acoustic environment. New York: Wiley . White, H.E., (1968). Introduction to college physics. New York: Van Nostrand. White, M.W., Manning, K.V., & Weber, R.L., (1968). Basic physics. New York: McGraw-Hill. White, R.T., (1914a). A model for validation of learning hierarchies. Journal of Research in Science Teaching, 11(1), 1-3. White, R.T., (1974b). Indexes used in testing the validity of learning hierarchies. Journal of Research in Science Teaching, 11(1), 61-66. Yeany, R.H., Dsst, R.J., & Mathews, R.W., (1980). The effects of diagnostic-prescriptive instruction and locus of control on the achievement and attitudes of university students. Journal of Research in Science Teaching, 17(6), 537-543. Yeany, R.H., Kuch, Chin Yap, & Padilla, M.J., (1986). Analysing hierarchical relationships among modes of cognitive reasoning and integrated science process skills. Journal of Research in Science Teaching, 23(4), 277-291. Yeany, R.H., & Miller, P.A., (1980). The effect of diaqnostic/remediation: instruction on science learning: A 58

meta - ana lys i.sPaper presented at the annual meeting of the National Association for Research in Science Teaching. Boston, MA, April 11-13. ERIC no.ED187533. Yeany, R.H., Waugh, M.L., & Blalock, A.L., (1979). The effects of achievement diagnosis with feedback on the science achievement and attitude of university students. J ou rna l o f Re


You're Reading a Free Preview

/*********** DO NOT ALTER ANYTHING BELOW THIS LINE ! ************/ var s_code=s.t();if(s_code)document.write(s_code)//-->