You are on page 1of 58

CAAP

Technical Handbook
2011–2012

Collegiate Assessment of Academic Proficiency

Assessing academic achievement in Reading, Writing,


Mathematics, Science, and Critical Thinking
Collegiate Assessment of
Academic Proficiency (CAAP)
www.act.org/caap

CONTACT INFORMATION:

General Information CAAP Program Management


Postsecondary Assessment Services
ACT
500 ACT Drive, P.O. Box 168
Iowa City, IA 52243-0168
800.294.7027
E-mail: caap@act.org

Ordering Information CAAP Customer Services (70)


ACT
P.O. Box 1008
Iowa City, IA 52243-1008
319.337.1576
Fax: 319.337.1467
E-mail: caap@act.org

Research Information CAAP Research Services


ACT
500 ACT Drive, P.O. Box 168
Iowa City, IA 52243-0168
319.339.3089
Fax: 319.341.2284
E-mail: caapresearch@act.org
Table of Contents
Chapter 1 Overview of the CAAP Program . . . . . . . . . . . . . . 1
A. Purpose . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . 1
B. Test Development . . . . . . . . . .. . . . . . . . . . . . . . 1
C. Uses of CAAP . . . . . . . . . . . . .. . . . . . . . . . . . . . 2
Chapter 2 Development Procedures . . . . . . . . . . . . . . . . . . . . 3
A. Test Specifications . . . . . . . . . . . . . . . . . . . . . . . . 3
B. Selection of Item Writers . . . . . . . . . . . . . . . . . . . 4
C. Item Construction . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 3 Content of the CAAP Tests . . . . . . . . . . . . . . . . . . . 7
A. Writing Skills . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
B. Writing (Essay) . . . . . . . . . . . . . . . . . . . . . . . . . 8
C. Mathematics . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
D. Reading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
E. Science . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
F. Critical Thinking . . . . . . . . . . . . . . . . . . . . . . . . 13
Chapter 4 Technical Characteristics . . . . . . . . . . . . . . . . . . . 14
A. Objective Tests . . . . . . . . . . . . . . . . . . . . . . . . . 14
B. Description of the Sample . . . . . . . . . . . . . . . . . 14
C. Item Characteristics . . . . . . . . . . . . . . . . . . . . . . 15
D. Objective Test Characteristics . . . . . . . . . . . . . . . 15
E. Essay Test Characteristics . . . . . . . . . . . . . . . . . 19
Chapter 5 Scaling and Equating . . . . . . . . . . . . . . . . . . . . . . 21
A. Establishing Scale Scores for the
Objective CAAP Tests . . . . . . . . . . . . . . . . . . . . 21
B. Establishing Scale Scores for the CAAP Subscores 23
C. Equating CAAP Scores . . . . . . . . . . . . . . . . . . . . 25
D. Reliability and Standard Error of Measurement . 25
E. User Norms . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

i
Chapter 6 Interpretation of Scores . . . . . . . . . . . . . . . . . . . . 30
A. Using CAAP Scores for Making Inferences
about Groups . . . . . . . . . . . . . . . . . . . . . . . . . . 30
B. Using CAAP Scores for Making Inferences
about Individuals . . . . . . . . . . . . . . . . . . . . . . . 32
C. Using CAAP Scores for Selection Decisions . . . . 33
Chapter 7 Validity Evidence for CAAP Test Scores . . . . . . . . 36
A. Content Validity of CAAP Scores . . . . . . . . . . . . 36
B. Criterion-Related Validity Evidence
for CAAP Scores . . . . . . . . . . . . . . . . . . . . . . . . 36
Appendix A CAAP Content Consulting Panel . . . . . . . . . . . . . 41
Appendix B Differential Item Functioning (DIF) Analyses
of CAAP Items . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
Appendix C CAAP Writing (Essay) Test Score-Point
Descriptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

ii
Chapter 1

Overview of the CAAP Program


The Collegiate Assessment of Academic Proficiency (CAAP) was designed to assess academic
achievement in Reading, Writing (in both objective and essay test formats), Mathematics,
Science, and Critical Thinking. The unique modular format of CAAP offers institutions the
flexibility to select the assessment components that meet their educational goals and objectives.

A. PURPOSE
CAAP tests are used by both two- and four-year postsecondary institutions to measure the
academic progress of students and to help determine the educational development of
individual students.
1. Group Basis: CAAP is used to help institutions improve their instructional programs by
measuring student progress in acquiring core academic skills. Institutions concerned
with program evaluation use CAAP to provide evidence that general education
objectives are being met, document change in students’ performance levels from one
educational point to another, and provide differential performance comparisons in
general education instructional programs within an institution.
2. Individual Basis: CAAP is used to indicate a student’s readiness for further education,
identify interventions needed for subsequent student success, and assure some specified
level of skill mastery prior to graduation or program completion.

B. TEST DEVELOPMENT
CAAP was created in response to requests from several postsecondary institutions for a
standardized assessment that measures select academic skills typically obtained in a core
general education curriculum. In response to these requests, the ACT Board of Trustees
resolved in 1985 to develop “a new national testing program that will have as its primary
purpose a means for postsecondary institutions to verify that their students have developed
certain basic college-level academic skills....”
To help develop the test specifications for the CAAP program, ACT staff conducted an
extensive review of the relevant literature, brought together a national advisory committee,
and convened committees of college faculty. Test items were then developed by selected
faculty content specialists and thoroughly reviewed by faculty and measurement experts.
The initial test forms in Writing, Mathematics, Reading, and Critical Thinking were
introduced in the fall of 1988 and the Science Test was introduced in the fall of 1989.
CAAP was first field tested in the spring of 1988, and a two-year pilot study began in the
fall of 1988. The purpose of the two-year pilot study was to refine CAAP, pretest additional
items, develop user norms, and conduct research projects in support of the technical
aspects of the program. The pilot period ended in July 1990, and CAAP became fully
operational in August 1990. Since that time CAAP has undergone periodic reviews to
ensure that it continues to reflect relevant postsecondary curricula and to meet the needs
of its postsecondary institution users.

1
C. USES OF CAAP
1. Document achievement of selected general education objectives. This process
requires that institutions carefully define program objectives in measurable terms and
that they establish valid criteria to evaluate performance relevant to these objectives.
2. Indicate change from one educational level to another—“value-added.” If an
institution determines that the skills measured by CAAP are taught in its core courses,
then the difference between the mean score of a group of students who have
completed a particular core curriculum and the mean score of a group of students
who have not completed the core curriculum may be viewed as a measure of the
change that occurred during the course of the program. Note that this use of CAAP
is based on groups of students, not on individual students.
3. Compare local performance with that of other populations. CAAP user norms
may be helpful to a particular institution in determining how its students, as a group,
compare with students at the same level attending similar institutions across the
nation.
4. Establish requirements for eligibility to enter the junior year. CAAP has been
designed to provide information for institutions to make decisions regarding
academic achievement with both current and transfer students. With local research
support by an institution, CAAP may be useful in establishing eligibility thresholds
for upper-level academic coursework. CAAP may also serve as an indicator of the
need for intervention so that skills can be strengthened to help attain success in the
junior and senior years.
5. Establish other eligibility requirements. CAAP may be useful in establishing
individual student readiness for taking specific advanced courses. For example, CAAP
Writing Skills Test scores may be used to determine whether students are prepared
to take a required upper-division writing course in their area of study. Achieving a
particular minimum score on one or more CAAP tests may also be considered as one
of a variety of requirements for graduation; however, CAAP should not be the sole
determinant for course or graduation eligibility.

NOTE: Care should be taken when using CAAP results for these
purposes. Local research should be conducted on the specific
application of the CAAP program whenever possible. In
addition, CAAP results should be used in a manner that will
benefit students as well as institutions. ACT staff are willing to
work with institutions to help maximize the usefulness of
CAAP.

2
Chapter 2

Development Procedures
This chapter describes the policies and procedures that are used in developing CAAP tests.
The test development cycle involves several stages, beginning with the development of test
specifications.

A. TEST SPECIFICATIONS
To gain insight into the assessment concerns of the higher education community and to
assist ACT in the development of CAAP test specifications, an advisory committee was
formed in 1985, which consisted of individuals in higher education who were familiar
with assessing students at the postsecondary level. The committee encouraged ACT to
pursue development of CAAP and provided significant input regarding the types of
knowledge and skills and the relative levels of academic accomplishment that should
be the focus of the CAAP assessments.
Since that time, ACT has maintained contact with faculty and administrators at
postsecondary institutions nationwide to ensure that, while the basic structure of the
CAAP program specifications remains the same from year to year, they continue to
reflect users’ evolving curriculum and academic assessment needs. ACT maintains this
contact via periodic surveys, bulletins, and newsletters, as well as through ongoing
direct interactions with postsecondary personnel via phone, e-mail, and numerous panel
meetings, workshops, and professional conferences.
Content Specifications. The content specifications for the CAAP tests in each of five
areas (writing, mathematics, reading, critical thinking, and science) were developed
initially with the assistance of nationally known education consultants (see Appendix A).
These consultants met with ACT staff to determine both the specific topics and the
proportion of items for each topic to be covered by the tests. The consultants’ input was
based on their knowledge of the skill levels of precollege and college students and the
levels of skills necessary to perform in selected college-level courses. There was one
consultant panel for each of the subject areas, and each panel consisted of three to five
members with relevant backgrounds and expertise. Two separate panels for writing skills
were convened to focus separately on the Writing (Essay) Test and the objective Writing
Skills Test.
Statistical Specifications. The statistical specifications for the tests indicate the level of
difficulty (proportion correct) and minimum acceptable level of item discrimination
(biserial correlation) of the test items to be used.
The tests are constructed to have a mean item difficulty of about .63 for the CAAP user
population and a range of item difficulties from about .30 to .80. For the Writing (Essay)
Test, writing prompts selected for operational use were those that contained topics that
proved accessible to examinees while also producing a spread of scores across the score
scale. Writing (Essay) Test forms were constructed so that the mean difficulty was
equivalent across forms.

3
With respect to discrimination indices, each item is required to have a biserial
correlation of .30 or higher, with scores on the test measuring comparable content. For
example, science items should correlate .30 or higher with overall performance on the
Science Test. The distribution of biserial correlations was selected so that the tests would
effectively differentiate between students who know the material and those who do not.

B. SELECTION OF ITEM WRITERS


ACT contracts with teaching faculty in the disciplines assessed by CAAP to construct the
test items and writing prompts. ACT makes every attempt to include item writers who
represent the diversity of the population of the United States with respect to ethnic
background, gender, and geographic location.
Before writing items or prompts, potential writers are required to submit a sample set
of materials for review. An Item Writer’s Guide specific to the appropriate content area
is sent to each item writer. The guides include examples of items and provide item
writers with test specifications and ACT’s requirements for content and style. Also
included are specifications for fair portrayal of all groups of individuals, avoidance of
subject matter that may be unfamiliar to members of certain groups within society, and
nonsexist use of language.
ACT test development staff evaluate each set of sample test items submitted by a
potential item writer. A decision concerning whether or not to contract with the item
writer to construct items for CAAP is made on the basis of that evaluation.

C. ITEM CONSTRUCTION
Each item writer under contract is given an assignment to produce a small number of
multiple-choice items or writing prompts. The size of the assignment ensures that a
diversity of material will be obtained for the test and that the security of the testing
program will be maintained.
The task of the item writer is to create items that are educationally important as well as
psychometrically sound. Even with good writers, many items fail to meet ACT standards;
therefore, a large number of items must be constructed. Each item writer submits a set
of items or multiple prompts called a unit, in a given content area. Item writers work
closely with ACT test specialists, who assist them in producing items of high quality that
meet all relevant test specifications.
Internal Review of Items. After a unit is accepted, it is edited to meet ACT specifications
for content, accessibility, language, item classification, item format, and word count. A
content specialist in the appropriate area checks each item to ensure that it has the
appropriate content classification, that it has one and only one correct answer, and that
the incorrect alternatives (foils) are plausible, but incorrect. ACT staff also ensure that the
test items and writing prompts are at the appropriate cognitive level for the intended
examinee group.
During the editorial process, each unit is reviewed several times by both content and
editorial staff to ensure that it meets ACT standards for content accuracy, style and
format conventions, and editorial quality.

4
Materials are also reviewed to ensure that they are consistent with a broad array of
professional standards, including the Standards for Educational and Psychological Testing
(APA, NCME, AERA, 1999) and the Code of Fair Testing Practices (NCME, 1995). All test
materials are also reviewed internally for fair portrayal, balanced representation of
societal groups, and use of nonsexist language.
External Review of Items. After the initial rounds of internal editing, each unit is
reviewed from at least two different perspectives by consultants commissioned by ACT:
content reviews and fairness reviews. For the content review, each unit is sent to a
consultant specializing in the appropriate content area who has been previously trained
by ACT staff. Content consultants are recruited from among college faculty who teach
in the content areas they are asked to review. The consultant verifies the accuracy of the
content and the answer keys, and ensures that the items and any stimulus materials are
of the right educational level and cognitive complexity for the intended student
population. Content consultants send written comments back to ACT for each unit.
For the fairness reviews, potential test items and prompts are sent to a special group of
reviewers who evaluate the fairness of the materials from multicultural perspectives.
Consultants are selected from five particular societal groups: African Americans, Asian
Americans, Hispanic Americans, Native Americans, and Women. At least one fairness
reviewer is selected from each group. Participants are sent all units being considered
for inclusion in the CAAP test forms and asked to review the materials for fair portrayal
of all societal groups and to ensure that the context within which items are embedded
is not differentially familiar to different societal groups. Each consultant is asked to
develop written comments that are then discussed among the consultants and with ACT
staff in a group teleconference. After the teleconference, consultants are asked to
forward their written comments to ACT.
ACT staff review all comments returned by the content and fairness consultants and
make appropriate changes to the unit as needed. Should there be disagreement with a
consultant’s suggestions concerning change in content, proposed changes are discussed
with that consultant. If significant content changes are made in the unit as a result of a
consultant’s comments, those revisions are sent back to the consultant to confirm that
the changes were correctly made.
Item Tryouts. The multiple-choice items that are judged acceptable in the internal and
external review processes are assembled into tryout units; several tryout units are then
combined into item tryout test booklets. The Writing (Essay) Test prompts are tried out
in booklets composed of pairs of prompts. CAAP tryout units are administered in special
studies under standardized conditions to a sample of college-bound high school
students and to students enrolled in two- and four-year postsecondary institutions who
are representative of the total CAAP examinee population. All booklets are spiraled
(e.g., Form 1, Form 2, Form 1, Form 2...) for administration to ensure random
assignment of items or prompts to students. Each examinee in a sample study is
administered a tryout booklet from one of the areas covered by the CAAP tests. Time
limits for the tryout units are sufficient to permit the majority of students to respond to
all test items. For the Writing (Essay) Test prompts, examinees are provided 25 minutes
to respond to each prompt.

5
Item Analysis of Tryouts. Item analyses are performed on the tryout data from the
multiple-choice tests. The item analyses serve to identify statistically effective multiple-
choice test questions. Items that are either too difficult or too easy and those that fail to
discriminate between students of high and low educational development, as measured
by their tryout test scores, are eliminated or revised. Revised items may be reevaluated
via internal and external reviews before being resubmitted for field testing. The review
process also provides feedback that helps to decrease the incidence of poor-quality items
in the future.
Papers from the Writing (Essay) Test tryouts are scored by two raters according to
predetermined scoring criteria. Raters receive extensive training in the application of the
scoring criteria. Frequency distributions of the tryout results are reviewed for similarity
and dispersion. Prompts that distribute equivalent groups of examinees similarly and
across all score points are considered to be successful.
Assembly of New Forms. Items judged acceptable in the review process are placed in
an item pool. Preliminary forms of the CAAP tests are then constructed by selecting
items from this pool that meet both the content and statistical requirements of the tests.
For each test in the battery, items for the new forms are selected to match the content
specifications outlined in Chapter 3. Items are also selected to comply with desired
statistical specifications. Technical characteristics of the CAAP test items are described
in Chapter 4.
The preliminary versions of the test forms are subjected to intensive review to ensure
that the items are accurate and that the overall test forms conform to good test
construction practice. Items are again checked for content accuracy and conformity to
ACT style. Items are also reviewed to ensure that they are free of clues or cues that could
allow test-wise students to answer other items on the form correctly. The assembled
forms are then reviewed by consultants who verify content and keys and prepare
written justifications for each test item in the form that indicate why the keyed response
is correct and why the other alternatives are incorrect.
ACT staff members review comments from the consultants and make any necessary
changes to the items or passage. Whenever significant changes are made, the revised
components are again reviewed by the appropriate consultants and by ACT staff. If
additional content questions are raised, other consultants are asked to review the test
form. When no further corrections are needed, the test form is prepared for printing.
Prior to formal introduction, final forms of the tests are involved in special equating,
norming, and scaling studies. These procedures are discussed in detail in Chapter 5. In
addition, some studies can only be conducted effectively when forms are actively being
used. When sufficient examinee data becomes available, Differential Item Functioning
(DIF) analyses are conducted to check that test items within a test form do not function
differently for members of some societal groups (see Appendix B).

6
Chapter 3

Content of the CAAP Tests


A. WRITING SKILLS
The CAAP Writing Skills Test is a 72-item, 40-minute test measuring students’ understanding
of the conventions of standard written English in punctuation, grammar, sentence structure,
strategy, organization, and style. Spelling, vocabulary, and rote recall of rules of grammar
are not tested.
The test consists of six prose passages, each of which is accompanied by a set of
12 multiple-choice test items. A range of passage types is used to provide a variety of
rhetorical situations.
Scoring. Three CAAP Writing Skills scores are reported: 1) a total test score based on all
72 items; 2) a subscore in Usage/Mechanics based on the 32 punctuation, grammar, and
sentence structure items; and 3) a subscore in Rhetorical Skills based on the 40 strategy,
organization, and style items.
Usage/Mechanics. Items that measure usage and mechanics offer alternative responses,
including “NO CHANGE,” to underlined portions of the text. The student must decide
which alternative employs the conventional practice in usage and mechanics that best fits
the context.
• Punctuation. Use and placement of commas, colons, semicolons, dashes, parentheses,
apostrophes, and quotation, question, and exclamation marks.
• Grammar. Adjectives and adverbs, conjunctions, and agreement between subject and
verb and between pronouns and their antecedents.
• Sentence Structure. Relationships between/among clauses, placement of modifiers, and
shifts in construction.
Rhetorical Skills. Items that measure rhetorical skills may refer to an underlined portion
of the text or may ask a question about a section of the passage or about the passage as a
whole. The student must decide which alternative response is most appropriate in a given
rhetorical situation.
• Strategy. Appropriateness of expression in relation to audience and purpose,
strengthening of writing with appropriate supporting material, and effective choice of
statements of theme and purpose.
• Organization. Organization of ideas and relevance of statements in context (order,
coherence, unity).
• Style. Precision and appropriateness in the choice of words and images, rhetorically
effective management of sentence elements, avoidance of ambiguous pronoun
references, and economy in writing.

7
Table 3.1 summarizes the primary content elements and typical proportions of the CAAP
Writing Skills Test devoted to each element.

Table 3.1 Content Specifications Summary for the CAAP Writing Skills Test

Proportion Number
Content Category of Test of Items

Usage/Mechanics .44 32
Punctuation .08 6
Grammar .11 8
Sentence Structure .25 18
Rhetorical Skills .56 40
Strategy .21 15
Organization .14 10
Style .21 15
TOTAL 1.00 72

B. WRITING (ESSAY)
The CAAP Writing (Essay) Test is predicated on the assumption that the skills most
commonly taught in college-level writing courses and required in upper-division college
courses across the curriculum include:
• Formulating an assertion about a given issue
• Supporting that assertion with evidence appropriate to the issue, position taken, and a
given audience
• Organizing and connecting major ideas
• Expressing those ideas in clear, effective language
The model developed by ACT for the writing test is designed to elicit responses that
demonstrate a student’s ability to perform these skills.
Each of two 20-minute writing tasks is defined by a short prompt that identifies a specific
hypothetical situation and audience. The hypothetical situation involves an issue on which
the examinee must take a stand. The examinee is instructed to take a position on the issue
and explain to the audience why the position taken is the better (or best) alternative.
In order to more clearly define the audience and provide a focus for responses, each
prompt specifies the basis upon which the audience will make its decision. Situations and
audiences defined in the writing prompts are constructed so that the required background
knowledge and experience are within the command of college sophomores.
Scoring. For the CAAP Writing (Essay) Test, ACT developed a six-point, modified-holistic
scoring system. Two trained raters read each of the two essays. Raters independently score
their assigned essay on a scale from 1 to 6 (1 being the lowest, 6 the highest).
Scores from the two raters for each essay are averaged to get a reported score for that essay,
which ranges from 1 to 6 in increments of .5. The two raters’ score for each essay must

8
either agree or be adjacent (i.e., differ by no more than 1 point) to be averaged. If the
raters’ scores differ by two or more points, a chief scorer adjudicates and determines the
reported score.
After a score has been assigned for each of the two essays, a composite score is calculated
by averaging the two essay scores. The composite score is reported on a scale from 1 to 6
in increments of .25 points.
Each score point reflects a student’s ability to perform the skills identified above. Essays
are evaluated according to how well a student formulates a clear assertion on the issue
defined in the prompt, supports that assertion with reasons and evidence appropriate to
the position taken and the specified concerns of the audience, and develops the argument
in a coherent and logical manner. A student obtains lower scores for not taking a position
on the specified issue, for not developing the argument, or for not expressing those ideas
in clear, effective language. A student who does not respond to the prompt is assigned a
“not rateable” indicator rather than a score on the 1 to 6 scale.
The purpose of the scoring system is to assess a student’s ability to perform the writing
task defined in the prompt, given a timed, first draft situation. A description of the Essay
scoring system is presented in Appendix C.

C. MATHEMATICS
The CAAP Mathematics Test is a 35-item, 40-minute test designed to measure students’
proficiency in mathematical reasoning. The test assesses students’ proficiency in solving
mathematical problems encountered in many college-level mathematics courses and
required in upper-division courses in mathematics and other disciplines. It emphasizes
quantitative reasoning rather than the memorization of formulas. The content areas tested
include prealgebra; elementary, intermediate, and college algebra; coordinate geometry;
and trigonometry. Descriptions of the content areas and the approximate proportions of
items in each are provided below.
• Prealgebra. Items in this category involve operations with whole numbers, decimals,
and fractions; order concepts; percentages; averages; exponents; scientific notation; and
similar concepts.
• Elementary Algebra. Items in this category involve basic operations with polynomials,
setting up equations, and substituting values into algebraic expressions. They may also
require the solution of linear equations in one variable and other related topics.
• Intermediate Algebra. Items in this category assess students’ understanding of
exponents, rational expressions, and systems of linear equations. Other concepts such as
the quadratic formula and absolute value inequalities may also be tested.
• Coordinate Geometry. Knowledge and skills assessed in this category may include
graphing in the standard coordinate plane or the real number line, graphing conics,
linear equations in two variables, graphing systems of equations, and similar types of
skills.
• College Algebra. Items in this category are based on advanced algebra concepts
including rational exponents, exponential and logarithmic functions, complex numbers,
matrices, inverses of functions, and domains and ranges.
• Trigonometry. Items in this category include concepts such as right triangle
trigonometry, graphs of trigonometric functions, basic trigonometric identities, and
trigonometric equations and inequalities.

9
Scoring. Three scores are reported for the CAAP Mathematics Test: a total test score based
on all 35 items, a subscore in Basic Algebra based on 17 items, and a College Algebra
subscore based on the remaining 18 items. The Basic Algebra subscore is composed of test
questions from the Prealgebra, Elementary Algebra, Intermediate Algebra, and Coordinate
Geometry content areas. The College Algebra subscore is composed of test questions from
the College Algebra and Trigonometry content areas.
Table 3.2 summarizes the primary content elements and typical proportions of the CAAP
Mathematics Test devoted to each element.

Table 3.2 Content Specifications Summary for the CAAP Mathematics Test

Proportion Number
Content Category of Test of Items

Basic Algebra .49 17


Prealgebra .09–.14 3–5
Elementary Algebra .09–.14 3–5
Intermediate Algebra .09–.11 3–4
Coordinate Geometry .14–.18 5–6
College Algebra .51 18
College Algebra .40 14
Trigonometry .11 4
TOTAL 1.00 35

D. READING
The CAAP Reading Test is a 36-item, 40-minute test that measures reading comprehension
as a combination of skills that can be conceptualized in two broad categories: referring and
reasoning. Students are presented with four passages, each with nine test questions that
assess students’ understanding of the information presented in the passages.
Referring Skills. Test items that focus on referring skills require the student to derive
meaning from text by identifying and interpreting specific information that is explicitly
stated. Typical items of this type require students to recognize main ideas of paragraphs
and passages, to identify important factual information, and to identify relationships among
different components of textual information.
Reasoning Skills. Test items that focus on reasoning skills require students to determine
implicit meanings and go beyond the information that is explicitly presented. Typical items
in this category assess students’ ability to determine meaning from context, infer main ideas
and relationships, generalize and apply information beyond the immediate context, draw
appropriate conclusions, and make appropriate comparisons.
Passage Formats. The CAAP Reading Test consists of four prose passages of about
900 words each that are representative of the level and kinds of writing commonly
encountered in college curricula. Each passage is accompanied by a set of nine multiple-
choice test items that focus on the set of complementary and mutually supportive skills that
readers must use in studying written materials across a range of subject areas.

10
The four reading passages come from the following four general content areas, one passage
from each area:
• Prose Fiction. Entire stories or excerpts from short stories or novels.
• Humanities. Art, music, philosophy, theater, architecture, dance.
• Social Studies. History, political science, economics, anthropology, psychology,
sociology.
• Natural Sciences. Biology, chemistry, physics, physical sciences.
Scoring. Three scores are reported for the CAAP Reading Test. A total test score is reported
based on all 36 items. In addition, two subscores are reported based on passage content
areas: The Arts/Literature subscore is based on the 18 items in the Prose Fiction and
Humanities sections of the test, and the Social Studies/Sciences subscore is based on the
18 items in the Social Studies and Natural Science sections of the test.
Table 3.3 summarizes the primary item-content categories and typical proportions of the
CAAP Reading Test devoted to each category.

Table 3.3 Content Specifications Summary for the CAAP Reading Test

Proportion Number
Content Category of Test of Items

Referring .25–.33 9–12


Reasoning .67–.75 24–27
TOTAL 1.00 36

E. SCIENCE
The CAAP Science Test is a 45-item, 40-minute test designed to measure students’
knowledge and skills in science. The contents of the Science Test are drawn from biological
sciences (e.g., biology, botany, and zoology), chemistry, physics, and the physical sciences
(e.g., geology, astronomy, and meteorology). The test emphasizes scientific knowledge and
reasoning skills rather than a high level of skill in mathematics or reading. A total score is
provided for the Science Test; no subscores are provided.
Passage Formats. The Science Test consists of eight passage sets, each of which contains
scientific information and a set of multiple-choice test questions. A passage may conform
to one of the three different formats listed below.
• Data Representation. This format presents students with graphic and tabular material
similar to that found in science journals and texts. The items associated with this format
measure knowledge and skills such as graph reading, interpretation of scatterplots, and
interpretation of information presented in tables, diagrams, and figures.
• Research Summaries. This format provides students with descriptions of one
experiment or of several related experiments. The items focus on the design of
experiments and the interpretation of experimental results. The stimulus and items are
written expressly for the Science Test, and all relevant information is presented in the
text of the stimulus or in the test questions.

11
• Conflicting Viewpoints. This format presents students with several hypotheses or views
that are mutually inconsistent owing to differing premises, incomplete or disputed data,
or differing interpretations of data. The stimuli may include illustrative charts, graphs,
tables, diagrams, or figures. Items in this format measure students’ knowledge and skills
in understanding, analyzing, and comparing alternative viewpoints or hypotheses.
Table 3.4 lists three passage formats and the approximate proportions of the Science Test
devoted to each.

Table 3.4 Passage Format Summary for the CAAP Science Test

Proportion Number
Passage Format of Test of Items

Data Representation .33 15


Research Summaries .54 24
Conflicting Viewpoints .13 6
TOTAL 1.00 45

Item Classifications. The 45 test items in the Science Test can be conceptualized in three
major groups. Each group is meant to address an important major element of scientific
inquiry. The groups are listed below, along with brief descriptions of typical knowledge
and skills tested.
• Understanding. Identify and evaluate scientific concepts, assumptions, and components
of an experimental design or process; identify and evaluate data presented in graphs,
figures, or tables; translate given data into an alternate form.
• Analyzing. Process information needed to draw conclusions or to formulate hypotheses;
determine whether information provided supports a given hypothesis or conclusion;
evaluate, compare, and contrast experimental designs or viewpoints; specify alternative
ways of testing hypotheses or viewpoints.
• Generalizing. Extend information given to a broader or different context; generate a
model consistent with given information; develop new procedures to gain new
information; use given information to predict outcomes.
Table 3.5 lists the three item classification groups and the approximate proportions of the
Science Test devoted to each.

Table 3.5 Item Classification Summary for the CAAP Science Test

Proportion Number
Item Classification of Test of Items

Understanding .18–.22 8–10


Analyzing .49–.53 22–24
Generalizing .27–.23 12–15
TOTAL 1.00 45

12
F. CRITICAL THINKING
The CAAP Critical Thinking Test is a 32-item, 40-minute test that measures students’ skills
in analyzing, evaluating, and extending arguments. An argument is defined as a sequence
of statements that includes a claim that one of the statements, the conclusion, follows from
the other statements. The Critical Thinking Test consists of four passages that are
representative of the kinds of issues commonly encountered in a postsecondary
curriculum.
A passage typically presents a series of subarguments in support of a more general
conclusion or conclusions. Each passage presents one or more arguments using a variety
of formats, including case studies, debates, dialogues, overlapping positions, statistical
arguments, experimental results, or editorials. Each passage is accompanied by a set of
multiple-choice test items.
Table 3.6 summarizes the primary content categories and the approximate proportions of
the test devoted to each category. A total score is provided for the Critical Thinking Test;
no subscores are provided.

Table 3.6 Content Specifications Summary for the CAAP Critical Thinking Test

Proportion Number
Content Category of Test of Items

Analysis of elements .53–.66 17–21


of an argument
Evaluation of an argument .16–.28 5–9
Extension of an argument .19 6
TOTAL 1.00 32

13
Chapter 4

Technical Characteristics
A. OBJECTIVE TESTS
The following sections present information about the items that make up the CAAP multiple-
choice tests. The data used for the technical analyses of the CAAP tests, except for
mathematics, were collected from examinees who tested during the four-year period from
1998 to 2001. Because the mathematics tests were revised and reintroduced nationally in
September 2002, examinee data were available only from the concordance studies conducted
on the revised mathematics forms administered from spring 2001 through spring 2002. These
data are reported for the mathematics forms in the tables throughout this chapter.
Major demographic information for the samples of students who contributed to the national
data and to the mathematics concordance data is provided in Table 4.1.

B. DESCRIPTION OF THE SAMPLE


Data for the objective test results reported in this chapter are based on students who
participated in national administrations of CAAP during the period from 1998 to 2001,
except for the CAAP mathematics test, for which data came from concordance studies
conducted in selected locations during the period from spring 2001 through spring 2002.
Major demographic data for these students are summarized in Table 4.1.

Table 4.1 Percentage of Students in Major Demographic Categories in


National Administrations and Mathematics Concordance Studies

Demographic Writing Critical


Category Skills Reading Math Science Thinking
Gender
Female 60 60 61 60 62
Male 40 40 39 40 38
Ethnicity
Caucasian 76 77 77 79 74
African American 13 12 13 10 12
Hispanic/Latino 2 2 2 2 4
Other/No Resp. 5 6 8 5 5
Grade Level
Freshman 9 9 <1 9 23
Sophomore 58 63 79 62 46
Junior 22 21 19 21 18
Senior 9 6 2 7 10
Total Sample Size 80,010 68,627 1,975 52,302 69,790

Note: Percentages may not sum to 100% for some demographic categories due to
rounding and nonresponse by some students.

14
C. ITEM CHARACTERISTICS
Item Difficulty. The item difficulty index reported here is the proportion of examinees
who correctly answer an item. This index is not solely the property of an item; it is a
function of both the item and the ability of the group responding to the item. The higher
the item difficulty index (proportion correct), the easier the item for the specific groups of
examinees involved.
Table 4.2 summarizes the distributions of item difficulties obtained for two forms of the tests
in the CAAP battery. The table provides the means and standard deviations of the difficulty
indices for the items in each test form and the sample sizes on which these data are based.
Item Discrimination. Item discrimination is the degree to which an item differentiates
between students who performed well on the relevant total test and those who did not.
Item-test biserial correlations are reported as indices of item discrimination; the mean item
discrimination is the average correlation between students’ responses to the items and their
total scores on the test.
Table 4.3 provides the mean item discriminations obtained for two forms of the tests in the
CAAP battery; standard deviations and sample sizes for each of the item discrimination
indices are also included in the table. With mean biserial correlations ranging from .49 to
.55, the discrimination indices indicate that the items are contributing substantially to the
measurement provided by each of the CAAP tests.

D. OBJECTIVE TEST CHARACTERISTICS


Completion Rate. Completion rate is used as an indication of the relative speededness of
the CAAP tests. The index reported is the proportion of examinees who responded to the
last five items on the test within the allotted testing time. It is a measure of the relative
length of the test for a given time limit and sample of examinees. Tests with higher
completion rates are more easily completed within the allotted time by a given group of
examinees than are tests with lower completion rates.
Table 4.4 provides the completion rate for two forms of each CAAP test. As the results show,
the tests have high completion rates. The completion rates are in the 90 percent range for
all tests except Writing Skills, for which completion rates are in the mid-80 percent range.
Reliability. Reliability is an estimate of the consistency of test scores across repeated
measurements. The primary approach to this estimation follows Kuder-Richardson 20
(1937), an internal consistency estimation. The appropriateness of the Kuder-Richardson
estimate depends upon several considerations and assumptions, of which speededness,
unidimensionality and form-to-form comparability are primary.
Kuder-Richardson Formula 20 (K-R 20) reliability estimates are reported in Table 4.5 for two
forms of the CAAP examinations. For tests of a given length, the K-R 20 measures the extent
to which all items in a test are correlated with one another. In general, longer tests tend to
be more reliable than shorter tests.

15
Table 4.2 Item Difficulty Distributions and Summary Data for CAAP Forms 11 and 12

Writing Skills Reading Skills Mathematics Science Critical Thinking

Difficulty Range1 11A 12A 11A 12A 11G 12G 11A 12A 11A 12A
.10–.19 0 0 0 0 4 3 0 0 0 0
.20–.29 0 1 0 0 7 9 3 3 0 0
.30–.39 1 2 1 1 3 2 4 1 2 2
.40–.49 4 6 4 4 4 5 12 6 8 4
.50–.59 9 22 10 6 5 5 7 15 6 8
.60–.69 15 12 5 12 4 4 11 8 8 12

16
.70–.79 24 17 12 7 7 6 5 7 7 6
.80–.89 13 11 4 6 0 1 2 5 1 0
.90–.99 6 1 0 0 1 0 1 0 0 0
Total Number of Items 72 72 36 36 35 35 45 45 32 32
Mean Difficulty .72 .65 .65 .65 .48 .46 .55 .58 .59 .60
Standard Deviation .14 .14 .12 .13 .22 .21 .15 .15 .13 .11
Total Sample Size 40,954 39,056 37,095 31,532 989 986 29,677 22,625 26,451 43,339

1. Proportion of examinees correctly responding to each item


Table 4.3 Item Discrimination Distributions and Summary Data for CAAP Forms 11 and 12

Writing Skills Reading Skills Mathematics Science Critical Thinking

Discrimination Range1 11A 12A 11A 12A 11G 12G 11A 12A 11A 12A
.00–.09 0 1 0 0 0 0 0 0 0 0
.10–.19 0 0 0 1 1 3 0 0 0 0
.20–.29 2 1 0 0 1 1 2 1 0 2
.30–.39 8 9 1 2 3 3 6 7 2 2

17
.40–.49 18 20 11 5 8 7 13 17 10 6
.50–.59 18 23 17 18 11 10 18 14 10 10
.60–.69 16 14 7 9 8 8 6 6 7 9
.70–.79 10 4 0 1 3 3 0 0 3 3
Total Number of Items 72 72 36 36 35 35 45 45 32 32
Mean .55 .51 .53 .54 .52 .51 .49 .49 .54 .54
Standard Deviation .13 .12 .08 .10 .13 .16 .10 .10 .10 .12
Total Sample Size 40,954 39,056 37,095 31,532 989 986 29,677 22,625 26,451 43,339

1. Based on item-test biserial correlation


Table 4.4 Completion Rates for CAAP Test Forms 11 and 12

Test Form Completion Rate Sample Size

Writing Skills 11A .86 40,954


12A .84 39,056

Reading Skills 11A .96 37,095


12A .96 31,532

Mathematics 11G .95 989


12G .97 986

Science 11A .92 29,677


12A .93 22,625

Critical Thinking 11A .96 26,451


12A .95 43,339

Table 4.5 Internal Consistency Reliability Estimates (K-R 20)


for CAAP Test Forms 11 and 12

Test Form K-R 20 Sample Size

Writing Skills 11A .92 40,954


12A .92 39,056

Reading Skills 11A .85 37,095


12A .86 31,532

Mathematics 11G .84 989


12G .84 986

Science 11A .86 29,677


12A .86 22,625

Critical Thinking 11A .85 26,451


12A .85 43,339

18
E. ESSAY TEST CHARACTERISTICS
Seven forms of the CAAP Writing (Essay) Test are available for administration. Each
form includes two different essay prompts, and two raters score each essay. The Writing
(Essay) Test is assigned one composite score on a scale from one to six (1 to 6), which
represents the average across both prompts. Descriptions of the score points are included
in Appendix C.
Tables 4.6 and 4.7 provide descriptive statistics, estimates of inter-rater reliability, the
percent of perfect agreement, and the percent of adjacent agreement for the seven CAAP
Writing Essay forms. All figures are broken down by form and by essay. The results
provided in these tables indicate that the CAAP forms were very similar in terms of overall
performance and consistency of scoring. The inter-rater reliability estimates ranged from
.68 to .74 for the individual essay scores. The percent of perfect agreement was very
consistent across essays, ranging from .70 to .78. Both of these indices indicate that the
raters were able to apply the CAAP rubric in a consistent manner.
Each form was constructed to be equivalent in difficulty based on field test results. Table
4.8 indicates that the seven forms are similar in overall difficulty, with mean scores ranging
from 3.00 to 3.31.
Note: Each student essay was scored by two raters. If the scores were not in perfect or
adjacent agreement (within one point of each other), a third rater resolved the difference
in the assigned scores. The two scores assigned by the original raters were compared in
order to determine how closely different raters agree on the assigned scores.

Table 4.6 Descriptive Statistics for Essay 1


CAAP Writing (Essay) Test

Standard Inter-rater % Perfect % Adjacent


Form Mean Score Deviation Reliability Agreement Agreement

88A 3.28 .62 .72 .75 .25


88B 3.33 .63 .71 .74 .26
89A 2.91 .60 .74 .78 .21
89B 3.15 .64 .72 .74 .26
11A 3.37 .64 .74 .75 .25
12A 3.10 .65 .68 .71 .28
13A 3.07 .66 .73 .74 .26

19
Table 4.7 Descriptive Statistics for Essay 2
CAAP Writing (Essay) Test

Standard Inter-rater % Perfect % Adjacent


Form Mean Score Deviation Reliability Agreement Agreement

88A 3.34 .64 .72 .74 .26


88B 3.23 .61 .71 .75 .25
89A 3.09 .61 .71 .75 .24
89B 3.17 .63 .72 .75 .25
11A 3.18 .62 .74 .77 .23
12A 3.08 .68 .69 .70 .30
13A 3.04 .65 .73 .74 .26

Table 4.8 Descriptive Statistics for Composite Essay Score


CAAP Writing Essay Test

Standard
Form Mean Score Deviation Sample Size

88A 3.31 .58 17,282


88B 3.28 .57 15,415
89A 3.00 .55 17,745
89B 3.15 .58 21,024
11A 3.27 .57 13,124
12A 3.09 .61 12,025
13A 3.05 .61 7,711

20
Chapter 5

Scaling and Equating


A. ESTABLISHING SCALE SCORES FOR THE OBJECTIVE CAAP TESTS
This handbook illustrates the properties and characteristics of the CAAP tests using forms
11A and 12A (11G and 12G for Mathematics). However, more than just two forms of the
CAAP tests exist, and it is this plenitude of forms that creates the need for the scaling and
equating of the CAAP tests. When constructing forms of the CAAP tests, great care is taken
to ensure that each form contains the same content and that each form has the same
psychometric and statistical properties such as difficulty, variability, and reliability. Though
it is possible to ensure that each form has the same content, it is more difficult to ensure
that each form has the same statistical distribution, and, as a result, small form-to-form
differences in the distribution of CAAP raw scores do occur. Equating is a psychometric
procedure that corrects for small variations in difficulty from form to form so that a student
should receive the same reported score regardless of the form taken.
When CAAP was first developed, a small sample was used to linearly scale each total CAAP
score to have a mean of 60, a standard deviation of approximately 5, and scale score
endpoints of 40 and 80. Similarly, each subscore was linearly scaled to have a mean of 15
and a standard deviation of 2.5. These values for the mean and standard deviation were
sample dependent. The samples used to illustrate the characteristics of the CAAP in this
handbook yield mean and standard deviation values close to, but not exactly equal to, the
initial values just given.
Table 5.1 presents raw score statistics for the five total test scores making up forms 11A and
12A (11G and 12G for Mathematics). The statistics are based on data collected from 1998
through 2001, except for the Mathematics Test. That test was modified in 2002 to include
two subscores, Basic Algebra and College Algebra, in place of its former single Algebra
subscore. Therefore, a smaller amount of data is available for that test. As with earlier
Mathematics forms, this data was used to linearly scale the Mathematics Test to have a mean
of 60 and a standard deviation of 5 and the two subscores, Basic Algebra and College
Algebra, to have means of 15 and standard deviations of 2.5. The creation of the two new
Mathematics subscores required a content change in the Mathematics Test. For that reason,
the Mathematics Test and its subscores are referred to as form 11G and 12G instead of 11A
and 12A.
For the column titles in Table 5.1 through 5.4, N is the sample size, M is the number of
items comprising the test, S is the standard deviation of the test, and Rel is the reliability
of the test. Both the mean and the standard deviation have N as their divisor. The skewness
and kurtosis values for each test are not presented because their precise values usually
have little meaning. The raw and scaled scores are not precisely normally distributed, but
they are unimodal and not too asymmetric. The total tests tend to be somewhat more
symmetric than the subtests, and that is especially true of the College Algebra subtest,
which is difficult and consequently positively skewed.

21
As can be seen from Table 5.1, there are modest form-to-form differences between the raw
total score means and standard deviations of the same test. These differences are largest for
Writing Skills and Science, but those two tests contain the highest number of items, 72 and
45, respectively.
Table 5.2 presents scaled total score statistics for the two forms. The form-to-form
differences between the scale score statistics for the same CAAP total test scores are much
less than for the raw score statistics, though small differences are still present. This
reduction in form-to-form differences is due to the equating process that adjusts the scale
scores for any raw score form-to-form differences in difficulty.
Though the data for the two forms, 11A and 12A, were collected during the same time
period, it is not known how similar the two groups of students were who took forms 11A
and 12A. It is reasonable to expect the two student groups to be comparable, and the small
scale score differences between the two forms suggest that the two groups were very
similar. Mathematics test forms 11G and 12G, however, were administered to comparable
students groups. The two forms were spiraled together and administered within the same
group, resulting in each test form being administered to a separate but comparable student
group (see section C of this chapter).

Table 5.1 Raw Score Statistics for Forms 11 and 12 (Total Scores)

Test Form N M Mean S KR20 SEM

Writing Skills 11A 40,954 72 51.23 12.05 .92 4.72


12A 39,056 72 46.55 12.71 .92 4.98
Reading 11A 37,095 36 23.22 6.63 .85 3.49
12A 31,532 36 23.46 6.76 .86 3.45
Mathematics 11G 989 35 16.86 6.12 .84 3.32
12G 986 35 16.04 6.20 .84 3.36
Science 11A 29,677 45 24.54 7.95 .86 4.06
12A 22,625 45 26.16 7.86 .86 4.01
Critical Thinking 11A 26,451 32 19.03 6.30 .85 3.32
12A 43,339 32 19.31 6.35 .85 3.35

22
Table 5.2 Scale Score Statistics for Forms 11 and 12 (Total Scores)

Test Form N Mean S Rel SEM

Writing Skills 11A 40,954 63.46 4.81 .91 1.48


12A 39,056 63.60 4.94 .91 1.51
Reading 11A 37,095 62.22 5.43 .84 2.18
12A 31,532 61.86 5.31 .84 2.10
Mathematics 11G 989 57.62 4.00 .78 1.88
12G 986 57.56 3.92 .78 1.83
Science 11A 29,677 60.30 4.59 .86 1.74
12A 22,625 60.60 4.55 .86 1.72
Critical Thinking 11A 26,451 61.65 5.39 .83 2.24
12A 43,339 61.33 5.48 .83 2.23

B. ESTABLISHING SCALE SCORES FOR THE CAAP SUBSCORES


Table 5.3 presents raw score statistics for the six subscores of the CAAP tests.
Usage/Mechanics and Rhetorical Skills are subscores of the Writing Skills Test. Basic
Algebra and College Algebra are subscores of the Mathematics Test. Arts/Literature and
Social Studies/Sciences are subscores of the Reading Test. The Critical Thinking and Science
Tests have no subscores. Once again, form-to-form differences between the raw score
statistics for the same subscores are present. Table 5.4 presents the scaled score statistics
for the subscores. In this table such form-to-form differences between the same subscore
scale score statistics are greatly reduced, and this is again due to the equating process.

23
Table 5.3 Raw Score Statistics for Forms 11 and 12 (Subscores)

Test Form N M Mean S KR20 SEM

Usage/Mechanics 11A 40,954 32 22.60 5.77 .84 3.13


12A 39,056 32 20.66 6.14 .85 3.23
Rhetorical Skills 11A 40,954 40 28.62 6.81 .86 3.48
12A 39,056 40 25.89 7.15 .86 3.65
Basic Algebra 11G 989 17 10.89 3.49 .76 2.27
12G 986 17 9.96 3.61 .77 2.30
College Algebra 11G 989 18 5.97 3.26 .71 2.30
12G 986 18 6.09 3.18 .68 2.33
Arts/Literature 11A 37,095 18 12.63 3.59 .77 2.29
12A 31,532 18 12.32 3.97 .81 2.33
Social Studies/ 11A 37,095 18 10.59 3.71 .74 2.50
Sciences
12A 31,532 18 11.15 3.52 .73 2.41

Table 5.4 Scale Score Statistics for Forms 11 and 12 (Subscores)

Test Form N Mean S Rel SEM

Usage/Mechanics 11A 40,954 16.85 2.37 .83 .98


12A 39,056 16.83 2.30 .83 .94
Rhetorical Skills 11A 40,954 16.74 2.54 .85 .99
12A 39,056 16.80 2.53 .85 .99
Basic Algebra 11G 989 15.02 2.48 .77 1.20
12G 986 15.03 2.52 .76 1.23
College Algebra 11G 989 14.98 2.54 .68 1.44
12G 986 15.05 2.52 .67 1.46
Arts/Literature 11A 37,095 15.57 2.56 .76 1.27
12A 31,532 15.43 2.57 .78 1.21
Social Studies/ 11A 37,095 16.25 2.70 .73 1.40
Sciences
12A 31,532 16.23 2.74 .72 1.44

24
C. EQUATING CAAP SCORES
Random group equipercentile equating is used to equate scores on the different CAAP
forms. Two or more forms are spiraled together and then administered so that in each
classroom of each school one student will receive one form and the next student will
receive a different form, and so on. This randomization of test forms allows for the
assumption that the different groups of students taking the different forms are nearly
identically distributed, and so any statistical differences observed between the forms are
due to form-to-form variation and not to differences in ability among the groups of students
taking the different forms. This assumption is crucial for the equating of CAAP forms.
One of the spiraled forms is the base form that has already been scaled and equated. The
equipercentile equating procedure examines the raw score cumulative distributions of the
new forms and adjusts for any differences in the new form(s) so that the raw score
cumulative distribution of the new form(s) closely matches that of the base form. Then the
raw scores are transformed to scale scores. Smoothed cubic splines are used to make the
adjustments in the cumulative distribution functions. Smoothed cubic splines are very
versatile and allow for a variety of adjustments in the cumulative distribution functions.
Consequently, the cumulative distribution function(s) of the new form(s) can be made to
closely match the cumulative distribution of the base form.

D. RELIABILITY AND STANDARD ERROR OF MEASUREMENT


Tables 5.1 through 5.4 also present estimates of reliability and the average standard error
of measurement (SEM). The (K-R 20) reliability coefficient is used for the raw scores
because these scores are sums of dichotomous items, and the average SEM estimate for the
raw scores is computed using this coefficient. This is not the case for the scaled scores. A
complicated psychometric model must be used to obtain reliability and SEM estimates for
the scaled scores. The interested reader can consult Kolen, Hanson, and Brennan (1992)
for an explanation of the method used to estimate the reliability and SEM of the scale
scores. The standard deviations given in those eight tables contain both true score
variability and error score variability. The SEM measures variability due solely to
measurement error. When a student takes an examination, it is hypothesized that the
student’s score does not represent a totally accurate measurement of the student’s ability.
Inaccuracy can be due to many factors. The student may not feel well that day. The student
may feel anxious about the test and not perform as well as usual. The student may be
preoccupied with other matters and not give full attention to the test. Or, though the
student may know a great deal about the subject being tested, that particular test may
contain questions unfamiliar to the student and not represent a measure of the student’s
full knowledge of the subject.
The SEM can be used in an approximate way to assess how accurately students’ scores on
the test reflect the students’ true knowledge of the test topic. A confidence interval can be
constructed for each student by adding one SEM to the student’s score and subtracting one
SEM from the student’s score. If the errors of measurement are normally distributed, then
such intervals will contain students’ true scores about 68 percent of the time. Errors of
measurement for the CAAP tests are not exactly normally distributed, so it is best to think
of the 68 percent as an approximate percentage. If more probabilistic precise estimates of
students’ true scores are desired, then +2 and -2 SEMs can be added to the students’ scores.

25
In this case, if the errors are normally distributed, then such intervals will contain the students’
true scores about 95 percent of the time. Once again, 95 percent should be considered
approximate because CAAP error scores probably are not precisely normally distributed.
The value of the SEM can vary with the value of a student’s score. The following five
figures, Figure 1 through Figure 5, present the SEM as a function of a student’s scaled score
for the five total scores. The method described in Kolen, Hanson, and Brennan (1992) was
used to compute the scale score SEMs. Technically, the SEM is the standard deviation of the
measurement error conditional on a student’s true score. Using the SEM with a student’s
observed scaled score is an approximation, but it is commonly done. SEMs for raw scores
usually have a negative quadratic shape. They are highest in the middle of the scale, but
tend to decrease at both ends of the scale. SEMs for scaled scores (those being presented
in Figures 1 through 5), do not necessarily behave the same way. The CAAP scaled score
SEMs decrease at the top of the scale, and they are usually greatest around the middle of
the scale, but they do not always decrease at the low end of the scale. At the low end of
the scale there were not always enough students to allow accurate estimates of the SEM,
so the graphs sometimes do not extend to the very lowest scores of the scales. Scaled score
SEMs are affected by the original scaling transformation and also by the equating process,
sometimes in predictable ways and sometimes in unpredictable ways. The CAAP scale
score SEMs presented below are fairly typical except for the Mathematics Test. The scale
score SEMs for that test are not highest in the middle of the scale, and this is due to the
scaling and equating process.
If it is desired to construct confidence intervals about students’ scores, then the appropriate
value of the SEM can be selected by using Figures 1 through 5 and the value of the
students’ scores on the appropriate test. Then confidence intervals for the students’ scores
can be constructed as described above. In this procedure, the student’s observed scale score
is being used in place of the student’s unobserved true scale score, and this is an
approximation. Some interpolation will also be necessary.

Figure 1
Writing Skills

2.0
Standard Error of Measurement

1.5

1.0

Key
11A
0.5 12A

0.0
45 50 55 60 65 70 75
Scale Score

26
Figure 2
Mathematics

2.5

Standard Error of Measurement


2.0

1.5

1.0
Key
11G
12G
0.5

0.0
45 50 55 60 65 70 75
Scale Score

Figure 3
Reading

2.5
Standard Error of Measurement

2.0

1.5

1.0
Key
11A
12A
0.5

0.0
45 50 55 60 65 70 75
Scale Score

27
Figure 4
Critical Thinking

3.0
Standard Error of Measurement
2.5

2.0

1.5

Key
1.0
11A
12A
0.5

0.0
45 50 55 60 65 70 75
Scale Score

Figure 5
Science Reasoning

2.0
Standard Error of Measurement

1.5

1.0

Key
11A
0.5 12A

0.0
45 50 55 60 65 70 75
Scale Score

28
E. USER NORMS
The norms for the CAAP tests are based on student data collected from CAAP user
institutions. A full description of the most recent norms and the corresponding reference
tables are included in a separate document, CAAP User Norms, which may be obtained by
sending a request by e-mail to CAAP@act.org.
The reference tables provide norms for freshmen and sophomores according to
institutional control (public or private) and type (two- or four-year) for each test. Junior and
senior norms are reported by test only. The tables contain distributions of scale scores and
cumulative percentages for each of the CAAP tests; this information is provided for both
the total scores and the subscores. In addition, descriptive statistics (sample sizes, means,
and standard deviations) are reported at the bottom of each table. Norms tables are also
provided for the two individual essays scores and the composite essay score.
Establishing Local Norms. Because the CAAP user norms represent a wide variety of
institutions, results for individual institutions may differ from those of the user norm group.
Individual institutions, therefore, are encouraged to develop local normative information if
they wish to interpret a given CAAP score in the context of the performance of their
particular students. Local norms are reported in the CAAP Institutional Summaries sent to
institutions following each test administration. The report includes frequency distributions
and descriptive statistics for the scores on each CAAP test, as well as frequency distributions
for the student demographic characteristics.
Local distributions and descriptive statistics may be compared to determine how students
from an individual institution differ from other CAAP-tested students. In addition, local
distributions can be used in establishing cutoff scores or decision zones for upper-level
admissions/placement decisions. More detailed information on setting cutoff scores and the
interpretation of test results are available in Chapter 6.

29
Chapter 6

Interpretation of Scores
A. USING CAAP SCORES FOR MAKING INFERENCES ABOUT GROUPS
When using CAAP scores in outcomes assessment, institutional staff will often want to make
inferences about particular groups of students. For example, an institution might want to
know the average CAAP score of students who have completed a general education core
curriculum or the average scores of students with particular majors. There are four primary
ways in which CAAP scores can be used to make inferences about groups.

1. Comparison to a Defined Group


One relatively simple way in which CAAP scores may be used to make inferences about
groups is to compare the academic performance of a group of students at a particular
institution to that of a relevant norm group. For example, a two-year public institution may
want to determine whether the academic performance of its sophomores is comparable to
that of sophomores at similar institutions. This could be accomplished by comparing the
average sophomore CAAP score at the institution to the average score reported in the CAAP
norms for all two-year public institutions that use CAAP. If the institution’s average score is
below the average score reported in the CAAP user norms, then the institution may decide
that its sophomores have not reached a level of academic proficiency comparable to that
of sophomores in other two-year public institutions. The institution should then investigate
the reasons for this result.
It is important to consider that the level of academic proficiency exhibited by a particular
group will often depend on student characteristics not included in the definition of norm
groups. For this reason, a low average test score should only be used to indicate that further
investigation is needed and not as an infallible sign of a deficiency within a particular
curriculum.

2. Comparison of Pretest and Posttest Mean Scores (Longitudinal Method)


Test scores may also be used to measure changes over time in students’ academic skills and
knowledge. Such changes typically result from exposure to particular academic programs
or general education core curricula. The average change for a given reference group can
be used as an indication of the general effectiveness of a curriculum. It is important to
remember, however, that other characteristics, such as student motivation to achieve, may
also affect their performance, both in the classroom and on a test.
Conceptually, the most straightforward way to measure educational growth is with a
longitudinal method in which the CAAP scores of students who were tested on two
different occasions are compared. For example, an entering class of freshmen could be
tested with CAAP prior to taking any courses. This same sample of students would then be
retested after they had completed a core of general education courses. The average
difference in students’ scores from one point in time to another could be considered
evidence of the program’s educational effectiveness. The average difference for a particular
program at one institution could also be compared with the average differences observed
at similar institutions.

30
A potentially serious problem in comparing one institution’s average difference scores with
those of other institutions is that there are a number of environmental variables that could
affect students’ academic achievement, but are either beyond the control of institutions or
are not directly related to their core curricula. If the distributions of these environmental
variables are significantly different among institutions, then simple comparisons of average
difference scores could be inappropriate.
However, a variety of statistical techniques can be used to adjust for such differences. The
conceptually simplest technique is to make comparisons only among institutions that are
similar with respect to the environmental variables. The principal disadvantage of this
technique is that the amount of data required increases tremendously with the number of
comparison groups. Another analytical technique that is more flexible, because it requires
less data, is to construct “regression” models in which institutions’ average difference scores
are related to the environmental variables, as well as to variables representing the
institution’s instructional effectiveness.
Longitudinal assessments may be difficult to implement practically. For example, if initial
testing is done on a sample of students, substantial effort and expense may be required to
locate the same students for retesting. In addition, testing students twice may be considered
by some institutions to be too costly or time-consuming. Therefore, alternative outcomes
assessment methods may be desirable.

3. Comparison of Means of Different Groups (Cross-Sectional Method)


An alternative to the longitudinal method is the cross-sectional method, in which the mean
CAAP scores of students in two different groups are compared. In a typical cross-sectional
study, the freshmen at an institution are tested at the beginning of the fall term, and the
sophomores are tested at the end of the spring term of the same academic year. The
effectiveness of a program is then inferred from the differences between the mean
sophomore scores and the mean freshman scores.
The principal advantage of cross-sectional studies is that they can be done in one year,
rather than in two. This characteristic may make them very attractive to some institutions.
The principal disadvantage of cross-sectional studies is that they are not as accurate as
longitudinal studies, because differences in other characteristics of students are confounded
with differences in mean scores. For example, test score differences between sophomores
and freshmen may be due to differences in their entering cognitive abilities, irrespective of
any instruction they receive. Moreover, data for sophomores are collected only for those
students who persist through the sophomore year. Therefore, the mean scores of
sophomores will tend to be higher than the mean scores of freshmen, because the latter
are based in part on the scores of potential dropouts.
The potential bias in cross-sectional results can be controlled to a certain extent by statistical
adjustments (e.g., by deleting nonpersisting students or by using an analysis of covariance
with an ACT® Test Composite score as the covariate). Such statistical adjustments can be time-
consuming and relatively more expensive to do, however, and are not panaceas.
Cross-sectional studies are also subject to the same difficulties that longitudinal studies are
in making comparisons across institutions. In principle, the statistical techniques previously
described for comparing institutions’ average difference scores in longitudinal studies could
also be applied to cross-sectional studies.

31
4. Comparison of Actual and Estimated Mean Scores (Prediction Method)
Another alternative to the longitudinal method is the prediction method, in which students’
end-of-program CAAP scores are predicted from their scores on another test (e.g., the ACT
Assessment). Students’ predicted end-of-program CAAP scores are then compared with
their actual end-of-program CAAP scores. The average difference between the two scores
could be considered an indicator of an institution’s effectiveness for a particular program.
For an illustration of how a prediction study might work, suppose that an institution wants
to evaluate its general education program, which students typically complete by the end of
the sophomore year. Before students are admitted to the institution, they are required to
take the ACT Assessment. From students’ ACT Assessment scores, predicted end-of-
program CAAP scores are calculated. Shortly before students complete the general
education core curriculum, they are tested with CAAP.
A prediction model, based on the matched ACT Assessment and CAAP data across students
at many institutions, is then developed. The dependent variable in the model is the
sophomore CAAP score. The independent variables in the model may include entering ACT
Assessment scores, other student characteristics (such as courses taken), and institutional
characteristics. An individual institution’s effectiveness can then be inferred by comparing
its students’ predicted CAAP scores with their actual CAAP scores. If the actual CAAP scores
are higher than predicted, then this is an indication that the institution’s program has been
effective in educating students.
The principal advantage of the prediction method is that an institution does not have to test
students twice with CAAP. Because many institutions will already have relevant test scores
for their entering students, the only testing needed is at or near the end of the program(s)
that the institutions want to evaluate. A disadvantage of the prediction method is that, for a
given sample size, inferences based on it are not as accurate as inferences based on the
longitudinal or cross-sectional studies. The reason is that in the prediction method,
inferences are based on tests with different contents and score scales, rather than on the
same test and score scale. For group comparisons, however, this disadvantage can be
effectively minimized simply by increasing the number of students tested.

B. USING CAAP SCORES FOR MAKING INFERENCES ABOUT INDIVIDUALS


Tests cannot measure knowledge and abilities with absolute precision. A score obtained on
a given test may be considered the sum of a “true score” component and an “error”
component. Because of the error component, or “measurement error,” an obtained score is
only an estimate of an individual’s “true” test score.
The true score can be conceptualized as the average of scores obtained by repeatedly
testing an examinee under identical conditions, assuming that the standard error is constant
throughout the score range and that the measurement error has a normal distribution.
Using the standard error of measurement to build confidence intervals around test scores
discourages the misinterpretation of individual test scores as precisely representing a
student’s true knowledge and skills.

32
C. USING CAAP SCORES FOR SELECTION DECISIONS
It is likely that institutions will have unique placement and evaluation concerns that may
require locally developed cutoff scores. The results of the CAAP examinations can be used
in several ways to establish cutoff scores or decision zones (see Section 3 that follows) for
such purposes as admitting students to upper-division or professional-level courses. Before
establishing cutoff scores, however, institutions are encouraged to conduct validity studies
to examine specific uses of CAAP scores (see Chapter 7 on validity).

1. Methods for Setting Cutoff Scores


Once CAAP scores have been locally validated for a certain use (e.g., admitting students to
upper-division courses), cutoff scores may be set using one of several procedures.
Procedures for setting cutoff scores include the use of local normative information,
judgmental procedures based upon a content review of the items, the use of local validity
evidence, and the use of other comparison populations.
Using Local Norms. One method of setting cutoff scores uses local norming studies that
frame the decision in terms of students enrolled at an institution. The results of these
studies may consist of local score distributions related to test-score performance and
performance in college. Institutional personnel are often required to establish cutoff scores
on the basis of administrative considerations. Under these conditions, local score
distributions can be used to provide preliminary cutoffs. For example, due to a limited
number of openings available for two-year college transfer students, personnel at a four-
year institution may decide that only the top 30 percent of the applicant pool can be
accepted based on their CAAP test performance. The appropriate local CAAP score
distribution(s) can be used to determine the test score(s) corresponding to a cumulative
percentage of 70 (70 percent of the students scoring at or below the scale score). Any
applicant whose score is at or above a cumulative percentage of 70 will be admitted to the
institution.
Cutoff scores based on score distributions are easy to communicate and implement in a
selection or placement system. However, cutoff scores based solely on percentile rank
considerations may lead to incorrect decisions from other perspectives. For example,
students in need of remediation may be incorrectly admitted to an upper-division
curriculum, or transfer students who have the necessary skills may be incorrectly denied
admission into a four-year institution. Use of score ranges or decision zones (see Section 3)
is useful in that they acknowledge the measurement error inherent in the scores.
Using Judgmental Procedures. When judgmental procedures are used to establish cutoff
scores, institutional personnel will need to conduct a thorough review of the test content.
Based on this review, institutions may determine that a student correctly answering a
certain percentage or more of the test items has demonstrated sufficient knowledge of the
subject to be admitted, for example, to upper-division study.
This method also requires content judges to decide how a “borderline” test taker (e.g., one
whose knowledge and skills are in the decision zone between upper- and lower-division
work) would respond to the items on the examination. Since this method relies on
subjective judgements, further inspection of actual performance data is recommended.

33
Setting Cutoff Scores from Validity Evidence. Validity evidence can be employed in
establishing cutoff scores. A test score predicting adequate course performance, as
measured by cumulative grade-point average (GPA) or individual course grades, can be
chosen as the cutoff score.
Developing grade-experience tables is also one way of identifying cutoff scores based on
validity evidence. These tables typically describe course grades or GPAs based on test score
performance intervals. By using tables, the distribution of course grades or GPAs
corresponding to several CAAP score intervals can be identified.
Logistic regression is another statistical method for analyzing the relationship between
CAAP scores and students’ course grades. Logistic regression can be used to estimate
directly the probability of success, given particular test scores, and the proportions of
students about whom correct selection decisions would be made.
For estimating probabilities of success, “success” can be defined in various ways (e.g., a
grade of “C” or higher). The estimated probabilities of success yield, for either actual or
hypothetical cutoff scores, the following proportions: the proportion of students scoring
above the cutoff who actually succeeded in the course (true positive), the proportion of
students scoring above the cutoff who actually failed (false positive), the proportion of
students scoring below the cutoff who would have succeeded (false negative), and the
proportion of students scoring below the cutoff who would have failed (true negative). For
each cutoff score, the proportion of correct decisions is calculated by summing proportions
of true positive and true negative decisions. In general, it is desirable to maximize the
proportion of correct decisions; the optimal cutoff score is identified as the one yielding
the largest proportion of correctly selected students.
Using Other Comparison Populations. The scores of a comparison population may also
be used to set cutoff scores. This method is particularly helpful when local normative data
are not available.

2. Choosing a Cutoff Procedure


There are several procedures that can be used to set cutoff scores. There is no one method
that is best for all institutions and situations; a measurement expert should be consulted
for assistance in both choosing and implementing a procedure.
Further, all methods involve a degree of subjective judgment that will vary according to the
amount and type of supporting information included in the decision. The choice of an
appropriate method will depend substantially upon the supporting information and
available resources.
If an appropriate sample of test takers is available, and if standards can be expressed in
terms of performance (for example, “the mean GPA of students with CAAP Writing Skills
scores between 55 and 65”), then local validity studies can be conducted to develop local
norms and to establish cutoff scores.
If a number of qualified judges are available who are both familiar with the relevant content
areas and can competently specify a borderline performance level, then cutoff scores can be
derived subjectively.

34
Once cutoff scores have been established and used, it is essential for the institution to
continue to monitor their effectiveness. Experience may suggest adjusting initial cutoff
scores. Regardless of the method used, it is important to validate cutoff scores against
criteria of interest (e.g., sophomore GPA, junior GPA) so that institutional personnel can be
certain that cutoff scores are functioning as intended.

3. Alternatives to Cutoff Scores


Placement/admissions decision rules are not limited to single cutoff scores; decision zones
are often recommended as an alternative. Decision zones consist of a range of scores and
take into account the measurement error inherent in the test scores. For example, it might
be appropriate to identify a Critical Thinking score range of 65 to 70 as a decision zone for
admission into upper-division courses in education. Students scoring above the decision
zone would be allowed to begin the advanced curriculum. Those scoring below the decision
zone would be placed into review or remedial courses because their test performance was
similar to that of students with a lower success rate in upper-division courses. Students who
score within the decision zone would be advised that their skills appear to be bordering on
readiness for the upper-division curriculum. They might consider enrolling in remedial
courses (or participating in other appropriate skill-building services) to improve their skills,
or they might enroll in the upper-division curriculum, with full awareness that most of the
students will have a stronger base of critical thinking skills. The student would be
encouraged to carefully consider the options with the aid of an advisor.

35
Chapter 7

Validity Evidence for


CAAP Test Scores
Validity refers to the degree to which intended uses of scores on a test are justified. Two
sources of validity evidence for CAAP test scores are discussed in this section: content validity
and criterion-related validity.
Content validity refers to the extent to which the knowledge and subject matter measured by
the CAAP tests adequately represent the general knowledge and skills required in the first two
years of college and that are important for success in upper-level courses. Criterion-related
validity refers to the relationship between the CAAP tests scores and external criteria.
One type of criterion-related validity is concurrent validity. For the CAAP tests, concurrent
validity refers to the extent to which a CAAP score can be used to explain a person’s current
level of performance, as measured by other criteria (e.g., course grades or grade-point average).
Test scores and course grades are collected at the same time for this form of validity evidence.
A second type of criterion-related validity is predictive validity. In the context of the CAAP tests,
predictive validity is the extent to which a CAAP score can be used to predict a person’s future
performance, as measured by other criteria (e.g., junior course grades or junior grade-point
average). Because this type of validity involves prediction, it requires a period of time between
collection of the test scores and of the criteria of interest.

A. CONTENT VALIDITY OF CAAP SCORES


The content validity of a test for a given purpose is based on the match between test
content and the knowledge or skill domain of interest (e.g., curricular content). To facilitate
establishment of CAAP score content validity, intended purposes and general uses of CAAP
scores are presented in the descriptions of CAAP test development processes and test
content specifications in Chapters 1, 2, and 3 of this handbook.

B. CRITERION-RELATED VALIDITY EVIDENCE FOR CAAP SCORES


Beginning in September 1988 and continuing through the spring of 1990, 58 postsecondary
institutions participated in research projects designed to provide validity evidence for
several uses of CAAP scores. The specific uses investigated were measuring students’
academic knowledge and skills in common core areas, predicting students’ academic
performance in the junior year of college, and measuring the changes in students’ academic
knowledge and skills due to their completing a general education core curriculum.
Both two-year and four-year, public and private institutions were included in the research
projects. Institutions were asked to select a random sample consisting of at least 10 percent
of their students, or 100 students, whichever was greater. As an alternative, institutions
could choose to test their entire student population. Student transcripts were collected from
many of the institutions; transcripts were required for two of the studies but were optional
in the other study.

36
The CAAP Science Test was released in the fall of 1989 after the CAAP research projects
were already in progress. Validity information about the Science Test is provided from
studies conducted outside ACT (e.g., Olsen, 1991).

1. CAAP as a Measure of Students’ Academic Knowledge and Skills


If sophomore CAAP scores and college GPAs are both considered reliable and valid
measures of academic skills acquired during the first two years of college, then there
should be a statistical relationship between these variables. To test this assumption,
sophomore CAAP scores were used to model sophomore GPAs.
Data from 787 students at 8 postsecondary institutions were analyzed in this study. The
median (across institutions) of Writing Skills mean scale scores was 65.3. The medians of
Mathematics, Reading, Critical Thinking, and Writing (Essay) mean scale scores were 57.7,
63.2, 62.9, and 2.4, respectively. Median standard deviations for the four CAAP objective
tests ranged from 3.5 (Mathematics) to 4.8 (Reading). The median standard deviation for
the Writing (Essay) Test was 0.7.
The median (across-institutions) correlation between cumulative English GPA and
sophomore CAAP Writing Skills score was .37, with a range of .26 to .57. For cumulative
mathematics GPA and CAAP Mathematics score, the median correlation was .34, with a
range of .11 to .40. Sophomore cumulative overall GPA yielded moderate positive
relationships between CAAP Writing Skills (median r = .36), Mathematics (.35), Reading
(.38), and Critical Thinking (.34) scores.
The positive median correlations found between sophomore CAAP scores and GPAs
indicate that CAAP can be used to measure some of the knowledge and skills acquired
during the freshman and sophomore years at most postsecondary institutions. In
interpreting the magnitude of the correlations, one should keep in mind that they are
limited by the reliabilities of the criterion variables (i.e., the GPAs). Etaugh, Etaugh, and
Hurd (1972) and Schoenfeldt and Brush (1975) estimated single course grade reliabilities
ranging from .30 to .76. If these reliability estimates pertain to the criterion variables in the
present study, then the correlations obtained are about what could be expected.

2. CAAP as a Predictive Measure


If junior course grades and GPAs are reliable and valid measures of junior-year academic
performance, and if sophomore CAAP scores are valid measures of the skills needed to
succeed in the junior year, then there should be a statistical relationship between
sophomore CAAP scores and junior-year grades and GPAs. Regression models were
developed to examine this predictive relationship. Regression statistics were computed by
institution and summarized across institutions.
Seven postsecondary institutions with a combined sample size of 1,514 students
participated in this study. The medians (across institutions) of sophomore CAAP objective
test scale score ranged from 58.6 (Mathematics) to 63.9 (Writing Skills), and median
standard deviations ranged from 3.6 (Mathematics) to 4.5 (Reading). For the Writing (Essay)
Test, these statistics were 2.5 and 0.6, respectively.
The correlations between junior GPAs and corresponding sophomore CAAP test scores
were all moderately positive: junior English GPA had a median correlation of .32 with CAAP
Critical Thinking score, .25 with Writing Skills score, and .25 with Reading score. Junior
mathematics GPA had a median correlation of .23 with CAAP Mathematics score. Junior

37
cumulative overall GPA, which included freshman and sophomore course grades, was
somewhat more strongly associated with CAAP objective test scores than was junior
noncumulative overall GPA (e.g., median correlations between these GPA variables and
CAAP Critical Thinking scores were .35 and .26, respectively).
These findings indicate that sophomore CAAP scores have modest predictive capability and
can be used to predict academic performance in the junior year of college. The study also
found that the predictive capability of sophomore course grades was comparable, in some
instances, to that of sophomore CAAP scores. It is recommended, therefore, that
sophomore CAAP scores be used in conjunction with sophomore course grades and GPAs
to predict junior-year academic performance.

3. CAAP as a Measure of Educational Change


If CAAP scores are valid for measuring changes over time in college students’ academic
skills, and if such skills increase from the freshman to the sophomore year, then the CAAP
scores of sophomores should be greater than the CAAP scores of freshmen. To investigate
this hypothesis, both longitudinal and cross-sectional research designs were used. In the
longitudinal design, entering freshmen were tested in the fall of 1988. After the same
students had completed nearly two academic years of college, they were post tested in the
spring of 1990 with the same CAAP tests. In the cross-sectional design, entering freshmen
were tested in the fall of 1988 and second-semester sophomores were tested in the spring
of 1989.
Twenty-six institutions participated in the longitudinal study, and the numbers of students
associated with each test were: Writing Skills (151); Mathematics (175); Reading (272);
Critical Thinking (462); and Writing Essay (256). Median freshman CAAP objective test mean
scores ranged from 56.2 (Mathematics) to 61.8 (Critical Thinking), and median standard
deviations ranged from 3.7 (Mathematics) to 4.5 (Reading and Critical Thinking). The
median Writing (Essay) mean and standard deviations were 2.8 and 0.6, respectively.
Fifty-six institutions participated in the cross-sectional study, and the number of entering
freshmen (11,917) was considerably larger than the number of second-semester
sophomores (3,847). For the unadjusted comparison (see below), median freshman CAAP
objective test mean scores ranged from 56.1 (Mathematics) to 60.7 (Writing Skills), with
median standard deviations ranging from 3.6 (Mathematics) to 4.7 (Reading). The median
Writing (Essay) mean and standard deviation were 2.4 and 0.6, respectively.
In the analysis of the longitudinal data, average differences between pre- and post-treatment
CAAP scores were calculated by institution and summarized across institutions. Estimated
reliabilities of CAAP mean difference scores and difference scores for individual students
were also computed. In the analysis of the cross-sectional data, differences in freshman and
sophomore mean CAAP scores were compared according to several methods.
Unadjusted Comparisons. The mean CAAP scores of freshman and sophomore samples
were compared without any statistical adjustment for differences in such characteristics as
academic skills or persistence.
ANCOVA. The CAAP scores of persisting freshmen and sophomores were compared using
analysis of covariance (ANCOVA). In the ANCOVA model, institution and educational level
were the main effects, and the ACT Assessment Composite score was the covariate. ANCOVA
analyses were based on data for persisting students only, pooled across institutions.

38
All median CAAP objective test difference scores, for both the longitudinal data and the
cross-sectional data analyzed with ANCOVA, were positive. For the longitudinal data,
median difference scores ranged from 0.9 (CAAP Writing Skills, Mathematics, and Critical
Thinking) to 1.1 (Reading). Median difference scores for the cross-sectional data ranged
from 0.6 (Mathematics) to 1.7 (Critical Thinking). The range of the median difference scores
for the unadjusted cross-sectional data was from 1.3 (Reading) to 2.3 (Critical Thinking).
These results indicate that the average scores on the CAAP objective tests increased from
the freshman to the sophomore year. If CAAP scores are used to measure educational
change at the group level with cross-sectional research designs, then adjustments (e.g.,
eliminating records of nonpersisting students and adjusting differences in ability through
ANCOVA) should be made for differences in the entering cognitive skills of freshmen and
sophomores. Using unadjusted cross-sectional data can tend to overestimate change.
Spearman-Brown reliabilities for longitudinal mean difference scores were .92 (Writing
Skills), .56 (Mathematics), .60 (Reading), and .81 (Critical Thinking). Individual students’
difference scores for the Writing Skills Test had a comparable estimated reliability (.72).
Individual students’ difference scores for the Mathematics, Reading, and Critical Thinking
tests had considerably smaller estimated reliabilities Spearman-Brown coefficients for these
tests were .31, .40, and .34, respectively. CAAP score differences reported to individual
students should be interpreted with these small reliabilities in mind.
The results of a study conducted outside of ACT indicate that CAAP can be used to measure
educational change. Pascarella, Bohr, Nora, and Terenzini (1995) modeled end-of-freshman
year CAAP scores as a function of entering CAAP scores and other variables, including
ethnicity, parents’ level of education and income, academic motivation, age, credit hours
taken during freshman year, and location of residence (on or off campus). The data for their
study were collected from 2,416 freshmen participating in the longitudinal National Study
of Student Learning. These students represented 18 postsecondary institutions from 16
states. The authors found that CAAP Reading gain scores, adjusted for the variables
described above, differed for male nonathletes (.72), male football and basketball players
(-.76), and male athletes participating in other sports (.42). For CAAP Mathematics, the gain
scores were .29, -.61, and .16, respectively. Female nonathletes had larger Reading gains
than did female athletes (1.13 vs. .52, respectively).

39
Appendix A

CAAP Content
Consulting Panel

41
CAAP CONTENT CONSULTING PANEL

Writing Skills Committee


Richard Lloyd-Jones Kris D. Gutierrez
University of Iowa University of Colorado
David Bartholamae Lynn Quiman Troyka
University of Pittsburgh Queensborough Community College

Reading Committee
Joan Elifson David Pearson
Georgia State University University of Illinois
Thomas Goodkind Darwin T. Turner
University of Connecticut University of Iowa
William Holliday
University of Maryland

Mathematics Committee
Larry A. Curnutt Robert F. Wardrop
Bellevue Community College Central Michigan University
Jerome A. Goldstein
Tulane University

Science Committee
David Cater Ervin Poduska
University of Iowa Kirkwood Community College
Dean Hartman Robert Yager
Grant Wood Area Education Agency University of Iowa
David Lyon
Cornell College

Critical Thinking Committee


Jeanette Dick Richard Mayer
McNeese State University University of California–Santa Barbara
Robert H. Ennis Lynne McCauley
University of Illinois Western Michigan University

Writing Essay Committee


Arthur Appleby Paul Diehl
State University of New York–Albany University of Iowa
Joan Baron Lester Faigley
Connecticut State Department University of Texas–Austin

42
Appendix B

Differential Item
Functioning (DIF)
Analyses of CAAP Items

43
DIFFERENTIAL ITEM FUNCTIONING (DIF) ANALYSES OF CAAP ITEMS
Overview. As part of ACTs ongoing efforts to ensure fairness in CAAP tests, analyses of
Differential Item Functioning (DIF) are conducted on active CAAP forms as sufficient data
become available. The procedure and results are discussed below for all five content-area
objective tests in Forms 11 and 12.
DIF Methodology. DIF analyses offer a way to examine each item on a test to determine
whether the item is differentially difficult for members of one or more subgroups of the
examinee population. A DIF analysis compares the performance of a focal (minority) group
against a relevant base or referent (majority) group. ACT has laid the groundwork for DIF
studies by requesting that all examinees voluntarily provide their age, gender, and ethnicity.
These demographic factors are used to identify the referent and focal examinee groups
across which item functioning is compared.
All CAAP DIF studies are based on operational test administrations rather than data
gathered under special testing conditions. This is because it is not clear how motivated
examinees are under such artificial testing situations. The use of operational data also helps
to ensure that the fairness of items and test scores is assessed under conditions identical
to those in place during actual testing.
ACT uses the Mantel-Haenszel common-odds ratio (MH, or the MH index; see Holland and
Thayer, 1988) as the statistical index. The MH index represents the probability of members
of either the referent or the focal group getting an item correct compared to the probability
that (matched) members of the other group will get the item correct. Items identified as
potentially biased against either group are reexamined by ACT content and fairness reviewers
to determine whether the language or context of the items may be a contributing factor.
Examinees in the referent and focal groups are sorted into performance strata on the basis
of a matching variable, usually their total test scores. Each examinee’s 1/0 (right/wrong)
item scores and demographic group membership (focal or referent) are then used to
construct a 2 × 2 contingency table for each of k levels of the matching variable. The overall
2 × 2 table provides the data for the procedure. Specifically, at level k, for a given item the
2 × 2 table is of the form:

Group Correct = l Incorrect = 0 Totals

Referent Group Ak Bk nRk


Focal Group Ck Dk riFk
Totals m1k m0k Tk

where there are Tk examinees at strata k, npk in the referent group, and nFk in the focal
group. The frequency of correct response is Ak for the referent group and Ck for the focal
group, and Bk and Dk represent the frequency of incorrect response for the referent and
focal groups, respectively.

44
The Mantel-Haenszel common odds ratio is estimated by:

AkDk
∑ _____
Tk
αMH =
AkDk
∑ _____
Tk

Values of the odds ratio greater than 1.0 indicate DIF favoring the referent group. Values
less than 1.0 indicate DIF favoring the focal group. At ACT, values of the odds ratio that are
<0.5 or >2.0 are flagged for further review.
Data Used in CAAP DIF Analyses. Data for the DIF analyses reported below were
collected as part of the national administration of CAAP tests during the period from
January 1998 through December 2001. Table 1 presents summary information regarding
the number of students and institutions that provided data for the DIF analyses. The total
number of examinee records involved in the analyses ranged from 18,262 (for Form 12 of
the Mathematics Test) to 43,339 (for Form 12 of the Critical Thinking Test). The number of
CAAP user institutions for which examinee data were available ranged from 95 (for Form
12 of the Mathematics Test) to 199 (for Form 11 of the Writing Skills Test). A minimum of
100 students was required for the focal group, and there were at least 7,000 students in
each referent group (i.e., White/Caucasians and males). Six focal groups (African
American/Black, Asian/Pacific Islander, Filipino, Mexican American/Chicano, Puerto
Rican/Cuban/Hispanic, and female) were found to have sufficient sample sizes to conduct
DIF analyses on at least one or more CAAP tests.

Table 1 Numbers of Students and Institutions Involved in CAAP DIF Analyses

Module Number of Students Testing Period Number of Schools

Form 11 Form 12 Form 11 Form 12 Form 11 Form 12


Math * *35,463 *18,262 4/98–11/01 4/98–12/00 190 95
Writing Skills 40,954 39,056 4/98–10/01 9/97–9/01 199 181
Reading 37,095 31,532 4/98–9/01 9/97–8/01 175 158
Science 29,677 22,625 4/98–9/01 9/97–12/00 130 112
Critical Thinking 26,451 43,339 4/98–9/01 1/98–12/01 177 192

* Analyses were based upon results from Forms 11C and 12C because the smaller n-counts from
forms 11G and 12G did not permit suitable focal group sizes.

Summary of DIF Review Process. A comparison was flagged in the DIF analysis if the
MH index was less than .5 (i.e., judged as favoring the focal group) or greater than 2.0 (i.e.,
judged as favoring the referent group). Three ACT staff members with extensive experience
with fairness issues and fairness review procedures reviewed each flagged comparison to
determine if any part of the item in question might bias the performance of any given
examinee group. Items that at least one reviewer believed might be potentially biased were
reevaluated by the group, and a final determination regarding the item’s possible bias was
reached. Any items identified as potentially biased were then reviewed to determine
appropriate action to be taken.

45
Summary of DIF Analysis Results. Tables 2 and 3 list the number of MH values that
favored either the focal or referent group for tests in Forms 11 and 12, respectively. Note
that comparisons are shown only for tests where at least one MH value was found to favor
either the focal group or the referent group.

Table 2 Results of the DIF Analyses: Form 11

Number of MH
Values Favoring:
Focal Group
Test (Sample Size) Referent Group* Focal Referent
Writing Skills Puerto Rican/ White/Caucasian 1 0
(72 Items) Cuban/Hispanic
(399)
Asian/Pacific White/Caucasian 4 1
Islander (693)
Mathematics Asian/Pacific White/Caucasian 0 2
(35 items) Islander (624)

*Sample size was at least 7,000 for each referent group.

Table 3 Results of the DIF Analyses: Form 12

Number of MH
Values Favoring:
Focal Group
Test (Sample Size) Referent Group* Focal Referent
Writing Skills Filipino (162) White/Caucasian 2 1
(72 Items)
Asian/Pacific White/Caucasian 2 3
Islander (961)
Mathematics Asian/Pacific White/Caucasian 0 2
(35 items) Islander (348)
Puerto Rican/ White/Caucasian 1 0
Cuban/Hispanic
(127)
Reading Asian/Pacific White/Caucasian 0 2
(36 items) Islander (516)

*Sample size was at least 7,000 for each referent group.

46
Discussion of DIF Results. Of the 21 comparisons flagged by the DIF analysis across the
10 tests evaluated (i.e., 2 forms in each of 5 content areas), 10 comparisons were judged
to favor the focal group and 11 comparisons were judged to favor the referent group.
Although a precise confidence interval for the Mantel-Haenszel statistic is not known, ACT
experience has been that by chance alone, roughly 5 percent of all comparisons will be
statistically flagged even when no reason for the DIF is discernible in the items. The
21 flagged comparisons represented considerably less than 5 percent of all the
comparisons made, thus suggesting that all or most of the flags might have resulted by
chance. However, the items identified by the flagged comparisons were examined in
accordance with the procedures described above.
Although the reviewers did not identify any item as biased, they did note two word
problem items on the mathematics test that contained text that might be unfamiliar to some
ESL students. The reviewers proposed minor edits to these items, and the form containing
these items was marked so that these edits would be incorporated into the next
administration of the form.
ACT continues to collect data for DIF analyses as part of its routine quality-control
procedures for developing and maintaining quality CAAP assessments. As sufficient data
become available, DIF analyses will again be conducted for CAAP forms, and the results
will be reported in this manual.

47
Appendix C

CAAP Writing (Essay) Test


Score-Point Descriptions

49
CAAP WRITING (ESSAY) TEST SCORE-POINT DESCRIPTIONS
Upper-range papers. These papers clearly engage the issue identified in the prompt and
demonstrate superior skill in organizing, developing, and conveying in standard written
English the writer’s ideas about the topic.
6. Exceptional. These papers take a position on the issue defined in the prompt and
support that position with extensive elaboration. Organization is unified and coherent.
While there may be a few errors in mechanics, usage, or sentence structure, outstanding
command of the language is apparent.
5. Superior. These papers take a position on the issue defined in the prompt and support
that position with moderate elaboration. Organization is unified and coherent. While
there may be a few errors in mechanics, usage, or sentence structure, command of the
language is apparent.

Mid-range papers. Papers in the middle range demonstrate engagement with the issue
identified in the prompt but do not demonstrate the evidence of writing skill that would
mark them as outstanding.
4. Competent. These papers take a position on the issue defined in the prompt and
support that position with some elaboration or explanation. Organization is generally
clear. A competency with language is apparent, even though there may be some errors
in mechanics, usage, or sentence structure.
3. Adequate. These papers take a position on the issue defined in the prompt and support
that position but with only a little elaboration or explanation. Organization is clear
enough to follow without difficulty. A control of the language is apparent, even though
there may be numerous errors in mechanics, usage, or sentence structure.

Lower-range papers. Papers in the lower range fail in some way to demonstrate
proficiency in language use, clarity of organization, or engagement of the issue identified
in the prompt.
2. Weak. While these papers take a position on the issue defined in the prompt, they may
show significant problems in one or more of several areas, making the writer’s ideas
often difficult to follow: support may be extremely minimal; organization may lack clear
movement or connectedness; or there may be a pattern of errors in mechanics, usage,
or sentence structure that significantly interferes with understanding the writer’s ideas.
1. Inadequate. These papers show a failed attempt to engage the issue defined in the
prompt, lack support, or have problems with organization or language so severe as to
make the writer’s ideas very difficult to follow.

Unratable papers.
0. The following categories of unratable papers are reported to the students as 0.
• Off task. These responses refuse to engage the issue identified in the prompt.
• Illegible.
• Not English.
• No response.

50
References

51
REFERENCES
American College Testing. 1992. CAAP User’s Guide. Iowa City, Iowa: Author.
Etaugh, A. F., C. F. Etaugh, and D. E. Hurd. 1972. Reliability of college grades and grade point
averages: Some implications for prediction of academic performance. Educational and
Psychological Measurement. 32,1045-1050.
Holland, P. W. and D. T. Thayer. (1988). Differential item performance and the Mantel-
Haenszel procedure. In H. Wainer and H. I. Braun (Eds.), Test Validity (pp. 129-145).
Hillsdale, NJ: Erlbaum.
Kolen, M. J., B. A. Hanson, and R. L. Brennan. 1992. Conditional standard errors of
measurement for scale scores. Journal of Educational Measurement. 29, 285-307.
Kuder, G. F. and M. W. Richardson. The theory of the estimation of test reliability.
Psychometrika. 1937, 2,151-160.
Olsen, S. A. 1991, April. Examining the relationship between college general education
science-oriented coursework and CAAP Science Reasoning. Paper presented at the annual
meeting of the American Educational Research Association, Chicago, IL
Pascarella, E. T., L. Bohr, A. Nora, and P. T. Terenzini. 1995. Intercollegiate athletic
participation and freshman-year cognitive outcomes. Journal of Higher Education. 66(4),
369-387.
Schoenfeldt, L. and D. Brush. 1975. Patterns of college grades across curricular areas: Some
implications for GPA as a criterion. American Educational Research Journal. 12, 313-321.

52
ACT Offices

ACT National Office Midwest Region Washington DC Office


500 ACT Drive 300 Knightsbridge Parkway, Suite 300 One Dupont Circle NW, Suite 220
P.O. Box 168 Lincolnshire, IL 60069-9498 Washington, DC 20036-1170
Iowa City, IA 52243-0168 847.634.2560 202.223.2318
319.337.1000 Fax 847.634.1074 Fax 202.293.2223

West Region 700 Taylor Road, Suite 210 Hunt Valley Office
Gahanna, OH 43230-3318
2880 Sunrise Boulevard, Suite 214 614.470.9828 Executive Plaza One
Rancho Cordova, CA 95742-6103 Fax 614.470.9830 11350 McCormick Road, Suite 200
916.631.9200 Hunt Valley, MD 21031-1002
Fax 916.631.8263 410.584.8000
Northeast Region Fax 410.785.1714
Mountain/Plains Region 144 Turnpike Road, Suite 370
Southborough, MA 01772-2121 KeyTrain Office
3131 S. Vaughn Way, Suite 218 508.229.0111
Aurora, CO 80014-3507 Fax 508.229.0166 340 Frazier Avenue
303.337.3273 Chattanooga, TN 37405-4050
Fax 303.337.2613 423.266.2244
Southeast Region Fax 423.266.2111
Southwest Region 3355 Lenox Road NE, Suite 320
Atlanta, GA 30326-1332 National Center for
8701 N. MoPac Expy., Suite 200 404.231.1952
Austin, TX 78759-8364 Educational Achievement
Fax 404.231.5945
512.320.1850 8701 N. MoPac Expy., Suite 200
Fax 512.320.1869 1315 E. Lafayette Street, Suite A Austin, TX 78759-8364
Tallahassee, FL 32301-4757 512.320.1800
850.878.2729 Fax 512.320.1877
Fax 850.877.8114

Mountain/Plains Region National Office Midwest Region Midwest Region


Denver (Aurora) Iowa City Chicago (Lincolnshire) Columbus (Gahanna)

Northeast Region
Boston
(Southborough)

Hunt Valley Office

Washington DC
Office
West Region
Sacramento
(Rancho Cordova)
Southeast Region
National Center for Atlanta
Educational Achievement
Austin
Southwest Region KeyTrain Office Southeast Region
Austin Chattanooga Tallahassee

© 2012 by ACT, Inc. All rights reserved. 19350

You might also like