You are on page 1of 27

ADDIS ABABA UNIVERSITY

SCHOOL OF PSYCHOLOGY

Measurement and Statistics in the Behavioral Sciences


Module Code (Psyc 641)

Measurement component

By
Dame Abera

January 2014
1

1. Measurement
1.1. Definition of basic terms:
Test:

 It refers to the presentation of a standard set of questions to be answered by pupils


 It is also an instrument for appraising a sample of person’s behavior
Measurement

 It is a process of assigning numbers to objects or events


o Ex. – hurricanes/high wind, earthquakes, time, stock market, height, weight

 It uses an instrument (such as ruler, thermometer, scale, and test) to determine the
quantity of something
 It mostly describes a given behavior in terms of test scores, amount, size, or quantity
Assessment

 It refers to the process of collecting (through paper-and-pencil tests, observational


techniques, and self-report devices), interpreting and synthesizing information to aid in
decision making
Evaluation

 It refers to the process of arriving at judgments about abstract entities such as programs,
curricula, organizations, and institutions
o For example, systemic evaluations (e.g., national assessments) are conducted to
ascertain how well an education system is functioning
 In most education contexts, assessments are a vital component of any evaluation
 It involves making a value judgment; the highest level of intellectual skill according to
Bloom; the ability to render judgments about the value of methods or materials for
specific purposes, making use of external or internal criteria
Psychometrics

 Assigning numbers to psychological characteristics


o Ex. – achievement personality, IQ, opinion, interests

Theories of psychometrics

 Classical Test Theory (CTT)

 Item Response Theory (IRT)

Dame: Measurement component


2

Classical Test Theory


Basic of Classical Test Theory

 Classical Test Theory (CTT) aims at studying the reliability of a (real valued) test score
variable (measurement, test) that maps a crucial aspect of qualitative or quantitative
observations into the set of real numbers
 Aside from determining the reliability of a test score variable itself CTT allows
answering questions such as:
o How do two random variables correlate once the measurement error is filtered
out (correction for attenuation)?
o How dependable is a measurement in characterizing an attribute of an individual
unit, i.e., which is the confidence interval for the true score of that individual
with respect to the measurement considered?
o How reliable is an aggregated measurement consisting of the average (or sum)
of several measurements of the same unit or object (Spearman-Brown formula
for test length)?
o How reliable is a difference, e.g., between a pretest and posttest?
Basic Concepts of Classical Test Theory
Primitives
 In the framework of CTT:
o Each measurement (test score) is considered being a value of a random variable Y
consisting of two components:
 A true score and
 an error score
o Two levels, or more precisely, two random experiments may be distinguished:
 Sampling an observational unit (e.g., a person) and
 Sampling a score within a given unit
 Within a given unit:
o The true score is a parameter, i.e., a given but unknown number characterizing
the attribute of the unit
o Whereas the error is a random variable with an unknown distribution

The core concepts: True score and error variables


 the true score variable τi := E(Yi |U ) is defined by the conditional expectation of the test
Yi given the variable U
 The values of the true score variable τi are the conditional expected values E(Yi |U = u)
of Yi given the unit u
 They are also called the true scores of the unit u with respect to Yi
 Hence, these true scores are the expected values of the intra-individual distributions of
the Yi
 The measurement error variables εi are simply defined by the difference εi := Yi – τi
Dame: Measurement component
3

Assumptions of CTT
Theory of true scores

 Most classical approaches assume that the raw score (X) obtained by any one individual
is made up of a true component (T) and a random error (E) component: X = T + E.
o The true score of a person can be found by taking the mean score that the person
would get on the same test if they had an infinite number of testing sessions
 Because it is not possible to obtain an infinite number of test scores, T is a
hypothetical, yet central, aspect of CTTs
Assumptions about the true scores

 Domain sampling theory assumes that the items that have been selected for any one test
are just a sample of items from an infinite domain of potential items
 The parallel test theory assumes that two or more tests with different domains sampled
(i.e., each is made up of different but parallel items) will give similar true scores but have
different error scores
o Two tests Yi and Yj are defined to be parallel:
 if they are τ-equivalent (Ti = Tj)
 if their error variables are uncorrelated (Cov (i, j) = 0, i ≠ j), and
 if they have identical error variances (Var (i) = Var (j))
o Multiple forms of a test (e.g., Form A and Form B) are considered to be parallel if
their means, variances and reliabilities are equal
 True scores in the population are assumed to be:
o Measured at the interval level and
o normally distributed
 When these assumptions are not met, test developers convert scores,
combine scales, and do a variety of other things to the data to ensure that
this assumption is met

Assumptions about the random errors


 In addition, the overriding concern of CTTs is to cope effectively with the random error
portion (E) of the raw score
o The less random error in the measure, the more the raw score reflects the true
score
 It is also expected that the random errors around the true score would be normally
distributed
 In addition, those random errors are uncorrelated with each other; that is, there is no
systematic pattern to why scores would fluctuate from time to time
 Finally, those random errors are also uncorrelated to the true score, T, in that there is no
systematic relationship between a true score (T) and whether or not that person will have
positive or negative errors
o All of these assumptions about the random errors form the foundations of CTT

Dame: Measurement component


4

o The standard deviation of the distribution of random errors around the true score
is called the standard error of measurement
 The lower it is, the more tightly packed around the true score the random
errors will be
Rules of Classical Test Theory

 The first is that the standard error of measurement of a test is consistent across an entire
population
o That is, the standard error does not differ from person to person but is instead
generated by large numbers of individuals taking the test, and it is subsequently
generalized to the population of potential test takers
o In addition, regardless of the raw test score (high, medium, or low), the standard
error for each score is the same
 The second is that as tests become longer, they become increasingly reliable
o Recall that in domain sampling, the sample of test items that makes up a single
test comes from an infinite population of items
o Also recall that larger numbers of items better sample the universe of items and
statistics generated by them (such as mean test scores) are more stable if they are
based on more items
 The third is that the important statistics about test items (e.g., their difficulty) depend on
the sample of respondents being representative of the population
o The interpretation of a test score is meaningless without the context of normative
information
The major advantage of CTT
 The major advantage of CTT are its relatively weak theoretical assumptions, which make
CTT easy to apply in many testing situations
o Relatively weak theoretical assumptions not only characterize CTT but also its
extensions (e.g., generalizability theory)
 Although CTT’s major focus is on test-level information, item statistics (i.e., item
difficulty and item discrimination) are also an important part of the CTT model
o CTT does not invoke a complex theoretical model to relate an examinee’s ability
to success on a particular item
o Instead, CTT collectively considers a pool of examinees and empirically
examines their success rate on an item (assuming it is dichotomously scored)
 This success rate of a particular pool of examinees on an item, well known
as the p value of the item, is used as the index for the item difficulty
o The ability of an item to discriminate between higher ability examinees and
lower ability examinees is known as item discrimination, which is often
expressed statistically as the Pearson product-moment correlation coefficient
between the scores on the item (e.g., 0 and 1 on an item scored right-wrong) and
the scores on the total test
 When an item is dichotomously scored, this estimate is often computed as
a point-biserial correlation coefficient

Dame: Measurement component


5

Limitations of CTT
 If item responses are dichotomous, CTT suggests that they should not be subjected to
factor analysis
o This poses problems in establishing the validity for many tests of cognitive
ability, where answers are coded as correct or incorrect
 Once the item stems are created and subjected to content analysis by the experts, they
often disappear from the analytical process
o Individuals may claim that a particular item stem is biased or unclear, but no
statistical procedures allow for comparisons of the item content, or stimulus, in
CTT
 CTT and its models are not really adequate for modeling answers to individual items in a
Questionnaire
 CTT exclusively focuses on measurement errors
 The major limitation of CTT can be summarized as circular dependency
o The person statistic (i.e., observed score) is (item) sample dependent, and
o the item statistics (i.e., item difficulty and item discrimination) are (examinee)
sample dependent
 This circular dependency poses some theoretical difficulties in CTT’s
application in some measurement situations (e.g., test equating,
computerized adaptive testing)
Item Response Theory

 Models of item response theory (IRT) specify how the probability of answering in a
specific category of an item depends on the attribute to be measured, i.e., on the value of
a latent variable
 IRT is more theory grounded and models the probabilistic distribution of examinees’
success at the item level
 As its name indicates, IRT primarily focuses on the item-level information in contrast to
the CTT’s primary focus on test-level information
 The IRT framework encompasses a group of models, and the applicability of each model
in a particular situation depends on:
o The nature of the test items and
o The viability of different theoretical assumptions about the test items
 For test items that are dichotomously scored, there are three IRT models, known as
three-, two-, and one-parameter IRT models
The IRT three-parameter model:
 In the IRT three-parameter model, [c.sub.i] represents the guessing factor, [a.sub.i]
represents the item discrimination parameter commonly known as item slope, [b.sub.i]
represents the item difficulty parameter commonly known as the item location
parameter, D represents an arbitrary constant (normally, D = 1.7), and [Theta] represents
the ability level of a particular examinee

Dame: Measurement component


6

 The item location parameter is on the same scale of ability [Theta], and takes the value of
[Theta] at the point at which an examinee with the ability-level [Theta] has a 50/50
probability of answering the item correctly
 The item discrimination parameter is the slope of the tangent line of the item
characteristic curve at the point of the location parameter
The two-parameter model:
 When the guessing factor is assumed or constrained to be zero ([c.sub.i] = 0), the three-
parameter model is reduced to the two-parameter model for which only:
o item location and
o item slope parameters need to be estimated

The one-parameter IRT model


 If another restriction is imposed that stipulates that all items have equal and fixed
discrimination, then [a.sub.i] becomes a constant rather than a variable, and as such,
this parameter does not require estimation, and the IRT model is further reduced to the
one-parameter IRT model
 So, for the one-parameter IRT model, constraints have been imposed on two of the three
possible item parameters, and item difficulty remains the only item parameter that needs
to be estimated
 This one-parameter model is often known as the Rasch model, named after the
researcher who did pioneer work in the area
o It is clear from the discussion that the three-parameter model is the most
general model, and the other two IRT models (two- and one-parameter
models) can be considered as models nested or subsumed under the three-
parameter model
Advantages of IRT and Pattern Scoring

 Better estimates of an examinee’s ability


o The score that is most likely, given the student’s responses to the questions
on the test (maximum likelihood scoring)

 More information about students and items are used

 More reliability than number right scoring

 Less measurement error (SEM)


Disadvantages of IRT and Pattern Scoring

 Technical - Complex Mathematics –


o Difficult to understand

o Difficult to explain

Dame: Measurement component


7

 Not common – Not like my experience

 Perceived as “Hocus Pocus”


1.2. Scales of measurement

 Normally, when one hears the term measurement, they may think in terms of measuring
the length of something (the length of a piece of wood) or measuring a quantity of
something (a cup of flour)
 This represents a limited use of the term measurement
 In statistics, the term measurement is used more broadly and is more appropriately
termed scales of measurement
 Scales of measurement refer to the ways in which variables/numbers are defined and
categorized
 Each scale of measurement has certain properties which in turn determines the
appropriateness for use of certain statistical analyses

 The four scales of measurement are:


o Nominal
o Ordinal
o interval, and
o ratio

Nominal

 Categorical data and numbers that are simply used as identifiers or names represent a
nominal scale of measurement
 Numbers on the back of a football jersey/sport shirt and your social security number are
examples of nominal data
 If a researcher conducts a study and includes gender as a variable, and he/she will code
Female as 1 and Male as 2 or visa versa when entering data into the computer, he/she is
using the numbers 1 and 2 to represent categories of data
Ordinal

 An ordinal scale of measurement represents an ordered series of relationships or rank


order
 Individuals competing in a contest may be fortunate to achieve first, second, or third
place
 First, second, and third place represent ordinal data
 If Tadesse takes first and Melese takes second, we do not know if the competition was
close; we only know that Tadesse outperformed Melese
 Likert-type scales (such as "On a scale of 1 to 10 with one being no pain and ten being
high pain, how much pain are you in today?") also represent ordinal data

Dame: Measurement component


8

 Fundamentally, these scales do not represent a measurable quantity


 An individual may respond 8 to this question and be in less pain than someone else who
responded 5
 A person may not be in half as much pain if they responded 4 than if they responded 8
 All we know from this data is that an individual who responds 6 is in less pain than if
they responded 8 and in more pain than if they responded 4
 Therefore, Likert-type scales only represent a rank ordering
Interval

 A scale which represents quantity and has equal units but for which zero represents
simply an additional point of measurement is an interval scale
 The Fahrenheit scale is a clear example of the interval scale of measurement
 Thus, 60 degree Fahrenheit or -10 degrees Fahrenheit are interval data
 Measurement of Sea Level is another example of an interval scale
 With each of these scales there is direct, measurable quantity with equality of units
 In addition, zero does not represent the absolute lowest value
 Rather, it is point on the scale with numbers both above and below it (for example, -10
degrees Fahrenheit)
Ratio

 The ratio scale of measurement is similar to the interval scale in that it also represents
quantity and has equality of units
 However, this scale also has an absolute zero (no numbers exist below the zero)
 Very often, physical measures will represent ratio data (for example, height and weight)
 If one is measuring the length of a piece of wood in centimeters, there is quantity, equal
units, and that measure can not go below zero centimeters
 A negative length is not possible
Generally:

 Interval and Ratio data are sometimes referred to as parametric and nominal and ordinal
data are referred to as nonparametric
o Parametric means that it meets certain requirements with respect to parameters
of the population (for example, the data will be normal - the distribution
parallels the normal or bell curve)
o In addition, it means that numbers can be added, subtracted, multiplied, and
divided
o Parametric data are analyzed using statistical techniques identified as
Parametric Statistics
o As a rule, there are more statistical technique options for the analysis of
parametric data and parametric statistics are considered more powerful than
nonparametric statistics
 Nonparametric data are lacking those same parameters and can not be added, subtracted,
multiplied, and divided
Dame: Measurement component
9

o For example, it does not make sense to add Social Security numbers to get a
third person
o Nonparametric data are analyzed by using Nonparametric Statistics

1.3. Psychological Testing and their uses

1.3.1. Introduction

 Psychological Testing is measurement of some aspect of human behavior by procedures


consisting of:
o Carefully prescribed content

o methods of administration, and

o methods of interpretation

 Test content may be addressed to almost any aspect of intellectual or emotional


functioning, including:

o personality traits

o attitudes

o Intelligence or

o emotional concerns

 Tests usually are administered by a qualified clinical, school, or industrial psychologist,


according to professional and ethical principles

 Interpretation is based on a comparison of the individual's responses with those


previously obtained to establish appropriate standards for the test scores

 The usefulness of psychological tests depends on their accuracy in predicting behavior

 By providing information about the probability of a person's responses or performance,


tests aid in making a variety of decisions

3.1.2. History of Psychological testing

 The primary impetus for the development of the major tests used today was the need for
practical guidelines for solving social problems
Dame: Measurement component
10

 The first useful intelligence test was prepared in 1905 by the French psychologists
Alfred Binet and Théodore Simon

o The two developed a 30-item scale to ensure that no child could be denied
instruction in the Paris school system without formal examination

 In 1916 the American psychologist Lewis Terman produced the first Stanford Revision
of the Binet-Simon scale to provide comparison standards for Americans from age three
to adulthood

o The test was further revised in 1937 and 1960, and today the Stanford-Binet
remains one of the most widely used intelligence tests

 The need to classify soldiers during World War I resulted in the development of two
group intelligence tests:

o Army Alpha and

o Army Beta

 To help detect soldiers who might break down in combat, the American psychologist
Robert Woodworth designed the Personal Data Sheet, a forerunner of the modern
personality inventory

 During the 1930s controversies over the nature of intelligence led to the development of
the Wechsler-Bellevue Intelligence Scale, which not only provided an index of general
mental ability but also revealed patterns of intellectual strengths and weaknesses

o The Wechsler tests now extend from the preschool through the adult age range
and are at least as prominent as the Stanford-Binet

 As interest in the newly emerging field of psychoanalysis grew in the 1930s, two
important projective techniques introduced systematic ways to study unconscious
motivation:

o The Rorschach or inkblot test—developed by the Swiss psychiatrist Hermann


Rorschach—using a series of inkblots on cards, and

o a story-telling procedure called the Thematic Apperception Test—developed by


the American psychologists Henry A. Murray and C. D. Morgan

 Both of these tests are frequently included in contemporary personality


assessment.

Dame: Measurement component


11

 During World War II the need for improved methods of personnel selection led to the
expansion of large-scale programs involving multiple methods of personality assessment

o Following the war, training programs in clinical psychology were systematically


supported by U.S. government funding, to ensure availability of mental-health
services to returning war veterans

 As part of these services, psychological testing flourished, reaching an


estimated several million Americans each year

 Since the late 1960s increased awareness and criticism from both the
public and professional sectors have led to greater efforts to establish
legal controls and more explicit safeguards against misuse of testing
materials

 The first vocational tests were published during the 1940s, beginning with the General Aptitude
Test Battery

3.1.3. Assumptions of Psychological Tests


 When using psychological tests, we must make important assumptions, including:
o a test measures what it says it measures…validity
o an individual’s behavior and test scores will remain stable over time…reliability
o individuals understand test items similarly…standardization
o individuals can report accurately about themselves
o individuals will report their thoughts and feelings honestly and
o test score is equal to ability plus some error
 We can increase our confidence in many of these assumptions by following certain steps during
test development

3.1.4. Uses of Psychological testing

 In educational settings:
o Administrators
o Teachers
o school psychologists, and
o career counselors use tests to make educational decisions including:
 admissions
 grading
 improving instruction and curriculum planning and
 career decisions
 In clinical settings:
o clinical psychologists
Dame: Measurement component
12

o psychiatrists
o social workers, and
o other health care professionals use tests to make:
 diagnostic decisions (identify developmental, visual, and auditory
problems)
 determine interventions, and
 assess the outcome of treatment programs
 In organizational settings:
o human resources professionals and
o industrial/organizational psychologists use tests to make decisions such as:
 who to hire for a particular position…selection and
 what training individuals need… classification, and
 what performance rating an individual will receive

3.1.5. Types of psychological tests

Achievement Tests

 These tests are designed to assess current performance or learned abilities in an academic
area
o Because achievement is viewed as an indicator of previous learning, it is often used
to predict future academic success
o An achievement test administered in a public school setting would typically include
separate measures of:
 Vocabulary
 language skills and reading comprehension
 arithmetic computation and problem solving
 science, and
 social studies
o Individual achievement is determined by comparison of results with average scores
derived from large representative national or local samples
o Scores may be expressed in terms of “grade-level equivalents”; for example, an advanced
third-grade pupil may be reading on a level equivalent to that of the average fourth-grade
student
Aptitude Tests

Dame: Measurement component


13

 These tests predict future performance in an area in which the individual is not currently
trained

 They are intended to predict the ability of a person to learn new skills

 Schools, businesses, and government agencies often use aptitude tests when assigning
individuals to specific positions

 Vocational guidance counseling may involve aptitude testing to help clarify individual
career goals
o If a person's score is similar to scores of others already working in a given
occupation, likelihood of success in that field is predicted

 Some aptitude tests cover a broad range of skills pertinent to many different occupations
o The General Aptitude Test Battery, for example, not only measures general
reasoning ability but also includes:
 form perception
 clerical perception
 motor coordination, and
 finger and manual dexterity
o Other tests may focus on a single area, such as art, engineering, or modern
languages
Intelligence Tests

 In contrast to tests of specific proficiencies or aptitudes, intelligence tests measure the


global capacity of an individual to cope with the environment

 Test scores are generally known as intelligence quotients, or IQs, although the various
tests are constructed quite differently
o The Stanford-Binet is heavily weighted with items involving verbal abilities

o the Wechsler scales consist of two separate verbal and performance subscales,
each with its own IQ
o There are also specialized infant intelligence tests, tests that do not require the
use of language, and tests that are designed for group administration
Interest Inventories

Dame: Measurement component


14

 Self-report questionnaires on which the subject indicates personal preferences among


activities are called interest inventories

o Because interests may predict satisfaction with some area of employment or


education, these inventories are used primarily in guidance counseling

o They are not intended to predict success, but only to offer a framework for
narrowing career possibilities

 For example, one frequently used interest inventory, the Kudor


Preference Record, includes ten clusters of occupational interests:

– Outdoors

– Mechanical

– Computational

– Scientific

– Persuasive

– Artistic

– Literary

– Musical

– social service, and

– Clerical

 For each item, the subject indicates which of three activities is best or
least liked

 The total score indicates the occupational clusters that include


preferred activities

Objective Personality Tests

 These tests measure social and emotional adjustment and are used to identify the need for
psychological counseling

 Items that briefly describe feelings, attitudes, and behaviors are grouped into subscales,
each representing a separate personality or style, such as:

Dame: Measurement component


15

o social extroversion or

o depression

 Taken together, the subscales provide a profile of the personality as a whole

o One of the most popular psychological tests is the Minnesota Multiphasic


Personality Inventory (MMPI), constructed to aid in diagnosing psychiatric
patients

 Research has shown that the MMPI may also be used to describe
differences among normal personality types

Projective Techniques

 Some personality tests are based on the phenomenon of projection, a mental process
described by Sigmund Freud as the tendency to attribute to others personal feelings or
characteristics that are too painful to acknowledge

 Because projective techniques are relatively unstructured and offer minimal cues to aid
in defining responses, they tend to elicit concerns that are highly personal and significant

o The best-known projective tests are the Rorschach test, popularly known as the
inkblot test, and the Thematic Apperception Test

o others include:

 word-association techniques

 sentence-completion tests, and

 various drawing procedures


3.1.6. Purposes of Psychological Testing
Diagnosis

 Once a test has been administered, the results may be evaluated to identify deficiencies or
weaknesses in need of improvement or require special attention –to make a diagnosis
Placement

 tests may be used in circumstances where it would be beneficial to group individuals on


the basis of their skill level or ability, for the most efficient and effective use of time and
energy
Prediction
Dame: Measurement component
16

 Specialized tests (entrance exams for colleges and universities, personality inventories,
and skin fold measurements) have long been used for the prediction of future events on
the basis of past or present data
Motivation

 Specialized tests have been used to provide feedbacks, marks, and grades to one’s
performance for further improvement
Achievement

 Specialized tests have been used to measure the level of ability or mastery of subjects
Program Evaluation

 Some psychological tests are used for evaluating whether or not a particular program has
successfully achieved its objectives
Establishing norms

 Standardized tests often provide such information as one’s level of achievement relative
to a clearly defined subgroup (people of the same age, sex, or class)
Decision
Generally, psychological tests can be used to make:

 Individual decisions…those made by the person who takes a test


 Institutional decisions…those that others make as a result of an individual’s performance
on a test (promotion, selection, detention, ...)
 Comparative decisions…that involve comparing people’s scores with one another to see
who has the best score, and
 Absolute decisions…that involve seeing who has the minimum score to qualify

3.1.7. Steps in Test Construction

 The first step in test development involves defining:


o the test universe (contents and objectives)
o the target audience, and
o the purpose of the test
 The next step in test development is to write out a test plan, which includes:
o Defining the test constructs in terms of observable and measurable behaviors
o Choosing the test forma t(for example, multiple choice, true/false, open-ended,
essay)
o Specifying the administration method, and
o Determining the scoring method:
Dame: Measurement component
17

 Three models for scoring are the cumulative model, the categorical model, and the ipsative model (using
yourself as the norm against which to measure something)

 The scoring model determines the type of data (nominal, ordinal, or


interval) the test will yield

 After completing the test plan, the test developer is ready to begin writing the actual test
questions and administration instructions
 After writing the test, the developer conducts a pilot test followed by other studies that
provide the necessary data for validation and norming
3.1.8. Interpretation of results

 The most important aspect of psychological testing involves the interpretation of test
results

Scoring

 The raw score is the simple numerical count of responses, such as the number of correct
answers on an intelligence test

o The usefulness of the raw score is limited, however, because it does not convey
how well someone does in comparison with others taking the same test

o Percentile scores, standard scores, and norms are all devices for making this
comparison

 Percentile scoring expresses the rank order of the scores in percentages

– The percentile level of a person's score indicates the proportion of


the group that scored above and below that individual

 Standard scores are derived from a comparison of the individual raw


score with the mean and standard deviation of the group scores

 Tables of norms are included in test manuals to indicate the expected


range of raw scores

– Normative data are derived from studies in which the test has been
administered to a large, representative group of people

– The test manual should include a description of the sample of


people used to establish norms, including age, sex, geographical
location, and occupation

Dame: Measurement component


18

– Norms based on a group of people whose major characteristics are


markedly dissimilar from those of the person being tested do not
provide a fair standard of comparison

Reliability

 Before validity can be demonstrated, a test must first yield consistent, reliable
measurements

 Reliability refers to the consistency or dependability or repeatability of test scores, data,


or observations

o It refers to the degree to which a test score is free from errors of measurement
 A measure of how much an individual’s observed score is likely to differ
from the individual’s true score is provided by standard error of
measurement (SEM)
 Standard deviation (SD) on the other hand provides a measure of how much
each individual score varies from the mean
 Fundamental to the proper evaluation of a test are:
o identification of the major sources of measurement error
o determining the size of the errors resulting from these sources
o indicating the degree of reliability to be expected and
o generalizability of results across:
 items
 forms
 raters
 administrations, and
 other measurement facets
 In measurement, random error arises from:
o student variables
o task sampling
o item calibration/standardization
o Scaling
o measurement error
o item and test design
o administration, and
o scoring protocols

Reliability of measurement scales

 From the classical test theory, we know that:


o Observed score (X) = True score (T) +Error score (E)

Dame: Measurement component


19

 Derived from this theory, reliability is then defined as the ratio of the true score’s
variance to the observed score’s variance
o R= Var (T)/Var (X)

The variance of the raw scores, true scores, and random error
 The equation for this process is as follows:
VAR (X) = VAR (T) + VAR (E)
 Given this, it can be shown that the variance of the observed scores VAR(X) that is due to
true score variance VAR (T) provides the reliability index of the test:
VAR (T)/VAR (X) = R
 When the variance of true scores is high relative to the variance of the observed scores,
the reliability (R) of the measure will be high (e.g., 50/60 = 0.83), whereas if the variance
of true scores is low relative to the variance of the observed scores, the reliability (R) of
the measure will be low (e.g., 20/60 = 0.33). Reliability values range from 0.00 to 1.00
 Rearranging the terms from the above equations, it can be shown that:
R = 1 – [VAR (E)/VAR(X)]
 That is, the reliability is equal to 1 – the ratio of random error variance to total score
variance
 Further, there are analyses that allow for an estimation of R (reliability), and, of course,
calculating the observed variance of a set of scores is a straightforward process
o Because R and VAR(X) can be calculated, VAR (T) can be solved for with the
following equation: VAR (T) = VAR (X) × R
 It is worth reiterating here that CTTs are largely interested in modeling the random error
component of a raw score
o Some error is not random; it is systematic
Estimating reliability

 The starting point of estimating the reliability of measurement is due to Charles


Spearman at the turn of the 20th century
 The concept of correlation was known, based on the work of Bravais, Galton and
Pearson during the 19th century, but Spearman was the first to consider various hidden
underlying causes affecting the true correlation
 to estimate reliability, researchers compute the correlation coefficient between two sets of
scores from the same person on the same test
 the two sets of scores can be obtained in various ways:
o Test-retest reliability:
 High correlation between scores on the same test given on different
occasions
 Measures of stability
o Alternate-form reliability:
 Two forms of the same test can be used
 Measures of equivalence
o Internal consistency reliability
 Internal consistency is estimated via:
- The split-half reliability index
Dame: Measurement component
20

- coefficient alpha index or


- the Kuder-Richardson formula 20 index
o Inter-rater reliability
o Intra-rater reliability

The split-half reliability index

 In the split-half method, a correlation coefficient is calculated between a person’s scores


on two comparable halves of the single test
 the formula for the Spearman Brown split-halves reliability coefficient is given as:

 If the correlation between the two-halves is 0.94, then the actual correlation between the
two halves of the test is: 2(.94)/1+.94 = 1.88/1.94 = .97, where r = .94 is the reliability of
2

the measurement in terms of Coefficient Alpha  consistency (.94)


 Thus, Coefficient Alpha (also named, "Cronbach=s alpha," is another measure of
internal consistency that is a squared concept, even though there is no squared sign in the
symbol designating 
 Theoretically, coefficient alpha is an estimate of the squared correlation expected
between two tests drawn at random from a pool of items similar to the items in the test
under construction
Cronbach’s alpha

 Specifically, coefficient alpha is typically used during scale development with items
that have several response options (i.e., 1 = strongly disagree to 5 = strongly agree)
 Cronbach alpha is appropriately applied to norm-referenced tests and norm-
referenced decisions (e.g., admissions and placement decisions), but not to criterion-
referenced tests and criterion-referenced decisions (e.g., diagnostic and achievement
decisions)
Dame: Measurement component
21

 If average inter-item correlation is > .6, then one can standardize items and add them
together as an index from which it can be seen that alpha measures true variance over
total variance
 In 1951 Cronbach came up with the symbol for the first time
 The formula for Cronbach alpha is given as:

Characteristics of Cronbach alpha


 The range of the alpha is from 0 to 1.0
 If the user obtains negative alphas, it means that the items are inconsistently coded
o Consistent coding means all items have to be coded so that high values on the
items correspond to high values on the total scale scores
 If the item-total correlations are negative, then the coding of the items needs to be
reviewed and corrected before computation of the alpha
 According to J.C. Nunnelly (1998), the alpha of a scale should be greater than .70 for
items to be used together as a scale
 The computation of Cronbach's alpha is based on the number of items on the survey (k)
and the ratio of the average inter-item covariance to the average item variance
 All other factors held constant:
o Tests that have normally distributed scores are more likely to have high Cronbach
alpha reliability estimates than tests with positively or negatively skewed
distributions, and so alpha must be interpreted in light of the particular score
distribution involved
o Cronbach alpha will be higher for longer tests than for shorter tests, and so alpha
must be interpreted in light of the particular test length involved
 If the items in a scale cohere together, then an alpha factor analysis will yield a single
factor with loadings that maximize the alpha coefficient
 D.J. Armor (1974) derives a Theta reliability coefficient that is an alpha analog
 Theta reliability is designed to measure the internal consistency of the items (variables)
in the first factor scale derived from a factor analysis
 Theta is formulated as:

Dame: Measurement component


22

Kuder-Richardson formula 20

 A collection of new reliability measures was introduced by G. F. Kuder and M. W.


Richardson (1937)
 KR-20 is used to estimate reliability for truly dichotomous (i.e., Yes/No; True/False)
response scales
 The principal advantages claimed for Kuder-Richardson formulas were ease of
calculation, uniqueness of estimate (compared to split-half methods), and conservatism
 The formula to compute KR-20 is:
KR-20 = N / (N - 1) [1 - Sum (piqi)/Var(X)]
Where

 N = # items
 Sum (piqi) = sum of the product of the probability of alternative responses;
and
 Var(X) = composite variance
Inter-rater reliability

 It refers to the level of agreement between assessments made by two or more assessors of
the same material at a time
 It is measured using a family of intraclass correlation coefficients (ICC - range 0-1; ICC
of = 0.7 is acceptable)
 The formula used to measure inter rater agreements is Cohen’s kappa
 Cohen's kappa is the ratio of the proportion of agreement (corrected for chance) divided
by the maximum number of times they could agree (corrected for chance)

Dame: Measurement component


23

Cohen's kappa formula is given by:

The nearer kappa is to 100, the stronger the agreement


Intra-rater reliability

 the level of agreement between assessments made by two or more assessors of the same
material presented at two or more times
Factors influencing Reliability

 Length of the test: in general, the longer the test, the higher the reliability
 level of test difficulty: norm referenced tests which are too easy or too difficult will tend
to provide scores of low reliability
 Spread of scores: other things being equal, the higher the spread of scores, the higher the
estimate of reliability
 Objectivity: the higher the objectivity of a test, the higher its reliability
o The higher the reliability, the lesser the error of measurement
 Methods of estimating reliability: in general, the size of the reliability coefficient is
related to the method of estimating reliability
Validity

 Interpretation of test scores ultimately involves predictions about a subject's behavior in


a specified situation

 If a test is an accurate predictor, then it is said to have good validity

 Validity is the degree of truthfulness of a test score, referring to the extent to which a test
measures what it proposes to measure
 Validity is not a property of the test or assessment, but rather of the meaning of the test
scores

 In addition to reliability, psychologists recognize different types of validity:

Dame: Measurement component


24

Face validity:

 Face validity simply means the validity at face value


 As a check on face validity, test/survey items are sent to teachers to obtain suggestions
for modification
Content validity

 A test has content validity if the sample of items in the test is representative of all the
relevant items that might have been used

o In short, face validity can be established by one person but content validity
should be checked by a panel of SMEs
o Content validity is sample-oriented rather than sign-oriented

o Words included in a spelling test, for example, should cover a wide range of
difficulty

Criterion-related validity

 Draws an inference from test scores to performance

 Criterion-related validity refers to a test's accuracy in specifying a future or concurrent


outcome

o criterion validity is about prediction rather than explanation


o Predication is concerned with non-casual or mathematical dependence where as
explanation is pertaining to causal or logical dependence
 There are two basic methods for showing a relation between a test and independent
events or behaviors (the criterion)
o Predictive validity
- Refers to how test performance describes future performance
- Established by correlating test scores taken at one time with scores on a
criterion measure obtained at a later date, usually months later
- This method establishes that the test provides information about events in the
future
o Concurrent validity
- Refers to how test performance describes current performance
- Established by correlating test scores and criterion scores obtained at
approximately the same time, usually within a week
- This method establishes that the test can provide information about
independent events or behaviors in the present
Construct validity

 Draws an inference form test scores to a psychological construct


Dame: Measurement component
25

 Construct validity is generally determined by investigating what psychological traits or


qualities a test measures

o Because it is concerned with abstract and theoretical construct, construct validity


is also known as theoretical construct
o We gather psychometric evidence of construct validity by conducting empirical
studies of the following:
 Reliability
– Test developers and researchers should provide evidence of
internal consistency or homogeneity
– Evidence of test–retest reliability is also appropriate
 Convergent validity
– A strong correlation between the test and other tests measuring the
same or similar constructs is necessary
 Discriminant validity
– Lack of correlation between the test and unrelated constructs is
also valuable
Factors influencing Validity
Factors in the test itself

 Unclear directions
 Difficult vocabulary
 Complex sentence structure
 Inappropriate level of test item difficulty
 poorly constructed test items
 Ambiguity
 Test too short
 Improper arrangement of items
 unintended clues
 Identifiable pattern of answers
 Specific determiners
 Grammatical inconsistencies
 Verbal associations
Factors in pupil’s response

 Personal factors (lack of motivation, emotional disturbances, frustration, response set)


Nature of the group and the criterion

 What a test measures is influenced by such factors as age, sex, ability level, educational
background, and cultural background of the particular group tested
 Other things being equal, the greater the similarity between the performance measured by
the test and the performance represented in the criterion, the larger the validity coefficient
Dame: Measurement component
26

Factors in test administration and scoring

 Insufficient time to complete the test


 Unfair aid to individual pupil who ask for help
 Cheating during the examination
 Unreliable scoring of essay questions
 Adverse physical and psychological conditions at the time of testing
In general, larger validity coefficients will result when the spread of scores is larger and when
the time span is shorter
Controversies

 The major psychological testing controversies stem from two interrelated issues:

o Technical shortcomings in test design and

o ethical problems in interpretation and application of results

Factor Analysis

 Researchers use factor analysis to identify underlying variables or factors that contribute
to a construct or an overall test score
o Confirmatory factor analyses that confirm predicted underlying variables
provide evidence of construct validity
o Exploratory factor analyses take a broad look at the test data to determine the
maximum number of underlying structures

Dame: Measurement component

You might also like