You are on page 1of 17

BS SECOND

ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
CHAPTER 4: A STATISTICS REFRESHER
SCALES OF MEASUREMENT
 Continuous scales – theoretically possible to divide any of the values of the scale.
Typically having a wide range of possible values (e.g. height or a depression scale).
 Discrete scales – categorical values (e.g. male or female)
 Error – the collective influence of all of the factors on a test score beyond those
specifically measured by the test
 Nominal Scales - involve classification or categorization based on one or more
distinguishing characteristics; all things measured must be placed into mutually
exclusive and exhaustive categories (e.g. apples and oranges, DSM-IV diagnoses,
etc.).
 Interval Scales - contain equal intervals between numbers. Each unit on the scale is Grouped frequency distributions have class intervals rather than actual test scores
exactly equal to any other unit on the scale (e.g. IQ scores and most other
psychological measures). A histogram is a graph with vertical lines drawn
 Ratio Scales – Interval scales with a true zero point (e.g. height or reaction time). at the true limits of each test score (or class
 Psychological Measurement – Most psychological measures are truly ordinal but are interval), forming a series of contiguous rectangles
treated as interval measures for statistical purposes.
DESCRIBING DATA
Bar graph - numbers indicative
 Distributions - a set of test scores arrayed for recording or study.
of frequency appear on the Y -
 Raw Score - a straightforward, unmodified accounting of performance that is usually
axis, and reference to some
numerical.
categorization (e.g., yes/ no/
 Frequency Distribution - all scores are listed alongside the number of times each
maybe, male/female) appears
score occurred
on the X -axis.

Frequency polygon - test scores


or class intervals (as indicated on
the X -axis) meet frequencies (as
indicated on the Y -axis).

Frequency distributions may be in tabular form as in the example above. It is a simple


frequency distribution (scores have not been grouped).

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 1


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
TYPES OF DISTRIBUTIONS  Measures of variability are statistics that describe the amount of variation in a
distribution.
 Range - difference between the highest and the lowest scores.
 Interquartile range – difference between the third and first quartiles of a distribution.
 Semi-interquartile range – the interquartile range divided by 2
 Average deviation – the average deviation of scores in a distribution from the mean.
 Variance - the arithmetic mean of the squares of the differences between the scores in
a distribution and their mean
 Standard deviation – the square root of the average squared deviations about the
mean. It is the square root of the variance. Typical distance of scores from the mean.
 Skewness - the nature and extent to which symmetry is absent in a distribution.
o Positive skew - relatively few of the scores fall at the high end of the
distribution.
o Negative skew – relatively few of the scores fall at the low end of the
MEASURES OF CENTRAL TENDENCY distribution.
 Central tendency - a statistic that indicates the average or midmost score between the  Kurtosis – the steepness of a distribution in its center.
extreme scores in a distribution. o Platykurtic – relatively flat.
 Mean - Sum of the observations (or test scores), in this case divided by the number of o Leptokurtic – relatively peaked.
observations. o Mesokurtic – somewhere in the middle.
 Median – The middle score in a distribution. Particularly useful when there are outliers, THE NORMAL CURVE
or extreme scores in a distribution.  The normal curve is a bell-shaped, smooth, mathematically defined curve that is
 Mode – The most frequently occurring score in a distribution. When two scores occur highest at its center. Perfectly symmetrical.
with the highest frequency a distribution is said to be bimodal.
MEASURES OF VARIABILITY
 Variability is an indication of the degree to which scores are scattered or dispersed in a
distribution.

Area Under the Normal Curve


Distributions A and B have the same mean score but Distribution has greater variability in The normal curve can be conveniently divided into areas defined by units of standard
scores (scores are more spread out). deviations.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 2


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
STANDARD SCORES CORRELATION AND INFERENCE
 A standard score is a raw score that has been converted from one scale to another SCATTERPLOT – Involves simply plotting one
scale, where the latter scale has some arbitrarily set mean and standard deviation. variable on the X (horizontal) axis and the other
 Z-score - conversion of a raw score into a number indicating how many standard on the Y (vertical) axis
deviation units the raw score is below or above the mean of the distribution.
 T scores - can be called a fifty plus or minus ten scale; that is, a scale with a mean set
at 50 and a standard deviation set at 10
 Stanine - a standard score with a mean of 5 and a standard deviation of approximately
2. Divided into nine units. Scatterplots of strong correlations feature
 Normalizing a distribution - involves “stretching” the skewed curve into the shape of a points tightly clustered together in a diagonal
normal curve and creating a corresponding scale of standard scores line. For POSITIVE CORRELATIONS the line
goes from bottom left to top right.
CORRELATION AND INFERENCE
 A coefficient of correlation (or correlation coefficient) is a number that provides us with
an index of the strength of the relationship between two things.
 Correlation coefficients vary in magnitude between -1 and +1. A correlation of 0 Strong NEGATIVE CORRELATIONS form a tightly
indicates no relationship between two variables. clustered diagonal line from top left to bottom right.
 Positive correlations indicate that as one variable increases or decreases, the other
variable follows suit.
 Negative correlations indicate that as one variable increases the other decreases.
 Correlation between variables does not imply causation but it does aid in prediction.
OUTLIER – an extremely atypical point (case),
 Pearson r: A method of computing correlation when both variables are linearly related
lying relatively far away from the other points in
and continuous.
a scatterplot
 Once a correlation coefficient is obtained, it needs to be checked for statistical
significance (typically a probability level below .05).
 By squaring r, one is able to obtain a coefficient of determination, or the variance that
the variables share with one another.
 Spearman Rho: A method for computing correlation, used primarily when sample RESTRICTION OF RANGE leads to weaker
sizes are small or the variables are ordinal in nature. correlations

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 3


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
META-ANALYSIS  Meta-analysis- a family of techniques to statistically combine information across studies
 Meta-analysis allows researchers to look at the relationship between variables across to produce single estimates of the data under study.
many separate studies.  The estimates are in the form of effect size, which is often expressed as a correlation
coefficient.
TEST AND TESTING
ASSUMPTIONS ABOUT PSYCHOLOGICAL TESTING  Error variance - the component of a test score attributable to sources other than the
I. PSYCHOLOGICAL TRAITS AND STATES EXIST trait or ability measured.
 A trait has been defined as “any distinguishable, relatively enduring way in which one  Both the assessee and assessor are sources of error variance
individual varies from another” (Guilford, 1959, p. 6). VI. TESTING AND ASSESSMENT CAN BE CONDUCTED IN A FAIR MANNER
 States also distinguish one person from another but are relatively less enduring  All major test publishers strive to develop instruments that are fair when used in strict
(Chaplin et al., 1988). accordance with guidelines in the test manual.
 Thousands of trait terms can be found in the English language (e.g. outgoing, shy,  Problems arise if the test is used with people for whom it was not intended.
reliable, calm, etc.).  Some problems are more political than psychometric in nature
 Psychological traits exist as constructs - an informed, scientific concept developed or VII. TESTING AND ASSESSMENT BENEFIT SOCIETY
constructed to describe or explain behavior.  There is a great need for tests, especially good tests, considering the many areas of
 We can’t see, hear, or touch constructs, but we can infer their existence from overt our lives that they benefit.
behavior, such as test scores. WHAT’S A “GOOD TEST?”
 Traits are relatively stable. They may change over time, yet there are often high  Reliability: The consistency of the measuring tool: the precision with which the test
correlations between trait scores at different time points. measures and the extent to which error is present in measurements.
 The nature of the situation influences how traits will be manifested.  Validity: The test measures what it purports to measure.
 Traits refer to ways in which one individual varies, or differs, from another  Other considerations: Administration, scoring, interpretation should be straightforward
for trained examiners. A good test is a useful test that will ultimately benefit individual
II. TRAITS AND STATES CAN BE QUANTIFIED AND MEASURED test takers or society at large.
 Different test developers may define and measure constructs in different ways.
 Once a construct is defined, test developers turn to item content and item weighting. NORMS
 A scoring system and a way to interpret results need to be devised.  Norm-referenced testing and assessment: a method of evaluation and a way of
III. TEST-RELATED BEHAVIOR PREDICTS NON-TEST-RELATED BEHAVIOR deriving meaning from test scores by evaluating an individual testtaker’s score and
 Responses on tests are thought to predict real-world behavior. The obtained sample of comparing it to scores of a group of testtakers.
behavior is expected to predict future behavior.  The meaning of an individual test score is understood relative to other scores on the
IV. TESTS HAVE STRENGTHS AND WEAKNESSES same test.
 Competent test users understand and appreciate the limitations of the tests they use as  Norms are the test performance data of a particular group of testtakers that are
well as how those limitations might be compensated for by data from other sources. designed for use as a reference when evaluating or interpreting individual test scores.
V. VARIOUS SOURCES OF ERROR ARE PART OF ASSESSMENT  A normative sample is the reference group to which test-takers are compared.
 Error refers to a long-standing assumption that factors other than what a test attempts
to measure will influence performance on the test.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 4


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
SAMPLING TO DEVELOP NORMS  One problem is that real differences between raw scores may be minimized near the
 Standardization: The process of administering a test to a representative sample of ends of the distribution and exaggerated in the middle of the distribution.
testtakers for the purpose of establishing norms.  Age norms: average performance of different samples of test-takers who were at
 Sampling – Test developers select a population, for which the test is intended, that has various ages when the test was administered.
at least one common, observable characteristic.  Grade norms: the average test performance of testtakers in a given school grade.
 Stratified sampling: Sampling that includes different subgroups, or strata, from the  National norms: derived from a normative sample that was nationally representative of
population. the population at the time the norming study was conducted.
 Stratified-random sampling: Every member of the population has an equal  National anchor norms: An equivalency table for scores on two different tests. Allows
opportunity of being included in a sample. for a basis of comparison.
 Purposive sample: Arbitrarily selecting a sample that is believed to be representative  Subgroup norms: A normative sample can be segmented by any of the criteria initially
of the population. used in selecting subjects for the sample.
 Incidental/convenience sample: A sample that is convenient or available for use.  Local norms: provide normative information with respect to the local population’s
May not be representative of the population performance on some test.
o Generalization of findings from convenience samples must be made with
caution. FIXED REFERENCE GROUP SCORING SYSTEMS
DEVELOPING NORMS FIXED REFERENCE GROUP SCORING SYSTEMS
Having obtained a sample test developers:  The distribution of scores obtained on the test from one group of testtakers is used as
 Administer the test with standard set of instructions the basis for the calculation of test scores for future administrations of the test.
 Recommend a setting for test administration • The SAT employs this method.
 Collect and analyze data NORM-REFERENCED VERSUS CRITERION-REFERENCED INTERPRETATION
 Summarize data using descriptive statistics including measures of central tendency  Norm referenced tests involve comparing individuals to the normative group. With
and variability criterion referenced tests testtakers are evaluated as to whether they meet a set
 Provide a detailed description of the — standard (e.g. a driving exam).
CULTURE AND INFERENCE
TYPES OF NORMS  In selecting a test for use, responsible test users should research the test’s available
 Percentile - the percentage of people whose score on a test or measure falls below a norms to check how appropriate they are for use with the targeted testtaker population.
particular raw score.  When interpreting test results it helps to know about the culture and era of the test-
 Percentiles are a popular method for organizing test-related data because they are taker.
easily calculated.  It is important to conduct culturally informed assessment.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 5


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
CHAPTER 6: PROPERTIES OF A STADARDIZED TEST
RELIABILITY TWO POSSIBLE NEGATIVE EFFECTS WHEN DOING TEST-RETEST RELIABILITY
 Reliability refers to the consistency of test scores obtained by the same persons when 1. CARRYOVER EFFECT
they are reexamined with the same test on different occasions, or with different sets of  Occurs when the first testing session influences scores from the second session
equivalent items, or under varying examining conditions (Anastasi and Urbina, 1996).
 For example, test takers sometimes remember their answers from the first time they
 Reliability is the extent to which a score or measure is free from measurement error. took the test.
Theoretically, reliability is the ratio of true score variance to observed score variance
 They are of concern only when the changes over time are random. In cases where the
(Kaplan and Saccuzzo, 2011).
changes are systematic, carryover effects do not harm the reliability.
 Reliability refers to the consistency in measurement; the extent to which measurements
 An example of a systematic carryover is when everyone’s score improves exactly 5
differ from occasion to occasion as a function of measurement error (Cohen and
points. In this case, no new variability occurs.
Swerdlik, 2009).
 Random carryover effects occur when the changes are not predictable from earlier
TYPES OF RELIABILITY scores, or when something affects some BUT NOT ALL test takers.
1. TEST RE-REST RELIABILITY  If something affects all test-takers equally, then the results are uniformly affected, and
 Repeating the identical test on a second occasion no net error occurs
 The reliability coefficient is simply the correlation between the scores obtained by the 2. PRACTICE EFFECT
same persons on the two administrations of the test.
 The same test is administered at two different times  Some skills improve with practice
 It is of value only if we are measuring characteristics that do not change over time (e.g.,  When a test is given a second time, test takers score better because they have
IQ). sharpened their skills by having taken the test the first time.
 If an IQ test administered at two points in time produces different scores, we might  The time interval between testing sessions must be selected and evaluated carefully. If
conclude that the lack of correspondence is due to random measurement error. the two test administrations of the test are very close in time, there is relatively great
 Usually, we do not assume that a person got smarter or less so in the time between risk of carryover and practice effects.
tests.  However, as the time interval between testing sessions INCREASES, many other
 Tests that measure some constantly changing characteristic are not appropriate for factors can intervene to affect scores.
test-retest evaluation 2. ALTERNATE FORMS RELIABILITY
 Retest reliability shows the extent to which scores on a test can be generalized over  Also called “Equivalent Forms” or “Parallel Forms” Reliability.
different occasions. The higher the reliability, the less susceptible the scores are to  An alternative to test-retest reliability, it makes use of alternate or parallel forms of the
random daily changes in the condition of the test-takers or the test environment. test.
 When retest reliability is reported in the test manual, the interval over which it was  The same persons can thus be tested with one form on the first occasion and with
measured should always be specified. another, equivalent form, on the second occasion.
 Retest correlations decrease progressively as this interval lengthens.  The correlation between the scores obtained on the two forms represents the reliability
coefficient of the test.
 In the development of alternate forms, care should be exercised to ensure that they are
truly parallel.
 Same number of items

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 6


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
 Same type and content 4. KR20 FORMULA
 Equal range and level of difficulty  Also known as Kuder-Richardson 20, it calculates the reliability of a test in which the
LIMITATIONS items are dichotomous, scored 0 or 1 (usually for wrong or right).
FORMULA: KR20 = N/N-1 {{s2 – Σpq)/s2} N = the number of items on the test
 Can only reduce, but not totally eliminate PRACTICE EFFECTS.
Where: KR20 = The reliability estimate s2 = the variance of the total test score
 Sometimes, the two forms are administered to the same group of people on the same p = the proportion of the people getting each item
day. When both forms of the test are given on the same day, the only sources of For each item, q equals 1 - p correct (this is found separately for each item)
variation are random error and the difference between the forms of the test. q = the proportion of people getting each item
 This type of reliability testing can be quite burdensome, considering that you have to incorrect.
develop two forms of the same test. 5. COEFFICIENT ALPHA
 Developed by Cronbach to estimate the internal consistency of tests in which the items
3. SPLIT HALF RELIABILITY are not scored as 0 or 1 (wrong or right).
 In a split-half reliability, a test is given and divided into halves that are scored  Applicable for many personality and attitude scales.
separately. The results of one half of the test are then compared with the results of the
 The SPSS software provides a convenient way of determining the coefficient alpha.
other.
HOW RELIABLE IS RELIABLE?
 How to divide the test to two halves?
o Divide the test randomly to two halves  Reliability estimates in the range of .70 to .80 are good enough for most purposes in
o Calculate score for the first half of the items and another score for the second basic research.
half  In clinical settings, high reliability is extremely important (i.e., reliability of .90 to .95).
o Although convenient, this method can cause problems if the questions in the WHAT TO DO ABOUT LOW RELIABILITY?
second half are more difficult.
 Increase the number of items.
o Use odd-even system
 Factor and item analysis
 The correlation (between the two halves) is usually an underestimate, because each
 Correction for attenuation - a formula that is being used to determine the exact
subset is only half as long as the full test. It is less reliable because it has fewer items.
correlation between two variables if the test is deemed affected by error
 To correct for half length, one can apply the Spearman-Brown formula, which allows
you to estimate what the correlation between the two halves would have been if each
half had been the length of whole test.
 The Spearman-Brown formula is advisable for use only when the two halves of the test
have equal variances. Otherwise, Cronbach’s coefficient alpha can be used. This
general reliability coefficient provides the lowest estimate of reliability

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 7


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
FACTORS AFFECTING TEST RELIABILITY  Content validation is not done by statistical analysis but by the inspection of items. A
TEST FORMAT, DIFFICULTY, OBJECTIITY, ADMINISTRATION, SCORING, ADEQUACY panel of experts can review the test items and rate them in terms of how closely they
match the objective or domain specification.
VALIDITY  This considers the adequacy of representation of the conceptual domain the test is
 The validity of a test is the extent to which it measures what it claims to measure. It designed to cover.
defines the meaning of test scores (Gregory, 2011).  If the test items adequately represent the domain of possible items for a variable, then
 The validity of a test measures what the test measures, and how well it does so the test has adequate content validity.
(Anastasi, 1996).  Determination of content validity is often made by expert judgment.
 Validity can be defined as the agreement between a test score or measure and the EXAMPLES
quality it is believed to measure. It is sometimes defined as the answer to the question, Educational Content Valid Test – syllabus is covered in the test; usually follows the table
“Does the test measure what it is supposed to measure?” (Kaplan and Saccuzzo, of specification of the test. (Table of specification – a blueprint of the test in terms of
2011). number of items per difficulty, topic importance, or taxonomy)
 The validity of tests is NOT easily captured by neat statistical summaries, but is instead Employment Content Valid Test – appropriate jobrelated skills are included in the test.
Reflects the job specification of the test.
characterized on a continuum ranging from weak to acceptable to strong
Clinical Content Valid Test – symptoms of the disorder are all covered in the test.
TYPES OF VALIDITY Reflects the diagnostic criteria for a test.
1. FACE VALIDITY 3. CRITERION VALIDITY
 The least stringent type of validity, whether a test looks valid to test users, examiners What is a criterion?
and examinees.  standard against which a test or a test score is evaluated.
 This type of validity doesn’t involve statistics.  A criterion can be a test score, psychiatric diagnosis, training cost, index of
 Usually this type of validity is done by face validators like Registered Psychometricians, absenteeism, amount of time.
Psychologists, and Guidance Counselors.  Characteristics of a criterion:
EXAMPLES o Relevant
An IQ test containing items which measure memory, mathematical ability, verbal o Valid and Reliable
reasoning and abstract reasoning has a good face validity. o Uncontaminated: Criterion contamination occurs if the
An IQ test containing items which measure depression and anxiety has a bad face o criterion based on predictor measures; the criterion used is a criterion of what is
validity.
supposed to be the criterion
A self-esteem rating scale which has items like “I know I can do what other people can
do.” and “I usually feel that I would fail on a task.” has a good face validity. CRITERION-RELATED VALIDITY DEFINED:
Inkblot test have low face validity because test takers question whether the test really  Indicates the test effectiveness in estimating an individual’s behavior in a particular
measures personality. situation
 Tells how well a test corresponds with a particular criterion.
2. CONTENT VALIDITY  A judgment of how adequately a test score can be used to infer an individual’s most
 Whether the test covers the behavior domain to be measured which is built through the probable standing on some measure of interest
choice of appropriate content areas, questions, tasks and items.
 It is concerned with the extent to which the test is representative of a defined body of
content consisting of topics and processes.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 8


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
TYPES OF CRITERION-RELATED VALIDITY  A judgment about the appropriateness of inferences drawn from test scores regarding
 Concurrent Validity – the extent to which test scores may be used to estimate an individual standings on a variable called construct.
individual’s present standing on a criterion.  Required when no criterion or universe of content is accepted as entirely adequate to
 Predictive – the scores on a test can predict future behavior or scores on another test define the quality being measured.
taken in the future.  Assembling evidence about what a test means. Series of statistical analysis that one
 Incremental Validity – this type of validity is related to predictive validity wherein it is variable is a separate variable.
defined as the degree to which an additional predictor explains something about the  A test has a good construct validity if there is an existing psychological theory which
criterion measure that is not explained by predictors already in use can support what the test items are measuring.
4. CONSTRUCT VALIDITY  Establishing construct validity involves both logical analysis and empirical data.
What is a construct? (Example: In measuring aggression, you have to check all past research and theories
 An informed scientific idea developed or hypothesized to describe or explain a to see how the researchers measure that variable/construct).
behavior; something built by mental synthesis. FACTORS INFLUENCING TEST VALIDITY
 Unobservable, presupposed traits; something that the researcher thought to have either a. Appropriateness of the test
high or low correlation with other variables. b. Directions/Instructions
c. Reading Comprehension Level
 It is the extent to which the test may be said to measure a theoretical construct or trait.
d. Item Difficulty
 A test designed to measure a construct must estimate the existence of an inferred,
e. Test Construction factors
underlying characteristic based on a limited sample of behavior
f. Length of Test
 Established through a series of activities in which a researcher simultaneously defines
g. Arrangement of Items
some construct and develops instrumentation to measure it.
h. Patterns of Answer

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 9


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
CHAPTER 7: TEST DEVELOPMENT AND ADMINISTRATION
TEST CONCEPTUALIZATION TEST CONSTRUCTION  TEST TRYOUT  ITEM ANALYSIS  TEST REVISION
TEST DEVELOPMENT NORM-REFERENCED VERSUS CRITERIONREFERENCED TESTS: ITEM
 All tests are not created equal. The creation of a good test is not a matter of chance. It DEVELOPMENT ISSUES
is the product of the thoughtful and sound application of established principles of test  Different approaches to test development and individual item analyses are necessary,
development. In this context, test development is an umbrella term for all that goes into depending upon whether the finished test is designed to be norm-referenced or
the process of creating a test. criterion-referenced.
I. TEST CONCEPTUALIZATION Pilot Work
 The beginnings of any published test can probably be traced to thoughts—self-talk, in  pilot study, and pilot refer, in general, to the preliminary research surrounding the
behavioral terms. The test developer says to himself or herself something like: “There creation of a prototype of the test.
ought to be a test designed to measure [fill in the blank] in [such and such] way.” The  In pilot work, the test developer typically attempts to determine how best to measure a
stimulus for such a thought could be almost anything. A review of the available targeted construct. The process may entail the creation, revision, and deletion of many
literature existing tests designed to measure a particular construct might indicate that test items in addition to literature reviews, experimentation, and related activities. Once
such tests leave much to be desired in psychometric soundness. An emerging social pilot work has been completed, the process of test construction begins
phenomenon or pattern of behavior might serve as the stimulus for the development of
a new test. If, for example, celibacy was to become a widely practiced lifestyle, then we II. TEST CONSTRUCTION
might witness the development of a variety of tests related to celibacy. SCALING
 The development of a new test may be in response to a need to assess mastery in an  It was previously defined a measurement as the assignment of numbers according to
emerging occupation or profession. For example, new tests may be developed to rules. Scaling may be defined as the process of setting rules for assigning numbers in
assess mastery in fields such as high-definition electronics, environmental engineering, measurement.
and wireless communications.  Historically, the prolific L. L. Thurstone is credited for being at the forefront of efforts to
What is the test designed to measure? Is there a need for this test? develop methodologically sound scaling methods.
What is the objective of the test? o He adapted psychophysical scaling methods to the study of psychological
Who will use this test? Who will take this test? variables such as attitudes and values.
What content will the test cover? How will the test be administered? o Thurstone’s (1925) article entitled “A Method of Scaling Psychological and
What is the ideal format of the test? Should more than one form of the test Educational Tests” introduced, among other things, the notion of absolute
be developed?
scaling—a procedure for obtaining a measure of item difficulty across samples
What special training will be required of What types of responses will be required
of test takers who vary in ability.
test users for administering or of test takers?
interpreting the test?
Who benefits from an administration of Is there any potential for harm as the
this test? result of an administration of this test?
How will meaning be attribute to scores on this test?
 This last question provides a point of departure for elaborating on issues related to
development with regard to norm- versus criterion-referenced tests.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 10


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
TYPES OF SCALES  A SUMMATIVE RATING SCALE,
 Instruments used to measure something, such as weight. LIKERT SCALES is used extensively in
 In psychometrics, scales may also be conceived of as instruments used to measure. psychology, usually to scale attitudes.
 Here, however, that something being measured is likely to be a trait, a state, or an Likert scales are relatively easy to
ability. construct. Each item presents the
 When we think of types of scales, we think of the different ways that scales can be testtaker with five alternative responses (sometimes seven), usually on an agree–
categorized. In Chapter 3, for example, we saw that scales can be meaningfully disagree or approve–disapprove continuum.
categorized along a continuum of level of measurement and be referred to as
NOMINAL, ORDINAL, INTERVAL, OR RATIO. But we might also characterize scales  METHOD OF PAIRED
in other ways. COMPARISONS - test takers are
SCALING METHODS presented with pairs of stimuli (two
 testtaker is presumed to have more or less of the characteristic measured by a (valid) photographs, two objects, two
test as a function of the test score. The higher or lower the score, the more or less of statements), which they are asked to
the characteristic the testtaker presumably possesses. But how are numbers assigned compare. They must select one of the stimuli according to some rule
to responses so that a test score can be calculated? This is done through scaling the
test items, using any one of several available methods.  GUTTMAN SCALE - is yet
 The MDBS-R (morally another scaling method that
debatable behaviors scale– yields ordinal level measures.
revised) is an example of a Items on it sequentially from
RATING SCALE, which can weaker to stronger expressions
be defined as a grouping of of the attitude, belief, or feeling
words, statements, or symbols on which judgments of the strength of a particular trait, being measured. A feature of
attitude, or emotion are indicated by the test taker. Rating scales can be used to record Guttman scales is that all
judgments of oneself, others, experiences, or objects, and they can take several forms. respondents who agree with the stronger statements of the attitude will also agree with
milder statements.
 THE MANY FACES OF RATING SCALES -
rating scales can take many forms. “Smiley”
faces, such as those illustrated here as Item
A, have been used in social-psychological
research with young children and adults with
limited language skills. The faces are used in
lieu of words such as positive, neutral, and
negative.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 11


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
FORMAT OF
ITEM
PSY SEMESTER

ADVANTAGES DISADVANTAGES

ESSAY
MULTIPLE
CHOICE

WRITING ITEMS
 In the grand scheme of test construction, considerations related to the actual writing of
the test’s items go hand in hand with scaling considerations. The prospective test
BINARY-
CHOICE

developer or item writer immediately faces three questions related to the test blueprint.
ITEMS

 What range of content should the items cover?


 Which of the many different types of item formats should be employed?
 How many items should be written in total and for each content area covered?
 Item format - Variables such as the form, plan, structure, arrangement, and layout of
MATCHING

individual test items are collectively referred to as item format.


 There are two kinds of formats, selected-response format and the constructed-
response format.
o Selected-response format requires test takers to select a response from a set
of alternative responses. *Three types of selected-response format multiple-
choice, matching, and true–false.
COMPLETION OR
SHORT ANSWER
(Fill-in-the-blank)

o Constructed-response format requires test takers to supply or to create the


correct answer, not merely to select it
WRITING ITEMS FOR COMPUTER ADMINISTRATION
A number of widely available computer programs are designed to facilitate the construction
of tests as well as their administration, scoring, and interpretation. These programs typically
make use of two advantages of digital media: the ability to store items in an item bank and
the ability to individualize testing through a technique called item branching.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 12


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
SCORING ITEMS  How does a test developer identify good items? It was developed, after the first draft of
CUMULATIVE MODEL the test has been administered to a representative group of examinees, the test
developer analyzes test scores and responses to individual items.
It is the most commonly used model in scoring the items, owing, in part, to its simplicity and
logic. The rule in a cumulatively scored test is that the higher the score on the test, the higher
the test taker is on the ability, trait, or other characteristic that the test purports to measure. IV. ITEM ANALYSIS

CLASS SCORING
V. TEST REVISION
(also referred to as category scoring),
 Test Revision as a Stage in New Test Development
testtaker responses earn credit toward placement in a particular class or category with other
o Having conceptualized the new test, constructed it, tried it out, and item-
testtakers whose pattern of responses is presumably similar in some way. This approach is
analyzed it both quantitatively and qualitatively, what remains is to act
used by some diagnostic systems wherein individuals must exhibit a certain number of
judiciously on all the information and mold the test into its final form.
symptoms to qualify for a specific diagnosis.
IPSATIVE SCORING CROSS-VALIDATION AND CO-VALIDATION
departs radically in rationale from either cumulative or class models. A typical objective in CROSS-VALIDATION
ipsative scoring is comparing a testtaker’s score on one scale within a test to another scale
 refers to the revalidation of a test on a sample of testtakers other than those on whom
within that same test.
test performance was originally found to be a valid predictor of some criterion.
 We expect that items selected for the final version of the test (in part because of their
III. TEST TRYOUT
high correlations with a criterion measure) will have smaller item validities when
Having created a pool of items from which the final version of the test will be developed, the administered to a second sample of testtakers.
test developer will try out the test. The test should be tried out on people who are similar in
 This is so because of the operation of chance. The decrease in item validities that
critical respects to the people for whom the test was designed. Thus, for example, if a test is
inevitably occurs after cross-validation of findings is referred to as validity shrinkage.
designed to aid in decisions regarding the selection of corporate employees with management
potential at a certain level, it would be appropriate to try out the test on corporate employees
CO-VALIDATION
at the targeted level.
WHAT IS A GOOD ITEM?  may be defined as a test validation process conducted on two or more tests using the
 In the same sense that a good test is reliable and valid, a good test item is reliable and same sample of testtakers.
valid.  When used in conjunction with the creation of norms or the revision of existing norms,
 A good test item helps to discriminate test takers this process may also be referred to as co-norming.
 A good test item is one that is answered correctly by high scorers on the test as a  A current trend among test publishers who publish more than one test designed for use
whole with the same population is to co-validate and/or co-norm tests.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 13


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
ITEM ANALYSIS
 The examination of individual items on a test, rather than the test as a whole, for its ITEM DISCRIMINABILTY
difficulty, appropriateness, relationship to the rest of the test, etc.  How well an item can discriminate among test takers who differ on the construct being
 Useful in helping test designers determine which items to keep, modify, or discard on a measured by the test
given test; and how to finalize the score for a student  Discrimination index= A calculation that determines the difference between those test
 If you improve the quality of the items on a test, you will improve the overall quality of takers who score well on a test and those who score poorly
the test – hence improving both reliability and validity.
 Involves difficulty value and discriminating power. INDEX OF DISCRIMINATION
METHODS OF ITEM ANALYSIS no.of students gettomg correct response 40
Difficulty Index =
𝑡𝑜𝑡𝑎𝑙
=
100
= .40 or 40%
I. QUALITATIVELY
 Within range of a good item
 Analyze test items based on content and form
 Includes content validity no.of students in upper 25% with correct response 15
 Evaluation of items based on effective item writing techniques DU =
𝑛𝑜.𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑢𝑝𝑝𝑒𝑟 25%
=
20
= .75 or 75%
II. QUANTITATIVELY
 Assessment Of Item Difficulty And Item Discriminability no.of students in the lower 25% with correct response 5
ITEM DIFFICULTY DL = = = .25 or 25%
𝑛𝑜.𝑜𝑓 𝑠𝑡𝑢𝑑𝑒𝑛𝑡𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑙𝑜𝑤𝑒𝑟 25% 20
 PERCENT OF THE GROUP THAT ANSWER THE QUESTION CORRECTLY
 For maximizing validity and reliability, the optimal item difficulty level is 0.50 Discrimination Index = DU-DL = .75-.25 = .50  has a good discriminating power
 However, this does not mean every item should have a difficulty level of 0.50, simply
that the average of all items should be 0.50 INDEX OF DISCRIMINATION ITEM EVALUATION
COMPUTING FOR ITEM DIFFICULTY .40 AND UP VERY GOOD ITEM
 ITEM DIFFICULTY = number of students with correct answer/total number of students .30-.39 GOOD ITEM
 Example: What is the difficulty item index of an item if 25 students are unable to answer .20-.19 REASONABLY GOOD ITEM
it correctly, while 75 answered it correctly? .10-.19 MARGINAL ITEM USUALLY SUBJECT
U-L INDEX METHOD TO IMRPOVEMENT
Table of equivalents in interpreting difficulty index BELOW .10 POOR ITEM TO BE REJECTED OR
RANGE OF REVISED
INTERPRETATION ACTION
DIFFICULTY INDEX
Revise/ Very Difficult OTHER METHOD
.00-.25 Difficult .00-.20
Discard PEARSON PRODUCT MOMENT CORRELATION METHOD
.21-.80 Moderately  Employed for measures of continuous scaling with three (3) or more scale points.
.27-.75 Right Difficulty Retain
Difficult
Bipolar scales, Likert scales, or rating scales are of this type.
Revise/ .81-1.00
.76 & above Easy Very Easy
Discard
.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 14


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
 Formula: POINT-BISERIAL CORRELATION METHOD
 Similar to Pearson’s r
 one variable has interval/ratio data and other is binary/dichotomous (nominal: yes or
no, right/wrong, high/low)
 t-Test (mean differences in each of items are analyzed and interpreted).

 In our case, X = one person’s score on the first half of items, X = the mean score on the
first half of items, Y = one person’s score on the second half of items, Y = the mean
score on the second half of items.

INTELLIGENCE
INTELLIGENCE  Ability to reason and make sense of  Examples:
 Latin word “intellectus” meaning perception or comprehension (Apruebo 2009) abstract information. Fluid Intelligence o Arithmetic facts
 Mental ability (global capacity to act purposely, to think rationally, and to deal effectively  Example: o Knowledge of the meaning of
o Spatial/visual skills (space, words
with the environment).
distance) o State capitals
 Associated with ability (perform or act) aptitude (specific mental ability) and o Rote memory
achievement (mastery of a subject). o puzzles
 Composite of general specific ability characterizing level of neurological functioning
based on applied experience and manifested in his dynamic coping with the challenges THREE-STRATUM THEORY OF INTELLIGENCE- JOHN CARROLL
for daily living  The three layers (strata) are defined as representing narrow, broad, and general
THEORIES ABOUT INTELLIGENCE cognitive ability
SPEARMAN’S TWO FACTOR THEORY OF INTELLIGENCE HOWARD GARDNER– THEORY OF MULTIPLE INTELLIGENCE
 Spearman reported findings supporting the idea that performance on any test of mental  He identified 9 distinct types of intelligence: Linguistic, Logical-Mathematical, Spatial
ability was based on a single general ability factor that he termed “g”. Intelligence, Musical, Bodily-Kinesthetic, Interpersonal, Naturalistic, Intrapersonal,
 Spearman also believed that performance on any test of mental ability required the use Existential
of a specific ability factor that he termed “s 9 TYPES OF INTELLIGENCE - GARDNER
LOGICAL –
SPATIAL
RAYMOND CATTELL’S VIEW OF INTELLIGENCE - INTELLIGENCE AS A FEW BASIC MATHEMATICAL
capacity to perceive the
ABILITIES LINGUISTICS Understanding of objects
visual world accurately, to
sensitivity to the meanings and symbols and of action
FLUID INTELLIGENCE CRYSTALLIZED INTELLIGENCE perform transformations
and sound s of words, s that be performed on
 The ability to think on the spot and  Factual knowledge about the world. upon perceptions and to re
mastery of syntax, them and of the relations
solve novel problems.  The skills already learned and -create aspects of visual
appreciation of the ways between these actions,
 The ability to perceive relationships. practiced. experience in the absence
language can be used ability to identify problems
 The ability to gain new types of  Knowledge and skills accumulated of physic al stimuli
and seek explanations
knowledge. over a lifetime, tends to increase with
age

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 15


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
MUSICAL o Intelligence grows with age throughout normal childhood 2)
INTERPERSONAL
Sensitivity to individual o Best index of intelligence = verbal ability
BODILY –KINESTHETIC Ability to notice and make
tones and phrases of Binet-Simon scale - 1905 (revised 1908)
Use of one’s body in highly distinctions among the
music, an understanding of  30 items in order of difficulty
skilled ways for expressive moods, temperaments,
ways to combine tones and
phrases into larger musical
or goal-directed purposes, motivations, and intentions  memory, judgment, & reasoning
capacity to handle objects of other people and  distinguished younger from older children
rhythms and structures,
skill fully potential ly to act on this
awareness of emotional  scores increased with age for each child
knowledge.
aspects of music  scores correlated with school grades & with teacher ratings of intelligence
INTRAPERSONAL  test distinguished MR from normal children
access to one’s own
NATURALISTIC EXISTENTIAL
feelings, ability to draw on
sensitivity and sensitivity to issues related DAVID WESCHLER– WESCHLER INTELLIGENCE SCALE
one’s emotions to guide
understanding of plants,
and understand one’s
to the meaning of life,  Wechsler Intelligence Scale for Children-Third Edition (WISC-III) – Used with
animals, and other aspects death, and other aspects of children 6 to 16
behavior, recognition of
of Nature. the human condition
personal strengths and  Wechsler Adult Intelligence Scale-Third Edition (WAIS-III) – Used with people 17
weaknesses and older
WISC-III
HOW DO WE MEASIRE INTELLIGENCE?  Provides a profile of someone’s strengths and weaknesses
What is an IQ?  Each test is made of 12 parts
 Lewis Terman revised Simon and Binet’s test and published a version known as the o Each part begins with the simplest questions and progresses to increasingly
Stanford-Binet Test in 1916. difficult ones
 Performance was described as an intelligence quotient (IQ) which was imply the ratio of o PERFORMANCE SCALE (6 PARTS)
mental age to chronological age multiplied by 100:  Spatial and perceptual abilities
 IQ=MA/CA x 100  Measures fluid intelligence
o VERBAL SCALE (6 PARTS)
FACTORS THAT INFLUENCE INTELLIGENCE  General knowledge of the world and skill in using language
 The Child’s Influence (Genetics, Genotype–Environment Interaction, Gender), The  Measures crystallized intelligence
Immediate Environment’s Influence, The Society’s Influence, Gender, Schooling,  VERBAL IQ IS BASED ON:
Poverty, Race and Ethnicity o Information
 Measures a child's range of factual information
DIFFERENT TYPES OF INTELLIGENCE TEST  Example: What day of the year is Independence Day?
BINET-SIMON INTELLIGENCE TEST o Similarities
 Developed the first intelligence test to identify MR children.  Measures a child's ability to categorize
 Binet - separating MR from mainstream children  Example: In what way are wool and cotton alike?
 Measured higher intellectual processes o Arithmetic
 Binet’s Assumptions:  Measures the ability to solve computational math problems

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 16


BS SECOND
ASES 311
PSYCHOLOGICAL ASSESSMENT
PSY SEMESTER
 Example: If I buy 6 cents worth of candy and give the clerk 25 cents, I  Pulling together these resources resulted in the selection of four major constructs of
would get _________ back in change? intelligence, which can be measured objectively.
o Vocabulary  Developed by Aurora R. Palacio, Ed.D. in 1991
 Measures the ability to define words Example: What does “telephone”  Purpose: validate the mental ability of Filipino whose ages range from 16 and above on
mean? Comprehension - Measures the ability to answer common sense the basis of his verbal and non-verbal skills
questions.  Time frame of administration: 52 minutes
 Example: Why do people buy fire insurance? Digit Span - Measures short- GENERAL APPLICATIONS
term auditory memory
 School: basis for screening, classifying and identifying needs that will enhance the
 PERFORMANCE IQ IS BASED ON:
learning process
o Coding
 Business and industry: predictors of occupational achievement; gauge of an applicant’s
 Copying marks from a code; visual rote learning
ability and fitness for a particular job; promotional
o Picture Completion
 Therapeutic agencies: proper planning and implementation treatment.
 Telling what's missing in various pictures
 Vocational rehabilitation and counseling: determining one’s capacity to handle the
 Example: Children are shown a picture, such as a car with now heels, and
challenged associated with certain degree programs
are asked: What part of the picture is missing?
DIFFERENT SUBTEST UNDER FIT
o Picture Arrangement
NON-VERBAL
 Arranging pictures to tell a story VOCABULARY ANALOGY NUMERICAL ABILITY ABILITY
o Block Design (TALASALITAAN) (UGNAYAN) (KAKAYAHAN SA BILANG) (ISINALARAWA
 Arranging multi-colored blocks to match printed design NG PROBLEMA)
 Example: Using the four blocks, make one just like this A test of proficiency a long
Deals with
o Object Assembly A test of ability to problem solving
A test of skill application of basic
 Putting puzzles together deal with words task through id of
in perceiving mathematical concepts and
and their meaning relationships
 measures nonverbal fluid reasoning relationships processes in various problem
as used in a among various
 Example: If these pieces are put together correctly, they will make (30 item s; 6 solving situations within the
sentence. (30 abstract figures
something. Go ahead and put them together as quickly as you can. mins) sphere of Filipino experiences.
items; 6 mins) (50 items; 29 mi
(25 items; 20 mins
ns)
THE BAYLEY SCALES OF INFANT DEVELOPMENT ARE OFTEN USED FOR INFANT
ASSESSMENT 3 TYPES OF INTELLIGENCE
1. CIS OR CRYSTALLIZED INTELLIGENCE: standard scores in vocabulary, analogy and
FILIPINO INTELLIGENCE TEST numerical ability (CIS=VSS+ASS+ NAS)
 Panukat ng Katalinuhang Pilipino (PKP), is the original version of the Filipino 2. FIS OR FLUID INTELLIGENCE: standard score of the n on-verbal subtest
Intelligence Test, 3. GIS OR GENERAL INTELLIGENCE: approx. of one’s intellectual ability based on his
 The items were constructed based on Filipino setting and experiences. verbal and non-verbal skills (GIS= CIS+FIS)
 The various ways by which a Filipino manifests his intelligence have been observed
and interviews were undertaken to solicit information on this matter.

PSYCHOLOGICAL ASSESSMENT| MIDTERMS 17

You might also like