You are on page 1of 44

PSYCHOMETRIC

PROPERTIES
of a
GOOD PSYCHOLOGICAL TEST
Ferrer, J.
Hernandez, J.
Pangan, G.
Santos, P.
RELIABILI VALIDITY STANDARDIZATION
TY
RELIABILITY

- Dependability or consistency.
- Consistency in measurement.

Reliability coefficient

- Index of reliability,
- Ratio between the true score variance on a test and
the total variance
Basics of Test Score Theory
Classical test score theory- assumes that each person has a true score that would be obtained if
there were no errors in measurement. 
Error - the component of the observed test score that does not have to do with the test taker’s
ability.

- In symbolic representation, the observed score (X) has two components;


T - true score
E - error component

The standard error of measurement (SEm) estimates how repeated measures of a person on
the same instrument tend to be distributed around his or her “true” score.

The true score is always an unknown because no measure can be constructed that provides a
perfect reflection of the true score.
Basics of Test Score Theory
variance (σ2) - standard deviation squared

- A statistic useful in describing sources of test score variability

true variance - Variance from true differences


error variance - variance from irrelevant, random sources

Reliability - proportion of the total variance attributed to true variance.


The greater the proportion of the total variance attributed to true variance, the more
reliable the test.

Systematic source of error


- not affect score consistency.
Sources of Error

1. Test construction - One source of variance during test construction is item sampling or content sampling ,
terms that refer to variation among items within a test as well as to variation among items between tests. 

2. Test Administration - Sources of error variance that occur during test administration may influence the
test taker's attention or motivation. The test taker's reactions to those influences are the source of one kind of
error variance. 

3. Test Scoring/Interpretation - In many tests, the advent of computer scoring and a growing reliance on
objective, computer- scorable items have virtually eliminated error variance caused by scorer differences.
Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral
measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still
require scoring by trained personnel. 

4. Other sources of error - Ex: Under report abuse & over report abuse
Reliability Estimates

1. Test-Retest Reliability (Coefficient of Stability)


2. Parallel Forms and Alternate forms
3. Split-Half reliability Estimates
4. The Spearman Brown Formula
5. Inter-item Consistency
6. Kuder-Richardson Formula (KR-20 & KR-21)
7. Coefficient Alpha
1. Test-Retest Reliability (Coefficient of Stability)

an estimate of reliability obtained by correlating pairs of scores from the same people on
two different administrations of the same test.

STEP 1
ADMINISTER PSYCH TEST
STEP 2
GET RESULTS
STEP 3 How many test did you administer? 1
INTERVAL (GAP TIME)
STEP 4 How many times did you administer? 2
RE-ADMINISTER TEST
Same participants/Population
STEP 5
GET RESULTS
STEP 6
COMPARE STEP 2 AND STEP 5
Reliability Estimates

2. Alternate forms and Parallel Forms


- Are simply different versions of a test that have been constructed so as to be parallel.

- Alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level
of difficulty.

  How many population? 1


STEP 1  
ADMINISTER FIRST TEST How can you say that the alternate form is the same
STEP 2 with the original form?
ADMINISTER ALTERNATE FORM TEST  
STEP 3 SAME NUMBER OF ITEMS
SCORE BOTH SAME FORMAT
STEP 4 SAME LANGUAGE
CORRELATE SAME CONTENT
SAME LEVEL OF DIFFUCULTY (Most hard)
3. Split-Half reliability Estimates

An estimate of split-half reliability is obtained by correlating two pairs of scores


obtained from equivalent halves of a single test administered once.

The computation of a coefficient of split-half reliability generally entails three steps:


Step 1. Divide the test into equivalent halves.
Step 2. Calculate a Pearson r between scores on the two halves of the test.
Step 3. Adjust the half-test reliability using the Spearman-Brown formula.

- Simply dividing the test in the middle is not recommended


- Randomly assign items to one or the other half of the test.
- Assign odd-numbered items to one half of the test and even-numbered items to the
other half.
- Divide the test by content
4. The Spearman Brown Formula

- Estimate internal consistency reliability from a correlation of two


halves of a test.
The general Spearman-Brown ( rSB ) formula is:

- By determining the reliability of one half of a test, a test developer can use the Spearman-Brown formula

The symbol rSB stands for the Pearson r of scores in the two half tests:
5. Inter-item Consistency

- degree of correlation among all the items on a scale.


- calculated from a single administration of a single form of a test.
- useful in assessing the homogeneity of the test.

Tests are said to be homogeneous if they contain items that measure a single trait.

In contrast to test homogeneity, heterogeneity describes the degree to which a test measures different factors. A
heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait. A test that
assesses knowledge only of color television repair skills could be expected to be more homogeneous in content
than a test of electronic repair.

The more homogeneous a test is, the more inter-item consistency it can be expected to have. Because a
homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item
consistency than a heterogeneous test. Test homo- geneity is desirable because it allows relatively straightforward
test-score interpretation.
6. Kuder-Richardson Formula (KR-20 & KR-21)
Dissatisfaction with existing split-half methods of estimating reliability

KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those
items that can be scored right or wrong (such as multiple-choice items).

If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method.

How is KR-20 computed? The following formula may be used:

The KR-21 formula may be used if there is reason to assume that all the test items have approximately the same
degree of difficulty.
7. Coefficient Alpha 

-Developed by Cronbach and subsequently elaborated on by others,


-Coefficient alpha may be thought of as the mean of all possible split-half correlations, corrected by the
Spearman-Brown formula.

In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coefficient alpha is
appropriate for use on tests containing nondichotomous items.

The formula for coefficient alpha is:


8. Inter-scorer (Inter-rater reliability)
Variously referred to as:

- Scorer reliability
- Judge reliability
- Observer reliability
- Inter-rater reliability
- Inter-scorer reliability

The degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a
particular measure.

Inter-rater consistency may be promoted by providing raters with the opportunity for group discussion along with
practice exercises and information on rater accuracy.

Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to
calculate a coefficient of correlation. This correlation coefficient is referred to as a coefficient of inter- scorer
reliability.
The Nature of the Test

1. Homogeneity Vs. Heterogeneity of Test Items


2. Dynamic Vs. Static Characteristics
3. Restriction or inflation of range
4. Speed Test Vs. Power Test
1. Homogeneity Vs. Heterogeneity of Test Items
Homogeneous in items - functionally uniform throughout. Tests designed to measure one
factor

Heterogeneous in items - estimate of internal consistency might be low

2. Dynamic Vs. Static Characteristics

Dynamic characteristic- trait, state, or ability presumed to be ever-changing as a


function of situational and cognitive experiences

Static characteristic - trait, state, or ability presumed to be relatively unchanging


3. Restriction or inflation of range
Restricted - resulting correlation coefficient tends to be lower.

Inflated - resulting correlation coefficient tends to be higher.

4. Speed Test Vs. Power Test


Power test - When a time limit is long enough to allow testtakers to attempt all items, and
if some items are so difficult that no testtaker is able to obtain a perfect score

Speed test - generally contains items of uniform level of difficulty (typically uniformly
low) so that, when given generous time limits, all test takers should be able to complete all
the test items correctly.

A reliability estimate of a speed test should be based on performance from two


independent testing periods using one of the following:

1. test-retest reliability,
2. alternate-forms reliability, or
3. split-half reliability from two separately timed half tests.
What to do about Low reliability?

 1. Increase no. of items


 2. Factor and Item Analysis
 3. Correction for attenuation
RELIABILITY VALIDITY STANDARDIZATION
Validity (Accuracy)

 -refers to an assessment tool (test) measures what it’s


supposed to measure

-it answers the question whether the tool (test) meets


the objective of assessment

Validation

-is the process of gathering and evaluating evidence


about validity.
Three Categories of Validity

Content Validity

Criterion-related Validity

Construct Validity
1. Content Validity

- Describes a judgment of how adequately a test samples behavior representative of


the universe of behavior that the test was designed to sample. 

-This is a measure of validity based on an evaluation of the subjects, topics, or


content covered by the items in the test.
 2. Criterion-related validity

- Is a judgment of how adequately a test score can be used to


infer an individual’s most probable standing on some
measure of interest - the measure of interest being the
criterion.

- This is a measure of validity obtained by evaluating the


relationship of scores obtained on the test to scores on
other tests or measures
                                                                    
 Concurrent Validity - is an index of the degree to which a test score is
related to some criterion measure obtained at the same time (concurrently)

 Predictive Validity - is an index of the degree to which a test score


predicts some criterion measure
 3. Construct Validity

- is a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a
variable called construct.

-A construct is an informed, scientific idea developed or hypothesized to describe or explain behavior.


A. Convergent Evidence - evidence for the construct validity of a particular test
may converge from a number of sources, such as other tests or measures designed
to assess the same (or a similar) construct.

o Convergent Validity - different methods measuring same construct


o Discriminant Evidence - A validity coefficient showing
little (a statistically insignificant relationship between test
scores and/or other variables with which scores on the test
being construct-validity should not theoretically be
correlated provides discriminant evidence of construct
validity (also known as discriminant validity)

B. Divergent Validity - same method measuring different


construct
Experimental Techniques
1. Multitrait-Multimethod Matrix

 
     

Source: Diagnostics: Multitrait Multimethod Analyses (Psychology, Research Methods)


2. Factor Analysis- a shorthand term  for a class of mathematical
procedures designed to identify factors or specific variables that are
typically attributes, characteristics, or dimensions on which people may
differ. 

 Face Validity - what a test  appears to measure to the person


  being tested than to what the test actually measures.
Validity, Bias, and Fairness
Test Bias

  Bias - is a factor inherent in a test that systematically prevents accurate, impartial


measurement.

  Rating Error - is a judgment resulting from the intentional or unintentional misuse


  of a rating scale.

• Leniency Error 
• Severity Error 
• Central Tendency Error
• Halo Effect 
• Test Fairness
RELIABILITY VALIDITY STANDARDIZATION
OTHER CONSIDERATIONS:

A. Purpose of the Instrument

B. Administration, Scoring and Interpretation

C. Actionable Results and will have benefits

D. Reliability and Validity

E. Norms
Norming, Test Standardization

Norm in a socio-cultural context-behavior that is usual, average, standard, expected or typical. They are most
commonly defined as rules or expectations that are socially enforced. 
 
Psychometric Norms - the test performance data of a particular group of test takers that serves as a reference point
in evaluating (interpreting) individual test scores.
 
Raw Scores are untransformed score from a measurement; it provides less information. Thus, misleading at times.
 
Percentile is a number where a certain percentage of scores fall below that number. 

 
For example:

In a test, you got 67/90 score. Wow! You got 74.44% correct and you assume you did very well. But… so you
thought!!!
Stanine- Stanine (STAndard NINE) is a method of scaling test scores on a nine-point standard scale with a mean
 of five and a standard deviation of two.
STEN Scores- (or “Standard Tens”) divide a scale into ten units. In simple terms, you can think of them as “scores
out of ten”:
Types of Norms (Norm-referenced Test)

a.     Percentiles
b.     Age Norms - uses age as reference to indicate average samples

Ex. 6 year old usual intelligence VS. 6 year old performing as a 12 year old intelligence

c. Grade Norms - uses average performance test of grade level


d. National Norms - uses representative sample of a nation
e. National Anchor norms – provides
an indication of the equivalency of scores on various tests that are administered nationaly. Equivalent tables
between two tests (equipercentile method)
f. Subgroup Norms - reference according to subgroups e.g. age, sex, educational attainment and etc.
g. Local Norms - provides normative information based on local population. E.g. Quezon City Senior
Highschool English proficiency.
Norm-referenced tests - report whether test takers
performed better or worse than a hypothetical average
student, which is determined by comparing scores against
the performance results of a statistically selected group
of test takers, typically of the same age or grade level, who
have already taken the exam.
How do we establish the norms?
Sampling-the process of getting a representative sample of a population.

- Solvin’s formula
- Cochran’s formula

1. Random Sampling Vs. Convenience Sampling


Random Sampling is a technique which each sample has equal probability to be
chosen to have a unbiased representation of the total population.
 
Convenience Sampling non-probability sampling that involves drawing part of the
population that is most accessible.
2. Stratified Sampling - stratified sampling is a method of 
sampling from a population which can be partitioned into 
subpopulations.
 
3. Purposive Sampling - non probability sampling technique
where researches chooses the sample based on who they think
would be appropriate for the study
 
Four quick sample size determination you may go here:
http://www.raosoft.com/samplesize.html
D. Criterion-Referenced Test (Domain or Content-
Referenced Test) a method of evaluation and deriving
meaning from test scores with reference to a set standard.

Ex. 180 passing


 
E. Culture and Inference (Culture-Fair) - culture remains as
a big factor in test administration, scoring and interpretation.
Thank
you!

You might also like