Professional Documents
Culture Documents
PROPERTIES
of a
GOOD PSYCHOLOGICAL TEST
Ferrer, J.
Hernandez, J.
Pangan, G.
Santos, P.
RELIABILI VALIDITY STANDARDIZATION
TY
RELIABILITY
- Dependability or consistency.
- Consistency in measurement.
Reliability coefficient
- Index of reliability,
- Ratio between the true score variance on a test and
the total variance
Basics of Test Score Theory
Classical test score theory- assumes that each person has a true score that would be obtained if
there were no errors in measurement.
Error - the component of the observed test score that does not have to do with the test taker’s
ability.
The standard error of measurement (SEm) estimates how repeated measures of a person on
the same instrument tend to be distributed around his or her “true” score.
The true score is always an unknown because no measure can be constructed that provides a
perfect reflection of the true score.
Basics of Test Score Theory
variance (σ2) - standard deviation squared
1. Test construction - One source of variance during test construction is item sampling or content sampling ,
terms that refer to variation among items within a test as well as to variation among items between tests.
2. Test Administration - Sources of error variance that occur during test administration may influence the
test taker's attention or motivation. The test taker's reactions to those influences are the source of one kind of
error variance.
3. Test Scoring/Interpretation - In many tests, the advent of computer scoring and a growing reliance on
objective, computer- scorable items have virtually eliminated error variance caused by scorer differences.
Individually administered intelligence tests, some tests of personality, tests of creativity, various behavioral
measures, essay tests, portfolio assessment, situational behavior tests, and countless other tools of assessment still
require scoring by trained personnel.
4. Other sources of error - Ex: Under report abuse & over report abuse
Reliability Estimates
an estimate of reliability obtained by correlating pairs of scores from the same people on
two different administrations of the same test.
STEP 1
ADMINISTER PSYCH TEST
STEP 2
GET RESULTS
STEP 3 How many test did you administer? 1
INTERVAL (GAP TIME)
STEP 4 How many times did you administer? 2
RE-ADMINISTER TEST
Same participants/Population
STEP 5
GET RESULTS
STEP 6
COMPARE STEP 2 AND STEP 5
Reliability Estimates
- Alternate forms of a test are typically designed to be equivalent with respect to variables such as content and level
of difficulty.
- By determining the reliability of one half of a test, a test developer can use the Spearman-Brown formula
The symbol rSB stands for the Pearson r of scores in the two half tests:
5. Inter-item Consistency
Tests are said to be homogeneous if they contain items that measure a single trait.
In contrast to test homogeneity, heterogeneity describes the degree to which a test measures different factors. A
heterogeneous (or nonhomogeneous) test is composed of items that measure more than one trait. A test that
assesses knowledge only of color television repair skills could be expected to be more homogeneous in content
than a test of electronic repair.
The more homogeneous a test is, the more inter-item consistency it can be expected to have. Because a
homogeneous test samples a relatively narrow content area, it is to be expected to contain more inter-item
consistency than a heterogeneous test. Test homo- geneity is desirable because it allows relatively straightforward
test-score interpretation.
6. Kuder-Richardson Formula (KR-20 & KR-21)
Dissatisfaction with existing split-half methods of estimating reliability
KR-20 is the statistic of choice for determining the inter-item consistency of dichotomous items, primarily those
items that can be scored right or wrong (such as multiple-choice items).
If test items are more heterogeneous, KR-20 will yield lower reliability estimates than the split-half method.
The KR-21 formula may be used if there is reason to assume that all the test items have approximately the same
degree of difficulty.
7. Coefficient Alpha
In contrast to KR-20, which is appropriately used only on tests with dichotomous items, coefficient alpha is
appropriate for use on tests containing nondichotomous items.
- Scorer reliability
- Judge reliability
- Observer reliability
- Inter-rater reliability
- Inter-scorer reliability
The degree of agreement or consistency between two or more scorers (or judges or raters) with regard to a
particular measure.
Inter-rater consistency may be promoted by providing raters with the opportunity for group discussion along with
practice exercises and information on rater accuracy.
Perhaps the simplest way of determining the degree of consistency among scorers in the scoring of a test is to
calculate a coefficient of correlation. This correlation coefficient is referred to as a coefficient of inter- scorer
reliability.
The Nature of the Test
Speed test - generally contains items of uniform level of difficulty (typically uniformly
low) so that, when given generous time limits, all test takers should be able to complete all
the test items correctly.
1. test-retest reliability,
2. alternate-forms reliability, or
3. split-half reliability from two separately timed half tests.
What to do about Low reliability?
Validation
Content Validity
Criterion-related Validity
Construct Validity
1. Content Validity
- is a judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a
variable called construct.
• Leniency Error
• Severity Error
• Central Tendency Error
• Halo Effect
• Test Fairness
RELIABILITY VALIDITY STANDARDIZATION
OTHER CONSIDERATIONS:
E. Norms
Norming, Test Standardization
Norm in a socio-cultural context-behavior that is usual, average, standard, expected or typical. They are most
commonly defined as rules or expectations that are socially enforced.
Psychometric Norms - the test performance data of a particular group of test takers that serves as a reference point
in evaluating (interpreting) individual test scores.
Raw Scores are untransformed score from a measurement; it provides less information. Thus, misleading at times.
Percentile is a number where a certain percentage of scores fall below that number.
For example:
In a test, you got 67/90 score. Wow! You got 74.44% correct and you assume you did very well. But… so you
thought!!!
Stanine- Stanine (STAndard NINE) is a method of scaling test scores on a nine-point standard scale with a mean
of five and a standard deviation of two.
STEN Scores- (or “Standard Tens”) divide a scale into ten units. In simple terms, you can think of them as “scores
out of ten”:
Types of Norms (Norm-referenced Test)
a. Percentiles
b. Age Norms - uses age as reference to indicate average samples
Ex. 6 year old usual intelligence VS. 6 year old performing as a 12 year old intelligence
- Solvin’s formula
- Cochran’s formula