You are on page 1of 8

Reliability Estimates

1. Test-retest estimate (employee performance, personality)


- An estimate of reliability obtained by correlating pairs of scores from the same people on two
different administrations of the same test
- to measure something that is relatively stable over time, such as a personality trait
- If the characteristic being measured is assumed to fluctuate over time, then there would be little
sense in assessing the reliability of the test using the test-retest method
- as the time interval between administrations of the same test increases, the correlation
between the scores obtained on each testing decreases
- When the interval between testing is greater than six months, the estimate of test-retest
reliability is often referred to as the coefficient of stability
- An estimate of test retest reliability from a personality profile might be low if the test taker
suffered some emotional trauma or received counselling during the intervening period
- may be most appropriate in gauging the reliability of tests that employ outcome measures such
as reaction time or perceptual judgement (discriminations of brightness, loudness, or taste)
- source of error variance is test administration

2. Parallel-Forms and Alternate Forms Estimates (coefficient of equivalence)


- Similar in two ways to obtaining and estimate of test-retest reliability
 Two test administrations with the same group are required
 Test scores may be affected by factors such as motivation, fatigue, learning, etc
- Test takers may do better or worse on a specific form of the test not as a function of their true
ability but simply because of the particular items that were selected for inclusion in the test
- there is a reasonable degree of stability in scores on intelligence tests
a. Parallel Forms
- for each form of the test, the means and the variances of observed test scores are equal
b. Alternate Forms
- simply different versions of a test that have been constructed so as to be parallel
- equivalent with respect to variables such as content and level of difficulty
- developing it can be time consuming or expensive
- advantageous to the test takers in many ways: minimizes the effect of memory for the content
of a previously administered form of the test.
- Source of error variance are test administration of construction

3. Split-Half Estimates (internal consistency estimate of reliability/estimate of inter-item


consistency) source of error variance is test construction
- obtained by correlating two pairs of scores obtained from equivalent halves of a single test
administered once
- internal consistency of the test items
- a useful measure of reliability when it is impractical or undesirable to assess reliability with two
tests or to administer a test twice (because of factors such as time or expense)
- Simply dividing the test in the middle is not recommended because it’s likely this procedure
would spuriously raise or lower the reliability coefficient
- Randomly assign number to split, or split in terms of odd and even numbers (odd-even
reliability), or split by content
- Purpose to split is to create what we call “mini parallel forms” as nearly equal as humanly
possible – in format, stylistic, statistical and related aspects
- Spearman Brown formula
 allows the test developers to estimate internal consistency reliability from a correlation
of two halves of a test, can also use it to estimate the reliability of the whole test
 necessary for estimating the reliability of a test that has been shortened or lengthened
 If test developers or users wish to shorten a test, the Spearman-Brown formula may be
used to estimate the effect of the shortening on the test’s reliability
 Could also be used to determine the number of items needed to attain a desired level of
reliability. However, if reliability is low, it may be impractical to increase the number of
items to reach an acceptable level of reliability. If necessary, new items must be
equivalent in content and difficulty so that the longer test still measures what the
original test measured
 Reliability of the items could also be raised by creating new items, clarifying test
instructions, or simplifying the scoring rules.
 The formula is inappropriate for measuring the reliability of heterogeneous or speed
tests
- Usually, but not always, reliability increases as test length increases

Other methods of estimating internal consistency:


Inter item consistency (homogeneous tests)
- degree of correlation among all the items on a scale
- Calculated from a single administration of a single form of a test
- Useful in assessing the homogeneity of the test (items that measure a single trait/items in a
scale are unifactorial)
- The more homogenous, the more the inter-item consistency since it has a relatively narrow
content area
- More desirable because it allows straightforward test-score interpretation
- Same scorers will have similar abilities
- An insufficient tool for measuring multifaceted psychological variables like personality and
intelligence

Heterogeneous tests
- a test measures that different factors, composed of items that measure more than one trait
- same scorers may have different abilities
1. KUDER-RICHARDSON FORMULAS
a. KR 20
- Statistic of choice for determining the inter-item consistency of dichotomous items, those that
can be scored right or wrong ( e.g. multiple choice items)
- If items are heterogeneous, KR-20 will yield lower reliability estimates that split-half estimate
- B. KR 21
- If we assume that all the test items have approximately the same level of difficulty

2. COEFFICIENT ALPHA (CRONBACH’S ALPHA) (single administration only)


- The mean of all possible split-half correlations
- Appropriate for use on tests containing non-dichotomous items ( scales from 0-1)
- Obtaining an estimate of internal consistency reliability
- Requires only one administration of the test
- To help answer questions about how similar sets of data are
- A value of 0.9 may be too high and indicate redundancy in the items

MEASURES OF INTER-SCORER RELIABILITY (source of error variance is test scoring and interpretation)

Inter-scorer reliability - degree of agreement or consistency between two or more scorers (or judges or
raters) with regard to a particular measure

- If the reliability coefficient is high, the prospective test user knows that test scores can be
derived in a systematic, consistent way by various scorers with sufficient training.
- Inter-rater consistency may be promoted by providing raters with the opportunity for group
discussion along with practice exercises and information on rater accuracy

The simplest way of determining the degree of consistency among scorers in the scoring of a test is to
calculate a coefficient of correlation (coefficient of inter-scorer reliability)

USING AND INTERPRETING A COEFFICIENT OF RELIABILITY

How high should the coefficient of reliability be? We need more of it in some tests, and we will
admittedly allow for less of it in others. If a test score is routinely used in combination with many other
test scores and typically accounts for only a small part of the decision process, that test will not be held
to the highest standards of reliability.
NATURE OF TESTS (could also be helpful in determining what estimate of reliability to use)

1. Homogeneity vs heterogeneity
- Homogeneous (internal consistency)
- Heterogeneous (test-retest estimate)
2. Dynamic vs static characteristics
a. Dynamic characteristic is a trait, state, or ability presumed to be ever-changing as a
function of situational and cognitive experiences – internal consistency
b. Static – test- retest or alternate forms method
3. Restriction or inflation of range
a. If restricted, correlation coefficient tend to be lower
b. If inflated, correlation coefficient tend to be higher
4. Speed vs Power tests
a. Speed – all three. If split-half, use Spearman Brown formula to adjust. KR 20 or split-half is
high
5. Criterion-referenced tests
- how different the scores are from one another is seldom a focus of interest

ALTERNATIVES TO THE TRUE SCORE MODEL

1. domain sampling theory


- seek to estimate the extent to which specific sources of variation under defined conditions are
contributing to the test score
- internal consistency is the most compatible
2. Generalizability theory
- An extension of the true score theory
- According to generalizability theory, given the exact same conditions of all the facets in the
universe, the exact same test score should be obtained. This test score is the universe score.

A generalizability study examines how generalizable scores from a particular test are if the test is
administered in different situations. It examines how much of an impact different facets of the universe
have on the test score. The influence of particular facets on the test score is represented by coefficients
of generalizability.

3. Item-response theory
- provide a way to model the probability that a person with X ability will be able to perform at a
level of Y.
Standard Error of Measurement

- provides a measure of the precision of an observed test score


- the higher the reliability, the lower the SEM
- useful in establishing the confidence interval – range or band of test scores that is likely to
contain the true score
Standard Error of the Difference
- aids a test user in determining how large a difference should be before it is considered
statistically significant.
How to measure utility of a test:

1. Taylor-Russell table
- Provides an estimate of the percentage of new hires who will be successful employees if a test is
adopted (organizational success)
2. Expectancy charts or Lawshe tables
- provide a probability of success for a particular applicant based on test scores (individual
success)

Evaluating employee performance:

A. Determine the reason for evaluating employee performance


B. Identify environmental and cultural limitations
C. Determine who will evaluate the performance
D. Select the best appraisal methods to accomplish your goals
a. focus of the appraisal dimensions
i. trait focused – honesty, responsibility, assertiveness, dependability
ii. competency focused – driving skills, public speaking skills, decision-making skills
iii. task-focused – crime prevention, court testimony, radio procedures
iv. goal focused – prevent crimes form occurring, minimize citizen complaints
b. Should dimensions be weighted?
c. Use of employee comparisons, objective measures, or ratings
i. Employee comparisons (rank order, paired comparison, forced distribution)
ii. Objective Measures (quantity or quality or work, attendance, safety)
iii. Ratings of Performance
Graphic rating scale – poor (1) – excellent (5)
Behaviour Checklists (more advantageous than the graphic rating scale) – ex.
Properly greets customers, thanks customer after each transactions. Contamination
– factors outside our own control
Comparison with other employees
Frequency of Desired Behaviours
Extent to which Organizational Expectations are Met
E. Train Raters
F. Observe and Document Performance
G. Evaluate Performance
a. Obtaining and reviewing objective data
b. Reading Critical Incident Logs
i. Distribution errors – distribution of ratings on a rating scale
1. Leniency error – rate employees at the upper end regardless of the performance
2. Central tendency error – rating employees in the middle of the scale
3. Strictness error – rate employees in the lower end regardless of the
performance
ii. Halo errors – allows either a single attribute or an overall impression of an individual
to affect the ratings that she makes on each relevant job dimension
iii. Proximity errors – occurs when a rating mode on one dimension affects the rating
made on the dimension that immediately follows it on the rating scale
iv. Contrast errors – performance rating of one person receives can be influenced
by the performance of a previously evaluated person
v. Sampling problems
1. Recency Effect – recent behaviours are given more weight in the performance
evaluation
2. Infrequent Observation – many managers or supervisors do not have the
opportunity to observe a representative sample of the behaviour
H. Communicate Appraisal Results to Employees
I. Terminate Employees
Legal Reasons for terminating an employee
1. Probationary Period
2. Violation of Company Rules
3. Inability to Perform
4. Reduction in Force (Layoff) – Worker Adjustment and Retraining Notification act (WARN),
requires the organizations to provide workers 60 days notice

Behaviourally Anchored Rating Scales

- To reduce the rating problems associated with graphic rating scales


Uses:
a. Compares incidents she has recorded for each employees with the incidents on each scale
b. All of the recorded incidents are read to obtain a general impression of each employees
c. Use the incidents contained in the BARS to arrive at a rating of the employee without
recording actual incidents
Problem: leniency error

Forced-Choice Rating Scales


- Use critical incidents and relevant job behaviours but the scale points are hidden

Behavioural Observation Scales


- Measures the frequency of the desired behaviours
- Provides high levels of feedbacks per dimensions
- Scores per dimensions are then summed for the overall score
- Criticized for measuring only the recalled behaviours not the actual behaviours being observed
Designing and Evaluating Training Systems

Determining Training Needs:

1. Conducting Needs Analysis


- First step in developing an employee training system (determine the types of training needed to
reach the organization’s goals
- Use employees’ performance appraisal scores, surveys, interviews, skills and knowledge tests
and critical incident

a. Organization Analysis
b. Task Analysis
c. Person Analysis

Developing a Training Program

1. Establishing Goals and Objectives


- First step in developing a training program

You might also like