You are on page 1of 11

Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!

CHAPTER 1: PSYCHOLOGICAL TESTING AND ASSESSMENT evaluation evaluation


Skill of Requires an educated selection of Requires technician-like skills in
TESTING AND ASSESSMENT Evaluator tools of evaluation, skill in terms of administering and scoring
The roots of contemporary psychological testing and assessment can be found in early 20th evaluation, and thoughtful a test as well as in interpreting a
century France. organization and integration of data test result
 Alfred Binet and Theodore Simon (1905) published a test designed to help place Outcome Entails a logical-problem solving Yields a test score or series of
Paris schoolchildren in appropriate classes. They developed questions that would approach the brings to bear many test score
predict children’s future progress in the Paris school system sources of data designed to shed
light on a referral question
 Within a decade, an English-language version of Binet’s test was prepared for the
use in schools in the US
PROCESS OF ASSESSMENT
 During WWI, the military needed a way to screen large numbers of recruits quickly
1. The process of assessment begins with a referral for assessment from a source (e.g.,
for intellectual and emotional problems, and psychological testing provided this
teacher, judge, counselor). One or more referral questions are put to the assessor about
method
the assessee (e.g., can this child function in a general educational environment?). The
 During WWII, the military would depend even more on psychological test to screen
assessor may meet with the assessee or others before the formal assessment in order to
recruits for service
clarify the reason for referral.
2. The assessor prepares for the assessment by selecting the tools of assessment to be
used. The assessor’s own past experience, education, and training play a key role in the
PSYCHOLOGICAL TESTING AND ASSESSMENT DEFINED specific tests or other tools to be employed in the assessment.
Testing is a term used to refer to everything from the administration of a test to the 3. Finally, the formal assessment will begin.
interpretation of a test score. 4. After the assessment, the assessor writes a report of the findings that is designed to
answer the referral question.
By WWII, a distinction between testing and a more inclusive term “assessment began to 5. More feedback sessions may also be scheduled.
emerge. The US Office of Strategic Services used a variety of procedures and measurement
tools in selecting military personnel for highly specialized positions involving espionage, APPROACHES USED IN ASSESSMENT
intelligence gathering, and the like. The OSS model would later inspire what is now referred Different assessor may approach the assessment task in different ways.
as the assessment center approach to personnel evaluation.  Collaborative Psychological Assessment, the assessor and assessee may work as
partners from initial contact through final feedback. One variety of this approach is
Assessment acknowledges that tests are only one type of tool used by professional assessors. the Therapeutic Psychological Assessment where, therapeutic self-discovery and
A test’s value is intimately linked to the knowledge, skill, and experience of the assessor. understandings are encouraged throughout the assessment process.
 Most notably used in educational settings is the Dynamic Assessment which is an
Psychological Assessment is the gathering and integration of psychology related data for the interactive approach to psychological assessment that usually follows a model of (1)
purpose of making a psychological evaluation that is accomplished through the use of tools evaluation, (2) intervention of some sort, and (3) evaluation. It provides a means for
such as tests, interviews, case studies, behavioral observations, and specifically designed evaluating how the assessee processes or benefits from some type of intervention
apparatuses and measurement procedures. during the course of evaluation.
Psychological Testing is the process of measuring psychology related variables by means of
TOOLS OF PSYCHOLOGICAL ASSESSMENT
devices or procedures designed to obtained a sample of behavior.
TESTS
 A test is a measurement device or technique used to quantify behavior or aid in the
PSYCHOLOGICAL ASSESSMENT PSYCHOLOGICAL TESTING
understanding and prediction of behavior.
Objective Answer a referral question, solve a To obtain some gauge, usually
 Psychological test is a set of items that are designed to measure characteristics of
problem, or arrive at a decision numerical in nature, with regard to
human beings that pertain to behavior (intelligence, personality, etc.). It always involves
through the use of tools of an ability or attribute
analyses of a behavior.
evaluation
 Format pertains to the form, plan, structure, arrangement, and layout of test items as
Process Usually individualized which May be individual or group in
well as to related considerations of such as time limit. It can also used to refer to the
focuses on how an individual nature. After test administration,
form in which a test is administered. Format can also be used to denote the form or
processes rather than simply the the tester typically adds up the
structure of other evaluative tools and processes.
results of that processing number of correct answers or the
 Tests differ in their administration procedure. Tests for one-to-one basis may require
number of certain types of
an active and knowledgeable test administrator. Tests designed for administration to
response
groups may not even require the test administrator to be present while the test takers
Role of Assessor is the key to the process The tester is not the key to the
independently complete the required tasks.
Evaluator of selecting tests and/or other process; one tester may be
 Tests differ in their scoring and interpretation procedures. Score is a code or summary
tools of evaluation as well as substituted for another tester
statement, usually but not necessarily numerical in nature, that reflects an evaluation of
drawing conclusions from the entire without appreciably affecting the
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
performance on a test, task, interview, or some other sample of behavior. Scoring is the
process of assigning such evaluative codes or statements to performance on tests, tasks,
interviews, or other behavior sample. A cut score is a reference point, usually ROLE-PLAY TESTS
numerical, derived by judgement and used to divide a set of data into two or more  A role-play test is a tool of assessment wherein assessees are directed to act as if
classifications. they were in a particular situation. Assessees may then be evaluated with regard to
 Psychometrics is the science of psychological measurement. Psychometrists and their expressed thoughts, behaviors, abilities, and other variables.
psychometrician refers to a professional who uses, analyzes, and interprets  Used in situations where it is too time-consuming, too expensive, or simply to
psychological test data. inconvenient to assess in real situation.
 Tests differ with their psychometric soundness or technical quality, which refers to  Role play can be used as both a tool of assessment and a measure of outcome.
how consistently and how accurately a psychological test measures what it purports to
measure. COMPUTERS AS TOOLS
 Psychometric utility refers to the usefulness or practical value that a test or other tool  Computers can serve as test administrators (online or offline) and as highly
of assessment has for a particular purpose. efficient test scorers. The account of performance spewed out can range from a
mere listing of score to the more detailed extended scoring report, which includes
INTERVIEW statistical analyses of the test taker’s performance.
 Interview is a method of gathering information through direct communication  Computer Assisted Psychological Assessment (CAPA) provides assistance to the
involving a reciprocal exchange. It is more than talk. test user, not the test taker.
 If conducted face-to-face, interviewer is taking note of both verbal and non-verbal  Computer Adaptive Testing (CAT), the computer has the ability to tailor the test to
behavior. Non-verbal behaviors may include the interviewee’s body language the test taker’s ability or test taking pattern.
such as movements and facial expressions in response to the interviewer, the extent
of eye contact, the way they are dresses, and apparent willingness to cooperate.
 Can be conducted through telephone, online interviews, e-mail interviews, and text WHO, WHAT, WHY, HOW, AND WHERE
messaging.
 An interview may be used to help professionals in human resources. A panel WHO ARE THE PARTIES IN THE ASSESSMENT ENTERPRISE?
interview can be employed where, more than one interviewer participates in the The Test Developers and publishers create tests or other methods of assessment. The APA
assessment to minimize biases of a lone interviewer. has estimated that more than 20,000 new psychological tests are developed each year.
 Interviewing in the context of clinical and counseling setting, has as its objective not
only the gathering of information, but also target change in the interviewee’s The Test User are the persons responsible for the selection, administration, and scoring of
thinking and behavior. Motivational Interviewing is a therapeutic dialogue that tests for the analysis, interpretation, and communication of test results and for any decisions
combines person-centered listening skills such as openness and empathy, with the or actions that are based on test scores. Individuals who simply administer tests, score
use of cognitive-altering techniques designed to positively affect motivation and tests, and communicate simple or “canned” test results are not test users.
effect therapeutic change.

PORTFOLIO Test User Qualification:


 Portfolio are work products – whether retained on paper, canvas, film, video, audio,  Level A – There are no special qualification to purchase the products (psychological
or some other medium. It is a sample of one’s ability and accomplishment. The tests)
appeal of portfolio assessment as a tool of evaluation extends to many other fields,  Level B – A master’s degree and formal training in the ethical administration,
including education. scoring, and interpretation of clinical assessment; Certification by or full active
membership in a professional that requires training and experience in the relevant
CASE HISTORY DATA area of assessment; A degree or license to practice in the healthcare or allied
 Refers to records, transcripts, and other accounts in written, pictorial, or other form healthcare field
that preserve archival information, official and informal accounts, and other data  Level C – A doctorate degree with formal training in the ethical administration,
and items relevant to an assessee. scoring, and interpretation of clinical assessments related to the intended use of the
 It can shed light to an individual’s past and current adjustment as well as on the assessment; Licensure or certification to practice in your state in a field related to
events and circumstances that may have contributed to any changes in adjustment. the purchase; Certification by or full active membership in a professional that
requires training and experience in the relevant area of assessment
BEHAVIORAL OBSERVATION
 It is the monitoring the actions of others or oneself by visual or electronic means The Test Taker can be anyone who is the subject of an assessment or an evaluation. The
while recording quantitative and/or qualitative information regarding the actions. amount of anxiety they are experiencing and the degree to which that test anxiety might be
 Naturalistic Observation is a variety of this tool which is used in order to observe significantly affect their test results.
behavior of humans in a natural setting, setting in which the behavior would be
expected to occur.
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
As society evolves and as the need to measure different psychological variables emerges, test  Ethical guidelines dictate that when test users have discretion with regard to the
developers respond by devising new tests. Society at large exerts its influence on various test administered, they should select and use the tests that are most appropriate for
aspects of the testing and assessment enterprise. the individual being tested
 Tests should be stored in a way that ensures its specific contents will not be made
know in advance
WHAT TYPES OF SETTINGS ARE ASSESSMENT CONDUCTED?  It should be ensured that a prepared and suitably trained person administers the
test properly
Educational Settings. As mandated by law, tests are administered early in school life to help  The test administrator (or examiner) must be familiar with the test materials and
identify children who may have special needs. Achievement tests evaluates accomplishment procedures and must have at the test site all the materials needed t properly
or the degree of learning that has taken place. Diagnostic tests are used to assess the need administer the test
for educational intervention as well as to establish or rule out eligibility for special education  Test users have the responsibility of ensuring that the room in which the test will be
program. conducted is suitable and conducive to the testing

Clinical Setting. Used to help screen for or diagnose behavior problems. The hallmark of During-Test Obligations:
testing in clinical settings is that the test is employed with only one individual at a time.  Rapport between the examiner and the examinee can be critically important.
Group testing is used primarily for screening in identifying those individuals who require Rapport is the working relationship between the examiner and the examinee.
further diagnostic evaluation.
After-Test Obligations:
Counseling Setting. Regardless of the tools used, the ultimate objective of many such  Safeguarding the test protocols to conveying the test results in a clearly
assessment is the improvement of the assessee in terms of adjustment, productivity, or some understandable fashion is one of the obligations of the test user
related variables.  Test users who have responsibility for interpreting scores or other test results have
an obligation to do so in accordance with the established procedures and ethical
Geriatric Setting. This relates to old people, especially with regard to their healthcare. Such guidelines
assessments are used in the extent to which assessees are enjoying as good a quality of life
as possible. Quality of life is the variables related to perceived stress, loneliness, sources of Assessment of people with disabilities can be done through accommodation and alternate
satisfaction, personal values, quality of living conditions, and quality of friendships and other assessment. Accommodation is the adaptation of a test, procedure, or situation, or the
social supports. substitution of one test for another, to make the assessment more suitable for an assessee
with exceptional needs. Alternate assessment is an evaluative or diagnostic procedure or
Business and Military Setting. Most notable in the selection and decision making about the process that varies from the usual, customary, or standardized way a measurement is
careers of personnel. derives, either by virtue of some special accommodation made to the assessee or by means of
alternative methods designed to measure the same variable(s).
Governmental & Organizational Setting. One of the applications is in governmental
licensing, certification, or general credentialing of professionals. Before one is legally entitled WHERE TO GO FOR ATHORATIVE INFORMATION: REFERENCE SOURCES
to practice something, they must pass an examination. Members of some professions have
 Test catalogues is one of the most accessible sources of information distributed by
formed organizations with requirements for membership that go beyond licensing or
the publisher of the test. Can be tapped by a simple telephone call, e-mail, or note.
certification.
 Test manuals have detailed information concerning the development of a particular
test and technical information relating to it.
Academic Research Setting & Other Settings. Many different kinds of measurement
procedures find application in a wide variety of settings.  Reference volumes
 Journal Articles may contain reviews of the test, updated or independent studies of
WHY IS THERE A NEED FOR PSYCHOLOGICAL ASSESSMENT? its psychometric soundness, or examples of how the instrument was used in either
Psychological tests and assessments allow a psychologist to understand the nature of the research or an applied context.
problem, and to figure out the best way to go about addressing it. Psychologists use test and  Online databases
other assessment tools to measure and observe a client’s behavior to arrive at a diagnosis
and guide treatment.

HOW ARE TEST/ASSESSEMENTS ADMINISTERED?


Responsible test users have obligations before, during, and after a test or any measurement
procedure is administered.

Pre-Test Obligations:
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
 The other (more theoretically relevant and probably stronger) based on the work of the German
psychophysicist Herbart, Weber, Fechner, and Wundt (experimental psychology developed from
this)

There are also tests that arose in response to important needs such as classifying and
identifying the mentally and emotionally handicapped.
 Seguin Form Board Test was developed in an effort to educate and evaluate the mentally
disabled.
CHAPTER 2: HISTORICAL AND CULTURAL PERSPECTIVE OF  Kraepelin devised a series of examination for evaluating emotionally impaired people.
ASSESSMENT

HISTORICAL PERSPECTIVE OF ASSESMENT THE EVOLUTION OF INTELLIGENCE AND STANDARDIZED ACHIEVEMENT TEST

EARLY ANTECEDENTS The Binet-Simon Scale had its first version published in 1905. The instrument contained
Tests and testing programs first came into being in China as early as 2200 B.C.E. Testing 30-item of increasingly difficult and was designed to identify intellectually subnormal
was used as a means of selecting who, of many applicants, would obtain government jobs. individuals. It is designed to help identify Paris schoolchildren with intellectual disability The
Evry third year in China, oral examinations were given to help determine work evaluations 1908 Binet-Simon Scale has been improved and introduced the significant concept of a
and promotion decisions. child’s mental age.
 Han Dynasty (206 B.C.E. to 220 B.C.E.). Test batteries (two or more tests used in conjunction)
were used. Topics were civil law, military affairs, agriculture, revenue, and geography L.M. Terman of Stanford University had revised the Binet test for the use in the US. The
 Ming Dynasty (1368-1644 C.E.). National Multistage Testing Programs involved local and Stanford-Binet Intelligence Scale was the only American version of the Binet test that
regional testing centers equipped with special testing booths. Only those who passes this third flourished.
set of tests were eligible for public office .
Robert Yerkes was requested for assistance by the army. He headed a committee who
Reports by British missionaries and diplomats encouraged the English East India Company developed two structured group tests of human abilities:
in 1832 to copy the Chinese system as a method of selecting employees for overseas 1. The Army Alpha was a verbal test, measuring such skills as ability to follow
duty. Because testing programs worked well for the company, the British government directions; it required reading ability
adopted a similar system of testing for its civil service in 1855. After the British 2. The Army Beta presented nonverbal problems to illiterate subjects and recent
endorsement of a civil testing system, the French and German governments followed. In immigrants who were not proficient in English It measured the intelligence of
1883, the US government established the American Civil Service Commission, which illiterate adults.
developed and administered competitive examinations for certain government jobs.
Standardized Achievement Tests provide multiple-choice questions that are standardized
 Charles Darwin suggested that to develop a measuring device, we must understand what we want to on a large sample to produce norms against which the results of new examinees can be
measure. compared. It became famous because of the ease of administration and scoring and the lack
 Sir Francis Galton continued Darwin’s works and he initiated a search for knowledge concerning of subjectivity/favoritism that can occur in essay or other written tests.
human individual differences which is now one of the most important domains in scientific
psychology.
Two years after the revision of 1937 Stanford-Binet Test, David Wechsler published the
 Galton’s work was extended by James McKeen Cattel, who coined the term mental test. Cattel
perpetuated and stimulated the forces that ultimately led to the development of modern tests. Wechsler-Bellevue Intelligence Scale, which contained several interesting innovations in
 intelligence testing. It yielded several scores, permitting an analysis of an individual’s pattern
EXPERIMENTAL PSYCHOLOGY AND PSYCHOPHYSICAL MEASUREMENT or combination of abilities. It can produce the performance IQ score.
 J.E. Herbart used mathematical models as the basis for educational theories that strongly influenced
19th-century educational practices.
 E.H. Weber followed and attempted to demonstrate the existence of a psychological threshold; the Personality Tests measured presumably stable characteristics or traits that theoretically
minimum stimulus necessary to activate a sensory system. underlie behavior.
 G.T. Fechner devised the law that the strength of a sensation grows as the logarithm of the stimulus  The first structured Personality Test was the Woodworth Personal Data Sheet,
intensity.
which was developed during WWI and was published in final form just after the war.
 Wilhelm Wundt set up a laboratory at the University of Leipzig in 1879, and was credited with
founding the science of psychology.  Projective test is also developed which provides an ambiguous stimulus and
 E.B. Titchner succeeded the works of Wundt and founded structuralism unclear response requirement (Rorschach Inkblot Test and Thematic Apperception
 G. Whipple, a student of Titchner provided the basis for immense changes in the field of testing by Test was the most famous).
conducting a seminar at the Carnegie Institute in 1919. The seminar came up with the Carnegie  The Minnesota Multiphasic Personality Inventory (MMPI) published in 1943
Interest Inventory and later the Strong Vocational Interest Blank. began a new era for structured personality tests. It is currently the most widely
used and referenced personality test.
Psychological testing developed form at least two lines of inquiry:  Factor analysis is a method of finding the minimum number of dimensions. Called
 One based on the work of Darwin, Galton, and Cattell on the measurement on individual factors, to account for a large number of variables. R.B. Cattel had introduced the
differences; and
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
Sixteen Personality factor Questionnaire (16PF), which remains as one of the
most well-constructed structured personality tests and an important example of a Descriptive statistics are methods used to provide a concise description of a collection of
test developed with the aid of factor analyses. quantitative information. Inferential statistics are methods used to make inferences from
observations of a small groups of people known as a sample to a larger group of individuals
SOME ISSUES REGARDING CULTURE AND ASSESSMENT known as population.
Individualist Culture is characterized by value being placed on traits such as self-reliance,
autonomy, independence, uniqueness, and competitiveness. Collectivist culture value is The levels of measurements are:
placed on traits such as conformity, cooperation interdependence, and striving toward group  Nominal scale - categorical data and numbers that are simply used as identifiers
goals. Culture-specific tests are designed for use with people from one culture but not from or names represent a nominal scale of measurement (e.g., gender).
another.  Ordinal scales - represents an ordered series of relationships or rank order (e.g.,
CHAPTER 3: A STATISTICS REFRESHER Likert-type scale, rank in a contest).
 Interval scales - represents quantity and has equal units but for which zero
SAMPLING AND SAMPLING TECHNIQUES represents simply an additional point of measurement is an interval scale. In
addition, zero does not represent the absolute lowest value. Rather, it is point on
A population is the set of all individuals of interest in a particular study. A sample is a set of the scale with numbers both above and below it.
individuals selected from a population, usually intended to represent the population in a  Ratio scale - similar to the interval scale in that it also represents quantity and
research study. has equality of units. However, this scale also has an absolute zero (no numbers
exist below the zero).
Sampling is the process of selecting observations to provide an adequate description and
inferences of the population. Probability sampling is a method of selecting a sample wherein A frequency table is an ordered listing of number of individuals having each of the different
each element in the population has a known, nonzero chance of being included in the values for a particular variable. A frequency distribution shows the pattern of frequencies
sample; otherwise, it is nonprobability sampling. over the various values. They can be illustrated through a histogram, bar graph, or frequency
 Simple random sampling is a sampling method wherein all elements of the polygon.
population have the same probability of inclusion in the selected sample (e.g., draw
lots, fishbowl method). Skewness is the nature and extent to which symmetry is absent. Skewed distribution in the
 Stratified sampling is a probability method where we divide the population into distribution in which the scores pile up one side of the middle and are spread out on the
nonoverlapping subpopulation or strata (a group whose members are of the other side. Can be positively skewed or negatively skewed.
characteristics), and then randomly pick samples from each stratum (population –
strata – random selection – sample). Floor effect is a situation in which many scores pile up at the low end of a distribution
 Cluster sampling divide the population into nonoverlapping groups or clusters. (creating skewness to the right; positively skewed distribution) because it is not possible to
And then randomly select clusters (a group whose members are not of the same have any lower score. Ceiling effect is a situation in which many scores pile up at the high
characteristics). end of a distribution (creating skewness to the left; negatively skewed distribution) because it
 Systematic sampling is where researchers select members of the population at a is not possible to have a higher score.
regular interval (or k) determined in advance.
 Multistage sampling draws a sample from a population using smaller and smaller The term testing professionals use to refer to the steepness of a distribution in its center is
groups (units) at each stage. It’s often used to collect data from a large, kurtosis. It is the peakedness/flatness. It can be platykurtic (relatively flat), leptokurtic
geographically spread group of people in national surveys. (relatively peaked, and mesokurtic (somewhere in the middle).
Multistage cluster sampling is where the researcher divides the
population into groups at various stages for a better data collection, The central tendency of a distribution refers to the middle of the group of score.
management, and interpretation. The groups are called clusters.  Mean is the sum of the scores divided by the number of scores.
Multistage random sampling is where the researcher chooses the  Mode is the value with the greatest frequency in a distribution
samples randomly at each stage. The researcher does not create clusters,  Median is the middle score.
but narrow down the convenience sample by applying random sampling.
 In convenience sampling, units are selected for inclusion in the sample because Variability is an indication of how scores in a distribution are scattered or dispersed.
they are easiest for the researcher to access.  Range indicates the distance between the two most extremes scores in distribution
 In snowball sampling, new samples are recruited by other sample to form part of (range = highest score – lowest score)
the sample. This can be a useful way to conduct research about people with specific  Variance is the average of each score’s squared difference from the mean
traits who might otherwise be difficult to identify.  Standard deviation is simply the square root of the variance. It indicates the
 Purposive sampling/Judgement sampling involves the researcher using their average deviation from the mean, the consistency in the scores, and how far scores
expertise to select a sample that is most useful to the purpose of the research. An are the spread out around the mean.
effective purposive sample must have clear criteria and rationale for inclusion.
 Quota sampling is done until a specific number of samples for various A standard score is a raw score that has been converted from one scale to another scale,
subpopulations have been selected. where the latter scale has some arbitrarily set mean and standard deviation.
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
validity. Typical cause of systematic error includes observational error, imperfect
Correlation is an expression of the degree and direction of correspondence between two instrument calibration, and sampling bias.
things. The coefficient of correlation is the numerical index that expresses the relationship,
it tells us the extent to which X and Y are corelated . Can be positive or negatively SOURCES OF ERROR VARIANCE
correlated. Aside from measurement error, we also have the error variance (also called residual error,
residual variance, or unexplained variance). It is the element of variability in a score that is
Code of ethics: Beneficence & Nonmaleficence, Fidelity & Responsibility, Integrity, produced by extraneous factors, such as measurement imprecision, and is not attributable to
Justice, and Respect for People's Rights & Dignity the independent variable or other controlled experimental manipulation. There are 3 sources
of error variance:
1. Construction
CHAPTER 5: RELIABILITY Example of this is the item sampling or content sampling – refer to variation among
Reliability – consistency in measurement; proportion of the total variance attribute to true items within a test as well as to variation among items between test.
variance. The greater the proportion of the total variance attributed to true variance, the 2. Administration
more reliable the test. Can be affected by the (1) Test Environment (room temperature, level of lighting,
and the amount of ventilation and noise); (2) Testtaker variables (emotional
In psychological assessment context, a reliable test means that it gives a dependable and problems, physical discomfort, lack of sleep): and (3) Examiner-related Variables
consistent results even though it will be used again and again although, error would always (examiners physical appearance and demeanor – even the presence or absence of an
be present. Error implies that there will always be some inaccuracy in our measurement. One examiner).
of the goals in psychological assessment is to minimize, as much as possible, the presence of 3. Scoring and Interpretation
error in testing. Despite rigorous scoring criteria set forth in many better-known tests of intelligence,
the scorer (or rater) can be a source of error variance. If subjectivity is involved in
The reliability can be measured through its reliability coefficient – an index that indicates scoring, then the scorer can be a source of error variance.
the ratio between the true score variance on a test and the total variance. Reliability
estimates in the range of .70 and .80 are good enough for most purposes in basic research. RELIABILITY ESTIMATES/MEASURES
1. Test-Retest Reliability (Time-Sampling Reliability)
Variance from true differences is true variance, and variance from irrelevant, random sources It uses the same instrument to measure the same thing/factor at two points in time. It is
is error variances. an estimate of reliability obtained by correlating pairs of scores from the same people on
two different administrations of the same test. It is appropriate when evaluating a test
[observed score = true ability + random error] that purports to measure something that is relatively stable over time (e.g., personality
According to the Classical Test Theory, each person has a true score that would be obtained traits). Poor test-retest correlations do not always mean that a test is unreliable –
if there were no errors in measurement. Difference between the true score and the observed suggest that the characteristic under study has changed. Pearson r or Spearman Rho is
score results from measurement error. The true score of an individual will not change with usually used to measure the test-retest reliability where: 1 – perfect reliability; ≥0.5 –
repeated applications of the same test. poor reliability; and 0 – no reliability.

Purpose: To evaluate the stability of a measure


Typical use: When assessing the stability of various personality traits
THE CONCEPT OF RELIABILITY Number of Testing Sessions: 2
Measurement error – all the factors associated with the process of measuring some variable, Sources of Error Variance: Administration
other than the variable being measured. In other terms, the variables that we are supposed Statistical Procedure: Pearson r or Spearman rho
to measure are affected by the extraneous variable. There are two types of measurement
error:
2. Parallel-Forms and Alternate-Forms Reliability
Random error is a source of error in measuring a targeted variable caused by This reliability compares two equivalent forms of a test that measure the same attribute.
unpredicted fluctuations and inconsistencies of other variables in the measurement Used to assess the consistency of the results of two tests constructed in the same way
process. This source of error fluctuates from one testing situation to another with from the same content domain. In order to call the forms “parallel”, the observed score
no discernible pattern that would systematically raise or lower scores thus, must have the same mean and variances. If the test are merely different versions
affecting reliability. The main reasons for random error are limitations of (without the sameness of observes score), they are called alternate forms.
instruments, environmental factors, and slight variations in procedure.
Parallel forms reliability can avoid some problems inherent with test-retest like
Systematic Error refers to a source of error in measuring a variable that is typically practice and fatigue effects but you have to create a large number of questions that
constant or proportionate to what is presumed to be the true value of the variable measure the same construct. Spearman rho or Pearson r can be used to measure.
being measured. Once a systematic error becomes known, it becomes predictable –
as well as fixable. Systematic errors primarily influence a measurement’s Purpose: To evaluate the relationship between different forms of a measure
Typical use: When there is a need for different forms of a test
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
Number of Testing Sessions: 1 or 2 Average Proportional Distance Method is a measure used to evaluate the internal
Sources of Error Variance: Test Construction or Administration consistency of a test that focuses on the degree of difference that exist between item
Statistical Procedure: Pearson r or Spearman rho scores.

One advantage of ADP over Cronbach’s alpha is that APD’s index is not connected to
3. Internal Consistency the number of items on a measure unlike the later method.
Assesses the correlation between multiple items in a test that are intended to measure
the same construct. It measures how related the items in the test that measures a Purpose: To evaluate the extent to which items on a scale relate to one another
certain construct or characteristic. You can calculate the internal consistency without Typical use: When evaluating the homogeneity of a measure
repeating the test or involving other researchers, so it’s a good way of assessing Number of Testing Sessions: 1
reliability when you only have one data set. In effect, we judge the reliability of the Sources of Error Variance: Test Construction
instrument by estimating how well the items that reflect the same construct yield similar Statistical Procedure: Pearson r with Spearman Brown/ KR-20/ Cronbach’s alpha/
results. There are three ways to compute the internal consistency: ADP

A. Split-half Reliability
Correlating two pairs of scores obtained from equivalent halves of a single test 4. Inter-scorer Reliability
administered once. This is useful when it is impractical to assess reliability with two Inter-scorer reliability is the degree of agreement or consistency between two or more
tests or to administer test twice. Results of one half of the test are then compared scorers (or judges or raters) with regards to a particular measure. It is often used when
with the results of the other. coding nonverbal behavior.

Steps in performing split-half reliability: Purpose: To evaluate the level of agreement between raters on a measure
(1) Divide the test into equivalent halves. Do not divide in the middle because it Typical use: Interviews or coding of behavior. Used when researchers need to show that
would lower the reliability, instead use the odd-even system – where one sub there is consensus in the way that different raters view a particular behavior pattern
score is obtained for the odd-numbered items in the test and another for the (and hence no observer bias)
even-numbered items. Number of Testing Sessions: 1
(2) Calculate the Pearson r between scores on the two halves of the test. Sources of Error Variance: Scoring and Interpretation
(3) Adjust the half-test reliability using the Spearman-Brown formula. This allows Statistical Procedure: Cohen’s kappa/ Pearson r /Spearman rho
a test developer or user to estimate internal consistency reliability from a
correlation of two halves. It can also be used to determine the number of items
needed to attain a desired level of reliability. USING AND INTERPRETING A COEFFICENT OF RELIABILITY
As a rule of thumb, .90s rates a grade of A, ,80s rates a B, and anywhere from .65 through
B. Kuder-Richardson Formula 20 (KR-20) the .70s rates a weak grade that borders on failing.
A measure of internal consistency reliability for measures with dichotomous
choices. It is a special case of Cronbach’s Alpha, computed for dichotomous scores. THE PURPOSE OF THE RELIABILITY COEFFICIENT. If the purpose of determining
It is often claimed that a high KR-20 coefficient (e.g., > 0.90) indicates a reliability is to break down the error variance into its parts, then a number of reliability
homogeneous test. However, like Cronbach’s Alpha, homogeneity (that is, coefficients would have to be calculated.
unidimensionality) is actually an assumption, not a conclusion, or reliability
coefficients.
NATURE OF TESTS
C. Cronbach’s Alpha/Coefficient alpha 1. Homogenous vs Heterogeneous Items
The most famous and commonly used among reliability coefficients because it Homogenous (equal difficulties and equal intercorrelations) items have high
requires only one administration of the test. Cronbach’s Alphha is mathematically reliability. The more homogenous a test is, the more inter-item consistency (refers to
equivalent to the average of all possible split-half estimates. The general rule of the degree of correlation among all the items on a scale) it can be expected to have.
thumb is that a Cronbach’s alpha of .70 and above is good, .80 and above is better, 2. Dynamic vs Static Characteristics
and .890 and above is best. This is calculated to help answer questions about how Dynamic Characteristics are traits, state, and/or ability presumed to be ever-
similar sets of data are. changing as a function of situational and cognitive experiences. Static
Characteristics on the other hand, are the trait, state, and ability which are
Although it does have some limitations; scores that have a lower number of items relatively unchanging.
associated with them tend to have lower reliability, and sample size can also 3. Restriction or Inflation of Range
influence the results. Restricted variance results in a correlation coefficient that tends to be lower.
Inflated variance results in a correlational coefficient that tends to be higher.
D. Average Proportional Distance (ADP) 4. Speed Test vs Power Tests
Rather than focusing on similarity between scores in items of a test, the ADP is a A power test has a time limit that is long enough to allow test takers to attempt all
measure that focuses on the degree of differences that exists between item scores. items, some items have a higher degree of difficulty. A speed test has items of
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
uniform level of difficulty, and when given time limits, all test takers should be able
to complete all the test items correctly (based on performance speed).
5. Criterion-referenced Tests
Designed to provide an indication of where a testtaker stands with respect to some
variable or criterion. Scores on this test tends to be interpreted in pass-fail terms,
and any scrutiny of performance on individual items tends to be for diagnostic and
remedial purposes.

THE STANDARD ERRROR OF MEASUREMENT


Provides an estimate of the amount of error inherent in an observed score or measurement.
The relationship between the SEM and the reliability of a test is inverse; the higher the
reliability of a test, the lower the SEM.

SEM is the tool used to estimate or infer the extent to which an observed score deviates from
a true score. It is the standard deviation of a theoretically normal distribution of test scores
obtained by one person on equivalent tests. It is an index of the extent to which one CHAPTER 6: VALIDITY
individual’s scores vary over tests presumed to be parallel.
A measurement can be reliable but not valid, however, if a measure is valid, then it is also
The SEM is useful in establishing the confidence interval – a range or band of test scores reliable.
that is likely to contain the true score. The standard error of the difference is the statistical
measure that can aid a test user in determining how large a difference should be before it is Validity – as applied to a test, is a judgement or estimate of how well a test measures what it
considered statistically significant. purports to measure in a particular context.
 Judgment is based on evidences about the appropriateness of inferences drawn
from test scores. Validity of test must be shown from time to time to account for
What to do about low reliability? culture and advancement.
1. Increase the numbers of items. The larger the sample, the more likely that the test  Validation is the process of gathering and evaluating evidence about validity. Test
will represent the true characteristics. user and testtaker both have roles in validation of test.
2. Factor and Item Analysis. Tests are more reliable if they are unidimensional:  Test users can conduct a (1) validation studies to yield insights regarding a
measuring a single ability, attribute, construct, or skill. Examine the correlation particular population of as compared to the norming sample or (2) local validation
between each item and the total score for the test. studies which is absolutely necessary when test user plans to alter elements of the
test.

One way measurement specialists have traditionally conceptualized validity is according to


three categories of types of validity (Trinitarian view) – believes that evaluations of validity
should focus on (1) how well a measurement’s quantitative value represents the attribute
being measured and how well the measurement gives scores relevant to what the test is
trying to measure.
1. Content Validity
Is based on an evaluation of the subjects, topics, or content covered by the items in
the test
2. Criterion-related Validity
Is obtained by evaluating the relationship of scores obtained on the test to scores on
other tests or measures
3. Construct Validity
“Umbrella Validity” because every other variety of validity falls under it. It assesses
whether a test is representative of all aspects of the construct. This is a measure of
validity that is arrived at by executing a comprehensive analysis of
a. how scores on the test relate to other test scores and measures; and
b. how scores on the test can be understood within some theoretical framework
for understanding the construct that the test was designed to measure
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
ECOLOGICAL VALIDITY. Refers to a judgment regarding how well a test measures what it
purports to measure at the time and place that the variable being measured (typically a VALIDITY, BIAS, AND FAIRNESS
behavior, cognition, or emotion) is actually emitted.  Rating error is a judgment resulting from the intentional or unintentional misuse of a
rating scale.
FACE VALIDITY. Relates more to what a test appears to measure to the person being tested  Leniency error (generosity error) is an error in rating that arises from the tendency on
than to what the test actually measures. Judgment concerning how relevant the test items the part of the rater to be lenient in scoring, marking, and/or grading.
appear to be, which is usually from testtaker’s, not test user’s perspective. Lack of face  Severity error lies on the opposite extreme of the leniency error where, a rater would pan
validity may result to a lack of confidence in perceived effectiveness of test which decreases just about everything they review
testtaker’s motivation or cooperation. A test’s lack of face validity could contribute to a lack of  Central tendency error is where a rater exhibits a general and systematic reluctance to
confidence in the perceived effectiveness of the test – with a consequential decrease in the giving ratings at either the positive or the negative extreme, thus the ratings would
testtaker’s cooperation or motivation to do his/her best. Ultimately, face validity may be more cluster in the middle of the rating continuum
a matter of public relations than psychometric soundness.

CONTENT VALIDITY. A judgment of how adequately a test samples a behavior


representative of the universe of behavior that the test was designed to sample. In the test of
ensuring content validity, test developers strive to include key components of the construct
targeted for measurement, and exclude content irrelevant to the construct targeted for
measurement. One example of content validity is an achievement test where, the test items
approximate the proportion of materials covered in the course. CHAPTER 7: UTILITY
Utility refers to how useful a test is specifically, to the practical value of using a test to aid in
When making a test, we usually have a test blueprint – a plan regarding the types of
decision making; usefulness/practical value of testing to improve efficiency. An index of
information to be covered by the items, the number of items tapping each area of coverage,
utility can tell us something about the practical value of the information derived from scores
the organizations of the items in the test, etc. It is also important to note that a content valid
on the test.
test may not be universally valid – it depends on the politics, culture and history of the
country.
FACTORS THAT AFFECT A TEST’S UTILITY:
A. Psychometric Soundness – a test is said to be psychometrically sound for a
CRITERION-RELATED VALIDITY. A judgment of how adequately a test score can be used to
particular purpose if reliability and validity coefficients are acceptably high. The
infer an individual’s most probable standing on some measure of interest – the measure of
higher the criterion-related validity of test scores for making a particular decision,
interest being the criterion. Whatever the criterion, ideally it is relevant, valid, and
the higher the higher the utility of the test is likely to be however, some tests may be
uncontaminated.
psychometrically sound, but has a little utility – particularly to the testtakers who
A. Concurrent validity – index of the degree to which a test score is related to some
fail to follow the test’s directions.
criterion measure obtained at the same time.
B. Costs – one of the most basic elements in any utility analysis; refers to
B. Predictive validity – index of the degree to which a test score predicts some criterion
disadvantages, losses, or expenses in both economic and noneconomic terms.
measure.
C. Benefits – utility of a test may take into account whether the benefits (profits, gains,
or advantages) of testing justify the costs of administering, scoring, and interpreting
Judgments of criterion-related validity, whether concurrent or predictive, are based on two
the test.
types of statistical evidence: the validity coefficient and expectancy data. Validity coefficient is
a correlation that provides a measure of the relationship between test scores and scores on
UTILITY ANALYSIS
the criterion measure. There is no rule for determining the minimum acceptable size of a
Utility analysis – family of techniques that entails a cost-benefit analysis designed to yield
validity coefficient. Essentially, it should be high enough to result in the identification and
information relevant to a decision about the usefulness and/or practical value of a tool of
differentiation of testtakers with respect to target attribute.
assessment. It is an umbrella term covering various possible methods that can be employed
to yield various kinds of outputs. Undertaken for the purpose of evaluating whether the
CONSTRUCT VALIDITY. Viewed as the unifying concept for all validity evidence. It is a
benefits of using a test outweighs the costs. The endpoint is typically an educated decision
judgment about the appropriateness of inferences drawn from test scores regarding
about which of many possible courses of action is optimal.
individual standings on a variable called a construct. Evidences that a test has construct
validity are (1) the test is homogeneous, measuring a single construct, (2) test scores
Utility analysis can be conducted through:
increase/decrease as a function of age, the passage of time, or an experimental manipulation
1. Expectancy data. Converting a scatterplot of test data to an expectancy table which can
as theoretically predicted, (3) test scores obtained after some event or the mere passage of
provide an indication of the likelihood that a testtaker will score within some interval of
time (or posttest scores) differ from pretest scores as theoretically predicted, (4) test scores
scores on a criterion measure – an interval that may be categorized as “passing”,
obtained by people from distinct groups vary as predicted by the theory, and (5) test scores
“acceptable”, or “failing”.
correlate with scores on other tests in accordance with what would be predicted from a
theory that covers the manifestation of the construct in question.
The Taylor-Russell tables provide an estimate of the percentage of employees hired by
the use of a particular test who will be successful at their jobs, given different
combinations of three variables: (1) the test’s validity (the value assigned is the validity
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
coefficient), (2) the selection ratio used (relationship between the number of people to be items need to be revised, and which items should be discarded. The analyses may
hired and the number of people available to be hired, and (3) the base rate (percentage of include item reliability, item validity, and item discrimination.
people hired under the existing system for a particular position). 5. TEST REVISION
Action taken to modify a test’s content or format for the purpose of improving the
Use of Naylor-Shine table can compensate for the limitations of the Taylor-Russell test’s effectiveness as a tool of measurement.
tables. It entails obtaining the difference between the means of the selected and
unselected groups to derive an index of what the test is adding to an already established
procedures. TEST CONCEPTUALIZATION
PILOT WORK. The preliminary research surrounding the creation of a prototype of the test.
2. The Brodgen-Cronbach-Gleser formula. Used to calculate the dollar amount of utility Test items may be pilot studied to evaluate whether they should be included in the final form
gain (estimate of the benefit, monetary or otherwise, of using a particular test/selection of the instrument. The test developer typically attempts to determine how best to measure a
method) resulting from the use of a particular selection instrument under specified targeted construct. Pilot work is a necessity when constructing tests or other measuring
conditions. instruments for publication and wide distribution.

Modification of the BGC formula is the productivity gain, which exist for researchers TEST CONSTRUCTION
who prefer their findings in terms of productivity gains (estimated increase in work SCALING. The process by which a measuring device is designed and calibrated and by which
output) rather than the financial ones. numbers (or other indices) – scale values – are assigned to different amounts of the trait,
attribute, or characteristic being measured.
L.L. Thurstone introduced the notion of absolute scaling – a procedure for obtaining a
METHODS FOR SETTING CUT SCORES measure of item difficulty across samples of test takers who vary in ability.
ANGOFF METHOD. Setting fixed cut scores can be applied to personnel selection tasks as
well as to questions regarding the presence or absence of a particular trait, attribute, or Scaling methods are as follows:
ability. The judgment of the experts is averaged to yield cut scores for the test.  Rating Scale. Can be defined as a grouping of words, statements, or symbols on which
judgement of the strength of a particular trait, attitude, or emotion are indicated by the
KNOWN GROUPS METHOD. Collecting of data on the predictor of interest from groups test taker. The use of rating scales of any type results in ordinal-level data.
known to possess, and not to possess a trait, attribute, or ability of interest. Based in an  Method of Paired Comparisons. Test takers are presented with pairs of stimuli which
analysis of this data, a cut score is set on the test that best discriminates the two groups’ test they are asked to compare and must select one.
performance.  Guttman scale. Items on it range sequentially from weaker to stronger expressions of
the attitude, belief or feeling being measured. All respondents who agree with the
IRT-BASED METHODS. Each item is associated with a particular level of difficulty, and stronger statements of the attitude will also agree with milder statements. The resulting
testtakers must answer items that are deemed to be above some minimum level of difficulty data are then analyzed by means of scalogram analysis.
to pass the test, which is determined by experts and serves as the cut score. Techniques can
be item-mapping method, and bookmark method. WRITING ITEMS. It is usually advisable that the first draft contain approximately twice the
number of items that the final version of the test will contain. An item pool is the reservoir
or well from which items will or will not be drawn for the final version of the test.
CHAPTER 8: TEST DEVELOPMENT
The creation of a good test is the product of the thoughtful and sound application of Items presented in a selected-response format require test takers to select a response from
established principles of test development – an umbrella term for all that goes into the a set of alternative responses.
process of creating a test. The process of developing a test occurs in five stages:  Multiple-choice formats have three elements: (1) stem, (2) correct alternative/option,
1. TEST CONCEPTUALIZATION and (3) several distractors/foils.
Ideas for a test  In matching items has two columns: premises on the left and responses on the
2. TEST CONSTRUCTION right.
Stage in the process of test development that entails writing  True-False items requires the test taker to indicate whether the statement is or is
test items (or re-writing or revising existing items), as well not a fact.
as formatting items, setting scoring rules, and otherwise Items presented in a constructed-response format require test takers to supply or to create
designing and building a test. the correct answer, not merely to select it.
3. TEST TRYOUT  Completion item
Once a preliminary form of the test has been developed, it  Short-answer item
is administered to a representative sample of testtakers  Essay item
under conditions that simulate the conditions that the final
version of the test will be administered. Computerized Adaptive Testing (CAT) refers to an interactive, computer-administered test-
4. ITEM ANALYSIS taking process wherein items presented to the test taker are based in part on the test taker’s
Statistical procedures are employed to assist in making performance on previous items. It tends to reduce floor effects and ceiling effects. It also has
judgements about which items are good as they are, which
Keep Rising In Zeal, Ignite Ambition’s Heart (KRIZIAH)!
the ability to tailor the content and order of presentation of test items in the basis of
responses to previous items referred to as item branching.

SCORING ITEMS. Many different test scoring models have been devised and some of them
are as follow:
 Cumulative model – the higher the score on the test, the higher the test taker is on
the ability, trait, or other characteristic that the test purports to measure.
 Class/Category scoring – test taker responses earn credit toward placement in a
particular class or category with other test takers whose pattern of responses is
presumably similar in some ways.
 Ipsative scoring – comparing a test taker’s score on one scale within a test to
another scale within that same test.

TEST TRYOUT
The test should be tried out on people who are similar in critical respects to the people for
whom the test was designed. A good test item is reliable and valid, and it helps to
discriminate test takers.

ITEM ANALYSIS
ITEM-DIFFICULTY INDEX. It is obtained by calculating the proportion of the total number
of test takers who answered the item correctly. The result can range from 0 to 1, and the
larger the item-difficulty index, the easier the item. For example, 50 out of 100 students got
the item correct so 50 divided by 100 = .5 (p)

ITEM-RELIABILITY INDEX. Provides an indication of the internal consistency of a test. The


higher the index, the greater the tests internal consistency. One statistical that can be used
is factor analysis, where items get factored and items belonging to several factors can be
revised or eliminated.

ITEM-VALIDITY INDEX. Provides an indication of the degree to which a test is measuring


what it purports to measure. The higher the item-validity index, the greater the test’s
criterion-related validity.

ITEM-DISCRIMINATION INDEX. How adequately an item separates or discriminates


between high scorers and low scorers on an entire test.

You might also like