You are on page 1of 12

ARANDA, Juliana M.

BPSY 198
BS PSYCH 4C Ma’am Mical

BPSY 198 quartile point and the first quartile point in a


The Science of Psychological Measurement distribution that has been segmented into quartiles.
interval scale- Scale: A measurement system
Self-Assessment: Glossary where items can be ordered with equal intervals
Test your understanding of elements of this chapter between them, and each unit on the scale is of the
by seeing if you can explain each of same magnitude as any other, with no absolute zero
the following terms, expressions, and point, which means mathematical operations cannot
abbreviations: be performed on the data.
Kurtosis- An indication of whether the center of a
A Statistics Refresher distribution is steep: peaked or flat in its shape.
Leptokurtic- A description of the kurtosis of a
arithmetic mean – Also known as the mean, this is distribution that exhibits a pronounced peak in its
a measure of central tendency generated by taking central portion.
the average of all scores in a distribution. linear transformation the extent to which the new
average deviation- Variability is calculated by score differs from other scores on the same scale is
adding the absolute values of all the scores in a in line with the differences between the original score
distribution and dividing by the total number of and the other scores on the scales that were used to
scores. derive it. This contrasts with a nonlinear
bar graph-A graphic illustration in which transformation.
data representing frequency are set on the vertical Mean- Commonly referred to as the arithmetic mean,
axis, categories are set on the horizontal axis, and it's a measure of central tendency obtained by
the rectangle bars describing the data are often computing the average of all the scores within a
noncontiguous. distribution.
bimodal distribution- the central tendency consists Measurement- Assigning numerical values or
of two scores that are always occurring in the symbols to the attributes of individuals or objects
distribution equal of times. based on specific guidelines or rules.
Distribution- A collection of test scores organized measure of central tendency- Tendency refers to
for recording or analysis. one of three statistical measures that convey
Dynamometer- A device used for measuring the information about the central or average score within
strength of a person's grasp. a distribution. The mean, or arithmetic mean, serves
Error- All factors other than those deemed to be as a central tendency measure and belongs to the
measured by a test contribute to test scores; errors ratio level of measurement. The median is another
are variables in all tests and assessments, measure of central tendency, considering the order
frequency distribution- A table of the results and of scores, and it falls under the ordinal level of
the number of times each result has been obtained. measurement. On the other hand, the mode, as a
frequency polygon- A graphic measure of central tendency, is nominal in nature. In
Data visualization in which the context of an 80-83 range, the arithmetic mean
Numbers that represent frequency are oriented is typically used.
vertically, test scores or categories are assigned to measure of variability- a statistical measure that
the data, and the horizontal axis are characterized reveals the extent to which scores within a
by a continuous line connecting the two points exam distribution are spread or dispersed. Common
results or categories match frequencies, measures of variability include the range, standard
Graph- A diagram or chart that consists of lines, deviation, and variance.
points, bars or other symbols describing and Median- middlemost score in a distribution
illustrating data. Mesokurtic- An explanation of the kurtosis of a
grouped frequency distribution- A tabular distribution that exhibits a central shape that is
overview of test scores in a class, often known as neither excessively peaked nor entirely flat.
class intervals where the test scores are Mode- identifying the most
grouped by intervals, frequently occurring score in a
histogram- A vertical line graph at the true limit of distribution,
each test result forms a series of adjacent rectangles. negative skew- A distribution is negatively skewed
interquartile range- A measure of variability in a when there are relatively fewer scores at the lower
dataset, calculated as the gap between the third end of the distribution.
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

nominal scale- A measurement system in which all semi-interquartile range A measure of variability
measured items are categorized and classified obtained by dividing the interquartile range by 2.
according to one or more distinguishing skewness an indicator of how symmetry is lacking
characteristics, and they are assigned to distinct and in a distribution, where positive skewness occurs
all-encompassing categories. when there are relatively few scores at the positive
nonlinear transformation- In psychometrics, a end, and negative skewness occurs when there are
method of altering a score where the new score may relatively few scores at the negative end.
not have a direct numerical link to the original score, standard deviation: A measure of variability
and the differences between the new score and calculated as the square root of the average of
other scale scores might not mirror the differences squared deviations from the mean, essentially the
between the original score and the original scale square root of the variance.
scores. This differs from a linear transformation. standard score A raw score transformed from one
normal curve- A mathematically defined curve, scale to another, with the new scale having a mean
smoothly shaped like a bell, with its peak at the and standard deviation set arbitrarily and being more
center and gradually diminishing on both sides, yet commonly used and easily understandable.
never quite touching the horizontal axis. This Examples of standard scores include z-scores and
concept is often associated with the area under the T-scores.
curve. stanine A standard score obtained from a scale with
normalized standard score scale- In essence, the an average of 5 and a standard deviation of roughly
final result of reshaping a skewed distribution to 2.
resemble a normal curve, often achieved by T score known as the Thorndike score, computed
employing nonlinear transformation techniques. using a scale with a mean fixed at 50 and a standard
normalizing a distribution- A statistical correction deviation set at 10. This scoring method is employed
applied to distributions that meet specific criteria in by the creators of the MMPI.
order to approximate a normal distribution, tail Between 2 and 3 standard deviations above the
enhancing the ease of understanding or working with mean on the normal curve, and the area between -2
the data. and -3 standard deviations below the mean. A
ordinal scale- Scale: A measurement system where normal curve exhibits two tails.
items can be ranked in order, but the ranking doesn't variability A measure of how scores in a distribution
provide information about the precise difference are spread or scattered.
between ranks, and there is no absolute zero point. variance A measure of variability that is calculated
In the fields of psychology and education, most by finding the average of the squared differences
scales are considered ordinal. between the scores in a distribution and their mean.
Platykurtic- An explanation of the kurtosis of a z score obtained by subtracting a specific raw score
distribution that has a relatively flat shape in its from the mean and then dividing by the standard
central portion. deviation. It quantifies a score in units of standard
positive skew- When there are relatively few scores deviation, signifying how far the raw score is from the
that fall at the upper end of the distribution. mean of the distribution in terms of standard
quartile One of the three demarcation points that deviation units.
separate the four quarters of a distribution, with each
of them typically marked or labeled.
Range- A measure of variability obtained by Of Tests and Testing
determining the disparity between the highest and
lowest scores in a distribution. age-equivalent scores measure how well someone
ratio scale- A ratio scale permits ranking with equal or something is doing (individual’s ability) by
intervals and meaningful math, thanks to an absolute comparing it to what's typically expected for a certain
zero point. Yet, ratio scales are infrequently seen in age.
psychology and education. age norms are created to serve as a reference point
raw score A direct, unaltered report of performance, based on the age of the individual who obtained a
typically expressed in numerical form and often particular score. They differ from grade norms.
utilized for assessment or diagnosis. bivariate distribution Also called a scatterplot,
scale scatter diagram, or scattergram, this is a visual
representation of correlation achieved by plotting the
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

coordinate points for values of the X-variable and the domain sampling In the context of construct
Y-variable on a graph. measurement, a subset of behaviors or test items
classical theory assumes that every person has a represents a particular construct. These selections
"true" score (T) they would get on a test if everything serve as indicative samples from a broader array of
was perfect and there were no mistakes. potential behaviors or items designed to assess the
coefficient of correlation typically symbolized by same construct.
"r," is a measure of the strength of the linear domain-referenced testing and
relationship between two continuous variables. It's assessment concerned with how accurately we can
represented as a number that can range from -1 to predict a person's performance within a specific area
1. While various methods can be employed to or subject.
calculate a correlation coefficient, the most equipercentile method A method used to compare
commonly used one is Pearson's "r." scores on multiple tests, such as in the development
coefficient of determination is a value that shows of national anchor norms. It involves establishing
the extent to which two variables share variance in percentile norms for each test and determining the
their calculations. It's determined by squaring the specific score on each test that corresponds to a
correlation coefficient, multiplying the result by 100, particular percentile.
and expressing it as a percentage. This percentage error variance in the true score model, accounts for
signifies the amount of variance explained by the the part of score variation that arises from unrelated,
correlation coefficient. random factors like test design, administration, and
construct A well-informed scientific concept created scoring, distinct from the trait or ability being
to describe or explain behavior. Constructs can measured.
include notions like "intelligence," "personality," fixed reference group scoring system that uses a
"anxiety," and "job satisfaction." fixed reference group's score distribution as a basis
content-referenced testing and for calculating scores in future test administrations.
assessment objective; measure student mastery of This method is employed, for example, in the scoring
learning goals of exams like the SAT and GRE.
convenience sample refers to what is available for System
use. grade norms tailored as a reference based on the
correlation representation of the extent and grade or educational level of the test taker who
direction of agreement between two continuous attains a specific score. This differs from age norms.
variables. incidental sample Commonly known as
criterion benchmark used to assess a test or a test convenience sampling, this method involves
score, which can manifest in various forms, including selecting individuals for a sample based on their
specific behaviors or a defined set of behaviors. easy accessibility rather than their
criterion-referenced testing and representativeness of the population under study.
assessment also called domain-referenced or intercept In the equation for a
content-referenced testing, involves evaluating test regression line, Y a bX,
scores by comparing them to a predefined standard, the letter a, which stands for a
in contrast to norm-referenced testing. constant indicating where the line
cumulative scoring A scoring method in which crosses the vertical or Y -axis,
points or scores from individual items or subtests are local norms Normative data or information related
added together, and a higher total score suggests a to a particular population, often of specific interest to
higher level of the measured ability or trait. This the test user. This data is gathered for the purpose
differs from class scoring and ipsative scoring. of making meaningful comparisons or evaluations.
curvilinearity Typically in the context of graphs or meta-analysis A set of statistical methods employed
correlation scatterplots, curvature refers to the to merge data from multiple studies and generate
extent to which the plot or graph displays a curved unified estimates of the statistics under investigation.
shape or pattern. multiple regression The examination of
developmental norms established based on any connections between multiple independent variables
trait, ability, skill, or characteristic that is assumed to and a single dependent variable to comprehend how
change, decline, or be influenced by age, grade level, each independent variable influences or predicts the
or stage of life. dependent variable.
national anchor norms An
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

equivalency table for scores on is distinct from a percentile, which ranks a person's
two nationally standardized tests performance relative to others.
designed to measure the same. Percentile A percentile indicates where a person's
thing, score ranks among others, while "percentage
national norms derived from correct" shows the proportion of correct answers on
a standardization sample that was. a test.
nationally representative of the program norms use stats from a specific group
population, during a certain time, without the usual formal data
norm encompass behavior or performance that is collection process, to see how they did.
considered customary, average, standard, purposive sampling the random selection of
anticipated, or typical within a given context. The individuals for a sample based on the belief that they
term "norm" is the singular form of "norms," referring represent the population under study is called
to an individual standard or benchmark that serves purposive sampling.
as a point of reference for measuring and comparing race norming The controversial practice of norming
behavior or performance. on the basis of race or ethnic background
normative sample comprises a group of individuals rank-order/rank-difference the process of
assumed to be a representative subset of the larger positioning individuals, groups, or businesses on an
population that could potentially take a specific test. ordinal scale in relation to others.
The performance data of this group on the test correlation coefficient - a number that shows how
serves as a reference or context for assessing two things are connected.
individual test scores. regression Studying relationships between
Norming The process of deriving or creating norms, variables to predict one variable based on another is
norm-referenced testing and known as "predictive analysis."
assessment method of evaluating and interpreting regression coefficient In the formula Y = a + bX,
test scores by comparing an individual test taker's both "a" (representing the ordinate intercept) and "b"
score to scores of a group of test takers who took the (the slope of the line) are regression coefficients. In
same test. This method is different from criterion- practical terms, the specific values of "a" and "b" are
referenced testing and assessment, where the focus calculated using algebraic methods.
is on specific performance standards or criteria regression line The result of simple regression
rather than group comparisons. analysis; the graphic “line of best fi t” that comes
Outlier An extremely atypical plot point in a closest to the greatest number of points on the
scatterplot; (2) any extremely a typical finding in scatterplot of the two variables,
research sample refers to selecting a group of people who are
overt behavior observable action or the product of believed to be representative of the entire population
an observable action refers to something that can be or universe under study.
seen or measured, such as a person's behavior, test Sampling A general reference to the process of
results, or assessment responses. These are developing a sample
tangible outcomes or behaviors that can be scatter diagram a picture or graph that shows the
observed or quantified in a research or assessment relationship between two variables.
context. scattergram tells us how strong and in which way
Pearson r The statistic commonly referred to as the two things are connected.
Pearson correlation coefficient, or Pearson's scatterplot This is also known as a scatter diagram,
coefficient of product-moment correlation, is widely which is a graphical representation of correlation. It's
utilized to calculate a measure of the connection created by plotting the coordinate points for two
between two continuous variables when that variables.
connection is linear in nature. simple regression Analyzing the relationship
percentage correct On a test with responses that between a single independent variable and a
are scored as either correct or incorrect, this is an dependent variable is known as "bivariate analysis."
expression of the percentage of items answered Spearman’s rho also called the rank-order
correctly. It's calculated by taking the number of correlation coefficient and the rank-difference
correct answers, multiplying it by 100, and then correlation coefficient. It is often preferred when
dividing by the total number of items on the test. This dealing with small sample sizes and when both sets
of measurements are ordinal in nature.
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

standard error of the estimate I n regression, the alternate forms These are “accommodations” or
standard error of the estimate measures the “alternative methods” implemented to cater to the
prediction's precision. A weaker correlation results in needs of the assessee, often designed to measure
a higher standard error. the same variable in a way that accommodates their
Standardization a representative group to create unique circumstances or requirements.
reference points or norms for interpreting test scores Alternate-forms reliability an estimate of
standardized test A test or measure that has “equivalent-forms reliability,” which gauges how item
undergone standardization is subject to established sampling and other errors impact scores on two
guidelines or norms, often defined by organizations versions of the same test. It differs from “parallel-
like the Standards for Educational and Psychological forms reliability,” which assesses the consistency
Testing (SEPT). between scores on entirely different forms of a test.
state state in personality refers to a temporary Assumption of local
expression of a trait, showing a short-term inclination independence
for certain behavior, as opposed to a lasting trait. In assumption of monotonicity
psychoanalytic theory, it conveys the dynamic assumption of unidimensionality
conflicts between the id, ego, and superego. coefficient alpha This statistic, also known as
stratified-random sampling known as "stratified Cronbach’s alpha or simply alpha, is commonly used
sampling," where a sample is created by dividing a in test construction to estimate reliability. More
population into specific subgroups, and every precisely, it equals the mean of all split-half
member within those subgroups has an equal reliabilities.
chance of being included in the sample. Coefficient of equivalence This is an estimate of
stratified sampling This is the process of "stratified “alternate-forms reliability” or “parallel-forms
sampling," which involves creating a sample by reliability,” which assesses the consistency between
dividing a population into specific subgroups. scores on two different versions or forms of a test.
subgroup norms Norms for a specific subgroup Coefficient of generalizability : In generalizability
within a larger group are guidelines or standards theory, an index of the influence that particular facets
established for that particular subset. have on a test score
test standardization description of "test coefficient of inter-scorer
standardization," where a test is given to a reliability the degree to which measurements are
representative sample of test-takers in specific consistent and repeatable, and it also reflects the
conditions. The data is then scored and interpreted extent to which measurements vary across different
to set a benchmark for future test administrations occasions due to measurement error.
with different test-takers. Coefficient of stability This is an estimate of “long-
Trait a distinctive and relatively lasting characteristic term test-retest reliability,” which is determined by
that differentiates one individual from another, in assessing the consistency of test scores over time
contrast to a "state," which is more temporary and intervals of six months or more.
changeable. Confidence interval a range or band of test scores
true score theory This system, also called the true that is likely to include the “true score” of an
score model or classical test theory, assumes that a individual, indicating the degree of confidence in the
test score (or a response to an individual item) measurement’s accuracy.
comprises a stable component that reflects what the Content sampling This refers to “content diversity”
test or item is intended to measure and a random or “variety of subject matter” found within test items.
component representing measurement error. It’s often discussed regarding the differences among
user norms These are often called "program individual test items within a test or between items in
norms," which are descriptive statistics based on a two or more tests. It can also be called “item
group of test-takers during a specific period, rather sampling.”
than formal norms obtained through systematic Criterion-referenced test testing assesses an
sampling methods. individual’s score in relation to a predetermined
Y= a+bX equation of linear regression standard, in contrast to norm-referenced testing,
which compares scores to a group.
Decision study This research, conducted after a
Reliability generalizability study, aims to assess the practicality
and usefulness of test scores in decision-making.
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

Dichotomous test item A test item or question with and scorer reliability, an estimate of the degree of
a binary response option, like true-false or yes-no. agreement or consistency between two or more
discrimination In IRT, the extent to which an item scorers
distinguishes between individuals with varying levels item characteristic curve (ICC) A visual depiction
of the measured trait or ability. of the probabilistic link between an individual’s trait
Domain sampling theory A subset of behaviors or or ability level and the likelihood of responding to an
test items representing a specific construct from the item as expected. This is also called a category
complete set of possible options. response curve, an item response curve, or an item
Dynamic characteristic A dynamic trait, state, or trace line.
ability that is subject to change based on situational Item response theory (IRT) Also known as latent-
and cognitive experiences, as opposed to a static trait theory or the latent-trait model, it’s a set of
characteristic. assumptions about measurement, including the
Error variance In the true score model, error assumption that a test measures a unidimensional
variance is the part of variability in observed scores trait and quantifies the extent to which each test item
not associated with the trait or ability being measures that trait.
measured, often stemming from factors like test Item sampling Also known as content sampling, it
design, administration, and scoring. pertains to the diversity of subject matter within test
Estimate of inter-item consistency items. This is often discussed in the context of the
facet differences between individual items in a test or
generalizability study In generalizability theory, between items in multiple tests.
these are the universe variables of interest, such as Kuder-Richardson formula 20 A set of equations
the test’s item count, scorer training, and the test’s by G. F. Kuder and M. W. Richardson used to
administration purpose. calculate the inter-item consistency of tests.
Generalizability theory Also known as domain Latent-trait theory Also known as the latent-trait
sampling theory, it involves assumptions about model, it’s a framework of measurement
measurement, where a test score (or even a assumptions, including the idea that a test measures
response to an individual item) comprises a stable a unidimensional trait and quantifies how each test
component measuring what the test or item is item assesses that trait.
designed for, along with less stable components that Odd-even reliability An estimate of split-half
can be considered as error. reliability, achieved by dividing a test into two halves,
with odd-numbered items in one half and even-
Heterogeneity - how diverse the factors a test numbered items in the other.
measure is. Parallel forms Two or more equivalent versions of
Homogeneity - a test measures a single factor. the same test in which the means and variances of
inflation of range/variance Also called variance observed test scores are identical. This is in contrast
inflation, it’s a term related to reliability estimates in to alternate forms.
which the variance of one or both variables in a Parallel-forms reliability This estimate evaluates
correlation analysis is artificially increased due to the the impact of errors on test scores for two versions
sampling method, leading to higher correlation of the same test, assuming equal means and
coefficients. This is in contrast to “restriction of variances for each form, in contrast to alternate-
range.” forms reliability.
Information function tell us which test questions Polytomous test item A test item or question
are most effective for people with a certain level of offering three or more response choices, of which
ability compared to others. It helps identify the only one is considered correct or aligned with the
questions that work best for different skill levels. intended trait or construct being measured.
inter-item consistency or homogeneity of test Power test A test, typically for achievement or
items, typically assessed using methods like the ability, has no time limit or an extremely long one,
split-half technique. enabling everyone to attempt all items. Some items
Internal consistency estimate of are intentionally very difficult, preventing perfect
reliability A reliability estimate for a test derived scores. Contrast with speed test.
from inter-item consistency measurement. Rasch model - It is employed to gauge an
Inter-scorer reliability : Also referred to as inter- individual's skill or trait level by considering their
rater reliability, observer reliability, judge reliability, answers to test questions.
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

Reliability The consistency or repeatability of Theta In Henry Murray’s personality theory, it’s a
measurements, which also represents the degree of term for the interaction between a person’s need and
variation between measurements due to the environmental press.
measurement error. Transient error A source of error in testing due to
Reliability coefficient A general term for an index fluctuations in the test-taker’s emotions, moods, or
of reliability or the ratio of true score variance on a mental state over time.
test to the total variance is known as a “reliability True score theory Classical test theory, also called
coefficient. the true score model, is a framework of
Restriction of range/variance “Restriction of measurement assumptions stating that test scores
variance” is a term associated with reliability consist of a stable component representing what is
estimates, where the sampling procedure narrows intended to be measured and a random component
the variance of variables, causing lower correlation considered as error.
coefficients. This is the opposite of “inflation of True variance In the true score model, it refers to
range.” the variance component associated with genuine
Spearman-Brown formula An equation used to differences in the ability or trait being measured that
estimate internal consistency reliability by is inherent in an observed score or distribution of
correlating two halves of a lengthened or shortened scores.
test. This method is not suitable for heterogeneous Universe In generalizability theory, the total context
tests or speed tests. of a particular test situation, including all the factors
Speed test A time-limited test, typically assessing that lead to an individual testtaker’s score
achievement or ability, in which speed tests feature universe score In the context of generalizability
items of consistent difficulty. theory, it’s a test score that pertains to the specific
Split-half reliability A timed test, often used to universe being evaluated or assessed.
evaluate achievement or ability, where speed tests Variance A measure of variability calculated as the
contain items of uniform difficulty. average of the squared differences between scores
Standard error of a score - A gauge of the extent in a distribution and their mean, known as the
to which a test score could fluctuate because of variance.
factors such as measurement inaccuracies.
standard error of measurement The degree to
which an observed score differs from a true score, Validity
also known as the standard error of a score. This
concept is related to reliability and is relevant to the base rate An index, often represented as a
Stanford-Binet Intelligence Scale. proportion, that quantifies the prevalence of a
Standard error of the difference A statistic used to specific trait, behavior, characteristic, or attribute
determine the level of difference between two scores within a population.
that should be regarded as statistically significant. Bias refers to an inherent factor within the test that
Static characteristic – The trait, state, or ability consistently hinders accurate and unbiased
remains quite constant over time. measurement.
central tendency error A rating error characterized
Test battery A battery of tests and assessments by a rater's tendency to avoid extreme ratings,
consists of different tests serving a common causing most ratings to cluster in the middle of the
objective. For instance, it can involve intelligence, rating scale. This is known as central tendency error.
personality, and neuropsychological tests to create concurrent validity
a comprehensive psychological profile of an Concurrent validity is a type of criterion-related
individual. validity that gauges the extent to which a test score
Test-retest reliability An estimate of reliability is correlates with a simultaneously acquired criterion
obtained by correlating scores from the same measure.
individuals on two different test administrations. This confirmatory factor analysis
is relevant to test-takers, error variance, Factor analysis techniques are used when we test
measurement equivalence across populations, how well an assumed factor structure aligns with the
personality tests, and considering the rights of observed relationships among variables.
individuals taking the test. construct
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

An educated and scientific concept created to related procedures, often with a focus on their
elucidate or define behavior; instances of such optimality, typically considering cost and benefit
constructs encompass "intelligence," "personality," aspects.
"anxiety," and "job satisfaction.” discriminant evidence
construct validity In the context of construct validity, data from a test
An assessment of the validity of conclusions made or another measuring tool that indicates a weak or
about an individual's position on a construct based negligible connection between the test scores or
on test scores. other variables that, in theory, should not have a
content validity substantial correlation with the test being assessed
An evaluation of how effectively a test (or any other for construct validity. This is in contrast to
measurement tool) captures behavior that convergent evidence.
accurately represents the entire range of behavior it expectancy chart
was intended to measure, taking into account Visual depiction of a table that displays expected
considerations such as the design, context, cultural outcomes.
factors, and measurement precision. expectancy data
content validity ratio (CVR) Data, typically presented in the form of an
A formula, devised by C. H. Lawshe, utilized to expectancy table, demonstrating the probability that
assess consensus among evaluators concerning the an individual taking a test will achieve a score within
significance of an individual test item for inclusion in a specific score range on a criterion measure,
a test, particularly in the context of content- particularly in the context of utility analysis.
referenced testing and assessment. expectancy table
convergent evidence Data displayed in a table, showing the probability
In the context of construct validity, this refers to data that a test-taker will achieve a score within a
obtained from other measuring instruments created particular score range on a criterion measure.
to assess the same or a related construct as the one exploratory factor analysis
being validated. These data consistently support the A category of mathematical methods used to
same assessment or conclusion regarding a test or approximate factors, extract factors, or determine
another measuring tool, in contrast to discriminant the number of factors to keep.
evidence. face validity
convergent validity An assessment of the extent to which a test or
Data that demonstrates that a test assesses the another measuring instrument effectively assesses
identical construct as another test claiming to what it claims to measure, solely based on visual or
evaluate that same construct. superficial cues, like the content of the test items.
criterion factor analysis
The benchmark against which a test or a test score A category of mathematical techniques often used
is assessed; this benchmark can manifest in various for simplifying data, intended to pinpoint variables
forms, such as a particular behavior or a defined set that exhibit variability among individuals (or factors),
of behaviors. explore correlations, understand factor loadings,
criterion contamination their purposes, instances of two-factor situations,
A situation where a criterion measure is, either and various types of these techniques.
wholly or partially, dependent on a predictor factor loading
measure. In the context of factor analysis, a metaphor that
criterion-related validity implies a test (or a specific test item) embodies or is
An assessment of the degree to which a score or associated with a certain level of one or more
index obtained from a test or another measuring abilities, which subsequently significantly influence
instrument can reliably predict an individual's most the test score (or the response to that particular test
likely position on a specific measure of interest, item).
which is the criterion. fairness
decision theory When discussing tests, it pertains to how a test is
A set of techniques employed to assess, in employed in a fair, unbiased, and equitable manner,
quantitative terms, the effectiveness of selection considering factors like group membership,
processes, diagnostic classifications, therapeutic individual test items, potential misconceptions, and
interventions, or other assessment or intervention- its implications for validity.
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

false negative The procedure of collecting evidence that pertains to


A particular form of misrepresentation where an the accuracy of how well a test assesses its intended
assessment tool falsely suggests that the test-taker construct, with the aim of assessing the validity of a
lacks a specific trait, skill, behavior, or quality, when test or another measurement tool. This is typically
in reality, the test-taker does possess or exhibit that conducted using a different population from the one
trait, skill, behavior, or quality. for which the test was initially validated.
false positive method of contrasted groups It's a way of
A measurement mistake involving an assessment comparing two different groups to see if a test works
tool that incorrectly implies the test-taker has a well for both.
specific trait, skill, behavior, or attribute when, in miss rate
reality, the test-taker does not possess it. This can The percentage of individuals for whom a test or
be particularly seen in a family environment context. another measurement method does not correctly
generosity error identify the presence or absence of a specific trait,
Commonly known as leniency error, it denotes a behavior, characteristic, or attribute. In this context,
somewhat inaccurate rating or evaluation by a rater a "miss" signifies an incorrect classification or
stemming from the rater's inclination to be overly prediction and can be further categorized as false
generous or insufficiently critical. This is in contrast positives and false negatives.
to severity error. multitrait-multimethod matrix
halo effect An approach to assess construct validity that
A rating error characterized by the rater having an involves looking at both convergent and divergent
overwhelmingly positive view of the subject being evidence through a correlation table depicting the
rated and typically assigning ratings that are relationships between traits and methods.
excessively favorable in a positive direction. This Naylor-Shine tables
arises from a situation in which the rater is inclined Statistical tables that were previously commonly
to be overly positive and insufficiently critical. utilized to aid in assessing the effectiveness of a
hit rate specific test.
The accuracy in identifying individuals who possess predictive validity
or lack a specific trait based on test scores. This term A type of criterion-related validity that measures how
is related to several concepts and references, well a test score foretells a particular criterion
including John Holland, the Holtzman Inkblot measure.
Technique (HIT), and more. ranking
homogeneity The arrangement of individuals, scores, or variables
into ranked positions or levels of importance, often
incremental validity following an ordinal scale. This concept is
Employed alongside predictive validity, it quantifies associated with Alan Raphael.
the ability of new predictors to provide additional rating
explanatory value beyond the predictors already in A numerical or verbal assessment that positions an
the model. individual or characteristic on a scale defined by a
inference set of numerical or descriptive terms known as a
A reasonable outcome or inference in a process of rating scale.
logical reasoning. rating error
intercept bias An evaluation stemming from the deliberate or
It refers to the point where a regression line, inadvertent mishandling of a rating scale; this type of
generated by a test or measurement process, error includes leniency error (or generosity error)
consistently underestimates or overestimates the and severity error.
performance of individuals within a particular group. rating scale
This is different from 'slope bias' and is commonly A structured set of numerical or verbal labels used
associated with interest inventories. by assessors or individuals to gauge the presence or
leniency error extent of a particular trait, attitude, or variable,
Also known as a generosity error, it's a type of rating including in behavioral observation and preschool
mistake that arises from a rater's inclination to be assessment.
overly lenient and inadequately critical. severity error
local validation study
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

Inaccurate rating or assessment error resulting from level of proficiency needed to belong to a specific
the rater's inclination to be excessively critical, as category. This is distinct from a relative cut score.
opposed to a generosity error. Angoff method
slope bias - lets experts set a passing score for an exam. One
This term, related to a different slope of a regression big advantage is that this score is based on what's
line between groups, signifies that a test or in the test, not on how a group of test-takers
measurement process consistently provides distinct performs.
validity coefficients for individuals in different groups. base rate
This is in contrast to intercept bias. An indicator, typically as a ratio, that reflects the
Taylor-Russell tables presence of a particular trait or behavior in a
Statistical tables that were widely employed to offer population. Terms related to this concept include
test users an estimate of how the inclusion of a "BASIC ID," "BCG formula," "BDI," "Beck
specific test in the selection process would enhance Depression Inventory (BDI)," "Beck Self-Concept
decision-making. Test (BST)," "behavior raters," and "behavior rating
test blueprint scale.
An elaborate blueprint outlining the content, benefits (as related to test utility)
structure, and number of items to be included in a the positive outcomes or advantages that come from
test. using a test or assessment, such as making better
test utility theory decisions or achieving specific goals.
In psychological testing, it refers to how well a test bookmark method
serves a particular purpose, considering factors like An IRT-based method for setting cut scores that
clinical use, costs, decision-making, diagnostic arranges items by difficulty, with experts selecting
capabilities, and reliability and validity. items of ideal difficulty. This is related to Edwin G.
utility Boring, Bosnia, the Boston Naming Test, and BPS
It's about how effective a test or assessment method (Bricklin Perceptual Scales).
is for a particular purpose, considering factors like Brogden-Cronbach-Gleser
benefits, costs, reliability, and validity. formula
validation The combined work of Hubert E. Brogden and
The procedure of collecting and appraising evidence decision theorists led by Cronbach and Gleser has
related to validity. given rise to a formula that calculates the benefits,
validation study whether monetary or otherwise, of using a specific
Research that involves collecting evidence to assess selection instrument under specific conditions. In
how accurately a test measures its intended essence, it measures the utility gain of using a
construct, with the aim of evaluating the test's validity particular test or selection method.
or the validity of another measurement tool. compensatory model of selection
validity costs (as related to test utility)
A broad concept that encompasses an assessment A selection model for job applicants built on the idea
of how effectively a test or another measuring tool that strong performance in one area can
assesses what it claims to measure. This evaluation compensate for weaker performance in another area.
carries significant implications for the validity of cut score
inferences and decisions based on these Often known as a cutoff score, it's a specific point
measurements. (typically numerical) determined by judgment and
validity coefficient used to categorize a dataset into two or more groups,
A correlation coefficient that offers an indication of with subsequent actions or inferences based on
the connection between test scores and scores on a these classifications.
criterion measure. discriminant analysis
A set of statistical methods employed to explore the
Utility connection between specific variables and multiple
naturally occurring groups.
absolute cut score fixed cut score
An anchor in a distribution of test scores used to A reference point in a test score distribution that
separate a dataset into multiple categories, often categorizes data into groups, typically based on a
established based on a decision about a minimum
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

minimum proficiency judgment. These are also A numerical figure that indicates the correlation
called fixed or absolute cut scores. between the quantity of individuals to be recruited
item-mapping method and the number of individuals who are accessible for
is like making a map to show how hard or relevant recruitment.
test questions are in comparison to one another. top-down selection
known groups method Assigning open positions to candidates based on
To support construct validity, a test should be able their scores, with the highest scorer receiving the
to differentiate between people who have a specific first position, the second highest scorer obtaining the
trait and those who don't. next position, and so on until all positions are
occupied.
method of contrasting groups - a way to study two Utility (test utility)
groups, one with a specific trait and one without, to In psychological testing and assessment, it pertains
see how that trait affects them. to the utility of a test or assessment method for a
method of predictive yield specific purpose. This considers factors such as
A method for determining cut scores by considering clinical use, cost-effectiveness, decision theory,
the number of available positions. This approach diagnostic capability, as well as questions related to
factors in predictions about the probability of reliability, validity, and theory.
candidates accepting offers, the quantity of job utility analysis
openings, and the distribution of applicant scores. Methods to weigh the pros and cons of testing
multiple cut scores versus not testing, examining likely outcomes. This
The utilization of multiple cutoff scores in relation to covers the BCG formula, cut scores, expectancy
a single predictor to classify test-takers into more data, practical factors, productivity gain, and the
than two groups, or the application of distinct cutoff intended purpose.
scores for each predictor when employing multiple utility gain
predictors for selection. An approximation of the advantage, whether in
multiple hurdle (selection monetary terms or other aspects, gained from the
process) application of a specific test or selection technique.
A multi-step decision process wherein reaching a
specific cutoff score on one test is a prerequisite for
progressing to the next evaluation stage within the Test Development
selection process.
norm-referenced cut score anchor protocol
A point of reference within a distribution of test An examination response sheet created by a test
scores used to categorize data into two groups, publisher for the purpose of verifying the correctness
primarily based on norms or typical values, rather of examiners' scoring.
than the association of test scores with a specific biased test item
criterion. An item that exhibits bias towards a specific group of
productivity gain"An actual increase in productivity test-takers compared to another group when
or work output, which can be approximated in utility accounting for variations in group proficiency
analyses using a specific test or evaluation method. (Camilli & Shepard, 1985).
relative cut score binary-choice item
Commonly known as a norm-based cutoff score, this A multiple-choice item offering only two response
is a designated point within a test score distribution, options is termed a binary-choice item, such as the
dividing data into two groups primarily based on familiar true-false question. While these items
norm-related factors rather than the correlation usually present a statement for the test-taker to
between test scores and a specific criterion. classify as fact or not, they can also use other pairs
return on investment of responses like agree-disagree, yes-no, or right-
The ratio of economic and/or non-economic gains wrong.
achieved from spending on the initiation or categorical scaling
enhancement of a specific testing program, training A scaling system where stimuli are categorized into
program, or intervention, in comparison to the total one of several alternative groups, which vary
costs incurred for these enhancements. quantitatively along a certain continuum.
selection ratio category scoring
ARANDA, Juliana M. BPSY 198
BS PSYCH 4C Ma’am Mical

An evaluation method where test responses In the context of IRT (Item Response Theory), a
determine placement in a category. Sometimes, a method of analyzing item response curves for
set number of qualifying responses is needed to join distinct groups of test-takers to assess the
a specific group, distinct from cumulative and equivalence of measurement instruments or items
ipsative scoring. across those groups.
ceiling effect differential item functioning
Reduced effectiveness of an assessment tool in (DIF)
differentiating high-ability or high-trait individuals. In IRT, differential item functioning (DIF) happens
class scoring when the same test item yields different results for
A method of evaluation where test responses various groups, often due to non-construct-related
determine placement in a specific category. group differences.
Sometimes, a fixed number of qualifying responses DIF items
is required to join a particular group, differing from In IRT, DIF refers to test items where individuals
cumulative and ipsative scoring. from different groups, assumed to have the same
comparative scaling underlying ability, show different probabilities of
In test development, a technique for creating ordinal endorsing based on their group membership.
scales by employing a sorting task that involves essay item
evaluating a stimulus in relation to every other An essay question is ideal for assessing a deep
stimulus utilized in the test, also known as understanding of a single topic and encourages
comparative utility. creative expression in the test-taker's own words.
completion item expert panel
All the words composing the section of a sentence Expert witnesses contribute to improving a test's
completion item, excluding the empty space that the content, fairness, and other aspects by providing
test-taker is supposed to fill in, often used in their subject matter and population-related insights.
sentence completion tests. floor effect
computerized adaptive testing It's when a test can't tell well the difference between
(CAT) people with very low skills or abilities.
An interactive computer-based test-taking process giveaway item
in which the items presented to the test-taker are A test item typically positioned at the start of an
influenced by their performance on previous items. ability or achievement test, intentionally made to be
co-norming straightforward, often with the goal of boosting the
The procedure of norming tests involves test-taker's confidence or reducing test-related
administering two or more tests to a group of test- anxiety.
takers from the same sample. When this method is guessing
utilized to establish the validity of all the tests Guessing on personality tests is not a major concern,
undergoing norming, it can also be termed as co- as it's assumed test-takers aim to make the best
validation. choice, even when it's challenging.
constructed-response format Guttman scale
A type of test item that requires a constructed - It's a scale that goes from showing less of
response, like essay questions or fill-in-the-blank, as something to showing more of it, like a thermometer
opposed to selecting from predefined options, which measuring temperature from cold to hot.
are found in selected-response formats. ipsative scoring
co-validation It's grading a test by comparing your performance to
Co-validation is the process of validating multiple how others did on the same test, rather than just
tests using the same test-taker sample. It can also looking at your absolute score.
be called co-norming when combined with norm item analysis
establishment or revision.
cross-validation It's a way of looking at how different questions in a
Revalidation involves testing a different sample of test perform, like how hard they are or how reliable
individuals from the group for which the test's they are. This helps understand how the test works.
predictive validity was originally established.
DIF analysis

You might also like