You are on page 1of 11

BASIC STATS FOR TESTING [ refresher ] Ex: DSM, yes or no scale, sex, etc.

Ordinal scales – “ ranking “ ordering


Why we need stats? characteristics
ex: board exam topnotchers, likert scale,
Stats methods serve two important level of agreement
purposes in the quest for scientific Interval scale – has property of moreness
understanding. and equal intervals but not absolute 0
Ex: farenheit or celsius
1. Descriptive stats – provide concise Ratio scale – has a true 0 point, easies to
collection of information manipulate
2. Inferencial stats – illogical deduction • has all three properties
about events that cannot be - Moreness
observed directly. - Equal interval
- Absolute 0
Scales of Measurement
Measurement: application of rules for Scales of Measurement and their Properties
assigning numbers to objects Property
The rules are the specific procedures used Type of Magnitude Equal Absolute 0
to transform qualities of attributes into scale intervals
numbers. Nominal No No No
Ordinal Yes No No
Properties of Scales Interval Yes Yes No
Three important properties make scales of Ratio Yes Yes Yes
measurement different from one another:
Magnitude – property of “ moreness “ Frequency Distributions
has a property of magnitude, can measure • Summarizing the scores for a group
ex: level of anxiety of individual
Equal Interval – difference between two • Displays scores on a variable or a
points at any place on the scale that has the measure to reflect how frequent
same meaning as the difference between each value was obtained
two other points that differ by the same • One defines all the possible scores
number of scale units and determines how many people
same intervals in between laging merong obtained each of those scores
equal interval
ex: ruler
Absolute 0 – no property being measured
Extremely difficult, not impossible, to define
an absolute 0
Ex: in psych test, since not reliable

Types of Scales
Nominal scales – “ categorization ”
classification based on one or more
distinguishing characteristics .13, 2.14, 13.59, 34.13 | 34.13, 13.59, 2.14, .13
Mode – the most common number in a
data set.
• Bimodal – two number that has a
higher frequency

Standard Deviation – gives detail of how


much above or below a score to the mean
Skewness – how measurement distributed
Approximation of the average deviation
Positively skewed – test is difficult
around the mean
Scores are clustered at the left lower end [
Tell you how spread out numbers are
positive side ]
Negatively skewed – test is easy
Variance – squared of standard deviation
Scores are clustered at the right higher end
Z-score – transforms data into standardized
[ negative side ]
units that are easier to interpret [ can have
negative value ]
Kurtosis – steepness of the distribution [
known as tails ]
• Leptokurtic – pointy positive kurtosis
• Mesokurtic – normal distribution
• Platykurtic – flat negative kurtosis

Percentiles – fall below the particular score


Formula:
T-score ( McCall’s T ) – standardized score
system that can be obtained from a simple
linear transformation of z-scores [ don’t
have negative value, always positive ]
Stanine score – scale is divided into 9 units
Sten score – scale is divided into 10 units
Example: CEE – college entrance exam
𝑁𝑜.𝑜𝑓 𝑏𝑒𝑎𝑡𝑒𝑛 𝑒𝑥𝑎𝑚𝑖𝑛𝑒𝑒𝑠
x 100
𝑡𝑜𝑡𝑎𝑙 𝑛𝑜.𝑜𝑓 𝑠𝑐𝑜𝑟𝑒𝑠
Standard Scores
Sa mababa lagi
Percentile Ranks– replace simple ranks to • Z-scores [ mean = 0, SD = 1 ]
adjust the number of scores in a group • T-scores [ mean = 50, SD = 10 ]
• Stanine scores [ mean = 5, SD = 2 ]
Measures of Central Tendency • Sten scores [ mean = 5.5, SD = 2 ]
Mean – average of a data set/arithmetic • IQ scores [ mean = 10, SD = 15 ]
average score in a distribution • CEE [ mean = 500, SD = 100 ]

Variability
Range – high score minus low score
Quartile – division of frequency [ normal
distribution divided into 4 parts ]
Median – the middle of the set of numbers
Interquartile – dividing the quartile from Q3 Norms and a Good test
to Q1 Norms – standards which results will be
Semi-interquartile – Interquartile divided compared; target sample
by 2 Ex: student from NU
• Age-related norms
- normative groups for particular age groups
• Norm-referenced tests
- determine how a test taker compares with
other or each person with a norm
[ comparing from other norm ex: students
from UST ]
Criterion-referenced tests
- specific types of skills
[ standard ex: student from NU and a
Statistical Methods working student as well,, depends on
Comparative [ difference, effect ] variable of norms ]
No. of No. of
times DV groups Norming
measured Stardardization – process of administering a
T-test independent 1 2 test to a representative sample of test
means takers for the purpose of establishing norms
T-test dependent 2 1 Ex: pilot testing
means Sample – representative of the whole
Anova one way 1 >2 population [ portion/selected ]
Anova repeated >2 1 Sampling – process of selecting sample
measure Propbability Sampling
Mann Whitney U 1 2 • Sample Random Sampling – equal
test chance of being selected
Wilcoxon Signed 2 1 • Systematic Sampling – every nth
Rank test item or person after is picked
Kruskal Wallis test 1 >2 • Stratified Sampling – random
Friedman Test >2 1 selection within predefined groups [
Anova two way 2 IV, 2 levels dividing into three types ]
MANOVA many measures of DV • Cluster Sampling – groups of the
target population are selected
Parametric/ Non – Parametric randomly
Dependent T-test -> Wilcoxon Signed Rank Non-probability Sampling
Test • Convenience Sampling – selected
Independent T-test -> Mann Whitney U Test based on their
Repeadted Measured Anova -> Friedman convenient/availability
Test • Qouta Sampling – specifiying who
One way/ Two-way Anova -> Kruskal Wallis shuld be selected according to
Test certain groups/criteria
• Purposive Sampling – chosen Standard error of measurement ( Sem ) –
consciously based on their measure how close the observe score to the
knowledge and understanding of the “ true “ score
research question - the lesser the number, the better and the
• Snowball or Referral Sampling – closer the score, the better
perople recruited to be part of a [ true score is always unknown since no
sample are asked to invite those measure can be constructed that provides a
they know to take part [ ex: perfect reflection of the true score ]
prostitutes, battered wife, etc. ]
Domain Sampling Theory ( central concept
Reliability of CTT ) – the more the items in the test,
- free from error the higher chance of reliability
error = inaccuracy in measurements - direct to the point items ilalagay sa test,
- consistency of test measurement making sure na ime-measure and domain
[ a test may be reliable in one context and and factor
unreliable in another ] - limited number of items can measure the
domain
History & Theory of Reliability
Charles Spearman – proponent of reliability Item Response Theory ( IRT ) – focuses on
Abraham De Moivre – introduced the basic the range of item difficulty [ individual’s
notion of sampling error ability level ]
Karl Pearson – developed the product Item difficulty = item easiness
moment correlation
testers use “ rubber yardsticks “ to estimate Sources of Error Variance
measurements. 1. Test Construction: item sampling or
content sampling [ ex: how the test made,
Basics of Test Score Theory content ]
Classical Test Score Theory – each person 2. Test Administration: may influence the
has a true score that would obtained if test taker’s attention or motivation [ ex:
there were no errors in measurement. how test user gives instruction ]
[ the difference between the true score and 3. Test Scoring and Interpretation:
the observed score results from individual administered test still require
measurement error ] scoring by trained personnel [ important
that the person is magaling sa field so they
can interpet well the certain test ]

Different Estimates of Reliability


[ the difference between the score we Test Re-Test Reliability [ 1 group, 1 test, 2
obtained and the score we are really administration ] ( time and administration
interested is the error of measurement ] concerned )
X–T=E - used to evaluate the error associated with
administering a test at two different times
[ ex: measure a stable traits or personality ]
duration: 2 weeks until 6 months
Carryover Effect – effect occurs when the Split Half Reliability [ 1 group, 2 test, 2
first testing session influences the scores administration ]
from the previous session [ short span of - test is given and divided into halves that
interval ] are scored separately
Practice Effect – test-retest correlation duration: the next day or after a week
usually overestimate the true reliability [ To shorten the items, odd-even system or
test-taker’s concern ] top bottom method can be used.
PROCEDURE: • Top bottom method: 1st half [ 25
• Sample population = (Test A) [1] + items ], 2nd half [ 25 items ]
(Test A) |2] • Odd even system: 1st half [ odd ], 2nd
• Administer the [1st ] psychological [ even ]
test adjust the half-test reliability: Spearman
• Get the test result Brown Formula: r =2r/1+r
• Wait for the interval/duration [ 2 For example: As a psychometrician in a
wks – 6 mos ] note: when the clinic, the psychologist instructed you ot
interval between testing is greater develop a test that measure emotional
than six months, the estimate of tes- stability of suicial patients. You decided to
retest reliability is often reffered to use split-half to establish its reliability. The
as coefficient of stability. correlation between two halves of the test
• Re-administer the psychologucal is.78 according to the spearman-brown
test formula, the estimated reliability would be
• Get the test result .876
• Correlate the results Computation: r = 2 (78)/1 +.78
Kuder-Richardson Formula ( KR20 )
Parallel-Forms & Alternate-Forms KR20 is (Kuder & Richardson, 1937)
Reliability [ 1 group, 2 test, 1 administration developed this formula to measure the
] same day administered ( item concerned ) reliability estimate a Split-half.
- compares two equivalent forms of test KR20: used when the items varies the level
that measure the same attribute of difficulty
Alternate form are designed to be KR21: used when the items have same level
equivalent with respect to variables such as of difficulty
content and level of difficulty.
Coefficient of equivalence – measures the
same attributes
PROCEDURE:
• Administer the first test
• Administer the Alternate form test
• Score both tests
• Correlate
disadvantages
• Hard to develop or construct
• Time consuming
Coefficient Alpha Face Validity
Cronbach's Alpha – commonly used when - Is not really a validity at all because it does
you have multiple Likert questions in a not offer evidence to support conclusions
suvery/questionnaire that form a scale and drawn from test score.
you wish to determine if the scale is - establishes the presentation, physical
reliabile. appearance of the psychological test [ is the
- Applicable for personality and attitude test presentable to the test-takers? ]
scales
How reliable is reliability? Content Validity
• In research setting - it has been suggested - It explores the appropriateness of test
that a reliability estimates in the range of items of a psychological test.
.70 and .80 are good enough for most - It means that the test covers the content
purposes in basic research what is supposed to cover.
• In clinical setting - high reliability is - how adequately a test samples
extremely important. a reliability of .90 to behavior representative of what the test
.95 might be good enough. was designed to sample.
• If the test is unreliable, information Test blueprint for the "structure" of the test
obtained with it is of little or no evaluation. A plan regarding the types of
value. information to be covered by the items
What to do with low reliability?
1. Increase the number of items. Criterion-related Validity
2. Factor and item analysis. Criterion: a standard on which a judgement
3. Correction of attenuation. or decision may be based. The standard
against which a test or a test score is
VALIDITY evaluated.
- is a judgement or estimate of how well a Concurrent Validity - statements of
test measure what it purports to measure in concurrent validity indicate the extent to
a particulat context. which test scores may be used to estimate
- indicated what the test aims or purpots to an individual's present standing on a
measure. criterion.
It answer the question, " Does the test Predictive Validity - how well a certain
measure what it is supposed measure can predict future behavior.
to measure? "
Construct Validity
Three Categories of Validity: A judgement about appropriateness of
1. Content Validity inferences drawn from test scores regarding
2. Criterion-related Validity individual standings on a variable called
- Predictive Validity construct.
- Concurrent Validity It arrived at by executing a comprehensive
3. Construct Validity analysis of:
- Convergent Validity a. How scores on the test relate to
- Discriminant Validity other test scores and measures
b. How scores on the test can be
understood withon some theoretical
framework for understanding the construct
that the test was designed to measure.

CONVERGENT VALIDITY - when a measure


correlates well with other tests believed to
measure the same construct.
[ ex. correlate assessment scores for a math
ability test with scores obtained from other
math ability test. ]
DISCRIMINANT VALIDITY - a construct
measure diverges from other measures
that should be measuring different things.
[ ex. correlate assessment scores for a math
ability test with scores obtained from a
verbal ability test. ]

Practicality of a Test
• A test must be usable
• Selection of the test should also be based
on:
- Effort
- Affordability
- Time frame
• Test requires simple directions
• Easy administration and scoring
INTELLIGENCE statistical procedure that identifies clusters
of related items (called factors) on a test
BINET-SIMON SCALE – IQ TEST used to identify different dimensions of
MENTAL AGE - The average age of performance that underlie one’s total score
individuals who achieve a • His Multiple Factors Theory of Intelligence
particular level of performance of a test indentified 7 primary mental abilities:
verbal comprehension, word fluency,
SPEARMAN’S TWO FACTOR THEORY number facility, spatial visualization,
measured by every task on an intelligence associative memory, perceptual speed and
test reasoning
high saturated factor g factor can safely
predict a similar level performance of g RAYMOND CATTELL’S THEORY ON FLUID
saturated task while s factor prediction is AND CRYSTALLIZED INTELLIGENCE
less accurate FLUID INTELLIGENCE - adaptive
and new learning capabilities
TERMAN’S STANFORD BINET INDIVIDUAL (Gf) are non-verbal - relatively culture free
INTELLIGENCE TEST and independent of specific instruction
The classic formula for the IQ is: IQ = CRYSTALLIZED INTELLIGENCE - learned
mental age divided by chronological age x through experience.
100. (Gc) – acquired skills and knowledge that
are dependent on exposure
THORNDIKE’S STIMULUS RESPONSE
THEORY STERNBERG’S TRIARCHIC THEORY OF
three broad classes of INTELLIGENCE
intellectual functioning: CONTEXTUAL INTELLIGENCE – emphasized
Intelligence test – is measured by standard intelligence on its sociocultural context
intelligence test. EXPERIENTIAL LEARNING – emphasized
Mechanical Intelligence – is the ability to insight and ability to formulate new ideas
visualize relationships among objects and and combine seemingly unrelated facts or
understand how the physical world works. information
Social Intelligence – is the ability to COMPONENTIAL INTELLIGENCE –
function successfully in interpersonal emphasizes the effectiveness of
situations informational processing.
he considered as two most basic Component – cognitive mechanism that
intelligence: trial and error and stimulus carry out adaptive behavior to novel
response association. situations
He said that stimulus response connections Two kinds of components:
that are repeated are strengthened while Performance– used in actual execution of
those that are not used are weakened. the tasks, includes encoding, comparing etc.
Administering the instructions of
L.L. THURSTONES MULTIPLE FACTORS metacomponents
THEORY OF INTELLIGENCE Metacomponents– higher order executive
Factor Analysis processes used in planning, monitoring and
evaluating one’s working memory program.
Theory of Multiple Intelligences Origins of Intelligence
Gardner’s intelligence theory that proposes Stanford-Binet
that there are eight distinct spheres of The widely used American revision of
personality. Binet’s original intelligence test
1. Linguistic Intelligence revised by Terman at Stanford University
WORD SMART The Dynamics of Intelligence
The ability to use language to Mental Retardation
excite, please, convince, stimulate or a condition of limited mental ability [
convey information intelligence score below 70 ]
2. Logical-Mathematical Intelligence Down Syndrome
LOGIC SMART retardation and associated physical
• People who display an aptitude for disorders caused by an extra chromosome
numbers, reasoning and problem solving. in one’s genetic makeup
3. Bodily Kinesthetic Intelligence
BODY SMART
• The ability to use fine and gross motor
skills in sports, the performing arts and
crafts production
4. Spatial Intelligence
PICTURE SMART
• The ability to perceive and mentally
manipulate a form or object, perceive and Genetic Influences
create tension, balance and composition in
visual or spatial display
5. Musical Intelligence
MUSIC SMART
• The ability to enjoy, perform or
compose a musical piece
6. Interpersonal Intelligence
PEOPLE SMART
• Is the ability to understand and get along
with others
7. Intrapersonal Intelligence
SELF SMART The most genetically similar people
• The ability to gain access to and have the most similar scores
understand one’s inner feelings, Heritability
dreams and ideas. the proportion of variation among
8. Naturalist Intelligence individuals that we can attribute to genes
NATURE SMART
• Ability to identify and classify patterns in
nature
David Wechsler believed that all tests of
intelligence measure traits of personality,
Environmental Influences such as drive, energy level, impulsiveness,
The Schooling Effect persistence, and goal awareness.
- some personality factors associated with
gains in measured intelligence over time. [
aggressiveness with peers, initiative, high
need for achievement, competitive striving,
curiosity, self-confidence, and emotional
stability ]
- some of the factors present in children
Group Differences whose measured intellectual ability has not
Group differences and environmental increased over time. [ passivity,
impact dependence, and maladjustment ]

Gender
• Males may have the edge when it comes
to the g factor in Intelligence (Jackson &
Rushton, 2006; Lynn & Irwing,2004).
• Males also tend to outperform females on
tasks requiring visual spatialization.
suggestive evidence indicating that more
Stereotype Threat experience in spatialization might be all
a self-confirming concern that one will be that is required to bridge this gender gap
evaluated based on a negative stereotype. (Chan, 2007).
• Girls may generally outperform on
Other Issues of Intelligence tests language skill–related tasks, although these
- Measured intelligence may vary as a result differences may be minimized when the
of factors related to the measurement assessment is conducted by computer
process. (Horne, 2007).
- many factors that can affect measured
intelligence Culture
- Another possible factor in measured • There is a relationship between culture
intelligence is what is called the FLYNN and psychological assessment (intelligence
EFFECT. tests).
James R. Flynn found that measured • A culture provides specific models for
intelligence seems to rise on average, year thinking, acting, and feeling. (Chinoy, 1967).
by year, starting with the year that the test • values may differ radically between
is normed. cultural and subcultural groups, people
from different cultural groups can
Personality have radically different views about what
Alfred Binet had conceived of the study of constitutes intelligence (Super, 1983;
intelligence as being synonymous with the Wober, 1974).
study of personality.
• The desire to create a culture-free C. California Psychological Inventory
intelligence test has resurfaced with various D. 16 Personality Factors
degrees of dedication throughout history. E. Dimensions of Self Concept – Form
• Nonverbal items were thought to H: College
represent the best available F. Tennessee Self Concept Scale
means for determining the cognitive ability G. Self Esteem Index
of minority group H. Differential Aptitude Test
children and adults.
• They have not been found to have the Other special Test
same high level of predictive validity as G. Bender Gestalt Visual Motor Test
more verbally loaded tests. E. Panukat ng Ugaling at Pagkatao ng
• This may be due to the fact that Pilipino
nonverbal items do not sample the
same psychological processes as do the
more verbally loaded, conventional tests of
intelligence.
• nonverbal tests tend not to be very good
at predicting success in various academic
and business settings.
• Culture loading may be defined as the
extent to which a test incorporates the
vocabulary, concepts, traditions,
knowledge, and feelings associated with a
particular culture.

I. INDIVIDUAL TEST OF INTELLIGENCE


A. The Wechsler Intelligence Scale for
Children (WISC)
B. The Wechsler Adult Intelligence
Scale (WAIS)
C. Ravens Stanford Progressive
Matrices Test
D.
II. GROUP TEST
Group Ability Tests: Testing in Education,
Civil Service and the Military
A. Verbal Test: MD5 Mental Ability Test
B. Non Verbal: Culture Fair Intelligence
Test

Structured Personality Test


A. Minnesota Multiphasic Personality
Inventory 2
B. Basic Personality Inventory

You might also like