Professional Documents
Culture Documents
JOJO Psychological-Assessment-Lecture-Notes
LECTURE NOTES
PSYCHOLOGICAL ASSESSMENT
Prepared and Screened by:
Prof. Jose J. Pangngay, MS Psych, RPm
CHAPTER I: BRIEF HISTORY OF PSYCHOLOGICAL TESTING AND PROMINENT INDIVIDUALS IN PSYCHOLOGICAL ASSESSMENT
A. Ancient Roots
• Chinese Civilization – testing was instituted as a means of selecting who, of the many applicants, would obtain government jobs
• Greek Civilization – tests were used to measure intelligence and physical skills
• European Universities – these universities relied on formal exams in conferring degrees and honors
B. Individual Differences
• Charles Darwin – believed that despite our similarities, no two humans are exactly alike. Some of these individual differences are more “adaptive than
others and these differences lead to more complex, intelligent organisms over time.
• Francis Galton – he established the testing movement; introduced the anthropometric records of students; pioneered the application of rating-scale and
questionnaire method, and the free association technique; he also pioneered the use of statistical methods for the analysis of psychological tests He used
the Galton bar (visual discrimination length) and Galton whistle (determining the highest audible pitch). Moreover, he also noted that persons with mental
retardation tend to have diminished ability to discriminate among heat, cold and pain.
E. World War I
• Robert Yerkes – pioneered the first group intelligence test known as the Army Alpha (for literate) and Army Beta (for functionally illiterate)
• Arthur S. Otis – introduced multiple choice and other “objective” item type of tests
• Robert S. Woodworth – devised the Personal Data Sheet (known as the first personality test) which aimed to identify soldiers who are at risk for shell
shock
F. Personality Testers
• Herman Rorschach – slow rise of projective testing; Rorschach Inkblot Test
• Henry Murray & Christina Morgan – Thematic Apperception Test
• Early 1940’s – structure tests were being developed based on their better psychometric properties
• Raymond B. Cattell – 16 Personality Factors
• McCrae & Costa – Big 5 Personality Factors
A. Objectives of Psychometrics
1. To measure behavior (overt and covert)
2. To describe and predict behavior and personality (traits, states, personality types, attitudes, interests, values, etc.)
3. To determine signs and symptoms of dysfunctionality (for case formulation, diagnosis, and basis for intervention/plan for action)
– has three construction strategies namely: theory-guided inventories, factor-analytically derived inventories, criterion-keyed inventories
– examples: NEOPI, 16PF, MBTI, MMPI
5. Interest Inventory
– Measures an individual’s performance for certain activities or topics and thereby help determine occupational choice or make career decisions
– Measure the direction and strength of interest
– Assumption: Interests though unstable, have a certain stability or else it cannot be measured
– Stability is said to start at 17 years old
– Broad lines of interests are more stable while specific lines of interests are more unstable, they can change a lot.
– Example: CII
6. Attitude Inventory
– Direct observation on how a person behaves in relation to certain things
– Attitude questionnaires or scales (Bogardus Social Distance Scale, 1925)
– Reliabilities are good but not as high as those of tests of ability
– Attitude measures have not generally correlated very highly with actual behavior
– Specific behaviors, however, can be predicted from measures of attitude toward the specific behavior
7. Values Inventory
– Purports to measure generalized and dominant interests
– Validity is extremely difficult to determine by statistical methods
– The only observable criterion is overt behavior
– Employed less frequently than interest in vocational counseling and career decision-making
8. Diagnostic Test
– This test can uncover and focus attention on weaknesses of individuals for remedial purposes
9. Power Test
– Requires an examinee to exhibit the extent or depth of his understanding or skill
– Test with varying level of difficulty
10. Speed Test
– Requires the examinee to complete as many items as possible
– Contains items of uniform and generally simple level of difficulty
11. Creativity Test
– A test which assesses an individual’s ability to produce new/original ideas, insights or artistic creations that are accepted as being social, aesthetic
or scientific value
– Can assess the person’s capacity to find unusual or unexpected solutions for vaguely defined problems
12. Neuropsychological Test
– Measures cognitive, sensory, perceptual and motor performance to determine the extent, locus and behavioral consequences of brain damage,
given to persons with known or suspected brain dysfunction
– Example: Bender-Gestalt II
13. Objective Test
– Standardized test
– Administered individually or in groups
– Objectively scored
– There are limited number of responses
– Uses norms
– There is a high level of reliability and validity
– Examples: Personality Inventories, Group Intelligence Test
14. Projective Test
– Test with ambiguous stimuli which measures wishes, intrapsychic conflicts, dreams and unconscious motives
– Projective tests allow the examinee to respond to vague stimuli with their own impressions
– Assumption is that the examinee will project his unconscious needs, motives, and conflicts onto the neutral stimulus
– Administered individually and scored subjectively
– Have 5 types/techniques: Completion Technique, Expressive Technique, Association Technique, Construction Technique, Choice or Ordering
Technique
– With low levels of reliability and validity
– Examples: Rorschach Inkblot Test, TAT, HTP, SSCT, DAP
15. Norm-Referenced Test – raw scores are converted to standard scores
16. Criterion-Referenced Test – raw scores are referenced to specific cut-off scores
O. Cross-Cultural Testing
1. Parameters where cultures vary
– Language – Education
– Test Content – Speed (Tempo of Life)
2. Culture Free Tests
– An attempt to eliminate culture so nature can be isolated
– Impossible to develop such because culture is evident in its influence since birth or an individual
– The interaction between nature and nurture is cumulative and not relative
3. Culture Fair Tests
– These tests were developed because of the non-success of culture-free tests
– Nurture is not removed but parameters are common an fair to all
– Can be done using three approaches such as follows:
✓ Fair to all cultures
✓ Fair to some cultures
✓ Fair only to one culture
4. Culture Loadings
– The extent to which a test incorporates the vocabulary, concepts, traditions, knowledge, and feelings, associated with particular culture
B. Steps in Research
1. Identify the problem 7. Ascertain and select sample
2. Conduct literature review 8. Conduct a pilot study
3. Identify theoretical/conceptual framework 9. Collect data
4. Formulate hypothesis 10. Analyze data
5. Operationalize variables 11. Interpret results
6. Select research design 12. Disseminate information
C. Research Problems
– Research problem is a situation in need of description or quantification, solution, improvement or alteration. You can evaluate these problems by using
the following criteria:
✓ Significance of the problem ✓ Feasibility
✓ Researchability of the problem ✓ Interest of the researcher
– Sources of Problems
✓ Experiences ✓ Replication studies
✓ Review of related literature ✓ Intellectual curiosity
✓ Issues and popular concern
D. Hypotheses - statements of the anticipated or expected relationship between the independent and dependent variables.
– Types
✓ Null hypothesis- states no relationship between variables ✓ Alternative hypothesis- gives the predicted relationship
– Complexity
✓ Simple- one independent and one dependent variable
✓ Complex or Multivariate- 2 or more independent or dependent variable
E. Research Design
Research Component Qualitative Research Design Quantitative Research Design
Purpose • To gain an understanding of underlying reasons and • To quantify data and generalize results from a sample to
motivations the population of interest
• To provide insights into the setting of a problem, • To measure the incidence of various views and opinions
generating ideas and/or hypotheses for later in a chosen sample
quantitative research • Sometimes followed by qualitative research which is used
• To uncover prevalent trends in thought and opinion to explore some findings further
• To explore causality • To suggest causality
Philosophical • Post-positivist perspective • Positivist perspective
Assumptions
F. Research Methods
Research Method Salient Features
Descriptive-Qualitative ▪ Detailed descriptions of specific situation(s) using interviews, observations, document review.
(Case Study/Ethnography) ▪ The researcher’s task is to describe things as they are.
Descriptive-Quantitative ▪ Numerical descriptions (frequency, average) of specific situations.
▪ The researcher’s task is to measure things as they are.
Correlational Analysis ▪ Quantitative analyses of the strength of relationships between two or more variables.
Regression Analysis ▪ Quantitative analyses of causal or predictive links between two or more variables.
Quasi-Experimental Research ▪ Comparing a group that gets a particular intervention with another group that is similar in characteristics but did not
receive the intervention.
▪ There is no random assignment used.
Experimental Research ▪ Using random assignment to assign participants to an experimental or treatment group and a control or comparison
group.
Meta-analysis ▪ Synthesis of results from multiple studies to determine the average impact of a similar intervention across the
studies.
G. Experiment Validity
– Experimental validity refers to the manner in which variables that influence both the results of the research and the generalizability to the population at
large
1. Internal Validity of an Experiment
– It refers to a study’s ability to determine if a causal relationship exists between one or more independent variables and one or more dependent
variables
– Threatened by the following:
• History and • Testing • Selection
Confounding Variables • Statistical Regression • Experimenter Bias
• Maturation • Instrumentation • Mortality
2. External Validity of an Experiment
– It refers to a study’s generalizability to the general population
• Demand Characteristics (subjects become wise to • Order Effects (Carry Over Effects)
anticipated results) • Treatment Interaction Effects (treatment +
• Hawthorne Effects selection/history/testing)
H. Sampling Techniques
1. In non-probability sampling, not every element of the population has an opportunity to be included.
Examples: accidental/convenience, quota, purposive and network/snowball.
2. In probability sampling, every member of the population has a probability of being included in the sample.
Examples: simple random sampling, stratified random sampling, cluster sampling and systematic sampling.
I. Research Variables
1. An independent variable is the presumed “cause”
2. The dependent variable is the presumed “effect”.
3. Extraneous variables are other factors that affects the measurement of the IV or DV
4. Intervening variables are any factor that are not directly observable in research situation but which maybe affecting the behavior of the subject.
b. Ordinal: a non-parametric scale wherein cases are ranked or ordered; they represent position in a group where the order matters but not the
difference between the values.
Example: 1st, 2nd, 3rd, 4th and 5th; Pain threshold in a scale of 1 – 10, 10 being the highest
c. Interval: a parametric scale wherein this scale use intervals equal in amount measurement where the difference between two values is
meaningful. Moreover, the values have fixed unit and magnitude.
Example: Speed of a car (70KpH); Temperature (Fahrenheit and Celsius only)
d. Ratio: a parametric scale wherein this scale is similar to interval but include a true zero point and relative proportions on the scale make sense.
Example: Height and Weight
2. Comparative Scales of Measurement
a. Paired Comparison: a comparative technique in which a respondent is presented with two objects at a time and asked to select one object according
to some criterion. The data obtained are in ordinal nature.
Example: Pairing the different brands of cold drink with one another please put a check mark in the box corresponding to your preference.
Brand Coke Pepsi Sprite Limca
Coke
Pepsi ✓ ✓
Sprite ✓
Limca ✓ ✓ ✓
No. of Times Preferred 3 1 2 0
b. Rank Order: respondents are presented with several items simultaneously and asked to rank them in order of priority. This is an ordinal scale that
describes the favoured and unfavoured objects, but does not reveal the distance between the objects. The resultant data in rank order is ordinal
data. This yields a better result when comparisons are required between the given objects. The major disadvantage of this technique is that only
ordinal data can be generated.
Example: Rank the following brands of cold drinks you like most and assign it a number 1. Then find the second most preferred brand and assign
it a number 2. Continue this procedure until you have ranked all the brands of cold drinks in order of preference. Also remember that no two
brands should receive the same rank order.
Brand Rank
Coke 1
Pepsi 3
Sprite 2
Limca 4
c. Constant Sum: respondents are asked to allocate a constant sum of units such as points, rupees or chips among a set of stimulus objects with
respect to some criterion. For example, you may wish to determine how important the attributes of price, fragrance, packaging, cleaning power and
lather of a detergent are to consumers. Respondents might be asked to divide a constant sum to indicate the relative importance of the attributes.
The advantage of this technique is saving time. However, the main disadvantages of are the respondent may allocate more or fewer points than
those specified. The second problem is respondents might be confused.
Example: Between attributes of detergent, please allocate 100 points among the attributes so that your allocation reflects the relative importance
you attach to each attribute. The more points an attribute receives, the more important the attribute is. If an attribute is not at all important, assign
it zero points. If an attribute is twice as important as some other attribute, it should receive twice as many points.
Attribute Number of Points
Price 50
Fragrance 05
Packaging 10
Cleaning power 30
Lather 05
Total Points 100
d. Q-Sort Technique: This is a comparative scale that uses a rank order procedure to sort objects based on similarity with respect to some criterion.
The important characteristic of this methodology is that it is more important to make comparisons among different responses of a respondent than
the responses between different respondents. Therefore, it is a comparative method of scaling rather than an absolute rating scale. In this method
the respondent is given statements in a large number for describing the characteristics of a product or a large number of brands of products.
Example: The bag given to you contain pictures of 90 magazines. Please choose 10 magazines you prefer most, 20 magazines you like, 30
magazines which you are neutral (neither like nor dislike), 20 magazines you dislike and 10 magazines you prefer least.
Prefer Most Like Neutral Dislike Prefer Least
(10) (20) (30) (20) (10)
3. Non-Comparative Scales of Measurement
a. Continuous Rating Scales: the respondent’s rate the objects by placing a mark at the appropriate position on a continuous line that runs from one
extreme of the criterion variable to the other.
Example: How would you rate the TV advertisement as a guide for buying?
Strongly Agree Strongly Disagree
10 9 8 7 6 5 4 3 2 1
b. Itemized Rating Scale: itemized rating scale is a scale having numbers or brief descriptions associated with each category. The categories are
ordered in terms of scale position and the respondents are required to select one of the limited numbers of categories that best describes the
product, brand, company or product attribute being rated. Itemized rating scales are widely used in marketing research. This can take the graphic,
verbal or numerical form.
c. Likert Scale: the respondents indicate their own attitudes by checking how strongly they agree or disagree with carefully worded statements that
range from very positive to very negative towards the attitudinal object. Respondents generally choose from five alternatives (say strongy agree,
agree, neither agree nor disagree, disagree, strongly disagree). A likert scale may include a number of items or statements. Disadvantage of Likert
scale is that it takes longer time to complete that other itemized rating scales because respondents have to read each statement. Despite the above
disadvantages, this scale has several to advantages. It is easy to construct, administer and use.
Example: I believe that ecological questions are the most important issues facing human beings today.
1 2 3 4 5
Strongly Disagree Disagree Neutral Agree Strongly Agree
d. Semantic Differential Scale: This is a seven-point rating scale with end points associated with bipolar labels (such as good and bad, complex and
simple) that have semantic meaning. It can be used to find whether a respondent has a positive or negative attitude towards an object. It has been
widely used in comparing brands and company images. It has also been used to develop advertising and promotion strategies and in a new product
development study.
Example: Please indicate you attitude towards work using the scale below:
Attitude towards work
Boring : : : : : : : Interesting
Unnecessary : : : : : : : Necessary
e. Staple Scale: The staple scale was originally developed to measure the direction and intensity of an attitude simultaneously. Modern versions of
the staple scale place a single adjective as a substitute for the semantic differential when it is difficult to create pairs of bipolar adjectives. The
modified staple scale places a single adjective in the center of an even number of numerical values.
Example: Select a plus number for words that you think describe personnel banking of a bank accurately. The more accurately you think the
word describes the bank, the larger the plus number you should choose. Select a minus number for words you think do not describe the bank
accurately. The less accurate you think the word describes the bank, the larger the minus number you should choose.
+3 +3
+2 +2
+1 +1
Friendly Personnel Competitive Loan Rates
-1 -1
-2 -2
-3 -3
B. Descriptive Statistics
1. Frequency Distributions – distribution of scores by frequency with which they occur
2. Measures of Central Tendency – a statistic that indicates the average or midmost score between the extreme scores in a distribution
ΣX Σ(fX)
a. Mean – formula: ̅X= (for ungrouped distribution) ̅
N
X= N
(for grouped distribution)
b. Median – the middle score in a distribution
c. Mode – frequently occurring score in a distribution
***Appropriate use of central tendency measure according to type of data being used:
Type of Data Measure
Nominal Data Mode
Ordinal Data Median
Interval / Ratio Data (Normal) Mean
Interval / Ratio Data (Skewed) Median
3. Measures of Variability – a statistic that describe the amount of variation in a distribution
a. Range – the difference between the highest and the lowest scores
b. Interquartile range – the difference between Q1 and Q3
c. Semi-Interquartile range – interquartile range divided by 2
d. Standard Deviation – the square root of the averaged squared deviations about the mean
4. Measures of Location
a. Percentiles – an expression of the percentage of people whose score on a test or measure falls below a particular raw score
Number of students beaten
Formula for Percentile = x 100
Total number of students
b. Quartiles – one of the three dividing points between the four quarters of a distribution, each typically labelled Q1, Q2 and Q3
c. Deciles – divided to 10 parts
5. Skewness - a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean
D. Inferential Statistics
1. Parametric vs. Non-Parametric Tests
Parametric Test Non-Parametric Test
Requirements • Normal Distribution • Normal Distribution is not required
• Homogenous Variance • Homogenous Variance is not required
• Interval or Ratio Data • Nominal or Ordinal Data
Common Statistical • Pearson’s Correlation • Spearman’s Correlation
Tools • Independent Measures t-test • Mann-Whitney U test
• One-way, independent-measures ANOVA • Kruskal-Wallis H test
• Paired t-test • Wilcoxon Signed-Rank test
• One-way, repeated-measures ANOVA • Friedman’s test
2. Measures of Correlation
a. Pearson’s Product Moment Correlation – parametric test for interval data
b. Spearman Rho’s Correlation – non-parametric test for ordinal data
c. Kendall’s Coefficient of Concordance – non-parametric test for ordinal data
d. Phi Coefficient – non-parametric test for dichotomous nominal data
e. Lambda – non-parametric test for 2 groups (dependent and independent variable) of nominal data
***Correlation Ranges:
1.00 : Perfect relationship 0.25 – 0.49 : Weak relationship
0.75 – 0.99 : Very strong relationship 0.01 – 0.24 : Very weak relationship
0.50 – 0.74 : Strong relationship 0.00 : No relationship
3. Measures of Prediction
a. Biserial Correlation – predictive test for artificially dichotomized and categorical data as criterion with continuous data as predictors
b. Point-Biserial Correlation – predictive test for genuinely dichotomized and categorical data as criterion with continuous data as predictors
c. Tetrachoric Correlation – predictive test for dichotomous data with categorical data as criterion and categorical data as predictors
d. Simple Linear Regression – a predictive test which involves one criterion that is continuous in nature with only one predictor that is continuous
e. Multiple Linear Regression – a predictive test which involves one criterion that is continuous in nature with more than one continuous predictor
f. Ordinal Regression – a predictive test which involves a criterion that is ordinal in nature with more than one predictors that are continuous in
4. Chi-Square Test
a. Goodness of Fit – used to measure differences and involves nominal data and only one variable with 2 or more categories
b. Test of Independence – used to measure correlation and involves nominal data and two variables with two or more categories
5. Comparison of Two Groups
a. Paired t-test – a parametric test for paired groups with normal distribution
b. Unpaired t-test – a parametric test for unpaired groups with normal distribution
c. Wilcoxon Signed-Rank Test – a non-parametric test for paired groups with non-normal distribution
d. Mann-Whitney U test – a non-parametric test for unpaired groups with non-normal distribution
6. Comparison of Three or More Groups
a. Repeated measures ANOVA – a parametric test for matched groups with normal distribution
b. One-way/Two-Way ANOVA – a parametric test for unmatched groups with normal distribution
c. Friedman F test – a non-parametric test for matched groups with non-normal distribution
d. Kruskal-Wallis H test – a non-parametric test for unmatched groups with non-normal distribution
7. Factor Analysis
– the two forms should be truly paralleled, independently constructed tests designed to meet the same specifications, contain the same
number of items, have items which are expressed in the same form, have items that cover the same type of content, have items with the
same range of difficulty, and have the same instructions, time limits, illustrative examples, format and all other aspects of the test
– has the most universal applicability
– for immediate alternate forms, the source of error variance is content sampling
– for delayed alternate forms, the source of error variance is time sampling and content sampling
– utilizes Pearson r or Spearman rho
c. Split-Half Reliability
– Two scores are obtained for each person by dividing the test into equivalent halves (odd-even split or top-bottom split)
– The reliability of the test is directly related to the length of the test
– The source of error variance is content sampling
– Utilizes the Spearman-Brown Formula
d. Other Measures of Internal Consistency/Inter-Item Reliability – source of error variance is content sampling and content heterogeneity
– KR-20 – for dichotomous items with varying level of difficulty
– KR-21 – for dichotomous items with uniform level of difficulty
– Cronbach Alpha/Coefficient Alpha – for non-dichotomous items (likert or other multiple choice)
– Average Proportional Distance – focuses on the degree of difference that exists between item scores.
e. Inter-Rater/Inter-Observer Reliability
– Degree of agreement between raters on a measure
– Source of error variance is inter-scorer differences
– Often utilizes Cohen’s Kappa statistic
4. Reliability Ranges
– 1 : perfect reliability (may indicate redundancy and homogeneity)
– ≥ 0.9 : excellent reliability (minimum acceptability for tests used for clinical diagnoses)
– ≥ 0.8 < 0.9 : good reliability,
– ≥ 0.7 < 0.8 : acceptable reliability (minimum acceptability for psychometric tests),
– ≥ 0.6 < 0.7 : questionable reliability (but is still acceptable for research purposes),
– ≥ 0.5 < 0.6 : poor reliability,
– < 0.5 : unacceptable reliability,
– 0 : no reliability.
5. Standard Error of Measurement
– an index of the amount of inconsistency or the amount of expected error in an individual’s score
– the higher the reliability of the test, the lower the SEM
• Error – long standing assumption that factors other than what a test attempts to measure will influence performance on the test
• Error Variance – the component of test score attributable to sources other than the trait or ability being measured
• Trait Error – are those sources of errors that reside within an individual taking the test (such as, I didn’t study enough, I felt bad that
missed blind date, I forgot to set the alarm, excuses)
• Method Error– are those sources of errors that reside in the testing situation (such as lousy test instructions, too-warm room, or
missing pages).
• Confidence Interval – a range or band of test scores that is likely to contain the true score
• Standard error of the difference – a statistical measure that can aid a test user in determining how large a difference should be before it
is considered statistically significant
6. Factors Affecting Test Reliability
a. Test Format e. Test Scoring
b. Test Difficulty f. Test Economy
c. Test Objectivity g. Test Adequacy
d. Test Administration
7. What to do about low reliability?
– Increase the number of items
– Use factor analysis and item analysis
– Use the correction of attenuation formula – a formula that is being used to determine the exact correlation between two variables if the test is
deemed affected by error
B. Validity – a judgment or estimate of how well a test measures what it purports to measure in a particular test
1. Types of Validity
a. Face Validity
– the least stringent type of validity, whether a test looks valid to test users, examiners and examinees
– Examples:
✓ An IQ test containing items which measure memory, mathematical ability, verbal reasoning and abstract reasoning has a good face
validity.
✓ An IQ test containing items which measure depression and anxiety has a bad face validity.
✓ A self-esteem rating scale which has items like “I know I can do what other people can do.” and “I usually feel that I would fail on a
task.” has a good face validity.
✓ Inkblot test have low face validity because test takers question whether the test really measures personality.
b. Content Validity
– Definitions and concepts
✓ whether the test covers the behavior domain to be measured which is built through the choice of appropriate content areas, questions,
tasks and items
✓ It is concerned with the extent to which the test is representative of a defined body of content consisting of topics and processes.
✓ Content validation is not done by statistical analysis but by the inspection of items. A panel of experts can review the test items and
rate them in terms of how closely they match the objective or domain specification.
✓ This considers the adequacy of representation of the conceptual domain the test is designed to cover.
✓ If the test items adequately represent the domain of possible items for a variable, then the test has adequate content validity.
✓ Determination of content validity is often made by expert judgment.
– Examples:
✓ Educational Content Valid Test – syllabus is covered in the test; usually follows the table of specification of the test. (Table of
specification – a blueprint of the test in terms of number of items per difficulty, topic importance, or taxonomy)
✓ Employment Content Valid Test – appropriate job-related skills are included in the test. Reflects the job specification of the test.
✓ Clinical Content Valid Test – symptoms of the disorder are all covered in the test. Reflects the diagnostic criteria for a test.
– Issues arising from lack of content validity:
✓ Construct underrepresentation-Failure to capture important components of a construct (e.g. An English test which only contains
vocabulary items but no grammar items will have a poor content validity.)
✓ Construct-irrelevant variance-Happens when scores are influenced by factors irrelevant to the construct (e.g. test anxiety, reading
speed, reading comprehension, illness)
c. Criterion-Related Validity
– What is a criterion?
✓ standard against which a test or a test score is evaluated.
✓ A criterion can be a test score, psychiatric diagnosis, training cost, index of absenteeism, amount of time.
✓ Characteristics of a criterion:
• Relevant
• Valid and Reliable
• Uncontaminated: Criterion contamination occurs if the criterion based on predictor measures; the criterion used is a criterion of
what is supposed to be the criterion
– Criterion-Related Validity Defined:
✓ indicates the test effectiveness in estimating an individual’s behavior in a particular situation
✓ Tells how well a test corresponds with a particular criterion.
✓ A judgment of how adequately a test score can be used to infer an individual’s most probable standing on some measure of interest.
– Types of Criterion-Related Validity:
✓ Concurrent Validity – the extent to which test scores may be used to estimate an individual’s present standing on a criterion
✓ Predictive – the scores on a test can predict future behavior or scores on another test taken in the future
✓ Incremental Validity – this type of validity is related to predictive validity wherein it is defined as the degree to which an additional
predictor explains something about the criterion measure that is not explained by predictors already in use
d. Construct Validity
– What is a construct?
✓ An informed scientific idea developed or hypothesized to describe or explain a behavior; something built by mental synthesis.
✓ Unobservable, presupposed traits; something that the researcher thought to have either high or low correlation with other variables
– Construct Validity defined
✓ A test designed to measure a construct must estimate the existence of an inferred, underlying characteristic based on a limited sample
of behavior
✓ Established through a series of activities in which a researcher simultaneously defines some construct and develops instrumentation to
measure it.
✓ A judgment about the appropriateness of inferences drawn from test scores regarding individual standings on a variable called
construct.
✓ Required when no criterion or universe of content is accepted as entirely adequate to define the quality being measured.
✓ Assembling evidence about what a test means.
✓ Series of statistical analysis that one variable is a separate variable.
✓ A test has a good construct validity if there is an existing psychological theory which can support what the test items are measuring.
✓ Establishing construct validity involves both logical analysis and empirical data. (Example: In measuring aggression, you have to check
all past research and theories to see how the researchers measure that variable/construct)
✓ Construct validity is like proving a theory through evidences and statistical analysis.
– Evidences of Construct Validity
✓ Test is homogenous, measuring a single construct.
• Subtest scores are correlated to the total test score.
• Coefficient alpha may be used as homogeneity evidence.
• Spearman Rho can be used to correlate an item to another item.
• Pearson or point biserial can be used to correlate an item to the total test score. (item-total correlation)
✓ Test score increases or decreases as a function of age, passage of time, or experimental manipulation.
• Some variable/construct are expected to change with age.
✓ Pretest, posttest differences
• Difference of scores from pretest and posttest of a defined construct after careful manipulation would provide validity
✓ Test scores differ from groups.
• Also called a method of contrasted group
• T-test can be used to test the difference of groups.
✓ Test scores correlate with scores on other test in accordance to what is predicted.
• Discriminant Validation
o Convergent Validity – a test correlates highly with other variables with which it should correlate (example: Extraversion
which is highly correlated sociability)
o Divergent Validity – a test does not correlate significantly with variables from which it should differ (example: Optimism
which is negatively correlated with Pessimism)
• Factor Analysis – a retained statistical technique for analyzing the interrelationships of behavior data
o Principal Components Analysis – a method of data reduction
o Common Factor Analysis – items do not make a factor, the factor should predict scores on the item and is classified into two
(Exploratory Factor Analysis for summarizing data and Confirmatory Factor Analysis for generalization of factors)
• Cross-Validation - Revalidation of the test to a criterion based on another group different from the original group from which the
test was validated
o Validity Shrinkage – decrease in validity after cross validation.
o Co-validation – validation of more than one test from the same group.
o Co-norming – norming more than one test from the same group
2. Test Bias
– This is a factor inherent in a test that systematically prevents accurate, impartial measurement
✓ Rating Error – a judgment resulting from the intentional or unintentional misuse of rating scales
• Severity Error/Strictness Error – less than accurate rating or error in evaluation due to the rater’s tendency to be overly critical
• Leniency Error/Generosity Error – a rating error that occurs as a result of a rater’s tendency to be too forgiving and insufficiently
critical
• Central Tendency Error – a type of rating error wherein the rater exhibits a general reluctance to issue ratings at either a positive
or negative extreme and so all or most ratings cluster in the middle of the rating continuum
• Proximity Error – rating error committed due to proximity/similarity of the traits being rated
• Primacy Effect – “first impression” affects the rating
• Contrast Effect – the prior subject of assessment affects the latter subject of assessment
• Recency Effect – tendency to rate a person based from recent recollections about that person
• Halo Effect – a type of rating error wherein the rater views the object of the rating with extreme favour and tends to bestow ratings
inflated in a positive direction
• Impression Management
• Acquiescence
• Non-acquiescence
• Faking-Good
• Faking-Bad
3. Test Fairness
– This is the extent to which a test is used in an impartial, just and equitable way
4. Factors Influencing Test Validity
a. Appropriateness of the test e. Test Construction factors
b. Directions/Instructions f. Length of Test
c. Reading Comprehension Level g. Arrangement of Items
d. Item Difficulty h. Patterns of Answer
A. Standardization
1. When to decide to standardize a test?
a. No test exists for a particular purpose
b. The existing tests for a certain purpose are not adequate for one reason or the another
2. Basic Premises of standardization
– The independent variable is the individual being tested
– The dependent variable is his behavior
– Behavior = person x situation
– In psychological testing, we make sure that it is the person factor that will ‘stand out’ and the situation factor is controlled
– Control of extraneous variables = standardization
3. What should be standardized?
a. Test Conditions
– There should be uniformity in the testing conditions
– Physical condition
– Motivational condition
b. Test Administration Procedure
– There should be uniformity in the instructions and administration proper. Test administration includes carefully following standard procedures
so that the test is used in the manner specified by the test developers. The test administrator should ensure that test takers work within
conditions that maximize opportunity for optimum performance. As appropriate, test takers, parents, and organizations should be involved in
the various aspects of the testing process
– Sensitivity to Disabilities: try to help the disable subject overcome his disadvantage, such as increasing voice volume or refer to other available
tests
– Desirable Procedures of Group Testing: Be care for time, clarity, physical condition (illumination, temperature, humidity, writing surface and
noise), and guess.
c. Scoring
– There should be a consistent mechanism and procedure in scoring. Accurate measurement necessitates adequate procedures for scoring
the responses of test takers. Scoring procedures should be audited as necessary to ensure consistency and accuracy of application.
d. Interpretation
– There should be common interpretations among similar results. Many factors can impact the valid and useful interpretations of test scores.
These can be grouped into several categories including psychometric, test taker, and contextual, as well as others.
a. Psychometric Factors: Factors such as the reliability, norms, standard error of measurement, and validity of the instrument are important
when interpreting test results. Responsible test use considers these basic concepts and how each impacts the scores and hence the
interpretation of the test results.
b. Test Taker Factors: Factors such as the test taker’s group membership and how that membership may impact the results of the test is a
critical factor in the interpretation of test results. Specifically, the test user should evaluate how the test taker’s gender, age, ethnicity, race,
socioeconomic status, marital status, and so forth, impact on the individual’s results.
c. Contextual Factors: The relationship of the test to the instructional program, opportunity to learn, quality of the educational program, work
and home environment, and other factors that would assist in understanding the test results are useful in interpreting test results. For
example, if the test does not align to curriculum standards and how those standards are taught in the classroom, the test results may not
provide useful information.
4. Tasks of test developers to ensure uniformity of procedures in test administration:
– Prepare a test manual containing the ff:
i. Materials needed (test booklets & answer sheets)
ii. Time limits
iii. Oral instructions
iv. Demonstrations/examples
v. Ways of handling querries of examinees
5. Tasks of examiners/test users/psychometricians
– Ensure that test user qualifications are strictly met (training in selection, administration, scoring and interpretation of tests as well as the required
license)
– Advance preparations
i. Familiarity with the test/s
ii. Familiarity with the testing procedure
iii. Familiarity with the instructions
iv. Preparation of test materials
v. Orient proctors (for group testing)
6. Standardization sample
– A random sample of the test takers used to evaluate the performance of others
– Considered a representative sample if the sample consists of individuals that are similar to the group to be tested
B. Objectivity
1. Time-Limit Tasks – every examinee gets the same amount of time for a given task
2. Work-Limit Tasks – every examinee has to perform the same amount of work
3. Issue of Guessing
✓ True or False
• Ideally a true/false question should be constructed so that an incorrect response indicates something about the student's
misunderstanding of the learning objective.
• This may be a difficult task, especially when constructing a true statement
2. Test Construction – be mindful of the following test construction guidelines:
– Deal with only one central thought in each item. – Avoid irrelevant information.
– Be precise. – Present items in a positive language
– Be brief. – Avoid double negatives
– Avoid awkward wordings or dangling constructs. – Avoid terms like “all” and “none”
3. Test Tryout
4. Item Analysis (Factor Analysis for Typical-Performance Tests)
5. Test Revision
D. Item Analysis
– Measures and evaluates the quality and appropriateness of test questions
– How well the items could measure ability/trait
1. Classical Test Theory
– Analyses are the easiest and the most widely used form of analyses
– Often called the “true-score model” which involves the true score formula:
𝑋𝑡𝑒 = 𝑟𝑥𝑥 (𝑋 − 𝑋̅ ) + 𝑋̅
Where:
𝑋𝑡𝑒 = True Score 𝑋 = Raw Score
𝑟𝑥𝑥 = Correlation Coefficient 𝑋̅ = Mean Score
– Assumes that a person’s test score is comprised of their “true score” plus some measurement error (X = T + e)
– Employs the following statistics
a. Item difficulty
– The proportion of examinees who got the item correctly
– The higher the item mean, the easier the item is for the group; the lower the item mean, the more difficult the item is for the group
Nu + Nl
– Formula: =
N
where: Nu = number of students from the upper group who answered the item correctly
Nl = number of students from the lower group who answered the item correctly
N = total number of examinees
– 0.00-0.20 : Very Difficult : Unacceptable
– 0.21-0.40 : Difficult : Acceptable
– 0.41-0.60 : Moderate : Highly Acceptable
– 0.61-0.80 : Easy : Acceptable
– 0.81-1.00 : Very Easy : Unacceptable
b. Item discrimination
– measure of how well an item is able to distinguish between examinees who are knowledgeable and not
– how well is each item related to the trait
– The discrimination index range is between -1.00 to +1.00
– The closer the index to +1, the more effectively the item distinguishes between the two groups of examinees
– The acceptable index is 0.30 and above
Nu − Nl
– Formula: = 1
N
2
where: Nu = number of students from the upper group who answered the item correctly
Nl = number of students from the lower group who answered the item correctly
N = total number of examinees
– 0.40-above : Very Good Item : Highly Acceptable
– 0.30-0.39 : Good Item : Acceptable
– 0.20-0.29 : Reasonably Good Item: For Revision
– 0.10-0.19 : Difficult Item : Unacceptable
– Below 0.19 : Very Difficult Item : Unacceptable
c. Item reliability index - the higher the index, the greater the test’s internal consistency
d. Item validity index - the higher the index, the greater the test’s criterion-related validity
e. Distracter Analysis
– All of the incorrect options, or distractors, should be equally distracting
– preferably, each distracter should be equally selected by a greater proportion of the lower scorers than of the top group
f. Overall Evaluation of Test Items
DIFFICULTY LEVEL DISCRIMINATIVE POWER ITEM EVALUATION
✓ In case of individual testing, where each question is given orally, unintended help can be given by facial expression or words of
encouragement. Thereon, taking test is always concerned to know how well he is doing and watches the examiner for indications of his
success. The examiner must maintain a completely unrevealing expression, while at the same time silently assuring the subject of his
interest in what he says or do.
✓ In individual testing, the tester observes the subject’s performance with care. He notes the time to complete each task and any errors, he
watches for any unusual method of approaching the task. Observation and note taking must be done in a subtle and unobtrusive manner so
as not to indirectly or directly affect the subject’s performance of the task
– General Procedures/Guidelines
✓ Conditions of testing
• Physical Condition. The physical condition where the test is given may affect the test scores. If the ventilation and lighting are poor,
the subject will be handicapped.
• Condition of the Person. Sate of the person affects the results, if the test is given when he is fatigued, when his mind is concerned
with other problems, or when he is emotionally disturbed, results will not be a fair sample of his behavior.
• Test Condition. The testing condition can often be improved by spacing the tests to avoid cumulative fatigue. Test questionnaires,
answer sheets and other testing materials needed must always be in good condition so as not to hinder good performance.
• Condition of the Day. Time of the day may influence scores, but is rarely important. Alert subjects are more likely to give their best
than subjects who are tired and dispirited. Equally good results can be produced at any hour, however, if the subjects want to do
well.
✓ Control of the group
• Group tests are given only to those reasonably and cooperative subjects who expects to do as the tester requests. Group testing
then, is a venue for a problem in command.
• Directions should be given simply, clearly and singly. The subjects must have a chance to ask questions whenever they are
necessary but the examiner attempts to anticipate all reasonable questions by full directions.
• Effective control may be combined with good rapport if the examiner is friendly, avoid an antagonistic, overbearing or fault attitude.
• The goal of the tester is to obtain useful information about people; that is to elicit good information from the results of the test. There
is no value adhering rigidly to a testing schedule if the schedule will not give true information. Common sense is the only safe guide
in exceptional situations.
✓ Directions of the subject
• The most important responsibility of the test administrator is giving directions.
• It is imperative that the tester gives the directions exactly as provided in the manual. If the tester understands the importance of this
responsibility, it is simple to follow the printed directions, reading them word for word, adding nothing and changing nothing.
✓ Judgments left to the examiner
• The competent examiner must possess a high degree of judgment, intelligence, sensitivity to the reactions of others, and
professionalism, as well as knowledge with regards to scientific methods and experience in the use of psychometric techniques.
• No degree of mechanical perfection of the test themselves can ever take the place of good judgment and psychological insight of
the examiner.
✓ Guessing
• It is against the rules for the tester to give supplementary advices; he must retreat to such formula as “Use your judgment.” (But the
tester is not to give his group an advantage by telling them this trade secret.)
• The person taking the test is usually wise to guess freely. (But the tester is not to give his group an advantage by telling them this
trade secret.)
• From the point of view of the tester, the tendency to guess is an unstandardized aspect of the testing situation which interferes with
accurate measurement.
• The systematic advantage of the guesser is eliminated if the test manual directs everyone to guess, but guessing introduces large
chances of errors. Statistical comparison of “do not guess” instruction and “do guess” instruction show that with the latter, the test
has slightly lesser predictive value.
• The most widely accepted practice now is to educate students that wild guessing is to their disadvantage, but to encourage them to
respond when they can make an informed judgment as to the most reasonable answer even if they are uncertain.
• The motivation most helpful to valid testing is a desire on the part of the subject that the score be valid. Ideally the subject becomes
a partner in testing himself. The subject must place himself on a scale, and unless he cares about the result he cannot be
measured accurately.
• The desirability of preparing the subject for the test by appropriate advance information is increasingly recognized. This information
increases the person’s confidence, and reduces standard test anxiety that they might otherwise have.
– Scoring
✓ Hand scoring ✓ Machine scoring
7. Responsible Report Writing and Communication of Test Results
– What is a psychological report?
✓ an abstract of a sample of behavior of a patient or a client derived from results of psychological tests.
✓ A very brief sample of one’s behavior
– Criteria for a good psychological report
✓ Individualized – written specifically for the client
✓ Directly and adequately answers a referral question
✓ Clear – written in a language that can be easily understood
✓ Meaningful – perceived by the reader as clear and is understood by the reader
✓ Synthesized – details are formed into broader concepts about the specific person
✓ Delivered on time
– Principles of value in writing individualized psychological report
✓ Avoid mentioning general characteristics, which could describe almost anyone, unless the particular importance in the given case is made
clear.
✓ Describe the particular attributes of the individual fully, using as distinctive terms as possible.
✓ Simple listing of characteristics is not helpful; tell how they are related and organized in the personality.
✓ Information should be organized developmentally with respect to the time line of the individual life.
✓ Many of the problems of poor reports, such as vague generalizations, overqualification, clinging to the immediate data, stating the obvious
and describing stereotypes are understandable but undesirable reactions to uncertainty.
✓ Validate statements with actual behavioral responses.
✓ Avoid, if possible, the use of qualities such as “It appears”, “tends to”, etc. for these convey the psychologist’s uncertainties or indecisions.
✓ Avoid using technical terms. Present them using layman’s language
– Levels of Psychological Interpretation
✓ Level I
• There is minimal amount of any sort of • Data are primarily treated in a sampling or correlate way
interpretation • There is no concern with underlying constructs
• There is a minimal concern with intervening • Found in large-scale selection testing
processes • For psychometric approaches
✓ Level II
• Descriptive generalizations - From the particular behaviors observed, we generalize to more inclusive, although still largely behavioral
and descriptive categories. Thus, they note, a clinician might observe instances of slow bodily movements and excessive delays in
answering questions and from this infer that the patient is “retarded motorically.” With the further discovery that the patient eats and
sleeps poorly, cries easily, reports a constant sense of futility and discouragement and shows characteristic test behaviors, the
generalization is now broadened as “depressed.”
• Hypothetical constructs - Assumption of an inner state which goes logically beyond description of visible behavior. Such constructs
imply causal conditions, related personality traits and behaviors and allow prediction of future events. It is the movement from
description to construction which is the sense of clinical interpretation
✓ Level III
• The effort is to develop a coherent and inclusive theory of the individual life or a “working image” of the patient. In terms of a general
theoretical orientation, the clinician attempts a full-scale exploration of the individual’s personality, psychosocial situation, and
developmental history
– Sources of Error in Psychological Interpretation
✓ Information Overload
• Too much material, making the clinician overwhelmed
• Studies have been shown that clinical judges typically use less information than is available to them
• The need is to gather optimal, rather than maximal, amount of information of a sort digestible by the particular clinician
• Obviously, familiarity with the tests involved, type of patient, referral questions and the like figure in deciding how much of what kind of
material is collected and how extensible it can be interpreted
✓ Schematization
• All humans have a limited capacity to process information and to form concepts
• Consequently, the resulting picture is of the individual is schematized and simplified, perhaps catering to one or a few salient and
dramatic and often, pathological, characteristics
• The resulting interpretations are too organized and consistent and the person emerges as a two-dimensional creature
• The clinical interpreter has to be able to tolerate complexity and deal at one time with more data than he can comfortably handle
✓ Insufficient internal evidence for interpretation
• Ideally, interpretations should emerge as evidence converges from many sources, such as different responses and scores of the same
tests, responses of different tests, self-report, observation, etc.
• Particularly for interpretations at higher levels, supportive evidence is required
• Results from lack of tests, lack of responses
• Information between you and the client
✓ Insufficient external verification of interpretation
• Too often clinicians interpret assessment material and report on the patients without further checking on the accuracy of their
statements
• Information between you and the relevant others
• Verify statements made by patients
✓ Overinterpretation
• “Wild analysis”
• Temptation to over-interpret assessment material in pursuit of a dramatic or encompassing formulation
• Deep interpretations, seeking for unconscious motives and nuclear conflicts or those which attempt genetic reconstruction of the
personality are always to be made cautiously and only on the basis of convincing evidence
• Interpreting symbols in terms of fixed meanings is a cheap and usually inaccurate attempt at psychoanalytic interpretation
• At all times, the skillful clinician should be able to indicate the relationship between the interrupted hypothetical variable and its
referents to overt behavior
✓ Lack of Individualization
• It is perfectly possible to make correct statements which are entirely worthless because they could as well apply to anyone under most
conditions
• “Aunt Fanny syndrome”/”PT Barnum Effect”
• What makes the person unique (e.g., both patients are anxious – how does one patient manifest his anxiety)
✓ Lack of Integration
• Human personality is organized and integrated usually in hierarchical system
• It is of central importance to understand which facets of the personality are most central and which are peripheral, which needs to sub
serve others and how defensive, coping and ego functions are organized, if understanding of the personality is to be achieved
• Over-cautiousness, insufficient knowledge or a lack of a theoretical framework are sometimes revealed in contradictory interpretations
made side by side
• On the face of it, someone cannot be called both domineering and submissive
✓ Overpathologizing
• Always highlights the negative not the positive aspect of behavior
• Emphasizes the weakness rather than the strengths of a person
• A Balance between the positive and negative must be the goal
• Sandwich method (positive-negative-positive) is a recommended approach
✓ Over-“psychologizing”
• Giving of interpretation when there is none (e.g., scratching of hands – anxious, itchy)
• Avoid generalized interpretations of overt behaviors
• Must probe into the meaning/motivations behind observed behaviors
– Essential Parts of a Psychological Report
✓ Industrial setting
• Identifying Information • Skills and Abilities
• Test administered • Personality Profile
• Test Results • Summary/Recommendation
✓ Clinical setting
• Personal Information • Test results and interpretation
• Referral question • Summary formulation
• Test administered • Diagnostic Impression
• Behavioral observation (Test and Interview) • Recommendation
F. Rights of Test Takers
1. Be treated with courtesy, respect, and impartiality, regardless of your age, disability, ethnicity, gender, national origin, religion, sexual orientation or
other personal characteristics
2. Be tested with measures that meet professional standards and that are appropriate, given the manner in which the test results will be used
3. Receive information regarding their test results
4. Least stigmatizing label
5. Informed Consent
6. Privacy and Confidentiality