0% found this document useful (0 votes)
56 views35 pages

Edition 5 Notes

Chapter 1 of 'Scale Development: Theory and Applications' emphasizes the critical role of measurement in educational research, highlighting the historical and theoretical foundations of social measurement. It distinguishes between scales, indices, and emergent variables, while stressing the importance of valid and reliable measurement tools to avoid invalid conclusions and policy missteps. The chapter also discusses the relationship between theory and measurement, outlining common pitfalls in scale development and the ethical implications of poor measurement practices.

Uploaded by

akachakzai001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views35 pages

Edition 5 Notes

Chapter 1 of 'Scale Development: Theory and Applications' emphasizes the critical role of measurement in educational research, highlighting the historical and theoretical foundations of social measurement. It distinguishes between scales, indices, and emergent variables, while stressing the importance of valid and reliable measurement tools to avoid invalid conclusions and policy missteps. The chapter also discusses the relationship between theory and measurement, outlining common pitfalls in scale development and the ethical implications of poor measurement practices.

Uploaded by

akachakzai001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOCX, PDF, TXT or read online on Scribd

Based on Chapter 1 of Scale Development: Theory and Applications by Robert F.

DeVellis &
Carolyn T. Thorpe (5th Edition), here are comprehensive, scholarly PhD-level class notes
tailored for students in education, particularly those engaged in scale construction and
educational research:

🧠 PhD Class Notes — Chapter 1: Overview of Scale


Development

🎯 Learning Outcomes

By the end of this session, students will:

 Understand the historical, theoretical, and scientific foundations of measurement in social


sciences.
 Recognize the distinction between scales, indices, and emergent variables.
 Evaluate the implications of poor measurement for educational research.
 Connect theory with operationalization in developing valid, reliable measurement
instruments.

1️⃣ The Centrality of Measurement in Educational Research


Measurement is not just a procedural step; it is foundational to scientific inquiry. In
educational research, constructs like motivation, engagement, teacher efficacy, or school
climate are latent—they cannot be directly observed, only inferred. Measurement tools like
scales act as bridges between abstract theoretical ideas and observable data.

“Measurement is a fundamental activity of science.” (DeVellis & Thorpe, 2021)

2️⃣ Historical and Theoretical Origins of Social


Measurement
Period Key Developments
Ancient to Pre- Measurement emerged from social needs (e.g., land allocation, taxation),
Scientific Era not purely scientific motivations.
17th–18th Century Measurement begins to adopt error-reduction and averaging techniques
Period Key Developments
(e.g., Newton's averaging of astronomical observations).
Emergence of statistics, psychometrics, and mental testing (e.g., Galton,
19th Century
Binet, Spearman, Pearson).
20th Century Expansion into psychophysics, factor analysis, and latent variable
Onward modeling. Measurement matures into a scientific and ethical practice.

3️⃣ Key Definitions and Constructs


Term Definition
A set of items measuring a latent variable (e.g., depression, motivation) that
Scale
is presumed to cause variation in item responses.
A collection of cause indicators that contribute to a composite outcome (e.g.,
Index
socioeconomic status from income, education, occupation).
Emergent Groupings of items that share similarity but have no causal connection (e.g.,
Variable checklist of unrelated facts).
A theoretical, unobserved trait that influences observed responses (e.g.,
Latent Variable
"engagement").
True Score The value of the latent variable in the absence of measurement error.
Measurement The difference between the observed score and the true score; includes
Error random and systematic error.

4️⃣ Relationship Between Theory and Measurement


Educational constructs are theory-driven. The process of measurement requires clarifying the
construct before writing a single item.

Poorly conceptualized constructs → poor measures → invalid research conclusions.

 Theoretical models shape what and how we measure.


 Measurement models (e.g., classical test theory, IRT) are used to translate constructs
into scorable data.
 Understanding how theory informs items is key to creating construct-valid instruments.

5️⃣ Composite Measures in Educational Research


Tool Purpose Example
Scales Measure latent variables (effect indicators) Motivation scale
Indices Combine components that influence an outcome (cause Dropout risk index
Tool Purpose Example
indicators)
Emergent Demographic
Group similar but non-causal items
Variables profiles

Scales are most relevant for educational research involving beliefs, attitudes, self-efficacy, and
other psychological constructs.

6️⃣ Not All Scales Are Created Equal


Common mistakes include:

 Assembling items without theoretical grounding.


 Equating scales with indices or emergent variables.
 Using single-item measures for complex constructs.
 Mistaking convenience for validity.

“Assembly may be a more appropriate term than development.” (DeVellis & Thorpe, 2021)

7️⃣ Costs of Poor Measurement


 Invalid conclusions: E.g., concluding that teacher stress has no effect on performance
due to poor measurement.
 Policy missteps: Misguided interventions based on unreliable scales.
 Ethical concerns: Participant burden, misrepresentation, and harm.

Key Insight: Better no measurement than bad measurement—because bad data can mislead
more than no data.

8️⃣ Relevance to Educational Practice


Application Measurement Need
Curriculum Evaluation Scales to measure cognitive engagement
Teacher Training Tools to assess beliefs about inclusive practices
Student Services Indices to assess dropout risk
Educational Policy Psychometrically sound measures of school climate
9️⃣ Critical Reflection Questions
1. How does theory guide the choice between developing a scale vs. an index?
2. What are the dangers of using atheoretical or borrowed instruments in education?
3. How might a poorly defined construct (e.g., “critical thinking”) affect an entire research
program?

🔟 Class Activity Suggestions


 Scale Critique: Review an existing published educational scale and analyze whether it
meets theoretical, reliability, and validity standards.
 Construct Mapping: Choose an educational concept (e.g., student agency) and define its
theoretical boundaries before writing items.
 Role Play: Debate the consequences of poor measurement from the perspective of a
teacher, policymaker, and researcher.

📌 Summary
 Measurement is essential in translating educational theory into actionable data.
 Not all multi-item tools are valid scales; precision in terminology and logic matters.
 Scale development requires rigorous conceptual, methodological, and ethical
consideration.
 In educational research, quality measurement leads to better science, better practice,
and more equitable outcomes.

Would you like these notes converted into a downloadable Word, PDF, or PowerPoint
document for teaching or research use?

Here are PhD-level class notes for Chapter 1: Overview of Scale Development: Theory and
Applications (5th ed.) by Robert F. DeVellis & Carolyn T. Thorpe. These notes are crafted for
advanced educational research students with emphasis on psychometrics, scale construction, and
educational measurement applications.

PhD Class Notes: Chapter 1 — Overview


1. Introduction: Why Measurement Matters in Social Science
 Measurement is foundational in both physical and social sciences.
 In social sciences, researchers often cannot directly observe the constructs of interest
(e.g., beliefs, attitudes, aspirations).
 Scale development becomes essential when existing instruments are inadequate,
unavailable, or unsuitable.
 Key Insight: Even in highly theoretical research, progress hinges on the ability to
quantify constructs reliably and validly.

2. Theoretical Foundations and Evolution of Measurement

2.1 Historical Antecedents

 Measurement predates science; social needs (e.g., taxation, voting) necessitated


quantification.
 In ancient China and Greece, measurement appeared in early civil services and
philosophical methods.
 Scientific measurement advanced with Isaac Newton's use of averaging and
developments in astronomy.
 Recognition of random vs. systematic error catalyzed statistical methods.

2.2 Psychometrics and Statistical Evolution

 Francis Galton and Karl Pearson formalized human variability and correlation analysis.
 Development of factor analysis (Spearman, Thurstone) and mental testing (Binet, Lord
& Novick).
 Psychophysics influenced the application of mathematical structure to subjective
phenomena (e.g., Stevens' scale types).

2.3 Contribution to Educational and Psychological Testing

 Emergence of Item Response Theory (IRT) and classical test theory broadened
psychometrics.
 Shift from descriptive to etiologically-informed diagnostic classifications (e.g., DSM-V
debates, Insel’s RDoC project).
 Measurement theory matured through its application in education, health, marketing, and
psychiatry.

3. Key Concepts and Terminology

Concept Definition Relevance


Unobservable construct inferred Central to psychometric
Latent Variable
from item responses modeling
Concept Definition Relevance
Composite of items sharing a Used to measure theoretical
Scale
common cause (effect indicators) constructs (e.g., depression)
Composite of items causing a Used to represent constructs like
Index
latent variable (cause indicators) socioeconomic status
Aggregation without causal
Emergent Variable Used for descriptive categories
relationship
Theoretical vs.
Measures built upon vs. Guides instrument design and
Atheoretical
independent of theory interpretation
Measurement

4. Relationship Between Theory and Measurement

 The constructs measured must reflect a clear theoretical foundation.


 Errors in scale development often stem from weak construct clarity or over-reliance on
atheoretical tools.
 Measures should map onto theoretical definitions to ensure interpretive validity.

"Poor measurement imposes an absolute limit on the validity of the conclusions one can reach."
— DeVellis & Thorpe

5. Common Measurement Challenges

 Overuse of Off-the-Shelf Tools: Misaligned with study purpose.


 Construct Underrepresentation: Measures fail to capture full domain.
 Construct-Irrelevant Variance: Unwanted sources influencing responses.
 Poorly Constructed Scales: Use of unvalidated, face-valid items.

6. Scales vs. Indices: A Conceptual Distinction

Aspect Scale Index


Items reflect a latent construct (effect Items cause the latent variable (cause
Causality
indicators) indicators)
Example Depression symptoms Presidential electability factors
Often uses regression, SEM, or formative
Analysis Factor analysis appropriate
modeling

7. Ethical and Practical Considerations


 Convenience vs. Quality: Short scales may reduce respondent fatigue but compromise
reliability.
 Cost of Poor Measurement: Invalid conclusions, misdirected interventions, reduced
scientific impact.
 Ethical Responsibility: Researchers must ensure the meaningfulness of their data
collection efforts.

8. Implications for Educational Research and Scale Construction

 Educational variables (e.g., motivation, teacher efficacy, self-concept) are typically latent
and abstract.
 Scale development requires:
o Strong construct clarity
o Integration of relevant theory
o Consideration of measurement model (reflective vs. formative)
 Use of valid, reliable instruments is critical in policy analysis, curriculum evaluation,
and student assessments.

9. Preview of Coming Chapters

 Ch. 2: Understanding latent variables and the measurement model.


 Ch. 3: Classical reliability theory and reliability estimation.
 Ch. 4: Forms of validity—content, construct, criterion.
 Ch. 5: Stepwise guide to scale construction.
 Ch. 6–9: Factor analysis, indices, item response theory (IRT), and broader research
integration.

Discussion Questions for PhD Seminar

1. How does the misuse of atheoretical measurement compromise the integrity of empirical
research?
2. In what ways can scale construction contribute to theory refinement in educational
contexts?
3. Discuss how poor measurement tools might misinform educational policy decisions.
4. Compare the use of scales and indices in assessing teacher performance in Pakistani
classrooms.
5. Reflect on how psychometrics has shifted the paradigm of educational assessment.
Let me know when you’re ready for Chapter 2, or if you’d like these notes formatted as a PDF or
slide deck for instructional use.

Here are PhD-level class notes for Chapter 2: Understanding the Latent Variable from Scale
Development: Theory and Applications (5th ed.) by Robert F. DeVellis & Carolyn T. Thorpe.
These notes are developed for doctoral students in education, with an emphasis on psychometrics
and educational measurement.

PhD Class Notes: Chapter 2 – Understanding the Latent


Variable

1. Conceptual Overview

 This chapter presents a conceptual model for understanding how scale items relate to the
constructs they aim to measure.
 Central to this model is the latent variable—an unobservable entity assumed to underlie
observed behaviors or responses.
 The focus is on the classical measurement model, which dominates scale development
due to its conceptual clarity and broad applicability in social sciences.

2. Constructs vs. Measures

 Researchers are typically interested in constructs (e.g., anxiety, motivation), not in the
items or scales themselves.
 Measures (e.g., questionnaires) are proxies for these constructs.
 Latent variables are theoretical; measures are empirical.
o Example: A questionnaire about parental aspirations captures a latent sentiment,
not the observed responses themselves.

3. Defining the Latent Variable

A latent variable has two defining features:

 Latent: Not directly observable.


 Variable: Has differing intensities or levels across individuals.
E.g., "Parental aspiration for child achievement" is latent (unseen) and variable (it differs among
individuals and across contexts).

 These are individual-level characteristics, so data is usually gathered from the


individual (not proxies) unless explicitly justified.

4. Latent Variables as Causal Constructs

 In classical test theory, the latent variable is assumed to cause the responses to scale
items.
 Therefore, item values should:
o Correlate with the latent variable, and
o Correlate with each other (internal consistency).

If five items are caused by a latent trait (e.g., self-efficacy), they should be statistically related
due to their shared source.

5. Visualizing Relationships: Path Diagrams

 Path diagrams offer a visual representation of how latent variables influence item
responses.
 Key diagram elements:
o Arrows represent causal paths (e.g., latent variable → item).
o Circles or ellipses represent unmeasured variables (e.g., error terms).
o Rectangles represent measured variables (items).

Figure Insight:

 Correlation between items (e.g., X1 and X5) is a product of the path coefficients
between each item and the latent variable.

6. Classical Measurement Assumptions

The Classical Measurement Model assumes:

1. Observed score (X) = True score (T) + Error (e)


2. Error is random, uncorrelated across items, and has zero mean.
3. Error is not correlated with true score.

These assumptions enable inference about the latent variable from item correlations.
7. Measurement Models: Types and Flexibility

Model Assumptions Flexibility


Parallel Test Model Equal true scores and equal error variances Rigid
Tau-Equivalent
Equal true scores; allows unequal error variances Moderate
Model
Essential Tau- True scores may differ by a constant; still assumes equal
Flexible
Equivalence latent influence
Items share a common latent variable, but differ in Most
Congeneric Model
strength and error variance Flexible

 These models help guide scale construction and selection of appropriate statistical
procedures (e.g., reliability estimates).

8. Choosing a Causal Model in Research

 Measurement design should match construct definition.


 Misalignment leads to misleading results.
o E.g., Measuring the work environment: Instead of asking about features (lighting,
computer), assess perceptions of adequacy, which are psychological and easier
to model causally.
 Clarifying the latent structure early in the research process can make scale development
more coherent and valid.

9. Educational Applications

 Latent constructs such as self-regulation, metacognition, and teaching efficacy are


central to educational research.
 Understanding how these constructs are reflected in observed behaviors (e.g., responses
to Likert-scale items) is critical for:
o Developing valid assessments
o Conducting structural equation modeling (SEM)
o Improving teacher training and student learning evaluations

10. Discussion Questions for PhD Seminar


1. How does the assumption that a latent variable “causes” item responses affect item
construction and scale interpretation?
2. In educational contexts, when might a congeneric model be more appropriate than a
parallel test model?
3. What risks arise if latent variables are not clearly defined or mismatched with the
measurement model?
4. How would you represent teacher self-efficacy as a latent variable in a path diagram?
What observable items would you include?
5. How can classical measurement theory support or limit educational policy decisions
based on survey data?

Let me know when you're ready for Chapter 3: Scale Reliability, or if you’d like a compiled
version (PDF, slides, or teaching guide).

Here are PhD-level class notes for Chapter 3: Scale Reliability from Scale Development:
Theory and Applications (5th Edition) by Robert F. DeVellis & Carolyn T. Thorpe. These notes
are contextualized for doctoral-level students in education and psychometrics.

PhD Class Notes: Chapter 3 – Scale Reliability

1. Conceptual Foundation of Reliability

 Reliability is the proportion of variance in observed scores that reflects the true score
variance rather than measurement error.
 Essential in educational research where instruments measure latent constructs (e.g.,
student motivation, teaching efficacy).
 A perfectly reliable scale has a reliability of 1.0, meaning no measurement error.

Formulaic expression:
Reliability = True Score Variance / Observed Score Variance

2. Classical Test Theory Model (Recap)

 Observed Score (X) = True Score (T) + Error (E)


 The greater the error variance, the lower the reliability.
 Reliability estimation centers on how much variation is attributable to true differences
vs. error.
3. Internal Consistency and Coefficient Alpha (α)

Internal Consistency

 Reflects the homogeneity of scale items—whether they measure the same underlying
construct.
 High inter-item correlations suggest items are caused by the same latent variable.

Cronbach’s Alpha (α)

 Most widely used index of internal consistency.


 Measures the average inter-item correlation, adjusted by the number of items.
 Assumes essential tau-equivalence: items are equally good indicators of the construct.

Formula:

\alpha = \frac{k}{k - 1} \left(1 - \frac{\sum \sigma^2_i}{\sigma^2_{\text{total}}} \right)


]

 k = number of items
 σi2\sigma^2_i = variance of each item
 σtotal2\sigma^2_{\text{total}} = variance of the sum of all items

4. Critiques and Limitations of Alpha

 Assumes unidimensionality and equal item-construct associations.


 Inflated by item quantity, even if the added items are poor indicators.
 Alpha if item deleted can mislead when applied at the sample level, not the population
level.
 Violation of tau-equivalence can reduce the validity of α as a reliability estimate.

5. Remedies and Alternatives

Coefficient Omega (ω)

 Overcomes some limitations of α.


 Allows for different item loadings, making it more accurate under congeneric
measurement models.
 Increasingly recommended as a default reliability indicator, especially in SEM.
6. Covariance Matrix: Basis for Alpha

 The covariance matrix of items provides unstandardized information about item


relationships.
 Standardized form = correlation matrix
 Strong covariances among items → strong latent-variable relationships.

7. Reliability Estimation Methods

a. Alternate-Forms Reliability

 Two equivalent versions of the scale administered to the same group.


 High correlation between forms indicates temporal and content consistency.

b. Split-Half Reliability

 Divides items into two subsets (e.g., odd–even) and correlates scores.
 Vulnerable to fatigue or order effects if not randomized properly.

c. Test–Retest Reliability

 Same instrument used at two points in time.


 High correlation implies temporal stability.
 Can be compromised by learning effects or maturation.

d. Inter-Rater Reliability

 Used when judges/observers are measurement instruments.


 High agreement indicates the ratings reflect the observed phenomenon, not idiosyncrasies
of raters.

e. Reliability of Change Scores

 Computed as the difference between two time points.


 Often less reliable due to increased error from both measurements.

8. Statistical Power and Reliability

 Higher reliability increases statistical power, the ability to detect true relationships.
 Low-reliability instruments reduce effect sizes, increase Type II error.
9. Generalizability Theory

 Extends classical reliability by recognizing multiple sources of measurement error


(e.g., raters, occasions, items).
 Emphasizes variance decomposition using ANOVA-like models.

10. Application in Educational Research

 Ensuring reliable scales is critical for:


o Assessing teacher beliefs, student attitudes, policy impact
o Longitudinal studies tracking educational growth
 Trade-offs:
o Longer scales improve reliability but increase respondent burden.
o Shorter scales are efficient but risk reduced internal consistency.

11. Guidelines for Alpha Interpretation (General Research)

Alpha (α) Interpretation


≥ 0.90 Excellent (e.g., clinical diagnosis)
0.80–0.89 Good (suitable for research use)
0.70–0.79 Acceptable (caution needed)
< 0.70 Poor (not acceptable for most uses)

12. Discussion Questions for PhD Seminar

1. Under what conditions might coefficient omega be preferable to Cronbach’s alpha?


2. How does internal consistency relate to unidimensionality of a scale?
3. How does reliability impact the validity and statistical power of a study in education?
4. When using change scores in an intervention study, what statistical precautions are
necessary?
5. What are the implications of using split-half reliability for long, fatigue-prone scales?

Let me know when you're ready for Chapter 4: Scale Validity, or if you’d like a compiled
version of the first three chapters in PDF or PPT format for lectures.
Here are the PhD-level class notes for Chapter 4: Scale Validity from Scale Development:
Theory and Applications (5th ed.) by Robert F. DeVellis & Carolyn T. Thorpe. These notes are
tailored for doctoral coursework in educational research, measurement theory, and
psychometrics.

PhD Class Notes: Chapter 4 – Scale Validity

1. Distinction Between Reliability and Validity

 Reliability: Concerns how well a latent variable influences a set of items.


 Validity: Concerns whether the latent variable that causes item covariation is the
construct of interest.
 A scale can be reliable but invalid—that is, consistently measuring the wrong construct.

2. What Is Validity?

 Validity is not a property of a test, but of the inference made from a test score in a
specific context.
 Traditional framework focuses on three core types:
1. Content Validity
2. Criterion-Related Validity
3. Construct Validity
 Messick’s unified view introduces broader validity, including consequential validity,
but is not widely adopted in practice.

3. Content Validity

Definition

 The extent to which scale items adequately sample the domain of the construct.
 Strongest when:
o The domain is well-defined (e.g., vocabulary lists).
o Items are sampled systematically from that domain.

Challenges

 For abstract constructs (e.g., stress, attitudes), boundaries are less clear.
 Risk of:
o Overly narrow item sampling (misses key aspects).
o Overly broad items (introduces construct-irrelevant variance).

Best Practices

 Use experts, literature reviews, and item mapping.


 Consider population and context (e.g., cost-consciousness might vary between legal,
health, or retail decisions).

4. Criterion-Related Validity

Definition

 The degree to which a scale correlates with an external criterion that it is intended to
predict or explain.
 Includes:
o Predictive validity (e.g., scale forecasts future GPA).
o Concurrent validity (e.g., scale correlates with current SAT score).
o Postdictive validity (e.g., childhood health predicts birth weight).

Cautions

 Criterion-related validity does not imply causality.


 High correlation ≠ accuracy in terms of predicting actual score values (e.g., perfect
correlation but different score distributions).
 May require score transformation to ensure interpretive accuracy.

5. Construct Validity

Definition

 The extent to which a scale behaves as expected based on theoretical relationships with
other constructs.
 Coined by Cronbach & Meehl (1955).
 Involves testing hypothesized patterns of relationships between the target construct and
others.

Subtypes

 Convergent Validity: High correlation with related constructs.


 Discriminant (Divergent) Validity: Low/no correlation with unrelated constructs.
Multitrait-Multimethod Matrix (MTMM)

 Method to assess both convergent and discriminant validity.


 Ideal pattern:
o Highest correlations for same trait, same method.
o Moderate for same trait, different method.
o Low for different traits, same method.
o Lowest for different traits, different methods.

6. Face Validity

 Refers to whether a scale “looks like” it measures what it claims to.


 Often confused with content validity.
 Not scientific, based on subjective judgments.
 Can influence respondent behavior (e.g., social desirability) but is not sufficient for
validation.

7. Validity Is Context-Bound

 A scale may be valid in one context and invalid in another.


 Validity depends on:
o Population
o Setting
o Purpose
 Example: A parental aspiration scale valid in affluent contexts may not be valid in low-
resource settings.

8. Validity vs. Reliability

Property Reliability Validity


Focus Consistency Accuracy and appropriateness
Type of Error Random error Systematic error (e.g., construct misalignment)
A broken clock that's always A well-functioning clock set to the wrong time
Analogy
wrong zone

9. Application in Educational Research


 Assessing teacher beliefs, student motivation, or policy attitudes requires not just
reliable, but valid scales.
 Poor validity can misguide:
o Interventions
o Curriculum design
o Policy decisions

10. Discussion Questions for PhD Seminar

1. How can a scale be reliable but not valid? Provide examples from education.
2. What strategies ensure content validity in abstract constructs like “teaching efficacy”?
3. Discuss the practical use of MTMM matrices in scale development.
4. When is face validity helpful, and when might it be misleading?
5. What are the implications of criterion-related validity for high-stakes assessments in
Pakistani schools?

Let me know when you're ready for Chapter 5: Guidelines in Scale Development, or if you’d
like a compiled document or slides for teaching purposes.

Here are the PhD-level class notes for Chapter 5: Guidelines in Scale Development from
Scale Development: Theory and Applications (5th Edition) by Robert F. DeVellis & Carolyn T.
Thorpe. This chapter is essential for doctoral students involved in empirical instrument
construction, especially in education and social science research.

PhD Class Notes: Chapter 5 – Guidelines in Scale


Development

Overview

This chapter transitions from theory to application by providing a step-by-step framework for
building effective measurement instruments. It operationalizes the theoretical principles of
earlier chapters into nine practical steps for scale developers.

Step-by-Step Framework
Step 1: Define the Construct Clearly

 Clarity is foundational: many researchers proceed with vague constructs.


 Use theory and specificity to define the boundaries of the construct.
 Example: Instead of “barriers to adherence,” define whether the focus is fear, cost,
logistics, or misinformation.

Step 2: Generate an Item Pool

 Item quantity: Start large; a broad and redundant pool is useful.


 Item relevance: Ensure items tap the construct (not just a category).
 Redundancy is a strength if it supports internal consistency, not if it repeats phrasing or
format unnecessarily.

Writing Good Items

 Avoid double-barreled statements and ambiguous pronouns.


 Use clear grammar and neutral language.
 Balance item valence (positive/negative) but beware of mixing that compromises
reliability.

Step 3: Choose the Measurement Format

Common Response Formats:

 Likert Scale: Best for attitudes/opinions.


 Semantic Differential: Captures meaning through bipolar adjectives.
 Visual Analog: Used in pain, mood, or perception scales.
 Binary/Checklist: For behaviors or factual reporting.
 Decide early so items and response formats align with latent variable assumptions.

Step 4: Expert Review of Item Pool

 Solicit expert judgments on:


o Item clarity
o Content relevance
o Redundancy
o Theoretical alignment
 May involve rating items or sorting into categories.

Step 5: Cognitive Interviewing

 Test comprehension with potential respondents.


 Helps detect misunderstandings, ambiguous wording, and unintended meanings.
 Useful for refining instructions and response formats.

Step 6: Include Validation Items

 Add items to detect:


o Response biases (e.g., social desirability, acquiescence)
o Construct-related relationships (e.g., convergent/divergent scales)
 Embedding these items supports construct validity without additional data collection
phases.

Step 7: Administer to Development Sample

 Ideal sample size: ~300 recommended; depends on:


o Number of items
o Complexity of construct
o Planned factor analysis
 Small samples risk unstable correlations and misleading internal consistency.

Step 8: Evaluate Item Performance

 Use statistics to refine the scale:


o Item means and variances: Avoid floor/ceiling effects.
o Item-total correlations: Indicates alignment with latent trait.
o Reverse scoring: Ensure consistent directionality.
o Dimensionality: Use factor analysis to assess unidimensionality.
o Reliability: Calculate alpha or omega to assess internal consistency.

Step 9: Optimize Scale Length


 Tradeoff between brevity and reliability:
o Shorter scales reduce burden but risk loss of measurement precision.
o Drop weak items cautiously—improve interpretability and alpha.
o Consider split samples to test reliability and stability.

Additional Considerations

 Item valence: Mixing positively and negatively worded items can reduce method bias
but may confuse respondents.
 Response scale range:
o Even vs. odd number of response options affects neutrality.
o Too many categories may overwhelm; too few limit precision.

Educational Research Application


In contexts like teacher education, curriculum evaluation, or policy impact studies, this chapter’s
framework guides:

 Creating contextually appropriate items


 Ensuring reliability and validity
 Using cognitive interviews to enhance scale interpretability
 Optimizing length without sacrificing psychometric quality

Discussion Questions for PhD Seminar


1. Why is redundancy both a strength and a potential weakness in scale development?
2. How can cognitive interviewing improve the validity of a scale in multilingual contexts?
3. When developing a scale for student self-regulation, how would you select the
appropriate response format?
4. What factors would influence your decision to split a sample during scale development?
5. How does expert review differ from cognitive interviewing in terms of identifying item
flaws?

Let me know when you’re ready for Chapter 6: Factor Analysis, or if you’d like a compiled
and formatted resource (e.g., PDF, teaching slide deck, or workbook) of Chapters 1–5.
Here are the PhD-level class notes for Chapter 6: Factor Analysis from Scale Development:
Theory and Applications (5th Edition) by Robert F. DeVellis & Carolyn T. Thorpe. This chapter
is central to psychometric scale construction and crucial for advanced students conducting
educational measurement and instrument validation.

PhD Class Notes: Chapter 6 – Factor Analysis

1. Introduction: Purpose of Factor Analysis

Factor analysis is a data reduction technique that helps uncover the underlying latent
structure of a set of observed variables (e.g., survey items). It serves multiple purposes:

 Identifies number of latent constructs (factors)


 Groups items measuring the same latent trait
 Condenses item sets into factor scores
 Helps evaluate the construct validity of a scale

2. Theoretical Underpinnings

From General Factor Models

 Assumes multiple latent variables may cause covariation among observed items.
 Items are modeled as effect indicators, causally influenced by latent factors.
 Useful in evaluating unidimensionality and multidimensionality.

Illustration

 Example: Items about "affect" might reflect one general or several specific constructs
(e.g., anxiety, depression).
 Factor analysis helps determine the optimal structure empirically.

3. Key Concepts and Terminology

Term Definition
Factor A latent variable accounting for shared variance among observed variables
Loading Correlation between item and factor (ranges from -1 to +1)
Eigenvalue Total variance in all items explained by a given factor
Term Definition
Communality Proportion of each item’s variance explained by all retained factors
Mathematical technique to clarify factor structure by simplifying item–factor
Rotation
relationships

4. Types of Factor Analysis

Exploratory Factor Analysis (EFA)

 Used when the factor structure is unknown.


 Helps discover the number and nature of latent variables.
 Frequently used during scale development.

Confirmatory Factor Analysis (CFA)

 Used to test a predefined factor structure.


 Involves model specification, fit indices, and hypothesis testing.
 Part of Structural Equation Modeling (SEM).

5. Factor Extraction Methods

Principal Components Analysis (PCA)

 A data reduction method, not truly factor analysis.


 Components are combinations of items, not latent variables.

Common Factor Analysis

 Assumes that shared variance is due to latent variables.


 Suitable for scale development.

6. Determining the Number of Factors

Method Description
Eigenvalue Rule
Retain factors with eigenvalues > 1.0.
(Kaiser’s Criterion)
Scree Plot Visual plot where the "elbow" indicates optimal number of factors.
Compares eigenvalues from real data with those from random data.
Parallel Analysis Retain factors above the random line—statistically grounded and
recommended by authors.
7. Factor Rotation

 Helps achieve simple structure: each item loads strongly on one factor only.
 Types:
o Orthogonal (e.g., Varimax): Assumes factors are uncorrelated.
o Oblique (e.g., Promax): Allows factors to correlate, often more realistic in social
sciences.

8. Interpreting Factors

 Examine factor loadings: items with loadings ≥ .50 typically considered strong
indicators.
 Content of high-loading items reveals latent construct meaning.
 Naming factors is interpretive; validity depends on theoretical and empirical coherence.

9. Advanced Models

Bifactor Models

 Includes one general factor and multiple specific factors.


 Helps identify essential unidimensionality when general and specific traits coexist.
 Requires specialized software (e.g., psych package in R) and expert judgment.

Hierarchical Models

 Reflect nested constructs (e.g., "general affect" → "anxiety", "depression").


 Useful for building multilevel scales.

10. Practical Application: Example from DeVellis et al.

 12 items about parental beliefs regarding children’s health analyzed using EFA.
 Scree plot and eigenvalues indicated a two-factor solution.
 Varimax rotation yielded clear item clusters:
o Factor 1: Items about parental influence.
o Factor 2: Items about luck/fate.
 Loadings > .50 highlighted strong indicators for each factor.
11. Implications for Educational Research

 Identifying construct structure in student motivation, teacher identity, learning


strategies.
 Enhancing scale precision and theoretical alignment.
 Supporting validation through empirical factor structures.
 Informing item selection and scale refinement.

12. Discussion Questions for PhD Seminar

1. What are the limitations of relying solely on eigenvalue > 1.0 to determine factor
retention?
2. How does factor rotation enhance the interpretability of multidimensional constructs?
3. Compare and contrast principal components analysis and common factor analysis in the
context of educational research.
4. How would you apply bifactor analysis to a scale measuring both general teaching self-
efficacy and classroom-specific factors?
5. What theoretical and practical risks arise if one retains too many or too few factors?

Let me know when you’re ready for Chapter 7: Developing Scales Using Items That Form an
Index, or if you'd like a compiled PDF or slide set for Chapters 1–6.

Here are the PhD-level class notes for Chapter 7: The Index from Scale Development: Theory
and Applications (5th ed.) by Robert F. DeVellis & Carolyn T. Thorpe. This chapter is
particularly relevant for advanced education researchers seeking to understand how indices differ
from scales and how to construct and validate them appropriately.

PhD Class Notes: Chapter 7 – The Index

1. Core Distinction: Scale vs. Index

Type Cause/Effect Relationship Indicator Type Example


Latent variable causes item Reflective or Effect
Scale “I feel sad” reflects depression
responses Indicators
Items cause or define the Formative or Causal Candidate electability based on
Index
construct Indicators multiple traits
 Scales: Items are effects of an underlying latent variable.
 Indices: Items collectively form or define a variable—there may be no common
underlying cause.

2. Two Types of Indices

a. Causal Formative Index

 Based on theory.
 Items conceptually unified and together define the variable (e.g., socioeconomic status:
income, education, occupation).

b. Composite Formative Index

 Driven by empiricism.
 Items chosen for predictive utility, not conceptual unity (e.g., sales hiring index based
on resume data).

A scale without internal consistency might still function well as an index.

3. Theoretical and Empirical Differences

Conceptual Criteria

Scale Index
Shared underlying cause plausible Items are distinct and not interchangeable
Items “about the same thing” Items tap different traits
One response predicts another No inter-item dependency required
Traits of the respondent Traits of environment or situation

Empirical Criteria

Scale Index
High inter-item correlations Weak inter-item correlations
Factor analysis reveals coherent clusters Items fail to form clusters
Alpha is relevant Internal consistency is not appropriate

(Table 7.1 summarizes these criteria in the text).


4. Formal Methods to Distinguish Index from Scale

1. Correlation Matrix: Scales yield strong correlations.


2. Factor Analysis: Scale items load on common factors; index items do not.
3. Vanishing Tetrads: Advanced SEM-based test to differentiate causal vs. effect
indicators.

5. Steps in Index Development

a. Clarify the Concept

 Must decide whether your index is causal or composite formative.


 Theory is critical for causal indices; data-driven utility governs composite ones.

b. Item Generation

 Index items may be sourced from:


o Administrative data
o Survey instruments
o New item writing (use scale-writing guidelines but avoid redundancy)

c. Avoid Redundancy

 Unlike scales, redundant items do not enhance reliability.


 Each item should represent a unique dimension of the construct.

d. Evaluation and Selection

 Use regression analysis to:


o Select predictive indicators
o Assign weights if appropriate
o Validate index against external criteria

6. Validity and Reliability of Indices

Type Considerations
Content Validity Same principles as scales; expert review for conceptual indices.
Construct Validity Only applies if there is conceptual unity.
Most appropriate when conceptual unity is lacking; predict an external
Criterion Validity
criterion
Reliability Internal consistency is not applicable; prefer test–retest reliability
7. Regression-Based Development and Validation

 Commonly used for composite indices.


 Steps:
o Identify predictors of outcome (e.g., risk, performance)
o Run regression on a large dataset
o Retain items based on strength, stability, and predictive contribution
 Avoid tautology: Do not validate on same data used to build the index.

8. Hybrid Models

 Some instruments combine reflective (scale-like) and formative (index-like)


components.
 Examples:
o Hierarchical hybrids: Specific subscales combine into a general factor.
o MIMIC models: Combine latent and observed predictors in SEM.

9. Educational Research Application

Indices are useful in contexts where multiple causes define an outcome:

 Student risk indices (attendance, behavior, background)


 Teacher readiness indices (credentials, self-efficacy, classroom resources)
 School climate indices (leadership, parent involvement, safety)

10. Discussion Questions for PhD Seminar

1. How can one empirically test whether a measurement instrument is a scale or an index?
2. Why is internal consistency inappropriate for evaluating indices?
3. How would you design a composite index to measure “school readiness” using
administrative data?
4. Discuss the potential limitations of causal formative indices in education policy
research.
5. Under what circumstances might a hybrid instrument be preferable to either a scale or
an index alone?
Let me know when you’re ready for Chapter 8: Overview of Item Response Theory, or if
you’d like a teaching version (PDF or slides) of Chapters 1–7.

Here are the PhD-level class notes for Chapter 8: An Overview of Item Response Theory
(IRT) from Scale Development: Theory and Applications (5th ed.) by Robert F. DeVellis &
Carolyn T. Thorpe. This chapter is critical for advanced researchers and doctoral students
engaged in psychometric scale design and evaluation.

PhD Class Notes: Chapter 8 – An Overview of Item


Response Theory

1. Introduction to IRT

 Item Response Theory (IRT) is a modern psychometric framework that models the
relationship between:
o A person's latent trait level (e.g., ability, depression, motivation),
o The probability of a specific item response.
 Unlike Classical Test Theory (CTT), which focuses on test-level properties, IRT focuses
on item-level performance.
 In IRT, items are evaluated individually, and their properties are designed to be
independent of the sample used for calibration.

2. Core Assumptions of IRT

 Unidimensionality: All items on a test measure a single latent trait.


 Local independence: Item responses are independent after conditioning on the latent
trait.
 Monotonicity: The probability of a correct response increases with higher levels of the
trait.

3. Comparison: CTT vs. IRT

Classical Test Theory


Feature Item Response Theory (IRT)
(CTT)
Focus Entire test Individual items
Sample dependence High Low (assumes model fit)
Reliability Scale-based (e.g., alpha) Item and test information functions
Classical Test Theory
Feature Item Response Theory (IRT)
(CTT)
On a continuous latent trait scale (θ or
Score interpretation Relative
Theta)
Measurement
Uniform Varies by trait level (θ)
precision

4. IRT Models

1-Parameter Model (Rasch Model)

 Item difficulty (b) only.


 Assumes equal discrimination.

2-Parameter Model

 Item difficulty (b) and discrimination (a).

3-Parameter Model

 Adds guessing (c) parameter—important in multiple-choice tests.

Each model fits different measurement needs and complexities.

5. Item Parameters

 Difficulty (b): Location on the trait continuum where the item has a 50% chance of being
passed.
 Discrimination (a): How sharply the item differentiates between different levels of the
trait.
 Guessing (c): The chance of answering correctly due to guessing, especially in binary-
choice formats.

6. Item Characteristic Curve (ICC)

 Graph showing the relationship between trait level (θ) and probability of item success.
 S-shaped curve for dichotomous items.
 Key insights from ICCs:
o Slope = discrimination
o Inflection point = difficulty
o Left Y-intercept = guessing
7. Category Response Curves

 For Likert or polytomous items, IRT uses models like:


o Graded Response Model (GRM) (Samejima)
o Partial Credit Model (PCM)
 These models create Category Response Curves (CRCs) that show how likely each
response category is chosen across the θ continuum.

8. Computerized Adaptive Testing (CAT)

 IRT enables dynamic item selection:


o Test adapts to the respondent's estimated ability.
o Items are chosen to maximize information and minimize respondent fatigue.
 Requires a large, calibrated item pool.

9. Advantages of IRT

 Precision: Estimates ability more precisely across different levels.


 Sample independence: Item parameters don't change across populations.
 Test efficiency: Shorter, more tailored tests possible.
 Score equivalence: Comparability across test forms.

10. Challenges and Limitations

 Requires large, diverse samples for accurate item calibration.


 High computational complexity and need for specialized software.
 Strong assumptions (e.g., unidimensionality) may be hard to meet in practice.
 Not always superior: CTT remains effective and simpler in many contexts.

11. Practical Application in Educational Research

IRT is ideal for:

 Developing assessments of student achievement, teacher knowledge, or mental health


constructs.
 Creating adaptive testing platforms in e-learning.
 Enhancing diagnostic precision for educational interventions.
 Evaluating cross-group equivalence (e.g., gender DIF—differential item functioning).

12. Discussion Questions for PhD Seminar

1. How does item-level analysis in IRT improve scale development compared to classical
methods?
2. What are the advantages and trade-offs of using the three-parameter model?
3. When should a researcher opt for IRT over CTT in educational research?
4. How can Category Response Curves inform improvements to Likert-scale items?
5. What are the implications of IRT’s assumptions for cross-cultural educational
measurement?

Let me know when you’re ready for Chapter 9: Measurement in the Broader Research
Context, or if you’d like a compiled teaching guide (PDF or slides) for Chapters 1–8.

Here are the PhD-level class notes for Chapter 9: Measurement in the Broader Research
Context from Scale Development: Theory and Applications (5th Edition) by Robert F. DeVellis
& Carolyn T. Thorpe. This final chapter re-situates scale development within the larger logic of
social science research, offering a meta-perspective essential for doctoral students conducting or
supervising empirical investigations.

PhD Class Notes: Chapter 9 – Measurement in the Broader


Research Context

1. Reconnecting with the Big Picture

This chapter circles back to key themes introduced in Chapter 1, emphasizing that:

 Measurement is embedded in the broader scientific inquiry.


 Scale development is not an isolated task but one that interacts deeply with theory,
method, and context.
 Measurement choices affect the credibility and utility of findings.

2. Measurement Before and After Scale Construction


Before Scale Development

 Search for existing measures before designing new ones. Use databases like:
o HaPI (Health and Psychosocial Instruments)
o Mental Measurements Yearbook
o PROMIS repositories
 Assess construct–population fit early using:
o Focus groups
o Qualitative interviews
o Cognitive probing
 Consider how the construct is perceived by participants, not just how it's theorized.

After Scale Use

 Validate the interpretation of scores within the study’s context.


 Analyze the generalizability of findings across:
o Populations
o Settings
o Time points
o Cultural contexts.

3. Measurement and Scientific Reasoning

 Both inductive and deductive logic are involved:


o Deductive: Theory guides item generation.
o Inductive: Factor analysis reveals latent structures.
 Both qualitative and quantitative methods contribute:
o Qual: Interviews to generate item content.
o Quant: Psychometric validation.

4. Analytic Considerations

 Scales developed using Likert formats often treated as interval-level data, although
technically ordinal.
 Use interval methods (e.g., regression, factor analysis) cautiously and justify their use.
 Choose methods consistent with scale construction assumptions (e.g., tau-equivalence
for alpha).

Validity is cumulative: Usefulness of a scale evolves with its use, interpretation, and context
adaptation.
5. Interpretation and Contextual Validity

 A scale is valid only in relation to the context in which it is used.


 Results that are surprising or counterintuitive might signal:
o Poor scale-context fit
o Unexamined response biases
o Contingent variables (e.g., mood, fatigue, social desirability)
 Think critically: Are scores truly reflecting the latent trait or something else?

6. Generalizability and Differential Item Functioning (DIF)

 Beware of group comparisons using scales that might function differently across
cultures or subpopulations.
 DIF (discussed in Ch. 8) can cause misleading conclusions.
 Ensure equivalence before interpreting group differences as “true” differences in traits.

7. Small vs. Big Measurement (The “Canoes vs. Cruise Ships” Analogy)

Small Measurement

 Researcher-developed, limited-scope instruments.


 Targeted, rapid, efficient, adaptable.
 Easier to control assumptions and refine design.
 Examples: custom teacher efficacy scale for a local district.

Big Measurement

 Large-scale, generalizable, high-cost.


 Example: SAT, GRE, PROMIS health metrics.
 Complex psychometric design; often multi-center, multi-year efforts.

Insight: Big measurements often evolve from successful small measurements. The latter are
valuable seedbeds of innovation.

8. Practical Implications for Educational Research

 Educational measurement must consider:


o Who is being measured?
o What construct is measured?
o Where (context) and how?
 Understand the trade-off between precision and scope.
 Decide whether your research needs a canoe or a cruise ship.

9. Final Reflections

 Measurement is not a neutral act; it shapes:


o What we see
o How we interpret it
o What policy or instructional action follows
 Good measurement clarifies theory, reveals constructs, and advances science.
 Poor measurement obfuscates findings, wastes resources, and misleads conclusions.

10. Discussion Questions for PhD Seminar

1. How do "small" and "big" measurement approaches differ in terms of validity, reliability,
and generalizability?
2. Why is it important to re-assess the use of a scale after data collection?
3. How does generalizability interact with differential item functioning?
4. Discuss an example from your own research where contextual factors may have affected
measurement.
5. When is it more beneficial to develop a new instrument rather than adapt an existing one?

Would you like me to compile all Chapters 1–9 into a formatted PDF teaching resource, or
prepare a presentation slide deck for class delivery?

You might also like