Research Design and Statistical Analysis - 4th Ed 2006

Research Design and Statistical Analysis in Christian Ministry
Table of Contents
Preface Table of Contents i-1 i-5 Unit I: Research Fundamentals
1 ........................................................................................................................ Scientific Knowing

Ways of Knowing 1-1
Common Sense Authority Intuition/Revelation Experience Deductive Reasoning Inductive Reasoning Objectivity Precision Verification Empiricism Goal: Theories 1-1 1-2 1-2 1-3 1-3 1-3 1-4 1-4 1-5 1-5 1-5
Science as a Way of Knowing
1-4
The Scientific Method Types of Research

Historical Research
Primary sources Secondary sources Criticism Examples An Example An Example An Example An Example An Example An Example 1-7 1-7 1-7 1-8 1-8 1-9 1-10 1-10 1-11 1-11
1-6 1-6
1-7
Descriptive Research
1-8 1-9 1-9 1-10 1-10 1-11 1-11 1-12 1-13 1-13 1-15 1-15 1-15
Correlational Research Experimental Research Ex Post Facto Research Evaluation
Research and Development
Faith and Science
Qualitative Research Suspicion of Science By the Faithful Suspicion of Religion By the Scientific There Need Be No Conflict Vocabulary Study Questions Sample Test Questions
1-12
Summary
1-14
2 ........................................................................................................................ Proposal Organization

Front Matter 2-2
Title Page Table of Contents List of Tables List of Illustrations The Introductory Statement 2-2 2-2 2-2 2-3 2-3
Introduction
2-3
i-8
4th ed. 2006 Dr. Rick Yount
Preliminaries The Statement of the Problem Purpose of the Study Synthesis of Related Literature Significance of the Study The Hypothesis 2-3 2-4 2-4 2-6 2-6 2-7 2-8 2-8 2-9 2-10 2-10 2-11 2-12 2-12 2-13 2-13 2-13 2-14 2-14 2-15
2-15 2-15 2-15 2-15 2-15
Method
2-7
Analysis
Population Sampling Instrument Limitations Assumptions Definitions Design Procedure for Collecting Data Procedure for Analyzing Data Testing the Hypotheses Reporting the Data Appendices Bibliography, or Cited Sources Personal Anxiety Professionalism in Writing
Clear Thinking Unified Flow Quality Library Research Efficient Design Accepted Format
2-12
Reference Material
2-13 2-14
Practical Suggestions
Summary
2-16
2-16 2-16 2-17
3 ........................................................................................................................ Empirical Measurement 3-1

Variables and Constants Measurement Types
Independent Variables Dependent Variables Nominal Measurement Ordinal Measurement Interval Measurement Ratio Measurement Data Type Summary Definitions An Example Another Example Operationalization Questions Vocabulary Study Questions Sample Test Questions
Vocabulary Study Questions Sample Test Questions
3-1
3-2 3-2 3-2 3-2 3-2 3-3 3-3 3-4 3-4 3-5 3-6 3-7 3-7 3-7
3-2
Operationalization
3-3
Summary
3-6
4 ........................................................................................................................ Getting On Target 4-1

The Problem Statement 4-1
Characteristics of a Problem
Limit scope of your study Current theory and/or latest research 4-1 4-1
4-1
i-9

Meaningfulness Clearly written 4-2 4-2 4-2 4-2 4-3 4-3
Examples of Problem Statements

Association Between Two Variables Association of several variables Difference Between Two Groups Differences Between More Than Two Groups
4-2
The Hypothesis Statement

The Research Hypothesis
4-4
4-4
Revision Examples
Example 1 Example 2 Example 3 Example 4
The Directional Hypothesis The Non-directional Hypothesis The Null Hypothesis
Association Between Two Variables Association of several variables Difference Between Two Groups Differences Between More Than Two Groups
4-4 4-5 4-5 4-6
4-6 4-7 4-7 4-8
4-8
Comments Suggested revision Comments Suggested revision Comments Suggested revision Comments Suggested revision Comments
4-8 4-8 4-8 4-9 4-9 4-9 4-9 4-10 4-10
4-8 4-9 4-9 4-10
Example 5
Dissertation Examples
Regression Analysis Correlation of Competency Rankings Factorial Analysis of Variance Chi-Square Analysis of Independence
4-10
4-10 4-11 4-11 4-11
5 ........................................................................................................................ Introduction to Statistical Analysis 5-1

Statistics, Mathematics, and Measurement
Descriptive Statistics Inferential statistics Statistics and Mathematics Statistics and Measurement Question One: Similarity or Difference? -1- Question Two: Data Types in Similarity Studies -2- Question Two: Data Types in Difference Studies
-3-4-5-6-7Interval or Ratio Correlation Ordinal Correlation Nominal Correlation Interval/Ratio Differences Ordinal Differences 5-4 5-5 5-5 5-6 5-7
5-1
A Statistical Flow Chart
5-2 5-2 5-2 5-2 5-4 5-4 5-4
5-4
Summary
5-7
5-7 5-8 5-8
6 ........................................................................................................................ Synthesis of Related Literature 6-1

A Definition 6-1
Synthetic Narrative Recent Research 6-1 6-1
i-10
Preliminaries
The Procedure for Writing the Related Literature

Choose One or More Databases
E.R.I.C. RIE CIJE Psychological Abstracts Dissertation Abstracts Thesaurus of ERIC Descriptors Education Index Citation Indexes Smithsonian Science Information Exchange Mental Measurements Yearbook Measures for Psychological Measurement 6-2 6-2 6-2 6-3 6-3 6-3 6-3 6-3 6-4 6-4 6-4
Related to Your Study
6-2 6-2
6-2
Choose Preliminary Sources
6-3
Select Key Words Searching the literature
6-4 6-5 6-6 6-7
Select Articles Analyze the Research Articles

An Organizational Notebook Prioritizing Articles Selecting Notes and Quotes with References Reorganize Material by Key Words Write a Synthesis of Related Literature Revise the Synthesis
Searching manually Searching by Computer
6-5 6-5
Summary
6-7 6-7 6-8 6-8 6-8 6-8
6-9
6-9 6-9 6-10
7 ....................................................................................................................... Populations and Sampling 7-1

The Rationale of Sampling
The Population Sampling Biased Samples Randomization
7-1
Steps in Sampling
Identify the Target Population Identify the Accessible Population Determine the Size of the Sample
Accuracy Cost The Homogeneity of the Population Other Considerations Sample Size Rule of Thumb Select the Sample 7-3 7-3 7-3 7-4 7-4 7-4
7-1 7-1 7-2 7-2 7-2 7-2 7-3
7-2
Types of Sampling
7-4
7-4 7-5 7-6 7-6
Inferential Statistics A Quick Look Ahead The Case Study Approach

Historical Case Studies of Organizations Observational Case Studies Oral Histories Situational Analysis Clinical case study Vocabulary
Simple Random Sampling Systematic Sampling Stratified Sampling Cluster sampling
7-7 7-8
Summary
7-8 7-8 7-8 7-8 7-8 7-9
7-9
i-11
Research Design and Statistical Analysis in Christian Ministry Study Questions Sample Test Questions 7-9 7-10
8 ........................................................................................................................ Collecting Dependable Data 8-1

Validity 8-1
Content Validity Predictive Validity Concurrent Validity Construct Validity Coefficient of Stability Coefficient of Internal Consistency Coefficient of Equivalence Answer 1: A Test Must be Reliable in Order to be Valid Answer 2: A Test Can be Valid Even If It Isnt Reliable 8-2 8-2 8-2 8-3 8-4 8-4 8-5 8-5 8-5
Reliability
8-3
Reliability and Validity Objectivity Summary
8-5 8-6 8-7
8-7 8-8 8-8
9 ........................................................................................................................ Observation 9-1

The Problem of the Observation Method Obstacles to Objectivity in Observation
Personal Interest Early decision Personal characteristics
Unit II: Research Methods
9-1 9-2
Practical Suggestions for Avoiding these Problems

Definition Familiar Groups Unfamiliar Groups Observational Limits Manual versus Mechanical Recording Interviewer Effect Debrief Immediately Participant Observation Undercover Observation? Observational Checklist Example Vocabulary Study Questions Sample Test Questions
9-2 9-2 9-3 9-3 9-3 9-3 9-3 9-3 9-3 9-4 9-4 9-4 9-4 9-4 9-5 9-5 9-5
9-3
Summary
9-4
10 ...................................................................................................................... Survey Research 10-1

The Questionnaire
Advantages
Remote subjects Researcher influence Cost Reliability Subjects convenience 10-1 10-1 10-2 10-2 10-2
10-1
10-1
i-12
Preliminaries Disadvantages
Rate of return Inflexibility Subject motivation Verbal behavior only Loss of control 10-2 10-3 10-3 10-3 10-3
10-2
Types of questionnaires Guidelines
10-3 10-4
The Interview
Advantages
Asking questions Understandable format Clear instructions Demographics at the end
10-4 10-4 10-4 10-4
10-5
10-5
Disadvantages
Flexibility Motivation Observation Broader Application Freedom from mailings Time Cost Interviewer effect Interviewer variables
10-5 10-5 10-5 10-5 10-5 10-6 10-6 10-6 10-6
10-6
Types of Interviews Guidelines
10-6 10-6
Developing the Survey Instrument

Specify Survey Objective Write Good Questions Evaluate and Select the Best Items Format the Survey Write Clear Instructions Pilot Study Examples Vocabulary Study Questions Sample Test Questions
Recording responses Interview skills Demographics Alternative modes
10-6 10-7 10-7 10-7
10-7
10-7 10-7 10-7 10-8 10-8 10-8 10-8 10-9 10-10 10-10
Summary
10-8
11 ...................................................................................................................... Developing Tests 11-1

Preliminary considerations
The Emphases in the Material Nature of Group being Tested The Purpose of the Test Writing items The True-False Item Writing True-False items
Avoid specific determiners Absolute answer Avoid double negatives Use precise language Avoid direct quotes Watch item length Avoid complex sentences Use more false items Advantages Advantages Disadvantages 11-2 11-3 11-3 11-3 11-3 11-3 11-4 11-4 11-4 11-4 11-4
11-1
Objective Tests
11-1 11-2 11-2 11-2 11-2 11-3
11-2
Multiple Choice Items
11-4
i-13
Research Design and Statistical Analysis in Christian Ministry Writing Multiple Choice Items
Pose a singular problem Avoid repeating phrases in responses Minimize negative stems Make responses similar Make responses mutually exclusive Make responses equally plausible Randomly order responses Avoid sources of irrelevant difficulty Eliminate extraneous material Avoid None of the Above Advantages Disadvantages Disadvantages 11-4 11-5 11-5 11-5 11-5 11-5 11-5 11-5 11-5 11-5 11-6 11-6 11-6 11-6 11-6 11-6 11-6 11-7 11-7 11-7 11-7 11-7 11-7 11-7 11-8 11-8 11-8 11-8 11-8
11-4
Supply Items
11-6 11-6
Writing Supply Items

When to use supply items Limit blanks Only one correct answer Blank important terms Place blank at the end Avoid irrelevant clues Avoid text quotes Advantages Disadvantages
Matching Items
11-7 11-7
Writing Matching Items

Limit number of pairs Make option list longer Only one correct match Maintain a central theme Keep responses simple Make the response option list systematic Specific instructions
Essay Tests
11-8
11-8
Open-Ended Items Writing essay items

Advantages Disadvantages Use short-answer essays Write clear questions Develop a grading key 11-8 11-9 11-9 11-9 11-9
11-9
Item analysis
11-9
11-10 11-10 11-10 11-10 11-10 11-13 11-14 11-14 11-15
Summary
Rank Order Subjects By Grade Categorize Subjects into Top and Bottom Groups Compute Discrimination Index Revise Test Items Examples Vocabulary Study Questions Sample Test Questions Sample Test
11-10
12 ...................................................................................................................... Developing Scales 12-1

The Likert Scale 12-2
Define the attitude Determine related areas Write statements Create an item pool
Positive examples Negative examples Validating the items Rank 12-3 12-3 12-3 12-3
12-2 12-2 12-2 12-3
i-14
Preliminaries
Formatting the Scale Write instructions Scoring the Likert scale 12-4 12-4 12-4
The Thurstone Scale
12-4
12-5 12-5 12-6 12-6 12-6 12-6 12-6
Q-Methodology Semantic Differential Delphi Technique Summary
Develop item pool Compute item weights Rank the items by weight Choose Equidistant Items Formatting the Scale Administering the Scale Scoring
12-6 12-7 12-7 12-8
13 ...................................................................................................................... Experimental Designs 13-1

What Is Experimental Research? Internal Invalidity 13-1 13-2
History Maturation Testing Instrumentation Statistical regression Differential selection Experimental mortality Selection-Maturation Interaction of Subjects The John Henry Effect Treatment diffusion Reactive effects of testing Treatment and Subject Interaction Testing and Subject Interaction Multiple Treatment Effect Summary True Experimental Designs
Pretest-Posttest Control Group Posttest Only Control Group Solomon Four-Group Time Series Nonequivalent Control Group Design Counterbalanced Design The One Shot Case Study One-Group Pretest/Posttest Static-Group comparison 13-6 13-6 13-7 13-7 13-8 13-8 13-9 13-9 13-10
Vocabulary Study Questions Sample Test Questions Sample Thurstone Scale Sample Thurstone Scale (with weights)
12-8 12-8 12-8 12-9 12-10
External Invalidity
13-2 13-2 13-2 13-3 13-3 13-3 13-4 13-4 13-4 13-4 13-5 13-5 13-5 13-5 13-5 13-6
13-4
Types of Designs
13-6
Quasi-experimental Designs
13-7
Pre-experimental Designs
13-9
Summary
13-10
13-10 13-11 13-11
i-15
14 ...................................................................................................................... Basic Math Skills 14-1

Mathematical Symbols 14-1
Arithmetic Operators Square (2) Square Root () The Sum Symbol () Parentheses and Brackets Using Letters as Numbers Fractions Negative numbers Percents and Proportions Exponents Simple Algebra Vocabulary Study Questions 14-1 14-1 14-2 14-2 14-2 14-3 14-3 14-4 14-4 14-4 14-4 14-6 14-6
Unit III: Statistical Fundamentals
Mathematical Concepts
14-3
Summary
14-6
15 ...................................................................................................................... Distributions and Graphs 15-1

Creating An Ungrouped Frequency Distribution Creating a Grouped Frequency Distribution
Calculate the Range Compute the Class Width Determine the Lowest Class Limit Determine the Limits of Each Class Group the Scores in Classes X- and Y-axes Scaled Axes Histogram Frequency Polygon
15-1 15-2
Graphing Grouped Frequency Distributions
15-2 15-2 15-3 15-3 15-3 15-4 15-4 15-4 15-5
15-4
Distribution Shapes Distribution-Free Measures Summary

Vocabulary Study Question Sample Test Questions
15-5 15-6 15-6
15-6 15-6 15-7
16 ...................................................................................................................... Central Tendency and Variation 16-1

Measuring Central Tendency
The Mode The Median The Arithmetic Mean Central Tendency and Skew Range Average Deviation Standard deviation
Deviation Method Raw Score Method Equal Means, Unequal Standard Deviations Population Parameters 16-5 16-6 16-7 16-9
16-1
Measures of Variability
16-1 16-1 16-2 16-3 16-3 16-4 16-5
16-3
Parameters and Statistics
16-8
i-16
Preliminaries
Sample Statistics Estimated Parameters 16-9 16-9
Standard (z-) Scores Summary
16-10 16-12
16-12 16-13 16-14 16-15
Example Vocabulary Study Questions Sample Test Questions
17 ...................................................................................................................... The Normal Curve and Hypothesis Testing 17-1

The Normal Curve 17-1 17-6 17-7 Level of Significance
The Normal Curve Table The Normal Curve Table in Action Criticial Values One- and Two-Tailed Tests The Distinction Illustrated Using the z-Formula for Testing Group Means Computing Probabilities of Means Example Vocabulary Study Questions Sample Test Questions 17-2 17-3 17-6 17-6 17-8 17-9 17-9 17-10 17-11 17-11 17-13
Sampling Distributions
Summary
17-10
18 ...................................................................................................................... The Normal Curve: Error Rates and Power 18-1

Type I and Type II Error Rates Increasing Statistical Power
Decision Table Probabilities Normal Curve Areas
18-1 18-4
18-2 18-2 18-4 18-4 18-5

18-5 18-5 18-6
Increase Increase 1 - 2 Decrease the Standard Error of the Mean

Decrease s Increase n Like Fishing for Minnows
Statistical Significance and Practical Importance Summary

18-6 18-6
18-7 18-7 18-7
19 ...................................................................................................................... One Sample Parametric Tests 19-1

The One-Sample z-Test The One-Sample t-Test Confidence Intervals Summary
The t-Distribution Table Computing t A z-Score Confidence Interval A t-Score Confidence Interval
Unit IV: Statistical Procedures
19-1 19-2 19-4 19-5
19-2 19-3 19-4 19-5
i-17
Research Design and Statistical Analysis in Christian Ministry Vocabulary Study Questions Sample Test Questions 19-5 19-5 19-6
20 ...................................................................................................................... Two Sample t-Tests 20-1

Descriptive or Experimental? t-Test for Independent Samples t-Test for Correlated Samples
Effect of Correlated Samples The Standard Error of Difference Example Problem The Standard Error of Difference Example Problem
20-1 20-2 20-5
20-3 20-3 20-5 20-6 20-6
The Two Sample Confidence Interval Summary

Examples Vocabulary Study Questions Sample Test Questions
20-8 20-8
21 ...................................................................................................................... One-Way Analysis of Variance 21-1

Why Not Multiple t-tests? Computing the F-Ratio
Sums of Squares Degrees of Freedom Variance Estimates The F-Ratio The F-Distribution Table The ANOVA Table An Example Procedures Defined
The Least Significant Difference The Honestly Significant Difference Multiple Range Tests Fisher-Protected Least Significance Difference (F)LSD HSD SNK 21-6 21-7 21-7 21-7 21-8 21-8 21-9
20-8 20-10 20-10 20-11
21-1 21-3
Multiple Comparison Procedures
21-3 21-4 21-4 21-4 21-5 21-5 21-5 21-6
21-6
Procedures Computed
21-7
Summary
21-10
21-10 21-11 21-11 21-13
22 ...................................................................................................................... Correlation Coefficients 22-1

The Meaning of Correlation Correlation and Data Types Pearsons Product Moment Correlation Coefficient (rxy) Spearmans rho Correlation Coefficient (rs) Other Important Correlation Coefficients
Point Biserial Coefficient Rank Biserial Coefficient Phi Coefficient (rf) Kendalls Coefficient of Concordance (W) The Coefficient of Determination (r2)
Examples Vocabulary Study Questions Sample Test Questions
22-1 22-2 22-3 22-5 22-6
22-6 22-6 22-7 22-7 22-7
i-18
Preliminaries
Summary
Vocabulary Study Questions Sample Test Question 22-8 22-8 22-9
22-8
23 ...................................................................................................................... Chi-Square Procedures 23-1

The Chi Square formula The Goodness of Fit Test
Equal Expected Frequencies
The Example of a Die Computing the Chi Square Testing the Chi Square Value Translating into English 23-2 23-2 23-3 23-3 23-3 23-3 23-4 23-4 23-4
23-1 23-2
23-2
Proportional Expected Frequencies

The Example of Political Party Preference Computing the Chi Square Value Testing the Chi Square Translate into English Eyeball the Data
23-3
Chi-Square Test of Independence

The Contingency Table Expected Cell Frequencies Degrees of Freedom Application to a Problem Party Preference Revisited Strength of Association
23-4
23-5 23-5 23-6 23-6 23-7 23-8
Cautions in Using Chi-Square

Small expected frequencies Assumption of Independence Inclusion of Non-Occurrences Example Vocabulary Study Questions Sample Test Questions
Contingency Coefficient Cramers Phi
23-8 23-9
23-9
23-9 23-10 23-10 23-11 23-12 23-12 23-12
Summary
23-11
24 ...................................................................................................................... Non-Parametric Statistics for Ordinal Differences 24-1

The Rationale of Testing Ordinal Differences Wilcoxin Rank-Sum Test (Ws) The Mann-Whitney U Test Wilcoxin Matched-Pairs Test (T) Kruskal-Wallis H Test Summary
Computing the Wilcoxin T The Wilcoxin T Table Computing the Kruskal-Wallis H Using the Chi-Square Table with Kruskal-Wallis H Example Vocabulary Study Questions Sample Test Questions Computing the Mann-Whitney U The Mann-Whitney U Table Computing the Wilcoxin W The Wilcoxin W Table
Unit V: Advanced Statistical Procedures
24-2 24-2 24-3 24-4 24-5 24-6
24-3 24-3 24-3 24-4 24-4 24-5 24-5 24-6 24-6 24-8 24-8 24-8
i-19
25 ...................................................................................................................... Factorial and Multivariate Analysis of Variance 25-1

Two-Way ANOVA 25-2
The Meaning of Interaction
Types of Interaction No Interaction Ordinal Interaction Disordinal Interaction 25-3 25-3 25-3 25-3
25-2
Three-way ANOVA Analysis of Covariance

Adjusting the SS Terms Uses of ANCOVA Example Problem
Sums of Squares in Two-Way ANOVA The Two-Way ANOVA Table
25-3 25-4
25-5 25-6
Multivariate Analysis of Variance Summary

Example Vocabulary Study Questions Sample Test Questions
25-6 25-7 25-7
25-9 25-10
25-10 25-13 25-13 25-14
26 ...................................................................................................................... Regression Analysis 26-1

The Equation of a Line Linear Regression 26-2 26-3
The Linear Regression Equation Errors of Prediction (e) Standard Error of Estimate
Computing a and b Drawing the Regression Line on the Scatterplot 26-3 26-4
26-3 26-4 26-5 26-6 26-6 26-6 26-7 26-7 26-7 26-7 26-8 26-8
26-8 26-9 26-10
Multiple Linear Regression

Raw Score Regression Equation Standardized Score Regression Equation Multiple Correlation Coefficient Multiple Regression Example The Data The Correlation Matrix The Multiple Regression Equation The Essential Questions Multiple Regression Printout
Section One Section Two Section Three
26-6
Summary
Focus on the Significant Predictors Multiple Regression Equations Example Vocabulary Study Questions Sample Test Questions
26-10 26-11 26-12 26-14 26-14 26-15
26-12
27 ...................................................................................................................... Guidelines for Evaluating Research Proposals 27-1

Research Proposal Checklist
Front Matter Introduction
Unit VI: EvaluatingResearch Proposals
27-1
27-1 27-1
i-20
Preliminaries The Method The Analysis General 27-2 27-2 27-3
Appendices:
Answer Key to Sample Test Questions Word List Critical Value Tables Dissertations and a Thesis Bibliography A1 A2 A3 A4 A5
i-21
Chapter 1
Scientific Knowing
Unit I: Research Fundamentals
1
Scientific Knowing
Ways of Knowing Science as a Way of Knowing The Scientific Method Types of Research
Have you considered how you know what you know? As you sit in classes or talk with friends, have you noticed that people differ in the way they know things? Look at six students who are discussing the issue of "modern translations" of the Bible.
Student 1: "I use the King James Version because that's the translation I grew up using. Everybody in our church back home uses it." Student 2: "I use the New King James because my pastor says it offers the best of beauty and modern scholarship." Student 3: "I've prayed about what version to use. I like the Amplified Version because it is so clear in its language. It just feels right." Student 4: " I've tried five or six different translations for devotional reading and for preparation for teaching in Sunday School. After evaluating each one, I've come back again and again to the New International Version. It's the best translation for me." Student 5: "The essense of Bible study is understanding the message, whatever translation we may use. Therefore, I use different translations depending on my study goals." Student 6: "I use the New King James because most of my congregation is familiar with it. In a recent survey, I found that 84% of our members use the KJV or NKJV."
Each of these students reflect a different basis for knowing which translation to use. Which student most closely reflects your view? How did you come to know what you know?
Ways of Knowing
Common Sense
As we begin our study of research design and statistical analysis, we need to understand the characteristics of scientific knowing, and how this kind of knowing differs from other ways we learn about our world. We will first look at five non-scientific ways of knowing: common sense, authority, intuition/revelation, experience, and deductive reasoning. Then we'll analyze the scientific method, which is based on inductive reasoning.
Authority Intuition/Revelation Experience Deductive Reasoning Inductive Reasoning
1-1
I: Research Fundamentals
Common Sense
Common sense refers to knowledge we take for granted. We learn by absorbing the customs and traditions that surround usfrom family, church, community and nation. We assume this knowledge is correct because it is familiar to us. We seldom question, or even think to question, its correctness because it just is. Unless we move to another region, or go to school and study the views of others, we have nothing to challenge our way of thinking. It's just common sense! But common sense told us that the earth is flat until Columbus discovered otherwise. Common sense told us that dunce caps and caning are effective student motivators until educational research discovered the negative aspects of punishment. Common sense may well be wrong.
Authority
Authoritative knowledge is an uncritical acceptance of anothers knowledge. When we are sick, we go to the doctor to find out what to do. When we need legal help, we go to a lawyer and follow his advice. Since we can not verify the knowledge on our own, we must simply choose to accept or reject the expert's advice. It would be foolish to argue with a doctor's diagnosis, or a lawyer's perception of a case. This is the meaning of "uncritical acceptance" in the definition above. The only recourse to accepting the expert's knowledge is to get a second opinionfrom another expert. As Christians, we believe that Gods Word is the authority for our life and work. The Living Wordthe Lord Himselfwithin us confirms the Truth of the Written Word. The Written Word confirms our experiences with the Living Word. Scripture is a valid source of authoritative knowledge. However, we spend a lot of time discussing Scriptural interpretations. Our discussions often deteriorate into conflicts about my pastors interpretations. We use our own pastors interpretation as authoritative because of the influence he has had in our own life. (We can substitute any authoritative person here, such as a father or mother, Sunday School teacher, or respected colleague.) But is the authority is correct? Authoritative knowing does not question the source of knowledge. Yet differing authorities cannot be correct simultaneously. How do we test the validity of an authoritys testimony?
Intuition/Revelation
Intuitive knowledge refers to truths which the mind grasps immediately, without need for proof or testing or experimentation. The properly trained mind intuits the truth naturally. The field of geometry provides a good example of this kind of knowing. Lets say I know that Line segment A is the same length as line segment B. I also know that Line segment B is the same length as line segment C. From these two truths, I immediately recognize that Line segments A and C are equal. Or, in short hand,
IF A=B and B=C, THEN A=C
I do not need to draw the three lines and measure them. My mind immediately grasps the truth of the statement. Revelation is knowledge that God reveals about Himself. I do not need test this knowledge, or subject it to experimentation. When Christ reveals Himself to us, we know Him in a personal way. We did not achieve this knowledge by our own efforts, but merely received the revelation of the Lord. We cannot prove this knowledge to others, but it is bedrock truth to those who've experienced it. Problems arise, however,
1- 2
Chapter 1
Scientific Knowing
when we apply intuitive knowing to ministry programs. Well, it's obvious that regular attendance in Sunday School helps people grow in the Lord. Is it? We work hard at promoting Sunday School attendance. Does it actually change the lives of the attenders? Is it enough for people to think it does, whether or not real change takes place? Answers to these questions come from clear-headed analysis, not from intuition.
Experience
Experiential knowledge comes from trial and error learning. We develop it when we try something and analyze the consequences. You've probably heard comments like these: We've already tried that and it failed. Or another: Weve found that holding Vacation Bible School during the third week of August, in the evening, is best for our church. The first is negative. The speaker is saying there's no need to try that ministry or program again, because it was already tried. The second is positive. This church has tried several approaches to offering Vacation Bible School and found the best time for them. Their truth may not apply to any other church in the association, but it is true for them. Theyve tried it and it worked. . .or it didnt. Much of the promotion of new church programs comes out of this framework. We say, This program is being used in other churches with great success (which means our church can have the same experience if we use this program). How do we evaluate program effectiveness? What is success? How do we measure it?
Deductive Reasoning
Deductive reasoning moves thinking from stated general principles to specific elements. We develop general over-arching statements of intent and purpose. Then we deduce from these principles specific actions we should take. Determine world view first. Then make daily decisions which logically derive from this perspective. When we take the Great Commission as our primary mandate, we have framed a world view for ministry. That is, Whatever we do, we will connect it to reaching out and baptizing (missions and evangelism), teaching (discipleship and ministry). Now, how do we do it? We deduce specific programs, plans, and procedures for carrying out the mandate. We eliminate programs that conflict with this mandate. How do we arrive at this world view? Are our over-arching principles correct? Have we interpreted them correctly? Correct action rises or falls on the basis of two things. First, correct action depends on the correctness of our world view. Secondly, correct action depends on our ability to translate that view into practical ministry steps.
Inductive Reasoning
Inductive reasoning moves thinking from specific elements to general principles. Inductive Bible study analyzes several passages and then synthesizes key concepts into the central truth. Science is inductive in its study of a number of specifics and its use of these results to formulate a theory. The truths derived in this way are temporary and open to adjustment when new elements are discovered. Knowledge gained in this way is usually related to probabilities of happenings. We have a high degree of confidence that combining X and Y will produce effect Z. Or, we learn that B and C are seldom found in combination with D. I can demonstrate probability by using matches. Picture yourself at the kitchen table with 100 matches. You pick up the first one. What is the probability it will light when you strike it? Well, you have two possibilities: either it will or it wont. So the probability is 50% (1 event out of 2 possibilities). You strike it and it lights. Pick up the
1-3
second match. The probability is 0.50 that it will light: (1 event out of two possibilities: Yes or No.) But cumulatively, out of two matches (first and second), one lit. One out of two is 50%. So the probability of the second match lighting is 50%, because 1 of 2 have already lit. You strike it and it lights. Pick up the third match. Again, the third match taken alone has p = 0.50 of lighting (read probability equals point-five-oh). However, taking all three matches together, two of the three have lit and the probability is 2/3 (p = 0.66) that the third match will light. It does. Now, pick up the fourth match. The probability is 3/4 (p=0.75) that it will light, taking all four matches together. What about the 100th match, given that the 99 previous matches have all lit? The probability is 0.50 for this particular match (yes, no), but p = 0.99 taking all matches together. The probability is very high! Yet we cannot absolutely guarantee it will light. This is the nature of inductive logic, and inductive logic is the basis of scientific knowledge. By definition, science does not deal with absolute Truth. Science seeks knowledge about processes in our world. Researchers gather information through observation. They then mold this information into theories. The scientific community tests these theories under differing conditions to establish the degree to which they can be generalized. The result is temporary, open-ended truth (I call it little-t truth to distinguish it from absolute Truth). This kind of truth is open for inquiry, further testing, and probable modification. While this kind of knowing can add nothing to our faith, it is very helpful in solving ministry problems.
Science as a Way of Knowing

Objectivity Precision Verification Empiricism Goal: Theories
Scientific knowing is based on precise data gathered from the natural world we live in. It builds a knowledge base in a neutral, unbiased manner. It seeks to measure the world precisely. It reports findings clearly so that others can duplicate the studies. It forms its conclusions on empirical data. Lets look at these ideals more closely.
Objectivity
Human beings are complex. Personal experiences, values, backgrounds, and beliefs make objective analysis difficult unless effort is made to remain neutral. Optimists tend to see the positive in situations. Pessimists see the negative. But scientists look for objective reality the world as it is uncolored by personal opinion or feelings. Scientific knowing attempts to eliminate personal bias in data collection and analysis. Honest researchers take a neutral position in their studies. That is, they do not try to prove their own beliefs. They are willing to accept empirical results contrary to their own opinions or values.
Precision
Reliable scientific knowing requires precise measurement. Researchers carry out experiments under controlled, narrow conditions. They carefully design instruments to be as accurate as possible. They evaluate tests for reliability and validity. They use pilot projects (trial runs of procedures) to identify sources of extraneous error in measurements. Why? Because inaccurate measurement and undefined conditions and unreliable instruments and extraneous errors produce data that is worthless. Every score has two parts: the true measure of the subject, and an unknown amount of error. We can represent this as
1- 4
Chapter 1
Scientific Knowing
Score = True Measure + Error
Think of two students who are equally prepared for an exam. When they arrive in class, one is completely healthy and the other has the flu. They will likely score differently on the exam. In this case, illness introduces an error term into the second student's score. When we gather data in a haphazard, disorderly way, error interferes with the true measure of the variable. Like static on a television screen, the error masks the true picture of the data. Analysis of this noisy data will provide a numerical answer which is suspect. Accurate measurement is a vital ingredient in the research process.
Verification
Science analyzes world processes which are systematic and recurring. Researchers report their findings in a way that allows others to replicate their studies to check the facts in the real world. These replications either confirm or refute the original findings. When researchers confirm earlier results, they verify the earlier findings. Research reports provide readers the background, specific problem(s) and hypotheses of studies. Also included are the populations, definitions, limitations, assumptions, as well as procedures for collecting and analyzing data. Writers do this intentionally so others can evaluate the degree that findings can be generalized, and perhaps, replicate the study.
Empiricism
The root of empiricism (Greek, empeirikos) refers to the employment of empirical methods, as in science, or derived from observation or experiment; verifiable or provable by means of observation or experiment.1 Science uses the term to underscore the fact that it bases its knowledge on observations of specific events, not on abstract philosophizing or theologizing. These carefully devised observations of the real world form the basis of scientific knowledge. Therefore, the kinds of problems which science can deal with are testable problems. Empirical data is gathered by observation. Basic observations can be done with the naked eye and an objective checklist (see Chapter 9). But observations are also made with instruments such as an interview or questionnaire (Chapter 10), a test (Chapter 11), an attitude scale (Chapter 12), or a controlled experiment (Chapter 13). Scientific knowing cares less about philosophical reasoning than it does the rational collection and analysis of factual data relevant to the problem to be solved.
Goal: Theories
The goal of scientific research is theory construction, the development of theories which explain the phenomena under study, not the mere cataloging of empirical data. The inductive process of scientific knowing begins with the specifics (collected data) and leads to the general (theories). What causes cancer? What makes it rain? How does man learn? What is the best way to relieve anxiety? What effect do children have on marital satisfaction? Most ministerial students want pragmatic answers to pragmatic problems in the ministry. In the past ten years [during the 1980s] there have been a rash of studies
1
"Empiricism," "empirical." The American Heritage Dictionary, 3rd ed., Version 3.0A, WordStar International, 1993.
1-5
relating some variable to church growth. The pragmatic question is How do I make my church grow? But Christian research goes deeper. It looks beyond the surface of ministry programming to the social, educational, psychological, and administrative dynamics of church life and work. Each of these areas have many theories and theorists giving advice and explanation. Are these views valid for Christian ministry? Can you modify these theories for effective use in church ministry? Seek a solid theoretical base for your proposal.
The Scientific Method

Felt Difficulty Problem Literature Hypothesis Population Sample(s) Collect Analyze Test Interpret
The scientific method is a step-by-step procedure for solving problems on the basis of empirical observations. Here are the major elements:
1. Begin with a felt difficulty. What is your interest? What questions do you want answered? How might a theory be applied in a specific ministry situation? What conflicting theories have you found? The felt difficulty is the beginning point for any study (but it has no place in the proposal). 2. Write a formal Problem Statement. The Problem establishes the focus of the study by stating the necessary variables in the study and what you plan to do with them (see Chapter 4). 3. Gather literature information. What is known? Before you plan to do a study of your own, you must learn all you can about what is already known. This is done through a literature search and results in a synthesis of recent findings on the topic (see Chap 6). 4. State hypothesis. On the basis of the literature search, write a hypothesis statement that reflects your best tentative solution to the Problem (see Chapter 4). 5. Select a target group (population). Who will provide your data? How will you find subjects for your study? Are they accessible to you? (see Chapter 7) 6. Draw one or more samples, as needed. How many samples will you need? What kind of sampling will you use? (see Chapter 7). 7. Collect data. What procedure will you use to actually collect data from the subjects? Develop a step-by-step plan to obtain all the data you need to answer your questions (see Chapters 9-13). 8. Analyze data. What statistics will you use to analyze the data? Develop a step-by-step plan to analyze the data and interpret the results (see Chapters 14-25). 9. Test the null, or statistical, hypothesis. On the basis of the statistical results, what decision do you make concerning your hypothesis? (see Chapters 16-26). 10. Interpret the results. What does the statistical decision mean in terms of your study? Translate the findings from statistics to English. (see Chapters 16-26)
The scientific method provides a clear procedure for empirically solving problems. In chapter 2 we introduce you to the structure of a research proposal. As you read the chapter, notice how the elements of the proposal follow the steps of the scientific method. Refer back to this outline in order to understand the links between the scien-
1- 6
Chapter 1
Scientific Knowing
tific method and the research proposal.
Types of Research
Under the umbrella of scientific research, there are several types of studies you can do. These types differ in procedure what they entail and outcome what they accomplish. Here are four major and three minor types of research from which you may choose.
Historical Descriptive Correlational Experimental Ex Post Facto Evaluation Research/Dev
Historical Research
Historical research analyzes the question what was? It studies documents and relics in order to determine the relationship of historic events and trends to present-day practice.
Primary sources
A source of information is primary when it is produced by the researcher. Reports written by researchers who conduct studies are eye witness accounts, and are primary sources of information on the results. Other examples of primary sources are autobiographies and textbooks written by authors who conduct their own research. Use primary sources as the major source of information in the Related Literature section of your proposal. Primary sources take two forms: documents and relics. Documents. Society creates documents to expressly record events. They are objective and direct. Documents provide straightforward information. Average Bible Study attendance listed on the Annual Church Letters on file in the state convention office is more likely to be accurate than numbers given from memory by ministers of education in local churches. However, information contained in documents may be incorrect. The documents may have been falsified, or word meanings in the documents may have changed. Relics. Society creates relics simply by living. Relics are artifacts left by communities and cultures in the past. People did not create these objects to record information as is the case with documents. Therefore, information conveyed by relics requires interpretation. The historical researcher reconstructs the meaning of relics in the context of their time and place.
Secondary sources
A source of information is secondary when it is a second-hand account of research. Secondary sources may take the form of summaries, news stories, encyclopedias, or textbooks written by synthesizers of research reports. While secondary sources provide the bulk of materials used in term papers, you should use them only to provide a broad view of your chosen topic. As already stated, emphasize the use of primary sources in your Synthesis of Related Literature.
Criticism
The term criticism has a decidedly negative connotation to most of us. A critical person is one who finds fault, depreciates, or puts down someone or something. The term comes from the Greek krino, to judge. Webster defines criticism as the art, skill, or profession of making discriminating judgments and evaluations, especially of literary or other artistic works."2 Criticism can therefore refer to praise as well as depreciation. A
2
"Criticism," The American Heritage Dictionary, 3rd ed., Version 3.0A, WordStar International, 1993.
1-7
Christian may cringe when he hears someone speaks of using higher criticism to study Scripture. It sounds as if the scholar is criticizing -- berating, slandering, putting down -- the Bible. The term actually means that scholars objectively analyze language, culture, and comparative writings to determine the authenticity of the work. Who wrote Hebrews? Paul? Apollos? Peter? Scholars apply the systematic tools of content analysis and literary criticism to determine the answer. Criticism takes two major forms: External criticism and internal criticism. External criticism. External criticism answers the question of genuineness of the object. Is the document or relic actually what it seems to be? What evidence can we gather to affirm the authenticity of the object itself? For example, is this painting really a Rembrandt? Was this letter really written by Thomas Jefferson? External criticism focuses on the object itself. Internal criticism. Internal criticism answers the question of trustworthiness of the object. Can we believe what the document says? What ideas are being conveyed? What does the writer mean by his words, given the culture and time period he wrote? Internal criticism focuses on the objects meaning.
Examples
Historical research is not merely the collection of facts from secondary sources about an historic event or process. It is the objective interpretation of facts, in line with parallel events in history. The goal of historical research is to in explain the underlying causes of present practices. Most of the historical dissertations written by our students have focused on former deans and faculty members. Dr. Phillip H. Briggs studied the contributions of Dr. J. M. Price, Founder and Dean of the School of Religious Education.3 Dr. Robert Mathis analyzed the contributions of Dr. Joe Davis Heacock, Dean of the School of Religious Education, 1950-1973.4 Dr. Carl Burns evaluated the contributions of Dr. Leon Marsh, Professor of Foundations of Education, School of Religious Education, Southwestern Seminary, 1956-1987.5 Dr. Sophia Steibel analyzed the Life and Contributions of Dr. Leroy Ford, Professor of Foundations of Education, 1956-1984.6 Dr. Douglas Bryan evaluated the contributions of Dr. John W. Drakeford, Professor of Psychology and Counseling.7
Descriptive Research
Descriptive research analyzes the question what is? A descriptive study collects data from one or more groups, and then analyzes it in order to describe present conditions. Much of this textbook underscores the tools of descriptive research: survey by questionnaire or interview, attitude measurement, and testing. A popular use of descriptive research is to determine whether two or more groups differ on some variable of interest.
Phillip H. Briggs, The Religious Education Philosophy of J. M. Price, (D.R.E. diss., Southwestern Baptist Theological Seminary, 1964). 4 Robert Mathis, A Descriptive Study of Joe Davis Heacock: Educator, Administrator, Churchman, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984) 5 Carl Burns, A Descriptive Study of the Life and Work of James Leon Marsh, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991) 6 Sophia Steibel, An Analysis of the Works and Contributions of Leroy Ford to Current Practice in Southern Baptist Curriculum Design and in Higher Edcuation of Selected Schools in Mexico, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1988) 7 Douglas Bryan, A Descriptive Study of the Life and Wrok of John William Drakeford, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1986)
3
1- 8
Chapter 1
Scientific Knowing
Another application of descriptive research is whether two or more variables are related within a group. This latter type of study, while descriptive in nature, is often referred to specifically as correlational research (see the next section).
An Example
The goal of descriptive research is to accurately and empirically describe differences between one or more variables in selected groups. Dr. Dan Southerland studied differences in ministerial roles and allocation of time between growing and plateaued or declining Southern Baptist churches in Florida.6 Specified roles were pastor, worship leader, organizer, administrator, preacher and teacher.7 The only role which showed significant difference between growing and non-growing churches was the amount of time spent serving as organizer, which included vision casting, setting goals, leading and supervising change, motivating others to work toward a vision, and building groupness.8
Correlational Research
Correlational research is often presented as part of the descriptive family of methods. This makes sense since correlational research describes association between variables of interest in the study. It answers the question what is in terms of relationship among two or more variables. What is the relationship between learning style and gender? What is the relationship between counseling approach and client anxiety level? What is the relationship between social skill level and job satisfaction and effectiveness for pastors? In each of these questions we have asked about an association between two or more variables. Correlational research also includes the topics of linear and multiple regression which uses the strengths of associations to make predictions. Finally, correlational analysis includes advanced procedures like Factor Analysis, Canonical Analysis, Discriminant Analysis, and Path Analysis all of which are beyond the scope of this course.
An Example
The goal of correlational research is to establish whether relationships exist between selected variables. Dr. Robert Welch studied selected factors relating to job satisfaction in staff organizations in large Southern Baptist Churches.9 He found the most important intrinsic factors affecting job satisfaction were praise and recognition for work, performing creative work and growth in skill. The most important extrinsic factors were salary, job security, relationship with supervisor, and meeting family needs.10 Findings were drawn from 579 Southern Baptist ministers in 153 churches.11
Experimental Research
Experimental research analyzes the question what if? Experimental studies use carefully controlled procedures to manipulate one (independent) variable, such as
Dan Southerland, A Study of the Priorities in Ministerial Roles of Pastors in Growing Florida Baptist Churches and Pastors in Plateaued or Declining Florida Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1993) 7 8 Ibid., 1 Ibid., 2 9 Robert Horton Welch, A Study of Selected Factors Related to Job Satisfaction in the Staff Organizations of Large Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990) 10 11 Ibid., 2 Ibid., 61
6
1-9
Teaching Approach, and measure its effect on other (dependent) variables, such as Student Attitude and Achievement. Manipulation is the distinguishing element in experimental research. Experimental researchers dont simply observe what is. They manipulate variables and set conditions in order to design the framework for their observations. What would be the difference in test anxiety across three different types of tests? Which of three language training programs is most effective in teaching foreign languages to mission volunteers? What is the difference between Counseling Approach I and Counseling Approach II in reducing marital conflict? In each of these questions we find a researcher introducing a treatment (type of test, training program, counseling approach) and measuring an effect. Experimental Research is the only type which can establish cause-and-effect relationships between independent and dependent variables. See Chapter 13 for examples of experimental designs.
An Example
The goal of experimental research is to establish cause-effect relationships between independent and dependent variables. Dr. Daryl Eldridge analyzed the effect of knowledge of course objectives on student achievement in and attitude toward the course.12 He found knowledge of instructional objectives produced significantly higher scores on the Unit I exam (mid-range cognitive outcomes) but not on the Unit III exam (knowledge outcomes). Knowledge of objectives did produce significantly higher scores on the postcourse attitude inventory.13
Ex Post Facto Research

Ex Post Facto (which translates into English as after the fact) research is similar to experimental research in that it answers the question, what if? But in ex post facto designs, nature not the researcher manipulates the independent variable. In studying the effects of brain damage on the attitudes of children toward God, it would be immoral and unethical to randomly select two groups of children, brain damage one of them, and then test for differences! But in an ex post facto approach the researcher defines two populations: normal children and brain-damaged children. Nature has applied the treatment of brain damage. The experiment is done after the fact of the brain damaged condition. Experimental studies involving juvenile delinquency, AIDS, cancer, criminal or immoral behavior and the like all require an Ex Post Facto approach.
An Example
The goal of ex post facto research is to establish cause-and-effect relationships between independent and dependent variables after the fact of manipulation. An example of Ex Post Facto research would be An Analysis of the Difference in Social Skills and Interpersonal Relationships Between Congenitally Deaf and Hearing College Students. Congenital deafness in this case is the treatment already applied by nature.
Evaluation
Evaluation is the systematic appraisal of a program or product to determine if it is accomplishing what it proposes to do. It is the application of the scientific method to the
Daryl Roger Eldridge, The Effect of Student Knowledge of Behavioral Objectives on Achievement and Attitude Toward the Course, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1985) 13 Ibid., 2
12
1- 10
Chapter 1
Scientific Knowing
practical worlds of educational and administrative programming. Specialists commend to us a variety of programs designed to solve problems. Depending upon the degree of personal involvement of these specialists with the programs, these commendations may contain more word magic than substance. Does a program do what it's supposed to do? The danger in choosing an evaluation type study for dissertation research is the political ramifications which come if the evaluation proves embarrassing to the church or agency conducting the program. Program leaders may not appreciate negative evaluations and apply pressure to modify results. This distorts the research process. Suppose you choose to evaluate a new counselor orientation program at a highly visible counseling network and you find the program substandard. Will this impact your ability to work with this agency as a counselor? Or suppose you want to compare Continuous Witness Training (CWT) with Evangelism Explosion (EE) as a witness training program. What are the implications of your finding one program much better than the other?
An Example
The goal of evaluation research is to objectively measure the performance of an existing program in accordance with its stated purpose. An example of this type of study would be A Critical Analysis of Spiritual Formation Groups of First Year Students at Southwestern Baptist Theological Seminary. Program outcomes are measured against program objectives to determine if Spiritual Formation Groups accomplish their purpose.
Research and Development

Research and Development (R&D) is the application of the scientific method in creating a new product: a standardized test, or program, or technique. R&D is a cyclical process in which developers (1) state the objectives and performance levels of the product, (2) develop the product, (3) measure the results of the product's performance, and (4) (if the results of the treatment do not meet the stated levels) revise the materials for further testing. Cyclical process means that the materials are revised and tested until they perform according to the standards set at the beginning of the product's development.
An Example
The goal of research and development is the production of a new product which performs according to specified standards. Dr. Brad Waggoner developed an instrument to measure the degree to which a given church member manifests the functional characteristics of a disciple.14 Two pilot tests using this original instrument produced Cronbach Alpha reliability coefficients of 0.9745 and 0.9618, demonstrating its ability to produce a reliable measurement of a church member's functional characteristics of a disciple.15 In 1998, this instrument was incorporated into MasterLife materials produced by LifeWay Christian Resources (SBC).16
14 Brad J. Waggoner, The Development of an Instrument for Measuring and Evaluating the Discipleship Base of Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991) 15 Ibid., 118 16 Report of joint development between Lifeway and the International Mission Board (SBC) at the 1998 Meeting of the Southern Baptist Research Fellowship.
1-11
Qualitative Research
In 1979 the faculty of the School of Religious Education at Southwestern created a teaching position for research and statistics. Their desire was for this position to give emphasis to helping students understand research methods and procedures for statistical analysis. It was further the desire that doctoral research become more objective and scientific, less philosophical and historical. In 1981, after two years of interviews and discussions, Religious Education faculty voted, and the president approved, my election to their faculty to provide this emphasis. This textbook, and the dissertation examples it contains, are products of 25 years of emphasis on descriptive, correlational and experimental research -- most of which is quantitative or statistical in nature. In recent years interest has grown in research methods which focus more on the issue of quality than quantity. A qualitative study is an inquiry process of understanding a social or human problem, based on building a complex, holistic picture, formed with words, reporting detailed views of informants, and conducted in a natural setting.17 Dr. Don Ratcliff, in a 1999 seminar for Southwestern doctoral students, suggested the following as the most common qualitative research designs: ethnography, field study, community study, biographical study, historical study, case study, survey study, observation study, grounded theory and any combination of the above.18 Grounded theory is a popular choice of qualitative researchers. It originated in the field of sociology and calls for the researcher to live in and interact with the culture or people being studied. The researcher attempts to derive a theory by using multiple stages of data collection, along with the process of refining and inter-relating categories of information.19 Qualitative research is subjective, open-ended, evolving and relies on the ability of the research to reason and logically explain relationships and differences. Dr. Marcia McQuitty, Professor of Childhood Education in our school, has become our resident expert in qualitative designs. I continue to focus on quantitative research, which is, in comparison, objective, close-ended (once problem and hypothesis is established), structured and relies on the ability of the research to gather and statistically analyze valid and reliable data to explain relationships and differences.
Faith and Science

What is the relationship between faith and science? Are faith and science enemies? Can ministers use scientific methodology to study the creation and retain a fervent faith in the Creator. I believe we can -- and should. But care must be taken to consciously mark out the boundaries of each. There is a difference between faith-knowing and scientific knowing, and that difference sometimes explodes into conflict a conflict fueled by both sides. First we'll look at the suspicion of science by the faithful. Then we'll consider the suspicion of religion by scientists.
Suspicion of Science By the Faithful

Anselm wrote in the 10th century, I believe so that I may understand. In other words, commitment and faith are essential elements in gaining spiritual understanding.
Randy Covington, An Investigation into the Administrative Structure and Polity Practiced by the Union of Evangelical Christians - Baptists of Russia, (Ph.D. proposal, Southwestern Baptist Theological Seminary, 1999), 20 paraphrasing John W. Creswell, Research Design: Qualitative and Quantitative Approaches (Thousand Oaks, CA: Sage Publications, ,1994), 1-2 18 Ibid., 25 quoting doctoral conference notes from meeting with Ratliff, Southwestern, April 24, 1999 19 Ibid., 26 paraphrasing Creswell, 12
17
1- 12
Chapter 1
Scientific Knowing
His words reflect Jesus' teaching that He gives understanding to those who follow Him (Mt. 11:29; 16:24). Blaise Pascal wrote in the 17th century, The heart has reasons which are unknown to reason.... It is the heart which is aware of God and not reason. That is what faith is: God perceived intuitively by the heart, not by reason. The truth of Christ comes by living it out, by risking our lives on Him, by doing the Word. We grow in our knowledge of God through personal experience as we follow Him and work with Him. We believe in order to understand spiritual realities. This approach to knowing is private and subjective. Such belief-knowing resents an anti-supernatural skepticism of openminded inquiry. More than that, some scientists consider the scientific method to be their religion. Their belief in evolution may be a justification for their unbelief in God. Science is helpful in learning about our world, but it makes a poor religion. So the faithful view science and its adherents with suspicion. Sometimes, however, the suspicion of science by the religious has less to do with faith than it does political power. In the Middle Ages, the accepted view of the universe was geocentric (earth-center). The moon, the planets, the sun (located between between Venus and Mars) and the stars were believed to rotate about the earth in perfect circles. This view had three foundations: science, philosophy and the Church. Greek science ( Ptolemy) and Greek philosophy (Aristotle) supported a geocentric view of the universe. The logic was rock solid for centuries: Man is the pinnacle of creation. Therefore, the earth must be the center of the universe. The Roman Catholic Church taught that the geocentric view was Scriptural, based on Joshua 10:12-13. Joshua said to the LORD in the presence of Israel: 'O sun, stand still over Gibeon, O moon, over the Valley of Aijalon.' So the sun stood still, and the moon stopped, till the nation avenged itself on its enemies, as it is written in the Book of Jashar. The sun stopped in the middle of the sky and delayed going down about a full day. For the sun and moon to stand still, the Church fathers reasoned, they would have to be circling the earth. Then several scientists began their skeptical work of actually observing the movements of the planets and stars. Copernicus, a Polish astronomer, created a 15th century revolution in astronomy when he published his heliocentric (sun-center) theory of the solar system. He theorized, on the basis of his observations and calculations, that the earth and its sister planets revolved around the sun in perfect (Aristotelian) circles. Keplar later demonstrated that the solar system was indeed heliocentric, but that the planets, including earth, orbited the sun in elliptical, not circular, paths. The Roman Catholic Church attacked their views because they displaced earth from its position of privilege, and opened the door to doubt in other areas. But Poland is a long way from Rome (it was especially so in the 15th century!), and so Copernicus and Keplar remained outside the Church's reach. Galileo is the father of modern Physics and did his work in Italy in the 16th and 17th centuries. He studied the work of Copernicus and Keplar, and built a telescope in order to more closely observe the planets. In 1632, he published the book Dialogue Concerning the Two Chief World Systems: Ptolemaic and Copernican, in which he supported a heliocentric view of the solar system. He was immediately attacked by Church authorities who continued to espouse a geocentric world view. Professors at the University of Florence refused to look through Galileo's telescope: they did not believe his theory, so they refused to observe. Very unscientific! Galileo, under threat of being burned at the stake, recanted his findings. It was not until October 1992 that the Roman
1-13
Catholic Church officially overturned the decision against Galileo's book and agreed that he had indeed been right. Science questions, observes, and seeks to learn how the world works. Sometimes this process collides with the vested interests of dogmatic religious leaders.
Suspicion of Religion By the Scientific

Science is meticulous in its focus on the rational structure of the universe. Scientists look with suspicion at the simple faith of believers who glibly say I don't know how, but I just know God did it. Such a statement reflects mental laziness. How does the world work? What can we learn of the processes?
There Need Be No Conflict

Many of the European men and women who pioneered science were motivated by the Reformation and their new found faith to discover all they could about God's creation. Stephen Hales, English founder of the science of plant physiology, wrote (1727), Since we are assured that the all-wise Creator has observed the most exact proportions of number, weight and measure in the make of all things, the most likely way therefore to get any insight into the nature of those parts of the Creation which come within our observation must in all reason be to number, weigh and measure.20 Hales commitment to scientific methodology in no way compromised his faith in the all-wise Creator. Nor did his faith undermine his scientific precision. Still, the skeptical neutrality of science often collides with the perspective of faith, acceptance and obedience. When I was in the sixth grade, our science class began a unit on the water cycle. I had always believed that God sent the rain to water the flowers and trees, because that's what mom told me (authoritative knowing) when I asked her why it rained. Now, before my very eyes was a chart showing a mountain and a river, and an ocean and a cloud. Carefully the teacher explained the diagram. Water vapor evaporates from the ocean and forms a cloud. The wind blows the cloud to the mountain, where water condenses in the form of rain. The rain collects and forms a river which flows back into the ocean. This is the water cycle. I can vividly remember my confusion and fear where was God in the water cycle? My dad helped when he got home that night. Well, the water cycle certainly explains the mechanical process of evaporation and condensation, but Who do you think designed the water cycle? My confusion was gone. My faith was strengthened though less simplistic and naive than it had been before (If God sends rain to water the plants, why doesn't He send some to the areas of drought, where people are starving to death?). And, I had learned something about how the world works that I hadn't even thought about before. The faithful should not use faith as a cop out for mental laziness. And so, faith focuses on the supernatural and subjectively sees with the hearts eye that which is unseen by the natural eye. Scripture, the Objective Anchor of our subjective experiences, is a record of personal experiences with God through the ages. Faith focuses on the Creator. Science focuses on the natural and objectively gathers data on repeatable phenom20
Stephen Hales, The Columbia Dictionary of Quotations is licensed by Microsoft Bookshelf from Columbia University Press. Copyright 1993 by Columbia University Press. All rights reserved.
1- 14
Chapter 1
Scientific Knowing
ena, the machinery, so we may better understand how the world works. Science focuses on the creation. There need be no conflict between giving your heart to the Lord and giving your mind to the logical pursuit of natural truth.
Summary
In this chapter we looked at six ways of knowing. We discussed specifically how scientific knowing differs from the other five. We introduced you to the scientific method, as well as seven types of research. Finally, we made a brief comparison of faith-knowing and science-knowing.
Vocabulary
authority common sense control of bias correlational research deductive reasoning descriptive research empiricism evaluation ex post facto research experience experimental research external criticism historical research inductive reasoning internal criticism intuition/revelation precision primary sources research and development scientific method secondary sources theory construction verification knowledge based on expert testimony cultural or familial knowledge, local maintaining neutrality in gaining knowledge analyzing relationships among variables from principle (general) to particulars (specifics) analyzing specified variables in select populations basing knowledge on observations analyzing existing programs according to set criteria analyzing effects of independent variables after the fact knowledge gained by trial and error determining cause and effect relationships between treatment and outcome determining the authenticity of a document or relic analyzing variables and trends from the past from particulars (specific) to principles (general) determining the meaning of a document or relic knowledge discovered from within striving for accurate measurement materials written by researchers themselves (e.g. journal articles) creating new materials according to set criteria objective procedure for gaining knowledge about the world materials written by analysts of research (e.g. books about) converting research data into usable principles replicating (re-doing) studies under varying conditions to test findings
Study Questions
1. Define in your own words six ways we gain knowledge. Give an original example of each. 2. Define science as a way of knowing. 3. Compare and contrast faith and science as ways of knowing for the Christian. 4. Define in your own words five characteristics of the scientific method. 5. Define in your own words eight types of research.
1-15
Sample Test Questions

1. Learning by trial and error is most closely related to A. deductive reasoning B. intuition C. common sense D. experience 2. Inductive logic is best described by A. particulars drawn from general principles B. general principles derived from a collection of particulars C. particulars established through reasoning D. general principles grounded in authoritative knowledge
Answer Key Provided for Sample Test Questions. See Appendix A1 Affectionate Warning: Memorizing right answers is not enough to understand research and statistics. Be sure you understand why the right answer is right.
3. Match the type of research with the project by writing the letter below in the appropriate numbered blank line. Historical Experimental Research & Development Descriptive Ex Post Facto Qualitative Correlational valuation Ev
____ An Analysis of Church Staff Job Satisfaction by Selected Pastors and Staff Ministers ____ Differentiating Between the Effects of Testing and Review on Retention ____ The Effect of Seminary Training on Specified Attitudes of Ministers ____ An Analysis of the Differences in Cognitive Achievement Between Two Specified Teaching Approaches ____ Determining the Relationship Between Hours Wives Work Outside the Home and the Couples Marital Satisfaction scores ____ The Churchs Role in Faith Development in Children as Perceived by Pastors and Teachers of Preschoolers ____ The Relationship Between Study Habits and Self Concept in Baptist College Freshmen ____ The Life and Ministry of Joe Davis Heacock, Dean of the School of Religious Education, 1953-1970 ____ Church Life Around the Conference Table: An Observational Analysis of Interpersonal Relationships, Communication, and Power in the Staff Meetings of a Large Church ____ An Analysis of the Relationship Between Personality Trait and Level of Group Member Conflict... ____ The Role of Womans Missionary Union in Shaping Southern Baptists View of Missions ____ The Effectiveness of the CWT Training Program in Developing Witnessing Skills ____ Determining the Effect of Divorce on Mens Attitudes Toward Church ____ A Learning System for Training Church Council Members in Planning Skills
1- 16
Chapter 1
Scientific Knowing
____ A Multiple Regression Model of Marital Satisfaction of Southwestern Students ____ The Effect of Student Knowledge of Objectives on Academic Achievement ____ A Study of Parent Education Levels as They Relate to Academic Achievement Among Home Schooled Children ____ A Critical Comparison of Three Specified Approaches to Teaching the Cognitive Content of the Doctrine of the Trinity to Volunteer Adult Learners in a Local Church ____ Curriculum Preferences of Selected Latin American Baptist Pastors ____ A Study of Reading Comprehension of Older Children Using Selected Bible Translations
1-17
1- 18
Chapter 2
Proposal Organization
2
Front Matter The Introduction The Method The Analysis Reference Material
The research proposal is a concise, clearly organized plan of attack for analyzing formal research problems. The beginning point in developing a proposal itself not a part of the final product is the felt difficulty. Hopefully, as you have read textbooks and journal articles, as you have listened to lectures and participated in discussion, you have been attracted to specific issues and concerns in your field. Perhaps there have been questions that remain unanswered, problems which remain unsolved, or conflicts which remain unresolved. These issues, your felt difficulties, hold the beginning point for your research proposal.
The first step toward an objective study of your felt difficulty is the choice of a topic. Consider a topic which has the potential to make a contribution to theory or practice in your chosen field. Afterall, a dissertation will consume large quantities of your time, your money, and your very self. Worthwhile topics can be discovered by browsing the indexes of information databases such as the Educational Resources Information Center (E.R.I.C.) or Psychological Abstracts (For detailed suggestions, see Chapter 6, Synthesis of Related Literature). This search, whether done manually or by computer, can provide useful information for confirming or abandoning a research topic. Once a topic has been determined, it must be translated, step by step, into a clear statement of a solvable problem and a systematic procedure for collecting and analyzing data. We begin that translation process in this chapter by providing a structural blueprint, as well as definitions of each proposal element, for the proposal you will eventually develop. The following structural overview gives you a framework for organizing your own proposal. Each element listed in the structural overview is defined. Study these element until you can see the structure of the whole.
2-1
Front Matter
Title Page Contents Tables Illistrations
Proposal Overview
Front Matter Title Page Table of Contents List of Tables List of Illustrations INTRODUCTION Introductory Statement Statement of the Problem Purpose of the Study Synthesis of Related Literature Significance of the Study Statement of the Hypothesis METHOD Population Sampling Instrument Limitations Assumptions Definitions Design Procedure for Collecting Data ANALYSIS Procedure for Analyzing Data Testing the Hypotheses Reporting the Data Reference Materials Appendices Bibliography
Title Page
The coversheet for the proposal contains basic information for the reader. You will list on this page your school name, the proposal title, your major department, your name and the date the proposal is submitted. The title of your proposal should provide sufficient information to permit your readers to make an intelligent judgment about the topic and type of study youre proposing to do. Your doctoral dissertation will be cataloged in Dissertation Abstracts upon graduation, so a clear title will attract more readers to your work.
Table of Contents
The Table of Contents lists the major headings and subheadings and their respective page numbers within the proposal. Suggestion: organize your proposal (and simplify the writing of the Table of Contents) using a three-ring binder with dividers for each section and element of the proposal. As you work on each section, file your materials in proper order in the binder.
List of Tables
As you write your dissertation, you will want to augment your written explanations with visual representations of the data. One form of presentation is the table, which displays the data tabular form rows and columns of figures which enhances, clarifies, and reinforces the verbal narrative. The List of Tables lists each table by name and page number. Let me suggest that you consider carefully the tables you will need to use to display your data and include a sample of each planned table in your proposal. Doing this shows that you have given adequate consideration to the forms your data will take.
2-2
Chapter 2
List of Illustrations
An illustration is a graph, chart, or picture that enhances visually the meaning of what you write. The List of Illustrations lists each illustration by caption and page number.
Introduction
The introduction section includes the introductory statement, the statement of the problem, the purpose of the study, the synthesis of related literature, the significance of the study, and the hypothesis. The purpose of the introduction is to demonstrate the thoroughness of your preparation for doing the study. This section explains to others, like the Advanced Studies Committee for instance, why you want to do this study. It further demonstrates how well you understand your specific field.
Introductory Statement Problem Purpose Synthesis Significance Hypothesis
The Introductory Statement

The proposal begins with an introductory statement, usually several pages in length, which leads like a funnel from a broad view of your topic to the specific Statement of the Problem. It provides readers of the proposal your rationale, based on published sources, for doing the study. For example, if I wanted to study priority research needs in religious education in Southern Baptist churches, I might organize my introductory statement in nine paragraphs as follows: Teaching in Jesus ministry Teaching in the early church The Sunday School movement of the past century Seminaries and Religious Education Southwestern Baptist Theological Seminary The School of Religious Education Doctoral degrees in the School of Religious Education Sources of problems for dissertation research The need to establish research priorities in a given field It is not necessary to begin with the Bible as I have done in my example. A study of cognitive counseling theories might begin with Gestalt psychology in the 1920s. Behavioral approaches to therapy might begin with B. F. Skinner in the 1950's. The point is to begin with a broad view of the field youre studying, and then narrow the focus to the point of the Problem Statement. Notice that my sample introductory statement outline begins with a broad overview of the field of the teaching of Jesus and ends with the specific point of research needs in religious education. Use objective language in writing the introductory statement. Document every statement. Do not include the personal feelings, experiences, or opinions which inspired your proposal. It simply isn't appropriate to say I had a bad experience with XYZ one time and wonder what might happen if....
The Statement of the Problem

The Problem Statement, usually no more than a single sentence, is the most important part of the whole proposal. It identifies the variables you plan to study as well as the type of study you intend to do. All other parts of the proposal grow out of the
2-3
Problem Statement. Just as an instructional objective provides the framework for lesson planning, so the Problem refects the very heart of the study. For example, look at the following Problem Statements from the dissertations of Drs. Marcia McQuitty and Norma Hedin:
The problem of this study [will be] to determine the relationship between the dominant management style and selected variables of full-time ministers of preschool and childhood education in Southern Baptist churches in Texas. The selected variables [are] level of education, years of service on church staffs, task preference, gender, and age.1 The problem of this study [will be] to determine the differences in measured self-concept of children in selected Texas churches across three variables: school type (home school, Christian school, and public school), grade (fourth, fifth, and sixth), and gender.2
See Chapter Four for more information on writing a Problem Statement.
Purpose of the Study

The Purpose of the Study section expands the Problem statement and describes in more detail the intention of the study. Use verbs like to determine, to ascertain, to evaluate, to discover. A listing of purposes for Dr. McQuitty's Problem Statement above reads this way:
The purposes of this study [will be] to determine: 1. the dominant management style of full-time preschool and children's ministers in Southern Baptist churches in Texas 2. the relationship between the dominant management style and selected variables of level of education, years of service on church staffs, task preference, gender, and age 3. areas of strengths and weaknesses in management style which could be addressed by additional printed material, professional development seminars, and the addition or restructuring of seminary class content for preschool and children's ministers.3
Notice that the list of Purpose statements comes directly out of the Problem Statement, and yet expands each component of it.
Synthesis of Related Literature

Part of the proposal-writing process involves library research. Preliminary sources such as literature indexes (Dissertation Abstracts) and key word thesauri (the E.R.I.C. Thesaurus) provide a doorway into millions of research articles. Use these resources to locate recent journal reports and dissertations related to your subject. Analyze these sources and condense the information into a clearly organized narrative. The purpose of the literature search is to establish a solid foundation for your study as well as prepare you to conduct the study. The Synthesis provides a backdrop for your
1 Marcia G. McQuitty, A Study of the Relationship Between Dominant Management Style and Selected Variables of Preschool and Children's Ministers in Texas Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1992), 5. Tenses changed from dissertation past tense to proposal future. 2 Norma Sanders Hedin, A Study of the Self-Concept of Older Children in Selected Texas Churches Who Attend Home Schools as Compared to Older Children Who Attend Christian Schools and Public Schools, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990), 6. Tenses changed from dissertation past tense to proposal future. 3 McQuitty, 5-6
2-4
Chapter 2
study. It details what others are doing in the field, what methods are being used, and what results have been obtained in recent years. A synthesis is different from a summary. In a summary, articles relating to a subject are outlined and then written up one after another. Let's say we have three articles. Article 1 contains discoveries A, B, and D. Article 2 contains discoveries A, B, and C. Article 3 contains discoveries A and C. A summary would look like this:
Article 1 found A, B, and D. Article 2 found A, B, and C. Article 3 found A and C.
This makes for lifeless writing and boring reading. It also fails to uncover the groupings of discoveries across all the articles. A synthesis, however focuses on key words and discoveries across many articles and combines the various research articles' findings. The focus is on the research discovery-clusters, not on individual articles. Look at the following rewrite:
Three researches found A (1,2,3). Two researchers found B (1,2), and two researchers found C (2,3).
This approach helps you discover linkages among researchers and makes for much more interesting reading. I've used three articles as an example, but a dissertation study will involve scores of them! When I was doing library research on my last doctorate, I found over a hundred research reports relating to my subject. In these reports, statisticians argued about proper procedures on the basis of a particular kind of error rate. As I analyzed the articles, I found that the researchers could be put into three camps. These camps, and the comparison of their views of various statistical issues, formed the organizational structure for my Related Literature section. I condensed ninety-two journal articles into fifteen pages of synthesis using over 30 key words. I remember my grandfather gathering the sap from maple trees to boil down into syrup. It frequently required over 100 gallons of sap to produce a gallon of syrup. This same process applies to the preparation of the Synthesis of Related Literature. Dr. Rollie Gill provides an example of synthetic writing in his dissertation on leadership styles.4
Outside research on Situational Leadership has questioned the validity and reliability of the "theory."127 See Chapter Six for more information on synthesizing literature.
Significance of the Study

The Significance of the Study section explains why, on the basis of the research literature, your study is worth doing. What makes your study important to your field?
127 Blank et al., A Test of the Situational Leadership Theory, 579-96; Goodson et al., Situational Leadership Theory,446-60; Norris and Vecchio, Situational Leadership Theory, 331-41; Vecchio, Situational Leadership Theory, 444-50; and Harold Ellwood Wiggin, Jr., A Meta-Analysis of Hersey and Blanchard's Situational Leadership Theory, (Ph. D. diss., Florida Atlantic University, 1991), in Dissertation Abstracts International, 52 (June 1992): 4488-A. 4 Rollie Gill, A Study of Leadership Styles of Pastors and Ministers of Education in Large Southern Baptist Churches, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1997), 27-28
2-5
What tangible contribution will it make? In short, it answers the so-what question. You want to study something. You find what you expect. So what?! The personal interest of the student or his/her major professor is not sufficient rationale for approving a proposal. The best rationale is a reference to one or more research studies stating the need for what you propose to do. Dr. Dean Paret wrote an effective statement of significance for his study on healthy family functioning:5
This study [will be] significant in that: 1. It provides empirical data for the relationship between family of origin in terms of autonomy and intimacy roles that were adapted and the current family healthy functioning patterns. Empirical validation has been called for by Hoverstadt et al.118 to support the theoretical assumptions upon which family therapy techniques are based. 2. It provides empirical data for breaking the recurrent cycle perpetuating the adult child syndrome.119 3. It provides a basis for the development of specific parenting training for the ministry of the church. 4. It provides helpful information for the seminary to aide [sic] the students who are having a difficult time juggling married life and student life, by providing indicators of stress areas related to autonomy and intimacy. According to Dr. David McQuitty, Director of Student Aid, the seminary through his office sees an increase in problems encountered by students as their seminary journey increases, both in financial stress, and student stresses, that could possibly be related to issues brought forward from the family of origin.120 It is therefore necessary to provide empirical data to help in breaking down the dysfunctional patterns of interaction.
118 120 119 Hoverstadt, et al., 287 and 296 Fine and Jennings, 14 Conversation with Dr. McQuitty on August 18, 1990
Just before my Proposal Defense, I made one last trip to the North Texas Science library. On that trip, I found a reference to a speech made two years earlier. Looking up the speech, I found a gold mine! The writers had analyzed many of the procedures I was studying. Their conclusion was to call for a computer analysis of several of the most popular procedures. It was the focus of my study! I added this recommendation to my significance section. It provided a solid rationale for my study when I defended it before my Proposal Committee.
The Hypothesis
The Statement of the Problem describes the heart of your study in one or two succinct sentences. The Statement of the (research) Hypothesis describes the expected outcome of your study. Base the thrust of your hypothesis on the synthesis of literature. Use the Problem Statement as the basis for the format of the hypothesis. Look at this Problem-Hypothesis pair from the dissertation of Dr. Joan Havens:
The problem of this study [is] to determine the difference in level of academic achievement across four populations of Christian home schooled children in Texas: those whose parents possessed (1) teacher certification, (2) a college degree, but no certification, (3) two or more years of college, or (4) a high school diploma or less.6
Dean Kevin Paret, A Study of the Perceived Family of Origin Health as It Relates to the Current Nuclear Family in Selected Married Couples, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991), 36-37
5
2-6
Chapter 2
[One of the hypotheses of this study is that there will] be no significant difference in levels of academic achievement in home schooled children across the four populations surveyed.7
Or another, from the dissertation of Dr. Don Clark, who did an analysis of the statistical power levels of dissertations hypothesizing differences written here in the School of Educational Ministries at Southwestern since 1981.8
The problem of this study [will be] to determine the difference in power of the statistical test between selected dissertations' hypotheses proven statistically significant and those selected dissertations' hypotheses not proven statistically significant in the School of Religious Education at Southwestern Baptist Theological Seminary.9 The hypothesis of this study [is] that power of the statistical test will be significantly higher in those dissertations' hypotheses finding statistically significant results than those. . .not finding statistically significant results.10
The Problem poses the question to be answered; the hypothesis presents the expected answer. The research hypothesis must be stated in measurable terms and should indicate, at least generally, the kind of statistic you'll use to test it. See Chapter Four for more information on writing the Hypothesis Statement.
Method
The METHOD section contains a detailed blueprint of your planned procedures. It specifically explains how you will collect the necessary data to analyze the variables youve chosen in a clear step-by-step fashion. This section includes the following components: population, sampling, instrument, limitations, assumptions, definitions, design, and collecting data.
Population Sampling Instrument Limitations Assumptions Definitions Design Collecting Data
Population
The Population section of the proposal specifies the largest group to which your study's results can be applied. Any samples used in the study (see below) must be drawn from defined one or more populations. Here is Dr. Da Silva's population:
The population for this study [will consist] of social work administrators in Texas who [are] members of the National Association of Social Workers. According to the mailing list of May 21, 1992, there [are] five hundred and seventy-eight administrators from the state of Texas.11
Here is Dr. Clark's population:

Joan Ellen Havens, A Study of Parent Education Levels as They Relate to Academic Achievement Among Home Schooled Children, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991), 7. Tenses changed from dissertation past tense to proposal future. 7 Ibid., 10 8 Don Clark, Statistical Power as a Contributing Factor Affecting Significance Among Dissertations in the School of Religious Education at Southwestern Baptist Theological Seminary, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1996) 9 10 Ibid., 5 Ibid., 30 11 Maria Bernadete Da Silva, A Study of the Relationship Between Leadership Styles and Selected Social Work Values of Social Work Administrators in Texas,(Ed.D. diss., Southwestern Baptist Theological Seminary, 1993), 7. Tenses changed from dissertation past tense to proposal future.
6
2-7
The population of this study [will consist] of all hypotheses from Ed.D. and Ph.D. dissertations completed within the School of Religious Education at Southwestern Baptist Theological Seminary which met four criteria: 1. The hypothesis was included within a dissertation completed between May 1978 and May 1996. 2. The hypothesis tested differences between groups as opposed to relationhips between variables. 3. The hypothesis was tested statistically by means of t-Test for Difference Between Means, One-way ANOVA, Two Factor ANOVA, or Three factor ANOVA. 4. Statistical significance was determined solely upon meeting a singular criteria, that being a single statistical test.12
See Chapter Seven for more information.
Sampling
The Sampling section describes how you will draw one or more samples from the population or populations defined above. It also explains how many subjects you intend to study in these samples. Here are examples of sampling statements based on the populations we defined above.
A twenty-five percent random sample [will be] obtained from the mailing list of the National Association of Social Workers in the State of Texas. The sample [is] estimated to consist of 144 subjects.13 A simple random sample of hypotheses [will be] conducted to produce two equal groups of fifty hypotheses: hypotheses proven statistically significant (Group X) and hypotheses not proven significant (Group Y). . . .14
See Chapter Seven for more information.
Instrument
The Instrument section describes the tools you plan to use in measuring subjects. Instruments includes tests, scales, questionnaires and interview guides, observation checklists, and the like). If you choose an existing instrument appropriate for your study, then describe its development, use, reliability and validity. If you cannot find a suitable instrument, you will need to develop your own. Provide a step by step explanation of the procedure you will use to develop, evaluate, and validate the instrument. Here is a portion of Dr. Hedin's instrument section:
The instrument selected for this study [is] the Piers-Harris Children's Self-Concept Scale (The Way I Feel About Myself), developed by Ellen V. Piers and Dale B. Harris in 1969. . .Answers are keyed to high self-concept; thus, a higher total score [indicates] a positive concept of self. . .Reliability coefficients ranging from .88 to .93, based on Kuder-Richardson and SpearmanBrown formulas, were reported for various samples29 . . . Content validity was built into the scale by using children's statements about themselves as the universe to be measured as selfconcept. By writing items pertaining to that universe of statements, the authors defined selfconcept for their scale31 . . . An attempt was made to establish construct validity during the initial standardization study. The PHCSCS scale was administered to eighty-eight adolescent institutionalized retarded females. As predicted by Piers and Harris, these girls scored signifi12
Clark, 30-31
13
Da Silva, 7
14
Clark, 31
2-8
Chapter 2
cantly lower than normals of the same chronological or mental age. This was interpreted as meaning that the PHCSCS did measure self-concept and discriminated between high and low self-concept.32
Dr. Wes Black developed his own instrument:

No standardized instrument was found to be applicable to this study. It is therefore necessary to devise such an instrument . . .thirteen experts received the questionnaire for their evaluation. The learning objectives from the Youth Discipleship Taxonomy were arranged in random order under each of the five areas of Church Training task assignment. . .The experts were asked to select ten items most appropriate for inclusion in a questionnaire on learning objectives for youth discipleship training from each of the five task areas and rank order their choices from one (highest) to ten (lowest) in each area. Responses from the experts were checked for completeness and correctness. The rankings were reversed scored (a ranking of one received ten points; ranking of two received nine points; and so forth) and scores totalled for each item on the taxonomy. Ten items in each of the five areas resulted in clear choices of the experts to be included in the instrument for this study. Table 1 [will provide] a summary of the experts. Appendix B lists the experts. The results of the content validity study [will be located]. . . in appendix C.15
See Chapters Nine, Ten and Eleven for more information on developing instruments..
Limitations
The Limitations section describes external restrictions that reduce your ability to generalize of your findings. An external restriction is one that is beyond your control. Let's say you plan to randomly assign students in a local high school to one of three experimental teaching groups. When you check with the principal, he allows you to do the experiment, but only if you use the regular classes of students he does not want you disrupting classes through random assignment. Since random assignment is an important part of experimental design, this is a limitation to your study and must be stated in this section. Limitations differ from delimitations. Delimitations are restrictions you set on your study. The fact that you decide to study single adults ages 20-50 is a delimitation of your study, not a limitation. Choosing to study only 6 of the 16 scales of the 16PF Test is a delimitation, because you make that decision on your own. Limitations are external restrictions and belong in this section. Delimitations are personal restrictions and belong in the Procedures for Collecting Data section of the proposal -- there is no Delimitations section. One of Dr. Matt Crain's limitations was:
Due to the lack of a central organizational headquarters, no directory of Churches of Christ exists whereby a true random sample of all congregations may be obtained.16
Here's one from the dissertation of Dr. Charles Bass:
15 Wesley Black, A Comparison of Responses to Learning Objectives for Youth Discipleship Training from Minister of Youth in Southern Baptist Churches and Students Enrolled in Youth Education Courses at Southwestern Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1985), 30-31 16 Matthew Kent Crain, Transfer of Training and Self-Directed Learning in Adult Sunday School Classes in Six Churches of Christ, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1987), 8
2-9
This study [will be] subject to the limitations recognized in collecting data by mail, such as difficulty in assessing respondent motivation, inability to control the number of responses, and bias of sample if a 100 percent response is not secured.17
Assumptions
Every study is built on assumptions. The purpose of this section is to insure that the researcher has considered his assumptions in doing the study. In doing a mailed questionnaire, the researcher must assume that the subjects will complete the questionnaire honestly. In testing which of two counseling approaches is best, one assumes that the approaches are appropriate for the subjects involved. Provide a rationale for the assumptions you state. It is not enough to copy assumptions out of previous dissertations. Explain the why of your assumptions. Here are several assumptions made by Dr. Darlene Perez:
1. All [112 Puerto Rican Southern and American Baptist] churches will have a youth Sunday School enrollment. 2. The pastors and youth leaders will cooperate with the study and will insure completion of the questionnaires. 3. Since [all] 112 Southern Baptist and American Baptist churches were used in the study, it is assumed that the findings are important in that they represent the general opinion of Baptist youth groups in Puerto Rico. . . .18
Here are several assumptions made by Dr. Gail Linam:

2. The in-depth training provided to researchers who administrated and/or scored the Iowa Tests of Basic Skills, the cloze reading comprehension test, and the retell ing comprehension analysis insured consistency in test administration and objectivity in scoring. 3. The Iowa Tests of Basic Skills, as a norm-based test, provided an accurate assessment of the reading level of boys and girls in Arlington, Texas, and thus offers a meaningful base of reference for religious educators around the nation who seek to make application of the study's findings to their particular group of boys and girls.19
Definitions
If you are using words in your study that are operationally defined -- that is, defined by how they are measured -- or have an unusual or restricted meaning in your study, you must define them for the reader. You do not need to define obvious or commonly used terms. For example, Dr. Kaywin LaNoue studied differences in spiritual maturity in high school seniors across two variables: active versus non-active in Sunday School, and Christian school versus public school. But what did she mean by active in Sunday School? What is spiritual maturity and how did she measure it? Here are her definitions for these two terms:
17 Charles S. Bass, A Study to Determine the Difference in Professional Competencies of Ministers of Education as Ranked by Southern Baptist Pastors and Ministers of Education, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1998), 45 18 Darlene J. Perez, A Correlational Study of Baptist Youth Groups in Puerto Rico and Youth Curriculum Variables, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991), 12 19 Gail Linam, A Study of the Reading Comprehension of Older Children Using Selected Bible Translations, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1993), 85
2-10
Chapter 2
Active. Active means those students attending their Sunday School at least three Sundaysa month.2 Spiritual maturity. Peter gives the steps in a Christian's growth toward maturity when he lists the attributes of the Christian life in the order by which they should be sought. He does this in 2 Peter 1:5-8. . . . In this study, spiritual maturity [is] the extent to which the students have assimilated (internalized) the virtues of goodness, knowledge, self-control, perseverance, godliness, brotherly kindness, and love.21
Dr. LaNoue used an adaptation of the Spiritual Maturity Test, developed and published by Dr. James Mahoney, to convert the virtues listed above into a test score.22 Sometimes special terms are used to communicate complex concepts quickly. These terms need to be defined. For example, the term "k, J combination" makes no sense until it is clearly defined:
k,J combination. -- This term refers to two major variables in this study: the number of groups in an experiment, k, and the sample size category, J. There [will be] four levels of k representing three, four, five, and six groups. There [will be] seven levels of J. J(1) through J(5) [will represent] equal n sample sizes of 5, 10, 15, 20, and 25 respectively. J(6) [will represent] an unequal set of nj's in the ratio of 1:2:3:4:5:6 with n1= 10. That is, when k=3, the sample n's [will be] 10, 20, and 30. J(7) [will represent] a set of nj's in the ratio of 4:1:1:1:1:1 with n1=80. That is, when k=3, the sample n's [will be] 80, 20, and 20. This provides twenty-eight combinations of k,J.23
See Chapter Three for more information on operationalizing variables.
Design
The Design section describes the research type of your study. It is here you declare your research to be correlational, or historical, or experimental. See the overview of Research Types in Chapter One for a description of eight major design types. Describe key factors that make your study of the stated type. If you are using an experimental design, explain which you are using and why. Dr. Brad Waggoner explained his design this way:24
The method of research [which will be] employed in this study [is] Research and Development. . . This type of research [is] accomplished in two phases. The first phase [will involve] the development of the product. The second phase [will consist] of evaluating the use or effects of the product.xx Although the exact number of specific stages of Research and Development vary from author to author, the following five steps [will be] applied:xy 1. The identification of a need, interest, or problem 2. The gathering of information and resources concerning the problem or need 3. The preliminary product or process [is] developed 4. The product or process [is] field-tested
Kaywin Baldwin LaNoue, A Comparative Study of the Spiritual Maturity Levels of the Christian School Senior and the Public School Senior in Texas Southern Baptist Churches With a Christian School, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1987), 25 21 22 Ibid., 26 Ibid., 93-97 23 William R. Yount, A Monte Carlo Analysis of Experimentwise and Comparisonwise Type I Error Rate of Six Specified Multiple Comparison Procedures When Applied to Small k's and Equal and Unequal Sample Sizes, (Ph.D. diss., University of North Texas, 1985), 8 24 Waggoner, 7-8
20
2-11
5. The product or process [is] refined based on the information obtained from the field-testing.
xx
C. M. Charles, Introduction to Educational Research (New York: Longman, 1988), 13
xy
Ibid., 13-14
Dr. Martha Bergen described the design of her study this way:25
The design of this study [is] descriptive in nature. [A] questionnaire [will be] designed to determine the attitudes of Southwestern Seminary's full-time faculty toward computers for seminary education. Further, certain variables [will be] examined to determine their possible predictions of these attitudes.
See Chapter Thirteen for more information on experimental designs.
Procedure for Collecting Data

The Procedure for Collecting Data section explains step by step how you plan to prepare instruments and gather data. Anticipate problems you may encounter and make contingency plans as needed. Avoid fuzzy over-generalized statements such as, Prepare and mail out survey forms. This phrase requires many specific actions: development, evaluation, rough draft, pilot testing, revision, final draft, printing, packaging, and mailing. Consult related dissertations and primary sources to discover the best procedures to use when collecting the particular type of data you need. At the end of this section, you should picture yourself with data sheets filled with numbers linked to each subject and every variable in the study. If the METHODS section is properly planned and executed, the result will be valid and reliable data ready for analysis. See Chapters Nine, Ten, Eleven, and Twelve for more information on collecting data.
Analysis
Analyzing Data Testing Hypothesis Reporting Data
The third and final major section of the proposal is the analysis section. The ANALYSIS section describes how you plan to process the numbers on the data sheets. This section moves step by step through the application of selected statistical procedures, the testing of hypotheses, and the reporting of the data in a systematic, coherent way.
Procedure for Analyzing Data

The Procedure for Analyzing Data explains step by step how you plan to statistically analyze your data. What statistical procedure(s) will you use? Procedures must agree with the stated Problem and Hypothesis. I was impressed by the importance of this section during my very first semester as the faculty member responsible for research and statistics. A doctoral student came into my office with a box full of inventory sheets. He had spent nearly $1,000 on printing and postage. He sat down, looked painfully at the box and asked,Now, what do I do
Martha S. Bergen, A Study of the Relationship Between Attitudes Concerning Computer-Enhanced Learning and Selected Individual and Institutional Variables of Full-Time Faculty Members at Southwestern Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1989), 52
25
2-12
Chapter 2
with this? What do you want to find out, I asked. I dunno ... uh, Im not sure. He had paid $300 for advice from a statistician across town, and had been led down a dead-end alley. The student left too much for others to decide. He did not own his own research. I gave him some suggestions, and, with a great deal of effort on his part and some additional help from his statistician, he was able to produce an acceptable dissertation. But he paid for it in many sleepless nights! The truth of the matter is that, as shown in the diagram at right, we really cannot correctly collect data until we know how we're going to analyze it. The two parts design and analysis work together.
Testing the Hypotheses

The Testing the Hypothesis section describes how you will test the statistical result obtained in the previous section to determine whether it is a significant finding or not. It is here you state the null form of your hypothesis, state your significance level () and explain the hypothesis testing procedure appropriate for the selected statistic. See Chapters Sixteen through Twenty-Six for procedures for analyzing data and testing the hypothesis.
Reporting the Data

The Reporting the Data section shows the charts, graphs, tables, or figures you plan to use to report the data youve collected and the findings of your analysis. Dr. Daryl Eldridge developed thirty-nine tables for his Effect of Student Knowledge of Behavioral Objectives dissertation.26 If you include actual examples of labelled charts or graphs (without data) in your proposal, then transferring the actual data to the chart is all that's left to do after the study. By deciding how to handle your data during the proposal stage, you clarify in your own mind exactly what you will need in order to finish your study. Putting off these decisions may cause you to overlook important areas in your study. Not only will this increase the difficulty of getting your proposal approved by the Ph. D. Committee, it will create unnecessary problems in writing your dissertation.
Reference Material
The Reference Material section contains supporting materials for the proposal. These materials include appendices and bibliography.
Appendices Bibliography
Appendices
An appendix contains supporting materials which relate directly related to your study. Most proposals require several appendices to include cover letters, a sample of the instrument, results of a pilot study, the data summary sheets, complex tables, illustrations of statistical analysis, and so forth. Dr. Daryl Eldridge developed twentythree appendices to house all the supplemental materials generated by his 188-page dissertation. What could possibly take up twenty-three appendices? Here's the list:27
1 - Course Objectives for Building a Church Curriculum Plan 332-435 [3 pages] 2 - Sample of Class / Session Objectives [1] 3 - First Draft of Unit 1 Exam [5]
26
Eldridge, 79
27
Ibid., 96-183
2-13
4 - Cornell Inventory for Student Appraisal of Teaching and Courses [7] 5 - Letter to Research Associates for Validation of Cognitive Tests [2] 6 - Test Item Analysis - Unit 1 Exam [2] 7 - Letter to Research Associates for Validation of Precourse Attitude Inventory [2] 8 - Report Form For Student Test Scores [1] 9 - Session Goals and Indicators [4] 10 - Unit 1 Exam, Final Form [8] 11 - Unit 3 Exam, Final Form [5] 12 - Cognitive PreTest, Final Form [4] 13 - Postcourse Student Inventory [8] 14 - Precourse Student Inventory [3] 15 - Tentative Class Schedule [4] 16 - Course Syllabus, Fall Semester [3] 17 - Course Syllabus, Spring Semester [5] 18 - Quizzes Over SBC Curriculum [6] 19 - Letter to Cornell University [1] 20 - Selected Comments From the Postcourse Inventory and Student Evaluations [3] 21 - Raw Scores For All of the Instruments [4] 22 - A Comparison of Scores Across Semesters for the Various Instruments [2] 23 - Statistical Analysis for Each of the Instruments Across Semesters [5]
You provide a clear, categorized filing system for supportive information by packaging materials in appendices. Small parcels of this information can be drawn from these appendices for explanation and illustration in the body of the dissertation. Such a design permits you to provide complete information, through references to the appendices, without bogging down the flow of thought in the dissertation itself. In the proposal development stage, think ahead concerning what appendices you will need and include an empty copy of each as an appendix to the proposal. This demonstrates to the Committee forethought and critical thinking.
Bibliography, or Cited Sources

The bibliography lists all primary and secondary references footnoted in the body of your proposal. List books first, then published articles and periodicals, then dissertations, then unpublished sources, interviews and, finally, other. Format bibliographical references according to the current style manual.
Personal Anxiety Professionalism
Practical Suggestions
Here are some practical suggestions to help you write a solid proposal.
Personal Anxiety
This assignment is complex. Some students experience a frightening sense of anxiety as they consider the daunting task of writing a research proposal. A research proposal taxes the thinking skills of the best students. You are confronted with learning new definitions (knowledge), understanding new concepts (comprehension), discovering conceptual links among numerous articles (analysis), writing an integrative narrative (synthesis), choosing the correct design and statistical procedures (evaluation) and putting all of this together in a single-focused, comprehensive document. Your educational experiences in high school and college may have emphasized rote memory,
2-14
Chapter 2
recall, and simple concepts rather than clear thinking. Therefore, writing an original research proposal is a strange new thing for some. Many paths to choose. Many decisions to make. What topic will I choose? What kind of research will I select? Where do I begin? For some, too many neat ideas compete for attention. For others, neat ideas are nowhere to be found. Don't panic. Take each section, each step of the process, one at a time.
Professionalism in Writing
A research proposal should be written in a clear, professional manner or it will not be understood. Here are some suggestions.
Clear Thinking
Your proposal should show clear thinking. Write and revise. Squeeze out fuzzy phrases, word magic28 and awkward grammar. Write simply and clearly. Use professional jargon only when simple English cant convey the thought.
Unified Flow
There should be a unified flow through the proposal. Take care not to ramble or lose focus in the details. March step by step in a single direction from the first page to the last.
Quality Library Research

The proposal should demonstrate extensive yet focused library research. Use primary sources less than five years old to establish current trends. Use secondary sources less than ten years old to establish the scope of your study. Use sources older than ten years only to establish historical trends.
Efficient Design
Your proposal should demonstrate your understanding of research design and statistical analysis, and how they work together. The proposal should present a narrative that is all-of-one-piece rather than a disjointed collection of pieces. Problem, Hypothesis, and Statistic should form its backbone.
Accepted Format
Finally, write in the accepted professional format of your school. Content is more important than format, but a professional format is required.
Summary
This chapter lays out the complete skeletal organization, with examples from actual dissertations, for the proposal you are developing. Study each component individually, as well as its relationship to the whole. Refer to this chapter and to the Evaluation Guidelines in Chapter 27 throughout the writing process to insure that you are on course. You will add to your understanding of each of these components as the semesI use the term word magic to refer to high-sounding, emotive words that have little substantive meaning. The majestic purpose of the American school is to instill in the hearts and minds of our youth the requisite essentials which will allow them to take their rightful place in society and fulfill their destiny. Huh? We hear word magic in sermons and classrooms as well. It gets the amens but communicates little.
28
2-15
ter progresses. Use this overview to anchor the big picture in your mind.
Vocabulary
Analysis Appendix Assumptions Bibliography Definitions Delimitations Design felt difficulty Front Matter Hypothesis Instrument Introduction Introductory Statement Limitations List of Tables List of Illustrations Method Population Procedure for Collecting Data Procedure for Analyzing Data Purpose of the Study Reporting the Data research proposal Sampling Significance of the Study Statement of the Problem Synthesis of Related Literature Table of Contents Testing the Hypotheses Title Page describes step-by-step the analysis of collected data an addendum to a proposal which contains supporting examples stated presuppositions upon which a proposed study is based a list of references used in developing the proposal a list of meanings of terms which are unique to the study, operationalized restrictions placed on a study by the researcher an explanation of the specific experimental approach to be used the beginning point of a study but not included in proposal preliminary materials such as Table of Contents and Lists the anticipated outcome of the study or solution to the Problem the means by which data is gathered the first major section of the proposal (includes the Problem) the opening statement of the proposal which leads to the problem restrictions placed on a study outside the researchers control a listing of tables used in the proposal (Front Matter) a listing of illustrations used in the proposal (Front Matter) the second major section of a proposal (includes sampling and instrument) the largest group to which the proposed study can be generalized step-by-step procedure for sampling, instrumentation, and gathering data step-by-step procedure for statistically reducing data to meaningful results explanation of the rationale for doing the study explanation of how data analysis will be presented (charts, tables) a step-by-step blueprint for conducting scientific inquiry the process of identifying a representative group from a population stated reasons why a study is necessary (answers `so what?) Simple focused statement of the relationship among variables in the study a clear narrative which fuses research materials related to the study an outline of proposal organization (Front Matter) an explanation of how stated hypotheses will be tested statistically the cover page of the proposal
Study Questions
1. Differentiate between the Introduction and the introductory statement. 2. Differentiate between a synthesis and a summary of related literature. 3. Differentiate between a limitation and a delimitation. 4. What are the three essential elements that make up the backbone of a proposal?

1. Which of the following proposal elements do not belong in the Introduction Section? a. The Problem b. The Hypothesis c. The Definitions d. The Synthesis of Related Literature
2-16
Chapter 2
2. The introductory statement should a. move from a broad focus of the field to the narrow focus of the study b. express the subjective interest and intent of the researcher c. take care not to use information from research articles d. lead directly to the statement of the hypothesis 3. Which of the following is not recommended as a way to organize the synthesis of literature? a. research article publication dates b. research article author names c. concepts addressed by research articles d. hypotheses of the study 4. Which of the following sections may be omitted from a proposal with appropriate caution? a. The Problem b. The Hypothesis c. The Significance of the Study d. The Limitations
2-17
2-18
Chapter 3
Empirical Measurement
3
Variables and Constants Measurement Types Operationalization
Scientific knowing stands or falls on the precision of its empirical observations. Whether these observations are made by a microscope, or a telescope, or stop watch, or pencil and paper test, the scientist strives for an accurate, numerical representation of the phenomena he is studying. The first step is to define the phenomenon under study in terms of the way you intend to measure it. This process is called operationalization. In order to understand the process, you will need to understand the terms variable and measurement. If you do not determine a clear way to measure what you intend to study, you will eventually bog down in the confusion of instrument design and statistical procedures. Now, not sometime later in your studies, is the time to decide specifically how you will measure the variables you intend to study.
Variables and Constants

A constant is a specific number which remains the same under all conditions. For example, water freezes (at sea level) at 32 degrees fahrenheit or 0 degrees centigrade. The number of seconds in a minute is 60. An object dropped from an airplane will accelerate at 32 feet per second per second (32 ft/sec). These are constants. A variable is an entity which can take different values. If we were to weigh each member of this class, we would find the numbers (weight) varying from subject to subject. Weight is a variable. So is eye color, IQ, gender, education level, level of anxiety, marital satisfaction, and so on. Research design and statistical analysis focus on the study of variables. As you think about subjects you would like to study, you will determine what variables you will study. Some of you are interested in counseling technique. Others are interested in learning style. Still others in administrative effectiveness. But what will you measure to determine which counseling technique is best? What will you measure to determine the effects of differing learning styles? What will you measure in order to gain a better picture of productive administration? Answers to these questions involve
3-1
Research Design and Statistical Analysis for Christian Ministry
an understanding of independent and dependent variables.
Independent Variables
An independent variable is one that you control or manipulate. You decide to study three different teaching methods. Teaching Method is an independent variable. Or you want to compare four approaches to counseling abused children. Counseling Approach is the independent variable.
Dependent Variables
A dependent variable is the variable you measure to demonstrate the effects of the independent variable. If you are studying Teaching Method you might measure achievement or attitude toward the class. If you are studying counseling approach you might measure anxiety level or overt aggression.
Measurement Types
Nominal Ordinal Interval Ratio
Before a dependent variable can be analyzed statistically, it must be measured or classified in some manner. There are four major ways we measure variables. These measurement types are called nominal, ordinal, interval and ratio.
Nominal Measurement
Nominal data refers to variables which are categorized into discrete groups. Subjects are grouped or classified into categories on the basis of some particular characteristic. Examples of nominal variables include all of the following: gender, college major, religious denomination, hair color, residence in a certain geographic region, staff position.
Ordinal Measurement
Ordinal data refers to variables which are rank ordered. Notice that nominal variables have no order to them. Males and Females imply nothing more than two different groups of subjects. but ordinal data orders subjects from high to low on some variable. An example of this data type would be the rank ordering of ten priorities for Christian education in the local church.
Interval Measurement
An ordinal scale only reports 1st, 2nd, 3rd places in a set of data. It cannot tell us whether the distance between 1st and 2nd is greater than or less than the distance between 2nd and 3rd. In order to measure distances between data points, we need a scale of equal, fixed gradations. This is precisely what an interval scale is. Numbers are associated with these fixed gradations, or intervals. One of the most common examples of an interval scale is temperature. The difference between 50 and 60 degrees F. is the same as the difference between 100 and 110 degrees F. Another example is an attitude scale which has 20 items. Each item can have a value of 1, 2, 3, or 4. That means a subject can make a score between 20 and 80. The scores on this scale fall at regular one-point intervals from 20 to 80.
3-2
Chapter 3
Ratio Measurement
Interval data does not, however, lend itself to ratios. We cannot say, for example, that 100 degrees is twice as hot as 50 degrees. The zero point on an interval scale is arbitrary; that is, it does not represent the total absence of the measured characteristic. A temperature reading of 0 degrees F. does not mean there is no heat. (The Kelvin scale was invented for this. A temperature of 0 degrees Kelvin, about -485 degrees F., is absolute zero temperature.) Ratio measurement differs from interval measurement only in the fact that the ratio scale contains a meaningful zero point. Zero weight means that the object weighs nothing. Zero elapsed time means that no time has passed since the beginning of the experiment (it has yet to begin!). A true zero point means that observations can be compared as ratios or percentages. It is meaningful to say that a 60-year-old is twice the age of a 30-year-old. Or that a 90-pound weakling weighs half as much as a 180-pound bully. In most types of studies, interval and ratio data are treated the same for purposes of selecting the proper statistical procedure.
Data Type Summary

Perhaps one final example will help you see how these four data types differ from each other. Study the table below. The students listed below have class attitude scores (interval: 20-80), test scores (ratio: 0-100), test rankings (ordinal: 1-11), grade classifiAttitude Test Test Test Gender cations (nominal; A, B, C) Scores Scores Rank Grade (20-80) (0-100) (1-11) (A-F) (M-F) and gender classifications (nominal: M,F). Barb 80 100 1 A F The ratio scale (test Chris 48 96 2.5 A M 96 2.5 A F Bonnie 74 scores) has a true zero and Robert 35 93 4 A M equally spaced intervals (the Jim 79 92 5 A M Tina 60 89 7 B F points between 0 and 100). Ron 55 89 7 B M The interval scale (attitude Jeff 56 89 7 B M Brenda 74 88 9 B F scores) has equally spaced Mark 56 82 10 B M intervals, but no true zero Mike 65 75 11 C M point (points range from 20 to Type: interval ratio ordinal nominal nominal 80). The ordinal scale (test rank) ranks the students by test scores. Notice that equal test scores received the same average rank (e.g., 96 and 96 both ranked '2.5', the average if ranks 2 and 3. These are called 'tied ranks'. "89" is ranked 7, 7, and 7 -- the average of 6, 7, 8). The first nominal scale (test grade) identifies students by grade categories and the second (gender) identifies them m or f. These four measurement types require three different sets of statistical procedures: one set for interval/ratio, another for ordinal, and still another for nominal. We'll look at some of the major procedures in a statistical flow chart in Chapter 5.
Operationalization
Our research design describes how we plan to measure selected variables. Statistical analysis describes how we plan to reduce these measurements to a meaningful (numerical) form. In both cases, the variables in the study must be defined in terms of measurement.
Definition An Example Another Example Questions to Answer
3-3
Definitions
An operational definition indicates the operations1 or activities that are performed to measure or manipulate a variable.2 The purpose of an operational definition is to help scientists speak the same language when reporting research. Since one of the primary characteristics of science is precision, we must begin with precise definitions of the variables we plan to study. Operational definitions force us to think concretely and specifically about the terms we use. Some of my students struggle with this. In one of my Principles of Teaching classes, a student was attempting to describe the fruit of the Spirit (Gal. 5:22-23). He defined "love" as "God's kind of love." But what kind of love is that? "Joy" was defined as "joy that you feel deeply, the joy we'll experience in heaven." But what is joy? These are non-definitions. They are empty. They are useless in teaching because they convey nothing but semantic fluff. I call this kind of definition "word magic," for it deceives teachers into thinking they are explaining words and phrases when in fact the definitions are little more than puffs of smoke in the air. Defining terms in precise terms of measurement avoids this kind of imprecision in research. Secondly, operational definitions provide a common base for communication of terms with others. When terms are operationally defined, readers know exactly how we are using our terms. For example, what does hunger mean? In one research study the operational definition for hunger was ...the state of animals kept at 90% of their normal body weight. This is certainly not the definition people use when they reach for their third chocolate-covered doughnut, saying, I'm really hungry!! The goal is to precisely understand the terms we use in research, and to convey that meaning clearly to others.
An Example
Years ago, General Motors used the slogan We Build Excitement PONTIAC! Suppose we wanted to study that. What does General Motors mean by excitement? We need to operationalize the term. There are several ways to do it. Have trained raters follow selected owners of Pontiacs, Fords and Chryslers and count the number of times they behave in an excited, agitated or exuberant manner. Excitement means the number of such behaviors per day. Is there a significant difference among the owners of these three makes of cars? Or, tally the number of dates selected car owners have per week. Excitement means the number of dates per week. This definition assumes that dates are exciting. Or, ask the owners: How excited does your car make you? Have them respond by marking a scale from 0 (no excitement) to 10 (excited all the time because of the car). Here excitement is a self-reported feeling, measured by a number on a scale. Or, ask two acquaintances of each selected subject to rate them on a car excitement scale. With this definition, excitement is the average scale score of impres-
Meriam Lewin, Understanding Psychological Research (New York: John Wiley & Sons, 1979), 75. Walter R. Borg and Meredith D. Gall, Educational Research: An Introduction, 4th ed. (New York: Longman, 1983), 22
1 2
3-4
Chapter 3
sions of the two acquaintances. Each of these definitions provides a different measure of the general term excitement. In fact, we actually have four concepts of the term. But each definition is clear in its meaning.3
Another Example
Lets illustrate the operationalization process with a practical example. Read this example carefully, noting each step in the process. John is considering several topics for his research proposal. He is drawn toward the problem of adolescent bail-out of church attendance when they leave home. Putting his first thoughts down on paper, he writes: "Church attendance decreases when young people leave home" Writing out your thoughts is important! Almost anything can sound logical as you play with ideas in your mind. Putting these thoughts down on paper is a first real step toward constructing a workable topic. Ive heard students complain, I know what I want to study, but I just cant put it down on paper! Well, they feel like they know what they want to study, but their ideas are only wisps of fantasy. To put your idea down on paper is to grasp it, refine it, put shape to it, and bring it into the real world where the rest of us live. Do you have an idea for your study? Write it down. Then work on it, as a sculptor on granite, and bring out the essence of your creation. Nothing of value comes easy. As John reflects on his statement, he asks as many questions about it as he can. He steps away from his idea and objectively critiques it. You must separate your ego from your statement. Otherwise you will find yourself defending your work rather than refining it. Here are some of his questions: Whose church attendance decreases? This statement could refer to parents or friends. It is not specific on this point. The statement seeks to measure a change in behavior. This requires before-and-after measurements. Is this possible to do? What is church attendance? What does this term imply? Worship? Bible Study? Church softball league? What is home? What does it mean to leave home? After writing down these questions and considering alternative ways to express what he wants to study, he rewrites his statement like this: Young people living away from home will have a lower rate of attendance at worship services than young people living at home. First, this statement is better because it clarifies attendance as the young peoples attendance at worship services. Second, this statement is better because it indicates measuring attendance of
See Earl Babbie, The Practice of Social Research, 3rd ed. (Belmont, CA: Wadsworth Publishing Company, 1983), 130-131
3
3-5
two groups and comparing them, rather than a before-and-after measurement of a selected group of subjects. The term young people is still fuzzy, however. How young is young? What does living away from home mean? Or living at home? Answers to these questions would be placed in the Definitions section of the proposal. In John's case, he defined these terms as follows: Young people is defined as persons aged 18-25. Home is defined as the residence of the subjects parents and where he or she lived as a child. Living at home is defined as the continued full-time residence of the subject at home. Leaving home is defined as the subject taking up residence away from home for at least three months. In order to do this study, John needs to define two populations: young people living at home and young people living away from home. He will need to sample two study groups from these populations. He will need to gather four pieces of data from each subject: (1) age, (2) residence, (3) attendance at worship services, and (4) how long away from home. You have just walked through a process of operationalization. It is a process essential for clear problem-solving. Begin now to operationalize the variables you are considering for your study.
Operationalization Questions
As you consider the measurement of variables for your study, there are two basic questions you must answer. The first is Are my variables measurable? If they are not, you cannot study them -- not statistically, that is. Some students have difficulty answering this question because they have too limited an understanding of what measurement entails. We will be looking at several approaches to measurement in the chapters ahead: direct observation, survey, testing, attitude measurement, and experimentation. Once you have settled on what kind of data you need for your study, begin looking in research texts and journal articles for ways to gather that data. Dont overlook the guidelines in later chapters of this text! The second question is How will I measure these variables? Define each of your variables in terms of how you will measure them [operational definitions]. I suggest you work on the statement for a while and then put it aside for several hours. When you come back to it, youll be able to look at it more objectively. It is difficult to avoid rationalization and self-defense of your work. But you will excel in writing your proposal only if you can critique yourself clearly and objectively. It is better if you find the weaknesses before others do! Once you have operationalized your draft statement, you will be ready to write the Statement of the Problem and the Research Hypothesis. We will get into these two sections of the proposal in the next chapter.
Summary
This chapter has introduced you to the concept of variables, four data types (nominal, ordinal, interval, and ratio), as well as the process of operationalization: defining selected variables in terms of measurement.
3-6
Chapter 3
Vocabulary
arbitrary zero category constant dependent variable independent variable interval interval data measurement type measurement nominal data operational definition operationalization ordinal data rank ratio data true zero variable an arbitrary 'zero value' -- does not mean absence of the variable (e.g. 0F) a class or group of subjects (e.g. male/female on variable GENDER) a numerical value which does not change (e.g. the freezing point of water: 32F) a variable which is MEASURED by the researcher a variable which is MANIPULATED by the researcher equi-distant markings on a scale (e.g. degrees on a thermometer) a measurement which reflects a position on an interval scale (e.g. 54F) a specific kind of measurement (nominal, ordinal, interval, ratio) the process of assigning a number value to a variable a measurement which reflects counts in a group (e.g. 15 males in Research class) describing a variable by its measurement (e.g. adult means 18+ years old) the process of defining variables by their measurement a measurement which reflects rank order within a group the relative position in a group (e.g. 1st, 2nd, 3rd) a measurement which reflects a position on a ratio scale (e.g. 93 on Exam 1) the complete absence of a variable (e.g. 0 pounds = no weight) an element that can have many values (e.g. `weight can be 120 or 210 or 5)
Study Questions
1. List and define four kinds of measurement. Give an example of each kind. 2. Define constant and variable. Give two examples of each. 3. Operationalize the fuzzies below. A. Staff members who work with autocratic pastors are less happy than those who work with democratic pastors. B. Teaching Sunday School with discussion will result in better feelings than teaching with lecture. C. Group counseling is better than individual counseling.

1. Identify the following kinds of data by writing Nominal, Ordinal, Interval, or Ratio in the blank provided. ___ Birth Year ___ Test score ___ oC or oF ___ Nationality ___ Class rank ___ Body Weight
2. Which of the following is not a characteristic of an operational definition? A. helps researchers communicate clearly B. uses global, abstract terminology C. specifies activities used to measure a variable D. addresses sciences desire for precision con't
3-7
3. Which of the following is the best operational definition? A. An attitude of forgiveness B. Aggressive facial expressions C. Immoral behavior D. Anxiety test score 4. Identify the type of data expressed in the statements below by writing the appropriate letter in the blank provided. N ominal O rdinal I nterval R atio
____ Statistical Aptitude will be measured by scores obtained on the STAT2 (0-20)1 ____ My current feelings toward my father could be characterized as:2 Very Warm and Tender Good Unsure Unfavorable 1 2 3 4 ____ Employment Status: Full-Time Part-Time Not Employed3 Very Distant and Cold 5
____ Study Habits: Sum of Delay Avoidance (DA) and Work Methods (WM) Scores on the Survey of Study Habits and Attitudes Inventory (Max: 100)4 ____ Critical Thinking Ability: score on the Watson-Glaser Critical Thinking Appraisal5 ____ Leadership Style: 9,9 5,5 9,1 1,9 1,16 ____ Reasons for Dropping Out of a Christian College: Ranking of 50 Attrition Factors7 ____ Child Density: Computed by dividing the number of children in a family by the number of years married8
"0" means no aptitude for statistics. Ibid., 86 3 James Scott Floyd, The Interaction Between Employment Status and Life Stage on Marital Adjustment of Southern Baptist Women in Tarrant County, Texas, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990), 45 4 Steven Keith Mullen, A Study of the Difference in Study Habits and Study Attitudes Between College Students Participating in an Experiential Learning Program Using the Portfolio Assessment Method of Evaluation and Students Not Participating in Experiential Learning, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1995), 51 5 Bradley Dale Williamson, An Examination of the Critical Thinking Abilities of Students Enrolled in a Masters Degree Program at Selected Theological Seminaries, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1995), 23 6 Helen C. Ang, An Analytical Study of the Leadership Style of Selected Academic Administrators in Christian Colleges and Universities as Related to Their Educational Philosophy Profile, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984), 28-29 7 Judith N. Doyle, A Critical Analysis of Factors Influencing Student Attrition at Four Selected Christian Colleges, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984), 98 8 Martha Sue Bessac, The Relationship of Marital Satisfaction to Selected Individual, Relational, and Institutional Variables of Student Couples at Southwestern Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1986), 23
1 2
3-8
Chapter 4
Problem and Hypothesis
4
Getting On Target
The Problem of the Study The Hypothesis of the Study From Raw to Refined
I lay in the cold, damp sand of Fort Dix, New Jersey with my M-16 pointing down range. Getting my weapon on target was not as easy as my instructors had made it sound in class. I felt as if I were all thumbs as I wrestled with sight alignment, breathing, placement of the front sight on the target, correction for wind, and correction for distance. I had one thing going for me, however, despite my awkward confusion. My problem was clear: put the 7.62mm round in the center of the target standing 100 yards away. The anticipated result was clear as well: put it all together and the round will hit the bulls eye. Practice translated the problem into the anticipated result. I qualified for the Sharpshooters Badge. Writing a proposal is more complex than target practice. The need to get on target with your proposal, however, is just as important. The Problem and Hypothesis statements focus every other element of the proposal. They form the proposals heart its bulls eye. Confusion here will generate confusion throughout the proposal.
The Problem Statement

The problem statement defines the essence of your study and identifies the variables you will study.
Characteristics Examples
Characteristics of a Problem
The following characteristics are important to keep in mind as you develop the formal statement of the problem of your study.
Limit scope of your study

Novice researchers tend to include too many variables or too much material in their studies. The problem statement helps limit your study by focusing your attention on the particular variables you want to investigate.
Current theory and/or latest research

The problem statement should reflect the most recent discoveries in your field of interest. You will refine your problem as you conduct the literature review (Chapter Six). A clear understanding of your specific problem will help you gather pertinent
4-1
data from your field and discover if you are proposing a redundant study.
Meaningfulness
Is your problem statement meaningful? Is it important to your field? The problem may focus on something you personally want to know, but this is not enough to establish the need for the study. The inexperienced tend to focus on the obvious, surface issues related to ministry. The problem statement should have a theoretical basis beyond the pragmatic concern of what works? Research seeks to know the whys as well as the hows of the way the world works.
Clearly written
The problem statement is usually a single sentence which isolates the variables of the study and indicates how these variables will be studied. The statement is terse, brief, concise. It is objectively written so that another can read the statement and understand the focus of the study.
Examples of Problem Statements

Lets focus on several practical formats that Problems can take. We can study the relationship between variables or the differences between groups.
Association Between Two Variables

A study can focus on the relationship between two variables. The general format of this type of Problem Statement is this:
The problem of this study is to determine the relationship between (Variable 1) and (Variable 2) in (a specific group).
Dr. Helen Ang wrote her problem statement in this format:

The problem of this study [is] to determine the relationship between the leadership style of academic administrators in selected Christian colleges and universities and their educational philosophy profile.1
This study proposes to measure the administrative leadership style and the particular philosophy of education of selected Christian college administrators and determine whether there is any relationship between these two variables. Since style and philosophy are nominal variables, this problem statement infers the use of the chisquare2 Test of Independence -- relationship between two nominal variables. (See Chapters 5 and 23 for further information on chi-square.)
Association of several variables

A study may focus on how a selected group of variables may predict another. The general format is this:
The problem of this study is to determine the relationship between (variable 1) and a specified set of predictor variables.
Helen C. Ang, An Analytical Study of the Leadership Style of Selected Academic Administrators in Christian Colleges and Universities as Related to Their Educational Philosophy Profile, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984), 3 2 Chi is pronounced ki as in kite.
1
4-2
Chapter 4
The problem of this study [is] to determine the relationship between ministerial job satisfaction and a specific set of predictor variables. These variables [are] Principle Ministry Classification, Gender, Age, Marital Status, Education, Tenure, and presence in the workplace of a Performance Evaluation.2
Dr. Bob Welch wrote his problem statement like this:
This statement identifies variables which the researcher believed influences the degree of job satisfaction in ministerial staff members of Southern Baptist Churches. Problem statements of this type refer to multiple regression analysis. (See Chapter 26 for further information on multiple regression).
Difference Between Two Groups

A study may focus on how two groups differ on a variable. The general format of this type of Problem Statement is this:
The problem of this study is to determine the difference in (variable) between (group 1) and (group 2).
Dr. Mark Cook wrote his Problem statement this way:

The problem of this study [is] to determine the difference in learning outcomes between classes taught with active student participation and classes taught with no active participation in adult Sunday School classes in a Southern Baptist Church.3
This study will measure the variable learning outcomes -- defined later as the achievement score of the student on the multiple-choice post test measuring the lesson objectives at three cognitive levels: knowledge, comprehension, and application4 -- in two groups of adult Sunday School members. One group experienced a Bible study which intentionally integrated active participation methods. The second group experienced the same Bible study without active participation. Would intentional active participation make a difference in their learning? The statistic inferred by this statement is the t-Test for Independent Samples. (See Chapter 20 for further information on the two sample independent t-test).
Differences Between More Than Two Groups

A study may focus on how more than two groups differ on a variable. The general format of this type of Problem Statement is this:
The problem of this study is to determine the difference in (variable) across (more than two groups).
Dr. Scott Floyd wrote his second problem statement this way:
It [is] also the problem of this study to determine the difference in marital adjustment of Southern Baptist women. . . who were not employed outside the home, employed part-time, and employed on a fulltime basis.5
This study will measure marital adjustment, a ratio score, in Southern Baptist
2 Robert Horton Welch, A Study of Selected Factors Related to Job Satisfaction in the Staff Organizations of Large Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990), 4 3 Marcus Weldon Cook, A Study of the Relationship Between Activie Participation as a Teaching Strategy and Student Learning in a Southern Baptist Church, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1994), 3 4 5 Ibid., 24 Floyd, 5 4th ed. 2006 Dr. Rick Yount 4-3
women divided into three employment groups. Do the mean scores of these three groups differ significantly? The Problem Statement infers the use of one-way Analysis of Variance (ANOVA). (See Chapter 21 for further information on ANOVA). Dr. Floyd tested one independent variable above. His primary problem, however, involved two. In addition to employment status he also divided women into three levels of life cycle -- ages 18-31, 32- to 46 and 47 to 65. The Problem statement for this design read this way:
The problem of this study [is] to determine the interaction between life cycle stage and employment status of Southern Baptist women in Tarrant County, Texas, on a measure of marital adjustment.
This problem statement infers the use of two-way ANOVA, because it identifies two independent variables, employment and life cycle, and one dependent variable, marital adjustment. (See chapter 25 for information on Factorial ANOVA.) The Problem statement delineates the question of the study. It is the climax of the Introductory Statement and opens the door to the Synthesis of Related Literature. In doing your literature search, you will learn a great deal from others who have studied the variables you are interested in studying. At the end of the Related Literature section (see Chapter 6) you will be ready to write a confidence statement of your expected findings. This statement of expectation is called a hypothesis.
The Hypothesis Statement

Research Directional Non-Directional Null
As explained in Chapter 2, an hypothesis states the anticipated answer to the problem youve stated. The two major types of hypotheses are the research, or alternative, hypothesis, and the null, or statistical, hypothesis. The research hypothesis can either be directional or non-directional.
The Research Hypothesis

The research hypothesis flows directly out of the problem statement and declares in clear, objective, measurable terms what you expect the result of your study to be. The research hypothesis is located in the proposal under the section title The Statement of the Hypothesis. We'll consider examples of hypotheses, with their corresponding problem statements, under the same four divisions as before.
Association Between Two Variables

Dr. Helen Ang wrote her problem statement in this format:
The problem of this study [is] to determine the relationship between the leadership style of academic administrators in selected Christian colleges and universities and their educational philosophy profile.6
Her corresponding hypothesis was:

[It is the hypothesis of this study that there will be] a significant relationship between the leadership style of the academic administor and his/her educational philosophy profile.7
Another way this relationship between nominal variables could be stated is this: It is the hypothesis of this study that leadership style of the academic administor
Ang, 3
Ibid., 19
4-4
Chapter 4
and his/her educational philosophy profile are not independent. The phrase not independent indicates more clearly that the study will use the chi-square statistic. Categories of leadership style and educational philosophy are the nominal measurements.
Association of several variables

Dr. Dean Paret wrote his problem statement like this:
The problem of this study [is] to determine the relationship between percieved current nuclear family health and a set of predictor variables: perceived autonomy and percieved intimacy in the family of origin of randomly selected married graduate students...8
His corresponding hypothesis was:

It [is] the hypothesis of this study that autonomy and intimacy as percieved in the couple's family of origin are significant positive predictors of current nuclear family health.9
The above is a multiple regression example where one variable is being predicted by two others. Association among several variables can also involve several pairings of variables. Dr. Maria Bernadete Da Silva wrote her problem statement to analyze the relationships among several pairs of variables.
The problem of this study [is] to determine the relationship between leadership style and the levels of agreement on selected social work values of social work administrators in social service agencies in Texas.10
Her corresponding hypothesis was:

The hypothesis of this study [is] that leadership styles of social work administrators and the levels of agreement on four selected social work values [will not be] independent.11
The four social work values were respect for basic rights, social responsibility, individual freedom, and self-determination. Level of agreement of these values consisted of the number of social workers selecting one of four options: strongly agree, agree, disagree, or strongly disagree. This design required four chi-square tests of independence, matching leadership style and each of the four values.
Difference Between Two Groups

Dr. Joan Havens wrote her problem statement like this:
The problem of this study [is] to determine the difference in level of academic achievement across four populations of Christian home schooled children in Texas: those whose parents possessed (1) teaching certification, (2) a college degree but no certification, (3) two or more years of college, or (4) a high school diploma or less.12
One of her four hypotheses was stated this way:

[The third hypothesis of this study is that] there would be no significant difference in levels of academic achievement in home schooled children whose parents possessed a teaching certificate and those whose parents did not.10
8
13
Paret, 5 Ibid., 10
Ibid., 37
10
Da Silva, 4
11
Ibid., 7
12
Havens, 7
4-5
Scores were divided into two groups for purposes of testing this hypothesis: one group of children had parent-teachers with teacher certification and the second group did not. Did academic achievement -- defined as improved grade level scores in vocabulary, reading, writing, spelling, mathematics, science and social studies skills, as measured by the subtests of the Standford Achievement Test14 -- significantly differ between these two groups? This hypothesis suggests the use of t-Test for Independent Samples (Chapter 20). Dr. Daryl Eldridge, conducting an experimental study, wrote his problem statement this way:
The problem of this study will be to investigate the effect of student knowledge of behaviorally stated course objectives upon the performance and attitudes of seminary students in a church curriculum planning course.15
Dr. Eldridge wrote two hypotheses out of this problem:

To carry out the purposes of this study, the following hypotheses will be tested: 1. It is the hypothesis of this study that the test scores of students who have knowledge of course objectives will be significantly greater than the test scores of students who have no knowledge of objectives. 2. It is the hypothesis of this study that students with knowledge of course objectives will score significantly higher on an inventory of Student Appraisal of Teaching and Course than those who have no knowledge of objectives.16
Both of these hypotheses infer t-Test for Independent Samples.
Differences Between More Than Two Groups

Dr. John Babler wrote his Problem Statement this way:
The problem of this study [is] to determine the differences between hospice social workers, nurses, and spiritual care professionals in their provision of spiritual care to hospice patients and families.17
His corresponding hypothesis was:

The hypothesis of this study [is] that there [will] be significant differences in scores on the instrument adapted for this study to assess the provision of spiritual care to hospice patients and families between social workers, nurses, and spiritual care professionals.18
The instrument adapted for his study produced interval data. The hypothesis infers use of the one-way Analysis of Variance statistic. (Chapter 21) Research hypotheses can be directional or non-directional. The distinction between these two types of research hypotheses lies in whether the hypothesis simply states a difference or states a difference in a specific direction.
The Directional Hypothesis

Several of the previous examples of research hypotheses are directional. That is, they include a specific direction of result: For example, look at the following again:
14
Ibid., 21
15
Eldridge, 3
16
Ibid., 29
17
Babler, 7
18
Ibid., 32
4-6
Chapter 4
It [is] the hypothesis of this study that autonomy and intimacy as percieved in the couple's family of origin are significant positive predictors of current nuclear family health. (Paret) 1. It is the hypothesis of this study that the test scores of students who have knowledge of course objectives will be significantly greater than the test scores of students who have no knowledge of objectives. (Eldridge)
When you state your research hypothesis in a directional form, you show more confidence in the anticipated result of your study. This confidence grows out of your literature review and expertise in the field. You should state your research hypotheses in a directional format if possible.
The Non-directional Hypothesis

A non-directional hypothesis states that a difference or relationship exists between variables, but does not specify what kind of difference or relationship it is. For example, the hypotheses above can be re-written as non-directional hypotheses as follows:
It [is] the hypothesis of this study that autonomy and intimacy as percieved in the couple's family of origin are significant predictors of current nuclear family health. 1. It is the hypothesis of this study that the test scores of students who have knowledge of course objectives will be significantly different from the test scores of students who have no knowledge of objectives.
The first hypothesizes prediction, but does not specify direction positive or negative. The second hypothesizes difference, but does not specify greater than or smaller than. These non-directional statements are weaker than the directional statements actually stated by the researchers. Use a non-directional research hypothesis in your proposal only if you cannot develop a reasonable basis for stating a direction for your anticipated results.
The Null Hypothesis

Research design emphasizes the research hypothesis. Statistical analysis, on the other hand, emphasizes the null hypothesis since statistical procedures can only test null hypotheses. The null hypothesis is stated to reflect no difference between groups or no relationship between variables. If the null hypothesis of no difference is shown statistically to be unlikely, we can reject the null hypothesis and accept the alternative (research) hypothesis. The null hypothesis is located in the proposal section entitled Testing the Hypothesis. This section is found in the ANALYSIS section of your proposal. (Review this section in chapter 2.) Lets restate the hypotheses listed above in their null form.
It [is] the hypothesis of this study that autonomy and intimacy as percieved in the couple's family of origin are not significant predictors of current nuclear family health. 1. It is the hypothesis of this study that the test scores of students who have knowledge of course objectives will not be significantly different from the test scores of students who have no knowledge of objectives.
4-7
Notice that the null form of the hypothesis declares no relationship among variables, and no difference between groups.
NOTE: There are times, though rare, when the "null hypothesis" is the "research hypothesis" of the study. For example, you are creating a new treatment that you believe will require half the time, but will produce the same results, as a more costly, time-intensive procedure. Your intent to show "no difference" between the approaches. In these rare occasions, the null is the research hypothesis as wellas the statistical hypothesis. The point: The null is not always the opposite of the research hypothesis.
Revision Examples
It is relatively easy to read a statement of problem or hypothesis and agree that it is focused and meaningful. It is quite another to write such statements. The following examples are problem and hypothesis statements written by students in class. I will comment on the statement as written, and then suggest a revised version.
Example 1
The problem of this study is to determine the effect of adequate premarital counseling on the success rate of teenage marriages.
Comments
The term effect calls for an experimental or ex post facto approach to the study. If you are thinking in this direction, move to Chapter 13 soon. I encourage you to pursue an experimental design, but students sometimes use the term effect when they are actually thinking of correlation. You cannot infer a cause-and-effect relationship from a correlation. There are other questions raised by this Problem. What is adequate counseling? What kind of premarital counseling? How will you measure success rate? Success over what period of time? How do you define teenage marriage? Is this study focusing only on teenagers who are married, or on all marriages which began in the teenage years?
Suggested revision
The problem of this study is to determine the difference in attitude toward married life between married teenagers who undergo a specified course of premarital counseling and those who do not. Here you are studying teenagers who are married. You will have two groups: one group undergoes a specified counseling treatment (which you will define under Procedure for Collecting Data) and the other doesnt. You measure differences in attitude toward married life between the two groups.
Example 2
The problem of this study is to determine whether those who complete MasterLife Discipleship Training will have a more positive attitude toward discipleship and will become actively involved in discipleship.
4-8
Chapter 4
Comments
More positive than what? There is nothing to compare MasterLife against. What is meant by actively involved? Discipleship is a global term. What does it mean in the framework of this study? What is the theoretical basis for this study? How will it contribute to the field of Christian education? Is this really an evaluation of the MasterLife program?
Suggested revision
The problem of this study is to determine the difference in discipleship skills and attitudes developed in median adults between the MasterLife Discipleship Training program and the (Alternative) Discipleship Training program. This study will evaluate MasterLife against another discipleship training program. The basis for comparison will be measured skills and attitudes in the area of discipleship.
Example 3
It is the hypothesis of this study that the level of social extroversion expressed by a child will differ significantly in relationship to the type of before and after school care environment he or she receives.
Comments
This statement targets the variables rather well. Level of social extroversion and type of care environment are clearly stated. But the wording is awkward. How many types of before and after school care will be studied? Two? Three? What does type of care mean? How will it be measured?
Suggested revision
It is the hypothesis of this study that children receiving Type I care will score significantly higher on the social extroversion scale than children receiving Type II care. Two types of child care are specified. These two types are directly compared on the basis of a social extroversion measurement of the children. If one were interested in comparing several types of child care, the hypothesis could read: It is the hypothesis of this study that childrens scores on the social extroversion scale will significantly differ across (number) specified types of before and after school care.
Example 4
It is the hypothesis of this study that staff longevity of ministers is significantly increased in churches using a salary administration plan than churches who do not use such a plan.
Comments
The term increased indicates a before and after study. This may be difficult to do in churches. How do you get churches to agree to install a different plan for pur 4th ed. 2006 Dr. Rick Yount
4-9
poses of a research study? It is easier to focus on difference. What is staff longevity? How long a staff member stays in a position? How is it measured? Months? Years? What is a salary plan? This is a fuzzy concept. How will you determine whether a church qualifies as having a plan or not having a plan? Is a bad plan better than no plan? Is salary the major factor in staff longevity? Are there other variables that need to be considered in studying why staff members remain in a given church? How will the researcher deal with ineffective staff members who are not invited to consider other churches those who remain because they have nowhere else to go?
Suggested revision
It is the hypothesis of this study that the length of service of ministers is significantly higher in churches that qualify as having a specified salary administration plan than in churches that do not. The researcher maintains his focus on salary. However, there is a procedure which will be used to categorize churches on the basis of their salary plans. Rather than measure increase, the researcher will look at the difference between length of service of ministers in two categories of churches.
Example 5
The hypothesis of this study is that men who remain in the pastorate are significantly different than those who leave the pastorate to enter denominational work.
Comments
This statement uses some of the words weve discussed, but misses the mark as a hypothesis statement. It is an excellent example of a hypothesis written by someone who knows the words but does not understand their meaning (But I used the words significantly different!) What is the variable being studied? These two groups of men will be different on what variables(s)? What is the theoretical foundation of this? Is there justification for considering pastoral ministry or denominational ministry better than the other? Besides, what is being measured? How will the researcher obtain his data? There is really no study here. We need to head back to the drawing board on this one.
Dissertation Examples
The Problem-Hypothesis-Statistic set forms the backbone, the framework, for both the proposal and the dissertation itself. While you are certainly not expected to understand the statistical procedures referenced here, I include them for future reference and for a sense of completeness. We will introduce you to these and other statistical procedures in Chapter 5, and focus on them in chapters 16 to 26. The following statement-sets are drawn from dissertations of our graduates. They are written in the past tense since they are taken from the dissertations.
Regression Analysis
The problem of this study was to determine the relationship between attitudes concerning computer-enhanced learning and selected individual and institutional variables of full-time
4-10
Chapter 4
faculty members at Southwestern Baptist Theological Seminary. [The hypothesis] of this study was that the following variables would prove to be significant predictors of attitudes toward computer-enhanced learning for theological education among the full-time faculty of Southwestern Baptist Theological Seminary: age, gender, school division where teaching, discipline teaching, degree(s) held, number of years teaching at Southwestern, last enrolled in a course, whether or not own a computer, believe students should own a computer, and taken any computer courses/instruction.19
The statistic for this study was Multiple Regression (see Chapter 26). There were two significant predictors found in this study: whether the professor owned a computer or not, and whether they believed students should own a computer. A positive attitude toward computer-enhanced learning in theological education was predicted by "yes" answers to these two questions.
Correlation of Competency Rankings

The problem of this study was to determine the relationship between rankings of competencies for effective ministers of education. These rankings were produced by two groups of Southern Baptist ministers. Group one consisted of Southern Baptist pastors currently serving with ministers of education. Group two consisted of ministers of education currently serving in Southern Baptist churches. The hypothesis for this study was that there is a signicant positive relationship between the two rankings of competencies for an effective minister of education as identified by Southern Baptist pastors and ministers of education.20
The statistic for this study was Spearman rho correlation coefficient (see Chapter 22). Competencies for ministers of education were divided into five areas: minister, administrator, educator, growth agent, and personal [relational skills]. Higher coefficients reflect higher agreement between pastors and educators on ranked competencies. Lower coefficients reflect lesser agreement. The coefficients were minister (0.94), administrator (.64), educator (.83), growth agent (.54) and personal (.70).
Factorial Analysis of Variance

The problem of this study was to determine the difference in the spiritual maturity levels of the Christian school senior and the public school senior in the Texas Southern Baptist churches sponsoring a Christian school with twelve grades. The hypotheses of this study are (1) there will be insignificant interaction between the variables "school" [public, Christian] and "activity" ["active"/"inactive" in Sunday School], (2) there will be significant . . . difference in spiritual maturity across the variable "school," and (3) there will be a significant . . . difference in spiritual maturity across the variable "activity."
The statistic for this study was Factorial Anova (see Chapter 25). There was no interaction between the two variables, so the two "main effects" (school, activity) could be interpreted directly. There was no significant difference in spiritual maturity between seniors in Christian vs. public schools, but spiritual maturity in active Sunday School attenders was significantly higher than in inactive attenders.
Chi-Square Analysis of Independence

The problem of this study was to determine the relationship between the dominant management style and selected variables of full-time ministers of preschool and childhood education in Southern Baptist churches in Texas. The selected variables were level of education, years
19
Bergen, 7, 46
20
Bass, 3, 37
21
LaNoue, 2, 22
22
Marcia McQuitty, 5, 27
4-11
of service on church staffs, task preference, gender, and age. The hypothesis of this study was that dominant management style and selected variables were not independent.22
The statistic for this study was the chi-square test of independence (see Chapter 23). Dr. McQuitty queried all full-time preschool and children's ministers serving in Texas Baptist churches (N=132), and actually gathered data from eighty one (81). Only nineteen (19) ministers produced a dominant management style, and thirteen (13) of these were categorized as comforter. This discovery required a change in the hypothesis: rather than one of five management styles, Dr. McQuitty tested her specified variables against dominant vs multiple management styles. None of the specified variables produced a significant chi-square value.23 Still, insights gained through the data collection provided important insights into the strengths and needs of preschool and childhood education ministers -- insights which Dr. McQuitty uses in her seminary classes.
Analysis of Variance
The problem of this study was to determine the difference in achievement, both cognitive and affective, among students who learned through interactive instruction, simulation games, and presentational instruction in the Hong Kong Baptist Theological Seminary, Hong Kong.24 The following were the hypotheses of the study: 1. H1 : was the hypothesis that there was significant difference among the means across [testing] occasions. . . 2. H2 : was the hypothesis that there was significant difference among the means across all groups. . . 3. [interaction] 4. [post-test 1: cognitive] 5. [post-test 1: affective] 6. [post-test 2: cognitive] 7. [post-test 2: affective]25
The statistic for this study was one-way analysis of variance (see Chapter 21). The analysis revealed no significant differences in cognitive learning across teaching methods used in the three groups. All three groups learned. The greatest change in attitude toward learning and interpersonal relationships occurred in the "Simulation Games" group.26
Summary
The material of this chapter is crucial to your research proposal. It is important that you understand the concepts discussed here and be able to use them with your own topic. Read the examples of good statements several times until the pattern of each kind of study begins to become clear. Work step-by-step through the evaluations of the real-life examples.
Ibid., 43 Stephen Tam, A Comparative Study of Three Teaching Methods in the Hong Kong Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1989), 2 25 26 Ibid., 14-17 Ibid., 76-77
23 24
4-12
Chapter 4
Vocabulary
research hypothesis null hypothesis statistical hypothesis directional hypothesis non-directional anticipated outcome of study, stated in terms of difference (grps), or relationship (vars) anticipated outcome of study, stated in terms of NO difference or NO relationship same as null hypothesis states a direction of difference (larger, smaller) or relationship (positive, negative) states no direction -- simply states 'difference' or 'relationship'
Study Questions
1. Explain the purpose of the problem and hypothesis statements. 2. Describe the four characteristics of a good problem statement. 3. Describe four types of hypothesis statements.

1. A good problem statement should A. broaden the focus of the proposed study B. give primary attention to practical how-to matters C. include necessary definitions and procedures for clarity D. focus on the theoretical foundation of your field 2. Choose the best problem statement below: It is the problem of this study... A. to determine the relationship between pastors and youth ministers on their attitude toward the Bible. B. to see how well churches treat their staff members. C. to answer the question, Why do so many staff members leave the ministry? D. to determine the difference between SBC pastors and denominational employees attitudes concerning Cooperative Program giving. 3. Identify the following hypothesis statements as directional research, non-directional research, or null (statistical) hypotheses by writing the appropriate letter in the blank provided. D irectional ____ ____ ____ ____ ____ ____ ____ ____ N on-directional S tatistical
Therapy A will result in significantly less marital anxiety than therapy B There will be no significant difference between Teaching Approaches 1 and 2. There will be a relationship between Number of Hours Studied and GPA Number of Hours Worked Outside the Home and Marital Satisfaction are independent Bible Knowledge Score will be significantly different across the three groups Senior Adults' Preference Score toward the King James Version will be significantly higher than for Young Adults There will be no difference in ministerial commitment scores across three staff categories Men and women will score differently on the nurturing scale of the BA12 Test
4-13
Chapter 5
Introduction to Statistics
5
Statistical Analysis
Statistics, Mathematics, and Measurement A Statistical Flow Chart
In the first four chapters of the text, we have focused on concerns of research design: the scientific method, types of research, proposal elements, measurement types, defining variables, and problem and hypothesis statements. But designing a plan to gather research data is only half the picture. When we complete the gathering portion of a study, we have nothing more than a group of numbers. The information is meaningless until the numbers are reduced, condensed, summarized, analyzed and interpreted. Statistical analysis converts numbers into meaningful conclusions in accordance with the purposes of a study. We will spend chapters 15-26 mastering the most popular statistical tools. But you must understand something of statistics now in order to properly plan how you should collect your data. That is, the proper development of a research proposal is dependent on what kind of data you will collect and what statistical procedures exist to analyze that data. The fields of research design and statistical analysis are distinct and separate disciplines. In fact, in most graduate schools, you would take one or more courses in research design and other courses in statistics. My experience with four different graduate programs has been that little effort is made to bridge the two disciplines. Yet, the fields of research and statistics have a symbiotic relationship. They depend on each other. One cannot have a good research design with a bad plan for analysis. And the best statistical computer program is powerless to derive real meaning from badly collected data. So before we get too far into the proposal writing process, some time must be given to establishing a sense of direction in the far-ranging field of statistics.
Statistics, Mathematics, and Measurement

There are two major divisions of statistical analysis. The first emphasizes reducing the data that has been collected in a research study. The second emphasizes making inferences from the collected data.
Descriptive Inferential Mathematics Measurement
5-1
Descriptive Statistics
Descriptive statistical procedures are used to describe a group of numbers. These tools reduce raw data to a more meaningful form. Youve used descriptive statistics when averaging test grades during the semester to determine what grade youll get. The single average, say, a 94, represents all the grades youve earned in the course throughout the entire semester. (Whether this 94 translates to an A or a C depends on factors outside of statistics!). Descriptive statistics are covered in chapters 15 (mean and standard deviation) and 22 (correlation).
Inferential statistics
Inferential statistics are used to infer findings from a smaller group to a larger one. You will recall the brief discussion of population and sample in chapter 2. When the group we want to study is too large to study as a whole, we can draw a sample of subjects from the group and study them. Descriptive statistics about the sample is not our interest. We want to develop conclusions about the large group as a whole. Procedures that allow us to make inferences from samples to populations are called inferential statistics. For example, there are over 36,000 pastors in the Southern Baptist Convention. It is impossible to interview or survey or test all 36,000 subjects. Round-trip postage alone would cost over $21,000. But we could randomly select, say, one percent (1%) or 360 pastors for the study, analyze the data of the 360, and infer the characteristics of the 36,000. Inferential procedures are covered in chapters 16, 17, 18, 19, 20, and 21.
Statistics and Mathematics

Statistics is a branch of applied mathematics. Depending on how much emphasis a teacher or text places on the term applied (practical), the study of statistics can range from helpful to hurtful to the mathematical novice. The more mathematical the emphasis, the more one finds derivations of formulas and higher order math symbolism. Ive seen some statistics texts that had more Greek than English in them! The emphasis of this textbook is practical use of statistical procedures and their interpretations. You will be doing some math, but mathematical skill is not required to do well in the course. Most of the procedures we will study require simple arithmetic computations (+, - , x, ). We will also make use of the square () and square root () keys on your calculator. If you dont already own a calculator, buy a statistical calculator. You can recognize one by keys such as +, -,and . You can buy a calculator like this for less than $25.00 and it will be money well spent!
Statistics and Measurement

In chapter 3 we introduced you to four kinds of data: nominal, ordinal, interval, and ratio. Interval and ratio data often use the same statistical procedures, which means that we must learn three different sets of tests. Nominal data requires one kind of statistics (well focus on chi-square), ordinal data another (well focus on the Spearman rho and the Mann Whitney U), and interval/ratio a third. The interval/ratio procedures z-test, t-test, ANOVA, Pearsons r correlation coefficient and multiple regression make up the bulk of our study in statistics.
5-2
Chapter 5
Statistical Flowchart
Accompanies explanation on pages 5-4 to 5-7 in text
1
Relationships Between Variables Interval/Ratio? Ordinal? Nominal?
Studying Similarities Among Variables or Differences Between Groups? I/R
2
Differences Between Groups Interval/Ratio? Ordinal?
I/R
3
2 Vars 3+ Vars
5
2 Dicho* 1 Var 2 Vars
6
1 Group 2 Groups 2+ Groups
6c One-Way ANOVA
7
2 Groups 3+ Groups
2 Ranks 3+ Ranks
3b Multiple Regression 3a Pearson's r Linear Regression
4b Kendall's W
4a Spearman rho ()
1 Independent Var - 1 Dependent Var - Ind't Groups
Repeated Measures ANOVA

1 Independent Var - 1 Dependent Var - Matched Groups
Factorial ANOVA MANOVA
2+ Independent Var - 1 Dependent Var - Ind't Groups
Kendall's
tau ()
1 Independent Var - 2+ Dependent Var - Ind't Groups
5a Phi Coefficient (r) Rank Biserial

1 Dichotomous and 1 Ordinal
6b t-test for Ind't Samples

2 Independent Groups
t-test for Matched Samples

2 Matched Groups
Point Biserial
1 Dichotomous and 1 Interval/Ratio
6a One-sample z-test One-sample t-test
Sample mean and Population mean - known, or n>30 Sample mean and Population mean - unknown
5b Chi-Square () Goodness of Fit

Equal E Proportional E
7c Kruskal-Wallis H test
Rankings Divided into 3+ Matched Groups
5c Chi-Square () Test of Independence

Contingency Coefficient Cramer's Phi
7a Wilcoxin Matched Pairs T test

Rankings Divided into 2 Matched Groups
7b Mann-Whitney U test Wilcoxin Rank-Sum test

Rankings Divided into 2 Ind't Groups
ASSOCIATION
*Dichotomous - two and only two categories
DIFFERENCE 5-3
A Statistical Flow Chart

The flow chart on the preceding page is offered as a visual mental road map into the world of statistics. It is designed to provide you direction, step by step, by use of questions, to the specific procedure you should use for a particular type of study and data. The following is a verbal roadmap, given to clarify the diagram. An additional purpose of this section is to introduce you to the names of major procedures well study later in the semester. Bold faced names indicate procedures well discuss extensively. Follow the directions to the proper numbered section below.
Question One: Similarity or Difference?

In your study, are you looking for similarities between variables, or differences between groups? A similarity study would explore, for example, the relationship between self-esteem and marital harmony (two variables) in selected couples (one group).. A difference study might examine the difference in social skills (one variable) between autocratic and democratic ministers (two groups). If you are contemplating a similarity study, go to -1- below. If you are contemplating a difference study, go to -2-.
You have chosen a similarity study. Statistical procedures that compute coefficients of similarity or association or correlation (synonymous terms) come in four basic types. The first type computes correlation coefficients between interval or ratio variables. The second type computes correlation coefficients between ordinal variables. The third type computes correlation coefficients between nominal variables (or, at the very least, at least one of the two is nominal). The fourth type is a special category which computes a coefficient of independence between nominal variables. If your data is interval or ratio, then go to -3- below. If your data is ordinal, then go to -4- below. If your data is nominal, then go to -5- below.
-1- Question Two: Data Types in Similarity Studies
-2- Question Two: Data Types in Difference Studies

You have chosen a difference study. Statistical procedures that compute measures of significant difference come in two major types. The first computes measures of difference for interval or ratio variables. The second computes differences between ordinal variables. If your data is interval or ratio, then go to -6- below. If ordinal, go to -7-.
-3- Interval or Ratio Correlation

Interval/ratio correlation procedures come in two types. The first type examines two and only two variables at a time (go to -3a-). The second type examines three or more variables simultaneously (go to -3b-). -3aInt/ratio Correlation with 2 Variables The two procedures we will study are Pearsons Product Moment Correlation Coefficient (rxy or simply r) and simple linear regression. Pearsons r directly measures the degree of association between two interval/ratio variables. See Chapter 22. Simple linear regression computes an equation of a line which allows researchers to predict one interval/ratio variable from another. See Chapter 26.
5-4
Chapter 5
-3bInterval\ratio Correlation with 3+ Variables The procedure we will study which analyzes three or more interval/ratio variables simultaneously is multiple linear regression. This procedure is quickly becoming the dominant statistical procedure in the social sciences. With this procedure, you develop models which relate two or more predictor variables to a single predicted variable. We will confine our study to understanding the printouts of a statistical computer program called SYSTAT. See Chapter 26.
-4- Ordinal Correlation
Just like the interval/ratio procedures above, ordinal correlation procedures come in two types.
-4aOrdinal Correlation with 2 Variables The two procedures which compute a correlation coefficient between two ordinal variables are Spearmans rho (rs) and Kendalls tau (). Spearmans rho should be used when you have ten or more pairs of rankings; Kendalls tau when you have less than ten. Both measures give you the same information. If you had pastors and ministers of education rank order seven statements of characteristics of Christian leadership, you would compute the degree of agreement between the rankings of the two groups with Kendalls tau. See Chapter 22. -4bOrdinal Correlation with 3+ Variables Kendalls Coefficient of Concordance (W) measures the degree of agreement in ranking from more than two groups. Using our example above, you could compute the degree of agreement in rankings of pastors, ministers of education and seminary professors using Kendalls W. See Chapter 22.
-5- Nominal Correlation

Nominal correlation procedures come in two types differentiated by the number of categories in nominal variables. If you have two and only two categories, the variables are called dichotomous (go to -5a-). If nominal variables have more than two categories, go to -5b- below. -5aNominal Correlation with Dichotomous Variables When you have two variables which can take two and only two values (dichotomous variables), use the Phi Coefficient. When you have one dichotomous and one rank variable, use Rank Biserial. When you have one dichotomous and one interval/ ratio variable, use Point Biserial. See Chapter 22. -5bNominal Correlation wth 3+ Categories Procedures which determine whether two nominal variables are independent (not related) or not independent (related) are called Chi-square () tests. The word "chi" is pronounced "ki" as in "kite." The Chi-square Goodness of Fit test compares observed category counts (30 males [75%], 10 females [25%]) with expected counts based on school enrollment [85%
5-5
male, 15% female] to determine if class enrollment fits well the expected enrollment. The Chi-square Test of Independence compares two nominal variables to determine if they are independent. Are educational philosophy (5 categories) and leadership style (5 categories) independent of each other? When you want to determine the strength of the relationship between the two nominal variables, use Cramers Phi (c). This procedure computes a Pearsons r type coefficient from the computed value. See Chapter 23.
-6- Interval/Ratio Differences

This section of statistical procedures is the most important, and will consume the greater part of our study of statistics. If you are testing one sample against a given population, go to -6a-. If you are testing the difference between two groups, go to -6b-. If you are testing differences between three or more groups, go to -6c-. -6a1-Sample Parametric Tests of Difference The first type of interval/ratio difference procedures computes whether data from one sample is significantly different from the population from which it was drawn. If you have more than 30 subjects in the sample, use the one-sample z-test. If you have fewer than 30 subjects, use the one-sample t-test. Heres an example: You know the average income of all Southern Baptist pastors in Texas. You collect information on income of a sample of Southern Baptist pastors who are also seminary graduates. Is there a significant difference in average income between the sample and the population? See Chapter 19. -6b2-Sample Parametric Tests of Difference The second type computes whether data from two samples is significantly different. There are two different procedures which are used. The first is used when the two samples are randomly selected independently of each other: a sample of Texas pastors and a second sample of Texas ministers of youth. For this situation, use the Independent Samples t-test. See Chapter 20. The second procedure is used when pairs are sampled. Examples of sampling pairs include husbands and their wives, pastors and their deacon chairmen, fathers and their sons, counselors and their clients, and so forth. If you have two groups of paired subjects (husbands and their wives), use the Matched Samples t-test. See Chapter 20. -6cn-Sample Parametric Tests of Difference The third type computes significant difference across three or more groups. Again, procedures depend on whether the groups are matched (correlated, related) or independent. If the groups are independent, and you are examining one independent (grouping) variable, use One-Way Analysis of Variance. For example, is there a significant difference in Integration of Faith with Life between Southern Baptists, Episcopalians, and members of the Assemblies of God. See Chapter 21. If you are studying two or more independent variables simultaneously, use n-Way Analysis of Variance (also called Factorial ANOVA), where n is the number of independent variables. The importance of Factorial ANOVA is in the ability to study
5-6
Chapter 5
interaction among the independent variables. See Chapter 25. If the groups are related, use the Repeated Measures Analysis of Variance. (Not discussed in this text.)
-7- Ordinal Differences

The measurement of significant difference between (small) groups of data is closely related to the interval/ratio procedures we just mentioned. To conserve space, let me simply give you the ordinal equivalents of the procedures we discussed in -6above. Use these procedures when your group sizes are too small for procedures under -6- above. See Chapter 21 for all these procedures. The Wilcoxin Matched Pairs Test (Wilcoxin T) is analogous to the Matched Samples t-test. -7bThe Wilcoxin Test Rank Sum Test and the Mann-Whitney U Test are analogous to the Independent Samples t-test. -7cThe Kruskal-Wallis H Test is analogous to the One-Way ANOVA. -7a-
Summary
In this chapter we introduced you to statistical analysis. We linked statistics to the process of research design. We looked at two major divisions of statistics. We separated the practical application of statistical procedures from the need for higher level mathematics skills. We differentiated statistical differences by measurement type. And finally, we laid out a mental map of statistical procedures we will be studying so that you can determine which procedures might be of use to you in your own proposal.
Vocabulary
correlation coefficient Cramers Phi descriptive statistics Factorial ANOVA Goodness of Fit Indep't Samples t-test Inferential statistics Kendalls tau Kendalls W Kruskal-Wallis H Test Linear Regression Mann-Whitney U Test Matched Samples t-test Multiple Regression one-sample z-test one-sample t-test One-Way ANOVA Pearsons r a number which reflects the degree of association between two variables measures strength of correlation between two nominal variables measures population or sample variables two-way, three-way ANOVA compares observed counts with expected counts on 1 nominal variable tests whether the average scores of two groups are statistically different INFERS population measures from the analysis of samples correlation coefficient between two sets of ranks (n < 10) correlation coefficient among three or more sets of ranks non-parametric equivalent of ANOVA establishes the relationship between one variable and one predictor variable non-parametric equivalent of the independent t-test tests whether the paired scores of two groups are statistically different establishes the relationship between one variable and multiple predictor variables tests whether a sample mean is different from its population mean (n > 30) tests whether a sample average is different from its population average tests whether average scores of three or more groups are statistically different correlation coefficient between two interval/ratio variables
5-7
Phi Coefficient Point Biserial Rank Biserial Spearmans rho Test of Independence Two Sample Wilcoxin Wilcoxin Matched Pairs
correlation coefficient between two dichotomous variables correlation coefficient between interval/ratio variable and dichotomous variable correlation coefficient between ordinal variable and dichotomous variable correlation coefficient between two sets of ranks (n > 10) chi-square test of association between two nominal variables non-parametric equivalent of independent t-test non-parametric equivalent of matched samples t-test
Study Questions
1. Differentiate between descriptive and inferential statistics. 2. Consider your own proposal. Review the types of data (Chapter 3). List several statistical procedures you might consider for your proposal. Scan the chapters in this text which deal with the procedures youve selected. 3. Give one example of each data type (Review Chapter 3). Identify one statistical procedure for each example you give.

Identify which statistical procedure should be used for the following kinds of studies. Write the letter of the procedure in the blank. Goodness of Fit - 2 Pearson r Mult Regression (Mult Multiple) Indt) T-test (I Mann Whitney U Phi Coefficient Ph Spearman rho Test of Ind Independence - 2 One-Way ANOVA Linear) Regression (L Matched) T-test (M Wilcoxin T (Pairs)
____ 1. Difference between fathers and their adult sons on a Business Ethics test. ____ 2. Whether learning style and gender are independent. ____ 3. Analysis of six predictor variables for job satisfaction in the ministry. ____ 4. Difference in Bible Knowledge test scores across three groups of youth ministers. ____ 5. Prediction of marital satisfaction by self-esteem of husband. ____ 6. Relationship between number of years in the ministry and job satisfaction score. ____ 7. Difference in anxiety reduction between treatment group I and treatment group II. ____ 8. Correlation between rankings of objectives of the School of Religious Education by pastors and ministers of education.
5-8
Chapter 6
Synthesis of Literature
6
Synthesis of Related Literature
A Definition The Procedure
In this chapter we look at the process of finding, collecting, analyzing and synthesizing research articles which relate to the topic of our study. Before we can add to the knowledge base of our field of study, we must learn what is already known. The literature search provides a factual base for the proposed study.
A Definition
The related literature section of your proposal, entitled the Synthesis of Related Literature, is a synthetic narrative of recent research which is related to your study.
Synthetic Narrative
The related literature section is a synthetic narrative. It is a narrative in the sense that it should flow from the beginning to the end with a single, coordinated theme. It should not contain a series of disjointed summaries of research articles. Such unrelated and disconnected summaries generate confusion rather than understanding. It is synthetic in that it has been born out of the synthesis of many research studies. You will analyze research reports by key words. There may be twenty articles that provide information for a given key word. As you write your findings for each of your key words, you will draw from all of the articles addressing that key word simultaneously. The final product will be a synthesis a smooth blending of selected articles built around the key words of your study. This is the reason for the name of this section: The Synthesis of Related Literature. Not a summary, but a synthesis.
Recent Research
The synthesis of related literature focuses on recent research. The rule of thumb in defining recent is ten years. You will want to select and include research articles which are less than 10 years old. Major emphasis should be placed on research conducted in the past 5 years. Articles older than this are out of date and misleading. Consider an opinion survey conducted in 1955 on the attitudes of Americans on family. Such information has little relevance to family attitudes today. Its only value would be to show the change in attitude since 1955. Gather your information from research journal articles rather than books. Books are, by necessity, more out of date than the research theyre based upon. Research reports are primary sources of information, because they are written by those who conducted the study. Books are usually secondary sources; that is, sources written by authors not directly associated with the reported research: they merely compile re 4th ed. 2006 Dr. Rick Yount
6-1
search results from many sources. Focus your synthesis on primary sources of information.
Related to Your Study

Your Problem Statement and its associated operationalized variables define the boundaries of your literature search. Each and every footnote in the synthesis should directly relate to your subject. The purpose of the synthesis is not to provide filler for the proposal. The purpose is to convey in a clear, focused way the present body of knowledge which relates to your intended study.
Choose database Choose sources Determine keys Search Select articles Analyze articles Reorganize Material Write Synthesis Revise Synthesis
The Procedure for Writing the Related Literature

Choose One or More Databases
A database is a high-tech term which refers to a collection of information in a particular field of study. The information stored in a database includes research reports, formal speeches, journal articles, minutes of professional meetings, and the like. These databases can be searched manually, by book-type indexes, or electronically, by computer. Manual searching costs little or no money, but consumes large amounts of time. Computer searches are fast and efficient, but can become expensive.
E.R.I.C.
The Educational Resources Information Center (ERIC) was initiated in 1965 by the U.S. Office of Education to transmit findings of current educational research to researchers, teachers, administrators and graduate students.1 Information is housed in 16 clearinghouses around the nation.2
RIE
The ERIC system consists of two major parts. The first is the Resources in Education (RIE) which provides abstracts of unpublished papers presented at educational conferences, speeches, progress reports of on-going research studies, and final reports of projects conducted by local agencies such as school districts.3
CIJE
The second major part of the ERIC system is the Current Index of Journals in Education (CIJE). The CIJE indexes articles published in over 300 educational journals and articles about educational concerns in other professional journals.4 In general, ERIC listings have less lag time than the Education Index or Psychological Abstracts. This means it will provide you with more recent research findings. Altogether, the ERIC system indexes and abstracts research projects, theses, conference proceedings, project reports, speeches, bibliographies, curriculum-related materials, books and more than 750 educational journals.1
Walter R. Borg and Meredith D. Gall, Educational Research: An Introduction, 4th (New York: Longman Publishing Co., 1983), 153. 2 See Borg and Gall, pp. 901-2 for addresses of clearinghouses. 3 Ibid., p. 153 4 Charles D. Hopkins, Educational Research: A Structure for Inquiry (Columbus, Ohio: Charles E. Merrill Publishing Company, 1976), 221
1
6-2
Chapter 6
Psychological Abstracts
Published by the American Psychological Association, this publication lists articles from over 850 journals and other sources in psychology and related fields.2 It gives summaries of studies, books, and articles on all fields of psychology and many educational articles.3
Dissertation Abstracts
The Dissertation Abstracts database contains all dissertations written and registered since 1860. This is a rich resource not only of graduate level research findings, but also of research design and statistical analysis methods.
Choose Preliminary Sources

Primary sources are research reports written by the researchers involved in the study. Secondary sources are compilations of research reports by authors not associated with the reported research. Preliminary sources are reference books and indexes which lead to specific research articles within a given database. Here is a brief list of some of the major preliminary sources.4
Thesaurus of ERIC Descriptors

The pathway into the enormous ERIC database is a periodical called the Thesaurus of ERIC Descriptors. This publication contains a listing of all the key words used to categorize research articles and unpublished papers in the ERIC system. Similar indexes exist for Dissertation Abstracts and Psychological Abstracts databases.
Education Index
The Education Index provides an up-to-date listing of articles published in hundreds of education journals, books about education and publications in related fields since 1929. For an index to educational articles for the years 1900 to 1929, check the Readers Guide to Periodic Literature.5
Citation Indexes
The Citation Indexes list published articles which references (cites) a given article. My statistics professor at University of North Texas gave me a copy of a 1973 article on multiple comparisons one evening before class. He thought the questionable findings in the article would make a good dissertation study. By using citation indexes, I was able to quickly track down references to over fifty articles published since 1973 which cited the article hed given me. The Science Citation Index (SCI) provides citations in the fields of science, medicine, agriculture, technology, and the behavioral sciences.
Sharon B Merriam and Edwin L. Simpson, A Guide to Research for Educators and Trainers of Adults (Malabar, FL: Robert E. Krieger Publishing Company, 1984), 35. 2 Borg and Gall, p. 150. 3 Hopkins, p. 224 4 See Borg and Gall, pp. 148-166 for detailed information on these and many other sources. 5 Hopkins, p. 221
6-3
The Social Science Citation Index (SSCI) does the same for the social, behavioral and related sciences.1
Smithsonian Science Information Exchange

The Smithsonian Science Information Exchange (SSIE) is the best preliminary source for recently completed and ongoing research in all fields. It has the least lag time between publication and indexing. Several services are offered, including research information packages on major topics, custom searches, and computer searches.2
Mental Measurements Yearbook

The Mental Measurements Yearbook, edited by Oscar K. Buros, indexes articles and findings related to published tests. It gives information regarding forms, manuals, grade levels, publishers, and prices of educational, psychological, and vocational tests, plus reviews of the tests by testing experts.3 It is published in six-year intervals and is an excellent source of finding a validated testing instrument for your proposal.
Measures for Psychological Measurement

Provides information on over 3000 psychological measures that have been described in research literature. These are tests not published by regular test developers, so there is little overlap with Mental Measurements Yearbooks.4 These are some of the major sources. Check Borg and Gall for many more reference sources available to you. Also ask your professors to suggest major research journals in your field. University libraries are also a good resource for information on specialty databases.
Select Key Words

Most databases are accessed through the use of key words, or descriptors. As we have previously noted, the key words for ERIC documents are published in the Thesaurus of ERIC Descriptors. Key words for Psychological Abstracts are published in the Thesaurus of Psychological Index Terms. Each database has its own set of key words. Borg and Gall provide an example of a study and how one would go about doing a literature search. The study is the academic self-concept of handicapped children in the elementary school.5 Key phrases in this study are academic self-concept, handicapped school children, and elementary school students. There is no descriptor for academic self-conceptin the Thesaurus of ERIC Descriptors. There are the descriptors self-concept and self-esteem, both of which appear to fit this study. Since there are no specific definitions of how reviewers used the terms, it would be wise to use both of these descriptors in the data search. There are two descriptors for handicapped school children: handicapped children and handicapped students. The final descriptor is elementary school students. Using these ERIC descriptors, a search can be made manually or electronically through every document in the ERIC system. (Well follow this study in later steps).
1 4
Borg and Gall, p. 156-7. Borg and Gall, 158
2 5
Borg and Gall, 168-9 Ibid., 171
Hopkins, 225
6-4
Chapter 6
The bridge that connects your study to the documents in the databases youve selected is made up of the descriptors, or key words, that grow out of your Problem Statement and operationalized variables. Only key words that are known by the database will work. In the example above, we found that the descriptor academic self-concept does not exist in the ERIC system. Other key words had to be substituted. When I wrote a research proposal on Research Priorities in Religious Education, the descriptor Religious Education led me to over thirty research articles. But none of the articles used the term the way Southern Baptists use it. If you consider a study which has a solid theoretical base, you will find it easier to find descriptors. Ultimately, you will secure reports that provide a good foundation for your study. If your study is theoretically shallow, you will have difficulty finding descriptors. You will be barred from the world of scientific knowledge.
Searching the literature

Having determined your key words, the next step to locate the research articles which are associated with them. We can do this manually by thumbing through the printed database index, or electronically by doing a computer search.
Searching manually
To do a manual search for the key words listed above in the ERIC system, follow these steps:
1. Look in the ERIC index published in the most recent month of the current year. (Indexes for ERIC documents are published monthly; semi-annual volumes are published twice each year.) 2. Look up each of your descriptors in the Subject Index section. 3. You will notice that descriptors are organized in hierarchies. The higher up the hierarchy you find a descriptor, the broader it is (that is, the greater number of articles it references). The farther down the hierarchy you find a descriptor, the narrower it is (the fewer number of articles it references). Articles are referenced under the descriptors by ED numbers, such as ED 654 321. 4. Look up the ED number in the Document Resumes section of the ERIC index. Here you will find a brief description (an abstract) of the referenced article. You can usually tell from the abstract whether the article will be of help to you in your own study. 5. When you have found all the abstracts for all your descriptors in this index, move to the next earlier month and repeat the process. 6. When you have completed the current year, use the semi-annual volumes to search back through previous years. 7. Continue the process until you have located every ERIC document related to every descriptor back as far as you want the search to extend.
Searching by Computer
A manual search requires a great deal of time because you must manually thumb through multiple volumes of database indexes. Just think about looking up each of four descriptors, along with their associated articles, in monthly and then semi-annual indexes for up to ten years! How much time do you have to sit in the Reference Section of your university library? But more important than wasted time is the limitation of doing only simple searches. This rules out searches such as self-esteem AND elementary school children. Such a search would select only those articles which
6-5
relates to BOTH descriptors. With a computerized database, you can search through literally millions of articles in seconds, and combine key words in complex ways. We can combine all our selected descriptors into a single search command for the computer. With one pass through all the ERIC documents, every article meeting the specifications of the command line will be selected from that database. Lets use our example to illustrate the process.
1. The library assistant responsible for doing computer searches dials up the data base. 2. Descriptors are entered one at a time. 3. With each entry, there is a pause for a few seconds while the computer scans all of its material. It responds with a number of articles relating to that descriptor. The following numbers of articles were found by Borg and Gall for the example problem: 1. handicapped children 2. handicapped students 3. self-concept 4. self-esteem 5. elementary school students Total Number: 277 450 4,433 894 5,031 11,085
4. Descriptors can be combined to select only those articles that fit a specific combination. Borg and Gall's example is interested in (1) self-concept OR (2) self-esteem AND (3) handicapped children OR (4) handicapped students AND (5) elementary school students. This combination is entered with the command (1 or 2) and (3 or 4) and (5). The OR increases the number of selected articles by including additional descriptors. Any article relating to either self-esteem OR self-concept and any article relating to either handicapped children OR handicapped students will be selected. The AND narrows the number of selected articles by requiring articles to match all the descriptors connected by it. All articles must have either (1) or (2) AND either (3 or (4) AND (5) elementary school students to be selected in this search. The search above produced only one article reference out of the 11,085 articles identified by single descriptors. The Related Literature section requires more than a single article! The researchers broadened the search by dropping (5) elementary school students. Entering the command (1 or 2) and (3 or 4) produced 41 articles in ERIC documents. 5. Print out abstracts. You can have the computer print out the selected abstracts immediately (online) or you can have them printed out later (off-line). The difference is COST! Printing out abstracts while on-line means paying the connect fee between the computer and the database while the printer cranks out the abstracts. Printing off-line gives you the abstracts in a few days, but cost only a few cents each. This lower cost is possible because the database computer can call the library in the evening when phone rates are low, down-load all of the articles to the librarys computer, and hang up. The library computer then prints out the listing. On-line printing is expensive, but quick. You get your listing of articles immediately. Off-line printing is much cheaper, but you may have to wait 3-4 days before you can get your printouts.
Borg and Gall suggest the most productive results for educational topics would be to search RIE and CIJE from 1969 to date, RIE and Education Index from 1966 to 1968, and Education Index from 1965 back as far as the student plans to extend his review.1 Note: This provides a good historical context. Use sources less than 10 years old for the bulk of your study.
Borg and Gall, 29
6-6
Chapter 6
Select Articles
You now have either citations or abstracts of the selected articles. Citations give the author, title, and date of selected articles; an abstract gives a 50-100 word summary of the study. You want to get abstracts if the database provides them. You now must find the article. Your library can help you do a computer search and provide you with citations. However, the articles cited may not be in on your campus. You may need to go to a larger university or state school to find the original article. In our area, for example, North Texas State University has over 5 million journal articles on microfiche and adds thousands of articles each year. Make a list of the publications cited in your search. The first step is to find out which libraries in the area carry these publications. The reference desk at area university libraries can provide you with a catalog of publications collected by a particular library. Locate the publications on your list in the directory. Some libraries have articles bound in annual volumes and stored on shelves. Others record articles on microfilm or microfiche and store them in filing cabinets. Using the librarys indexing system, you can locate the full article selected by your key word search. There are two major ways to process the articles when you find them. The first is to read through the article in the library and take notes on it immediately. Copy down what you think is relevant on 5x8 cards. Be sure to get all the bibliographical information you need for footnotes and references. The second way is to merely scan the article to determine whether it really pertains to your study or not. If it does, make a xerox copy of it. Both bound journals and microfilm/fiche materials can be xeroxed. The cost is about ten cents per page. You may spend twenty or thirty dollars in dimes this way, but you have a real advantage over the first approach. You have the articles. You can analyze them at home: write on them, categorize them, cut and paste them the copies belong to you. I heartily recommend this approach -- especially if you have a family who would like to see you from time to time. Check the bibliographies of the research articles for further references to related literature. This provides you another path to important studies done in your area of interest. Now you must analyze and organize all of this material.
Analyze the Research Articles

There are several ways to organize the mass of information you have collected. One way is to form a chronology of events or developments related to your subject. Place the articles chronologically and look for trends over time. Another approach is to organize the information conceptually. In this approach you organize your information in major and minor concepts relating to your subject. A third way is to organize the literature around your stated hypotheses.
An Organizational Notebook
In my last dissertation, I organized my literature conceptually. I began by scanning the 167 selected articles, looking for key concepts and terms used by the authors that related to the key words of my study. I then placed each term at the top of a blank sheet of paper in a notebook. I began with about thirty concepts which were organized alphabetically.
6-7
Prioritizing Articles
While I scanned the articles, I categorized them into three levels of importance: high, medium, and low. High priority articles were identified as those which dealt directly either with my subject or methods. Medium priority articles were identified as those which provided either relevant background information or important implications of my subject. Low priority articles were identified as those which only tangentially referred to my subject or methodology. After my key word notebook was organized, I began reading the high priority articles in detail. New concepts were added to the organizational notebook.
Selecting Notes and Quotes with References

Each time I read something related to one of my key concepts, I recorded it on the appropriate page in my organizational notebook. I was careful to include reference information for each quote. This saved hours of retracing the source of a good quote later. When a key word page was filled with quotes, a second page was added to the notebook. This was done for all the high priority articles. I then analyzed the medium priority articles and scanned the low priority articles. Information drawn from these was added to the concept pages.
Reorganize Material by Key Words

I now possessed everything in the selected articles related to the major terms in my notebook. As information was added under each key word, the differing viewpoints, definitions, and explanations of the authors leapt off the page! Conflicting opinions were obvious. Schools of thought, formed by groups of writers sharing opinions and collectively opposing others, became apparent to me as I studied the articles in this dissected form. Further, introductory statements to the articles provided me with quotes and paraphrases for my introductory statement. Historical perspectives, both explicitly stated in articles and implicitly discovered by matching findings with dates of publication, provided background information. Arguments and counter-arguments over definitions and viewpoints gave me insight into the significance of my study. The very process of pulling information out of articles piece by piece (analysis) and placing it under particular key words transformed 167 research reports into thirty key conceptual groupings (synthesis).
Write a Synthesis of Related Literature

The next step in the process is to refine each of the key word groupings into a narrative. The Related Literature section is not a list of article summaries. It should be a flowing, well-structured narrative that begins with the variables you established in your Problem and ends with a question begging to be answered. Study each key word grouping. How do the various authors define and use the concept? Do they speak for or against the concept? Can you group the authors by differing opinions concerning the concept? Write out, in narrative form, a clear description of how these authors use this particular concept. Once each of the key word groupings have been analyzed and refined into a narrative, determine what order the key word narratives should take in the Related Literature section. There are three major approaches to ordering the key word clusters: chronologically, conceptually, or by stated hypotheses. Chronologically: If the key word clusters
6-8
Chapter 6
for a natural timeline of development, a chronological ordering is best. In this case, clusters will be time-sensitive, showing a change in thinking over time. Conceptually: If your study is anchored in clear, inter-related concepts, a conceptual ordering is suggested. My last dissertation had sections on the development of ANOVA and multiple comparisons tests, Type I error rate, Type II error rate, power, and research design. Stated hypotheses: If you have several hypotheses in your study, thesde form a natural way to order key word clusters.
Revise the Synthesis

As in any specialized writing, revision is necessary. We think we know what we want to say. We feel that we have said it clearly. Our thoughts easily flow out of our minds, so we assume they flow smoothly on paper. But this is rarely the case. All good writing takes time - bulding up, tearing down, and building up again. Lay the first draft of the Synthesis aside for several days. Come back to it and read it with an objective eye. You will always find sections that are too brief, or too wordy, or awkward in structure. You will find redundancies, blind spots, and grammatical mistakes. Revise the material and set it aside for a week. Repeat the process until the material reads smoothly, clearly and tersely. It should go without saying that you must plan ahead in order to do this. Waiting until just before the deadline is a sure way to produce inferior work. This procedure applies to your entire proposal but is critical for the Related Literature section.
Summary
As you can see, the process of developing the Related Literature section of your paper involves a great deal more than checking ten or twelve books out of the library and writing a term paper. The process takes time. You have most of the semester to complete this but dont wait! Searching the literature will provide you necessary insight into how to mold your entire proposal. Begin now to search the literature. You should do at least one computer search just for the practice of it, and, additionally, it will save you weeks of library time.
Vocabulary
CIJE Citation Indexes computer search databases descriptors Dissertation Abstracts ERIC Education Index manual search Measures for Psy. Measurement Mental Measurements Yearbook organizational notebook preliminary sources primary sources Psychological Abstracts RIE secondary sources SSIE abbreviation for Current Index of Journals in Education (published articles) resources that list articles which cite a given research article reference locating research articles by computer collections of research information by subject matter (e.g. ERIC) key words by which research articles are indexed (e.g. cognitive or children) a resource that catalogs abstracts of all dissertations back to 1860 abbreviation for Educational Resources Information Center (CIJE and RIE) a resource that catalogs education information back to 1929 locating research articles using printed indexes catalogs psychological tests used in research catalogs published educational, psychological and vocational tests tool to aid in dissecting articles and synthesizing related ideas resources used to locate articles (e.g. indexes) materials produced by those who conduct research (e.g. journal articles) index to over 850 psychological journals abbreviation for Resources in Education: index to unpublished materials materials produced by writers who study research reports (e.g. books) abbr for Smithsonian Science Information Exchange: best for ongoing research
6-9
synthetic narrative X AND Y = Z X OR Y = Z
multiple articles broken down and reordered by concept in clear concise writing both X and Y must be true (1) for Z to be true (1); otherwise Z = 0 false either X or Y must be true (1) for Z to be true (1); both 0? Z = 0 false
Study Questions
1. Differentiate among preliminary, primary and secondary sources of information. 2. Define the following terms: ERIC, SSIE, RIE, CIJE, descriptor, SCI, SSCI, database, synthesis. 3. Differentiate between a summary of literature and a synthesis of literature. 4. What is the major difference between printing abstracts on-line and off-line? 5. Discuss the importance of revision in writing your proposal. How are you planning to incorporate revision into your proposal development schedule?

1. John is interested in analyzing recent unpublished conference proceedings in the field of educational psychology. His best resource is a. SCI b. CIJE c. RIE d. Education Index 2. Two advantages to using the computerized databases to search the literature are a. simple searches and expense b. complex searches and expense c. simple searches and time d. complex searches and time 3. Given 4 descriptors, which of the following will provide the greatest number of articles? a. (1 or 2) and (3 or 4) b. (1 and 2) or (3 and 4) c. 1 and 2 and 3 and 4 d. 1 or 2 or 3 or 4 4. Which of the following is not a good way to organize your Synthesis of Related Literature section? a. alphabetically by authors last name b. chronologically by article date c. according to major and minor concepts d. according to stated hypotheses
6-10
Chapter 7
Populations and Sampling
7
The Rationale of Sampling Steps in Sampling Types of Sampling Inferential Statistics: A Look Ahead The Case Study Approach The Rationale of Sampling
In Chapter One, we established the fact that inductive reasoning is an essential part of the scientific process. Recall that inductive reasoning moves from individual observations to general principles. If a researcher can observe a characteristic of interest in all members of a population, he can with confidence base conclusions about the population on these observations. This is perfect induction. If he, on the other hand, observes the characteristic of interest in some members of the population, he can do no more than infer that these observations will be true of the whole. This is imperfect induction, and is the basis for sampling.1 The population of interest is usually too large or too scattered geographically to study directly. By correctly drawing a sample from a specific population, a researcher can analyze the sample and make inferences about population characteristics.
Population Sampling Biased Samples Randomization
The Population
A population consists of all the subjects you want to study. Southern Baptist missionaries is a population. So is ministers of youth in SBC churches in Texas. So is Christian school children in grades 3 and 4. A population comprises all the possible cases (persons, objects, events) that constitute a known whole.2
Sampling
Sampling is the process of selecting a group of subjects for a study in such a way that the individuals represent the larger group from which they were selected.3 This representative portion of a population is called a sample.4
Donald Ary, Lucy Cheser Jacobs, and Asghar Razavieh, Introduction to Research in Education, (New York: Holt, Rinehart and Winston, Inc., 1972), 160 2 Ibid., p. 125 3 L. R. Gay, Educational Research: Competencies for Analysis and Application, 3rd ed., (Columbus, Ohio: Merrill Publishing Company, 1987), 101. 4 Ary et. al., 125
1
7-1
Biased Samples
It is important that samples provide a representative cross-section of the population they supposedly represent. The sample should be a microcosm a miniature model of the population from which it was drawn. Otherwise, the results from the sample will be misleading when applied to the population as a whole. If I select Southern Baptist ministers as the population for my study and select Southern Baptist pastors in Fort Worth as my sample, I will have a biased sample. Fort Worth pastors may not reflect the same characteristics as ministers (including staff members) across the nation. Selecting people for a study because they are within convenient reach members of my church, students in a nearby school, co-workers in the surrounding region yields biased samples. Biased samples do not represent the populations from which they are drawn.
Randomization
The key to building representative samples is randomization. Randomization is the process of randomly selecting population members for a given sample, or randomly assigning subjects to one of several experimental groups, or randomly assigning experimental treatments to groups. In the context of this chapter, it is selecting subjects for a sample in such a way that every member of the population has an equal chance at being selected. By randomly selecting subjects from a population, you statistically equalize all variables simultaneously.
Steps in Sampling
Target Population Accessible Population Size of Sample Select
Regardless of the specific type of sampling used, the steps in sampling are essentially the same: identify the target population, identify the accessible population, determine the size of the sample, and select the sample.
Identify the Target Population

The first step is the identification of the target population. In a study concerning professors in Southern Baptist seminaries, the target population would be all professors in all Southern Baptist seminaries. In a study of job satisfaction of local church staff ministers, the target population is all staff ministers in all churches.
Identify the Accessible Population

Since it is usually not possible to reach all the members of a target population, one must identify that portion of the population which is accessible. The nature of the accessible population depends on the time and resources of the researcher. Given the target population of Southern Baptist professors, the accessible population might be Southwestern Seminary professors. Given the target population of local church staff ministers, the accessible population might be Southern Baptist ministers of education in Texas. Notice that specifying the accessible populations reduces the scope of the two examples in the preceding paragraph. In most cases this is helpful because beginning researchers tend to include too much in their study.
7-2
Chapter 7
Determine the Size of the Sample

Student researchers often ask How big should my sample be? The first answer is use as large a sample as possible.5 The reason is obvious: the larger the sample, the better it represents the population. But if the sample size is too large, then the value of sampling reducing time and cost of the study is negligible. The more common problem, however, is having too few subjects, not too many.6 So the more important question is, Whats the minimum number of subjects I need? The question is still difficult to answer. Here are some of the factors which relate to proper sample size.
Accuracy
In every measurement, there are two components: the true measure of the variable and error. The error comes from incidental extraneous sources within each subject: degree of motivation, interest, mood, recent events, future expectations. All of these cause variations in test results. In all statistical analysis, the objective is to minimize error and maximize the true measure. As the sample size increases, the random extraneous errors tend to cancel each other out, leaving a better picture of the true measure of the population.
Cost
An increasing sample size translates directly into increasing costs: not only of money, but time as well. Just think of the difference in printing, mailing, receiving, processing, tabulating, and analyzing questionnaires for 100 subjects, and then for 1000 subjects. The dilemma of realistically balancing accuracy (increase sample size) with cost (decrease sample size) confronts every researcher. Inaccurate data is useless, but a study which cannot be completed due to lack of funds is not any better. Cost per subject is directly related to the kind of study being done. Interviews are expensive in time, effort and money. Mailing out questionnaires is much less expensive per subject. Therefore, one can plan to have a larger sample with questionnaires than with interviews for the same cost.
The Homogeneity of the Population

Homogeneous [from homo-genos, like-kind] means of the same kind or nature; consisting of similar parts, or of elements of the like nature (Webster, s.v. homogeneous). Homogeneity in a population means that the members of the population are similar on the characteristic under study. We can take a sample of two drops of water from a 10 gallon drum, and have a good representative sample of the ten gallons. This is because the water in a 10 gallon drum is an homogeneous solution (if we mix it up well before sampling). But if we take two people out of a group of 500, we will not have a good representative sample of the 500. People are much less homogeneous than a water solution! But even populations of people vary in homogeneity. The population Texas Baptists would have less variability on the issue of gambling than the more general population of Texans. The greater the variability in the population, the larger the sample needs to be.
5
Ary et. al., 167
Gay, 114
7-3
Other Considerations
Borg and Gall list several additional factors which influence the decision to increase the sample size (See pp. 257-261). These are
1. When uncontrolled variables are present. 2. When you plan to break samples into subgroups. 3. When you expect high attrition of subjects. 4. When you require a high level of statistical power (see Chapter 17) .
So, what is a good rule of thumb for setting sample size in a research proposal? Here are two suggestions
Sample Size Rule of Thumb

Dr. John Curry, Professor of Educational Research, North Texas State University (now retired), provided his research students (fall, 1984) with the "rule of thumb" on sample size (see right). Using this rule, an adequate sample of Southern Baptists 36,000 pastors would be a random sample of 1%, or 360 pastors. L. R. Gay suggests 10% of large populations and 20% of small populations as minimums.7 Using Gays suggestion, our sample of pastors would include 3,600. It is left to the student to weigh the factors of accuracy, cost, homogeneity of the accessible population, type of sampling and kind of study, and determine the best sample size for his study.
Size of Population
0-100 101-1,000 1,001-5,000 5,001-10,000 10,000+
Sampling Percent
100% 10% 5% 3% 1%
Select the Sample

The final step is to actually select a sample of predetermined size from the accessible population.
Types of Sampling
Simple Systematic Stratified Cluster
There are several ways of doing this. We will look at four major types here: simple random, systematic, stratified, and cluster sampling. The basic characteristic of random sampling is that all members of the population have an equal and independent chance of being included in the sample.8
Simple Random Sampling

The most common method of sampling is known as simple random sampling: "Pick a number out of a hat!" Gay provides a good example of this type of sampling.9 A superintendent of schools wants to select a sample of teachers so that their attitudes toward unions can be determined. Here is how he did it:
1. The population is 5,000 teachers in the system. 2. The desired sample size is 10%, or 500 teachers. 3. The superintendent has a directory which lists all 5,000 teachers alphabetically. He assigns numbers
7
Gay, 114-115
Ary, 162
Gay, 105-7
7-4
Chapter 7
from 0000 to 4999 to the teachers. 4. A table of random numbers is entered at an arbitrarily selected number such as the one underlined below:
59058 11859 53634 48708 71710
5. Since his population has only 5000 members, he is interested only in the last 4 digits of the number, 3634. 6. The teacher assigned #3634 is selected for the sample. 7. The next number in the column is 48708. The last four digits are 8708. No teacher is assigned #8708 since there are only 5000. Skip this number. 8. Applying these steps to the remaining numbers shown in the column, teachers 1710, 3942, and 3278 would be added to the sample. 9. This procedure continues down this column and succeeding columns until 500 teachers have been selected.
This random sample could well be expected to represent the population from which it was drawn. But it is not guaranteed. The probable does not always happen. For example, if 55% of the 5000 teachers were female and 45% male, we would expect about the same percentages in our random sample of 500. Just by chance, however, the sample might contain 30% females and 70% males. If the superintendent believed teaching level (elementary, junior high, senior high) might be a significant variable in attitude toward unions, he would not want to leave representation of these three sub-groups to chance. He would probably choose to do a stratified random sample.
Systematic Sampling
A systematic sample is one in which every Kth subject on a list is selected for inclusion in the sample.10 The K refers to the sampling interval, and may be every 3rd (K=3) or 10th (K=10) subject. The value of K is determined by dividing the population size by the sample size. Lets say that you have a list of 10,000 persons. You decide to use a sample of size 1000. K = 10000/1000 = 10. If you choose every 10th name, you will get a sample of size 1000. The superintendent in our example would employ systematic sampling as follows:
1. The population is 5,000 teachers. 2. The sample size is 10%, or 500 teachers. 3. The superintendent has a directory which lists all 5,000 teachers in alphabetical order. 4. The sampling interval (K) is determined by dividing the population (5000) by the desired sample size (500). K = 5000/500 = 10. 5. A random number between 0 and 9 is selected as a starting point. Suppose the number selected is 3. 6. Beginning with the 3rd name, every 10th name is selected throughout the population of 5000 names. Thus, teacher 3, 13, 23, 33 ... 993 would be chosen for the sample (Gay, pp. 113-114).
Writers disagree on the usefulness of systematic sampling. Ary and Gay discount systematic sampling as not as good as random sampling because each selection is not independent of the others.11 Once the beginning point is established, all other choices are determined. Both writers give as an example a population which includes various nationalities. Since certain nationalities have distinctive last names that tend to group together under certain letters of the alphabet, systematic sampling can skip over
10
Gay, 112
11
Ary, 116 and Gay, 114
7-5
whole nationalities at a time. Babbie on the other hand, states that systematic sampling is virtually identical to simple random sampling when one chooses a random starting point.12 Sax reports that systematic sampling usually leads to the same results as simple random sampling.13 There is a module on your tutorial disk that directly compares systematic sampling with simple random sampling. Use that to compare the results of sampling for yourself. There is one major danger with systematic sampling on which all authors agree. If there is some natural periodicity repetition within the list, the systematic sample will produce estimates which are seriously in error.14 If this condition exists, the researcher can do one of two things. He can use simple random sampling on the list as it exists, or he can randomly order the list and then use systematic sampling.
Stratified Sampling
Stratified sampling permits the researcher to identify sub-groups within a population and create a sample which mirrors these sub-groups by randomly choosing subjects from each stratum. Such a sample is more representative of the population across these sub-groups than a simple random sample would be.15 Subgroups in the sample can either be of equal size or proportional to the population in size. Equal size sample subgroups are formed by randomly selecting the same number of subjects from each population subgroup. Proportional subgroups are formed by selecting subjects so that the subgroup percentages in the population are reflected in the sample. The following example is a proportionally stratified sample. The superintendent would follow these steps to create a stratified sample of his 5,000 teachers.16
1. The population is 5,000 teachers. 2. The desired sample size is 10%, or 500 teachers. 3. The variable of interest is teaching level. There are three subgroups: elementary, junior high, and senior high. 4. Classify the 5,000 teachers into the subgroups. In this case, 65% or 3,250 are elementary teachers, 20% or 1,000 are junior high teachers, and 15% or 750 are senior high teachers. 5. The superintendent wants 500 teachers in the sample. So 65% of the sample (325 teachers) should be elementary, 20% (100) should be junior high teachers, and 15% (75) should be senior high teachers. This is a proportionally stratified sample. (A non-proportionally stratified sample would randomly select 167 subjects from each of the three groups.) 6. The superintendent now has a sample of 500 (325+100+75) teachers, which is representative of the 5,000 and which reflects proportionally each teaching level.
Cluster sampling
Cluster sampling involves randomly selecting groups, not individuals. It is often impossible to obtain a list of individuals which make up a target population. Suppose
12 Earl Babbie, The Practice of Social Research, 3rd. (Belmont, CA: Wadsworth Publishing Company, 1983), 163 13 Gilbert Sax, Foundations of Educational Research (Englewood Cliffs, NJ: Prentice-Hall, 1979), 191 14 Gilbert Churchill, Marketing Research: Methodological Foundations, 2nd (Hinsdale, IL: The Dryden Press, 1979), 328 15 Ary and others, 164; Babbie, 164-165; Borg and Gall, 248-249; Sax, 185-190 16 Gay, pp. 107-109
7-6
Chapter 7
a researcher is interested in surveying the residents of Fort Worth. Through cluster sampling, he would randomly select a number of city blocks and then survey every person in the selected blocks. Or, another researcher wants to study social skills of Southern Baptist church staff members. No list exists which contains the names of all church staff members. But he could randomly select churches in the Convention, and use all the staff members of the selected churches. Any intact group with similar characteristics is a cluster. Other examples of clusters include classrooms, schools, hospitals, and counseling centers. Lets apply this approach to the superintendents study.
1. The population is 5,000 teachers. 2. The sample size is 10%, or 500 teachers. 3. The logical cluster is the school. 4. The superintendent has a list of 100 schools in the district. 5. Although the clusters vary in size, there are an average of 50 teachers per school. 6. The required number of clusters is obtained by dividing the sample size (500) by the average size of cluster (50). Thus, the number of clusters needed is 500/50 = 10 schools. 7. The superintendent randomly selects 10 schools out of the 100. 8. Every teacher in the selected schools is included in the sample.
In this way, the interviewer can conduct interviews with all the teachers in ten locations, and save traveling to as many as 100 schools in the district.17 There are drawbacks to cluster sampling. First, a sample made up of clusters may be less representative than one selected through random sampling.18 Only ten schools out of 100 are used in our example. These ten may well be different from the other ninety. Using a larger sample size, say, 25 schools rather than 10, reduces this problem. A second drawback is that commonly used inferential statistics are not appropriate for analyzing data from a study using cluster sampling.19 The statistical procedures we will be studying require random sampling.20
Inferential Statistics: A Quick Look Ahead

The field of inferential statistics allows researchers to study samples and infer the characteristics of populations. We have already noted the two basic components of every piece of collected data: the true measurement and error. Suppose you have a population of 1000 test scores. The average of the entire 1000 scores is 75. You would expect the average of a random sample of 100 scores to also be 75. So you draw your first sample of 100 and compute the average. You get 73.8. You draw another 100 and find the average to be 76.2. Another hundred: 77.7. Yet another: 71.5. The central tendency in 1000 scores is not exactly duplicated in a sample of 100. The differences among sample averages is due to sampling error. Inferential statistics provides a way to estimate true population parameters from sample statistics through the use of the laws of probability. Each of the sample means is different from the population mean. But is the difference great enough to be considered significant? We will master some of the most popular techniques for inferring population characteristics from sample measurements a little later in the course.
17 20 18 19 Gay, 110-112 Ibid., 111 Ibid, 112 See Babbie 167-171 his discussion of statistical analysis and cluster sampling.
7-7
The Case Study Approach

Not all research is geared to sampling subjects out of large populations. The case study is a kind of descriptive research in which an in-depth investigation of an individual, group, event, community or institution is conducted. The strength of the case study approach is its depth, rather than its breadth. The investigator tries to discover all the variables that are important in the history or development of his subject.1 The weakness of the case study is its lack of breadth. The dynamics of one individual or social unit may bear little relationship to the dynamics of others. Most case studies arise out of counseling or remedial efforts and therefore provide information on exceptional rather than representative individuals.2 Because of this, it is more difficult to write an acceptable dissertation which employs a case study approach. The objective of graduate research is to concentrate on areas which have high generalizability. In most cases, this involves sampling from specified populations. The case study approach involves finding atypical subjects that exemplify some relevant trait. Random sampling methods are therefore inappropriate.3 Borg and Gall cite several areas where the case study approach is used:4
Historical Case Studies of Organizations

An historical case study of an organization involves the analytical observation of an organization, by way of records, documents and personal interviews of members and leaders, from its inception to the present. An example of this kind of case study would be The Development of the School of Educational Ministries, Southwestern Baptist Theological Seminary, Fort Worth, Texas.
Observational Case Studies

An observational case study involves the in-depth observation of a specific individual or group over a period of time. An example of this type of study would be "Living with the Children of God: New Testament Community or New Age Cult?"
Oral Histories
Oral histories involve extensive first-person interviewing of a single individual. Dissertations have been written on the lives of J. M. Price and Joe Davis Heacock, former deans of the School of Religious Education, using this approach.
Situational Analysis
An event is studied from the perspective of the participants involved. For example, a staff member is summarily fired from a church staff by the pastor. Interviews with the staff member and family, staff colleagues, pastor, church leaders and selected church members would be conducted. When all the views are synthesized, an indepth understanding of the event can be produced.
Clinical case study

A particular problem is studied through in-depth analysis of a single individual
1 Ary, p. 286. See the following for guidance in proposing a case study approach: Borg and Gall, 488490; Gay, 207; Sax, 106. 2 3 4 Ibid., 287 Sax, 106 Headings from Borg and Gall, 489
7-8
Chapter 7
suffering from the problem. "Depression in the Ministry: A Case Study of Twenty Ministers of Eeducation."
Summary
In this chapter you have learned about sampling techniques that allow you to select and study a small representative group of subjects (the sample) and infer findings to the larger group (the population). You have been given a rationale for sampling, the place of randomization in sampling, the steps of sampling, four types of sampling, and a look at the case study approach.
Vocabulary
accessible population attrition biased sample case study approach Cluster sampling error estimated parameters population parameters randomization sample sample size sample statistics sampling error Sampling Simple random sampling statistical power Stratified sampling systematic sampling target population true measure subjects available for sampling (e.g. mailing list) loss of subjects during a study subjects selected in non-random manner (e.g. 3rd grade classes at school) in-depth study of individual subject or institution selecting subjects by randomly choosing groups (e.g. city blocks or churches) difference between the measurement of a variable and its true value mean and standard deviation of population computed from sample statistics mean and standard deviation of population measured directly selecting subjects so that each population member has equal chance of being selected a (smaller) group of subjects which represents a (larger) population the number of subjects in a sample (symbolized by N or n) mean and standard deviation of a sample (not useful in themselves) source of the discrepancy between sample statistics and population parameters process of selecting a representative sample from a population drawing subjects by random number (e.g. names out of a hat) the probability that a statistic will declare a difference `significant selecting subjects at random from population strata (e.g., male, female) selecting every kth subject from a list. (e.g., every 10th person in 1000 = 100 subjects) population of interest to your study (e.g. single adults) the true value of a variable (no error)
Study Questions
1. Define target population, accessible population, and sample. 2. Explain why sampling is an important part of research. 3. List and describe four types of sampling. 4. Explain why randomization is important in sampling. 5. You want to study Youth ministers attitudes toward small group Bible study. You have identified 4,573 youth ministers. Using the rules of thumb estimate for sampling, how many youth ministers should you select for your study?
7-9

1. Sampling is based on the principles of a. intuition b. trial and error c. inductive reasoning d. deductive reasoning 2. The key to producing a good representative sample is a. random selection of subjects b. using volunteers for the sample c. narrowly defining the target population d. using the minimum number of subjects in the sample 3. Southern Baptist single adults would be considered a(n) a. accessible population b. target population c. stratified sample d. cluster sample 4. One would increase sample size if he expected a. high attrition of subjects during the study b. a high cost per subject c. high homogeneity of the population d. few uncontrolled variables in the population
7-10
Chapter 8
The Measurement Triad
8
Collecting Dependable Data
Validity Reliability Objectivity
We have discussed variables and problems, hypotheses and purposes, populations and samples. The theoretical foundation of your study must sooner or later yield to concrete action: the collection of real pieces of data. The tools used to collect data are called instruments. An instrument may be an observation checklist, a questionnaire, an interview guide, a test or attitude scale. It may be a video camera or cassette recorder. An instrument is any device used to observe and record the characteristics of a variable. Before you can accurately measure the stated variables of your study, you must translate those variables into measurable forms. This is done by operationally defining the variables of your study (Chapter 3). Data collection is meaningless without a clearly operationalized set of variables. The second step is to insure that the selected instrument accurately measures the variables youve selected. The naive researcher rushes past the instrument selection or development phase in order to collect data. The result is faulty, error-filled data -- which yields faulty conclusions. The accuracy of the instrument used in your study is an important factor in the usefulness of your results. If the data is incomplete or inadequate, the study is destined for failure. A wonderful design and precise analysis yields useless results if the data quality is poor. So carefully design or select the instrument you will use to collect data. Three characteristics -- "the Great Triad" -- determine the precision with which an instrument collects data. The Great Triad consists of (1) validity, Does the instrument measure what it says it measures?; (2) reliability, Does the instrument measure accurately and consistently?; and (3) objectivity, Is the instrument immune to the personal attitudes and opinions of the researcher?
Validity
The term validity refers to the ability of research instruments to measure what they say they measure. A valid instrument measures what it purports to measure. A
Content Predictive Concurrent Construct
8-1
Reseach Design and Statistical Analysis for Christian Ministry
12-inch ruler is a valid instrument for measuring length. It is not a valid instrument for measuring I.Q., or a quantity of a liquid, or an amount of steam pressure. These require an I.Q. test, a measuring cup, and a pressure gauge. Lets say a student wants to measure the variable spiritual maturity, and operationally defines it as the number of times a subject attended Sunday School out of the past 52 Sundays. The question we should ask is whether attendance count in Sunday School is a valid measure of spiritual maturity does count really measure spiritual maturity? Can one attend Sunday School and be spiritually immature? (Yes, for coffee, fellowship and business contacts). Can one be spiritually mature and not attend Sunday School? (Yes, pastors usually use this time for pastoral work). If either of these questions can be answered yes, (and they are), then the measure is not a valid one. There are four kinds of instrument validity: content, concurrent, predictive, and construct. Each of these have specific meaning, and helps establish the nature of valid instruments.
Content Validity
The content validity of a research instrument represents the extent to which the items in the instrument match the behavior, skill, or effect the researcher intends them to measure.1 In other words, a test has content validity if the items actually measure mastery of the content for which the test was developed. Tests which ask questions over material not covered by objectives or study guidelines, or draw from other fields besides the one being tested, violate this kind of validity. Content validity is different from face validity, which is a subjective judgement that a test appears to be valid. Researchers establish content validity for their instruments by submitting a long list of items (such as statements or questions) to a validation panel. Such a validation panel consists of six to ten persons who are considered experts in the field of study for which the instrument is being developed. The panel judges the clarity and meaningfulness of each of the items by means of a 4- or 6-point rating scale. Compute the means and standard deviations (see Chapter 16) for each of the items. Select the items with the highest mean and lowest standard deviation on "meaningfulness" and "clarity" to be included in your instrument. In summary, content validity asks the question, How closely does the instrument reflect the material over which it gathers data? Content validity is especially important in achievement testing.
Predictive Validity
The predictive validity of a research instrument represents the extent to which the tests results predict such things as later achievement or job success. It is the degree to which the predictions made by a test are confirmed by the later success of the subjects. Suppose I developed a Research and Statistics Aptitude Test to be given students at the beginning of the semester. If I correlated these test scores of incoming students with their final grade in the course, I could use the test as a predictor of success in the course. In this example, the Research and Statistics Test provides the predictor measures and the final course grade is the criterion by which the aptitude test is analyzed for validity. In predictive validity, the criterion scores are gathered some time after the predictor scores. The Graduate Record Examination (GRE) is taken
1
Merriam, p. 140
8-2
Chapter 8
by college students and supposedly predicts which of its users will succeed in (future) doctoral level studies. Predictive validity asks the question, How closely does the instrument reflect the later performance it seeks to predict?
Concurrent Validity
Concurrent validity represents the extent to which a (usually smaller, easier, newer) test reflects the same results of a (usually larger, more difficult, established) test. The established test is the criterion, the benchmark, for the newer, more efficient test. Strong concurrent validity means that the smaller, easier test provides data as well as the larger, more difficult one. A popular personality test, called the Minnesota Multi-Phasic Inventory (MMPI), once had only one form consisting of about 550 questions. The test required several hours to administer. In order to reduce client frustration, a newer short-form version was developed which contained about 350 questions. Analysis revealed that the shorter form had high concurrent validity with the longer form. That is, psychologists found the same results with the shorter form as with the long form, but also reduced patient frustration and administration time. A researcher wanted to determine whether anxious college students showed more preference for female role behaviors than less anxious students. To identify contrasting groups of anxious and non-anxious students, she could have had a large number of students evaluated for clinical signs of anxiety by experienced clinical psychologists. However, she was able to locate a quick, objective test, the Taylor Manifest Anxiety Scale, which has been demonstrated to have high concurrent validity with clinical ratings of anxiety in a college population. She saved considerable time conducting the research project by substituting this quick, objective measure for a procedure that is time-consuming and subject to personal error.1 Concurrent validity asks the question, How closely does this instrument reflect the criterion established by another (usually more complex or costly) validated instrument?
Construct Validity
Construct validity reflects the extent to which a research instrument measures some abstract or hypothetical construct.2 Psychological concepts, such as intelligence, anxiety, and creativity are considered hypothetical constructs because they are not directly observable -- they are inferred on the basis of their observable effects on behavior.3 In order to gather evidence on construct validity, the test developer often starts by setting up hypotheses about the differentiating characteristics of persons who obtain high and low scores on the measure. Suppose, for example, that a test developer publishes a test that he claims is a measure of anxiety. How can one determine whether the test does in fact measure the construct of anxiety? One approach might be to determine whether the test differentiates between psychiatric and normal groups, since theorists have hypothesized that anxiety plays a substantial role in psychopathology. If the test does in fact differentiate the two groups, then we have some evidence that it measures the construct of anxiety.4 Construct validity asks the question, How closely does this instrument reflect
Borg and Gall, 279 A construct is a theoretical explanation of an attribute or characteristic created by scholars for purposes of study. Merriam, p. 141 3 4 5 6 Borg and Gall, 280 Ibid. Sax, 206 Ary et. al., 200 7 David Payne, The Assessment of Learning: Cognitive and Affective (Lexington, Mass.: D.C. Heath and Company, 1974), 259
1 2
8-3
the abstract hypothetical construct it seeks to measure?
Reliability
Stability Consistency Equivilance
Reliability is the extent to which measurements reflect true individual differences among examinees.5 It is the degree of consistency with which [an instrument] measures what it is measuring.6 The higher the reliability of an instrument, the less influenced it is by random, unsystematic factors.7 In other words, is an instrument confounded by the smoke and noise of human characteristics, or can it measure the true substance of those variables? Does the instrument measure accurately, or is there extraneous error in the measurements? Do the scores produced by a test remain stable over time, or do we get a different score every time we administer the test to the same sample? There are three important measures of reliability. These are the coefficients of stability, internal consistency, and equivalence. All three use a correlation coefficient to express the strength of the measure. We will study the correlation coefficient in detail when we get to chapter 22. For the time being, we will merely state that a reliability coefficient can vary from 0.00 (no reliability) to +1.00 (perfect reliability, which is never attained). A coefficient of 0.80 or higher is considered very good.
Coefficient of Stability
The coefficient of stability, also called test-retest reliability,9 measures how consistent scores remain over time. The test is given once, and then given to the same group at a later time, usually several weeks. A correlation coefficient is computed between the two sets of scores to produce the stability coefficient. The greatest problem with this measure of reliability is determining how much delay to use between the tests. If the delay is too short, then subjects will remember their previous answers and the reliability coefficient will be higher than it should be. If the delay is too long, then subjects may actually change in the interval. They will answer differently, but the difference is due to a change in the subject, not in the test. This will yield a coefficient lower than it should be.10 Still, science does best with consistent, stable, repeatable phenomena, and the stability of responses to a test is a good indicator of the stability of the variable being measured.
Coefficient of Internal Consistency

The purpose of a test is to measure, honestly and precisely, variables resident in subjects. The structure of the test itself can sometimes reduce the reliability of scores it produces. The coefficient of internal consistency11 measures consistency within a given test. The coefficient of internal consistency has two major forms. The first is the splithalf test. After a test has been administered to a single group of subjects, the items are divided into two parts. Odd items (1, 3, 5...) are placed in one part and even items (2, 4, 6...) in the other. Total scores of the two parts are correlated to produce a measure of item consistency. Since reliability is related to the length of the test, and the split-half coefficient reduces test length by half, a correction factor is required in order to obtain the reliability of the entire test. The Spearman-Brown prophecy formula (formula at right) is used to make this correction.12 Here the r is the corrected reliability coefficient and r equals the com9
Borg and Gall, 283
10
Ibid., 284
11
Ibid., 284-5
12
Ibid., 285
8-4
Chapter 8
puted correlation between the two halves. If r=0.60, then the formula yields r' = 0.75. Another measure of internal consistency can be obtained by the use of the KuderRichardson formulas. The most popular of the formulas are known as K-R 20 and K-R 21. The K-R 20 formula is considered by many specialists in education and psychology to be the most satisfactory method for determining test reliability. The K-R 21 formula is a simplified approximation of the K-R 20, and provides an easy method for determining a reliability coefficient. It requires much less time to apply than K-R 20 and is appropriate for the analysis of teacher-made tests and experimental tests written by a researcher which are scored dichotomously.13 (A dichotomous variable is one which has two and only two responses: yes-no, true-false, on-off). Cronbachs Coefficient Alpha is a general form of the K-R 20 and can be applied to multiple choice and essay exams. Coefficient Alpha compares the sum of the variances for each item with the total variance for all items taken together. If there is high internal consistency, coefficient alpha produces a strong positive correlation coefficient.
Coefficient of Equivalence
A third type of reliability is the coefficient of equivalence, sometimes called parallel forms, or alternate-form reliability. It can be applied any time one has two or more parallel forms (different versions) of the same test.14 One can administer both forms to the same group at one sitting, or with a short delay between sittings. A correlation coefficient is then computed on the two sets of parallel scores. A common use of this type of reliability is in a pretest-posttest research setting. By using the same test for both testing occasions, the researcher cannot know how much of the gain in scores is due to the treatment and how much is due to subjects remembering their answers from the first test. If one has two parallel forms of the same exam, and the coefficient of equivalence is high, one can use one form as the pretest and the other as the posttest.
Reliability and Validity

A test can be reliable and not valid for a given testing situation. But can a test be unreliable and still be valid for a given testing situation?
Answer 1: A Test Must be Reliable in Order to be Valid

You will read in some texts that an unreliable instrument is not valid. For example, Bell states If an item is unreliable, then it must also lack validity, but a reliable item is not necessarily also valid.15 Sax writes, a perfectly valid examination must measure some trait without random error...in a reliable and consistent manner.16 Both of these statements subsumes the concept of reliability under validity rather than depicting them as interdependent concepts. Nunnally agrees in the sense that reliability does place a limit on the extent to which a test is valid for any purpose. Further, high reliability is a necessary, but not sufficient, condition for high validity. If the reliability is zero, or not much above zero, then the test is invalid.17 Dr. Earl McCallon of North Texas University put it more directly in class. The
14 Borg and Gall, 285-6 Ibid., 283 Judith Bell, Doing Your Research Project (Philadelphia: Open University Press, 1987), 51 16 Sax, 220 13 15
8-5
maximum validity of a test is equal to the square root of its reliability.18 Therefore, test validity is dependent upon test reliability.
Answer 2: A Test Can be Valid Even If It Isnt Reliable

Both Bell and Sax reflect what Payne calls the cliche of measurement that a test must be reliable before it can be valid.19 Payne explains validity in terms of a theoretical inference. Validity is not strictly a characteristic of the instrument but of the inference that is to be made from the test scores derived from the instrument. Payne differentiates between validity and reliability as interdependent concepts: validity deals with systematic errors (clarity of instructions, time limits, room comfort) and reliability with unsystematic errors (various levels of subject motivation, guessing, forgetting, fatigue, and growth) in measurement.20 Babbie uses marksmanship to demonstrate the inter-relationship of validity and reliability.21 High reliability is pictured as a tight shot pattern and low reliability as a loose shot pattern. This is a measure of the shooters ability to consistently aim and fire the weapon. High validity is pictured as a shot-cluster on target and low validity as a cluster off to the side. This is a measure of the trueness or accuracy of the sights. Using this analogy we can define four separate conditions:
1. High reliability and high validity is a tight cluster in the bulls eye. 2. Low reliability with high validity is a loose cluster centered around the bulls eye. (One could certainly question the validity of such data) 3. High reliability with low validity is a tight cluster off the target. 4. Low reliability with low validity is a loose cluster off the target.
Payne and Babbie would hold that an instrument can be unreliable and still be valid. A yardstick made out of rubber or a measuring tape made out of yarn are valid instruments for measuring length, even though their measurements would not be accurate. Bell, Sax and Nunnally would say a tape measure made of yarn is not valid if it cannot produce reliable measurements. McCallon demonstrates the boundary condition of Vmax = R. In the final analysis, whether we are aiming a rifle or designing a research instrument, our goal should be to get a tight cluster in the bulls-eye. Use instruments which demonstrate the ability to collect data with high validity and high reliability.
Objectivity
The third characteristic of good instruments is objectivity. Objectivity is the extent to which equally competent scorers get the same results. If interviewers A and B interview the same subject and produce different data sets for him, then it is clear that
17 Jum Nunnally, Educational Measurement and Evaluation, 2nd ed. (New York: McGraw-Hill Book Company, 1972), 98-99 18 Class notes, Research Seminar, Spring 1983 19 Payne, 259 20 Ibid., 254
8-6
Chapter 8
the measurement is subjective.22 Something about the subject is hooking the interviewers differently. The difference is not in the subject, but in the interviewers. A pilot study which uses the researchers instrument with subjects similar to those targeted for the study will demonstrate whether it is objective or not. This is particularly important in interview or observation type studies in which human subjectivity can distort the data being gathered. The validation panel described under validity also helps the researcher create an objective test. All items in an item bank should be as clear and meaningful as the researcher can make them. But after the validation panel has evaluated and rated them, the best of the items can be selected for the instrument. This will filter out much of the researchers own biases. An illustration of the objective-subjective tension in instruments is the difference between essay and objective tests. The difference in grades produced on essay tests can be more related to the mood of the grader than the knowledge of the student. A well-written objective test avoids this problem because the answer to every question is definitively right or wrong. Whether you are planning to use an interview guide, an observation checklist, an attitude scale, or a test, you must work carefully to insure that the data you gather reflects the real world as it is, and not as you want it to be.
Summary
The first element of the Great Triad is validity. The four types of validity content, predictive, concurrent, and construct focus on how well an instrument measures what it purports to measure. The fifth type of validity, face validity, is nothing more than a subjective judgement on the part of the researcher and should not be used as a basis for validating instruments. The second element of the Great Triad is reliability. These three approaches to reliability stability, internal consistency, equivalence focus on how accurate the gathered data is. The third element of the Great Triad is objectivity, which concerns the extent that data is free from the subjective characteristics of the researchers.
Authentic scientific knowing is based on data that is...

VALID it says what it purports to say
RELIABLE it says what it says accurately and consistently, and OBJECTIVE it says what it says without subjective distortion or personal bias
Vocabulary
coefficient of stability concurrent validity construct validity
21
measure of steadiness, or sameness, of scores over time degree a new (easier?) test produces same results as older (harder?) test degree to which test actually measures specified variable (e.g. intelligence)
22
Babbie, 118
Sax, 238
8-7
content validity Cronbachs coefficient coefficient of equivalence face validity coefficient of internal consistency Kuder-Richardson formulas objectivity parallel forms predictive validity reliability Spearman-Brown prophecy formula split half test test-retest validity
degree to which test measures course content measure of internal consistency of a test measure of sameness of two forms of a test degree a test looks as if it measures stated content degree each item in a test contributes to the total score measures of internal consistency the degree that data is not influenced by subjective factors in researchers tests used to establish equivalence degree test measures some future behavior degree a test measures variables accurately and consistently used to adjust the r value computed in split-half test procedure used to establish internal consistency test given twice over time to establish stability of measures degree a test measures what it purports to measure
Study Questions
1. Define the terms instrument, validity, reliability, and objectivity. 2. Discuss the relationship between an operational definition and the procedures for collecting data. 3. Of these three essentials of research, which is most important? Clear research design, accurate measurement, precise statistical analysis. Why?

1. Which of the following is not part of the Great Triad? a. predictive validity b. internal consistency c. instrument objectivity d. empirical measurement 2. Content validity is most concerned with how well the instrument a. predicts some future behavior b. defines an hypothetical concept c. matches the results of another instrument d. measures a specific universe of knowledge 3. The coefficient of stability is more commonly known as the a. split-half test b. test-retest c. the K-R 20 d. parallel forms 4. Using Babbies analogy of shots on a target, a tight cluster off to the side of the target would represent a. high reliability with high validity b. low reliability with high validity c. high reliability with low validity d. low reliability with low validity
8-8
Chapter 9
Observation
Unit II: Research Methods
9
Observation
The Problem The Obstacles Practical Suggestions
In a sense, all scientific research involves observation of one kind or another. This is what empiricism means (Review Chapter One if needed). But in this chapter we come to focus on observation as one specific research technique among many. In this sense, the term observation means looking at something without influencing it and simultaneously recording it for later analysis.1 In observational research, we do not deal with what people want us to know (self-report measures) or with what some test writer believes he knows (tests and scales). Rather, we deal with actual people in real situations. People are seen in action. As such, observation is the most basic of techniques. The researcher with pad in hand carefully observes subjects he has selected in order to quantify variables he is interested in. Deciding what to observe and who to observe has been discussed in more general ways. Here we will look at how to record what is seen, and what mode of observation to use. Before we move to practical steps in doing observational research, we must first consider the biggest problem in observational research. That problem is, quite simply, the human being who does the observing.
The Problem of the Observation Method

Observation is a natural process. We do it all the time. We look at and listen to people. We infer meanings, characteristics, motivations, feelings, and intentions. We know when someone is sincere or not. We can feel whether or not someone is telling the truth. And this is the problem. When an observer moves from the actions he sees to an inference of motivation behind those actions, his observational data is as much related to who he is as it is to what subjects do. The major problem with observation is the fact that the observer is human! Observers have feelings, aspirations, fears, biases, and prejudices. Any one of these can influence and distort that which is being observed. Here's two examples:
An observer watches a group of children at play. One child turns to another and strikes him on the arm. The observer jots down hostility. The event was one child strikes another. The observer interpreted the act to be one of hostility, which is a complex con-
June True, Finding Out: Conducting and Evaluating Social Research (Belmont, CA: Wadsworth, 1983),
159
9-1
II: Research Methods
struct.
Two people watch a prominent television evangelist preach for ten minutes. One responds, What courageous leadership! What a man of God! The other responds What a con man! He sure can manipulate people! The difference in the data is in the observers, not in the evangelist. More data is needed to determine which of these two pictures is more correct.
These two examples illustrate inference, an enemy of valid and reliable data. When an observer infers motive to observed action, he adds something of himself to the data. Such data is distorted, invalid and unreliable. A second enemy is interference. The very presence of the observer can affect the behavior of the people being observed. Tell a Sunday School teacher youll be visiting his class the next Sunday, and you can expect a marked improvement in preparation of the lesson. This factor is also the rationale for using undercover agents to infiltrate and observe criminal behavior as it really is. The presence of a uniformed police officer would certainly interfere with the criminal behavior.
Obstacles to Objectivity in Observation

Personal interest Early decision Personal characteristics
Obstacles to objectivity in collecting data in observation research include personal interest, early decision, and personal characteristics.
Personal Interest
I see what I want to see. I once had a lady church member who insisted that we never elect a divorced person as a Sunday School teacher. She quoted scripture and produced one reason after another why divorced persons would be the ruin of the church until her own daughter got a divorce. It was not three weeks until this same lady was in my office, quoting scripture and complaining of how the church does not care about divorced people -- that we needed to give them opportunities for service after all, theyre people too!!! The scripture had not changed, but she certainly had, because of her personal experience. We always have a personal interest in any study we conduct. If we did not, the process of giving birth to a research plan might be unbearable. But our personal interest should be directed toward collecting objective facts, not proving preconceived notions. If the study is intended from its inception to substantiate what you already believe, you will have difficulty seeing anything that contradicts this perspective. This is called selective observation, or, as we have noted, I see what I want to see.
Early decision
It is part of the reality of human perception that we naturally and automatically fill in the gaps of what we know to be true. We add elements from our own imagination to make situations reasonable. The problem with this is that we can be deceived by our own imagination into creating a situation that does not exist in reality. When we have too few factual observations, we tend to fill in too much. This is the psychological basis of gossip: filling in the gaps between known data points with what we subjectively feel. The researcher needs a large number of objective data points from which to develop a theoretical pattern. By ending the observation phase prematurely, the researcher may interpret the data incorrectly. Ive seen enough. I can see the trend. The trend may be an incorrect extrapolation from the facts.
9-2
Chapter 9
Observation
Personal characteristics
Many of the things that characterize us as being human pose difficulties in the observation process: emotions, prejudices, values, physical condition. We can unknowingly make a faulty inference because of the subjective influence of one or more of these personal characteristics. They may be difficult to identify.2 Whatever we study, we must make every effort to insure that our data reflects that which we study and not ourselves. Objective observation checklists can help remove our personal biases and lack of neutrality concerning the chosen subject.
Practical Suggestions for Avoiding these Problems

Here are some key guidelines to use if you plan to do an observational study.3
Definition
Observation is the act of looking at something without influencing it and recording the scene or action for later analysis.
Familiar Groups
Positively, studying a familiar group permits the use of previous experience with the group and established understanding of the subjects. Negatively, this very previous experience reduces the objectivity of the study. Further, revelation of discoveries within a familiar group can be perceived by group members as a betrayal of a trust. For example, a minister on a large church staff decides to study "interpersonal conflict in local church ministry," using his position as a platform for observation if staff meeting discussions. While his existing relationship with the staff (and further, the level of trust he enjoys with staff colleagues) will encourage more realistic behaviors, revelation of those behaviors through his study may well end his relationships!
Unfamiliar Groups
Positively, studying an unfamiliar group reduces the effects of group identification and bias. In addition, observers notice things that insiders overlook. Unfamiliarity with the group improves objectivity in the data. Negatively, observers face problems in gaining access to unfamiliar groups, and, once involved, may have difficulty in understanding member actions within the group.
Observational Limits
Observation is an intensively human process. It is a fact that observers simply cannot study some people. Factors such as gender, age, race, appearance, religious denomination, or political affiliation of observers may prevent access to some groups of subjects. These are just six of many possible barriers to observation.
Manual versus Mechanical Recording

Manual recording refers to taking notes by hand during an observational session. Mechanical recording refers to recording the observations with tape recorders or video equipment. Manual recording of data is more difficult than mechanical, but can be
2
Hopkins, 81
True, 175-176
9-3
simplified by using shorthand or tallies on observation checklists. Mechanical recording makes an exact record of all the data, but does nothing to simplify or reduce the bulk of the observations. Observational episodes must be analyzed at a later time.
Interviewer Effect
Observation is an intensely human process! If subjects see observers taking notes, they may well change their behavior. (Interviewer effect is increased). Recording data surreptitiously decreases interviewer effect, but can be an invasion of privacy!
Debrief Immediately
Write-ups of observation sessions have to be made promptly because observers -being human! -- may selectively forget details, or unintentionally distort observations. Waiting until after the observational session is over to record responses greatly increases the likelihood that observer subjectivity will influence the data.
Participant Observation
(Compare "Familiar Groups"). Positively, participant observers (i.e., observers who are members of the groups they observe), have easier access, and gain a truer picture of group behavior. Negatively, participant observers are restricted to one role within the group, and are more partial in their observations than a non-participant observer.
Non-participant Observation
(Compare "Unfamiliar Groups"). Positively, non-participant observers have a clearer, less biased perspective on group behavior. Negatively, the presence of a known (non-member) observer alters the behavior of subjects, especially at the beginning of the study. Failure to announce the purpose for an observer being present in the group may be unethical.
Observational Checklist
An observational checklist is a structure for observation, and allows observers to record behaviors during sessions quickly, accurately, and with minimal interviewer effect on behaviors. Dr. Mark Cook developed an observer consistency checklist for use in his study on active participation as a teaching strategy in adult Sunday School classes.4 He described his instrument this way:
The observer consistency checklist was developed to be used by trained observers in examining each teaching situation for consistency across treatments. It was imperative in this study that all other elements in the lesson plan and teaching environment be held constant while allowing active participation to be the independent variable. This evaluation form included (a) a checklist of teacher factors (such as any unusual enthusiasm or behaviors), student factors (such as unusual interruptions or group behaviors), and unusual external factors (outside interruptions, weather, or equipment problems); (b) frequency counts of the number of external interruptions, disruptions by students, departures from the lesson, and active participation; (c) a five-point rating of teacher enthusiasm; and (d) a record of the time span of the lesson.5
A copy of the checklist is located at the end of the chapter.
Cook, 21
Ibid., 22
9-4
Chapter 9
Observation
Summary
The fundamental data gathering technique in science is observation. In this chapter we looked at the obstacles facing one who plans to do an observational study, as well as practical suggestions to help you plan an effective study.
Vocabulary
inference interference interviewer effect observation researcher infers motivation behind observed behavior researcher changes observed behavior by his/her presence potential bias in data due to subjective factors in interviewers gathering data by way of objective observation of behavior
Study Questions
1. Define observation research. 2. Define in your own words the terms inference and interference as they relate to enemies of valid data. Give an original example of each term. 3. Explain how our humanness is a liability in observational research. observa

1. The most basic approach of science to acquiring data is through a. statistical analysis b. standardized testing c. direct observation d. controlled experimentation 2. I see what I want to see is most closly related to which of the faollowing obstacles? a. personal interest b. early decision c. personal characteristics d. subjective projection 3. By observing unfamiliar groups, researchers a. reduce the objectivity of the studies b. increase the introduction of their own personal bias into the data c. notice things insiders easily overlook d. employ their own personal experiences with the group
9-5
APPENDIX A6 OBSERVER CONSISTENCY CHECKLIST Date: _______________________ Observer: ___________________ Time: ________________________ Teacher:______________________
Observer Instructions: Place a checkmark for each episode of the following factors. Memo the significant events or factors under the comment section at the bottom of the form. ACTIVE LESSON NONACTIVE LES-
OBSERVED FACTORS SON EXTERNAL FACTORS Interruptions from outside class Unusual weather Equipment problems Any other external factors STUDENT FACTORS Students' experiences affect lesson Student interruptions Hostile environment Unusual group behavior TEACHER FACTORS Teacher experience affects lesson Unusual teacher enthusiasm Unusual teacher behavior Different teaching style Variation from lesson plan Gave test answers Use of active participation Level of teacher enthusiasm (Scale: 1-5) Time of lesson (record in minutes) Attendance in the class
_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____
_____
COMMENTS: _________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________ ___________________________________________________________
Cook, 61
9-6
Chapter 10
Survey Research
10
Survey Research
The Questionnaire The Interview Developing a Survey Instrument
Survey research uses questionning as a strategy to elicit information from subjects in order to determine characteristics of selected populations on one or more variables.1 A written survey is called a questionnaire; an oral survey is called an interview. Although they serve similar purposes in gaining information, each provides unique advantages and disadvantages to the researcher.
The Questionnaire
The mailed questionnaire has been heavily criticized in recent times and has fallen into disfavor as a device for gathering data. But it has been the abuse and misuse of this technique that has drawn the criticism, not the nature of the questionnaire itself.2 Hastily constructed questionnaires, consisting of poorly worded questions, produce unreliable information at best and invalid results at worst. A planned, well-constructed questionnaire can obtain information that is obtainable in no other way.
Advantages
A questionnaire provides researchers several advantages over the interview.
Remote Subjects Influence Cost Reliability Convenience
Remote subjects
A questionnaire allows researchers to gather data from any part of the world. Through the use of existing postal systems, or, more recently, the internet, contact can be made with almost any literate population of interest. As a result, subjects can be randomly selected from wide-ranging populations, such as Southern Baptists in America.
Researcher influence
The standardized wording of a printed questionnaire reduces researcher interference in subject responses. The researchers gender, appearance, mannerisms, social skills and the like have no effect on how subjects respond to the questions.
Cost
Even with the high cost of postage, the mailed questionnaire is still the most
1
Gay, 191
Hopkins, 145
10-1
economical means, per subject, for gathering data. The economy of process allows researchers to increase the number of subjects in the study. Increased sample size provides more accurate estimates of population characteristics. Not only does the questionnaire save money directly, it also saves time. Consider the difference in processing time between mailing out 1000 questionnaires and interviewing 1000 subjects. Dr. Jay Sedgwick of Dallas Theological Seminary (Southwestern Ph.D. graduate, 2003) analyzed differences in costs and data quality among three data collection techniques. He investigated direct collection from conference participants, e-mail responses to a website, and a traditional mailed survey. Conventional wisdom suggested that email would provide quality data at greatly reduced costs. He found this not to be the case. Direct collection can be frustrated by restrictions imposed by conference leaders. Return rate was lowest among email recipients -- and responses provided the least reliability. The mailed survey was the most expensive, but provided the best return rate and quality of data.
Reliability
The standardized wording and structured questions of the questionnaire provide a higher reliability in the data than is practically able to be obtained by interview.
Subjects convenience
The questionnaire is completed at the subjects convenience. They can consider each question, check necessary records, and reflect on their answers. Data is more valid under these conditions than when answers are given "on the spot" in an interview.
Disadvantages
Rate of Return Inflexibility Motivation Limited data Loss of control
There are disadvantages in using a mailed questionnaire that are overcome by the interview. These include the questionnaire's rate of return, its inflexible structure, the level of subject motivation, the limitation of not observing the subject as questions are answered, and the loss of control over the questionning process.
Rate of return
The biggest drawback in using questionnaires is the rate of return of the completed forms. Let me illustrate. You have drawn a representative sample from which to collect data. But when the questionnaires stop coming in, you find that only 35% of the sample responded. Why did 65% not respond? Are they different in some systematic way from the 35% who did? Does this have a bearing on your variables? You have no way of knowing. And this is a confounding variable (a source of error) in your study. Therefore, valid mail surveys have extensive follow-up procedures to produce the largest possible rate of return. How large? Some texts say 50%, some 60%. We suggest that doctoral students gathering data for their dissertation aim to get a 70% response rate or better. The return rate is computed as a percentage as follows:4
rate = (R / (S - ND)) x 100

where R equals the number of questionnaires that were returned, S the number
4
Hopkins, 148
10-2
Chapter 10
Survey Research
sent out, and ND the number unable to be delivered (return to sender). For example, if you send out 180 questionnaires, and have 10 undelivered and 150 returned, your return rate is
rate = (150 / (180 - 10)) x 100 = rate = (150 / 170) x 100 = rate = (.882353) x 100 = 88.24%
The major problem with a low rate of return is that the data may not reflect the true measure of the sample you chose to study. Part of the sample volunteered to comply with the research request, and returned the completed form. Others ignored the questionnaires. The difference in willingness to comply may relate to some aspect of your study. So, a low return rate (i.e., less than 50%) of survey forms may well give a distorted view of the target population. Higher return rates (60% - 80%) increase confidence that the returned data correctly reflects the sample, which, in turn, reflects characteristics in the population from which the sample was drawn.
Inflexibility
The structure of a written questionnaire (which increases reliability of subject responses) also limits the researchers ability to probe subject responses or clarify misunderstandings. To write a questionnaire which directs subjects through a series of probes (follow-on questions which move the subjects deeper) and branches (skips to following sections) usually results in a complex, perhaps confusing, instrument. The written questionnaire is much more inflexible than the interview as a device for gathering data.
Subject motivation
There is no way to determine the motivation level of the subjects when they fill out the form. What is the subject's mental state: overworked, busy, contemplative, focused? The questionnaire cannot measure this as an interviewer would.
Verbal behavior only

Questionnaire data is limited to the responses subjects choose to make. Researchers cant know the mental or emotional state subjects are in when they complete the questionnaire. Researchers can not observe how subjects behave while completing the questionnaire. They merely get subjects verbal responses about their behavior.
Loss of control
Researchers give up control over the administration of the questions on the survey form. There is no control over the subjects environment, time, or attention to the task. There is no control over the order in which the questions are answered. There is no control over leaving answers blank. This loss of control creates missing data or distorted data, which can pose problems in statistical analysis.
Types of questionnaires
Questionnaires consist of questions of two basic types: structured and unstructured. A structured question, sometimes called close-ended, provides a predeter 4th ed. 2006 Dr. Rick Yount
10-3
mined set of answers from which the subject chooses. Here is an example of a structured, or close-ended, question:
What kind of college did you attend? ____ Evangelical college ____ Private secular college ____ ____ Catholic college ____ State college
Other: ______________________________________________________ (please describe)
The advantage of this type of question over the unstructured (open- ended, see below) question is its greater reliability. It is a more reliable (consistent, stable) question because subjects are given specific responses from which to choose. The data from this type of question are more easily analyzed than data from open-ended items. The second type of question is the unstructured, or open-ended, question. This question asks the subject for information without providing choices. Heres how the structured question above might be restated as an unstructured item.
Describe the kind of college you attended.
This type of question allows subjects to respond in their own way, using their own terms and language. It is less restrictive, so it might uncover subject characteristics that would be missed by the close-ended type. The open-ended item, however, increases the likelihood that subjects will respond incorrectly (that is, in a way not planned by the researcher). One subject might answer the above question like this: It was an expensive nightmare! This tells the researcher how he felt about his college, but it does not answer the question he had in mind. Close-ended questions may miss important data points because they are restrictive. Open-ended questions may provide so many data points that the researcher cannot reduce them meaningfully. The answer? Use a survey form of open-ended questions in a pilot project to gather as many answers as possible. Then design a close-ended questionnaire for the actual study. This provides a valid base for the structured items, yet yields a reliable set of data for the study.
Guidelines
Here are some specific guidelines for developing a questionnaire.
Asking questions
The key to designing an effective questionnaire is asking good questions. A good question is specific, clearly presented, and generates an answer that is definite and
10-4
Chapter 10
Survey Research
quantifiable. Asking unambiguous, meaningful questions is difficult. Researchers write questions according to standard guidelines (see Chapter 11). They then evaluate and revise questions as needed. Finally, questions are validated for clarity and meaningfulness by objective judges. The quality of the questionnaire is built directly on the quality of each question in it.
Clear instructions
Questionnaire designers know how to fill out their questionnaires because they created them. It is easy to assume that anyone would know how to complete the form. Such assumptions can doom a survey study. Subjects need clear instructions for completing the survey. If there are several sections in the form, specific instructions should be given for each section.
Understandable format
The order of questions in the questionnaire should not confuse subjects. Answers should be easy to select. Eliminate complex structures as much as possible (i.e., avoid probes into telescoped questions, or jumps to different sections in the form). A simple structure will produce more reliable data.
Demographics at the end

Demographic questions describe the subject who is answering the questionnaire in general categories: age, gender, economic status, education level, and other such personal information. Place these questions at the end of the questionnaire. First, by placing content questions first, you lead subjects to make thoughtful responses quickly. After investing time to answer content questions, subjects are more likely to fill out the demographic questions, and -- most importantly! -- return the form to the researcher (increasing return rate!). When you place demographic questions first, subjects can get a feeling of invasion, and simply throw the document away. Demographics last increases the validity of answers and the return rate.
The Interview
In its most basic form, the interview is an oral questionnaire where subjects answer questions live, in the presence of researchers or their assistants.
Advantages
There are several key advantages to using an interview approach over the mailed questionnaire.
Flexibility Motivation Observation Broad Application No Mailing
Flexibility
A face-to-face interview affords greater flexibility than the more rigid written questionnaire. Interviewers can branch from one set of questions to another without confusing the subject. The interviewer can clarify misunderstandings of questions or instructions. If a subject makes an unexpected comment, the interviewer can investigate with follow-up questions. The survey instrument can be more complex. This is because a trained interviewer is better able to handle branching and probing than the
10-5
untrained subject.
Motivation
When interviewers and subjects are facing each other, the motivation level of subjects can be directly observed and noted. Rapport between the interviewer and subject can create a more cooperative atmosphere, which increases the validity of the subjects responses.
Observation
Researchers can record the manner, as well as the content, of subjects answers. Mood, attitude, bias, emotional state, body language, facial expression these are excellent clues to the quality of answers being received.
Broader Application
Interviewers can gather information from people who cannot read. Young children, senior adults with poor eyesight, and groups for whom English is a second language can give better information through an interview than they can with a written questionnaire.
Freedom from mailings

Interviewing subjects precludes all of the problems associated with mailing out (and getting back!) surveys: postage and materials costs, bad addresses, return mail, return rates, and the like.
Time Cost Int. Effect Int. Variables
Disadvantages
Likewise, there are some major disadvantages with the interview.
Time
Questioning scores of subjects one by one, in person, requires far more time than sending out survey forms by mail. In order to acquire a sufficient sample size of subjects, researchers may need to enlist and train a group of assistants to help in the interviewing. The training of interviewers is a monumental task and requires a great deal of time to insure that all the interviewers administer the survey the same way.
Cost
While the cost of postage is avoided by interviewing subjects, interviewing involves other expenses. Payment of assistants is more expensive than stamps, but is necessary if you plan to do a professional study. The printed interview guide will cost about the same to print as a comparable questionnaire. Additionally, interviewing may require travel costs or long distance phone costs. This means that, given a set research budget, the number of subjects you can interview will be less than the number you can survey by mail. This results in a loss of statistical power in your study.
Interviewer effect
Do you remember the problems of inference and interference associated with observation research (Chapter 9)? All of the human problems we discussed regard-
10-6
Chapter 10
Survey Research
ing observational research apply to interviewers as well. Personal characteristics, social skills, competence, gender, appearance all of these factors will produce variance in subject responses to questions, unless they can be controlled by homogeneous enlistment and adequate training.
Recording Skills Demographics Modes
Interviewer variables
Differences among interviewers their values, beliefs, and biases may introduce distortion in the way interviewers interpret and record responses by subjects.
Types of Interviews
Earlier in the chapter we defined questions which are structured (close-ended) and unstructured (open-ended). A structured interview is simply an oral questionnaire. Researchers ask the questions in the order they appear on the form. An unstructured, or free response, interview presents the subject with open-ended questions. Researchers can follow up answers with probes and skips without confusing subjects. Just as the structured question increases reliability and decreases the range of answers, so does the structured interview. Just as the unstructured question increases answer variance and decreases the ability to quantify research data, so does the unstructured interview.
Guidelines
Here are some specific guidelines to consider if you plan to use the interview.
Recording responses
Subject responses need to be accurately recorded during the interview. Recording the responses after the interview invites problems with subjective interpretation, selective memory, or unconscious bias.
Interviewer skills
Before the study begins, interviewers should be given adequate practice in asking questions, fielding responses, probing, clarifying instructions, and recording answers. If skill levels differ among the interviewers, extraneous variability will be introduced into the data, making findings ambiguous.
Objective Write Items' Select Items Format Instructions Pilot Study
Demographics first
Ask demographic questions first in the interview. By asking non-threatening demographic questions at the beginning of an interview session, researchers establish rapport between themselves and subjects. Such rapport improves the level of trust between researchers and subjects, which, in turn, increases the validity of answers received. Demographics come FIRST in the interview, LAST in the questionnaire.
Alternative modes
The face-to-face interview is only one mode of interviewing. Researchers can conduct interviews by telephone. This extends the range of the interview far beyond that possible with face-to-face meetings. Researchers can also mail cassette tapes to subjects. The subject listens to the question on tape and records his answer. This is less
10-7
expensive than interviewing by phone, and extends the interview beyond that possible with face-to-face meetings. These modes provide more subject information than the written questionnaire. Voice characteristics, subject hesitation, and tone of voice provide clues to subject motivation. Still, none of these alternatives permit direct observation of the subject as in the face-to-face meeting.
Developing the Survey Instrument

The following steps should be taken in developing a questionnaire or interview guide (See Borg and Gall, Chapter 11, for details).
Specify Survey Objective

Determine the objective of the survey. What is the focus of the research? What exactly do they need to know? What are the related areas? Include in the instrument only what is needed for the study.
Write Good Questions

Develop an item pool of good questions (i.e., more questions than are needed) which relate to the study. Each question should be clear and definite. Each should generate an answer that is clear and quantifiable.
Evaluate and Select the Best Items

Submit the item pool to a panel of evaluators. This panel should consist of 5 to 8 experts in either the content area of the survey, or research design, or both. Have them rate each item in the pool on the basis of relevance to the study, and clarity of composition. Combine the ratings of all the judges for all the items to determine which items are best suited for your study.
Format the Survey

Place the questions in an attractive format that enhances transition from question to question. If the instrument is to be used as a written questionnaire, provide an easy way to record responses.
Write Clear Instructions

Write clear, concise instructions to insure that the survey is done correctly. Let several people who are unfamiliar with the study read the instructions, and explain what they would do to complete the instrument. Revise the instructions as needed.
Pilot Study
Select a group of people similar to those who will be involved in the actual study. Use the instrument to gather data from them. Check for any problems the pilot group encountered while completing the form. Ask the group for suggestions. Revise the instrument as needed.
10-8
Chapter 10
Survey Research
Summary
Survey research gathers specific data from a large group of people that possess that data. We have developed advantages, disadvantages, and guidelines for using the mailed questionnaire and the personal interview.
Examples
Dr. Margaret Lawson designed her own questionnaire to gather data for her study of selected variables and their relationship to whether or not Life Launch pilot churches (1987-88, n=120) continued offering LIFE courses (MasterLife, Experiencing God, Pa-renting by Grace, and the like, 1992-93).5 She collected data on what courses were offered, who led the courses (pastor, staff or lay), how the materials were paid for (participants paid full, part, or none), as well as attendance in Sunday School and Discipleship Training, church membership, number of baptisms, gifts and initiated ministries. Her survey instrument is located at the end of the chapter. Her procedure for developing the survey form was as follows:
The steps in developing the survey instrument were as follows:6 1. Questions were designed for subjects' responses to reflect information on the factors present in those churches that did, and those that did not, continue to offer LIFE courses. The same two-page questionnaire was sent to all the churches. Drew and Hardman suggest that respondants are more likely to complete a cone or two page questionnaire.91 2. A validation panel of experts drawn from the areas of adult discipleship training, research design, and the field of religious education were asked to rate the relevance and clarity of each question. . . .Following the panel's critique and evaluation eight surveys were returned. Suggestions were offered by Avery Willis and Clifford Tharpe and the appropriate revisions and modifications were incorporated.93
Dr. Darlene Perez developed her Spanish-language survey to gather information from youth and youth leaders in Puerto Rico concerning Youth Curriculum materials. Here was her procedure:7
The Youth Sunday School Curriculum Questionnaire was designed to obtain data related to the youth curriculum variables identified in the problem statement. The procedures for designing the instrument followed guidelines in . . . Research Design and Statistical Analysis for Christian Ministry.2 . . . .The first step . . .consisted of stating the purpose of the study with clear instructions on how to complete the questionnaire. Second, an item pool of questions was developed. The questions were written in an objective, structured and close-ended form. They were designed to obtain information about the curriculum being used by participants, the degree of curriculum satisfaction, the disposition to change curriculum, the preference for a Bible study approach, and the preference for a teaching/learning method. Third, the questionnaire included a section at the end for demographic information. A copy of this questionnaire is provided as appendix H. . . . The questionnaire was submitted to a validation panel of seven experts in the areas of education or curriculum development or youth knowledge. Each panel member considered points of clarification and the validity of each item. The best, most clear, and most valid questions were selected for the survey. . . . A proposed pilot study with youth and youth leaders not included in the research was to
5 Margaret P. Lawson, A Study of the Relationship Between Continuance of LIFE Courses in the LIFE Launch Pilot Churches and Selected Descriptive Factors, (Ph.D. dissertation, Southwestern Baptist Theological Seminary, 1994) 6 7 Ibid., 25-26 Perez, 55-58
10-9
be completed in Puerto Rico. The validation procedures with a pilot group the following steps: 1. The Sunday School Board provided a list of Baptist and non-Baptist churches in Puerto Rico currently using the Spanish Convention Uniform Series. A non-Baptist, evangelical church (Alianza Christiana y Misionera, Rio Pierdras, Puerto Rico) was selected for the pilot study. The questionnaire was submitted during a youth Sunday School class to a group of thirteen youth and three youth leaders. Corrections were made to clarify the instructions on how to complete the questionnaire. Also, the term "youth" (joven) was changed to Intermedios y Pre-jvenes along with a parenthesis stating the ages twelve to seventeen. 2. After making corrections, it was felt that the instrument needed further validation. A second validation pilot study was performed with a group of thirty youth and youth leaders from the Baptist Convention of Puerto Rico who were meeting at a youth camp during July, 1990. After this validation process, the following changes were made. . . [six changes listed]. 3. In order to make the validation process more consistent, a third pilot study was performed with a group of thirty youth and youth leaders from the Puerto Rico Southern Baptist Association, at a youth camp in July 1990. Only a few corrections were made in the section of demographics. . .[two changes listed]. A copy of the validated questionnaire appears as appendix I. [the English-language version is included at the end of the chapter]
10-10
Chapter 10
Survey Research
Vocabulary
close-ended question demographics item pool open-ended question rate of return structured question unstructured question validation panel type of question which provides a set of answers to choose from (a b c d) personal data on subjects (gender, ed level,years in ministry) a collection of test items from which a subset is drawn for creating an instrument question which allows subject to answer in his/her own words percentage of mailed questionnaires which are completed and returned synonym for close-ended question synonym for an open-ended question judges who analyze the clarity and relevance of questions in an item pool
Study Questions
1. Compare and contrast the advantages and disadvantages of the interview and questionnaire. 2. Define structured or close-ended questions. Give an example. 3. Define unstructured or open-ended questions. Give an example. 4. Discuss the pros and cons of using structured or unstructured questions. 5. Differentiate the handling of demographic questions in the questionnaire and interview.

1. The criticism of survey research is based primarily on the A. lack of depth of information gained by the survey approach B. availability of better data gathering instruments C. absence of good statistical tools to analyze survey data D. abundance of poorly constructed survey instruments 2. One major advantage of the questionnaire is that it A. generally produces a high return rate B. possesses a high degree of flexibility C. eliminates the researchers influence on subjects D. focuses only on the verbal behavior of subjects 3. You send out 1000 questionnaires. 200 are returned marked Addressee unknown Return to Sender. 400 are completed and mailed back. Your rate of return is A. 50% B. 400 C. 40% D. 600 4. The best advantage of close-ended questions is the ____ of the answer. A. reliability B. flexibility C. range and depth D. correctness 5. An open-ended question A. decreases the validity of the answer B. increases the reliability of the answer C. increases the variability of the answer D. increases the objectivity of the answer 6. A major disadvantage of the interview is A. its broad application B. its inflexibility C. the higher cost of the data D. the limitation of measuring verbal behavior only
10-11
[65]
LIFE LAUNCH SURVEY
Please complete the information requested concerning LIFE courses in your church at the time of the LIFE Launch project and the present time. FIRST YEAR refers to the reporting year following the LIFE LAUNCH, October 1987 to September 1988. LAST YEAR refers to the latest reporting year, October 1992 to September 1993.
1. What LIFE courses did you offer in the first year of the LIFE Launch? MasterLife MasterBuilder MasterDesign Parenting by Grace None Other (please specify) __________ 2. What LIFE courses have you offered during the last year? MasterLife MasterBuilder MasterDesign DecisionTime Parenting by Grace I Parenting by Grace II Covenant Marriage WiseCounsel Disciple's Prayer Life Experiencing God Step by Step Step by Step Through the Old Testament Through the New Testament LifeGuide to Discipleship None and Doctrine Other (please specify) ______________________________ 3. Which staff member began the initial LIFE courses? Pastor Associate Pastor Minister of Education Other (please specify) __________ 4. Did any lay person have a leadership position from the beginning? Yes No 5. Has a staff person led LIFE courses in the past year? Yes No 6. Has a lay person led LIFE courses in the past year? Yes No
(OVER)
7
Lawson, 65-66
10-12
Chapter 10
Survey Research
7. Indicate how participants paid for their study materials in the first year: Participants paid full price Participants paid some of the cost Materials were provided Other (please specify) free of charge __________ 8. Indicate how participants paid for their study materials in the past year: Participants paid full price Participants paid some of the cost Materials were provided Other (please specify) free of charge __________ 9. Indicate the total number of participants in all LIFE groups: _______ FIRST YEAR ________ LAST YEAR
10. Indicate the average number of participants in individual LIFE groups: _______ FIRST YEAR ________ LAST YEAR
11. Complete the following information about your church during the LIFE Launch year: _____ Resident Church Membership _____ Average Sunday School Attendance _____ Total Gifts 12. Complete the following information about your church during the past year: _____ Resident Church Membership _____ Average Sunday School Attendance _____ Total Gifts 13. What specific ministries have been initiated by LIFE course participants? _____ Total Baptisms _____ Average Discipleship Training Attendance _____ Total Baptisms _____ Average Discipleship Training Attendance
Please return the completed survey to: Margaret Lawson address address city, state Would you like to receive a summary of the results of the survey? ______________
10-13
108 APPENDIX I8 VALIDATED YOUTH SUNDAY SCHOOL MATERIALS QUESTIONNAIRE (ENGLISH TRANSLATION) The purpose of this questionnaire is to obtain basic information about the Sunday School youth materials being used in your church and to identify the curriculum preferences of youth and youth leaders. Instructions: Select with a check mark (9) the best alternative. Choose only one response for each question
1) Which Sunday School materials are currently being used in your church? __ 1. El Interprete (Convention Uniform Series of the Sunday School Board) __ 2. Enseanza Bblica Para Jvenes, (Dilogo y Acin Program of The Spanish Publishing House) __ 3. Materials designed in your own church. __ 4. Exploradores y Embajadores (Editorial Vida, Miami, Florida) __ 5. Other, specify: ____________________________________________
2)
How satisfied are you with the Youth Sunday School materials used in your church? __ 3. Dissatisfied (I do not like it) __ 4. Very dissatisfied (I do not like at all)
__ 1. Very satisfied (I like it very much) __ 2. Satisfied (I like it)
3)
Are you interested in changing Youth Sunday School materials? __ 2. No __ 3. Indifferent
__ 1. Yes
4)
If you were going to change Youth Sunday School materials, which Bible study approach would you prefer?
__ 1. I would like to study the Bible systematically, book by book, covering the whole Bible within a certain period of time. __ 2. I would like to study the Bible by themes that relate to daily life, such as the family, friendships, the community, and others. __ 3. I would like to study the Bible by doctrinal themes, such as the doctrine of God, Jesus, the Holy Spirit, Church, Bible, prayer, and others. __ 4. I would like to have Bible studies about discipleship, Christian growth and formation.
Perez, 108-109
10-14
Chapter 10
Survey Research
109 5. If you were going to change the Youth Sunday School materials, which teaching/learning methods would you prefer?
__ 1. Conference -- The teacher would expose and explain the Bible passage. __ 2. Questions and answers -- The teacher would use questions to promote group participation. __ 3. Small group work -- The class would be divided into small groups. Each group is assigned to work on a task and will report to the whole class its findings. __ 4. Individual tasks -- The teacher would assign questions or tasks to each student and he/she would work independently. __ 5. Other, specify: _________________________________________________
Please complete the following information: Position: ___ ___ ___ ___ ___ ___ ___ Sex: ___ ___ Male Female Youth Pastor Youth Minister Minister of Christian Education Sunday School Director Youth teacher Other Age: ___
Denomination: ___ ___ ___ Southern Baptist American Baptist Other, specify: ___________________________________
Church name: ____________________________________________ Have you completed this questionnaire before? Comments/suggestions: ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ____ yes _____ no
10-15
10-16
Chapter 11
Writing Tests
11
Developing Tests
Preliminary Considerations Objective Test Items Essay Test Items Item Analysis
A test is an instrument which measures a subjects knowledge, understanding, or skill in a given content area, and produces a ratio score reflecting that measure. If the focus of a study is "testing subjects" on some variable (Bible knowledge, comprehension of various translations, current events), an appropriate test must be found, or one must be developed. This chapter introduces you to principles of developing tests.
Preliminary considerations
You may be able to use existing tests for your study. Lets say the nature of your study is to identify a relationship between job satisfaction and interpersonal dynamics among staff members. You may be able to find an existing test which will measure job satisfaction. Check the Mental Measurements Yearbook, or Tests In Print, or other such resources for published tests in your area of interest. Tests can also be found in research articles being gathered for the Related Literature section of your proposal. Study the validity and reliability scores on the test, the population(s) the test was designed for, and the conditions of test administration. If these factors fit your study, youre in business! Describe these characteristics in the Instrument section of your proposal. You may need, however, to develop your own test, since there are many areas in the field of Christian Education that do not yet have tests. This chapter focuses on the procedure to use in developing such a test for use in a larger dissertation context. Good tests gather good data. Good tests build good attitudes. Good tests can even produce a positive learning experience. The principles discussed here will help you in this task.
The Emphases in the Material

A test should measure important areas of instruction, knowledge, understanding, or skill. The emphases of the test should parallel emphases in the material which subjects have learned. Avoid writing trivial, ambiguous, or simplistic questions.
Nature of Group Being Tested

Study the group you intend to test. The level of difficulty of the test, the language
11-1
you use, the length of the test and other such variables depend a great deal on who your subjects are.
The Purpose of the Test

What is the purpose of the test? What do you really want to know? Are you measuring knowledge, or comprehension, or the ability to solve problems, or to analyze new situations? Are you measuring simple recall, or mental reasoning? The purpose of the test provides a North Star to guide your developmental process.
Writing items
Avoid ambiguous or meaningless test items. Use good grammar. Avoid rambling or confusing sentence structure. Use items that have a definitely correct answer. Avoid obscure language and big words, unless you are specifically testing for language usage. Be careful not to give the subject irrelevant clues to the right response. Using a(n) rather than a or an is an example of this. In short, a test should not provide any barrier to subjects apart from demonstrating mastery over the test content. Otherwise, scores reflect more noise than true measure.
Objective Tests
True-False Multiple Choice Matching
An objective test is a test made up of close-ended questions. Objective tests have several advantages over essay tests. Asking 100 objective questions over a given content field provides a much better sampling of examinee knowledge and understanding than asking three or four essay questions. With objective tests, grading is easier and the scores are a more reliable measure of what the examinee knows. There are four common types of objective questions. These are the constant alternative (true-false) question; the changing alternative (multiple choice) question; the supply (or fill-in-the-blank) question; and the matching question.1
The True-False Item

The true-false, or constant alternative item, presents the subject or student with a factual statement. The statement is judged to be either true or false.
Advantages
The advantages of the true-false test item are efficiency and potency. It is efficient in that a large number of items can be answered in a short period of time. Scoring is fast and easy. It is potent because it can, in a direct way, reveal common misconceptions and fallacies.
1 The material in this chapter is a synthesis of principles gleaned from Nunnally, Chapter 6: Test Items, 153-196; and Payne, Chapter 5: Constructing Short Answer Achievement Test Items, 95-136. These are excellent resources for those wanting to improve their test-writing ability. Another excellent (more recent) source is Tom Kubiszyn and Gary Borich, Educational Testing and Measurement: Classroom Application and Practice, 2nd (Glenview, IL: Scott, Foresman and Company, 1987). Also more recent material can be found in my own Created to Learn (1996), Chapter 14 and Called to Teach (1999), Chapter 9, both from Broadman and Holman.
11-2
Chapter 11
Writing Tests
Disadvantages
Good true-false items are hard to write. An item that makes sense to the writer may confuse even well-informed subjects. Statements require careful wording, evaluation and revision. Secondly, true-false items encourage guessing. An examinee can earn around 50% of the test score by mere chance simply by guessing at the right answer. If there are only two alternatives, then pure chance gives him 50% over the long run. This is if subjects know absolutely nothing about the subject. This is a lot of noise in the test scores. Thirdly, constant alternative items tend toward response sets. A response set is a repetitious pattern of answers, like the following 18-item test. T T F T T F T T F F T F T T F T T F ^ ^ ^ ^ ^ Notice that the pattern T T F repeats itself through the test. Test writers can produce these response sets without being aware of it. Subjects pick up these irrelevant clues, and score higher than their knowledge allows. The objective is not to insure high scores, but to actually measure what subjects know and understand.
Writing True-False items

The following guidelines will help you avoid major pitfalls in writing true-false test items.
Avoid specific determiners

Specific determiners, such as only, all, always, none, no, or never, give irrelevant clues to the correct answer. When you find these terms in a true-false item, the answer is usually FALSE. Terms like might, can, may, or generally are usually true. Write items without using these terms.
Determiners Answers Negatives Language Quotes Item length Sentences False Items
Absolute answer
Base true-false items on statements that are absolutely true or absolutely false. Avoid statements that are true under some conditions, but not others, unless the conditions are specifically stated. Well-informed subjects have greater difficulty answering ambiguous questions correctly, because they have more information to process in trying to understand the item.
Avoid double negatives

A double negative is confusing. T F It is not infrequently observed that threeyear-olds play in groups. State the item positively: T F Three-year-olds play in groups. The latter item tests knowledge of three-year-olds and social development. The former requires knowledge plus practice in mental gymnastics.
Use precise language

Avoid using terms like few, many, long, short, large, small, or important in test items. These terms are ambiguous. How much is enough to determine the truth or falseness of a T-F question? How big is big? How many is many?
11-3
Avoid direct quotes

If the treatment has been a classroom situation or a series of directed readings, do not test over direct quotes from class notes or the readings. These, taken out of context, are usually too ambiguous to use as test items.
Watch item length

Avoid making true statements longer than false items. This is easy to do because true statements often need qualifications to make sure they are absolutely true. The additional length is an irrelevant clue to the answer.
Avoid complex sentences

Complex grammatical constructions and obscure language infuse questions with an irrelevant level of difficulty. Take a central idea and write two simple statements, one true and one false. Place these in your item pool.
Use more false items

When developing a True-False test, make about 60% of the items false. False items discriminate better between examinees than true items. Yountian modification: Improve the reliability of the test scores by having subjects correct false items to make them read correctly. Underline the most important concept in the statement, and have them change it to make the statement true. Score one point for the correct answer and one point for correcting the statement. This reduces guessing and increases reliability of scores.
Multiple Choice Items

The multiple choice, or changing alternative, item consists of a sentence stem and several responses. One and only one of the responses is correct. All other responses are incorrect, but plausible. The most common form presents a stem and four or five responses.
Advantages
The multiple choice question, with its multiple responses, can be written with less ambiguity and greater structure than the true-false question. Guessing is reduced since the probability of guessing the correct answer is 1 in 4 (25%) instead of 1 in 2 (50%) for true-false items. Multiple choice items can demand more subtle discrimination than other forms of objective questions. Lastly, one can write multiple choice items which test at higher levels of learning, such as application and analysis, than other question types.
Disadvantages
Good multiple choice questions are difficult to write. Effective detractors plausible wrong answers are hard to create, particularly if you are providing a 5th or 6th alternative response. Secondly, multiple choice tests are less efficient because a subject can process fewer multiple choice items in a given time than other types.
Writing Multiple Choice Items

The following guidelines will help you avoid major pitfalls in writing changing alternative items.
11-4
Chapter 11
Writing Tests
Pose a singular problem

The stem of the question should pose a clear, definite, singular problem. A common mistake in multiple choice questions is the use of an incomplete stem. In the continent of Africa... could be followed with any number of responses that fit. Even better (though not a requirement) is to make the stem a complete sentence or a direct question, rather than an sentence fragment.
Single Problem Repeats Negative Stems Reponses Similar Responses Exclusive Responses Plausible Responses Random Irrelevance Extraneous None of the Above
Avoid repeating phrases in responses

Rather than putting the same phrase in every response, include the phrase in the stem. Keep the alternative responses as simple as possible.
Minimize negative stems

Avoid negative stems if possible. Which of the following is NOT a characteristic of.... This construction can confuse some subjects who might otherwise know the material.
Make responses similar

Avoid making the correct response systematically different from the others (grammar, length, construction). Responses should be written in parallel form so that the form of the response is not a clue to the correct answer.
Make responses mutually exclusive

Each response should be mutually exclusive of all others. Avoid overlapping responses.
Make responses equally plausible

All responses in an item set should be equally plausible and attractive to the less knowledgeable subject.
Randomly order responses

Responses (ABCD) should be randomly ordered for each question. Some test writers hesitate to place the proper answer first (A) because subjects wont read the others or last (D) because thats an obvious place for the right answer. That leaves B or C for the majority of correct responses. Use a random number table or a computer, or even a die, to assign the order of responses and avoid unintentional response sets.
Avoid sources of irrelevant difficulty

Avoid irrelevant sources of difficulty in the statement of the problem or in the responses. Some test writers confuse subjects by using complex vocabulary, for instance. Do you want to know what the subject knows, or do you want to test his vocabulary?
Eliminate extraneous material

Do not include extraneous material in a question. That is, do not attempt to mislead examinees by including information not necessary for answering the question.
11-5
Avoid None of the Above

Alternative responses such as none of the above, all of the above, and both b and d should be eliminated if possible. The uninformed thinker can use simple reasoning skills to eliminate several alternatives. But the informed non-thinker may not be able to correctly manipulate the problem. Unless the test is supposed to measure reasoning ability, do not make ability to reason part of the score.
Supply Items
Supply items, sometimes called recall or fill in the blank items, present a statement with one or more blanks. The task of the subject is to fill in the blank(s) with the most appropriate terms in order to correctly complete the statement.
Advantages
Supply items are relatively easy to construct. Second, they are efficient in that a large number of statements can be processed in a given length of time. Third, remembering a term or phrase is more difficult than recognizing it in a list or response set. Therefore, supply items discriminate better between subjects knowledge of important definitions and concepts.
Disadvantages
Supply items are notorious for being ambiguous. It is difficult to write a supply item that is clear and plainly stated. Supply items are unclear in the way theyre graded because usually more than one word will adequately fill the blank. Grading can be arbitrary and unfair, depending on how synonyms are handled.
Writing Supply Items

The following guidelines will help you avoid major pitfalls in writing supply items.
When? Limit blanks One Correct Important Blank End Blank Irrelevant Clues Text Quotes
When to use supply items

In general, only use supply items when the correct response is a single word or brief phrase.
Limit blanks
Use only one or two blanks in a supply item. The greater the number of blanks, the greater the item ambiguity and the more difficult grading is.
Only one correct answer

Write the item in such a way that only one term or word will correctly complete the statement. If there are equally acceptable terms for a given concept (i.e., null and statistical hypothesis), then credit should be given for either answer.
Blank important terms

Leave only the most important word or term blank. Blanking out minor words makes the item trivial.
11-6
Chapter 11
Writing Tests
Place blank at the end

In most cases, it is preferable to place the blank at the end of the sentence. This gives subjects the entire sentence to construct the basis for supplying the proper term. Placing the blank at the beginning reverses this natural process and causes confusion.
Avoid irrelevant clues

An irrelevant clue is an element of the statement, unrelated to the conceptual focus of the question, which hints at the correct answer. An example is making the length of the blank equal to the number of characters in the word to be supplied. Another irrelevant clue is the use of a or an before a blank. Avoid these two irrelevant clues by making all blanks the same length, and using the more general a(n) before the blank.
Avoid text quotes

Do not use directly quoted sentences out of required reading as supply items. This is a easy temptation to fall into and seems to make sense: If subjects have read the material, they should be able to supply the appropriate term. But sentences taken out of context are usually ambiguous. Write the supply item based on a clear concept, not a specific quote.
Matching Items
Matching items presents subjects with two or three columns of items which relate to each other. An example of a matching question is one which provides a numbered list of authors with a parallel lettered list of the books they wrote. Match the books to their authors by writing the letter of each book in the space next to the numbered author. The list of authors is the item list and the list of books is the response option list.
Advantages
The matching item can test a large amount of material simply and efficiently. Response pairs can be drawn from various texts, class notes, and additional readings to form a summary of facts. Grading is easy.
Disadvantages
A good matching item is difficult to construct. As the number of response pairs in a given item increases, the more mental gymnastics is required to answer it. Matching items can present little more than a confusing array of trivial terms and sentence fragments.
Writing Matching Items

The following guidelines will help you avoid major pitfalls in writing supply items.
Limit Pairs Option List One Correct Central Theme Responses Systematic
Limit number of pairs

Do not include too many pairs to be matched in a given item. The list should contain no more than 8-10 pairs.
11-7
Make option list longer

If each response should be used once and only once, then the option list should be longer than the item list. That is, it should contain more responses than are needed to match all the items. Then subjects cannot answer the last item by the simple process of elimination. However, if responses can be used more than once, then both lists can be the same length. Note: There are times when a matching question can be used in place of several multiple choice questions. The response option list consists of only a few responses. Responses from the options list are used several times to match each of the items. This is not a true matching question, which consists of matched item-response pairs, but it is a common practice and eliminates the need for several repetitive multiple choice items. You can see an example of this kind of matching question at the end of the chapter.
Only one correct match

It is important to insure that each term in the item list matches only one term in the option list. This becomes more difficult as the list grows larger. Response options may be used more than once, however.
Maintain a central theme

A matching item should contain matched pairs that all relate to one central theme. Avoid mixing names, dates, events, and definitions in a single matching item. If this is not possible, construct several matching items, each with a central theme: dates-events, terms-definitions, and so forth.
Keep responses simple

It is better to place longer statements in the item list and the shorter answers in the response option list. This helps subjects scan the response list for the correct match more efficiently.
Make the response option list systematic

Arrange the answers in the response option column in some systematic way. This might be alphabetical or chronological order. This makes the task of searching through the list less taxing and allows subjects to concentrate on answering correctly.
Specific instructions
Be sure to clearly instruct subjects on how the matching is to be done. Show an example, if necessary. This eliminates test-wiseness as an extraneous variable in the scoring.
Essay Tests
Essay tests are constructed from unstructured or open-ended questions which require subjects to write out a response.
11-8
Chapter 11
Writing Tests
Advantages
Essay test items allow much greater flexibility and freedom in answering. Grammar, structure, and content of the answer is left to subjects. Essay items permit testing at the higher levels of learning than most types of objective questions. Finally, essay questions permit a greater range of answers than objective items.
Disadvantages
The greatest disadvantage of essay items is that they are difficult to score consistently. The answers are more ambiguous and subjective than objective responses. The reliability of scores is lower than those produced by objective tests over the same content because of the variability of response. Essay items test a smaller sample of material because of the amount of time required to analyze and understand the question, develop the answer, and write it out in complete sentences. They are less efficient than objective types.
Writing essay items

The following guidelines will help you avoid major pitfalls in writing essay items.
Use short-answer essays

It is much better to use several short answer essay items than one or two long ones. If the testing period is one hour, ask six ten-minute essays rather than two 30minute essays. This improves sampling of material, focuses the essays sufficiently to increase reliability of grading, and produces a better measure of what subjects know.
Short Answer Clear Question Grading Key
Write clear questions

Be sure that the question you ask gives sufficient guidance to examinees. The question, Discuss sampling processes, is much too vague. Better essay questions structure the thinking of subjects: Define four types of sampling and explain specifically how each is used in research.
Develop a grading key

Outline a specific grading key for each essay item. Points may be awarded for each element in the key. Major elements should receive more points than minor ones. A point or two should be awarded for grammar, punctuation, organization, and the like. This grading key provides a systematic guide for objectively grading an essay answer. Without such a key, the score is as much a result of the perception of the grader as it is a measure of the knowledge of the subject.
Item analysis
Item analysis is a procedure for determining which items in an objective test discriminate between informed and uninformed subjects. If a tests purpose is to separate subjects along a scale of content mastery (and most tests have this purpose!), then it is important that this separation be done fairly. Every item in a test should contribute to this separation process. Those that do not should be revised or eliminated.
11-9
A popular method of item analysis is a procedure called the Discrimination Index. After administering and grading the exam, the procedure is applied as follows:
Rank Order Subjects By Grade

Rank order subjects high to low by their grade on the exam. The rank position of each student is a reflection of their overall preparation for the examination.
Categorize Subjects into Top and Bottom Groups

Identify top and bottom proportions of students to compare. You can choose a percentage ranging from 10- to 40-percent. Twenty-five percent is common, and gives you the top and bottom quarters of the class.
Compute Discrimination Index

An example will illustrate this step better than a definition. Lets say you have a class of 40 students. You select top and bottom quarters (25%) for computation of the discrimination index. This means you have 10 students in the top group and 10 in the bottom, as identified by test score rank. Count how many students in the groups answered question one correctly. Lets say that in our case, 8 of the top 10 subjects answered question #1 right, and 3 of the bottom 10 answered it right. The discrimination index is equal to 8 minus 3 divided by 10, or +0.500.
Revise Test Items

A discrimination index ranges from -1.00 to +1.00. A negative index indicates a faulty question: more bottom students answered it right than top students. This question should at least be rewritten and may need to be eliminated from the test. Questions you expect everyone in the class to know, a so-called barrier question, will appear with a low discrimination index often a 0.000 index. Questions you design as discriminating questions questions designed to separate students according to their mastery of the material should have moderate to high index values (+0.500 and above). A reasonable test should contain 60% barrier questions and 40% discriminating questions. A test can be more difficult (while remaining completely fair and unbiased) by including higher percentages of discriminating questions (say, 50-50), or by including questions with discrimination indexes of +0.750 or higher (or both!). The use of the discrimination index by test designers solves one of the most frustrating aspects of education and research: arbitrary testing. The discrimination index provides a way to develop tests which contain questions that actually separate the prepared (knowledgable) from the unprepared, andyield test data which is more valid and reliable.
Summary
In this chapter we have looked at procedures for developing various types of tests. We have considered four kinds of objective items: true-false, multiple choice, supply and matching. We have discussed the use of essay questions. Finally, we described item analysis, which allows test developers to determine whether objective test items are valid.
11-10
Chapter 11
Writing Tests
Examples
In addition to the checklist in Chapter nine, Dr. Mark Cook also developed an objective test
. . .to measure the lesson objectives at three cognitive levels: knowledge, comprehension, and application. The process of development began by creating a thirty-item multiple-choice test to be used in the field test of the study (appendix D). The test was examined by three selected specialists. The specialists that were asked for validation of the test were as follows: [specialists listed]. These professors were provided complete lesson plans to use in evaluation.3
A copy of the test is located at the end of the chapter. Dr. Brad Waggoner focused his entire 1991 dissertation on developing a standardized test to measure the discipleship base -- defined as 'that portion of a given church's membership that meets the criteria of a disciple'4 -- of local Southern Baptist churches. He worked in conjunction with the International Mission Board of the Southern Baptist Convention to produce a valid and reliable instrument. A final instrument of 136 items5 produced a Cronbach's alpha reliability coefficient of 0.9618.6 While we can certainly not replicate the fifty-eight pages7 of his development procedure here, we will outline the procedure and focus on key aspects of test development.
Phase One: Identification of Functional Characteristics8 Attitudes: A disciple is one who: Possesses a desire and willingness to learn Has conviction regarding the necessity of living in accordance to biblical principles and guidelines Evidences a repentant attitude when a violation of Scripture occurs Possesses a willingness to forfeit personal desires and conveniences, if necessary, in order to seek the interests of others Possesses and demonstrates the character trait of humility Possesses and demonstrates the character trait of integrity Is willing to be accountable to others Conduct/Behavior: A disciple is one who: Manifests a lifestyle of utilizing time and talents for God's purposes Possesses a lifestyle depicted by intentional compliance with the with the moral teachings of the Bible. . . Maintains appropriate behavior toward those of the opposite sex Actively seeks to promote social justice and righteousness in society as well as to individuals Relational/Social: A disciple is one who: Values and accepts himself as created in the image of God Has an awareness of the reality and presence of God through the ministry of the Holy Spirit Experiences trust in God in times of adversity as well as in times of prosperity Seeks to commune with and learn about God through the means of meditation upon Scripture and prayer Is consistently involved in fellowship with other believers in the context of a local church Applies oneself to building meaningful relationships with other believers Maintains a forgiving spirit when wronged Confesses or seeks forgiveness when guilty of an offense
5
3 8
4 Cook, 22-23 Waggoner, 9 Ibid., Headings from 68-80
Ibid., 209
Ibid., 118
Pages 65-118 of 233 pages
11-11
Ministry/Skills: A disciple is one who: Publicly identifies with Christ and the Church when provided an opportunity Seeks and takes advantage of opportunities to share the Gospel with others Is involved in ministering to other believers Seeks the good of all men with a willingness to meet practical social needs such as food, clothing, and the like Doctrine/Beliefs: Eternal security Salvation The Holy Spirit (the nature and role of) The Eternal State (the literal existence of heaven and hell) Scripture (the authority and reliability of) Phase Two: Testing of Content Validity9 The finctional characteristics, categorized according to the five domains described above, were placed on a 9-point Likert rating scale, a value of "1" being "not valid," and a value of "9" being "very valid" with gradations of validity in between57 (appendix B). The purpose of the rating scale was for a panel of experts to determine the degree to which each characteristic was a valid and measureable function of a disciple.(58) A list of names was compiled. . .the panel was to consist of five experts and two alternatives representing the academic, denominational, and local church levels (appendix C).(59) A letter was constructed that explained the nature and purpose of the research and requested their participation on the panel (appendix D). . . . When the rating scales were returned, the mean scores were calculated for the characteristics (appendix F). Phase Three: Revision of Characteristics10 Revisions to the list of characteristics were made based on the panel's scores, comments, and additions. It was predetermined that any item receiving a mean score of less than 7.0 would be considered for deletion. Phase Four: Item Writing11 Review Related Measures Construction of Questions The Size of the Item Pool The Issue of Relevance The Issue of Clarity The Issue of Simplicity The Issue of Single Meaning The Issue of Double Negatives The Issue of Question Length The Issue of Question Variety The Issue of Response Categories The Issue of Assuming The Issue of "Leading" or "Loaded" Questions The Issue of Grammar and Tone Phase Five: Testing Content Validity of Questions12 Selection of a Panel of Experts Development of a Validation Instrument Follow-Up of Validation Panel
9
Ibid., 81-82
10
Ibid., 82
11
Ibid., 83-91
11-12
Chapter 11
Writing Tests
Calculation of Validity: "...questions receiving mean scores of less than 6.0 would be considered for deletion." Phase Six: Questionnaire Design13 Question Order and Flow Questionnaire Length Questionnaire Design and Layout Size and Color of Paper Layout Instructions Expression of Gratitude Expression of Confidentiality Identification of the Sponsor Phase Seven: Refining the Pilot Test14 The process of refining the pilot test consisted of a small number of individuals evaluating the clarity of questions, word meanings, instructions, and procedure for completing the instrument. . . .Revisions were made to the instrument based upon the results. Subsequently, over 100 questionnaires were printed and put into booklet form (appendix M). Phase Eight: Pilot Test #115 Selection of Sample Group [n=50 church members in two groups] Establish Time and Place of Pilot Test Letter of Invitation Constructed and Mailed Administering the Instrument Phase Nine: Data Analysis16 [This is part of Chapter Four of the dissertation]. Phase Ten: Revision of the Instrument17 Phase Eleven: Second Pilot Test18 Selection of [Three] Churches Procedure for Administering the Pilot Test Follow-Up Procedure Phase Twelve: Data Analysis of the Second Pilot Test19 [This is part of Chapter Four of the dissertation].
As mentioned in Chapter One, this instrument -- with further revisions by Dr. Waggoner in conjunction with the IMB and LifeWay Christian Resources (SBC) -- is being integrated into revised MasterLife materials produced by LifeWay.
Vocabulary
changing alternative constant alternative discrimination index distractors multiple choice question
12 17
synonym for a multiple choice test item synonym for a true-false test item procedure used to determine quality of test items multiple choice options which appear plausible but are incorrect test item with one stem and 4 or 5 plausible options
13 18
Ibid., 91-92 Ibid., 103-104
Ibid., 92-98 Ibid., 105-106
14 19
Ibid., 98-99 Ibid., 106
15
Ibid., 99-103
16
Ibid., 103
11-13
response set specific determiners supply question
predictable pattern in objective answers (e.g. TTTF TTTF TTTF) terms like `never or `sometimes that give clues to correct answer synonym for fill-in-the-blank questions
Study Questions
1. Explain the four preliminary guidelines given for writing tests in your own words. 2. Explain why objective test items produce more reliable scores than essay test items. 3. Write out 3 TF, 3 MC, 2 supply and 2 essay questions relating to this material. Set them aside for a few days. Then go back and evaluate each of your questions according to the criteria given for each kind of question.

1. T F A true-false question which uses terms such as only, none, or always are usually true. 2. Choose the best true-false question below. A. Disuse of double negatives does not impair item validity. B. Payne writes, Dont use direct quotes in t-f items. C. Direct quotes, fuzzy language, double negatives, specific determiners and complex sentences should be avoided in t-f items; rather, focus on central concepts, precise language, and simple sentences. D. Constant alternative items consist of a stem and several parallel responses. 3. Which of the following is an advantage of multiple choice items over true-false items? A. Easy to write B. Guessing is reduced C. Less efficient D. More open ended 4. Which of the following is a problem of matching questions? A. The question contains less than 10 pairs B. The response options list is systematically ordered C. Each item matches one and only one response option D. Response options cover multiple themes
11-14
Chapter 11
Writing Tests
Sample Test
APPENDIX B3 PRE-SESSION TEST Student Number _________ (see your name tag) Circle the letter of the phrase that best completes the sentence. 1. The phrase "priesthood of believers" is found in the Bible (a) in the New Testament, (b) in the Old Testament, (c) in both testaments, (d) in neither testament. 2. The doctrine of the priesthood of the believer teaches that priests should (a) be representative of all people, (b) represent God to other persons (c) be ordained by a church, (d) remain completely separated from the world. 3. During the Reformation, the priesthood of all believers particularly emphasized (a) infant baptism, (b) personal witnessing, (c) direct access to God, (d) wrongs of the Catholic church. 4. The concept of priest in the Old Testament is most often associated with the priesthood of (a) all Israelites, (b) some Israelites, (c) no Israelites, (d) the special prophets of Israel. 5. The Old Testament covenant was designed by God (a) to bless Israel as His people only, (b) to assure that Israel worshipped only God, (c) to help Israel conquer their world, (d) to make Israel a blessing to all other nations. 6. Christians are referred to as a holy priesthood. This holiness is best reflected by Christians when they are (a) motivated by love, (b) pure in their thoughts, (c) serving God at church, (d) separated from the world.... 62
3
Cook, 62-63. The entire test runs 15 items.
11-15
11-16
Chapter 12
Developing Scales
12
Developing Scales
The Likert Scale The Thurstone Scale The Q-Sort Scale The Semantic Differential
Our emphasis from the beginning of the text has been on the objective measurement of research variables. Sometimes we are most interested in studying subjective variables: attitudes, feelings, personal opinions, or word usage. How can we measure subjective variables objectively? The answer is an instrument called a scale.1 Dr. Martha Bergen used an adaptation of an existing scale2 to measure the attitude of seminary professors toward using computers in seminary education.
Respondants [110 seminary professors serving at Southwestern Baptist Theological Seminary in 1988] were asked to read each question and decide to what extent they agreed or disagreed with each question. They were instructed to circle the appropriate number after each of the items. The rating scale was set up in a logical pattern using the numbers "1," "2," "3," and "4" to correspond with "strongly disagree," "disagree," "agree," and "strongly agree," respectively. Responses [from the 53 items] were totaled and evaluated to reveal which attitude/s was/were most prominent. . . . A validation panel consisting of five experts in the areas of education, religious education, and computers was asked to rate the relevance and clarity of each question. Proper revisions and modifications were made as deemed necessary from the panel's critique and evaluation. For the purpose of establishing reliability, a stratified random sample of ten seminary professors -- representative of the intended population -- was selected to respond to the questionnaire. The method of split-half correlation was used to determine the coefficient of internal consistency. . . .3
The result of the modifications was an instrument which measured the strength of support (an attitude) of seminary professors for the use of computers in the seminary education in 1988. The internal consistency coefficient, after applying the SpearmanSee Babbie, "Chapter 15: Indexes, Scales and Typologies," pp. 366-389; Nunnally, "Chapter 15: Attitudes and Interests," pp. 441-467; and Payne, "Chapter 8: The Development of Self-Report Affective Items and Inventories," pp. 164-200. An excellent paperback dealing with this subject is Daniel J. Mueller, Measuring Social Attitudes: A Handbook for Researchers and Practitioners, (New York: Teachers College Press, 1986). 2 Bergen describes her instrument as an adaptation of a 1986 dissertation [instrument] from North Texas State Univeristy. See Mitchell Drake Weir, 'Attitudes and Perceptions of Community College Educators toward the Implementation of Computers for Administrative and Instructional Purposes' (Ph.D. dissertation, North Texas State University, 1987), pp. 129-35. In May 1988 North Texas State University became the University of North Texas, 48 3 Ibid., 48-49. See also 57-62 for more detail.
1
12-1
Brown Prophecy Formula, was +0.75, a strong positive value (see Chapter 22). A scale is an instrument which measures subjective variables. In this chapter we look at four major types of scales: the Likert (LIE-kurt), the Thurstone, the Q-sort and the Semantic Differential. Each of these important scale types provides the means to gather subjective data objectively.
The Likert Scale

The Likert scale is by far the most popular attitude scale type. A statement is followed by several levels of agreement: strongly agree, agree, no opinion, disagree, strongly disagree. This five-point scale is commonly used, but other scales, from four to ten points, can be used as well.2 Follow these steps to develop a Likert Scale for use in research. We will use the attitude desire to learn in college students to illustrate the process.
Define the attitude

The first step in designing an attitude scale is to define the attitude you want to measure. What does the attitude mean? What does desire to learn mean? If students do not have a desire to learn, what do they have? Perhaps, desire to get a degree. With these two end points we can begin to build a scale to differentiate between those who desire to learn, and those who merely want a credential. In defining the attitude, we must choose which end of the scale will be positive, and which will be negative. The simplest way to do this is to assign the positive end of the scale to your attitude. For our example, we'll make desire to learn positive, and desire to get a degree negative.
Determine related areas

Having defined the end points of the scale, we next determine what attitudes, opinions, behaviors, or feelings might be related to each end of the scale. What kinds of things would reflect the positive side? The negative side? These related areas provide the raw material from which well develop attitudinal statements. In what areas would learn and degree students differ? Heres my suggested list: doing homework, using the library, extra reading, free time discussion, meetings with professors, opinions concerning the meaning of a degree, and views on grades.
Write statements
Next, we will write statements that reflect positive and negative aspects of these areas. Weve defined positive to mean that which agrees with my position, and negative means that which disagrees with my position. The statements, even though reflecting subjective variables, should be objective. That is, statements must not be systematically biased toward one position or the other. Students who really want merely to get a degree should have no trouble scoring low on the scale. They should tend to agree with statements reflecting degree and tend to disagree with statements reflecting learning. In the same way, students who really want to learn should tend to agree with learning statements, and tend to disagree with degree
2 Mueller, Chapter 2: Likert Attitude Scaling and Chapter 3: Likert Scale Construction: A Case Study, 8-33.
12-2
Chapter 12
Developing Scales
statements.
Positive examples
Positive statements should be objective statements which are acceptable by those having the attitude, and just as unacceptable to those not having it. The following reflect these characteristics in regard to our attitude scale:
I generally enjoy homework assignments and sometimes do more than the assignment requires. I frequently use library resources to go beyond the required reading. I believe a degree is empty unless it reflects my best efforts of scholarship. A late assignment, thoughtfully done, is more important than the loss in grade average.
Negative examples
Negative statements should be objective statements which are acceptable to those not having the attitude, and just as unacceptable to those having it. These statements coincide with the positive examples above.
Homework assignments are designed to meet course requirements. It is impractical in time and energy to do more than is required. It is better to master the required reading than to dilute ones thinking with other authors. A degree is a credential for ministry and reflects, in itself, none of the extremes of scholarship some try to ascribe to it. It is better to turn in an assignment on time than to be docked for lateness to make it better.
Create an item pool

Continue writing items, both positive and negative, until you have an item pool at least twice the size of your intended instrument. If you plan to have 20 statements in your final scale, then create an item pool of 40 items.
Validating the items

Enlist a validation panel of 6-8 persons to evaluate each item. It is suggested that you have persons on the panel who represent both extremes of the scale. Have the panel rate each item on its clarity and potency in defining the attitude in question.
Rank
Rank order the evaluated items on clarity and potency. Choose an equal number
3 Mueller states, Five categories are fairly standard.... Some scale constructors use seven categories, and some prefer four or six response categories (with no middle category). All of these options seem to work satisfactorily. It should be noted in this regard that reducing the number of response categories reduces the spreading out of scores (reduces variance) and thus tends to reduce reliability. Increasing the number of response categories adds variance. As the number of categories is increased, a point is reached at which respondents can no longer reliably distinguish psychologically between adjacent categories [i.e., whats the difference between a 10 and an 11 on a 12-point scale? WRY]. Increasing the number of categories beyond this point simply adds random (error) variance to the score distribution (pp. 12-13).
12-3
of positive and negative items from the best statements.
Formatting the Scale

Randomly order the selected statements. Use letters to indicate choices, such as SD, D, A, and SA rather than numbers. I recommend that you use four or six levels of response. Using an even number of responses forces respondents to mark the direction of their attitudinal tendencies positive or negative. Mean scores for groups filling out the scale have more meaning in this less stable construction. Many Likert scales have 5 levels, with a no opinion center. This neutral middle option allows subjects an easy way to avoid considering the statement.
Write instructions
Write instructions which clearly explain how to select responses on the form. (See the finished example at the end of the chapter.) There are other ways to indicate the intensity of response. Dr. Don Mattingly (Ed.D., 1984) developed a scale for his dissertation which used the categories Yes! Yes No No!
to indicate how strongly his subjects agreed or disagreed with statements concerning recreation ministry.
Scoring the Likert scale

The points given for each response depend on whether the statement is positive or negative. The person who strongly agrees with a positive statement gets the maximum points. One who strongly disagrees with a positive statement gets the minimum points. For a four-point scale, the scoring would be as follows for positive statements: SD=1, D=2, A=3, SA=4. The person who strongly agrees with a negative statement gets the minimum number of points (1), while the one who strongly disagrees with a negative statement gets the maximum points (4). In our four-point example, the scoring for negative statements would be as follows: SD=4, D=3, A=2, and SA=1. In this short 8-item example attitude scale (see end of chapter), subject attitude scores will range from a low of 8 (8 x 1 = 8) to a high of 32 (8 x 4 = 32). For a twenty-five item scale, this procedure yields scores ranging from 25 to 100. These scores can then be used to compare groups on the defined attitude.
The Thurstone Scale

The Likert Scale, which we just discussed, consists of statements that are all of equal weight. The subjectss score results from adding together all of the scaled responses for all the statements. The Thurstone attitude scale, however, consists of statements which have a range of weights from high (usually 11) to low (usually 1). Subjects select the attitudinal statements they agree with most. Their scores result from computing the average of the weights of the items selected.4 Use the following steps to develop weighted items for a Thurstone scale.
4
See Mueller, Chapter 4: Thurstone Scale Construction, 34-46.
12-4
Chapter 12
Developing Scales
Attitude Toward Seminary Learning

INSTRUCTIONS: Read each statement below. Circle the letter which best describes your response to the statement. If you strongly disagree with the statement, circle SD. If you DISAGREE, circle D, AGREE, A, or STRONGLY AGREE, SA. 1. Homework assignments are designed to meet course requirements. It is impractical in time and energy to do more than is required. (-) 2. A late assignment thoughtfully done is more important to me than the loss in grade aver age. (+) 3. A degree is a credential for ministry and reflects, in itself, none of the extremes of scholar ship some try to ascribe to it. (-) 4. I generally enjoy homework assignments and sometimes do more than the assignment requires. (+) 5. It is better to turn in an assignment on time, as it is, than to be docked for lateness to make it better. (-) 6. I frequently use library resources to go beyond the required reading. (+) 7. I believe a degree is empty unless it reflects my best efforts of scholarship. (+) 8. It is better to master the required reading than to dilute ones thinking with other authors. (-) SD (D) A = 3 pts SD D SA
(A) SA = 3 pts A SA
(SD) D = 4 pts SD D
(A) SA = 3 pts (A) SA = 2 pts SA (SA) = 4 pts
SD
SD (D) A = 2 pts SD SD D D A
(A) SA = 2 pts
Red notations are not included on the form, but are included here to demonstrate the scoring of a completed form. This subject selects items as marked, which are scored according to statement type. This subject scored 23 points on this scale (32 possible). Very positive attitude!
12-5
Develop item pool

As in the development of Likert scale items, develop an item pool of attitudinal statements. Include statements that range from extremely unfavorable to extremely favorable, as well as neutral statements. An item pool of about 50 attitudinal statements is adequate.
Compute item weights

Compute a scale value (or weight) for each statement. This is done by having a panel of 10 or more judges rank each statement. This is done by having each judge read through all statements, and choosing the most positive. This statement is given 1 point. Most negative? (11 points). These statements are eliminated from the pool. Choose the two next most positive (2 points each). Two next most negative (10 points). Four next most positive (3 points). Four next most negative (9 points). After all judges have rank ordered the statements, average weights are computed by adding up all the points from all the judges for each item, and dividing by the number of judges. This average is the item weight. The item with the lowest weight is the most positive according to the panel of judges. The item with the highest weight is the most negative.
Rank the items by weight

Rank the items by item weight, low (positive) to high (negative).
Choose Items by Equidistant Weights

Compose the final scale by selecting 20 to 25 statements whose weights are approximately equidistant from each other throught the entire scale. If a 9-category favorableness scale was used by judges and if 22 items are to be selected for the final scale, the items will need to be picked at scale intervals of approximately .36. (There are eight units between 1.00 and 9.00; 8/22 = 0.36). In fact, since no items will have median values as low as 1.00 or as high as 9.00, a slightly smaller interval size, perhaps around .33, should be used to select 22 equidistant items.5 If two items have the same weight, choose the item with the smaller standard deviation (see Chapter 16 for how to calculate standard deviation). In this way, the list of statements form a range of weights, as determined by the panel of judges.
Formatting the Scale

Place the selected statements in random order. Do not include item weights on the instrument.
Administering the Scale

Direct the subjects to read all statements in the instrument and mark those with which they agree. They may choose as many as they like. See the example at the end of the chapter.
Scoring
Compute the median (or mean) of the weights of the statements marked by the
5
Mueller, p. 37
12-6
Chapter 12
Developing Scales
subject. This is the subjects score which reflects attitude on the theme.
Q-Methodology
It is difficult to rank order more than ten statements. But rank ordering attitudinal statements is a good way to gather subjective data on a given sample. The Q-sort is a procedure for rank ordering a large number of statements. Rankings of statements by two or more groups can then be compared. One version of the Q-sort uses a physical set of boxes, numbered 1 through 11 (This is the same arrangement as that described for weighting Thurstone items). The procedure is usually applied when the number of statements to be ranked is greater than 40. The subject looks through a number of statements written on cards. Each card contains one statement. The first time through, the subject selects the statement he agrees with the most. That item goes into box 1. The subject then goes through the cards a second time and selects the statement he agrees with the least. This card is placed in box 11. The next time through the cards, the subject selects two cards he agrees with the most, and places them into box 2. Then he chooses the two cards he agrees with least in box 10. Then 4 cards in box 3 and 4 cards in box 9, and so forth, until he is left with the middle box (#6). All the remaining statements are placed in it. The researcher then assigns point values for each statement, 1-11, based upon the box into which they were placed. After all subjects have placed the statements, averages are computed. Rank order statements for the group on the basis of their average values.
Semantic Differential
The semantic differential provides information on differences (differential) in word usage (semantics) in subjects. Osgood and Tannenbaum wrote the classic work on using the semantic differential, entitled The Measurement of Meaning.1 The book is a My Church detailed analysis of this powerful technique. We valuable __ : __ : __ : __ : __ : __ : __ simply introduce the procedure here. clean __ : __ : __ : __ : __ : __ : __ Osgood and Tannenbaum isolated three bad __ : __ : __ : __ : __ : __ : __ major dimensions of word meanings through unfair __ : __ : __ : __ : __ : __ : __ the use of factor analysis. These dimensions are large __ : __ : __ : __ : __ : __ : __ evaluative (good or bad), potency (strong or strong __ : __ : __ : __ : __ : __ : __ weak) and activity (fast or slow). Their book deep __ : __ : __ : __ : __ : __ : __ contains hundreds of adjective pairs relating to fast __ : __ : __ : __ : __ : __ : __ these three dimensions. active __ : __ : __ : __ : __ : __ : __ A subject is presented a sheet of paper with hot __ : __ : __ : __ : __ : __ : __ a single word or term at the top. Below this word are a number of adjectival pairs, separated (1) (2) (3) (4) (5) (6) (7) by seven blanks. For example, the meanings associated with the term my church might be formatted like this: The first four adjective pairs measure the evaluative dimension; the next three measure potency; and the last three measure activity. The numbers shown above are not
1
worthless dirty good fair small weak shallow slow passive cold
Urbana: University of Illinois Press, 1957
12-7
printed on the instrument, but are shown here to help clarify the scoring procedure. Pairs which are reversed should be scored in reverse, so that positive is always (1) and negative (7) regardless of which side of the scale they appear. Subjects check one blank between each pair indicating their opinion of the term on this scale. Blanks are scored 1-7, providing a numerical score for the meaning of the term in each dimension. Groups of subjects can then be compared on the three dimensions of meaning for any commonly used word. (Note: the numbering scale 1-7 is true only if the positive term is on the left; otherwise the scale is labelled 7-1). Results can be plotted in three dimensions to provide a picture of semantic differences between two or more groups of subjects.
The Delphi Technique

The Delphi Technique, while having alternative forms and procedures, is essentially used to determine concensus in a group of subjects. The items around which this concensus is formed are constructed from comments from the group itself, thereby eliminating researcher bias in item creation. Suppose a researcher is interested in defining the most important concerns of Sunday School teachers of youth in Tarrant Baptist Association churches. A letter would be sent to all youth teachers in Tarrant Association churches asking them to list their "major concerns" in teaching young people. Responses would most likely range from literature, to facilities, to youth attitudes, to parental problems, to . . . well, the list would be long. The "major concerns" from all responders would be analyzed for commonalities, and a list of key "major concerns" would be produced. From this list of "major concerns" the researcher would create pairs of attitudinal statements, one positive and the other negative. For example, for the major concern of "youth literature," one might create the following paid of attitudinal statements. (+) (-) The literature we use for teaching youth demonstrates an understanding of youth needs, and how the Bible addresses those needs. The literature we use for teaching youth demonstrates a lack of understanding of youth needs, and provides little help in addressing those needs with the Bible.
Pairs of statements are created for each major concern. Randomly select an equal number of positive and negative statements for inclusion in the Delphi instrument. Construct an instrument in which statements are randomly listed. Associate each with a Likert type response: Strongly Agree. . . Strongly Disagree. Duplicate the instrument and send it to all youth teachers in Tarrant Association. Each teacher will read the statements and mark his or her degree of agreement (or disagreement) with each statement. Completed forms will be returned to the researcher by means of self-addressed and stamped envelopes. Score forms just like a Likert scale. Scores for each statement produces a mean for the entire group. Means (and their associated statements) will then be ranked. From this ranking, the researcher can determine how the group responded to the "major concerns" submitted by individuals earlier. Thesewill either be reinforced by agreement by the entire group (major concerns, indeed!), or they will be identified as
Procedure described by Dr. John Curry, University of North Texas, EDER 601, Fall 1983
12-8
Chapter 12
Developing Scales
a isolated concerns not shared by the group. The Delphi Technique is a powerful way to allow a group of subjects to create their own attitude statements, and then measure the strength (or lack) of support by the whole group for the statements generated by the process.1
Summary
In this chapter we have introduced ways researchers measure attitudes. We have emphasized the Likert and Thurstone scales, the Q-Sort, and the Semantic Differential. These are but a sampling of procedures available to you to measure the subjective characteristics of groups.
Vocabulary
Evaluative Likert scale Potency Q-sort Activity Semantic Differential Thurstone scale A scale in the semantic differential which measures good-bad Attitude scale which uses + and - equally weighted statements A scale in the semantic differential which measures strong-weak Method for rank ordering a large number of attitudinal statements A scale in the semantic differential which measures fast-slow An attitude scale which measures differences in word meanings Attitude scale which uses weighted statements
Study Questions
1. Define attitude scale. 2. Compare and contrast the Likert and Thurston attitude scales. 3. What applications would be appropriate for the semantic differential in Christian research? Likert scale? Thurstone scale? Delphi Technique?

1. The attitude scaling technique which uses equally weighted items is the A. Likert Scale B. Thurstone Scale C. Q-Sort D. Semantic Differential 2. The best approach to rank ordering a large number of statements is the A. Likert Scale B. Thurstone Scale C. Q-Sort D. Semantic Differential 3. The method to use in measuring the differences between selected groups in the way they use specified terms is the A. Likert Scale B. Thurstone Scale C. Q-Sort D. Semantic Differential
12-9
Sample Thurstone Scale

Instructions: Read the statements below, and check off any that reflects your attitude toward education. You may check off as many as you like. (weights on next page) __x__ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ __x__ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ __x__ _____ _____ _____ _____ _____ __x__ _____ _____ I am intensely interested in education. I go to school only because I am compelled to do so. I am interested in education but one shouldnt get too concerned about it. I like reading thrillers and playing games better than studying. Education is of first rate importance in the life of man. Sometimes I feel education is necessary and sometimes I doubt it. I wouldnt work at studying so hard if I didnt have to pass exams. Education tends to make people snobs. I think time spent studying is wasted. It is better to start a career at age 18 than to go to college. It is doubtful that education has helped the world. I have no desire to have anything to do with education. We cannot become good citizens unless we are educated. More money should be spent on education. I think my education will be of use to me after I leave school. I always read newspaper articles on education. Education does more harm than good. I see no value in education. Education allows us to live a less monotonous life. I dislike education because that time has to be spent on homework. I like the subjects taught in school but do not like attending school. Education is doing more harm than good. Lack of education is the source of all evil. Education enables us to make the best possible use of our lives. Only educated people can enjoy life to the full. Education does more good than harm. I do not like school teachers so I somewhat dislike education. Education is alright in moderation. It is enough that we should be taught to read, write and do sums. I do not care about education so long as I can live comfortably. Education makes people forget God and despise Christianity. Education is an excellent character builder. Too much money is spent on education. If anything, I must admit to a slight dislike of education.
Attitude score = 1.0 + 1.3 + 2.7 + 1.8/4 = 1.3 Very positive!
12-10
Chapter 12
Developing Scales
Sample Thurstone Scale (with weights)

Subject score equals average of weights of statements selected.
1.0 10.0 4.2 6.4 0.5 5.4 6.9 8.4 10.1 7.9 5.7 10.9 1.3 2.2 3.7 3.0 9.3 11.4 3.3 7.4 4.5 10.5 2.3 0.3 1.2 2.7 7.1 4.9 5.8 8.9 9.9 1.8 8.6 6.7
I am intensely interested in education. I go to school only because I am compelled to do so. I am interested in education but one shouldnt get too concerned about it. I like reading thrillers and playing games better than studying. Education is of first rate importance in the life of man. Sometimes I feel education is necessary and sometimes I doubt it. I wouldnt work at studying so hard if I didnt have to pass exams. Education tends to make people snobs. I think time spent studying is wasted. It is better to start a career at age 18 than to go to college. It is doubtful that education has helped the world. I have no desire to have anything to do with education. We cannot become good citizens unless we are educated. More money should be spent on education. I think my education will be of use to me after I leave school. I always read newspaper articles on education. Education does more harm than good. I see no value in education. Education allows us to live a less monotonous life. I dislike education because time has to be spent on homework. I like the subjects taught in school but do not like attending school. Education is doing more harm than good. Lack of education is the source of all evil. Education enables us to make the best possible use of our lives. Only educated people can enjoy life to the full. Education does more good than harm. I do not like school teachers so I somewhat dislike education. Education is alright in moderation. It is enough that we should be taught to read, write and do sums. I do not care about education so long as I can live comfortably. Education makes people forget God and despise Christianity. Education is an excellent character builder. Too much money is spent on education. If anything, I must admit to a slight dislike of education.
12-11
12-12
Chapter 13
Experimental Designs
13
What is Experimental Research? Internal Invalidity External Invalidity Types of Designs
We've previously discussed aspects of three dissertations which embraced an experimental design. My Southwestern dissertation compared three approaches to teaching adults in a local Southern Baptist church: Skinnerian behaviorism, Brunerian cognitivism, and an eclectic approach of the two in 1978. Dr. Stephen Tam compared three approaches to teaching with Chinese students in Hong Kong seminary: interactivity, gaming, and lecture in 1989. Dr. Mark Cook studied the role of active participation in adult learning in a local church in 1994.1
What Is Experimental Research?

The research methods we have examined in the past few chapters are generally considered descriptive studies. A descriptive study analyzes a present condition in order to describe it completely. It answers the question What is? Experimental research, on the other hand, answers the question What if? The researcher manipulates independent variables (e.g., type of treatment, teaching method, communication strategy) and measures dependent variables (anxiety level, Bible comprehension, marital satisfaction) in order to establish cause-and-effect relationships between them. Notice, the independent variable is controlled or set by the researcher. The dependent variable is measured by the researcher. An experiment is a prescribed set of conditions which permit measurement of the effects of a particular treatment.2 In our varied curricula education, administration, age-group ministry, counseling and social work there is a need to discover the If p then q links in the world of local church ministry. In this chapter we will explain threats to internal and external experimental validity as well as illustrate both true- and quasi- experimental research designs. There are numerous hindrances to planning a good experiment. A good experiSee Yount A Critical Analysis...; Tam, A Comparative Study...; and Cook, A Study of... See Babbie, Chapter 8: Experiments, pp. 186-207; Borg and Gall, Chapters 15-16: Experimental Designs, Parts I and II, pp. 631-731; Clifford J. Drew and Michael L. Hardman, Chapter 5: Designing Experimental Research, Designing and Conducting Behavioral Research (New York: Pergamon Press, 1985), pp. 77-105; Sax, Chapter 6: Research Design: Factors Affecting Internal and External Validity, pp. 116-151 and Chapter 7: Research Design, Types of Experimental Designs, pp. 152-178; and True, Chapter 8: Experiments and Quasi-experiments, pp. 233-258.
1 2
13-1
ment is one that confines the variation of measurement scores to variation caused by the treatment itself. The hindrances to good research design are called sources of experimental invalidity. These sources fall under two major subdivisions: internal invalidity and external invalidity. Lets define further these sources of experimental invalidity.2
History Maturation Testing Instrumentation Regression Selection Mortality Interaction John Henry Diffusion
Internal Invalidity
Internal invalidity asks the question, Are the measurements I make on my dependent (i.e., the variable I measure) variable influenced only by the treatment, or are there other influences which change it? An experimental design suffers from internal invalidity when the other influences, called extraneous sources of variation, have not been controlled by the researcher. When extraneous variables have been controlled, researchers can be reasonably sure that post-treatment measurements are influenced by the experimental treatment, and not by extraneous variables. Donald Campbell and Julian Stanley wrote a chapter of a text on research designs that has become a classic in the field.3 In this chapter they list eight extraneous variables: history, maturation, testing, instrumentation, statistical regression, differential selection, experimental mortality, and selection-maturation interaction. Borg and Gall list two more: the John Henry effect and experimental treatment diffusion.4
History
History refers to events other than the treatment that occur during the course of an experiment which may influence the post-treatment measure of treatment effect. If the explosion of the nuclear reactor in Chernobyl, Ukraine had occurred in the middle of a six-month treatment to help people reduce their anxiety of nuclear power, it is likely that post-test anxiety scores would be higher than they would have been without the disaster. History does not refer to the background of the subject. Since history is an internal source of invalidity, it's influence must occur during the experiment. If you study two groups, one which receives the treatment and a similar one which does not, you control for history (which is why this second group is called a control group) since both groups are statistically5 affected the same way by events outside the experiment. Any differences between the two groups at the end of the experiment could reasonably be linked to the treatment.
Maturation
Subjects change over the course of an experiment. These changes can be physical, mental, emotional, or spiritual. Perspective can change. The natural process of human growth can result in changes in post-test scores quite apart from the treatment. Question: How would a control group control this source of internal invalidity?6
I use the term invalidity to differentiate this concept from test validity discussed in Chapter 8. Be careful, however. Many texts use the terms experimental validity and test validity. 3 Donald T. Campbell and Julian C. Stanley, Experimental and Quasi-experimental Designs for Research on Teaching, in Handbook of Research on Teaching, ed. N. L. Gage (Chicago: Rand McNally, 1963) 4 Borg and Gall, 635-637 5 Individuals might be affected, but the groups will not significantly differ from each other. 6 Subjects in both groups will mature, on average, the same.
2
13-2
Chapter 13
Testing
A common research design is to give a group a pre-test, a treatment, and then a post-test (see p. 13-6). If you use the same test both times, the group may show an improvement simply because of their experience with the test. This is especially true when the treatment period is short and the tests are given within a short time. Unless you must specifically measure changes during the experiment -- requiring testing before and after the treatment -- it is better to only give a post-test. Randomly assign subjects to groups to render the dependent variable (as well as all others!) statistically equal at the beginning of the study.
Instrumentation
In the previous section we discussed the problem of using the same test twice in pre- and post-measurements. But if you use different tests for pre- and post-measurements, then the change in pre- and post-scores may be due to differences between the tests rather than the treatment. The best remedy, as we have already discussed, is to use randomization and a post-test only design. But if you must have pre-test scores you must use intact groups and need to know if the groups are equivalent, or you want to study changes over time then you must develop equivalent tests using the parallel forms techniques discussed in Chapter Eight. How does use of a control group relate to instrumentation?7
Statistical regression
Set a glass of cold milk and a hot cup of coffee on a table. Over time, the cold milk will get warmer and the hot coffee colder. They both regress toward the room temperature. Statistical regression refers to the tendency of extreme scores, whether low or high, to move toward the average on a second testing. Subjects who score very high or very low on one test will probably score less high or low when they take the test again. That is, they regress toward the mean. Lets say you are analyzing how much a particular reading enrichment program enhances the reading skills of 3rd grade children. You give a reading skills test and select for your experiment every child who scores in the bottom third of the group. You provide a three-month treatment of reading enrichment, and then measure the reading ability of the group. On the basis of the scores on the childrens first and second tests, you find that reading skills improved significantly. What, in your opinion, is wrong with this study?8 Do not study groups formed from extreme scores. Study the full range of scores. The question we need to answer is: Does the reading enrichment program significantly improve reading skills of randomly selected subjects over a control group?
Differential selection
If we select groups for treatment and control differently, then the results may be due to the differences between groups before treatment. Say you select high school
Even if tests are not equivalent both experimental and control groups answer the same test. This controls for the effects of instrumentation on the treatment group. It isolates treatment group changes to the given treatment. 8 The group would have scored, on average, better on the second testing regardless of the treatment, simply due to statistical regression. In addition, there is no control by which to measure the treatment.
7
13-3
seniors who volunteer for a special Bible study program as your treatment group, and compare their scores with a control group of high school seniors who did not volunteer. Do your post-test scores measure the effect of the Bible study treatment, or the differences between volunteers and non-volunteers? You cannot say. Randomization solves this problem by statistically equating groups.
Experimental mortality
Experimental mortality, also called attrition, refers to the loss of subjects from the experiment. If there is a systematic bias in the subjects who drop out, then posttest scores will be are biased. For example, if subjects drop out because they are aware that theyre not improving as they should, then the post-test scores of all those who complete the treatment will be positively biased. Your results will appear more favorable than they really are. How does use of a control group solve the problem of attrition?9
Selection-Maturation Interaction of Subjects

Interaction means the mixing or combining of separate elements. If you draw a group of subjects from one church to serve as the treatment group, and a second group from a different church to serve as a control, you could well find -- beyond the simple problem of selection differences (Are the two groups equivalent?) -- a mixing of selection and maturation factors to compound the extraneous influence on your measurements. For example, if the two churches differ in the average age of their members, they may well respond to the treatment differently due to inherent maturational factors. Randomly selecting all subjects from a defined population solves this problem.
The John Henry Effect

John Henry, the legendary steel drivin man, set himself to prove he could drive railroad spikes faster and better than the newly invented steam-powered machine driver. He exerted himself so much in trying to outdo the "experimental" condition that he died of a ruptured heart. If subjects in a control group find out they are in competition with those in an experimental treatment, they tend to work harder. When this occurs, differences between control and treatment groups is decreased, minimizing the perceived treatment effect.
Treatment diffusion
Similar to the John Henry effect is treatment diffusion. If subjects in the control group perceive the treatment as very desirable, they may try to find out whats being done. For example, a sample of church members are selected to use an innovative program of discipleship training, while the control group uses a traditional approach. Over the course of the experiment, some of the materials of the treatment group may be borrowed by the control group members. Over time, the treatment diffuses to the control group, minimizing the treatment effect. This often happens when the groups are in close proximity (members of the same church, for example). Both the John Henry Effect and Treatment Diffusion can be controlled if experimental and control groups are isolated.
Subjects will tend to drop out of both treatment and control groups equally. Those who remain in both groups provide a better picture of "difference" than before-and-after type designs.
9
13-4
Chapter 13
External Invalidity
External invalidity asks, How confidently can I generalize my experimental findings to the world? Sources of external invalidity cause changes in the experimental groups so that they no longer reflect the population from which they were drawn. The whole point of inferential research is to secure representative samples to study so that inferences can be made back to the population from which the samples were drawn (Chapter Seven). External invalidity hinders the ability to infer back. Campbell and Stanley list four sources of external invalidity: the reactive effects of testing, the interaction of treatment and subject, the interaction of testing and subject, and multiple treatment interference.
Effects of Testing Treatment & Subject Testing & Subject Multiple Treatments
Reactive effects of testing

Subjects in your samples may respond differently to experimental treatments merely because they are being tested. Since the population at large is not tested, experimental effects may be due to the testing procedures rather than the treatment itself. This reduces generalizability. One type of reactive effect is pretest sensitization. Subjects who take a pre-test are sensitized to the treatment which is to follow (educators sometimes use a pre-test as an advanced organizer to prepare students for learning). This preparation changes the research subjects from the population from which they were drawn, and therefore reduces the ability to generalize findings back to the (untested) population. The best experimental designs do not use pretests. Another type of reactive effect is post-test sensitization. The posttest can be, in itself, a learning experience that helps subjects to put all the pieces together. Different results would be obtained if the treatment were given without a posttest. While researchers must make measurements, care must be taken to measure treatment effect, not add to it, with a post-test.
Treatment and Subject Interaction

Subjects in a sample may react to the experimental treatment in ways that are hard to predict. This limits the ability of the researcher to generalize findings outside the experiment itself. If there is a systematic bias in a sample, then treatment effects may be different when applied to a different sample.
Testing and Subject Interaction

Subjects in a sample may react to the process of testing in ways that are hard to predict. This limits the ability of the researcher to generalize findings outside the experiment itself. If there is a systematic bias of test anxiety or test-wiseness in a sample, then treatment effects will be different when applied to a different sample.
Multiple Treatment Effect

Normally we find a single treatment in an experiment. If, however, an experiment exposes subjects to, say, three treatments (A, B, and C) and test scores show that treatment C produced the best results, one cannot declare treatment C the best. It may have been the combination of the treatments that led to the results. Treatment C, given alone, may produce different results.
13-5
Summary
Designing an experiment that produces reliable, valid, and objective data is not easy. But experimental research is the only direct way to measure cause and effect relationships among variables. What a help it would be to Kingdom service if we could develop effective experimental researchers who are also committed ministers of Gospel -- learning from direct research how to teach and counsel and manage and serve in ways that directly enhance our ministry.
Types of Designs
The following is a summary of some of the more important designs of Campbell and Stanley. I will briefly describe the design, give an example of how the design would be used in a research study, and indicate possible sources of internal and external invalidity. In the design diagrams which follow, a test is designated by O, a treatment by X, and randomization by an R.
True Experimental Designs

Experimental designs are considered true experiments when they employ randomization in the selection of their samples and control for extraneous influences of variation on the dependent variable. The three designs we will consider in this section are the best choices for an experimental dissertation. These are the pretest-posttest control group design, the Posttest Only Control Group design, and the Solomon Four Group design.
Pretest-Posttest Control Group

Two randomly selected groups are measured before (O1 and O3) and after (O2 and O4) one of the groups receives a treatment (X).
R R
O1 O3
O2 O4
Example. Third graders are randomly assigned to two groups and tested for knowledge of Paul. Then one group gets a special Bible study on Paul. Both are then tested again. Analysis. The t-test for independent samples (Chapter 20) can be used to determine if there is a significant difference between the average scores of the groups (O2 and O4). You can also compute gain scores (O2 - O1 and O4 - O3) and test the significance of the average gain scores with the matched samples t-test. Comments. This designs only weakness is pre-test sensitization and the possible interaction between pretest and treatment.
Posttest Only Control Group

Subjects are randomly selected and assigned to two groups. Due to randomization, the two groups are statistically equal. No pretest is given. One group receives the treatment.
13-6
Chapter 13
R R
O1 O2
Example. Third graders are randomly assigned to two groups. Then one group receives a special study on the life of Paul (no pre-test). Both are tested on their knowledge of Paul at the conclusion of the study. Analysis. The difference between group means (O1 and O2) can be computed by an independent groups t-test. [Other procedures that can be used include one-way ANOVA (though usually used with three or more groups - see Chapter 24), the ordinal procedures Wilcoxin Rank Sum test or Mann-Whitney U (see Chapter 21). Well discuss these later].
Solomon Four-Group
Subjects are randomly selected and assigned to one of four groups. Group 1 is tested before and after receiving the treatment; Group 2 is tested before and after receiving no treatment; Group 3 is tested only after receiving the treatment; and Group 4 is tested after receiving no treatment.
1 2 3 4
R R R R
O1 O3
X X
O2 O4 O5 O6
The Solomon design is actually a combination of the Pre-Test Post-Test Design (groups 1 and2) and the Post-Test Only design (groups 3 and 4). Look!
1 2 3 4
R R R R
O1 O3
X X
O2 O4 O5 O6
Example. Third graders are randomly assigned to 1 of 4 groups. The knowledge of Paul is measured in groups 1 and 2. Groups 1 and 3 are given a special study on the life of Paul. When the special study is over, all four groups are tested. Analysis. One-way ANOVA can be used to test the differences in the four posttest mean scores (O2, O4, O5, O6). The effects of the pretest can be analyzed by applying a t-test to the means of O4 (pretest but no treatment) and O6 (neither pretest or treatment). The effects of the treatment can be analyzed by applying a t-test to the means of O5 (treatment but no pretest) and O6 (neither pretest or treatment). Subject maturation can be analyzed by comparing the combined means of O1 and O3 against O 6. Comments. The Solomon Four Group design provides several ways to analyze data and control sources of extraneous variability. Its major drawback is the large number of subjects required. Since each group needs to contain at least 30 subjects, one experiment would require 120 subjects.
13-7
Quasi-experimental Designs
The term quasi- (pronounced kwahz-eye) means almost, near, partial, pseudo, or somewhat. Quasi-experimental designs are used when true experiments cannot be done. A common problem in educational research is the unwillingness of educational administrators to allow the random selection of students out of classes for experimental samples. Without randomization, there are no true experiments. So, several designs have been developed for these situations that are almost true experiments, or quasi-experimental designs. Well look at three: the time series, the nonequivalent control group design, and the counterbalanced design.
Time Series
Establish a baseline measure of subjects by administering a series of tests over time (O1 through O4 in this case). Expose the group to the treatment and then measure the subjects with another series of tests (e.g., O5 through O8).
O1
O2 O3 O4
O5 O6
O7
O8
Example. A class of third graders is given several tests on Paul before having a special study on him. Several tests are given after the special study is finished. Analysis. I could say something like data is analyzed by trend analysis for correlated data on n subjects under k conditions (linear and polynomial), or the monotonic trend test for correlated samples, but let me simply say that data analysis is much more complex with a time series design. An effective visual analysis can be made by graphing the groups mean scores on each test over time. Important changes in the group can easily be attributed to the treatment by the shape of the line. One could also average the pre-treatment scores and the post-treatment scores, and apply a t-test for matched samples to the averages! Comments. Since there is no control group, one cannot determine the effects of history on the test scores. Instrumentation may also be a problem (Are the tests equivalent?) Beyond these internal validity problems, the reactive effects of repeated testing of subjects is a source of external invalidity.
Nonequivalent Control Group Design

Subjects are tested in existing or intact groups rather than being randomly selected. The dotted line in the diagram represents non-equivalent groups. Both groups are measured before and after treatment. Only one group receives the treatment.
O1 X O2 --------------------O3 O4
Example. Two intact third grade classes (no random selection) are tested on their knowledge of Paul before and after one of them receives a special study on the life of Paul. Analysis. One approach to measuring the significance of difference between the
13-8
Chapter 13
two groups is to compute gain scores. This is done by subtracting the pre-test score from the post-test score for each subject. Use gain scores to compute average gain for each group. Test whether the average gain is significantly different by the t-test for independent samples. Another approach is to use the pre-test scores as a covariate measure to adjust the posttest means. Analysis of covariance (See Chapter 25) is the procedure to use. Comments. This design should be used only when random assignment is impossible. It does not control for selection-maturation interaction and may present problems with statistical regression. Beyond these internal sources of invalidity, this design suffers from pretest sensitization.
Counterbalanced Design
Subjects are not randomly selected, but are used in intact groups. Group 1 receives treatment 1 and test 1. Then at a later time, they receive treatment 2 and test 2. Group 2 receives treatment 2 first and then treatment one.
Group 1 Group 2
Time 1 2 X2 O X1 O X2 O X1 O
Example. Two third grade classes receive two special studies on Paul: one in classroom and the other on a computer. Class 1 does the classroom work first, followed by the computer; class 2 does the computer work first. Both groups are tested after both treatments. Analysis. Use the Latin Squares analysis (beyond the scope of this text). Comments. Since randomization is not used in this design, selection-maturation interaction may be a problem. Multiple treatment effect is a possible source of external invalidity.
Pre-experimental Designs
Pre-experimental designs should not be considered true experiments, and are not appropriate for formal research. I include them so that you can contrast them with the better designs. Data collected with these designs is highly suspect. We will consider the One Shot Case Study design, the One Group Pretest Posttest design, and the Static Group comparison design.
The One Shot Case Study

A single group is given a treatment and then tested.
Example. A third grade class is provided a special Bible study course on Paul, after which their knowledge of Paul is tested.
13-9
Analysis. Very little analysis can be done because there is nothing to compare the posttest against and no basis to determine what influence the treatment had. Comments. None of the sources of internal or external invalidity are controlled by this design. It suffers most in the areas of history, maturation, regression, and differential selection. It also suffers from the external source of treatment and subject. The design is useless for most practical purposes because of numerous uncontrolled sources of difference.
One-Group Pretest/Posttest
A single intact group is tested before and after a treatment. O1 X O2
Example. A group of third graders is tested on knowledge of Paul before and after a special study on the life of Paul. Analysis. Test the difference between the pre-test and post-test means using the matched sample t-test (See Chapter 20) or Wilcoxin matched pairs signed rank test (See Chapter 21). Comments. Problems abound with history, maturation, testing, instrumentation, and selection-maturation interaction. The reactive effects of pre-and post- tests and treatment and subject are external sources of invalidity.
Static-Group comparison
Two intact groups are tested after one has received the treatment. X O -----------O Example. Two classes of third graders are tested on their knowledge of Paul after one of them has had the special Bible study. Analysis. Determine whether there is a significant difference between post-test means by using the t-test for independent samples (Chapter 20) or the Mann-Whitney U nonparametric test (Chapter 21). While these statistics will work, their results are meaningless since there is no assurance that groups were the same at the beginning of the treatment. Comments. This design suffers most from selection, attrition, and selection-maturation interaction problems. It also fails to control the external invalidity source of treatment and subject.
13-10
Chapter 13
Summary
This chapter introduced you to the world of experimental research design. The concepts of internal and external validity, randomization, and control are essential to constructing experiments which provide valid data. Experimental research is the only type which can establish cause-and-effect relationships between variables.
Vocabulary
control group differential selection experimental mortality external invalidity history instrumentation interaction of testing/subject interaction of treatment/subject internal invalidity John Henry effect maturation posttest sensitization pretest sensitization selection-maturation interaction statistical regression testing treatment diffusion true experimental research representative sample which does not receive treatment subjects selected for samples in a non-random manner, i.e., in "different ways" loss of subjects from the study flaw which prevents experimental results from being generalized to the original population events during experiment which influences scores on post test differences in subject scores due to differences in tests used subjects may react to tests unpredictably (generalization?) subjects may react to treatment unpredictably (generalization?) condition which alters measurements within the experiment Control Group tries harder (distorting the results) change in subjects over course of the experiment posttest changes subjects: they put it all together and score higher than they normally would pretest changes subjects: advance organizer: prepares subjects for treatment samples of subjects may mature differently top and bottom scoring subjects move toward the average on second test source of internal invalidity: improvement due to (different) tests, not treatment source of internal invalidity: treatment leaked to Control Group design which involves random selection and random assignment
Study Questions
1. Define internal and external invalidity. 2. Explain the ten sources of internal invalidity and four sources of external invalidity. 3. What is required for a research design to be true experimental? Why?
13-11
Sample Test Question

Identify each statement on the left as External or Internal invalidity by writing an E or I in the first blank. Then match the type of invalidity on the right with the statements on the right by placing the appropriate letter in the second blank by each statement. E/I ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ Ltr ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ ___ 1. Subject familiarity with tests 2. Systematic differences in drop-out 3. Exp Groups chosen differently 4. Differences between pre/post test 5. Control group tries harder 6. Subjects react differently to experimental treatment 7. Natural changes in subjects 8. Pre-test sensitization 9. Impact of external events 10. Treatment leaked to Control 11. Low group scores higher on second test 12. Subjects react differently to testing procedures A. History B. Instrumentation C. John Henry Effect D. Maturation E. Mortality F. Multiple Treatments G. Regression H. Reactive Effect of Tests I. Selection J. Testing K. Testing & Subject L. Treatment & Subject M. Treatment Diffusion
13-12
Chapter 15
Data Reduction: Distributions and Grapahs
15
Distributions and Graphs
Creating an Ungrouped Frequency Distribution Creating a Grouped Frequency Distribution Visualizing the Distribution: the Histogram Visualizing the Distribution: the Frequency Polygon Common Distribution Shapes Distribution-Free Data
The end of the research part of a study comes after the data has been collected through tests, attitude scales, questionnaires, or other instruments. Raw data presents us an incomprehensible mass of numbers. The first step in statistical analysis is to reduce this incomprehensible mass of numbers into meaningful forms. This is done by using frequency distributions and associated graphs. In this chapter well look at several ways to organize data so that you see its meaning. We will look at both Ungrouped and Grouped Frequency Distributions.
Creating An Ungrouped Frequency Distribution

Lets say that you have given a Bible knowledge test to 38 high school seniors. The maximum score is 120. Here are the scores:
90 84 99 80 59 70 98 75 75 105 59 68 81 109 93 91 66 104 82 97 95 47 69 75 89 72 71 62 84 100 83 97 78 95 44 51 58 74
As you can see, this collection of numbers makes little sense as it is. But we can organize and summarize the data in such a way to make it meaningful. Lets start by rank ordering the numbers from high (109) to low (44).
v v v v v v 109 105 104 100 99 98 97 97 95 95 93 91 90 89 84 84 83 82 81 80 78 75 75 75 74 72 71 70 69 68 66 62 59 59 58 51 47 44
15-1
III: Statistical Fundamentals
This ranking helps us to see where any given score fell along the whole range of scores. But the list is still rather long and difficult to manage. Lets now go through the list and count the number of times each score occurs. This is the scores frequency, represented by the letter f.
Score 109 105 104 100 99 98 97 95 93 91 90 f 1 1 1 1 1 1 2 2 1 1 1 Score 89 84 83 82 81 80 78 75 74 72 71 f 1 1 1 1 1 1 1 3 1 1 1 Score 70 69 68 66 62 59 58 51 47 44 f 1 1 1 1 1 2 1 1 1 1
The ungrouped frequency distribution above removes the redundancy of repeating scores. But the large number of single scores (f=1) still confuses the picture. If we were to group ranges of scores together in classes, we would get a better picture of the data. Grouping scores into classes produces a grouped frequency distribution.
Creating a Grouped Frequency Distribution

The steps in constructing a grouped frequency distribution are as follows: calculate the range of scores, compute the class width (i), determine the lowest class limit, determine the limits of each class, and finally group the scores into the classes.
Calculate the Range

The range of scores is found by subtracting the lowest score from the highest and adding one. Or, in statistical shorthand, Range = Xmax - Xmin + 1 The X represents a score. The term Xmax refers to the highest (maximum) score and Xmin to the lowest (minimum) score. Putting the above formula into English, we read, The range of a group of scores is equal to the difference between the maximum and minimum scores in the group, plus 1. In our case, the range equals (109 - 44 + 1=) 66.
Compute the Class Width

We approximate the size of each category of scores, called the class width (i), by dividing the range by the number of intervals we wish to have. Conventional practice suggests we use 5 to 15 classes. We'll use 10 classes here. The tentative class width (i) is equal to the range of 66, computed above, divided by the number of intervals desired, 10.
15-2
Chapter 15
66 / 10 = 6.6 We need to round up or down to a whole number. Odd class widths are better than even ones because the midpoint of an odd-width class is a whole number. So lets round up to 7. (In this context, we would even round a number like '6.1' up to 7). The distribution will have a class interval (i) of 7.
Determine the Lowest Class Limit

Each class of scores should begin with a multiple of the class width. The lowest class limit should be a multiple of i (in our case, i=7) AND include the lowest score. Our lowest score is 44. The value 42 includes the score of 44 and is a multiple of 7. So our first class begins with 42 and includes 7 scores. As a result, all scores with a value of 42, 43, 44, 45, 46, 47, or 48 will be counted in this class. The lowest class is 42-48.
Determine the Limits of Each Class

The next higher class will begin with (42+7=) 49, the next with (49+7=) 55, and so on, until we reach the last class, 105-111. All classes are listed below.
Group the Scores in Classes

Move through the data and count how many scores fall into each class. The result looks like this:
Class 105-111 98-104 91-97 84-90 77-83 70-76 63-69 56-62 49-55 42-48 Counts // //// ////// //// ///// /////// /// //// / // f 2 4 6 4 5 7 3 4 1 2 n = f = 38 scores
This grouped frequency distribution reveals much more about the Bible knowledge of high school seniors than we could discern in previous listings. On the down side, by grouping our scores into classes, we actually lost some detail. But "losing detail" is necessary when the aim is to derive meaning from the numbers. We can combine our scores even more by increasing the class width i. Lets look at a frequency distribution of the same data with i = 14.
Class 98-111 84-97 70-83 56-69 42-55 Tally ///// ///// ///// ///// /// f / ///// ///// // // 6 10 12 7 3 n = 38
15-3
This last graph gives a smoother picture of the data set, though we notice the loss of more detail because we reduced the number of classes. Frequency distributions certainly simplify data sets, but we can present the data even more clearly by graphing the frequency distributions.
Graphing Grouped Frequency Distributions

Graphs display frequencies in a visual form. We can see a bit of this visual form in the counts columns above. The length of the counts (\\\) gives a rough visual image of the data distribution. But we can do better with a graph. A frequency distribution graph consists of two axes which frame the frequency of each score interval.
X- and Y-axes
A graph is composed of a vertical line, called the ordinate or the Y-axis, and a horizontal line, called the absissa or the X-axis. These two lines intersect to form a right angle. By convention, the Yaxis should be three-fourths the length of the X-axis. Axis is pronounced AX-is. Axes is pronounced AXees.
Scaled Axes
Numbers are placed on the X- and Y-axes at equal intervals to represent the scale values of the variable being graphed. In a graph of a grouped frequency distribution, the X-axis is scaled by the range and class intervals, the Y-axis is scaled by frequency. There are two major graph types used to display information from a grouped frequency distribution. The first is the histogram and the other is the frequency polygon.
Histogram
A histogram (HISS-ta-gram) is a special type of bar graph. The width of the bars equals the class interval and the heights of the bars equal class frequencies. Let's use the example data to build a histogram with a range of 44-111 and class width (i) of 7. The frequencies for this graph are located in the middle of page 15-3. Look at the graph at left. Class limits are listed along the X-axis. The widths of all classes equal 7. The height of each bar equals the frequency of scores contained in each category. The shape of the graph provides us a clear and meaningful picture of the entire data set. Then we reduced the number of categories from ten to five (increased i from 7 to 14). The graph at
15-4
Chapter 15
left shows the effect of reducing the number of classes. Irregularities have been smoothed out, but some of the more specific (irregular) data has been glossed over. Choosing class width and the number of classes is a trial and error process. Our goal is to reflect the shape of the data as clearly as possible while attaining as much precision as possible.
Frequency Polygon
By connecting the midpoints of the bars with lines, we produce a frequency polygon. The frequency polygon displays the same information as the histogram, but in a different form. The frequency polygon at right is based on the ten-class histogram on the previous page. If we remove the bars of the histogram, we obtain a frequency polygon graph, below right.
Distribution Shapes
The graphic image of a histogram or frequency polygon tells us at a glance the group profile of the data. The incomprehensibility of a set of numbers is transformed into a meaningful visual protrait. This visual portrait displays two special characteristics: kurtosis and skewness. The kurtosis of a curve describes how flat or peaked it is. The three basic profiles of kurtosis are platykurtic (flat), leptokurtic (peaked), and mesokurtic (balanced). A flat curve is called platykurtic. Think of the flatness of a plate and youll remember platey-kurtic. Notice that there are low frequencies for all the categories. A peaked curve is called leptokurtic. Think of the central frequencies leaping away from the others and youll remember leap-tokurtic. Notice that outer categories have lower frequencies while the central categories have high frequencies. A curve that falls between platykurtic and leptokurtic is called mesokurtic. Think of medium (meso-) and youll remember meso-kurtic. The familiar bell shaped curve is mesokurtic. The skewness of a curve describes how horizontally distorted a curve is from the familiar bell-shaped curve. A curve with negative skew has its left tail pulled outward to the left, to the negative end of the scale. A curve with positive skew has its right tail pulled outward to the right, to the positive end of the scale. A common mistake is to focus on the mound of scores rather than the distorted tail. Remember: the direction the tail is pulled is the direction of the skew. A distribution where all categories of scores have equal
platykurtic
leptokurtic
mesokurtic
negative skew
positive skew
rectangular
15-5
frequency is called a rectangular distribution.
Distribution-Free Measures
Our discussion on distributions applies to ratio or interval data only, called parametric data. Two other types of statistics deal with the non-parametric measures: either ordinal (ranks) or nominal (counts) data. Non-parametric data is often called distribution-free. We will spend the next few chapters dealing with parametric statistics, and then deal with non-parametric types in Chapters 22, 23, and 24.
Summary
This chapter carried you through the first step in data analysis: reducing a series of chaotic numbers to orderly distributions and graphs. Before engaging in more sophisticated statistical procedures, you should initially analyze your data with these data reduction techniques. All good introductory statistics texts have chapters on data reduction techniques.
Vocabulary
Absissa Class width (i) Class Exponential curve Frequency (f) Frequency polygon Histogram Kurtosis Leptokurtic Mesokurtic Midpoint Negative skew Non-parametric measures Ordinate Parametric measures Platykurtic Positive skew Rectangular distribution Skew X-axis Y-axis number along the horizontal (x-) axis of a graph distance between upper and lower limits in a given class a subset of scores defined by upper and lower limits in a frequency distribution line on a graph produced by the equation y = x the number of scores in a given class graph that depicts class frequencies: uses class midpoints graph that depicts class frequencies: uses class limits amount of flatness (or peakedness) in a distribution of scores highly peaked distribution ("leaps up" in the middle) moderately peaked distribution (normal curve) halfway point between class limits in a given class: x' negative end of skewed distribution: tail pulled left in a negative direction ranks or counts; ordinal or nominal; distribution-free number along the vertical (y-) axis scales or tests; interval or ratio; normal distribution flat distribution ("like a plate") positive end of skewed distribution: tail pulled right in a positive direction all classes have same frequency the degree a tail in a frequency distribution is pulled away from the mean the horizontal axis in a graph the vertical axis in a graph
Study Question
Using the following data and the guidelines provided in this chapter... 89, 92, 83, 98, 98, 80, 89, 97, 83, 87, 86, 84, 97, 97, 99, 90, 95, 90, 91, 96, 95, 91, 91, 92, 94, 93, 94, 100 a) ...to construct a grouped frequency distribution with i=3. b) ...to construct a histogram of this distribution. c) ...to construct a frequency polygon of this distribution. d) How would you describe this distribution? (What type?)
15-6
Chapter 15

1. Frequency distributions and graphs perform what statistical function? A. reduce massive data sets to meaningful forms B. infer characteristics of populations from samples C. predict future trends or behaviors of subjects D. depict significant differences between groups 2. A distribution has a range of 55 points. The best value for i is A. 55 B. 11 C. 7 D. 2 3. In a positively skewed distribution, A. the scores are piled up on the right B. the right tail curves away from the x-axis C. the long tail points to the right D. the curve is narrow and pointed 4. Which of the following best describes a negatively skewed distribution? A. The test was too easy for the sample of subjects B. The test was too difficult for the sample of subjects C. Scores on the test were evenly distributed among subjects D. Few subjects scored high on the test.
15-7
15-8
Chapter 16
Data Reduction: Central Tendency and Variation
16
Central Tendency and Variation
Measuring the Central Tendency of Data Measuring the Variability of Data Statistics and Parameters The Standard (z-) Score
In the last chapter we considered a way to reduce a mass of numbers by creating a grouped frequency distribution and graphing it. The graph is a visual image of the data, and is an important first step in data analysis. In this chapter we develop basic concepts in reducing data numerically. A group of numbers has two primary numerical characteristics. The first is a central point about which they cluster, called the central tendency. The second is how tightly they cluster about that point, called variability.
Measuring Central Tendency

The central tendency of a group of scores is the numerical focus point of those scores. It refers to the point of greatest concentration of the set of numbers. There are three separate measures of central tendency. These are the mode, the median, and the mean.
The Mode
The mode is the most frequently occurring score in a set of scores. 82 82 83 83 84 85 86 87 87 87 88 90 95 99 99 The mode of the above set of numbers is 87 because it appears three times more than any other number in the set. 82 83 84 86 87 88 88 89 90 91 91 92 94 97 98 There are two modes above. The numbers 88 and 91 both appear twice. This is a bi-modal (two modes) data set. 82 83 84 86 87 88 89 90 91 92 93 94 95 96 97 There is no mode for this distribution. No score occurs more frequently than any other. The mode is the most frequent score in a set of data.
The Median
The median is the middlemost score. That is, it is the score that represents the
16-1
exact halfway point through the data set. The median score divides the set of data into two equal halves. Half of the scores falls below the median, and half falls above it.
34
56
67
100 356
In the above set, it is the 6th score, or the number 9. Five scores fall below 9, and five scores fall above 9. We can calculate this score with the simple formula: (N+1)/2 where N is the number of scores in the set. There are 11 scores (N=11) in the data set above. Using the formula, we compute (11+1)/2 = 6. The 6th score is the median, which is the number 9. The median of this data set is 9.
9
median
34 56 67 100 356
5 scores above
5 scores below
Here's another example:
34 23 67 4 8 17 2 78 99 5 178 3 1678
First, we rank order the numbers from low to high.
2 3 4 5 8 17 23 34 67 78 99 178 1678
Applying the formula, we compute (13+1)/2 = 7. We are looking for the 7th score. The 7th score is the number 23. 23 is the middle number, the median. Six scores fall above and six below this number. Here's an example with an even number of scores:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In this case, there are two middlemost values. The median for this data set is the average of the two middlemost values. Add the two middle values together and divide by 2. In our case, (7+8)/2 = 7.5. Notice that seven numbers fall below 7.5 (1-7) and seven numbers fall above it (8-14).
1 2 3 4 5 6 7 8 9 10 11 12 13 14
\_ 7.5 The median is the middlemost score.
The Arithmetic Mean

The mean is the average value of a data set and is the best representation of the set of scores. You have often computed test averages in your courses by adding together several test scores and dividing by the number of tests.
1 2 3 4 5 6 7 8 9 10
The mean is found by adding these ten numbers (N) together and dividing by ten (N). We can represent the procedure for computing a mean in a shorter form by using symbols. You were introduced to the symbol (capital sigma) in chapter 14. The
16-2
Chapter 16
symbol X (capital "X" or "Y" or "L" or any English letter) refers to scores. The letter N refers to the number of scores. And finally, the Greek letter (pronounced myoo) represents the arithmetic mean of the scores. Using these letters to define the formula for the mean, we have the following:
Read the above formula like this: mu equals the sum of X divided by N. Or, in English, the average value of a group of scores is the sum of those scores divided by the number of scores in the group. Lets use this formula on the following data set: 10 23 17 5 64 28 3
The mean score of 21.43 represents the average value of all the individual scores in the group, and is the most important measure of central tendency due to its use in statistical analysis.
Central Tendency and Skew

When a distribution is a normal ("bell-shaped") curve, all three measures of central tendency have the same value. If a distribution is skewed, the three measures differ in a predictable way. The mode will always equal the highest frequency value. The mean moves away from the mode in the direction of the skew. (Extreme scores pull the mean toward them). In the positively skewed distribution at right ("1") the mean is greater than the mode. In the negatively skewed distribution ("2") the mean is smaller than the mode. The median is always between mode and mean. We have defined three measures of central tendency -- the mode, median, and mean -- and established the prominence of the mean. Now we turn to the second essential characteristic of scores -- variability.
Measures of Variability
The second essential characteristic of a group of scores is variability. Variability is a measure of how tightly a group of scores clusters about the mean. Scores can be tightly clustered or loosely clustered about the mean. Scores that tightly cluster about the mean have lower variability. Scores that loosely cluster, that are more spread out from the mean, have higher variability. There are three measures of variability. These are range, average deviation, and standard deviation.
Range
As we learned in the last chapter, the range of a group of scores is equal to the highest score minus the lowest score plus 1, or, Range = Xmax- Xmin + 1. It is a crude
16-3
measure of variability, but is a useful first step in understanding a distribution. Lets look at an example. Class A took a midterm examination in research. The highest score in the class was 103 and the lowest was 48. The range was 103 - 48 + 1 or 56 points. Class B is the same size and took the same exam. Their highest and lowest scores were 95 and 67 respectively. Their range was 95 - 67 + 1 or 29 points. Therefore, the scores of Class B have lower variability (more tightly clustered) than the scores of Class A. The problem with range is that it tells us nothing of the dispersion of scores between the high and low points. Classes C and D have the same ranges, but have different dispersions of scores. One way of getting at the dispersion of scores throughout the whole distribution is to measure the deviation of each score from the mean and then compute the average of all the deviations.
Average Deviation
A deviation score, symbolized by a lower case x, is the difference between a score (X) and the mean () of the distribution. When you subtract the mean of a group of scores from a specific score, you compute the deviation of the score from the mean. Or, we can write this relationship more simply as x = X - . The average deviation of a group of scores is computed by summing all the deviations in the group and dividing by N. Look at the following scores:
10 20 30 40 50
First, compute the mean of these scores: (150/5=30). Then compute the deviation scores (x) by subtracting the mean (30) from each score (X) like this: deviation scores (x)
score - mean = deviation 10 30 -20 20 30 -10 raw scores (X) 30 30 0 40 30 10 50 30 20 sum of deviations (x) = 0
Notice that when we sum the deviations, we get 0 (x=0). Why is the sum of deviations equalled to zero? The mean is the balance point in a distribution. When two children of equal weight use a teeter-totter, the balance point is placed half-way between them, as in diagram A below left.
16-4
Chapter 16
But when children of unequal weight use it, the board must be changed so that the balance point, or fulcrum, is closer to the heavier child. This is shown in B below right. Heavier weight plus shorter distance on one side of the board balances with the lighter weight and longer distance of the other. Another way of saying this is that for perfect balance, the moment of force (weight x distance) of one side equals the moment of force of the other. Subtract one from the other and the result is zero. This is what is meant in statistics when we say the mean is the fulcrum of a group of scores. Large deviations are like large distances from the fulcrum, and small deviations like small distances. (All scores weigh the same in this example). The sum of deviations on one side of the mean will always cancel out or balance the sum of deviations on the other side of the mean. Therefore, x=0. In order to compute average deviation, we must take the absolute values of the deviations. An absolute value, symbolized as |x|, equals the value of a number regardless of sign. So, the absolute value of -4 equals 4 (|-4| = 4).. By taking the absolute values of deviations, we make them all positive distances from the mean. Summing them, we produce a meaningful measure of "spreadedness" from the mean:
The average deviation equals 12. But average deviation has some mathematical limitations that cause problems in more advanced procedures. A better measure of variability, which also reflects the dispersion of scores throughout a distribution, is the standard deviation.
Standard deviation
The standard deviation has mathematical properties which make it, like the mean, much more useful in higher-order statistics. The procedure for standard deviation involves summing squared deviations (producing a value variously called the sum of squared deviations, sum of squares, and statistically, x2, which is a fundamental component of many statistical procedures) in order to eliminate negative values. The pathway to standard deviation moves from deviations to the sum of squares to variance to standard deviation. Well look at two ways to compute sum of squares. The first, called the deviation method, clearly illustrates what standard deviation means. The second, called the raw score method, is easier to use. Both procedures result in the same value for sum of squares.
Deviation Method
Compute deviations of all scores from the mean. Square all deviations (x2) and sum them (x2) as follows
score 10 20 30 40 50 mean = 30 30 30 30 30 deviation squared -20 400 -10 100 0 0 10 100 20 400 2 x=0 x = 1000
16-5
Large groups will have a larger sum of squares than small groups, simply because there are more deviations in a large group. Dividing by N eliminates size of group from the result. This gives a truer picture of spreadedness in a group of numbers no matter how many are in the group. Divide the sum of squares by N in order to factor out the variable of group size. The resulting value is called the variance of the scores, and is symbolized by the lower case Greek letter sigma ().
Variance (2) = x2 / N = 1000/5 = 200.0
Since we squared deviations before adding them, variance measures variability in squared units. It would be better if score variability were in the same unit of measure as the scores themselves. We can "undo" the squaring by taking the square root (/)1 of the variance, like this:
Standard Deviation () = /2 = /200 = 14.14
The number 14.14 represents the standardized measure of variability for our example. This number represents, in the same unit of measure as our scores, the degree of spread-out-ness of the scores from the mean. The larger the number, the greater the spread. It is useful in comparing the variabilities in different groups of scores, but will become more meaningful in future statistical procedures. This deviation method shows you exactly what a "standard deviation" is, and is fine to use when you have a few scores and a whole number mean. But if you have a large data set, and the mean is a fraction, like 73.031, computing individual deviation scores, squaring them, and then summing them can be painfully tedious. A simpler way to compute the sum of squares -- and get the very same result -- is to use the raw score formula.
Raw Score Method

The raw score method uses the squares of each raw score (rather than squares of deviation scores) to produce the sum of squares. The raw score formula for sum of squares is:
x2
=
X2 - (X)2/N
where X2 refers to the sum of squared-raw-scores (square all the scores and sum them) and (X)2 refers to the sum-of-all-scores squared (sum the scores and then square the sum). Let's apply this formula to the same data that we used under the deviation method. We should get the same answer: x2 = 1000.
X 10 20 30 40 50 X = 150 (X)2 = 22500 X2 100 400 900 1600 2500 X2 = 5500
x2 = = = x2 =
X2 - (X)2/N 5500 - 22500/5 5500 - 4500 1000
Warning: There is a great difference between X2 and x2.

Do not confuse the two!
1 The "square root" symbol actually looks like this: ox but is difficult to produce within the text. So I am using the simpler (/x) symbol. Later, with more complicated formulas, I will use graphical characters to indicate square root.
16-6
Chapter 16
As you can see, both methods give sum of squares values of 1000. The raw score method is easier to do and less prone to arithmetic errors.
Equal Means, Unequal Standard Deviations

Lets say I have two groups of scores. The first group consists of scores 1, 3, 5, 9, 11 and 13. Well use the letter X to refer to them. The second set of scores are 5, 6, 8 and 9. Well use the letter Y to refer to them. Ive put them on a scale below like this:
Notice that the means of the two groups are equal. But the degree of scatter (variability) among the scores is not. Lets compute the standard deviations of both groups to compare them. Which group should have the larger standard deviation?2 Using the deviation method, we calculate the sum of squares of X as follows:
i 1 2 3 4 5 6 N = 6 Xi 1 3 5 9 11 13 X = 42 xi -6 -4 -2 2 4 6 x = 0 xi2 36 16 4 4 16 36 x2 = 112
The variance of group X equals x2/N = 112/6 = 18.66. The standard deviation is the square root of variance, or /18.66 = 4.32. Using the raw score method, we calculate the sum of squares for Group X as follows:
i 1 2 3 4 5 6 n = 6 Xi 1 3 5 9 11 13 X = 42 Xi2 1 9 25 81 121 169 X2 = 406
x2 = X2 - (X)2/N = 406 - (42)2/6 = 406 - 294 = 112 We get the same result, 112, with either method.
2 Did you say the X's? Good. You can see from the graph that the X's are spread out more than the Y's (another way of saying this is that the range of X is greater than the range of Y). We would expect the X's to have more variability than the Y's, and, in turn, the standard deviation of the X's will be greater.
16-7
Now lets compute variance and standard deviation for Group Y, which should produce a smaller sum of squares, variance and standard deviation than Group X did. Heres the deviation method:
i 1 2 3 4 N = 4 Yi 5 6 8 9 Y = 28 yi -2 -1 1 2 y = 0 yi2 4 1 1 4 = 10
y2
The variance of group Y equals y2/N = 10/4 = 2.5. The standard deviation is the square root of variance, or /2.5 = 1.58. Using the raw score method, we calculate the sum of squares for Group 2 as follows:
i 1 2 3 4 n = 4 Yi 1 6 8 9 Y = 28 Yi2 1 36 64 81 Y2 = 206
y2 = Y2 - (Y)2/N = 206 - (28)2/4 = 206 - 196 = 10 Again, we get the same result, 10, with either method. Since the sum of squares equals 10, variance equals 2.5 and standard deviation 1.58, as calculated above. The deviation method illustrates the meaning of standard deviation, the raw score method gives the same result more simply. We have computed the standard deviation for both groups of scores. The groups have identical means, but different spreads. We expected that the scores of Group X would have the larger standard deviation than Group Y because of its larger spread of the scores. Calculations demonstrated a standard deviation of 4.32 in Group X and of 1.58 in Group Y, confirming our expectation.
Parameters and Statistics

So far weve used the symbol to refer to the mean, 2 to refer to the variance, and to refer to the standard deviation of a group of scores. We have treated these groups as populations. You will recall from Chapter 8 that a population of scores includes all the scores or subjects in a specified group (e.g., 7,000 Texas Baptist pastors). A sample is a subset of scores drawn from the population we wish to study (e.g., 700 randomly chosen Texas Baptist pastors). We can compute mean and standard deviation for the population directly. The resulting values are called population parameters and are defined as and . We can compute mean and standard deviation for the sample directly. These values are called sample statistics and are defined as and . We can estimate population parameters from a sample of scores. These values are called estimated parameters and are
16-8
Chapter 16
defined as
and s. Lets illustrate these three sets of values.
Population Parameters
Suppose we have a population of 10,000 ministers. We want to compute the mean and standard deviation of their IQ. In order to compute these population parameters directly, you give all 10,000 ministers an IQ test. Sum the 10,000 IQ scores (X), and divide by 10000 (N). The result is the population mean, symbolized by (pronounced "myoo"). Subtract from the 10,000 IQs (x=X-), square the 10,000 deviations (x2), sum them (x2), divide by 10,000 (N), and finally take the square root (/). This yields the population standard deviation , symbolized by .
Sample Statistics
The cost in time and materials to test 10,000 subjects and compute the parameters is not practical. Draw a random sample of 100 ministers (1%) and measure their IQs. Sum the 100 IQs (SX) and divide by 100 (N) to produce the sample mean. The sample mean is symbolized by , pronounced X-bar. from the 100 IQs (x=X- ), square the 100 deviations (x2), sum them Subtract 2 (x ), divide by 100 (N), and finally take the square root (/). This sample standard deviation is symbolized by a sigma with a hat on top ( ), pronounced sigma-hat.3
Estimated Parameters
When we can not compute population parameters directly, we must estimate them from sample statistics. This is not a problem for the estimate of the mean (). The sample mean ( ) is the best estimate. But due to the smaller number of scores in the sample -- because it is a subset of the population -- then the sample standard deviation ( ) always underestimates . This underestimation of requires a small correction factor in the equation for estimated standard deviation (s). While the equations for and have sum of squares divided by N or n4,the equation for estimated standard deviation (s) has sum of squares divided by n-1. Why n-1? It has to do with the Central Limit Theorem and you really dont want to know. (Okay, for those who do: the selection of a sample of n scores from the population reduces by one the number of n-sized samples that can be drawn from the
Some textbooks refer to the sample standard deviation as "sigma-tilde" ( ). Often, N and n are used interchangibly to refer to the number of scores in a set. Other times, N refers to the number of scores in a population, and n to the number of scores in a sample.
3 4
16-9
population. This reduces the number of degrees of freedom of the population by one. Well talk more about degrees of freedom in a few chapters). So, we have three sets of formulas. Mean and standard deviation are common concepts across the three versions, but there are important differences to note. Notice the use of N for parameters and n for samples. Match the formulas with the diagram above.
Standard (z-) Scores

We have demonstrated two related but separate characteristics of data sets. The first is central tendency (the mean is the most important measure). The second is variability (the standard deviation is the most important measure). Given two sets of data, there are four possibilities of comparisons. Using the chart at left, write out the four possibilities in English.5 Comparing two sets of scores is difficult because the groups usually possess different values of locus and scatter. Where do we begin? What is required is a standard scale which reflects in one value both mean and standard deviation. Then translating raw scores from each set to a single standard form would allow us to compare them directly. Direct comparison is possible because transformed scores from the groups have common standardized values. The standardized score which reflects in one value both mean and the standard deviation of a set of scores is called a z-score. A raw score (X) from a population that has a mean of and standard deviation of is transformed into a standardized scale score (z) with this formula.
The equation is pronounced z equals X minus mu over sigma. In English, the formula means that a standardized score is equal to a raw score minus the population mean divided by population standard deviation. A raw score (X) from a sample that has a mean of and estimated standard deviation of s is transformed into a standardized scale score (z) with the following formula.
The equation is pronounced z equals X minus X-bar over s. In English, the formula means that a standardized score is equal to a raw score minus the sample mean divided by the estimated standard deviation. Both formulas reflect the same relationship between a raw score and a standard5
Upper left: Two distributions have the same mean and standard deviation. Upper right: Two distributions have the same standard deviation, but different means. Lower left: Two distributions have the same mean, but different standard deviations. Lower right: Two distributions have different means and standard deviations.
16-10
Chapter 16
ized score in a distribution of numbers. The distinction is whether the distribution is a sample or a population. Notice the values for mean and standard deviation are both part of the transformation formula. No matter what these parameters are, the standardized scores are plotted on a z-scale which looks like this:
For a standardized scale, the mean is always zero and standard deviation is always one. The z-score equations transform any group of scores into these standardized values. Lets look at an example of how z-scores facilitate comparison between scores. John is taking Hebrew and Research. On his midterm exams, he made a 85 in Hebrew and an 80 in Research. On which exam did he do better? It seems obvious that he did better in Hebrew than he did in Research. But the real answer is not so easy. To compare his performance on the two tests, we must take into consideration how well his classmates as a whole did. That is, we need to know the means and standard deviations for the two exams. Heres the information we need: Hebrew 80 10 Research 70 5 Now compute z-scores for Hebrew (zh) and Research (zr).
Placing these values on a z-scale, we have:
Notice several things about the diagram above. First, since the z-scores from Hebrew and Research now fall on the same standardized scale, we can directly
16-11
compare them. It is clear from the scale that John did much better in Research, scoring two standard deviations above the mean, than he did in Hebrew, where he scored only one-half standard deviation above the mean. Second, notice that the means of both classes line up on a z-score of 0. In standardized scores, the mean is always 0. Third, notice that the of 1 on the z-scale is equivalent to 10 in Hebrew and 5 in Research. Fourth, notice that Johns score of 85 in Hebrew falls directly below 0.50 on the z-scale. His score of 80 in Research falls directly below 2.00 on the z-scale. Standardized scores lie at the heart of inferential statistics. These basic building blocks provide the foundation for procedures well study soon.
Summary
The three measures of central tendency are the mode, the median, and the mean. These refer, respectively, to the most frequent score, the middlemost score, and the arithmetic average of a group of scores. In terms of statistical analysis, the mean is by far the most important of the three, and the most affected by skewed distributions. Three measures of variability are the range, average deviation, and standard deviation. The standard deviation (and its squared cousin, variance) is the most important of the three. The two characteristics of mean and standard deviation can be combined to transform a raw score (X) into a standard score (z). Z-scores can be directly compared across groups, regardless of differing parameters.
Example
In my Ed.D. dissertation, I analyzed how much learning of the doctrine of the Trinity in Southern Baptist adults occurred over a seven week course. Cognitive tests were given at the beginning (Test 1), end (Test 2), end plus three months (Test 3) and end plus six months (Test 4).6 I was also interested in whether the mental abilities of the three groups were balanced. Here is one of my Tables showing the means and standard deviations of these groups.7 You can notice several things immediately from the numbers below. The three groups' average mental ability, measured by the Otis-Lennon Mental Ability Test (maximum score: 80), were within 0.90 points of each other. All three groups learned a great deal about the doctrine of the Trinity -- all three groups jumped an average of 50.69 points over the seven weeks (Test #2 Total N minus Test #1 Total N). All three groups forgot some of what they learned, dropping an average of 11.48 points over three months and 17.98 points over six months. Are these means significantly different? We will learn how to answer this question in Chapter 20.
6 William R. Yount, A Critical Comparison of Three Specified Approaches to Teaching Based on the Principles of B. F. Skinner's Operant Conditioning and Jerome Bruner's Discovery Approach in Teaching the Cognitive Content of a Selected Theological Concept to Volunteer Adult Learners in the Local Church, (Fort Worth: Southwestern Baptist Theological Seminary, 1978). 41-42 7 Ibid., 168
16-12
Chapter 16
APPENDIX XI Means and Standard Deviation Scores Total N MENTAL ABILITY TEST #1 TEST #2 TEST #3 TEST #4 59.96* 15.58+ 24.70 8.01 75.39 15.40 63.91 13.91 57.41 11.56 +Standard Deviation X 59.71 16.55 25.57 4.79 81.43 8.02 66.00 12.36 61.00 9.81 Y 59.67 19.13 23.44 8.80 78.44 14.85 67.78 15.97 59.22 11.58 Z 60.57 11.30 25.43 10.26 65.43 18.41 56.86 11.44 52.29 11.86
*Mean
Vocabulary
X-bar, the average or mean of a group of scores (sample) the average or mean of a group of scores (population) sigma-squared, the population variance sigma, the population standard deviation sigma-hat squared, sample variance sigma-hat, sample standard deviation |x|/n : Sum absolute values of deviation scores, then divide by n sum of scores divided by the number of scores focal point of scores: mean, median, mode and s, computed from sample, infers population parameters average score middlemost score most frequent score number of scores (sometimes used to refer to one group within experiment) number of scores (sometimes used to mean entire experiment) population measurements (, ) distance between highest and lowest scores in a group standardized measure of variation in scores: s sample measurements ( and ) sum of squared deviation scores measure of spreadedness in a group of scores measure of spreadedness in squared units deviation score: difference between score (X) and mean ( or ) raw score: e.g., test score standardardized score which reflects both and (or and s)
average deviation average central tendency estimated parameter mean median mode n N parameter range standard deviation statistics sum of squares variability variance x X z-score
Ibid., 169
16-13
Study Questions
1. What are the modes for the sets of scores below? a. 1 2 3 4 5 6 6 7 8 9 b. 1 2 3 4 5 6 6 7 8 8 c. 1 1 2 2 3 3 4 4 5 5 Mode: ____ Mode: ____ Mode: ____
2. What are the medians for the following data sets? a. b. 10 3 15 7 20 78 22 45 27 2 29 56 33 4 7 Md: ____ Md: ____
3. Compute the mean, sum of squares (use deviation method), variance and standard deviation for the following scores: 65 70 70 75 85 90 95
4. Using the scores in #3, compute the sum of squares with the raw score method.
5. You have taken midterm exams. Your score in New Testament survey was 75. Your score in Principles of Teaching was 90. NTS PT n 100 25 X 7020 2175 x2 2500 225
a. Compute means for both classes. b. Compute standard deviations (s) for both classes. c. Transform your midterm scores into z-scores. d. Plot your standard scores on a z-scale. Include the appropriate raw score scale values for the two classes. e. In which class did you do better? Explain how you know this.
16-14
Chapter 16

1. The most important measure of central tendency is the A. mode B. mean C. kurtosis D. range 2. The measure of central tendency which behaves like a balance or a teeter-totter is the A. mean B. mode C. median D. range 3. If you add together all the deviations of scores about the mean, the result is A. the standard deviation B. the variance C. the sum of squares D. zero 4. You compute mean and median of a distribution and find that the median is larger. You know from this that the distribution is A. normal B. leptokurtic C. positively skewed D. negatively skewed 4. Parameters are to statistics as A. mean is to variance B. population is to sample C. average deviation is to standard deviation D. Greek is to English 5. The mean and standard deviation of the z-scale is, respectively, A. 1, 1 B. 0, 1 C. 1, 0 D. 0, 0
16-15
Chapter 17
The Normal Curve and Hypothesis Testing
17
The Normal Curve Defined Level of Significance Sampling Distributions Hypothesis Testing
In the last chapter we explained the elementary relationship of means, standard deviations, and z-scores. In this chapter we extend this relationship to include the Normal Curve, which allows us to convert z-score differences into probabilities. On the basis of laws of probability, we can make inferences from sample statistics to population parameters and make decisions about differences in scores. Using z-scores and the Normal Curve, we can convert differences in scores to probabilities. The chapter is divided into the following sections: The Normal Curve Defined. What is the nature of the Normal Curve? How does the Normal Curve and its associated Distribution table, link z-score with area under the curve? How does area under the curve relate to the concept of probability? Level of Significance. What do the terms level of significance and region of rejection mean? What is alpha ()? What is a critical value? The Sampling Distribution. What is a sampling distribution? How does it differ from a frequency distribution? Hypothesis Testing. How do we statistically test a hypothesis?
The Normal Curve

On page 16-11 we presented a z-scale with the z-scores for Johns Research and Hebrew test scores. It looked like this:
17-1
Recall that the mean of the z-scale equals zero and extends, practically speaking, 3 points in either direction. Each point on the z-scale equals one standard deviation away from the mean. A score of 100 in Johns Hebrew class equals 2 standard deviations above the mean (=80, =10, z=+2.0). A score of 55 in Johns Research class equals 3 standard deviations below the mean (=70, =5, z=-3.0). The z-scale assumes that the distribution of standardized scores forms a bellshaped curve, called a Normal Curve. The normal curve is plotted on a set of X-Y axes, where the X-axis represents, in this case, z-scores and the Y-axis frequency of z-scores. It looks like the diagram at left. The area between the bell and the baseline is a fixed area, which equals 100 percent of the scores in the distribution. We will use this area to determine the probabilities associated with statistical tests. There is an exact and unchanging relationship between the z-scores along the xaxis and the area under the curve. The area under the curve between z = 1 (read z equals plus or minus 1) standard deviation is 68% of the scores (p=0.68). The area between 2 standard deviations is 96%, or 0.96 of the curve.
The tails of the distribution theoretically extend to infinity, but 99.9% of the scores fall between z = 3.00. Now, lets use the normal curve in a practical way with Johns classes. We can use the information in the diagram above to answer questions about Johns classes. Example 1: How many Hebrew students scored between 70 and 90?
For the Hebrew class, a score of 70 equals a z-score of -1 and a 90 equals a z-score of +1. The area under the normal curve between -1 and +1 is 68%. Therefore, the proportion of students in Hebrew scoring between 70 and 90 is 0.68 0.68. How many students is that? Multiply the proportion (p=0.68) times the number of students in the class (60). The answer is 40.8. Rounding to the nearest whole student we would say that 41 Hebrew students fall between 70 and 90 on this test. Example 2: How many research students scored between 60 and 80?
For the research class a score of 60 equals a z-score of -2; an 80 equals a z-score of +2. The area under the curve between -2 and +2 is 96%. Therefore, the proportion of the students in Research scoring between 60 and 80 is 0.96. How many students is that? (0.96)(40)=38.4. Rounding off to the nearest whole student we would say that 38 research students fall between 60 and 80 on this test.
17-2
Chapter 17
The Normal Curve Table

The Normal Curve distribution table allows us to determine areas under the Normal Curve between a mean and a z-score. Look up this table and use it to follow along the following description. You will find this table on page 1 in the Tables Appendix at the back of the book (Appendix A3-1). The left column of the Normal Curve Table is labelled Standard score z. Under this heading are z-scores in the form "x.x" beginning with 0.0 at the top and ending with 4.0 at the bottom. Across the top of the chart are the hundredths (0.0x) digits of z-scores, the numbers 0.00 through 0.09. To find the area under the normal curve between a mean (z = 0) and z = 0.23, look down the left column to 0.2 and then over to the column headed by .03. Where the 0.2 row and .03 column you'll find the area under the Normal Curve between z1 = 0 and z2 = 0.23. This area (shown in gray) is 0.0910 or 9.1%. .03 .04 ... 0.0 | 0.1 | 0.2 --.0910 0.3 .00 .01 .02 What is the area under the curve between the mean and z=+1.96? Look down the left column to the row labelled 1.9 and then across to the column labelled .06. Where these cross in the chart you will find the answer: 0.4750. That means that 47.5% of the scores in the group fall between the mean and 1.96 standard deviations away from the mean. .03 .04 .05 .06 1.7 | 1.8 | 1.9 --.4750 2.0 .07 ...
What is the area under the curve between the mean and z=-1.65? The normal curve is symmetrical, which means that the negative half mirrors the positive half. We can find the area under the curve for negative z-scores as easily as we can for positive ones. Look down the column for the row labelled 1.6 and then across to the column labelled 0.05. Where these cross you will find the answer: 0.4505. Forty-five percent (45%) of the scores of a group falls between the mean and -1.65 standard deviations from the mean. .05 .06 ... | | --.4505 .02 .03 .04
1.4 1.5 1.6 1.7
The Normal Curve Table in Action

Lets continue to use Johns exam scores to further illustrate the use of the Normal Curve. We know John scored 85 in Hebrew. How many students scored higher than
17-3
John? Our first step is to compute the z-score for the raw score of 85, which we have already done. We know that the standard score for John's Hebrew score of 85 is zh = 0.500 (diagram on 16-11 and 17-1). The second step is to draw a picture of a normal curve with the area were interested in. Notice that Ive lightly shaded the area to the right of the line labelled z = 0.5. This is because we want to determine how many students scored higher than John. Since higher scores move to the right, the shaded area, which is equal to the proportion of students, is what I need. But just how much area is this?
Look at the Normal Curve Table for the proportion linked to a z-score of 0.5. Down the left column to "0.5." Over to the first column headed ".00." The area related to z=0.5 is 0.1915. I have shaded this area darker in the diagram below. Our lightly shaded area is on the other side of z=0.5! The area under the entire Normal Curve represents 100% of the scores. Therefore, the area under half the curve, from the mean outward, represents 50% (0.5000) of the scores. So, the lightly shaded area in the diagram is equal to 0.5000 minus 0.1915, or 0.3085.
So we know that 30.85% of the students in Johns Hebrew class scored higher than he did. How many students is that? Multiplying .3085 (proportion) times 60 (students in class) gives us 18.51, or 19 students. Nineteen of 60 students scored higher than John on the Hebrew exam. Here's another. John scored 80 in Research. How many students scored lower than this? Weve already computed Johns z-score in Research as +2.00. The area under the curve between the mean and z = 2.00 is 0.4772. Find 0.4772 in the Table.
17-4
Chapter 17
Since John also score higher than all the students in the lower half of the curve, we must add the 0.5000 from the negative half of the curve to the 0.4772 value of the positive half to get our answer. So, 97.72% of the students in Johns research class scored lower than he. How many students is this? It is (40 * .9772 =39.09) 39 students. Here's an example which takes another perspective. Weve used the Normal Curve table to translate z-scores into proportions. We can also translate proportions into z-scores. Take this question: What score did a student in Johns Hebrew class have to make in order to be in the top 10% of the class? We start with an area (0.10) and work back to a z-score, then compute the raw score (X) using the mean and standard deviation for the group. Draw a picture of the problem -- like the one below.
We have cut off the top 10% of the curve. What proportion do I use in the Normal Curve table? We know we want the upper 10%. We also know that the table works from the mean out. So, the z-score that cuts off the upper 10% must be the same z-score that cuts off 40% of the scores between itself and the mean (50% - 10% = 40%). The proportion we look for in the table is 0.4000. Search the proportion values in the table and find the one closest to .4000. The closest one in our table is 0.3997. Look along this row to the left. The z-score value for this row is 1.2. Look up the column from 0.3997 to the top. The z-score hundredth value is .08. The z-score which cuts off the upper 10% of the distribution is 1.28. . 1.0 1.1 1.2 1.3 .08 .09 ... | | --.3997 05 .06 .07
The z-score formula introduced in Chapter 16 yields a z-score from a raw score when we know the mean and standard deviation of a group of scores (left formula below). This z-score formula can be transformed into a formula that computes X from z. Multiply both sides of the z-score formula by s and add . This produces the formula below right. Do you see how the two equations below are the same? One solves for z and the other for X.
Substituting the values of z=1.28, we get the following:
=80, and s=10 into the equation above right
X = 80 + (1.28 * 10) = 80 + 12.80 X = 92.80
17-5
A student had to make 92.8 or higher to be in the upper 10% of the Hebrew class. These examples may seem contrived, but they demonstrate basic skills and concepts youll need whenever you use parametric inferential statistics. Learn them well, become fluent in their use, because youll soon be using them in more complex, but more meaningful, procedures.
Level of Significance
Johns Hebrew score was different from the class mean, but was the difference greater than we might expect by chance. Or as a statistician would ask it, was the score significantly different? Johns research score was different from the class mean, but was it significantly different?
Criticial Values
We determine whether a difference is significant by using a criterion, or critical value, for testing z-scores. The critical value cuts off a portion of the area under the normal curve, called the region of rejection. The proportion of the normal curve in the region of rejection is called the level of significance. Level of significance is symbolized by the Greek letter alpha ().
In this example, the critical value of 1.65 cuts off 5% of the normal curve. The level of significance shown above is = 0.05. Any z-score greater than 1.65 falls into the region of rejection and is declared "significantly different" from the mean. Convention calls for the level of significance to be set at either 0.05 or 0.01.
One- and Two-Tailed Tests

When all of is in one tail of the normal curve, the test is called a one-tailed test. When we statistically test a directional hypothesis, we use a one-tail statistical test. (Refer back to Chapter 4, if necessary, to review "directional hypothesis") We can also divide the region of rejection between the tails of the normal curve in order to test non-directional hypotheses. To do this, place half of the level of significance (/2) in each of the two tails. When statistically testing a non-directional hypothesis, use a two-tailed test. The chart below summarizes the four conditions. Notice the effect of 1- or 2-tailed tests and =.01 or .05 on the critical values used to test hypotheses. Memorize the conditions for each of the four conventional critical values: 1.65, 1.96. 2.33 and 2.58. Notice that the one-tail critical values (1.65, 2.33) are smaller than the two-tail values (1.96, 2.58). Having chosen a directional hypothesis (demonstrating greater confidence in your study), you can show significance with a smaller z-score (easier to obtain) than is possible with a non-directional study.
17-6
Chapter 17
So now we return to our question at the beginning of this section. Did John score significantly higher than his class averages in research and Hebrew? Since this is a directional hypothesis, we'll use a 1-tail test, with =0.05. Under these conditions, John had to score 1.65 standard deviations above the mean in order for his score to be considered "significantly different." In Research, John scored 2.00 standard deviations above the mean. Since 2.00 is greater than 1.65, we can say with 95% confidence that John scored significantly higher in research than the class average. In Hebrew, John scored 0.5 standard deviations above the mean. Since 0.5 is less than 1.65, we conclude that John did not score significantly higher in Hebrew than the class average. Our discussion to this point has focused on single scores (e.g., Johns exam grades) within a frequency distribution of scores. While this has provided an elementary scenario for building statistical concepts, we seldom have interest in comparing single scores with means. We have much more interest in testing differences between a sample of scores and a given population, or between two or more samples of scores. Among the example Problem Statements in Chapter 4, you saw Group 1 versus Group 2 types of problems. This requires an emphasis on group means rather than subject scores, on sampling distributions rather than frequency distributions.
--- --- Warning! This transition from scores to means is the most confusing element of the course --- ---
Sampling Distributions
A distribution of means is called a sampling distribution, which is necessary in making decisions about differences between group means. Just as naturally occurring scores fall into a normal curve distribution, so do the means of samples of scores drawn from a population. The normal curve of scores forms a frequency distribution; the normal curve of means forms a sampling distribution. Look at the diagram at right. Here we see three samples drawn from a population. All three sample means are different, since each group of ten scores is a distinct subset of the whole. The variability among these sample means is called sampling error. Even though we are drawing equal-sized groups from the same population, the means differ from one
17-7
another and from the population mean. Differences between means must be large enough to overcome this "natural" variation to be declared significant. If we were to draw 100 samples of 10 scores each from a population of 1000 scores, we would have 100 different mean scores. These 100 sample means would cluster around the population mean in a sampling distribution, just as scores cluster around the sample mean in a frequency distribution. If we were to compute the "mean of the means" we would find it would equal the population mean. The two characteristics which define a normal frequency distribution are the mean and standard deviation. These same characteristics define a sampling distribution. The mean () of a sampling distribution is the population mean, if it is known. If it is unknown, then the best estimate of the mean is one of the sample means ( ). The standard deviation of the sampling distribution, called the standard error of the mean ( ), is equal to the standard deviation of the population () divided by the square root of the number of subjects in the sample (n). Or, as in the formula below left,
If the population standard deviation () is unknown (which is usually the case), we must estimate it. In this case, the formula for standard error of the mean ( ) is based on the estimated standard deviation (s), as in the formula above right.
The Distinction Illustrated

Lets illustrate these concepts with the following scenario: a staff believes the education space in the church needs renovating. They want to measure attitude toward building renovation among the membership. They develop a building renovation attitude scale which has a range of 1 (low) to 7 (high). Because of several meetings already conducted, their hypothesis is that church members have a negative attitude toward building. They set = 0.05, and decide to use a 1-tail test since they are certain the scores will reveal a negative attitude. Here is the seven-point attitude scale used in the study. 1 2 3 4 5 6 7 Negative Neutral Positive On a seven-point scale, the value of 4 is neutral. It represents the condition of neutral attitude. The research1 hypothesis for their study was Ha: <4.00 (church members will score significantly less than 4.00). The null hypothesis for the study was H0: =4.00. They randomly selected 100 church members and asked them to complete the attitude survey . After collecting and scoring the 100 forms, they computed the samples mean and standard deviation, which equalled 3.80 and 0.7. A frequency distribution of scores looked like the diagram at left. Notice the neutral value of 4.00 at z=0, s=0.7, and the computed score-labels for each of the z-scores (1.9 - 6.1). But the staff wanted to know if the samples attitude of 3.8 was significantly lower than 4.0? They developed the sampling distribution, with the neutral value of
1
Research hypotheses are also called "alternative" hypotheses -- hence the reference Ha
17-8
Chapter 17
4.00 at z=0, =0.07, and the computed raw meanlabels for each of the z-scores (3.79 - 4.21). The x-axis of the sampling distribution reflects means, not scores. Notice also in the diagram at left that the much smaller differences between the mean-labels than between the score-labels. This is because the standard error of the mean ( =0.07) is much smaller than standard deviation of the sample (s=0.7). Using a 1-tail test with =0.05, the critical value needed to reject H0 : =4.00 is -1.65. The area cut off by this critical value is shown shaded gray at left. Since the sample mean (3.8) falls into this area (beyond the dotted line), we declare that the 3.80 is significantly lower than 4.00. Translating into English, we can say that the church at large does have a negative attitude toward building renovation. Let's now look at a slightly modified form of the z-score formula to compute the exact z-score for , as well as the exact probability of getting such a mean, all using the same procedure and Normal Curve table that we used before.
Using the z-Formula for Testing Group Means

Our statistical question is, Is 3.8 significantly lower than 4.0? We can see from the diagram above that the line representing 3.80 falls in the region of rejection. But we can also determine the exact number of standards errors the mean of 3.80 falls away from the hypothesized "neutral attitude" mean of 4.0. We do this in the same way we determined the number of standard deviations John's individual scores fell away from his class means. We use a z-Score formula much like the one on page 17-5, but adjusted for use with sampling distributions. We replace the standard deviation (s) with the standard error of the mean ( ), with , and X with . It looks like this:
The above formula converts a mean into a z-score, given and . Sampling distribution z-scores are tested for significance just as we did with frequency distribution z-scores. Substituting 4.0, 3.8 and 0.07 for , and we have
The mean of 3.80 is 2.85 standard errors below the hypothesized mean of 4.00. In order to be significant (1-tail, = 0.05), the z-score must be 1.65 standard errors or more from the mean. Since -2.85 is farther from the mean than -1.65, we reject the null hypothesis and accept the alternative: "The congregation has a negative attitude toward renovation." In hypothesis testing, the null hypothesis is either retained (no significant difference) or it is rejected (significant difference). There are no partial decisions. Moving back to the sampling distribution diagram, make a note that the dotted
17-9
line representing =3.8 is 2.85 standard errors below =4.0. Our finding is the same (3.8 falls into the shaded area), but we obtain a specific z-score, a more accurate measurement, by using the formula.
Summary
In this chapter we have introduced you to the process of testing hypotheses of parametric differences by way of the Normal Curve. We have differentiated between frequency and sampling distributions, and introduced the formula for computing the standardized score z. Because our hypothesis decisions (reject or retain H0) are based on probabilities (necessary since we work with sampling error and inferences), our results are always subject to errors. Such is science: hypotheses, data gathering, speculations, probabilities of findings. Our goal through proper research design and statistical analysis is to minimize errors and maximize "true findings." We will take up the topic of error rates and power in the next chapter.
Example
Dr. Robert DeVargas studied the change in moral judgment in students (3% sample, N=360) who used the Lessons in Character curriculum adopted by the Fort Worth I. S. D. for the 1996-1997 school year.2 While much of his statistical analysis is far beyond the scope of this chapter, notice in his writing below the use of level of significance as a benchmark for his findings.
Analysis of the fifth grade test data proceeded with the following steps: 1. An Analysis of Covariance (ANCOVA) was performed upon the post-test means of the treatment and control groups using the pre-test scores as a covariate variable. The mean score of the control group post-test score was 2.19 (n=30). The mean score of the treatment group post-test was 2.18 (n=31). The ANCOVA procedure produced an F value of 0.163 giving a significance of p=0.688. The critical value Fcv(1, 60, =.05) =4.00. ... [3b] The treatment group's mean pre-test score equaled 2.0535 and the mean post-test score was 2.1835; the mean difference was 0.13 (n=31, SD=0.316). The standard error of the difference was 0.057 giving t = 2.29. The critical value tcv(df=30, 1-tail, =.05) = 1.697.3 ... The step 3b analysis performed on the pre- and post-test means of the treatment group calculated the t value to be 2.29. Comparison to the critical value. . .reveals that. . .there exists a significant difference between the pre- and post-test scores. . . .it can be stated that the treatment [of moral judgement curriculum] made a significant difference in the level of moral judgement between the pre- and post-test scores of the treatment group.4
Robert DeVargas, A Study of Lessons in Character: The Effect of Moral Judgement Curriculum Upon Moral Judgement, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1998) 3 4 Ibid., 75 Ibid., 78
2
17-10
Chapter 17
Vocabulary
Alpha () Critical value Frequency distribution Level of significance Normal curve One-tail test Region of rejection Reject null Retain null Sampling error Sampling distribution Standard error of the mean Two-tail test probability of rejecting true null hypothesis. Level of significance. 1%, 5% value beyond which the null hypothesis is rejected categorization of scores into classes and class counts probability of rejecting a true null. Symbolized by symmetrical mesokurtic (bell-shaped) distribution of scores hypothesis test which places in only one tail of distribution area under the normal curve beyond the critical value the decision of "statistically significant difference" the decision of "no statistically significant difference" random differences among the means of randomly selected groups normal curve distribution of sample means within a population the standard deviation of a sampling distribution of means hypothesis test which places /2 in both tails of distribution
Study Questions
1. Define the following terms. Think of how you would explain them to someone in the class. inferential statistics level of significance sampling distribution research hypothesis (Ha) Normal Curve table p null or statistical hypothesis (H0) 1- and 2-tail tests standard error of the mean directional hypothesis non-directional hypothesis
2. Determine the area under the normal curve between the following z-scores. Draw out the problems with the following diagrams. A. z = 0 and z = 2.3 B. z = 0 and z = -1.7
C. z = 0.5 and z = 1.8
D. z = -1.2 and z = .75
17-11
3. You go to an associational picnic. The men decide to have a contest to see who can throw a softball the farthest. You record the distances the ball is thrown, and compute the mean ( = 164 feet) and the standard deviation (s = 16 feet). There were 100 men who threw the ball. Answer the following questions: A. With this information, sketch a normal curve and label it with z-scores and raw scores, mean and standard deviation. B. How many men threw the ball 180 feet or more? 120 feet or less? C. How far did one have to throw the ball to be in the top 10%? D. What distances are so extreme that only 1% of the men threw this far? E. Sketch a sampling distribution based on mean, s, n given above. F. A sister association joins the fellowship and challenges your association with a balltoss of their own. Their average distance was 170 feet. Did they throw significantly better than your association? The following diagrams will help you work through problem 3.
17-12
Chapter 17

1. The power of a statistical test refers to its ability to A. reject a true null B. retain a true null C. reject a false null D. retain a false null 2. A score of 85 would be converted to a z-score of ___, given a sample mean of 90 and a standard deviation of 10. A. +5.00 B. +0.50 C. -0.50 D. -5.00 3. The area under the normal curve between z = 2.00 is approximately A. 10% B. 50% C. 68% D. 96% 4. The point on the z-scale that cuts off the region of rejection is the _____ and the area under the curve to the right of this line is given by _____. A. critical value, p B. level of significance, 1- C. critical value, D. level of significance, 1-p
17-13
17-14
Chapter 18
The Normal Curve and Power
18
The Normal Curve Error Rates and Power
Type I and Type II Error Rates Increasing Statistical Power Statistical versus Practical Significance
Because decisions to reject or retain null hypotheses are based on probabilities (necessary since we work with sampling error and inferences), our results are always subject to errors. Such is science: problem, hypothesis, data gathering, analysis, and probabilities of findings. Our goal through proper research design and statistical analysis is to minimize errors and maximize "true findings." We complete our journey through the Normal Curve and hypothesis testing with considerations of error rates and power. The chapter is divided into the following sections: Error Rates. What are Type I and Type II Error Rates? How can we reduce the likelihood of committing Type I and Type II errors? Power. What is statistical power? How do we increase the power of a statistical test? Statistical versus Practical Significance. What is the difference between statistical significance and practical significance?
Type I and Type II Error Rates

The dependence on laws of probability to make decisions means that we can be wrong when we retain or reject a null hypothesis. Whats the probability that were right? Or wrong? How can we improve our chances of making right decisions concerning our data? Before we can answer these questions, we must establish the elements of the problem, and how they relate to making correct decisions. In hypothesis testing, there are two independent realities at work. The first reality is the real world itself. An anti-cancer drug either works or it doesnt. A prescribed teaching technique improves learning or it doesnt. A counseling procedure reduces anxiety in counselees or it doesnt. The second reality, at least within the context of our study, is our decision about the effectiveness of the treatment. Based on measurements, we will decide on the basis of statistical analysis whether our anti-cancer treatment succeeded or not, whether our teaching procedure was effective or not, whether our counseling ap 4th ed. 2006 Dr. Rick Yount
18-1
proach reduced anxiety or not.
Decision Table Probabilities

These two related, but independent, realities set up four possible outcomes for the analysis. These four outcomes are best considered in a 2x2 decision table, like the one at left. When there is no real difference in the world, and the researcher decides statistically that there is no difference, he makes a correct decision (box A). When there is a real difference in the world, and the researcher decides statistically that there is a difference, he also makes a correct decision (box D). In both cases the decision of the researcher matches the real world. When there is no real difference in the world, and researchers decide statistically that there is a difference (B), they make a mistake, called a Type I error. Where there is a real difference in the world, and researchers decide statistically that there isnt (C), they also make a mistake, called a Type II error. Error Types Now, lets translate the English labels above into statistical language. The term null refers to the stated null hypothesis for the study. If there is no difference in the real world, we say that the null is true. If there is a difference in the world, we say that the null is false. When we statistically decide that a difference does not exist, we retain the null. When we statistically decide that a difference does exist, we reject the null and accept the research (alternative) hypothesis. The statistical labels mean exactly the same as the English labels above. We add probability values (1-, , , 1-) to the boxes at left, and one new term, power. These probabilities refer to the likelihood of a mean falling into the conditions of each particular box. When I set to 0.05 (the probability of committing a Type 1 error), I automatically set the probability of making the correct statistical decision to retain the null to 1- (0.95). Adding to 1- results in 1 (1-+=1), or 100% of all the scores in a single population. When I set power (1) to 0.80 (the probability of declaring a real difference significant), I automatically set to 0.20 (the probability of committing a Type 2 error). Adding to 1 results in 1 (1+=1), or 100% of a second, different population. We will use the normal curve diagrams at left to identify exactly what these probability values refer to.
Correct
Statistical Language
Probabilities
Normal Curve Areas

These two normal curves represent two distributions that differ on some variable. Well call them 1 and 2. Normal Curve 1 is the population represented by the sampling distribution of our study, centered on the hypothesized mean. Normal Curve 2 represents a theoretical distribution of means centered on a different
18-2
Chapter 18
population mean higher than our own. The critical value line of Curve 1 cuts off the region of rejection (light gray area). This area is equal to , the probability of committing a Type 1 error: rejecting a true null. Any mean falling into this region is declared significantly higher than 1. The dark gray area to the left of the critical value is the region of non-rejection and is equal to 1-, the probability of retaining a true null. If we set at 0.05, then 1 equals 0.95. This means that when we declare a difference significant, we are 95% sure of our decision. But notice that the critical value line also cuts off the lower part of Curve 2 (light gray). This area is symbolized by , the probability of committing a Type 2 error: retaining a false null. Any mean from distribution 2 (which should be declared different) falling in this area will be declared not significantly higher than 1. The dark gray area to the right of the critical value line equals 1, the probability of rejecting a false null (power). Now let's put the curves and boxes together with the labels A, B, C, and D. The diagram at right shows two sample means (A,B) which are true nulls* (arrows show they belong with population 1). Mean A falls to the left of the critical value and is declared not significantly different from . This is a correct decision, and reflects box A (p=1-). Mean B falls beyond the critical value and is declared significantly different from . This is a Type 1 error and reflects box B (p=). In the second diagram at right, we have means C and D which are both false nulls* (arrows show they belong with population 2). Mean C falls to the left of the critical value and is declared not significantly different from . This is a Type 2 error, and reflects box C (p=). Mean D falls to the right of the critical value and is declared significantly different from . This is a correct decision and reflects box D (p=1-). Review the decision table and these diagrams until you can see the correspondence between the two. *Of course, we can never really know what the "real world" conditions are, whether the nulls are actually true or false. In the previous examples, I was giving you hidden information in order to establish the four possibilities in hypothesis testing. We are left with the tasks of gathering reliable and valid samples of data, applying statistical procedures, and making decisions of outcome based on probability. But this is still a wonderful mechanism for solving problems. Understanding the
18-3
dynamics of hypothesis testing -- error rates, power, z-scores -- is like any other kind of under-the-hood, behind-the-scenes knowledge. It provides insight into how things work, sophistication in what doesn't, and calm assurance that a research design -whether our own or one found in the literature -- is what it purports to be. Such understanding elevates us from blind user to savvy consumer of research findings. Further, it matures us as a competent research designer. At the heart of this competence lies the ability to improve our chances of making a correct decision even before we send out instruments or determine the samples we'll study. A statistician would ask it this way: How can I increase the power of my study? Lets take a look at some possibilities.
Increasing Statistical Power

There are four ways to increase the power of your analysis: increase the level of significance (), increase the difference between population means (1-2), increase the number of subjects studied (N), and decrease the variability of test scores (s). The first two are more theoretical than practical, but they are nonetheless instructive.
Increase
The level of significance () directly controls the size of the critical value used to declare a difference significant. As we increase , we reduce the critical value. As the critical value is reduced, the likelihood that a mean will be declared significantly different from the mean increases. In other words, the probability of declaring a null hypothesis false (power) increases simply because we reduced the critical value. The diagrams at left show how this happens. Notice the position of the critical value line in the upper diagram at left. It cuts off the curve at z=1.65 (=0.05). If we increase to 0.10, the cut off line moves to the left, to z=1.28. As the critical value line moves to the left, the area labelled power increases by the lighter segment shown in the lower diagram. However, we have increased power (1-) by increasing , the Type I error rate. This is simply robbing Peter to pay Paul. It does not improve the overall research design to increase the probability of Type 1 errors as we decrease the probability of Type 2 errors . It is better to remain with the conventional values of 0.05 or 0.01 for .
The power of a statistical test means is the probability of declaring a difference significant. Greater power means nothing more or less than a greater probability of declaring a value significant. It stands to reason that the probability of declaring a larger difference significant is greater than for declaring a smaller difference significant. The difference between population means in the upper diagram at left is smaller than the difference in the lower diagram. Look at the dramatic difference in power, reflected by the shaded areas in the diagrams.
Increase 1 - 2
18-4
Chapter 18
Several years ago a Ford Motor Company commercial featured a road test comparing a Lincoln and a Cadillac. They used 100 drivers. The Lincoln won the test (remember, it was a Ford commercial). But interesting to me was that researchers needed 100 persons to show the difference. Why so many? Because the difference between a Cadillac and Lincoln is very small. Had one of the cars been a 67 Chevy taxi cab, the ability to distinguish between cars would have been easier, and the difference could have been firmly established with fewer subjects. It is reasonable to assume that as 1 - 2 increases, detecting the difference becomes easier. We can see this very fact in the formula for z itself. The equation for z has the term in the numerator, so that as this difference increases, z increases falling farther out from the mean, and the more likely to be declared significant. The problem, of course, is that this discussion is purely theoretical, since we have no control over the size of difference ( ). So let's turn our attention to elements we do have some control over.
Decrease the Standard Error of the Mean

If we increase z by increasing the numerator, we can by the same formula show that we increase z by decreasing the denominator. In the graphic at right, notice that the difference between the means is the same for both pairs (dotted lines are parallel). The standard error of the mean (variability within the sampling distribution of means) is smaller in the lower diagram, larger in the upper. Now, do you see that power (a proportion of the entire curve) is larger in the lower pair than in the upper? The light-gray area in the lower diagram represents the increase in power -- the shaded areas in the lower pair cover a larger proportion of the whole (i.e., greater power) than the shaded area in the upper diagram. Reducing the standard error of the mean ( ) increases z, and therefore the probability of declaring a difference significant. The equation for computing at left shows the two ways to reduce the standard error. We can either decrease s, the standard deviation of the sample, or increase n, the number of subjects in the group.
Decrease s
The standard error of the mean ( ) is decreased by decreasing s. We do this by improving the precision and accuracy of the measurements of our sample(s). By designing better experiments, writing better tests, and using more reliable methods for collecting data, we squeeze some of the noise (extraneous, unsystematic variability) out of our data. It is clear that by gathering data that is more precise, we will be able to detect targeted differences more easily because we are removing unwanted static from the process. Sloppy designs, poor instruments, and awkward data gathering should be replaced by clear designs, accurate and valid instruments, and precise data gathering procedures. Decreasing s increases power without increasing Type I error rates.
Increase n
The second way to decrease
is to increase n by adding subjects to our study.
18-5
As the number of subjects increases, the more their individual differences (random noise) cancel each other out and allow true differences to show through. The size of your sample(s) has a direct influence on the outcome of your study. If you study three approaches to counseling using groups of 10 subjects each, you may not have sufficient statistical power to declare the differences significant, even if they really exist!. The same study, done with three groups of 30, might declare these real differences significant. If you use three groups of 1000, you may find significant differences that are, in a practical sense, trivial (see "practical importance" below). Because n is so potent an influence on power, you must use caution in selecting your sample size. You may want to consult an advanced statistics text to determine the size of sample(s) you need for your statistic, but for now, Dr. Currys rule of thumb for sample size (Chapter 7) is a good place to start.
Like Fishing for Minnows

To improve the statistical design of our studies, we must reduce the standard error of the mean by using precise measurements, or increasing sample size, or (preferably) both. Consider with me for a moment the concept of "power" in terms of a minnow net. If I use a minnow net that has a large mesh, I may not catch any minnows even if they are present. This is analogous to conducting a statistical test with low power. The difference may exist in the real world (You really did find the cure for cancer!), but the statistical procedure does not declare it significant because the power of the test is too low. (Remember, for a scientist, the only real difference is a statistically significant one). If I use a minnow net with a fine mesh, I will catch minnows, if they are there. This is analogous to conducting a statistical test with high power. If the difference exists, the statistic will declare it to be so. However, if there are no minnows, I will not catch any, no matter how fine the mesh of my net is. Even a high power statistic will not declare "no difference in the real world" as significant. Maximizing power in a study is a good thing. However, too much power might declare a trivial difference as significant. Let's look into this final topic before we close.
Statistical Significance and Practical Importance

If I select a very large sample, I may declare trivial differences significant. If I select a small sample size, I may declare real differences as not significant. So why bother with statistical testing?! The issue is not settled by statistical tests. The practical importance of a study is an interpretation of the research study taken as a whole. Practical importance takes into consideration non-statistical factors such as the cost in time and money to implement the procedures in the study. A new reading program may improve reading achievement by 3% (in the context of the experiment, a significant increase), but if the cost in materials and personnel is excessive, it may not be practical enough to make the change. Still, hypothesis testing points us in a direction. It provides us with precise tools with which to infer meaning from data. Tools of all kinds can be misused, but this does not invalidate the tool. It encourages us to learn how to use the tools correctly and interpret the results objectively.
18-6
Chapter 18
Summary
Weve come a long way in the last two chapters! We began in Chapter 17 with the standardized z-scale and linked it to the Normal Curve distribution table. We introduced the characteristics of the normal curve. We linked the concepts of z-score and area under the curve. We explained the concept of level of significance (). We differentiated one- and two-tail statistical tests. We made the leap from frequency distributions (of scores) to sampling distributions (of means). We related the z-score equation to hypothesis testing with sampling distributions. In our present chapter we explained and illustrated the concepts of Type I and Type II error rates, as well as power. We tied these concepts to pictorial representations of where these error rates come from, as well as what they mean. Finally, we described practical ways to improve the statistical design of our studies. These two chapters lay the foundation for understanding and using the statistical procedures we'll discuss in the remainder of the book.
Vocabulary
Alpha Beta (Type II) Power Reject null Retain null Type I error rate Type II error rate probability of rejecting true null hypothesis () probability of retaining false null hypothesis () probability of rejecting a false null the decision of statistically significant difference the decision of no statistically significant difference probability of rejecting a true null probability of retaining a false null
Study Questions
1. Explain the following terms in English: Type 1 error rate Type 2 error rate Power Practical significance Statistical significance Level of significance
2. Draw from memory the 2x2 decision table, labelling the four headings and filling in the boxes with level of significance, error rates and power. Then draw four sets of paired normal curves and identify the areas under the curves which relate to each of the four cells in the decision table.

1. A Type 2 error is the probability of A. retaining a true null B. rejecting a true null C. rejecting a false null D. retaining a false null
18-7
2. Greater power in a statistic means A. more precision B. less precision C. more differences declared "significant" D. fewer differences declared "significant" 3. The best way to increase power in your statistical design is to. A. lower the critical value of the test B. increase C. increase the standard deviation of scores D. use a larger sample 4. You want to be 99% confident of your decision that Sample mean A is different from Sample mean B. Which of the following will allow you to do this? A. 1-tail test at =0.01 B. 2-tail test at = 0.99 C. 1-tail test at =0.99 D. 2-tail test at =0.01 5. T F Results can have statistical significance without having practical significance.
6. T F The best ways to decrease the probability of committing a Type 2 error in your study is to increase n and decrease "noise" in the scores. 7. T F The critical value for a 2-tail test, =0.01, is 2.33.
18-8
Chapter 19
One Sample Tests: z- and t-
Unit IV: Statistical Procedures
19
One Sample Parametric Tests
The One-Sample z-Test The One-Sample t-Test The Confidence Interval
Dr. Roberta Damon developed a marital profile of Southern Baptist missionaries in eastern South America in 1985. Part of her study involved measuring whether SBC missionary husband-wife pairs differed from American society-at-large on the variable Couple Agreement on Religious Orientation, as measured by the Religious Orientation Test. She set = 0.01 and decided to use a two-tail test. The American mean () at the time was 56. The mean score ( ) and estimated standard deviation (s) of her sample of 330 missionaries was 86.3 and 20.847 respectively. Applying the sampling distribution z-formula we introduced on page 17-9, she computed z as follows:
Remember that a two-tail test (=0.01) requires only z=2.58 for declaring a difference significant. Here we see z=26.403. Southern Baptist missionaries serving in eastern South America in 1984, as a group, scored over 26 standard errors above the national mean in Couple Agreement1 on the Religious Orientation Scale! In chapters 17 and 18 we explored the theoretical basis for hypothesis testing with sample means. Here in Chapter 19 we will use these principles to establish our first practical use for hypothesis testing. One-sample tests compare a population mean () with a single sample mean( ). We have already seen the one-sample z-test in action. We will also discuss the one-sample t-test.
The One-Sample z-Test

The procedure we used in chapter 17 to compute the significance of a church's attitude toward building renovation, as well as in the analysis shown above, used the one-sample z-test. The formula has two forms. When the population standard deviation () is known, we use the equation on the left below. When sigma () is not known
Couple Agreementmeasures the degree to which religion serves as a relationship strength. A high score means that there is high agreement on religious issues between husbands and wives, and that subjects consider religion an important part of their marriage relationship. Roberta McBride Damon, A Marital Profile of Southern Baptist Missionaries in Eastern South America, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1985), 37
1
19-1
IV: Statistical Procedures
(which is usually the case), we use s to estimate , and so use the equation on the right (the more popular of the two).
The limitation to using s to estimate is whether the sample is large enough to approximate a normal curve. "Large enough" means at least n=30 subjects. The normal curve table (z-) requires a normal distribution of scores in order to give accurate proportions under the curve. When a sample contains less than 30 scores, the requirement of normality is not met, and we cannot use the normal curve table or the z-test. In this case, use the one sample t-test.
The One-Sample t-Test

The one-sample t-test must be used to test the difference between a population mean () with a single sample mean( ) when both (1) the population standard deviation () is unknown, and (2) the sample size (n) is less than 30 (That is, the z-test cannot be used under these conditions). However, note that the one-sample t-test can be used for samples of any size, and is often used instead of the z-test, making it a one of the most popular tests of difference. The value of the one-sample t is obtained exactly like the one-sample z. The formula looks just like the z-test formula given earlier:
The logic behind the one-sample t is the same as we've used for the z-test, with one exception. A smaller n means lower power (see 18-5). Since the t-test is designed to be used with smaller n samples, the t-distribution critical value table assigns a slightly larger critical value to reflect the loss of power.
The t-Distribution Table

Look at the t-distribution table in appendix A3-2 at the back of the text. Notice that the column headings reflect levels of significance ranging from 0.10 to 0.005. These are one-tailed probabilities. The heading df on the left bold-faced column is an abbreviation for degrees of freedom. Degrees of freedom is directly related to the size of a sample, but it is a concept that is much easier to illustrate than define. Lets say we have three variables that add to 10, such that X + Y + Z = 10. What number can we substitute for "X"? The answer is any number. The variable X is free to vary. We have one degree of freedom. So let's arbitrarily assign the value "2" to X. We now have 2 + Y + Z = 10. What number can we substitute for "Y"? The answer is any number. Variable Y is free to vary. We now have two degrees of freedom. Let's assign the value "5" to Y. We now have 2 + 5 + Z = 10. What number can we substitute for "Z"? There is only one number in the universe: Z must equal "3" for the equation to be true. Z is NOT free to vary. So for X, Y, and Z, three numbers (n=3), we have two (n-1) degrees of freedom. This relationship holds true for any size n: df = n-1. Or in English, if you
19-2
Chapter 19
have 26 subjects in a study and use the one-sample t-test, then the degrees of freedom for the analysis is (n-1 = 26-1 =) 25. Recall that in the Normal Curve table we can find proportion values for any zscore. We focused on the four primary critical values of 1.65, 1.96, 2.33, and 2.58, but could calculate proportions for any z from 0 to 3. Look at the t-table under the column labelled 0.05 and move down to the bottom of the column, next to df= . The value of the t-test critical value (1.645) is exactly the same as the value in the Normal Curve table (0.05, 1-tail test). To the right of 1.645 is the .025 column with the value "1.96." This is exactly the same as the 2tail 0.05 Normal Curve value of z. The heading ".025" in the t-table refers to the /2 area of the two-tail test: 0.05/2 = 0.025. The next value to the right is "2.326," the one-tail 0.01 value of z. And the next, "2.576." These are the same four essential critical values we studied for the normal curve: 1.65, 1.96, 2.33, and 2.58. The t-Distribution table differs from the Normal Curve table in that it contains nothing but critical values for a given level of significance and df. Each df row provides the critical value for a unique t-distribution. As we have just seen, for , the t-distribution is the same as the Normal Curve distribution. As n decreases, the t-distribution becomes increasingly platykurtic. That is, the smaller the sample size, the flatter and wider the distribution. This pushes the critical values out farther on the tails, making significance harder to establish, as you can see in the diagram at right. Now choose the column in the t-table ending with "1.645." Move up the column and watch how the critical values increase. As df decreases, critical values increase. This is just another way of saying as n decreases, power decreases. Thus, the t-table accounts for lower power which derives from smaller n.
Computing t
Let's revisit the congregation study of attitude toward building renovation we did with the z-test in Chapter 17. Suppose our random sample is a small one, made up of only 25 members rather than 100. Lets use the same hypothesized population value of 4.0 and the same mean and standard deviation of 3.8 and 0.7 respectively. Computing t we have
The critical value for =.05, 1-tail, and df = 24 is 1.711. Notice that this critical value for t, symbolized as tcv, is a larger value than the comparable z-test value of 1.65. But since our value of -1.43 is smaller (not as far out on the left tail) than -1.711 , we retain the null hypothesis. While the difference between sample mean and hypothesized population mean is the same as before (-0.2), the standard error of the mean (sx) was twice as large -- 0.14 in the t-test as opposed to 0.07 in the z-test. This is due to the smaller number of subjects (n=25) in this sample as compared to the former one (n=100). Additionally, the critical value bar for the t-test (1.711) is higher than the z-test (1.645). So the smaller sample size yields less power. The same hypothesis (H0:=4.0)
19-3
and the same difference (-0.2) yielded two different results. The z-test yielded a significant difference. The t-test did not. Why? Because in the second case we lacked sufficient power to declare the difference significant. The t-test does not correct for lack of power. It simply allows us to test samples too small for the z-test. Up to now we have used the z- and t- formulas to test single hypotheses. Another, less common, use for these formulas is in the creation of a confidence interval.
Confidence Intervals
The confidence interval offers another approach to making statistical decisions. In this approach, we set an interval around the population mean, bordered by confidence limits. We can then state, with a given degree of confidence, that the null hypothesis is true if any sample mean which falls within this interval , and false if it falls outside this interval. The benefit of using a confidence interval is that any number of sample means can be tested with only one computation.
A z-Score Confidence Interval

The z-score confidence interval equation looks like this:
where CI95 is the 95% confidence interval consisting of two endpoint values (x.x, x.x), is the population mean, z is the z-score for the given level of significance (in this case, = 0.05, so z = 1.96), and is the standard error of the mean. Now let's revisit the church attitude study again using n=100 subjects and a confidence interval based on the z-score. Using the formula above we have the following computation:
In the diagram at left you can see the church's mean score (n=100) of 3.8, and the confidence interval values of 3.863 and 4.137. The mean (3.8) falls outside the confidence interval. We therefore reject the null hypothesis (just as we did with the hypothesis test in Chapter 17). The church has a negative attitude toward renovation. But let's assume we tested twenty churches in our association.2 All we need do is calculate the mean score of each church and compare that to the interval "3.863 - 4.137." Any church mean falling below this interval
I've oversimplified this to focus your attention on the meaning of confidence interval. But for this to actually work as I've described it, we have to assume that all twenty churches produced the same standard deviation (s) on their attitude scores -- and this is an unreasonable assumption. If we were to actually do this study, we would compute the overall standard deviation from all twenty churches, and use that value to construct the confidence interval. Then, any church falling outside the interval would be significantly different from 4.00, the hypothesis neutral point. But even so, we make one computation, 20 comparisons.
2
19-4
Chapter 19
reflects a significant negative attitude. Any church mean falling above the interval reflects a significant positive attitude. Any mean falling within the interval reflects no attitude (neutral). Confidence intervals are always based on two-tailed tests. The two values, 3.863 and 4.137, are called the confidence limits. The range of scores between the limits, shown at right by the two-headed arrow, is the confidence interval.
A t-Score Confidence Interval

We can also develop a confidence interval with t-scores. We simply substitute the appropriate tcv value for zcv. In the above example, the appropriate t-table critical value (0.05, 2-tail, df=24) is 2.064. The standard error of the mean is different because of the change in sample size (25 vs. 100). It is 0.14 rather than 0.07, which we used with the z-test (note the equation in the middle of page 19-3). Therefore, the t-value confidence interval equation is
The flatter curve of the t-distribution forces the interval to be wider than the one we computed with z. Because of the larger standard error of the mean ( ), the t-score cut-off values are farther apart (0.14 vs. 0.07). Notice also that the sample mean of 3.8, which fell outside the z-score interval, falls inside the t-score confidence interval. This agrees with the hypothesis test on page 19-3. Since the mean of 3.8 falls within the confidence interval, it is declared not significantly different. The different result (z- vs. t-) is due directly to different n's (100 subjects vs. 25 subjects). The loss of power with n=25 changed our significant finding to a non-significant finding.
Summary
This chapter is built on Chapters 17 and 18. The principal extensions we made in this chapter include the use of the t-distribution table, the concept of degrees of freedom and confidence interval. The next chapter extends these concepts still further: to testing mean differences between two samples.
Vocabulary
Confidence interval Degrees of freedom One sample z-test One sample t-test Distance between the 2-tail critical values centered on mean. "Region of acceptance (of null)" the number of values free to vary given a fixed sum (n-1 for one group) tests difference between and X-bar - if is known, or if is unknown and n>30 tests difference between and X-bar - if unknown and estimated by s (must use if n<30)
Remember: The one-sample t-test can be used with any sample size, which makes it one of the most popular statistics of difference.
19-5
Study Questions
1. A sample mean on an attitude scale equals 3.3 with a standard deviation of 0.5. There were 16 people tested in the group. Test the hypothesis: The group will score significantly higher than 3.0 on attitude X. (Use =0.01) A. State the statistical hypothesis. B. Compute the proper statistic to answer the question. (con't) C. Test the statistic with the appropriate table. D. State your conclusion. E. Establish a 99% confidence interval about the sample mean. (Careful here...) F. Draw the sampling distribution and the confidence interval. 2. Repeat #1 above, but with a sample size of 49. 3. A study in 1980 revealed that the average salary of Southern Baptist ministers of education in Fort Worth was $29,000 (fictitious data). You randomly sample 28 ministers of education (1995) and find their average salary is $37,000 with a standard deviation of $3,000. Have salaries improved significantly? A. B. C. D. E. F. G. Draw and label an appropriate sampling distribution. State the research hypothesis. State the statistical hypothesis. Select the proper test and compute the statistic. Test the statistic with the appropriate table. Establish a 95% confidence interval about the sample mean. Does this confidence interval agree with your hypothesis test? Explain how.

1. When must you use the one-sample t-test to test differences between two groups of scores? A. When you do not know and n<30. B. When you do not know . C. When you know and n>30. D. When you do not know s and n<30. 2. The denominator of the one-sample t-test equals A. n-1 B. s//n C. D. s 3. What would be your best estimate of the critical value for t (.05, df=19, 1-tail)? A. 1.65 B. 1.73 C. 1.96 D. 2.01 4. T F A confidence interval produced by a z-score is narrower than one produced by a tscore particularly for samples of 30 or fewer subjects. 5. T F If a confidence interval, centered at the hypothesized population mean, includes a sample mean, then the sample mean is considered significantly different.
19-6
Chapter 20
Two-Sample t-Tests
20
Two Sample t-Tests
t-Test for Two Independent Samples t-Test for Two Matched Samples Confidence Interval for Two-Sample Test
The next logical step in the analysis of differences between parametric variables (that is, interval/ratio data) is testing differences between two sample means. In one-sample tests, we compared a single sample mean ( ) to a known or hypothesized population mean (). In two-sample tests, we compare two sample means ( ). The goal is to infer whether the populations from which the samples were drawn are significantly different. Dr. Mark Cook studied the impact of active learner participation on adult learning in Sunday School in 1994.1 Using twelve adult classes at First Baptist Church, Palestine, he randomly grouped six classes into a six-session Bible study using active participation methods and six that did not.2 Because he used intact Sunday School classes, he applied an independent samples t-test to insure that the two experimental groups did not differ significantly in Bible knowledge prior to the course. Treatment and control groups averaged 6.889 and 7.436 respectively. They did not (t= -0.292, tcv= 2.009, df=10, = .05).3 At the conclusion of the six-week course, both groups again took a test of knowledge. Treatment and control groups averaged 10.111 and 8.188 respectively. Applying the independent samples t-test, Dr. Cook discovered that the active participation group scored significantly higher than the control (t=2.048, tcv =1.819, df=10, =.05).4 The average class achievement of those taught with active learner participation was significantly higher than those taught without it.
Descriptive or Experimental?
Two-sample tests can be used in either a descriptive or experimental design. In a descriptive design, two samples are randomly drawn from two different populations. If the analysis yields a significant difference, we conclude that the populations from which the samples were drawn are different. For example, a sample of pastors and a sample of ministers of education are tested on their leadership skills. The results of this study will describe the difference in leadership skill between pastors and ministers of education. In an experimental design (like Dr. Cook's above), two samples are drawn from the same population. One sample receives a treatment and the other (control) sample
1 Marcus Weldon Cook, A Study of the Relationship between Active Participation as a Teaching Strategy and Student Leanring in a Southern Baptist Church, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1994) 2 3 4 Ibid., pp. 1-2 Ibid., p. 45 Ibid., p. 50
20-1
either receives no treatment or a different treatment. If the analysis of the post-test scores yields a significant difference, we conclude that the experimental treatment will produce the same results in the population from which the sample was drawn. The results from Dr. Cook's study can be directly generalized to all adult Sunday School classes at First Baptist Church, Palestine -- since this was his population -- and further, by implication, to all similar adult Sunday School classes: active learner participation in Sunday School improves learning achievement.
t-Test for Independent Samples

Samples are independent when subjects are randomly selected and assigned individually to groups.5 The formula for the t-test for independent samples is
where the numerator equals the difference between two sample means ( ), and the denominator, called the standard error of difference equals the combined standard deviation of both samples. Review: We have studied a normal curve of scores (the frequency distribution) where the standard score (z, t) equals the ratio of the difference between score and mean (X- ) to the distribution's difference within -- standard deviation (s). See diagram 1 at left. We have studied a normal curve of means (the sampling distribution) where the standard score (z, t) equals the ratio of the difference between sample mean ( ) and population mean () to the distribution's difference within -- the standard error of the mean ( ). See diagram 2. In the independent-samples t-test, we discover a third normal curve -- a normal curve of differences between samples drawn from two populations. Given two populations with means x and y , we can draw samples at random from each population and compute the difference between them ( ). The mean of this distribution of differences equals x - y (usually assumed to be zero) and the standard deviation of this distribution is the standard error of difference: . See diagram 3. Any score falling into the shaded area of diagram 1 is declared significantly different from . Any mean falling into the shaded area of diagram 2 is significantly
5 When subjects are drawn from a population of pairs, as in married couples, and assigned to husband and wife groups, the groups are not independent, but matched or correlated. We will discuss the correlated-samples t-test later in the chapter.
20-2
Chapter 20
Two-Sample t-Tests
different from . Any difference between means falling into the shaded area of diagram 3 is significantly different from zero, meaning that are significantly different from each other. You will find the importance of this discussion in the logic which flows through all of these tests.
The Standard Error of Difference

The t-Test for Independent Samples tests a difference between two sample means against a hypothesized difference between populations (usually assumed to be zero). The most difficult part of the test is computing the standard error of difference, the combined standard deviation value for both samples together. The formula for the standard error of difference is:
The purpose of this formula is to combine the standard deviations of the two samples (sx, sy) into a single "standard deviation" value. We can rewrite the first part of the equation into a more recognizable (but mathematically incorrect) form like this:
The shaded area above highlights the formula we established for the estimated variance (s2) in chapter 16. Here it computes sx2. A similar element computes sy2. These are added together. The gray [brackets] combines the n's of both samples into a harmonic mean and has the same mathematical function in the formula as "dividing by n," producing a combined variance. Mathematically, we add the sums of squares, divide by degrees of freedom, and multiply by the [harmonic mean]. The square root radical produces the standard-deviation-like standard error of difference. The equation above spells out what the standard error of difference does, but it is tedious to calculate. If you have a statistical calculator that computes mean ( ) and variance (s2) from sets of data, you can use an equivalent equation much more easily. The equivalent equation substitutes the term (nx-1)sx2 for x2, and (ny-1)sy2 for y2. A quick review of the equation for s2 above shows these two terms to be equivalent. The altered equation looks like this:
Note: The terms are separated to show link to standard deviation. It is mathematically incorrect.
Substituting (n-1), s2 (from the calculater), and n terms from both samples quickly simplifies the above equation. Lets illustrate the use of these equations with an example.
Example Problem
The following problem will illustrate the computational process for the independent-samples t-test. Suppose you are interested in developing a counseling technique to reduce stress within marriages. You randomly select two samples of married indi 4th ed. 2006 Dr. Rick Yount
20-3
viduals out of ten churches in the association. You provide Group 1 with group counseling and study materials. You provide Group 2 with individual counseling and study materials. At the conclusion of the treatment period, you measure the level of marital stress in the group members. Here are the scores:
Group 1 25 17 29 29 26 24 27 33 23 14 21 26 20 27 26 32 20 32 17 23 20 30 26 12 Group 2 21 26 28 31 14 27 29 23 18 25 32 23 16 21 17 20 26 23 7 18 29 32 24 19
Using a statistical calculator, we enter the 24 scores for each group and compute means and standard deviations. Here are the results:
: s: Group 1 24.13 5.64 Group 2 22.88 6.14
Are these groups significantly different in marital stress? Or, asking it a little differently: Are the two means, 24.13 and 22.88, significantly different? Use the independentsamples t-test to find out. Here is the procedure. Step 1: State the research hypothesis. Step 2: State the null hypothesis. Step 3: Compute . (Numerator of t-ratio)
Step 4: Compute
. (Denominator of t ratio)
20-4
Chapter 20
Two-Sample t-Tests
Step 5: Compute t
Step 6: Determine tcv(.05, df=46, 2 tail) Degrees of freedom (df) for the independent samples t-test is given by nx + ny -2, or 24 +24 - 2 = 46 in this case. Since the t-distribution table does not have a row for df=46, we must choose either df=40 or df=50. The value for df=40 is higher, and convention says it is better to choose the higher value (to guard against a Type I error). So, tcv (df=40, 2 tail, 0.05) = 2.021 Step 7: Compare: Is the computed t greater than the critical value? t = 0.736 tcv = 2.021 Step 8: Decide The computed value for t is less than the critical value of t. It is not significant. We retain the null hypothesis. Step 9: Translate into English There was no difference between the two approaches to reducing marital stress. The t-Test for Independent Samples allows researchers to determine whether two randomly selected groups differ significantly on a ratio (test) or interval (scale) variable. At the end of the chapter you will find several more examples of the use of Independent-samples t in dissertations written by our doctoral students. You will often see journal articles that read like this: The experimental group scored twelve points higher on post-test measurements of subject mastery (t=2.751*, =0.05, df=26, tcv=1.706) and eight points higher in attitude toward the learning experience (t=1.801*) than the control group. (The asterisk [*] indicates "significant difference"). If the two samples are formed from pairs of subjects rather than two independently selected groups, then the t-test for correlated samples should be used.
t-Test for Correlated Samples

The Correlated Samples (or Matched Samples) t-test is used when scores come from (1) naturally occurring pairs, such as husband-wife pairs, parent-child pairs, sibling pairs, roommates, pastor-staff minister pairs, or (2) two measures taken on a single subject, such as pre- and post-tests, or (3) two subjects that have been paired or matched on some variable, such as intelligence, age, or educational level.
Effect of Correlated Samples

The t-test for correlated samples uses a different formula to take into account the correlation which exists between the two groups. Any positive correlation between the pairs reduces the standard error of difference between the groups.6 The higher the correlation, the greater the reduction in the standard error of difference. This makes the Correlated Samples t-Test a more powerful test than the Independent Samples tTest. Lets take a moment to analyze this.
20-5
The correlation between scores in matched samples controls some of the extraneous variability. In husband-wife pairs, for example, the differences in attitudes, socioeconomic status, and experiences are less for the pair than for husbands and wives in general. Or, when two scores are obtained from a single subject, the variability due to individual differences in personality, physical, social, or intellectual variables is less in the correlated samples design than with independent samples. Or, when a study pairs subjects on a given variable, it generates a relationship through that extraneous variable. In each case, more of the extraneous error (uncontrolled variability) in the study is accounted for. This reduction in overall error reduces the standard error of difference. This, in turn, increases the size of t making the test more likely to reject the null hypothesis when it is false (i.e., more powerful test). On the other hand, the degrees of freedom are reduced when using the correlated pairs formula. For independent samples, df is equal to nx + ny - 2. In the correlated samples, n is the number of pairs of scores and df is equal to n (pairs) - 1. Two groups of 30 subjects in an independent-samples design would yield df=30+30-2=58, but in a correlated samples design, df=30 pairs-1=29. The effect of reducing df is to increase the critical value of t, and therefore we have a less powerful test. The decision whether to use independent or correlated samples comes down to this: Will the loss of power resulting from a smaller df and larger critical value for t be offset by the gain in power resulting from the larger computed t? The answer is found in the degree of relationship between the paired samples. The larger the correlation between the paired scores, the greater the benefit from using the correlated samples test.
The Standard Error of Difference

The correlated samples t-Test is based on the differences between paired scores (d=X-Y) rather than the scores themselves. The formula looks like this:
where is the mean of the differences and value of is given by
is the standard error of difference. The
I provide here a two-step process for computing the standard error of difference to reduce the complexity of the formula. First, compute the sd2 using the formula above right. Note the distinction between d2 (differences between paired scores which are squared and then summed) and (d)2 (differences between paired scores which are summed and then squared). Secondly, use the sd2 term in the formula above left to compute the standard error of difference ( ). Substitute and in the matched-t formula and compute t. Compare the computed t-value with the critical value (tcv) to determine whether the difference between the correlated-samples means is significant. Lets look at the proce"Correlation" is another category of statistical analysis which measures the strength of association between variables (rather than the difference between groups). We'll study the concept in Chapter 22.
6
20-6
Chapter 20
Two-Sample t-Tests
dure in an example problem.
Example Problem
A professor wants to empirically measure the impact of stated course objectives and learning outcomes in one of her classes. The course is divided into four major units, each with an exam. For units I and III, she instructs the class to study lecture notes and reading assignments in preparation for the exams. For units II and IV, she provides clearly written instructional objectives. These objectives form the basis for the exams in units II and IV. Test scores from I,III are combined, as are scores from II,IV, giving a total possible score of 200 for the with and without objectives conditions. Do students achieve signifIcantly better when learning and testing are tied together with objectives? (=0.01) Here is a random sample of 10 scores from her class:
Subject 1 2 3 4 5 6 7 8 9 10 Without (I,III) With (II,IV) d 165 182 = -17 178 189 -11 143 179 -36 187 196 -9 186 188 -2 127 153 -26 138 154 -16 155 178 -23 157 169 -12 171 191 -20 X=1607 Y=1779 d = -172
Instructional Objectives
d2 289 121 1296 81 4 676 256 529 144 400 d2= 3786
means: 160.7 177.9 = -17.2 Insight: Compare the difference between the means with the mean of differences. What do you find?*
Subtract the with scores from the without (e.g., 165-182), producing the difference score d (-17). Adding all the d-scores together yields a d of -172. Squaring each d-score yields the d2 score (-17 x -17=289) in the right-most column. Adding all the d2 -scores together yields a d2 of 3786. These values will be used in the equations to compute t. Here are the steps for analysis: Step 1: State the directional research hypothesis. Step 2: State the directional null hypothesis. Step 3: Compute the mean difference ( )
*Answer: The difference between the means equals the mean of score differences.
20-7
Step 4: Compute sd2
Step 5: Compute
, the standard error of difference
Step 6: Compute t
Step 7: Determine tcv from the t-distribution table. In this study, = 0.01. tcv (df=9, 1-tail test, =.01) = 2.821 Step 8: Compare: Is the computed t greater than the critical value? t= - 5.673 tcv= 2.821
Step 9: Decide: The calculated t-value is greater than the critical value. It is significant. Therefore, we reject the null hypothesis and accept the research hypothesis. The with objectives test scores ( ) were significantly higher than the without objectives ). test scores ( Step 9: Translate into English Providing students learning objectives which are tied to examinations enhances student achievement of course material.
The Two Sample Confidence Interval

Just as we developed a confidence interval about a sample mean (chapter 19), we can develop a confidence interval about the difference in means. Do you recall the formula for a single-sample confidence interval. The equation for a 95% confidence interval is where tcv is the t critical value at df degrees of freedom and =0.05. We can also build a confidence interval about a sample mean (using it as a benchmark against several other means). This slightly different form looks like this:
The equation for the same 95% confidence interval about differences between two independent means follows the same form. The two-sample confidence interval is given by
20-8
Chapter 20
Two-Sample t-Tests
We can also build a confidence interval around a matched samples design:
Confidence intervals for two-sample tests are rarely found. This section is included more for completeness of the subject than for practical considerations.
Summary
Two-sample t-tests are among the most popular of statistical tests. The t-test for independent samples assumes that the group scores are randomly selected individually (no correlation between groups) while the t-test for correlated samples assumes that the group scores are randomly selected as pairs.
Examples
Dr. Roberta Damon, cited in Chapter 19 for her use of the one-sample z-Test, also used the independent-samples t-Test to determine differences between missionary husbands and wives on three variables: communication, conflict resolution, and marital satisfaction. Her findings were summarized in the following table:7
TABLE 4 t SCORES AND CRITICAL VALUES FOR THE SCALES OF COMMUNICATION, CONFLICT RESOLUTION AND MARITAL SATISFACTION ACCORDING TO GENDER Scale Communication Conflict Resolution Marital Satisfaction N = 165 df = 163 t 2.7578 0.4709 -0.71410 Critical Value 2.617 1.645 1.645 Alpha Level .005 .05 .05
Dr. Don Clark studied the statistical power generated in dissertation research by doctoral students in the School of Educational Ministries, Southwestern Seminary, between 1978 and 1996.11 Clark's interest in this subject grew out of a statement I made in research class: Many of the dissertations written by our doctoral students fail to show significance because the designs they use lack sufficient statistical power to declare real differences significant. He wanted to test this assertion statistically. One hundred hypotheses were randomly selected from two populations: hypotheses proven significant (Group X) and those not (Group Y). Seven of these were eliminated because they did not fit Clark's study parameters, leaving ninety-three.12
Damon, 41. Husbands were significantly more satisfied than wives with the way they communicate. "This indicates a felt dissatisfaction among the wives which their husbands do not share" (p. 41). 9 No difference in satisfaction with how couples resolve conflicts. 10 No difference in marital satisfaction. 11 12 13 Clark, 30 Ibid., 39 Ibid., p. 44
7 8
20-9
The scores for this study were the computed power values for all 93 hypotheses of difference. The mean power score (n=47) for "significant hypotheses" was 0.856. The mean power score (n=46) for "non-significant hypotheses" was 0.3397. The standard error of difference was 0.052, resulting in a t-value of 9.904,13 which was tested against tcv of 1.660. The power of the statistical test is significantly higher in those dissertations' hypotheses proven statistically significant than those. . .not proven statistically significant.14 At first glance, the findings seem trivial. Wouldn't one expect to find higher statistical power in dissertations producing significant findings? The simple answer is no. A research design and statistic produces a level of power unrelated to the findings. A fine-mesh net will catch fish if they are present, but a broad-mesh net will not. We seek fine-mesh nets in our research. The average power level of 0.34 for non-significant hypotheses shows that these studies were fishing with broad-mesh nets. As I had said, These studies were doomed from the beginning, before the data was collected. Dr. Clark found this to be true, at least for the 93 hypotheses he analyzed. We have been growing in our research design sophistication over the years. Computing power levels of research designs as part of the dissertation proposal has not been required in the past. Based on Dr. Clark's findings, we need to seriously consider doing this in the future, if for no other reason than to spare our students from dissertation research which is futile in its very design from the beginning. Dr. John Babler conducted his dissertation research on spiritual care provided to hospice patients and families by hospice social workers, nurses, and spiritual care professionals.15 In two of his minor hypotheses, he used the Independent-Samples tTest to determine whether there were differences in provision of spiritual care16 between male and female providers, and between hospice agencies related to a religious organization or not. Males and females (n= 21, 174) showed no significant difference in the provision of spiritual care (Meanm=49.6, Meanf = 51.2, t= -0.83, p=0.409). Agencies related to religious organizations (n=26) and those which were not (n=162) showed no difference in provision of spiritual care (Meanr =51.5, Meannr= 51.1, t=0.22, p=0.828).
Vocabulary
Average difference between two matched groups Standard error of difference (independent samples) Standard error of difference (matched samples) Sample paired subjects test differences between pairs Samples randomly drawn independently of each other combined variability (standard deviation) of two groups of scores tests difference between means of two matched samples tests difference between means of two independent samples
Correlated samples Independent samples Standard error of difference t-Test for Correlated Samples t-Test for Independent Samples
Study Questions
1. Compare and contrast standard deviation, standard error of the mean, and standard error of difference.
15 Ibid., p. 45 Babler, 7 Scores were generated by an instument which Babler adapted by permission from the Use of Spirituality by Hospice Professionals Questionnaire developed by Millison and Dudley (1992), p. 34 14 16
20-10
Chapter 20
Two-Sample t-Tests
2. Differentiate between descriptive and experimental use of the t-test. 3. List three ways we can design a correlated-samples study. Give an example of each type of pairing. 4. Discuss the two factors involved in choosing to use one test over the other. What is the major factor to consider? 5. A professor asked his research students to rate (anonymously) how well they liked statistics. The results were: Males Females 5.25 4.37 s2 6.57 7.55 n 12 31 a. State the research hypothesis c. Compute the standard error of difference e. Determine critical value (=0.01) g. Decide i. Develop CI99 for the differences. b. State the null hypothesis d. Compute t f. Compare h. Translate into English
7. Test the hypothesis that there is no difference between the average scores of husbands and their wives on an attitude-toward-smoking scale. Use 0.05 and the process listed in #6 (NOTE: Develop a CI95 confidence interval) Husband Wife 16 10 8 14 20 15 19 15 15 13 12 11 18 10 16 12

1. The term which reflects the spreadedness of differences between sample means is called A. standard deviation B. standard error of the mean C. standard error of difference D. standard range 2. The purpose of or is to combine the ______ of both samples into one number. A. variances B. df C. standard deviations D. sampling error 3. Which of the following is the same for the 1-sample and 2-sample tests weve studied so far? A. All are ratios of variability between to variability within. B. All calculate degrees of freedom the same way C. All use the same statistical tables. D. All yield the same level of power. 4. T F Jane is studying the attitudes of pastors and ministers of education toward Christian counseling. You would be right to advise her to randomly sample a group of churches and use the independent samples t-test to analyze the difference between pastors and educators on the same staffs. 5. T F The design which draws random samples from two populations and tests the two sample means for significant difference in the populations is called an experimental design.
20-11
20-12
Chapter 21
One-Way Analysis of Variance
21
Why Not Multiple t-Tests? F-Ratio Fundamentals The F-Distribution Table Computing the F-Ratio Multiple Comparison Procedures
In Chapter 20 we presented two minor hypotheses from Dr. John Babler's dissertation on spiritual care provision through hospice agencies. His primary hypothesis was that there would be significant differences in [provision of spiritual care scores] . . . between social workers, nurses, and spiritual care professionals.1 The mean scores on provision of spiritual care for these three groups were 47. 23, 50.75, and 55.94 respectively. Applying the Analysis of Variance procedure, Babler found these three groups differed significantly in their provision of spiritual care (F=10.547, p=0.000). Application of the (F)LSD multiple comparison test revealed that the three groups differed significantly from each other.2 Social workers scored lowest, professional spiritual care providers highest, and nurses in between.3 We established the fundamentals for parametric testing in Chapters 17 and 18. We learned how to apply one-sample z, t tests in Chapter 19. We extended these principles to two-sample tests in Chapter 20. The next logical step is testing the differences on a single dependent variable among three or more group means. The procedure to use is one-way analysis of variance, more popularly known as one-way ANOVA.
Why Not Multiple t-tests?

But why learn another procedure? Why not just pair off the multiple group means and apply t-tests to each pair? The reason lies in the Type I error rate. If I have means A, B, and C, I could use three independent t-tests on A-B, A-C, and B-C, as show here:
1 2 Babler, p. 32 Ibid., p. 47. Note: use of the Least Significant Difference test is valid when the F-ratio is significant. This test was designed by Sir Ronald Fisher, developer of the ANOVA procedure (hence the name F-Ratio). Carmer and Swanson call this the Fisher-Protected LSD (FLSD). 3 Ibid., p. 48
21-1
The multiple application of t-tests was used earlier in this century until Englishman R. A. Fisher showed that the Type I error rate expands from to some larger value as the number of tests between paired means increases. The error rate expansion is constant and predictable, given by the following equation: p = 1 - (1 - )k where p is the actual Type I error rate of all tests together, is the stated level of significance, and k is the number of tests performed. In the A-B-C example above, the true probability (p) of committing a Type I error using three t-tests (=0.05) is given by p = 1 - (1 - )k = 1 - (1 - .05)3 = 1 - .953 = 1 - 0.857 = 0.143 In other words, we run a 14.3% chance of wrongly declaring two means significantly different, even though we set the error rate () at 5%. The problem grows with the number of means in the experiment. Suppose an experiment consists of ten groups. The researcher decides to apply the independent ttest ( = 0.05) to all pairs of means. The number of required t-tests equals (k)(k-1)/2 where k is the number of means in the experiment.With k=10 means, there are 10(9)/ 2 or 45 t-tests to compute. The Type I error rate across these 45 tests (p) is p = 1 - (1 - .05)45 = 1 - .9545 = 1 - .0994 = .9006 This means there is a 90% chance of committing a Type I error, with set at 5%! Since we want to lock the Type I error rate to when testing multiple means, multiple t-tests should not be used. Sir Ronald Fisher proposed a solution, however, and he named his procedure the Analysis of Variance, or ANOVA. The F in F-ratio comes from his name. We have been walking down the "parametric road of differences" since chapter 16. From the simple z-score formula in chapter 16, through one- and two- sample parametric tests, there has been a common thread tying all these procedures together. That thread perhaps youve already seen it is that every procedure involves a ratio of difference between to difference within. This z-equation for a score is a ratio of difference between an individual score and population mean in the numerator, and difference within the group (population standard deviation) in the denominator. If n > 30, s estimates , and estimates , giving the second z-equation. We can use the t-equation (3rd), especially when n<30. It uses the same ratio of difference between score and sample mean and "difference within" (estimated standard deviation). The next z-equation (4th) is a ratio of "difference between" a sample mean ( ) and population or hypothesized mean () in the numerator, and difference within the sampling distribution (standard error of the mean) in the denominator. The tequation (5th) uses estimated values in the same ratio. The next t-equation (6th) is a ratio of difference between two sample means in the numerator, and difference within both samples (standard error of difference) in the denominator. While the form of the last t-equation (7th) changes, conceptually it maintains the ratio of difference between paired scores to difference within all scores together. In all these cases, the between-to-within ratio remains constant. ANOVA contin-
21-2
Chapter 21
ues this relationship. No matter how many groups are involved in an experiment, the ANOVA procedure breaks down the sum of squares of all experimental scores into two parts a difference between part and a difference within part. The ratio of between to within differences forms the F-ratio, just as we have done all along.
Computing the F-Ratio

The process of calculating the F-ratio involves computation of sums of squares, degrees of freedom, and variance estimates which produce the F-ratio itself. Finally we will describe how to test the F-ratio for significance. Students ask, Why do we need to study these details when computers will do this for us? I answer, You need more than numbers spit out of a computer -- you need to know what the computer is doing, at least on a basic level, in order to understand the printouts. Besides. . . its neat!
Sums of Squares
We illustrate the process of dividing the Total Sum of Squares (SSt) into two parts with the diagram at right. Here we find three groups of scores with grand mean g (the mean of all scores in the study) and sample means 1, 2, and 3. Lets focus on one score in sample 3, indicated by the letter ("T" in the diagram) X3,1 at right. The distance between X and equals this one score's part of the total deviation between the scores and the grand mean. When we subtract g from every score in the experiment and add these deviations together, well get 0 (x=0). Square all the deviations and sum them to produce SSt for the experiment. The full equation for SSt is
where k is the number of groups, j is the group counter which increments from 1 to k, nj is the number of scores in the jth group, and i is the score counter which increments from 1 to nj. The equation says to subtract the grand mean from each score in each group, square the deviation, and sum them all up across groups to produce SSt. Next, notice that the T line in the diagram is equal to the sum of the other two parts. The first part is labelled B -- for Between Sum of Squares (SSb) -- and is the deviation of 3 from g. If we square these deviations, adjust for the number of subjects in each group, and sum them, well have SSb for the experiment. The full equation for SSb is
The equation says to subtract the grand mean from each sample mean, square the difference, and multiply by the number of subjects in the sample. Add the k elements together to produce SSb. The second part is labelled W -- for Within Sum of Squares (SSw) -- and is the deviation of from . If we square all of these deviations and sum them for Sample Three, and do the same for Samples One and Two, well have SSw for the experiment. The equation for SSw is
21-3
The equation says to subtract each groups mean from each of the scores within that group, square the differences, and add them up. Add all these elements across all groups to produce SSw. This gives us the combined "within" sum of squares for all the groups in the experiment. Notice that the Total line in the diagram is equal to the Between and Within segments, illustrating that the total sum of squares in any experiment of three or more groups can be divided into two parts, SSb and SSw, such that SSt = SSb + SSw.
Degrees of Freedom
Each sum of squares term (SSb, SSw, SSt) in ANOVA has an associated df term (dfb, dfw, dft). The between df term is k groups minus 1 (dfb = k-1). The within df term is the total number of scores in the experiment (N) minus the number of groups (k) in the experiment (dfw = N-k).
The total df term equals number of subjects minus 1 (dft = N-1). SSb + SSw = SSt. In the same way, dfb + dfw = dft.
k - 1 + N - k = N-1
In the one-sample test we lost one degree of freedom (df = n-1). In the two-sample test we lost two degrees of freedom (df = n + n - 2). It follows that when k groups are studied, we lose k degrees of freedom (dfw = N-k.)
Variance Estimates
Review: Recall from Chapter 16 that variance (s2) equals the sum of squares (x2) divided by degrees of freedom (n-1). ANOVA computes a between variance estimate4 (MSb) and a within variance estimate (MSw) from the SS and df terms defined above.
The MS terms stand for mean-square. Variance equals the average sum of squares. So, MSb (mean-square-between) stands for the mean of the squared deviations between groups. And MSw (mean-square-within) stands for the mean of the squared deviations within all groups. We do not take the square root of variance, as we did in the previous procedures. The F-ratio is built from these two variance estimates hence the name, Analysis of Variance. 5
The F-Ratio
The F-ratio of ANOVA is the ratio of MSb and MSw. This value is compared to a critical F value drawn from the F-distribution table to determine whether it is significant or not.
Variance (s2 ) equals sum of squares (x2) divided by degrees of freedom (n-1). The two MS terms equals sum of squares divided by degrees of freedom. 5 If you were to apply the independent-samples t-Test and ANOVA to the same two groups of scores, the resulting t, F values would have the relationship: F = t2 or t = F.
4
21-4
Chapter 21
The F-Distribution Critical Value Table

In chapters 19 and 20 we used a df term to locate critical values with the tdistribution. The F-distribution table requires two df values to determine the critical value for an F-ratio. You will find part of an F-table in A3-3 in the back of the text. The table is labelled degrees of freedom in the numerator on top and degrees of freedom in the denominator down the left side. These terms refer to the dfb (=k-1) and dfw (=N-k) from the ANOVA computations. Suppose we test 4 groups with 28 subjects (k=4, N=28). The dfb is k-1, or 3; the dfw is N-k, or 24. Look down the column headed 3 until you cross the row labeled 24. The 0.05 critical value for F is 3.01.
The ANOVA Table

All of the F-Ratio elements we've discussed in this section are usually summarized in a concise chart called an ANOVA table. Tables such as the one below are commonly found in scientific literature. Study the relationships among the elements of the table and link the notations found here with the previous discussion of the computation of the F-Ratio. Here is the general format: Source Between Within Total SS SSb SSw SSt df df b df w dft MS MSb =SSb/dfb MSw=SSw/dfw F F = MSb /MSw
Lets put all this together in an example problem.
An Example
At the beginning of the chapter we highlighted the findings of Dr. John Babler's dissertation on spiritual care provision. Here are two ANOVA tables from his dissertation.
TABLE 26 ANALYSIS OF VARIANCE OF SCORE AND HOSPICE PROFESSION Sum of Squares Mean Squares F Ratio F Probability
Source
DF
Between Groups Within Groups Total
2 193 195
1360.6566 12449.9302 13810.5867
680.3283 64.5074
10.5465
.0000
The F-ratio of 10.5465 is significant because p=.0000, that is, p(F) is very small, much less than =0.05. This p(F) tells us that the spiritual care provision means of the three professions (47. 23, 50.75, and 55.94) were significantly different.
6
Babler, p. 47
21-5
TABLE 87 ANALYSIS OF VARIANCE OF SCORE AND AGE Source DF Sum of Squares Mean Squares F Ratio F Probability
Between Groups Within Groups Total
4 191 194
657.6667 13104.3128 13761.9795
219.2222 68.6090
3.1952
.0247
In Table 8 we see an F of 3.1952 (p=.0247). This p(F) tells us that the spiritual care provision means of the four age groups of professionals were significantly different. These means were 52.08 (26-35), 48.39 (36-45), 52.46 (46-55) and 52.12 (over 55). In these examples, you can see that the computer printout includes a p value for the computed F-ratio. This p is the exact probability of obtaining the computed F-ratio. It is easier to compare the computed p with than it is to look up an F critical value in a table. If p < , then reject the null hypothesis.
Multiple Comparison Procedures

The purpose of ANOVA is to determine whether the sample means are indicative of experimental treatment effects or merely reflect chance variation. Two statistical conclusions are possible. Either the H0: 1 = 2= . . . =k is tenable (means are equal), or it is rejected (means are different). A significant F-ratio leads to the rejection of the null hypothesis, but it does not tells us which means differ significantly from the others.
Procedures Defined
To determine which means differ sufficiently to produce the significant F-ratio, they study differences between pairs of means by using statistical techniques called multiple comparison procedures. These procedures vary in definition and procedure. We will briefly discuss four procedures here: the Least Significant Difference (LSD), the Tukey Honestly Significant Difference (HSD), the Student-Newman-Keuls (SNK) and the Fisher-Protected Least Significant Difference (FLSD).
The Least Significant Difference

The Least Significant Difference (LSD) is a form of the multiple t-test we discussed at the beginning of the chapter. Since the LSD focused on each pair of means individually, it uses a comparison-wise Type I error rate.8 Fisher developed the LSD and recommended it as a multiple comparison test, but stipulated that LSD should be used only if the F-ratio were significant. The stipulation of significant F locks Type I error rate to and prevents the inflated error rates often reported in textbooks discussing multiple comparison tests. Why, reasoned Fisher, would anyone use the LSD to find significant differences in an experiment that failed to produce a significant F? The
Babler, p. 54 The concepts of comparison- and experiment-wise error rates is beyond the scope of this book. Suffice it so say that c-wise procedures (LSD) applies to each pair, which can lead to an inflated Type I error rate; e-wise procedures (HSD) applies to all pairs in the experiment, leading to low power per pair.
7 8
21-6
Chapter 21
answer, of course, lies in the demands for educator-researchers to conduct and publish research. Since refereed journals demand articles reporting significant findings, many professors sought ways to produce significant findings more often. One way this was done was to use the LSD without a prior significant F-ratio, violating Fisher's guidelines. The result was an explosion of Type I errors and false positives in the literature. The problem was not the LSD test per sec, but its misuse. Other theorists sought ways to reduce the problem of excessive Type I error rate. The LSD generates the lowest critical value (and the highest level of power) of those discussed here, when used as Fisher directed.
The Honestly Significant Difference

J. W. Tukey developed the Honestly Significant Difference (HSD) in response to the inflated Type I error rates produced by misuse of the LSD. This procedure is based on an experimentwise Type I error rate (an error rate for all pairs taken together) which holds Type I error at no matter how many groups are in the experiment. As the number of groups increases, the critical value used to test differences increases. The problem with the HSD is that, while it protects against Type I errors, it also yields less power in detecting significant differences. The more groups tested, the less power achieved. The HSD generates the highest critical value (and the lowest level of power) of those discussed here.
Multiple Range Tests

A compromise approach between the low critical value (liberal Type I error rate) of the LSD and the high critical value (conservative Type I error rate) of the HSD was to create a range procedure. This multiple comparison procedure generates a range of critical values from conservative (equal to HSD) to liberal (equal to LSD). The Student-Newman-Keuls (SNK) is the most popular range statistic. Most statisticians discourage the use of range comparisons because it confuses the error values. Nonetheless, I found through a Dissertation Abstracts search in 1984 that the SNK was by far the most common multiple comparison used in graduate school research at that time.
Fisher-Protected Least Significance Difference

Displeased with the loss of power with the HSD, seeking ways to avoid the explosion of Type I errors with the (often misused) LSD, and avoiding the confusion of range statistics, some statisticians returned to Fisher's original prescription: use the LSD only if the F-ratio is significant. Applying the LSD only to significant F-ratios limits experiment-wise error rate to . At the same time, it maximizes the power of the test to detect pairwise differences. This modified procedure was called the Fisher-Protected Least Significant Difference (FLSD).9
Procedures Computed
In each of these procedures, a value is computed and compared to the differences between paired means. If the difference between two means is greater than the mul9
Yount, A Monte Carlo Analysis..., 26-37
21-7
tiple comparison value, the two means are declared significantly different. We will forego the specific formulas for each of these procedures since this computational work will most likely be done by computer. However, given specific elements of an ANOVA example, multiple comparison critical values can be compared with each other. The following table displays the values: r 5 4 3 2 (F)LSD 12.522* 12.522 12.522 12.522 SNK 17.531 16.502 15.026 12.522 HSD 17.531** 17.531 17.531 17.531
Notice that the (F)LSD procedure produces the smallest critical value (12.522), producing the greatest power. The HSD procedure produces the largest critical value (17.531), producing the least power. The SNK procedure produces a range between the two. There is a great deal of confusion in the literature concerning multiple comparisons. My Ph.D. dissertation* focused on six multiple comparison tests and analyzed their error rates by a computerized Monte Carlo technique. I generated 28,000 sets of random data, 1000 tests for each of 28 n- and k-combinations related to educational research. My findings indicated the best multiple comparison procedure, under all conditions, was the (F)LSD. It consistently provided the greatest power and an error rate closest to . HSD was too conservative (consistently produced low power). The Scheffe method (not discussed in the chapter, but included in the study) consistently decreased the level of significance below , reducing the power of the test more than any other. Scheffe consistently reduces the likelihood of finding "true differences" and should be avoided except under very narrow conditions. If you need a multiple comparison test, my unqualified recommendation is the (F)LSD. If using SPSS, check the box marked LSD under multiple comparison tests, but only use the results if the F-ratio is significant.
Summary
In this chapter we established the general procedure for use of the one-way Analysis of Variance (ANOVA) test. We explained the problem of using multiple ttests. We illustrated the breakdown of total sum of squares into between and within parts. We showed how each element in an ANOVA table is computed. We discussed the ANOVA table and the relationships among the various table elements. Finally, we introduced the concept of multiple comparison procedures and illustrated their use.
Examples
Dr. John Babler's Table 2, p. 21-5, displayed a significant F between the means of the three professions. But which professional group differed significantly from the others? Babler used a computerized LSD test (with a significant F) to determine that
*Yount. A Monte Carlo Analysis... Ph.D. diss., University of North Texas, 1985, 45-46
21-8
Chapter 21
each of the three means differed significantly from the others.10 We can put them in a difference table to see the paired-differences more clearly. 1 = Spiritual Care professionals, 2 = nurses, 3 = social workers.
2 50.75 5.19 1 ranks 55.94 8.71 <--- largest difference 3.52 <--- smallest difference
ranks 3 2
means 47.23 55.75
The largest difference (8.71) is between the highest mean (55.94) and the lowest (47.23). You can see the differences between all three paired means above. All of these differences exceeded the critical value, and were declared significant. Dr. Gail Linam's dissertation (see Chapter 24-1 for full reference and Chapter 25 for the two-way ANOVA findings) compared reading comprehension in children grades 4-6 across three translations of Scripture, the King James, New International and New Century versions. She found that the KJV produced significantly lower comprehension scores than either NIV or NCV. She applied the FLSD procedure to determine exact differences between versions. For the Old Testament Retelling (OTR) scores, mission children scored significantly lower with KJV than NCV (7.00 < 23.11) and main campus children significantly lower with KJV than either NCV or NIV (18.81 < 34.41, 18.81 < 32.95). For the New Testament Retelling (NTR) scores, mission children scored significantly lower with KJV than either NCV or NIV (7.25<29.44, 7.25<21.56) and main campus children the same (25.55<37.55, 25.55<34.60).11 For Old Testament Cloze (OTC) scores, mission children scored significantly lower with KJV than either NCV or NIV (4.22<17.56, 4.22<13.33), and main campus children the same (6.55<23.50, 6.55<23.27).12 For New Testament Cloze (NTC) scores, mission children scored significantly lower with the KJV than with either NCV or NIV (0.38<16.11, 0.38<10.78) and main campus children scored the same (11.05<23.50, 11.05<22.82).13 In every case and under every condition, the KJV produced significantly lower reading comprehension scores using two different types of testing procedures and stories from both Old and New Testaments. Older children (4th-6th grades) simply do not understand the King James translation as well as the NIV or NCV versions.
Vocabulary
ANOVA dfb dft dfw FLSD HSD LSD
10
Analysis of Variance: tests difference among 3 or more indt samples means degrees of freedom between: df between the means: =k-1 degrees of freedom total: df for whole experiment: =N-1 degrees of freedom within: (n-1 per group)(k groups) = N-k Fisher-protected LSD -- LSD test protected by a prior significant F-ratio Tukey Honestly Significant Difference mcp: very conservative Least Significant Difference high Type I error without significant F-ratio
11
Babler, 49
Linam, 109
12
Ibid., 111
13
Ibid.
21-9
MSw MSb multiple t-tests SNK SSb SSt SSw
mean square within: variance within-all-groups-combined mean square between: variance between means and grand mean applying t-tests to multiple pairs of means in an experiment with three+ groups Student-Newman-Keuls mcp which uses a range of critical values between sum of squares: SS term between grand mean and group means total sum of squares: SS term between grand mean and all scores within sum of squares: SS between scores and their respective group means
Study Questions
Dr. Martha Bergen studied attitudes toward computer-enhanced learning for seminary education among full-time professors at Southwestern Baptist Theological Seminary in 1989.14 One of her hypotheses was that there would be a significant difference [in attitude toward computer-enhanced learning] between the professors in the religious education, theology, and church music schools. Scores were generated from an attitude scale Dr. Bergen developed for the study. The mean attitude scores for the three schools were 118 (highest) in the Religious Education faculty, 117 in the church music faculty, and 114 (lowest) in the theology faculty. But were these differences in attitude significant? Here is the ANOVA table she generated:15
SOURCE OF VARIATION Between Within Total SUM OF SQUARES 323.387 25018.652 25342.039
df 2 73 75
MS 161.694 342.721
F .472
p .626
I. Using the problem and printout above, answer these questions: 1. Is the F-ratio significant? Explain why you say this. 2. Explain this F-ratio in terms of the three group means: 114, 117, 118. 3. How do you explain the differences in the school mean scores? 4. Dr. Bergen did not apply multiple comparisons tests to see if any single mean was significantly different from the others. Why ? Was she correct in doing so? II. General Chapter Questions: 1. Explain the problem of using several t-tests to determine significant differences among pairs of means. 2. Since the FLSD is a modified multiple t-test, explain how it overcomes the problem explained in #1. 3. Explain in your own words how ANOVA divides the total sum-of-squares into be tween and within parts. (Use the deviation explanation). 4. Fill in the ANOVA table below. You are testing the means of 4 groups of 10 subjects each. The SSb=480.0 and SSt=1440.0. Compute the F-ratio. Determine the critical value (=0.05). Is the F-ratio significant?
14
Bergen, Cover Page
15
Ibid., 87
21-10
Chapter 21

5x4/2=) 10 t-tests to the paired means, the level 1. If you have five group means and apply (5 of Type I error will be A. equal to that of a one-way ANOVA B. greater than that of a one-way ANOVA C. less than that of a one-way ANOVA D. unknown 2. An experimentwise Type I error rate is the probability of committing a Type I error A. in each paired t-test individually B. in all paired t-tests together C. in the experimental groups only D. in the overall F-ratio test 3. The proper order of experimentwise error rate, high to low, of the following is A. LSD, HSD, SNK B. LSD, SNK, HSD C. HSD, SNK, FLSD D. HSD, LSD, SNK 4. The mean-square terms in the ANOVA are most closely associated with which of the following? A. x2 B. s2 C. X D. 5. The SSb term refers to x2... A. between the sample means and the grand mean, summed B. among all scores within a given sample C. among all scores within the entire experiment D. between the scores and their own sample means, summed
21-11
21-12
Chapter 22
Correlation Coefficients
22
The Meaning of Correlation Correlation and Data Types Pearsons r Spearman rho Other Coefficients of Note Coefficient of Determination r2
The concept of correlation was introduced in Chapters 1 and 5. Our focus since Chapter 16 has been basic statistical procedures that measure differences between groups -- one-sample, two-sample, and k-sample tests. Now we turn our attention to basic statistical procedures that measure the degree of association between variables. Dr. Wesley Black studied the relationship between rankings of selected learning objectives in a youth discipleship taxonomy between full-time church staff youth ministers and seminary students enrolled in youth education courses at Southwestern Seminary.1 Questionnaires were returned by 318 students and 184 youth ministers.2 Ten objectives in each of five categories (Personal Ministry, Christian Theology and Baptist Doctrine, Christian Ethics, Baptist Heritage, and Church Polity and Organization) were ranked by these two groups. The basic question raised by Black in this study was whether students prioritized discipleship training objectives for youth in the same way as full-time ministers in the field. Using the Spearman rho correlation coefficient, Black found the correlations of rankings generated by students and ministers of the ten items for each of five categories were as follows: Personal Ministry, 0.915; Christian Theology and Baptist Doctrine, 0.867; Christian Ethics, 0.939; Baptist Heritage, 0.939; and Church Polity and Organization, 0.927. Each of these are strong positive correlations3 between the rankings of objectives by students and ministers.
1 Wesley Black, A Comparison of Responses to Learning Objectives for Youth Discipleship Training from Ministers of Youth in Southern Baptist Churches and Students Enrolled in Youth Education Courses at Southwestern Baptist Theological Seminary, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1985). 2 Black received 356 responses from students, but 38 of these were full-time youth ministers, and so were excluded from the study, leaving 318 student responses. Of the 307 responses from youth ministers, 197 indicated they were full-time church staff youth ministers. Thirteen additional responses were eliminated for incompleteness, leaving 184 youth minister responses. pp. 71-72 3 Any coefficient over 0.80 indicates a strong positive correlation.
22-1
The Meaning of Correlation

When we discussed the frequency distribution (chapter 15), we plotted values of X on the x-axis, and the frequency of the X-values on the y-axis. In this plot, there was one score per subject. In graphing a correlation between two variables, there are two scores per subject an X-score and a Y-score. We plot the Xscores on the x-axis and Y-scores on the y-axis. A single dot represents the intersection between each X-Y pair. Notice the diagram to the left. The single point in the shaded circle represents two scores from a single subject in a study: a 20 on variable 1 and a 25 on variable 2. Notice how the dots form a pattern in two-dimensional space. The tighter the pattern, the higher the correlation. The looser the pattern, the lower the correlation. These patterns are called scatterplots. The scatterplots at left illustrate various kinds of associations. The first scatterplot shows a perfect positive correlation. The correlation is positive because variable 2 increases as variable 1 increases. It is a perfect correlation because all the points fall on a straight line. (The line has been included in this diagram, but is not part of the scatterplot.) The second scatterplot shows a perfect negative correlation. The correlation is negative because variable 2 decreases as variable 1 increases. It is perfect because all the points fall on a straight line (not shown in this diagram). The third scatterplot shows a moderately positive correlation. Notice how most of the data points do not fall on the line. It is a moderate correlation, however, because the points fall in a tight pattern around the line. Notice the pattern is linear -- that is, a pattern suggesting a line. The fourth scatterplot shows no correlation. Scores on one variable have no systematic association with scores on the other. The scatterplot presents no linear pattern among the points. Beyond the graphical representation of association, we can mathematically compute the degree of association between two variables. The numerical result of such a computation is called a correlation coefficient. The value of these coefficients usually range from -1.00 to +1.00. A positive coefficient indicates that two variables systematically vary in the same direction: as one variable increases, the other variable tends to increase. The closer the coefficient is to +1.00, the stronger the positive association. A negative coefficient indicates that two variables systematically vary in opposite directions: as one variable increases, the other variable tends to decrease. The closer the coefficient is to -1.00, the stronger the negative association. A coefficient close to zero indicates that no systematic co-varying exists between the variables. There are several important correlation procedures. They differ according to the data types of the variables.
22-2
Chapter 22
Correlation and Data Types

Since chapter 16, we have focused on interval or ratio data types. In this chapter, we broaden our focus. There are correlational procedures for all four data types (nominal, ordinal, interval, ratio). The Pearsons Product Moment Correlation Coefficient (rxy ) computes the correlation between two interval or ratio variables. Spearmans rho (rs) computes the correlation between two ordinal, or ranked, variables. The Contingency Coefficient (C) and Cramers Phi (C) compute the strength of relationship when testing nominal data analyzed by a procedure (Chapter 23). The Phi Coefficient (r) computes the correlation between two dichotomous variables (two and only two categories (Yes or No, True or False). Additionally, a study may require the computation of a correlation coefficient between mixed data types. Point biserial is used when one variable is interval/ratio and the second is dichotomous. Rank biserial is used when one variable is ordinal and the second is dichotomous. We can summarize these various coefficients like this:
Variable 1 Variable 2 Interval/Ratio Ordinal Nominal Dichotomous Interval/Ratio Ordinal Nominal Dichotomous
rxy rs*
rs* rs C, C**
Point Biserial Rank Biserial
Point Biserial
Rank Biserial
r
*requires interval/ratio data to be ranked **Requires 2 value
Finally, Kendalls Coefficient of Concordance (W) computes the correlation of three or more rankings of of items. Now well look at how each of these correlation coefficients are computed.
Pearsons Product Moment Correlation Coefficient (rxy)

The most popular correlation coefficient is the Product Moment correlation coefficient, better known as Pearsons r. Pearsons r is used to determine the correlation between two variables under three conditions. First, both variables must be interval or ratio measures (i.e. attitude scales, test scores). Second, the relationship between the two variables must be linear the data points must generally fall along a straight line. A non-linear relationship between variables, shown at right, produces a Pearsons r near zero, even though is it clear from the example that there is a strong (quadratic, of the type y=x2) relationship between the two variables. The third condition is that both variables are normally distributed. A skewed
22-3
distribution produces a smaller r than a normal distribution. Use a large scale for variables in correlational analysis, since the larger the variability, the stronger the coefficient will be. A common mistake in research design is to use age categories rather than actual ages, or salary categories rather than actual dollar values. The range of categories will always be much smaller than the range of actual data, reducing the value of r. Pearsons r is computed with the following formula:
where n equals the number of score-pairs, and X and Y equal paired scores. While this formula is somewhat foreboding, it consists of the following simple components:
XY X Y X2 Y2 (X)2 (Y)2 Multiply X by Y and sum Sum all the X scores Sum all the Y scores Square all Xs and sum Square all Ys and sum Square the sum of X Square the sum of Y
Lets say we have a set of 5 paired scores: (3,6), (5,9), (2,5), (3,7), and (4,8). A scatter-plot of this data is shown at left. From what you can see in this graph, do you predict a strong or weak correlation coefficient? Weve put the paired X-Y values in the chart below to facilitate computing the various elements of the Pearsons r formula. The letters in the chart (A-G) refer to the step below (A-G).
X2 ==== 9 25 4 9 16 63 X2 X === 3 5 2 3 4 17 X XY ==== 18 45 10 21 32 126 XY Y === 6 9 5 7 8 35 Y Y2 ==== 36 81 25 49 64 255 Y2
A
289
C
1225
G
2
(X)
(Y)
B
A. Add up the Xs. This is X, and equals 17
D
22-4
Chapter 22
B. (X)2 = 17 x 17 = 289 C. Add up the Ys. This is Y, and equals 35 D. (Y)2 = 35 x 35 = 1225 E. Multiply each XY pair together and add. This is (XY), and equals 126 F. Square each X and add up the squared values. X2 = 63 G. Square each Y and add up the squared values. Y2 = 255
Now substituting into the raw score equation we have: Before going on, be sure to identify each term in the equation above with the chart above and the equation on the previous page.
The Pearson r value of +0.971 indicates a very strong -- nearly perfect -- positive correlation between these two variables.
Spearmans rho Correlation Coefficient (rs)

Spearmans rho yields a correlation coefficient between two ordinal, or ranked, variables.4 The formula is:
where D is the difference between paired ranks. The number 6 is a constant. Suppose a pastor asked two staff members to rank ten church objectives according to how well they were being accomplished by the church. Here are the rankings of the ministers.
Objective 1 2 3 4 5 6 7 8 9 10 Min Ed Rank 1 2 3 4 5 6 7 8 9 10 Min Youth Rank 2 1 5 3 7 6 4 10 9 8
Question: Do these two staff members agree in their evaluation of the objectives?
The two ranked variables in Dr. Black's dissertation (p. 1) was ranking of objectives by students and ranking of objectives by youth ministers. This was accomplished by assigning a score value to each individual subject's set of ranks, computing means of these scores, and rank ordering the objectives by the means. The result was a separate ranking high to low of objectives by the two groups. Spearman rho was then applied to compute the degree of agreement between the two rankings.
4
22-5
What is the strength of their agreement? First we compute the differences (D) between ranks, then square the differences (D ), sum the squares (D2), and substitute into the formula. The table below summarizes the process:
2
Objective 1 2 3 4 5 6 7 8 9 n= 10
Min Ed Rank 1 2 3 4 5 6 7 8 9 10
Min Youth Rank 2 1 5 3 7 6 4 10 9 8 =
D -1 +1 -2 +1 -2 0 +3 -2 0 +2 D=0
D2 1 1 4 1 4 0 9 4 0 4 D2=28
Objective 1 was ranked highest (1) by the minister of education and second (2) by the minister of youth. Subtracting 2 from 1 yields a difference (D) of -1. Squaring D yields a D2 of 1. Notice that the sum of differences (D) equals 0. Summing the D2 values, we get 28. Substituting the value of D2 and n into the Spearman formula, we have
The coefficient of +0.83 indicates a strong agreement between the two staff ministers with respect to the rankings of church objectives.
Other Important Correlation Coefficients

Several other correlation coefficients will be mentioned. These will be described but not illustrated by example in the interest of space.
Point Biserial Coefficient

The point biserial correlation coefficient is computed between one interval or ratio variable and one dichotomous variable. The term biserial refers to the fact that there are two groups of persons (X= 0,1) being observed on the continuous variable (Y). Use this procedure to test correlations of attitude scores or test scores of subjects between haves and have-nots: ministers who graduated from seminary and those who did not, preschoolers who have had a specified early education program and those who havent, staff members who have a specified evaluation procedure and those who do not, and so forth.
22-6
Chapter 22
Rank Biserial Coefficient

The rank biserial correlation coefficient is much like the point biserial just discussed, except that it uses an ordinal variable in place of an interval/ratio variable. The coefficient measures degree of relationship between a dichotomous condition (1,0) and a ranking.
The Phi Coefficient measures the strength of relationship between two dichotomous variables. A study of marital status and attrition rate in college might arbitrarily assign a 1 to married and 0 to not married; a 1 to dropped out and a 0 to remaining in school. Any type of variable that can be classified 1 and 0 can use the phi coefficient. A positive correlation indicates those who score "1" on one variable tend to score "1" on the other. Using the example above, a positive correlation would mean that married students (1) tend to drop out of school (1) more than unmarried students.
Phi Coefficient (r)
Kendalls Coefficient of Concordance (W)

Kendalls W extends Spearman Rho to more than two groups. This procedure is useful for studies in which three or more groups create rankings of items. The resulting statistic represents the level of agreement among the groups in ranking the items. For example, you could create a list of ten competencies required of an effective minister of education. You then send this list to a sample of pastors, a sample of ministers of education, and a sample of seminary professors. You ask them to rank order the list. Convert the individual rankings to score values and compute means for each item for each of the three groups. Use the means to rank order the items for each group. Then use Kendalls W to measure the degree of agreement exists among these three groups concerning the relative importance of the stated competencies for ministers of education.
The Coefficient of Determination (r2)

Ive said nothing about significance tests for correlation coefficients. That is to say, Ive not suggested a statistic to determine whether a coefficient is significant or not, such as: It is the hypothesis of this study that there is a significant positive correlation between number of study hours and grade point average in first year seminary students. You will find textbooks that provide this information, but it is best left alone. Leland Wilkerson, developer of SYSTAT: A System For Statistics, warns against the "smoke and mirrors" of significance testing for correlation coefficients. The null condition for correlation coefficients is H0: r=0. A significant correlation is one that differs from 0. The size of r is directly related to n, the number of subjects. The larger the group, the more likely one will achieve a significant r, even if the correlation is meaningless. I once helped a Catholic nun analyze correlational data on attitudes of Catholic parents toward sex education in parochial schools. Most of her correlations were very small -- 0.05 to 0.15 -- yet her SPSS printouts declared all of them significant. The reason for this was the number of subjects in her study -- nearly 500 couples, almost 1000 subjects. What does a "significant" correlation of r=0.07 mean? Very little.
22-7
Here we again see the distinction between statistical significance and practical importance. A different approach, a more meaningful approach, in determining the importance of a correlation coefficient, is the coefficient of determination (r). By squaring the correlation coefficient, one obtains a measure of the common variance between two variables, the proportion of variance accounted for in one of the variables, or explained by, the other. If the correlation between marital satisfaction and number of months married is -0.40, then 16% of the variance (-.40 x -.40 = .16) of one variable is accounted for by the variance of the other (the shaded area at right). We could say that 16% of the variability in marital satisfaction and number-of-months-married overlaps. It follows that 84% of the variability is unaccounted for. In the Catholic sex education study mentioned above, the r2 value of r=0.07 is 0.0049, or 0.49%: one-half of one percent of variance accounted for. Ninety-nine and three-fourths percent (99.51%) of variance was unaccounted for. This was a meaningless significant finding to be sure. We will use the concept of r2 much more when we discuss regression analysis.
Summary
In this chapter you have been introduced to the concept of correlation. You have learned how to compute the two most popular correlation coefficients, Pearson's r and Spearman rho, as well as learned of several other helpful correlational tools. You have been introduced to the coefficient of determination (r2) which will be of central importance in Chapter 26, Regression Analysis.
Vocabulary
Coefficient of Determination Contingency Coefficient Correlation Correlation coefficient Cramers Phi Kendalls tau Kendalls W Negative correlation Pearsons r Phi Coefficient Point biserial Positive correlation Rank biserial Scattergram Spearmans rho proportion of variance in one variable accounted for by another (r) measure of association between two nominal variables (max < 1) degree of association between two variables numerical measure of degree of association between two variables measure of association between two nominal variables (max = 1) measure of association between two sets of ranks (n < 10 pairs) measure of association among three+ sets of ranks one element of paired scores increases while the other decreases measure of association between two interval\ratio variables measure of association between two dichotomous variables measure of assocn between an interval/ratio variable and a dichotomous variable both elements in paired scores increase (or decrease) measure of assocn between ordinal variable and dichotmous variable graphical representation of correlation of two variables measure of association between two sets of ranks (n > 10 pairs)
Study Questions
1. Is age related to the length of stay of surgical patients in a hospital? The following data was collected in a recent study. Age: 40 36 30 27 24 22 20 Days: 11 9 10 5 12 4 7
22-8
Chapter 22
a. Draw a scatterplot diagram of the data, with AGE on x-axis and DAYS on y-axis. b. By appearance alone, do AGE and DAYS appear to be related? c. Compute the appropriate correlation coefficient. d. Interpret the results. e. Compute the coefficient of determination. What does it tell you? 2. A professional person and a blue-collar worker were asked to rank 12 occupations according to the social status they attached to each. A ranking of 1 was assigned to the occupation with the highest status down to a ranking of 12 for the lowest. Here are their rankings: Occupation
Physician Dentist Attorney Pharmacist Optometrist School Teacher Veterinarian College Professor Engineer Accountant Health Care Administrator Government administrator
Professional
Person 1 4 2 6 12 8 10 3 5 7 9 11
Blue-Collar
Worker 1 2 4 5 9 12 6 3 7 8 11 10
a. Compute the appropriate correlation coefficient. b. Interpret the results.
Sample Test Question

Match the following statistical procedures with the types of kinds of data below. a. Pearsons r d. Point Biserial b. Spearmans rho e. Rank Biserial c. Phi (r) f. Cramers Phi (Chap 23)
____ 1. Preacher popularity by rank and whether he graduated from seminary or not. ____ 2. Reading score in 6-year-olds and whether they participated in HEADSTART preschool program. ____ 3. Seminary GPA and marital satisfaction scores of graduating students. ____ 4. Smoking/not smoking and death by (1) cancer or (2) other causes. ____ 5. Staff position and leadership style category. ____ 6. Ten objectives in Christian Education ranked by pastors and ministers of education.
22-9
22-10
Chapter 23
Chi-Square Tests
23
Chi-Square Procedures
The Chi-Square Formula The Chi-Square Critical Value Chi-Square Goodness of Fit Test Chi-Square Test of Independence Cautions in Using Chi-Square
Dr. Helen Ang studied the relationship between predominant leadership style and educational philosophy of administrators in Christian colleges and universities for her Ed.D. dissertation in 1984.1 Leadership Style was a categorical variable with the following five levels (with percentages of the 113 administrators studied): team administrator (high people/high task: 23%), constituency-centered (moderate people/moderate task: 16%), authorityobedience (low people/high task: 4%), comfortable-pleasant (high people/low task: 38%), and caretaker (low people/low task: 19%).2 Educational Philosophy Profile was a categorical variable with the following six levels (with percentages): idealism (7%), realism (4%), neo-thomism (15%), pragmatism (58%), existentialism (1%), and eclectic (16%).3 Applying the Chi-Square Test of Independence, Dr. Ang found that the variables Leadership Style and Educational Philosophy were independent (2 = 21.676, 2cv = 31.410, a=0.05, df=20).4 The chi in chi-square is the Greek letter , pronounced ki as in kite. Chi-square (2) procedures measures the differences between observed (O) and expected (E) frequencies of nominal variables, in which subjects are grouped in categories or cells. There are two basic types of chi-square analysis, the Goodness of Fit Test, used with a single nominal variable, and the Test of Independence, used with two nominal variables. Both types of chi-square use the same formula.
The Chi Square Formula

The chi-square formula is as follows:
Helen C. Ang, An Analytical Study of the Leadership Style of Selected Academic Administrators in Christian Colleges and Universities as Related to their Educational Philosophy Profile, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1984). 2 3 4 Ibid., 28-29, 46 Ibid., 45 Ibid., 47
1
23-1
where the letter O represents the Observed frequency -- the actual count -- in a given cell. The letter E represents the Expected frequency -- a theoretical count -- for that cell. Its value must be computed. The formula reads as follows: The value of chi-square equals the sum of O-E differences squared and divided by E. The more O differs from E, the larger 2 is. When 2 exceeds the appropriate critical value, it is declared significant.
The Goodness of Fit Test

The Goodness of Fit Test is applied to a single nominal variable and determines whether the frequencies we observe in k categories fit what we might expect. Some textbooks call this procedure the Badness of Fit Test because a significant 2 value means that Observed counts do not fit what we Expect. The Goodness of Fit Test can be applied with equal or proportional expected frequencies (EE, PE).
Equal Expected Frequencies

Equal expected frequencies are computed by dividing the number of subjects (N) by the number of categories (k) in the variable. A classic example of equal expected frequencies is testing the fairness of a die. If a die is fair, we would expect equal tallies of faces over a series of rolls.
The Example of a Die

Lets say I roll a real die 120 times (N) and count the number of times each face (k = 6) comes up. The number 1 comes up 17 times, the number 2 21 times, 3 22 times, 4 19 times, 5 16 times, and 6 25 times. Results are listed under the O column below. We would Expect a count of 20 (E=N/k) for each of the six faces (1-6). This E value of 20 is listed under the E column below.
O 1 2 3 4 5 6 17 21 22 19 16 25 120 E 20 20 20 20 20 20 120 O-E -3 1 2 -1 -4 5 (O-E) = 0 (O-E) 9 1 4 1 16 25 (O-E)/E .45 (=9/20) .05 .20 .05 .80 1.25 2 = 2.80
The chart above shows the step-by-step procedure in computing the chi-square formula. Notice that both O and E columns add to the same value (N=120).
Computing the Chi Square

The first step is to subtract expected frequencies (E) from the observed (O). These differences fall under the O-E column. Notice that (O-E)=0, just as x=0. The second step is to square the differences. These squares are found under the (O-E)2 column. The third step is to divide the squared differences by the expected values. Each of these values, shown in the last column, is the portion of the chi-square total derived
23-2
Chapter 23
Chi-Square Tests
from each category. For example, the largest contributor of the chi-square is the high tally in category 6. It yields 1.25 of the 2.80 total. The fourth step is to sum the values in the last column to produce the final chisquare value in this case, 2.80.
Testing the Chi Square Value

The computed value of 2 is compared to the appropriate critical value. The critical value is found in the Chi-square Table (see Appendix A3-3). Using and df, locate the critical value from the table. For the Goodness of Fit Test, the degrees of freedom (df) equal the number of categories (k) minus one (df=k-1). In our example above, the critical value (=0.05, df=5) is 11.07. Since the computed value (2.80) is less than the critical value (11.07), we declare the 2 not significant.
Translating into English

What does this non-significant 2 mean in English? The observed frequencies of the six categories of die rolls do not significantly differ from the expected frequencies. The observed frequencies have a good fit with what was expected. Or, simply stated, The die is fair. Had the computed value been greater than 11.07, the 2 would have been declared significant. This would mean that the difference between observed and expected values is greater than we expect by chance. The observed frequencies would have a bad fit with what was expected. Or simply stated, The die is loaded. Equal E is usually an unrealistic assumption of the break-down of categories. A better approach is to compute proportional expected frequencies (PE).
Proportional Expected Frequencies

With proportional expected frequencies, the expected values are derived from a known population. Suppose you are in an Advanced Greek class of 100 students. You notice a large number of women in the class, and wonder if there are more women in the class than one might expect, given the student population. Using equal Es, you would use the value (E=N/k) of 50. But you know that women make up only 15% of the student population. This gives you expected frequencies of 15 women (.15 x 100) and 85 men (.85 x 100). This latter design is far more accurate than the EE value of 50.
The Example of Political Party Preference

Suppose you want to study whether political party preference has changed since the last Presidential election. A poll of 1200 voters taken four years before showed the following breakdown: 500 Republicans, 400 Democrats, and 300 Independents. The ratio equals 5:4:3. In your present study, you poll 600 registered voters and find 322 Republicans, 184 Democrats, and 94 Independents.5 The null hypothesis for this study is that party preference has not changed in four years. That is, your hypothesis is that the present observed preferences are in a ratio of 5:4:3.
Computing the Chi Square Value

Compute the expected frequencies as follows. The ratio of 5:4:3 means there are 5+4+3=12 parts. Twelve parts divided into 600 voters yield 50 voters per part (600/
5 Dennis E. Hinkle, William Wiersma, and Stephen G. Jurs, Basic Behavioral Statistics (Boston: Houghton Mifflin Company, 1982), 308-310
23-3
12=50). The first category, Republicans, has 5 parts (5:4:3), or 5x50=250 Expected voters. The second, Democrats, has 4 (5:4:3) parts, or 4x50=200 Expected voters. The third, Independents, has 3 parts (5:4:3), or 3x50=150 Expected voters. Putting this in a table as before, we have the following:
O Rep Dem Ind 322 184 94 600 E 250 200 150 600 O-E 72 -16 -56 (O-E) = 0 (O-E)2 5184 256 3136 (O-E)2/E 20.74 1.28 20.91 2 = 42.93
Notice that both O and E columns add to 600 (N). Notice that the O-E column adds to zero. Notice that the E values are unequal, reflecting the 5:4:3 ratio derived from the earlier poll. The resulting 2 value equals 42.93.
Testing the Chi Square

The critical value (=0.05, df=2) is 5.991. Since the computed value of 42.93 is greater than the critical value of 5.991, we declare the chi-square value significant. The observed values do not fit the expected values.
Translate into English

Since the recent poll does not fit the ratio of 5:4:3 found in the earlier poll, we can say that party preference has changed over the last four years.
Eyeball the Data

But HOW has political party preference changed? We can determine this by what some statisticians call eye-balling the data. The greatest part of the chi square value came from Republicans and Independents. R D I 322 184 94 250 200 150 72 -16 -56 5184 256 3136 20.74 1.28 20.91
Looking at the O-E column, we see that we observed more Republicans than we expected (322 > 250), and fewer Independents than expected (94 < 150), based on data from four years before. It is this twisting ( ) effect that causes the large chisquare value. In summary, the Goodness of Fit procedure tests one variable across k categories. The computed value is tested for significance at and df = k-1. The expected frequencies for each category can be equal (EE) or proportional (PE).
Chi-Square Test of Independence

The test of independence analyzes the relationship between two nominal variables. The procedure uses the special terms independent to mean not related, and not
23-4
Chapter 23
Chi-Square Tests
independent to mean related. The two nominal variables form a contingency table of cells.
The Contingency Table

My wifes Masters thesis studied the relationship between whether Schools for the Deaf identified giftedness in their students (Schools) and whether the schools were predominantly aural/oral, total communication, or a combination ( language preference).6 The column variable schools had two levels: Level I schools of the deaf did not identify students as gifted, while Level II schools of the deaf did. The row variable language preference had three levels. Aural/Oral schools are those who emphasized speech-reading methods of education of the deaf they did not use sign language. Total Comm schools emphasized the total communication method of deaf education, which includes American Sign Language. Both schools used both approaches.
Each of the 47 schools in the study were categorized by both variables and placed into one of 6 cells. How many deaf schools identify giftedness in their students (II) and use total communication as their primary approach? [15]. How many schools use aural/oral methods and do not identify giftedness in their students (I)? [3]. The table also includes margin totals, labelled Total. The total number of aural/ oral schools, regardless of school type, for example, was 3. The total number of Type I schools, regardless of language preference, was 27. The margin totals for the row variable are called row totals (3, 35, 9). The margin totals for the column variable are called column totals (27, 20). The sum of column totals (47) equals the sum of row totals (47) a good check on math accuracy. Margin totals are the means by which expected values are computed.
Expected Cell Frequencies

Each cell requires an Expected value to match its O value. Expected cell frequencies are computed from the margin totals. Using the above contingency table, lets focus on the Expected value for the upper left cell. The three necessary numbers to compute the upper left cell E value are 47 (Total), 27 category I (Column total) and 3 aural/oral methods (row total). The number of schools we would expect for this cell, given no relationship between the two variables is found by multiplying Column 1 Total (27) by Row 1 Total (3),
Barbara Parish Yount, An Analytical Study of the Procedures for Identifying Gifted Students in Programs for the Hearing-Impaired, (Master of Arts Thesis, Texas Woman's University, 1986). The term aural/oral refers to use of speech-reading and speech skills in teaching. The term total communication refers to using any mode of communication, especially American Sign Language, in teaching.
6
23-5
divided by the Total (47), or,
E =
(27x3) / 47 = 1.723 / | \
col 1 row 1 total
Putting this in more general terms, we can show the computation of the Expected values for all cells in a 3x4 contingency table.
The above table shows three levels of a column variable (1, 2, 3) and four levels of a row variable (I, II, III, IV). Once the observed frequencies are placed in the table and margin totals computed, expected values for each cell can be computed. The Expected value for cell 3,27 is found by multiplying the cell's row total (C) by its column total (Y) and dividing by the Table total (T). Once the expected cell frequencies are computed, the remainder of the computation is the same as demonstrated before. O-E, (O-E)2, (OE)2/E for each cell.
Degrees of Freedom
We determine df for the Test of Independence by the formula df = (r-1)(c-1), where r = the number of rows and c = the number of columns in the contingency table. For a contingency table of 5 rows and 6 columns, the degrees of freedom would be (5-1)(6-1) or 20. (Each variable loses one degree of freedom).
Application to a Problem
Lets apply this to our example on deaf schools. The expected frequencies are shown bold-faced in parentheses () below. It is suggested that you compute several of these to insure your understanding of the procedure.
Cell 3,2 refers to the cell at row 3, column 2, shown in the table as the shaded cell.
23-6
Chapter 23
Chi-Square Tests
Putting the O and E values into a chart, we have the following computations:
O 3 20 4 0 15 5 E 1.72 20.11 5.17 1.28 14.90 3.83 (O-E) 1.28 -0.11 -1.17 -1.28 .10 1.17 (O-E)2 1.638 .012 1.369 1.638 .010 1.369 (O-E)2/E 0.953 .001 .265 1.280 .001 .357
2 = 2.857 df = (3-1)(2-1) = 2 2cv = 5.991
The computed value of 2.857 is smaller than the critical value of 5.991. Therefore, the value is declared not significant. The statistical decision is to retain the null hypothesis. In terms of this study, language preference and school category are not related. It appears that educational approach is unrelated to identifying giftedness in deaf students in these 47 deaf schools.
Party Preference Revisited

Does gender relate to party preference? Lets categorize our 600 voters on these two variables and test this. Again, expected values are shown in (). Heres the data:
Male Republican 170 (187.83) 112 (107.33) 68 (54.83) 350 Female 152 (134.17) 72 (76.67) 26 (39.17) 250 Total 322
Democrat
184
Independent
94
Total
600
Heres our chart. Identify the Os and Es above in the chart below.
O 170 112 68 152 72 26 E 187.83 107.33 54.83 134.17 76.67 39.17 (O-E) -17.83 4.67 13.17 17.83 -4.67 -13.17 (O-E)2 317.91 21.81 173.45 317.91 21.81 173.45 2 = (O-E)2/E 1.69 .20 3.16 2.37 .28 4.43 12.13
RM DM IM RF DF IF
23-7
The computed value of 12.13 is larger than the critical value of 5.991 (0.05, df=2). Therefore, the value is declared significant. The statistical decision is to reject the null hypothesis. In terms of this study, this result means that gender and political party preference are related. Ones political preference is influenced by his or her gender. How are these two variables related? We can answer this by eyeballing the data in the table. The greatest part of the chi square comes from the n FEMALE-INDEPENDENT (IF) cell. We observe fewer women independents () than we expect by chance (26 vs. 39.17). The second highest value comes from the o MALE-INDEPENDENT (IM) cell. We observe more male independents () than we expect by chance (68 vs. 54.83). Notice that men outnumber women across independent. The third highest value comes from the p FEMALE-REPUBLICAN (RF) cell. We observe more women republicans () than we expect by chance (152 vs. 134.17). The fourth highest value comes from the q MALE-REPUBLICAN (RM) cell. We observe fewer male republicans () than we expect by chance (170 vs. 187.83). Notice that women outnumber men across republican.
RM DM IM RF DF IF 170 112 68 152 72 26 187.83 107.33 54.83 134.17 76.67 39.17 -17.83 4.67 13.17 17.83 -4.67 -13.17 317.91 21.81 173.45 317.91 21.81 173.45 1.69 .20 3.16 2.37 .28 4.43
q o p n
The arrows show the twisting motion in the table that indicates that the two variables are related.
Strength of Association
The chi-square test of independence tells you whether two nominal variables are related or not. It does not tell you how strong that relationship is. When you produce a significant chi-square (two variables are related), it is natural to wonder how strong the relationship is. Two procedures can provide such measures: the Contingency Coefficient (C) and Cramers phi (C).
Contingency Coefficient
The contingency coefficient (C) computes a Pearson r type correlation coefficient from a computed value. The formula is
If you get, say, a chi-square value of 63.383 (significant at = 0.001) with a sample size of 390, then you can compute the degree of association by
23-8
Chapter 23
Chi-Square Tests
If we were to compare this to a maximum value of 1.00, we would conclude that 0.398 is a weak correlation. But the maximum value for C is not 1.00. It is estimated by another formula:
where k is the number of categories in the variable with the fewer categories. Lets say in our case that one of our variables has 6 categories and the other has 3. Then, k = 3. The maximum value C can take is then computed as Cmax = (3-1)/3, or 0.817. Comparing 0.389 to 0.817, we would say that we have a moderately strong correlation.8
Cramers Phi
While the contingency coefficient is popular, a better alternative to the measurement of association in a contingency table is Cramers phi. The advantage of this procedure is that it ranges from 0.00 to +1.00 and is independent of the size of the table. Cramers Phi is defined as
Cautions in Using Chi-Square

Chi square is a simple yet powerful statistic. It lends itself well to categorical data gained through questionnaires or interviews. It can also be used with continuous data that has been categorized dividing test scores into high, medium, and low categories for example. This latter approach is easy, but there may be better ways (z, t, F) to analyze them, as weve already seen. There are, however, dangers to avoid in choosing this technique. These include small expected frequencies, the assumption of independence, the inclusion of nonoccurrences, and whether this approach should be your primary statistical tool.
Small expected frequencies

When expected cell frequencies are small, the computed chi square does not fit the distribution of the statistic correctly. In this case, the results of significance testing is suspect. How small is small? Howell takes the conservative position that all expected frequencies be at least 5.9 Others hold that the average expected cell frequency should be 5. That is, the ratio of subjects to cells must be greater than 5. Plan ahead. Dont make the mistake of one of my students who planned to use chi-square to study two variables: one had 5 levels and the other 6 levels, giving 30 cells. He thought that 50 subjects would be more than plenty. Dividing 50 by 30 (N/k) gave him an average cell size of 1.67. To get up to the minimum of 5, he needed 150 subjects. Another student of mine, having supposedly read the preceding warning, sug8
Hinkle, p. 320
Howell, p. 105
23-9
gested a dissertation using a chi-square table of 16 rows and 16 columns (256 cells) and considered 200 subjects more than enough.10 The reason is power. Fewer subjects than 5 per cell will not allow the chisquare procedures to detect relationships that may exist. If you plan correctly, but lose subjects during the study, or find some category tallies to be much smaller than anticipated, remember that your significance tests are suspect.
Assumption of Independence
We noted in Barb's study of deaf schools that each one of the 47 schools were placed in one and only one cell in the contingency table. Each school was independent of every other school. The assumption of independence means that each subject is located in one and only one cell in the contingency table. This mistake is easy to make usually by having subjects respond more than once. A student came into my office with a contingency table of tally marks in the fall in 1981 -- my first semester on faculty. His table was the result of $200 in mailings, $300 to a statistician across town, and the prior 10 months of his life. He had listed various educational programs down one side of the contingency table, and five levels of ratings across the top. Each subject checked off a rating for each program. He had 60 subjects and 300 tallies! The observations were not independent (each subject made five responses in the table). He had produced a chi-square value, but the value was meaningless. I encouraged him strongly to go back to his statistican and have him work out another approach to analyzing his data. Proper Planning Prevents Poor Performance and sleepless nights, as well.
Inclusion of Non-Occurrences
There is one final warning I would make about use of chi-square, and this involves the handling of non-occurances. Lets say you ask 20 men and 20 women whether they favor Variable or not. Seventeen men and eleven women say "Yes." With 28 yes responses, we can compute equal Es as 28/2=14. The analysis would be set up as follows:
O 17 11 E 14 14 O-E 3 -3 0 (O-E)2 9 9 2 (O-E)2/E 0.643 0.643 = 1.286
Male Female
This faulty design produces a chi square of 1.286 and is not significant. The fault lies in the fact that the number of nos for males and females is excluded. The correct approach is to build a contingency table as follows, which includes both yes and no responses: Male Yes No 17 3 20 Female 11 9 20 28 12 40
10
16x16x5 = 1280 subjects minimum
23-10
Chapter 23
Chi-Square Tests
Male Yes Male No Female Yes Female No
O 17 3 11 9
E 14 6 14 6
O-E 3 -3 -3 3 0
(O-E)2 9 9 9 9
(O-E)2/E 0.643 1.500 0.643 1.500 2 = 4.286
Now 2 = 4.286 and is significant (2cv = 3.84, df = 1, 0.05). Looking only at "yes" responses (excluding "no"s) invalidated the test. Further, it lowered the value of chi square, leaving us with a non-significant finding -- incorrectly.
Chi-Square as Primary Statistic?

Some students make the mistake of depending on a single chi square test as their dissertation's only statistical tool. A doctoral student worked six months collecting data and synthesizing literature. He walked into my office with a 3x5 contingency table. I entered his 15 observed frequencies into my computer, hit the RUN key, and a second later the answer flashed on the screen: NOT SIGNIFICANT. After a moment of shock, he said, Six months of my life. . .and it took a second to say its not significant?! He had very little to say about his subjects because he had rested all his analytical hopes on a single chi-square statistic. His dissertation's Chapter Three (Procedure for Analysis of Data) was thin. His dissertation passed only after additional (unplanned) weeks of research and writing. It is better to use the t-Test, ANOVA, or multiple regression as a primary statistic. Then use several chi square tests to analyze secondary variables, or sub-hypotheses, in your study. For example, "Does gender, income level, geographic location, year of birth, marital status, age saved, years in the ministry, education level. . .relate to your main variable?" Or, spend your time and energy developing the process and meaning of the variable categories themselves. Dr. Helen Ang, featured at the beginning of the chapter, used chi square to test the relationship between leadership style and educational philosophy. The chi square was simple, but creating the instruments to measure these variables -- the main focus of the dissertation -- was difficult.
Summary
In this chapter weve introduced the concept of non-parametric, or distributionfree, statistics. Weve looked at the chi-square Goodness of Fit tests with both equal and proportional expected frequencies. Weve studied the chi-square Test of Independence. The concept of degrees of freedom was discussed. Weve illustrated how the chi-square statistic is computed, how the critical value is obtained and what significance means in English.
Example
Dr. Roberta Damon's dissertation was cited (p. 19-1) for her use of the one11 This student is now professor at a prominent Christian university, author of many books, and a prominent leader in his professional organization, proving that unsignificant research findings need not impair one's career! 12 Roberta Damon, A Marital Profile, p. 70
23-11
sample z-Test. She also used the chi-square Test of Independence to analyze relationships among several other variables. First, she found that level of marital satisfaction and age category were not independent among missionary wives of her sample (2 = 7.525, 2cv = 5.99, df=2, 0.05). The younger wives expressed higher marital satisfaction than older wives. Second, she found that conflict resolution and age category were not independent among missionary wives of her sample (2 = 6.4513, 2cv = 5.99, df=2, 0.05). The younger wives were more satisfied with the way conflict is resolved in their marriage than older women.12
Vocabulary
contingency coefficient contingency table Cramers phi distribution-free tests equal expected frequencies expected frequencies margin totals observed frequencies proportional expected frequencies measures strength of assocn between two nominal variables () table of rows and columns in chi square test of independence measures strength of assocn between two nominal variables (c) statistics which do not assume a normal distribution of data E-values computed by dividing N by k theoretical values by which observed frequencies are tested (E) sums of counts used to compute Es in chi-sq test of independence actual counts of subjects in chi-square categories (O) E-values computed by known percentages in population
Study Questions
1. What are the critical values for the following conditions: a. 3 rows, 1 column, p=0.05 b. 5 rows, 3 columns, p=0.01 c. 4 rows, 9 columns, p=0.005 2. Define df for both Goodness of Fit and Test of Independence. Demonstrate how that k-1 and (r-1)(c-1) are the proper terms for the two dfs. 3. Youve done your analysis and your computed chi square is less than the critical value. What does this mean, given you are testing one variable? 4. If you have a table with 5 rows (margin totals A..E) and 6 columns (margin totals U..Z), what would the expected value of the cell at row 4, column 2 be?

1. All of the following kinds of data can be tested with chi square except A. dichotomized data B. categorized ratio data C. nominal data D. continuous interval data
23-12
Chapter 23
Chi-Square Tests
2. The term (O-E) in chi square is most closely related to ___ in the z-test. A. X2 B. X C. x D. x2 3. A significant chi square for the Test of Independence means that A. two nominal variables are related. B. two independent groups are different. C. two dependent groups are different. D. two categorical variables are independent. 4. A contingency table has 2 columns and 8 rows. The proper df is A. 16 B. 14 C. 7 D. 1
23-13
23-14
Chapter 24
Ordinal Tests of Difference
Unit V: Advanced Statistical Procedures
24
Non-Parametric Statistics for Ordinal Differences
The Rationale of Testing Ordinal Differences Wilcoxin Rank-Sum Test Mann-Whitney U Test Wilcoxin Matched-Pairs Test Kruskal-Wallis H Test
Chapters 16-21 covered major parametric procedures for testing hypotheses of difference (z, t, F). Here we will look at several non-parametric procedures used to test hypotheses of difference when the data is ordinal -- rankings. The most common application of these tests is with small group testing, where interval/ratio data is converted to ranks. These non-parametric tests are not constrained by the same mathematical restrictions as parametric tests, and so give better results for small n.. These procedures include the Wilcoxin Rank-Sum test, the Mann-Whitney U test, the Wilcoxin Matched-Pairs test, and the Kruskal-Wallis H test. Dr. Gail Linam studied the Bible reading comprehension of children, grades 4-6, across three translations of Scripture: the King James (KJV), the New International (NIV), and New Century (NCV).1 The children's reading comprehension was measured by two different instruments on a story from the Old and New Testaments. The first was the retelling method (OTR, NTR), and the second was the Cloze method (OTC, NTC). She also averaged the two stories into a single Bible comprehension score (BIBR, BIBC). Ninety-two (92) children were tested. Scores were ranked without regard to group membership of the child, and then sums of ranks were computed for each group (KJV, NIV, NCV). The results are shown in the following computer printout2:
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES2 DEPENDENT VARIABLE IS OTR GROUPING VARIABLE IS VER
Gail Linam, A Study of the Reading Comprehension of Older Children Using Selected Bible Translations, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1993) 2 Ibid., 204
1
24-1
V: Advanced Statistical Procedures
GROUP 1.000 2.000 3.000
COUNT 30 31 31
RANK SUM 887.500 1603.500 1787.000
KRUSKAL-WALLIS TEST STATISTIC = 18.649 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF
The Kruskal-Wallis shows a significant difference (p=0.000) in Old Testment Retelling scores (OTR) across the three translations (VER). Notice that the sum of ranks for Group 1 (KJV) is much smaller than Groups 2 and 3. This reveals much lower reading comprehension among children in grades 4-6 for the King James. She found the same results with the NTR, OTC, NTC, BIBR, and BIBC tests. In every case, children understood much less of the King James English than either the New International or the New Century versions.3 Each of the ordinal tests have parametric counterparts, with which you are already familiar.
When Groups are Small Wilcoxin Rank-Sum Mann-Whitney U
When Groups are Large Independent-samples t-Test
Both procedures test two independent samples for significant difference
Wilcoxin Matched-Pairs
Correlated-samples t-Test
Both procedures test two matched samples for significant difference
Kruskal-Wallis H
One-way ANOVA
Both procedures test three or more independent samples for significant difference
The Rationale of Testing Ordinal Differences

Since ranked data is non-parametric, ordinal procedures are not limited by parametric restrictions which reduce power with small n. These ordinal procedures provide greater power when testing small samples. All four of these ordinal tests follow the same rationale of testing. Rank subjects from lowest score (1) to highest score (n) without regard to group. Then separate rankings by their group. Sum the ranks within each group. These sums of ranks (R) are used by the tests to determine whether groups are significantly different. Since low scores produce low ranks, groups that score systematically lower produce a smaller
3
Ibid., 205-206
24-2
Chapter 24
sum of ranks; groups that score systematically higher produce a larger sum of ranks. If the difference between the R terms is large enough, it will be declared significant.
Wilcoxin Rank-Sum Test (Ws)

The Wilcoxin Rank-Sum test is one of the most common and best known distribution-free tests. The Ws computes R for two groups of scores, then uses the R of the smaller group to test the null hypothesis. If the groups are equal in size, then use the smaller of the two Rs. Compare this value with the critical value in the Wilcoxin Table to test the null hypothesis.
Computing the Wilcoxin W

Researchers tabulated the number of stressful events reported by two groups of patients in a local hospital. The first group were cardiac patients and the second were a control group of orthopedic patients.5 Heres the data: Cardiac Patients
Raw Scores Ranks 12 11 8 7 9 9 8 10 R = 45 5 0 6 1
Orthopedic Patients
1 2 2 3 6 2 3.5 3.5 5 7 R = 21
The lowest score (0) receives the rank of 1, and the highest count (12) receives the rank of 11. Two scores have the same count of 2. They are assigned the tied ranks of 3.5, 3.5 in place of 3, 4 (there is no 4 rank). These rankings are then summed by group, yielding sums of 45 and 21. Since Group 2 (n=5) is smaller than Group 1 (n=6), use R of Group 2: 21.
The Wilcoxin W Table

The Wilcoxin Rank-Sum table is located on page A3-6. The column labelled N1 refers to the size (n) of the smaller group. The column labelled N2 refers to the n of the second group. Locate the segment headed N1 = 5. Move down the left side of the segment to row N2=6. The two-tailed 0.05 value, move across to the column headed by 2.5 (2.5% = 0.025 = 0.05/2). The critical value is 18. In order for the Ws statistic to be declared significant, it must be less than the critical value [no, this isnt a typo!]. Our value of Ws=21 is greater than Wcv=18, so we retain the null hypothesis. Interpreting the statistical result in light of the study, we declare that there is no difference in the number of reported stressful events between cardiac patients and orthopedic patients.
The Mann-Whitney U Test

Another popular non-parametric equivalent to the independent samples t-test is the Mann-Whitney U. It is equivalent to the Wilcoxin Rank-Sum Test, but is included here because of its popularity in social science research. The Mann-Whitney U computes two U values with the following formulas:
5
David C. Howell, Statistical Methods for Psychology, (Boston: Duxbury Press, 1982), 500
24-3
where n1 = number of observations in group 1, n2 = number of observations in 2, R1 = sum of ranks assigned to group 1, and R2 = sum of ranks assigned to 2.
Computing the Mann-Whitney U

The smaller of U1 and U2 is the U Test statistic, and is compared to Ucv to determine whether to reject the null hypothesis. The computed U statistic must be less than the critical value in order to reject the null hypothesis. Using the R values of 45 and 21 from the cardiac example, well compute U1 and U2 .
The U1 term (6) is smaller than the U2 term (24), so the U statistic is 6.
The Mann-Whitney U Table

The critical value for U is found in the U-distribution table on page A3-9. N1 is the size of the smaller group, in our case, 5. N2 is the size of the larger group, in our case, 6. Since we are conducting a 2-tail test at =0.05, we'll use the upper table on A3-9. Locate the column labelled 5 and move down to the row labelled 6. The critical value is 3. The computed value must be less than the critical value6 in order to reject the null hypothesis. Since in this case it is not less, we retain the null. This is the same result we obtained with the Wilcoxin Rank-Sum test.
Wilcoxin Matched-Pairs Test (T)

The Wilcoxin Matched-Pairs test statistic (T) is computed in the same straightforward manner as Ws except that it is used with matched scores. After scores are subtracted from Before scores to yield a "difference."
Computing the Wilcoxin T

These differences are ranked from low to high without regard to sign (+,-). Then the sign of the difference (+,-) is applied to the ranks. All +ranks are summed to yield T+ and all -ranks are summed to yield T-. The smaller of the two sums (T+, T-) is taken, regardless of sign, as the statistic T. If computed T is smaller than the critical value (one-tail) or outside the range of critical values (two-tail) found in the Wilcoxin T Table (A3-7), then reject the null hypothesis.
Does running reduce blood pressure? The Before scores show patients blood pressure before the running program, and the After scores show blood pressure after six weeks of running.6
Howell, p. 505
24-4
Chapter 24
Before 130 170 125 160 143 130 145 160
After 120 163 120 135 130 136 144 120
Change 10 7 5 25 13 -6 1 40
Rank 5 4 2 7 6 3 1 8
Signed Rank +5 +4 +2 +7 +6 -3 +1 +8 T- = (-ranks) = -3
T+ =(+ranks) = +33
The Rank column shows the Change values ranked low to high without regard to sign (score 1 = rank 1; score 40 = rank 8). The Signed Rank column applies the sign (-,+) of Change to Rank. Add together all positive ranks for T+ and all negative ranks for T-. The T statistic equals the smaller of the two T values. Since T- (-3) is smaller than and T+ (33), T = 3.
The Wilcoxin T Table

The critical value is found in the Wilcoxin T table (A3-7). We are testing at =0.05 with N=8 pairs of scores. Read down the left side of the table to 8. Looking under the 0.05 column, we read a critical value of 5 . Since the computed T =3 is smaller than the critical values, we reject the null hypothesis. The running program did significantly lower the subjects blood pressure levels.
Kruskal-Wallis H Test
The Kruskal-Wallis H Test is a generalization of the Wilcoxin Rank-Sum test to the case where we have three or more independent groups. As such it is the distributionfree counterpart to the one-way analysis of variance test. Using the following equation, we test whether the Rs for all groups are equal:
where k = the number of groups, ni = the number of observations in group i, Ri = the sum of ranks in group i, and N = the total sample size.
Computing the Kruskal-Wallis H

A researcher designs an experiment to measure the effect of a depressant and a stimulant on the rate of performance in solving simple arithmetic problems. He also includes a control condition of a placebo. The scores below equal the number of problems solved in one hour.8
Ibid., p. 507
24-5
Depressant Score 55 23 40 17 50 60 44
Rank 9 2 3 1 6 10 4 ____ 35
Stimulant Score 73 82 51 63 74 85 66 69 R2
Ri:
R1 =
Rank 15 18 7 12 16 19 13 14 ____ = 114
Placebo Score 61 54 80 47
Rank 11 8 17 5
R3 =
____ 41
Substituting the R values into the H Test formula, we have the following:
Using the Chi-Square Table with Kruskal-Wallis H

The critical value for the Kruskal-Wallis H test comes from the Chi-square Table (A3-3) with k-1 degrees of freedom, where k is the number of groups being tested. Since the critical value is taken from the Chi-Square Table, the computed value must be larger than the critical value in order to reject the null hypothesis. The critical value for this example is 5.99 (0.05, df=2). Since H = 10.10 is larger than the critical value of 5.99, we reject the null hypothesis. The three drugs lead to different rates of performance. The performance of the stimulant group suggests that this group did better than either the depressant or control groups.
Summary
In this chapter we have investigated the more popular and powerful of the distribution-free ordinal tests of difference. We analyzed the Wilcoxin Rank-Sum test, the Mann-Whitney U test, the Wilcoxin Matched-Pairs Signed-Ranks test, and the Kruskal-Wallis H test. The value of these tests is their ability to handle smaller groups of subjects than their comparable parametric counterparts. This is particularly helpful in the kinds of studies designed in the context of Christian education, administration, counseling and social work.
Linam, pp. 205-206
24-6
Chapter 24
Example
Here are the remaining Kruskal-Wallis H test results from Dr. Linam's study:
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS NTR (New Testament Retelling Test) GROUPING VARIABLE IS VER GROUP COUNT RANK SUM 1.000 30 888.000 (KJV) 2.000 31 1679.500 (NIV) 3.000 31 1710.000 (NCV) KRUSKAL-WALLIS TEST STATISTIC = 17.884 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF ................... KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS OTC (Old Testament Cloze Test) GROUPING VARIABLE IS VER GROUP COUNT RANK SUM 1.000 30 808.500 2.000 31 1546.500 3.000 31 1923.000 KRUSKAL-WALLIS TEST STATISTIC = 27.115 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF ........... KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS NTC (New Testament Cloze Test) GROUPING VARIABLE IS VER GROUP COUNT 1.000 30 2.000 31 3.000 31 KRUSKAL-WALLIS TEST STATISTIC = PROBABILITY IS 0.000 ASSUMING ........... RANK SUM 705.000 1742.500 1830.500 33.342 CHI-SQUARE DISTRIBUTION WITH 2 DF
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS BIBR (Average Bible Retelling Test) GROUPING VARIABLE IS VER GROUP COUNT RANK SUM 1.000 30 851.500 2.000 31 1654.000 3.000 31 1772.500 KRUSKAL-WALLIS TEST STATISTIC = 20.822 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF ............ KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS BIBC (Average Bible Cloze Test) GROUPING VARIABLE IS VER GROUP COUNT 1.000 30 2.000 31 3.000 31 KRUSKAL-WALLIS TEST STATISTIC = PROBABILITY IS 0.000 ASSUMING RANK SUM 711.000 1664.500 1902.500 33.765 CHI-SQUARE DISTRIBUTION WITH 2 DF
............................................................................................
In every case, comprehension of the KJV was significantly lower than NIV or NCV
24-7
Vocabulary
Kruskal-Wallis H test Mann-Whitney U test sum of ranks Wilcoxin T test Wilcoxin Ws test ordinal alternative to one-way ANOVA ordinal alternative to t-test for independent samples key concept in ordinal statistics (R) - used to differentiate groups ordinal alternative to matched samples t-test ordinal alternative to the t-test for independent samples
Study Questions
1. Describe the rationale for using non-parametric tests. 2. Describe the appropriate way to handle tied ranks in the procedures discussed in this chapter. 3. Explain in your own words when to use the Kruskal-Wallis H, the Wilcoxin T, the Mann-Whitney U and the Wilcoxin Ws tests.

1. The common term in the Wilcoxin W, the Mann-Whitney U, and the Kruskal-Wallis H is 2 A. R B. X 2 C. r D. R 2. The appropriate test to use when testing the difference between 10 husbands and their wives in marital satisfaction scores is A. correlated samples t-test B. independent t-test C. Mann-Whitney U test D. Wilcoxin T test 3. The appropriate test to use when testing the difference among 10 third graders, 8 fourth graders and 9 fifth graders in problem-solving ability scores is A. one-way ANOVA B. multiple independent t-test C. Mann-Whitney U test D. Kruskal-Wallis H test 4. The non-parametric test equivalent to the Mann-Whitney U is A. Wilcoxin Ws B. Wilcoxin T C. Independent samples t-test D. Kruskal-Wallis H
24-8
Chapter 25
Factorial and Multivariate ANOVA
Factorial and Multivariate Analysis of Variance

Two-Way ANOVA The Meaning of Interaction Three-Way ANOVA Analysis of Covariance Multivariate Analysis of Variance
Dr. Kaywin LaNoue studied the difference in spiritual maturity in high school seniors across two independent variables. The first was whether seniors attended public or Christian high schools. The second was whether or not seniors were actively involved in Sunday School.1 She found no significant interaction between the independent variables (F=3.001, p=.086) allowing her to analyze the main effect F-ratios directly. She found no significant difference in spiritual maturity between Christian School and Public School seniors (F=0.217, p=0.642), but she found a significant difference between active and inactive seniors in Sunday School (F=13.918, p=0.000).2 One hundred twelve seniors participated in this study.3 In Chapter 21, we studied one-way Analysis of Variance which extended the analysis of groups from two (t) to three or more (F). One-sample, two-sample, and one-way ANOVA designs all involve a single independent (controlled) variable and a single dependent (measured) variable. In this chapter, well focus on Factorial ANOVA designs which extend the number of independent variables in a study. Factorial designs can involve two independent variables (two-way), or three independent variables (three-way) or more. I use the SYSTAT computer statistical package which can analyze designs involving up to eight independent variables (8-way). Factorial designs offer greater efficiency for analyzing multiple independent variables simultaneously. We can also test for an interaction effect between independent variables something not possible with multiple one-way studies. We will discuss the concept of interaction, illustrate 2- and 3-way ANOVA procedures, and finally introduce the Analysis of Covariance (ANCOVA) and Multivariate Analysis of Variance (MANOVA) procedures.
25
Kaywin Baldwin LaNoue, A Comparative Study of the Spiritual Maturity Levels of the Christian School Senior and the Public School Senior in Texas Southern Baptist Churches with a Christian School, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1987), 45 2 3 Ibid., 46 Ibid., 47
1
25-1
Two-Way ANOVA
Let's look at a study of the effect of reinforcement on developing vocabulary. One variable is reinforcement with two levels: immediate and delayed. The second variable is subject socioeconomic class with two levels: low and middle. The dependent (measured) variable is vocabulary test score. The two-factor table for this experiment is shown below. The means in the table are identified as r,c. Meanrow=1,col=1 is designated . The other three cell means are designated , , and . The margin mean for row 1 is designated . The dot (.) replaces the column number, indicating all columns. The other margin means are designated (second row), (first column), and (second column).
Low Immediate Delayed

Column Means
Middle =40 =40

=40
Row Means =40 =30
=40 =20
=30
The 2x2 design above produces three F-ratios. It yields an Fr-ratio (row F-ratio) which compares with (Row mean 1 with Row mean 2). This F-ratio is the same as if we had computed a one-way ANOVA across reinforcement type alone. The 2x2 design also yields an Fc-ratio (column F-ratio) which compares with (Column mean 1 with column mean 2). This F-ratio is the same as if we have computed a one-way ANOVA across socioeconomic status alone. These row and column F-ratios are called main effects. The 2x2 design also yields an Frc-ratio, which tests whether there is an interaction between the two independent variables in this case, reinforcement and socioeconomic status. This interaction cannot be tested in one-way ANOVA designs.
The Meaning of Interaction

The term interaction refers to the synergistic impact between independent variables on the dependent variable in an experiment. We can see this effect in our example above. When we graph the four values from our chart, we see that the type of reinforcement affected mean vocabulary score differently in the two SES groups. Immediate reinforcement (black line) showed no difference between the groups and produced consistently high results. Delayed reinforcement (gray line) was much less effective in the Low group than in the Middle group. The conclusion of the study is that we should not use delayed reinforcement with children from a low socioeconomic status. It is clear the lines in the upper graph are not parallel. This indicates interaction between the two variables. If there had been no interaction between reinforcement and socioeconomic status, the lines would be parallel, as shown in the
25-2
Chapter 25
lower left graph. In this (fictious) graph we see that the differences in vocabulary score are parallel across both reinforcement types and socioeconomic group. Let's look at some illustrations which show three types of interaction.
Types of Interaction
There are three basic kinds of interaction: no interaction, ordinal interaction, and disordinal interaction. Match the illustrations below with each description. For these examples we are using a 2x3 experimental design: Two levels in variable one and three levels in variable two.
No Interaction
In data with no interaction between variables, the lines are parallel. Treatment effects are constant across variable levels. Notice that the difference between means (black circles and gray squares) is the same for the three levels of variable 2.
(Note: As with all statistical concepts, data can have some interaction -- and therefore slightly unparallel lines -- and still not have significant interaction.)
Ordinal Interaction
In ordinal interaction, the rank order of the cell means of one variable is the same within each level of the second variable. While the lines are not parallel, they do not cross. Notice that the difference between means (black circles and gray squares) varies, but remains in the same order, for the three levels of variable 2.
Disordinal Interaction
In disordinal interaction, the rank order of cell means is not consistent within each level of the second variable. The lines representing each variable cross. Effects of treatments are radically different across the two variables. Notice that the difference between means (black circles and gray squares) not only varies, but changes order across the three levels of variable 2. A significant ordinal interaction shows that one treatment is superior to another at every level of the second variable. But when there is a significant disordinal interaction, one treatment is superior at one level of the second variable, but not at another. In both cases, interpretation of treatment effect must be made separately for each level of the second variable. Such an analysis is called simple effects. Whenever there is a significant interaction, main effects (Fr, Fc) are meaningless and simple effects must be computed. In the diagram at right, simple effects computations would test the two means at level 1 for significance, then the two means at level 2, and then the two means at level 3, as indicated by the rectangles. There are special formulas for these computations, but well not address them here. 1 2
Sums of Squares in Two-Way ANOVA

In one-way ANOVA, we had three major SS terms: SSt, SSb, and SSw. In two-way ANOVA we have six terms. The between portion of variance is further divided
25-3
between variance in variable A and variable B, variance within cells, and variance due to interaction between A and B. We can summarize these terms as: SSt total SScells within cells SSa variable A SSab interaction SSb variable B SSe error
The Two-Way ANOVA Table

Dr. LaNoue's study can be reduced to the following two-way table4 of the four group means, 175, 131, 157.5, 141.
Sunday School Participation Active Inactive 175 157.5 166.25 131 141 136
School Christian Public Col. Means
Row Means 153 149.5
The Fr-ratio tests the difference between row means 153 and 149.5 for significance, and the Fc-ratio tests the difference between column means 166.25 and 136 for significance. Analyzing this data by computer produces the following ANOVA table:
TABLE 35 SUMMARY TABLE FOR THE TWO-WAY ANOVA SOURCE SUM-OF-SQUARES C/P6 198.220 7 A/I 12730.850 C/P-A/I 2745.269 Error 98787.740 df 1 1 1 108 MEAN-SQUARE F-RATIO 198.220 0.217 12730.850 13.918 2745.269 3.001 914.701 denominator for all three F-ratios P 0.642 0.000 0.086
dfr = r-1 = 2-1 = 1. dfc = c-1 = 2-1 = 1. dfrc = (r-1)(c-1) = 1. dfe = k(n-1) where k is the number of cells (equal cell n's) = dft - (dfa + dfb + dfab) = 111-3 = 108. (unequal cell n's) dft = N-1 = 112-1 = 111 MS terms are given by the respective SS/df terms. F-ratios are given by MSx/MSe Fr = MSr/MSe = 198.22 / 914.701 = 0.217 Main Effect Fc = MSc/MSe = 12730.85 / 914.701 = 13.918 Main Effect Frc=MSrc/MSe = 2745.269 / 914.701 = 3.001 Interaction Effect
The first F-ratio to consider is the interaction F-ratio. This is Frc (3.001) in the table. The computed value of 3.001 is not significant (p=0.086). Therefore, the interaction between School Type and Participation is not significant. Had the interaction been
4 7
Data drawn from LaNoue, pp. 107-109 5Ibid., p. 46 Active or Inactive in Sunday School
Christian or Public School
25-4
Chapter 25
significant, the main effects values would have been meaningless, and we would have had to compute simple effects values. Because the interaction F is not significant, we can interpret main effect F-ratios directly. The first main effect is School Type. This F-ratio tests whether the two row means (153, 149.5) are significantly different. Its value is 0.217 (p=0.642). Because the p value is greater than 0.05, these row means are declared not significantly different. The spiritual maturity scores did not differ between seniors in Christian and public high schools. The second main effect is Active Participation in Sunday School. This F-ratio tests whether the column means (166.25, 136) are significantly different. Its value is 13.918 (p=0.000). Because the p value is less than 0.05, the column means are declared significantly different. Seniors who were active in Sunday School were significantly more mature spiritually than seniors who were inactive, regardless of school type attended. Two graphs of these means is shown below.
The graph at left above orders the means in the same ways as the two-way table. We can see some disordinal interaction, but since the interaction F is not significant (p=0.086), these differences are explained by sampling error. We can also see small (non-significant) differences between the Christian school and Public school seniors (squares and circles close together). By re-ordering the means in the graph above right, we can focus on the activity variable. The difference between the Active seniors and Inactive seniors is clearly seen here. We also see the slight (non-significant) interaction. Notice also that the highest mean of spiritual maturity is found in Christian-School-Active seniors (175, n=46). The lowest mean of spiritual maturity is found in Christian-School-Inactive seniors (131, n=13). These findings are strengthened by the fact that they are based on eleven Christian schools and their sponsoring churches in the state of Texas.8 Dr. LaNoue raised the following questions in her proposal: Does the Christian school accomplish the administrative goal of growth in Christ-likeness or spiritual maturity, or does that public school Christian grow as much or more in spiritual maturity as the Christian school student? Is the Christian school accomplishing something that is not being accomplished in another way?9 What she found should focus our attention on how active our teenagers are in Sunday School, not on whether they attend a private Christian school. It should also send a wake-up call to administrators of private Christian schools that spiritual growth may be more related to the school's publicity than to its students.
Ibid., pp. 71-74
Ibid., p. 19
25-5
Three-way ANOVA
Lets extend these ideas to three independent variables. Suppose you wish to measure the level of test anxiety in seminary students (dependent variable). One independent variable is school; the categories are theology, educational ministries, and music. Another independent variable is gender; the categories are male and female. The third independent variable is year in seminary; the levels are 1st, 2nd, 3rd, 4th+ years. With one analysis we can test the following: 1. Is there a significant difference in test anxiety (F1) across schools, (F2) between genders, or (F3) across length of study? (3-way main effects) 2. Is there an interaction between school and gender (F12), school and years of study (F13), or gender and years of study (F23)? (2-way interactions) 3. Is there an interaction among all three variables (F123)? (3-way interaction) A Three-way ANOVA table looks like the table below. (These table values are not related to seminary problem above, but are given merely as an example.)
1 2 3 12 13 23 123 Error Total
Source
df
1.00 2.00 1.00 2.00 1.00 2.00 2.00 36.00 47.00
1,302.09 1,016.67 918.75 116.66 468.75 50.00 50.00 961.00 4,883.92
SS
MS
1,302.09 508.34 918.75 58.33 468.75 25.00 25.00 26.69
F1 = 48.79* F2 = 19.05* F3 = 34.42* F12 = 2.18 F13 = 17.56* F23 <1 F123 <1 *p<0.05
The table shows that all three main effects and the AC interaction term are significant. Only F2 can be directly interpreted because the a-c interaction renders F1 and F3 meaningless. In graphing a three-way or higher order ANOVA, you must graph two variables at a time. For example, in graphing the means from the ANOVA table above, you might consider the two levels of A separately. Graph B and C for A1 and then B and C for A2. To show the significant interaction, graph A and C for each level of B. Graphing the ABC term is much more difficult, because it forms a plane in 3-dimensional space. With each additional independent variable, the complexity of analysis and interpretation increases. Science likes simple solutions. Avoid overly complex designs, even if your computer software allows you to do them!
Analysis of Covariance
When intact groups must be used for a study, differences may exist between two groups before the treatment begins. Results of the experiment cannot be attributed confidently to the treatment. It would be helpful to have a way to statistically level the groups, adjusting for pre-treatment differences in the post-treatment tests. Fortu-
25-6
Chapter 25
nately, such a procedure exists. The Analysis of Covariance (ANCOVA) procedure gets its name from the fact that it uses a known variable, called a covariate, to adjust the means of the dependent (measured) variable before applying an ANOVA test. The adjustment to the means is done through the coefficient of determination (r2) and variance accounted for (See the end of Chapter 22).
Adjusting the SS Terms

Recall that the statistic r gives the degree of correlation between two variables. The statistic r2 gives the amount of variability in one variable that is accounted for by another. So, if we have a pre-test that has a correlation of 0.80 with the post-test, its r2 value is 0.64. Sixty-four percent of the variability found in the post-test scores can be explained by the variability in the pre-test scores. The adjusted total sum of squares is adjusted as follows: SSt = SSt (1 - r total) where SSt is the adjusted total sum of squares and rtotal is the correlation coefficient between the covariate and the dependent variable for all pairs of observations. The adjusted within sum of squares is given by SSw = SSw (1 - r w) where rw is the within groups (pooled) correlation coefficient between the covariate and dependent variables. (Dont worry about where these terms come from. Computer programs calculate these terms from raw data automatically.) Finally, the adjusted between sum of squares can be computed from the other two adjusted values: SSb = SSt - SSw These adjusted values are used in a one-way ANOVA design. The correlation between the covariate and the dependent variable reduces the amount of unknown error and makes the design more powerful.
2 2
Uses of ANCOVA
ANCOVA is employed where random assignment of subjects is not possible or permitted. This is frequently the case in schools, where classes must be studied as they are, intact. A simple approach is to give the intact groups a pretest, and then use the pretest as a covariate for posttest scores. But there are many situations which lend themselves to ANCOVA: differences among religious, cultural, community, political, social, economic, or medical diagnostic groups; differences between alternative attitude, aptitude, or achievement groups; differences between vegetarians and non-vegetarians, smokers and non-smokers, users and non-users of a given product, criminals and non-criminals. Measure the differing groups of interest on a large number of variables, and then analyze these variables to discover which ones best distinguish between the groups. This is done through a procedure called Discriminant Analysis. These differentiating
Gene V. Glass, Statistical Methods in Education and Psychology, 2nd. (Englewood Cliffs, NJ: PrenticeHall, Inc., 1984), 493-497
10
25-7
variables can then be used as covariates. The strongest warning Ive heard about ANCOVA came from my statistics professor at the University of North Texas. In defining what ANCOVA does, Dr. William Brookshire said, with a dry smile and somewhat sarcastically, ANCOVA estimates what the experimental means would be if they werent what they are. Be sure you understand both the research design and the statistical limitations of your study before you make strong statements about your findings. There lurk numerous pitfalls for researchers who fail to consider their findings carefully.
Example Problem
Gene Glass10 provides this example of the benefits of ANCOVA. An experiment was performed in twenty elementary schools of a large school district. Ten of the schools were randomly designated to be sites for adoption of an innovative science curriculum, Science: A Process Approach (SAPA). The SAPA materials were bought and placed in the ten schools; teachers were trained to use them. The other ten elementary schools continued to use the districts traditional science curriculum. After two years of study in the respective programs, sixth-grade pupils in all twenty schools were given the Science Test (a 45-item measure of scientific methods, reasoning, and knowledge) of the Sequential Tests of Educational Process (STEP). Each students score was expressed as a percentage. There were 50 to 120 6th graders in each school, but since the school itself (along with its teachers, administrators, surrounding neighborhoods, and the like) was randomly designated as either SAPA or Traditional (Experimental or Control), the school was the experimental unit. The twenty schools means of sixth-grade pupils STEP-Science scores were used as the observational unit in the statistical analysis. The collected data follows:
SAPA n = 10 77.63% 74.13 67.20 78.23 57.93 57.65 83.30 73.90 45.90 64.83 68.07% 134.60 Traditional n = 10 64.10% 43.67 50.40 84.33 44.93 71.43 71.10 44.57 68.23 68.47 61.12% 201.50
s2
Applying one-way ANOVA to this data produced the following table. Was there a significant difference in the two curricula?
Source Between Within Total SS 241.30 3,024.94 3,266.24 df 1.00 18.00 19.00 MS 241.30 168.05 F 1.44 Fcv (0.10, 1, 18) = 3.01
25-8
Chapter 25
The answer is no. The F-ratio is smaller than the critical value even at =0.10, and inappropriate level of significance. But what if we know something more about the schools that could explain some of the error variance, and in so doing, reduce the error term of the analysis? IQ differences between the schools might affect the results. It is reasonable to assume that schools with high scholastic aptitude (IQ) will tend to have higher means on the achievement test than schools with lower IQ means. IQ means for each of the twenty schools are shown with the achievement means below:
SAPA IQ 105.7 100.3 94.3 108.7 93.1 96.7 106.9 100.3 86.5 96.1 98.86 47.94 ACH 77.63% 74.13 67.20 78.23 57.93 57.65 83.30 73.90 45.90 64.83 68.07% 134.60 Traditional IQ ACH 101.2 97.6 96.4 109.6 94.0 105.4 102.4 100.6 104.2 112.6 102.40 33.60 64.10% 43.67 50.40 84.33 44.93 71.43 71.10 44.57 68.23 68.47 61.12% 201.50
s2
Using adjusted SS terms, an ANCOVA table can be built from this data which looks like this:
Source Between Within Total SS 786.71 830.88 1,617.59 df 1.00 17.00 18.00 MS 786.71 48.88 F 16.10
Fcv(.01,1,17) = 15.7
Now we find the groups significantly different (p < 0.01). This adjustment was possible because of the high correlation between mean IQ and mean science achievement scores (r = +.931, +.805 for the experimental and control groups respectively). ANCOVA used these correlations to reduce the error variance and provide a more powerful analysis than was possible through ANOVA alone.
Multivariate Analysis of Variance

When we extended one-way ANOVA to Factorial ANOVA, we added one or more independent variables to a design. When we extend a one-way ANOVA to a Multivariate ANOVA (MANOVA), we add one or more dependent variables to a design. A one-factor MANOVA consists of one independent variable (treatment) and tests two or more dependent variables (measurements). A multi-factor MANOVA tests two or more independent variables against two or more dependent variables (i.e., combines factorial and multivariate designs). An educational researcher might be interested in the impact of three different questionning strategies on several learner variables. The three strategies form a one 4th ed. 2006 Dr. Rick Yount
25-9
way ANOVA design, but the multiple learner measurements achievement, anxiety level, attitude toward the course, attitude toward the instructor make the design a MANOVA. A counseling researcher might be interested in the impact of group vs. individual counseling and level of social competence on four counselee variables. These two independent variables form a 2-way ANOVA design, but the four dependent variables make the design a multi-factor MANOVA. It is enough at this point to be aware of the existence of these procedures, and to know what has been done when you discover a MANOVA analysis in your reading.
Summary
In this chapter, youve been introduced to the concepts of factorial ANOVA, interaction, ANCOVA, and MANOVA. A basic understanding of these advanced techniques will help you understand the research articles youll read as part of your literature analysis. The following table gives a summary of the key elements in these procedures.
Name of Analysis Number of Independent Variables ONE (Questionning strategy) MANY (Questionning strategy, Structure, Variety, and Attitude of Teacher) ONE plus COVARIATE (Questionning strategy, IQ) ONE (Questionning strategy) Number of Dependent Variables ONE (Achievement) ONE (Achievement)
One-Factor ANOVA
Multi-factor ANOVA
One-factor ANCOVA
ONE (Achievement) MANY (Achievement, attitude toward class, anxiety level) MANY (Achievement, attitude toward class, anxiety level)
One-factor MANOVA
Multi-factor MANOVA
MANY (Questionning strategy, Structure, Variety, and attitude of Teacher)
Example
Dr. Gail Linam's dissertation11 was cited earlier for her use of the Kruskal-Wallis H Test to measure differences between three groups of ranks (see Chapter 24). Her use of the H Test was secondary to her primary statistic of two-way ANOVA. Her dependent variable was reading comprehension score. She used the Retelling Method and the Cloze Test to produce reading comprehension scores for an Old Testament story (OTR, OTC), a New Testament story (NTR, NTC), and a Bible score, the average of the two stories (BIBR, BIBC). Her two independent variables were CAMP (church campus or mission campus) and VER (Bible version: KJV, NIV, NCV), as shown
11
Information for these tables from Linam, pp. 174, 196, and 198-200
25-10
Chapter 25
below:
CAMP VERSION KJV NIV NCV Church Campus xx.xxx xx.xxx xx.xxx Mission Campus xx.xxx xx.xxx xx.xxx
The two-way ANOVA table for OTR was:

DEP VAR: OTR N: 93 MULTIPLE R: 0.636 SQUARED MULTIPLE R: 0.405
ANALYSIS OF VARIANCE SOURCE CAMP VER CAMP*VER ERROR SUM-OF-SQUARES DF 3591.258 3377.792 175.082 11794.434 MEAN-SQUARE F-RATIO 1 3591.258 2 1688.896 2 87.541 87 135.568 P 26.490 12.458 0.646 0.000 0.000 0.527
There is no interaction between CAMP and VERsion (p=.527), so we can test the two independent variables separately. There was a significant difference in OTR reading comprehension scores between the church campus and mission campus children (p<.001). Looking at the scores below, we can see the church campus children scored higher than mission campus children. This was true in every case. There was a significant difference across translation (p<.001). The scores below show that the KJV produced the lowest comprehension scores. This was true in every case.
CAMP VERSION KJV NIV NCV Church Campus 18.81 32.96 34.41 Mission Campus 7.00 15.00 23.11
The two-way ANOVA table for NTR was:

DEP VAR: NTR N: 92 MULTIPLE R: 0.649 SQUARED MULTIPLE R: 0.422
There is no interaction (p=.065). Both CAMP and VER show significant differences. Here are the group means for NTR:
25-11
CAMP VERSION KJV NIV NCV

*See Linam, p. 105
Church Campus 25.55 37.55 34.60*
Mission Campus 7.25 21.56 29.44
The two-way ANOVA table for OTC was:

DEP VAR: OTC N: 93 MULTIPLE R: 0.718 SQUARED MULTIPLE R: 0.516
There is no interaction (p=.978). Both CAMP and VER show significant differences.
The two-way ANOVA table for NTC was:

DEP VAR: NTC N: 92 MULTIPLE R: 0.772 SQUARED MULTIPLE R: 0.595
There is no interaction (p=.226). Both CAMP and VER show significant differences.
25-12
Chapter 25
In each case we see that church campus children scored significantly higher in reading comprehension than mission children, and readers of the KJV scored significantly lower than readers of either NIV or NCV versions -- in every condition. Specific computations of pair-wise differences were made using the FLSD procedure. See Example on page 21-8 for these findings. Dr. Linams findings indicate that teachers and curriculum writers need to avoid use of the King James Version of the Scriptures for older children (grades 4-6). Children of this age simply cannot understand the text as well as the New International of New Century versions.
Vocabulary
ANCOVA disordinal interaction factorial ANOVA interaction effect main effects MANOVA multi-factor MANOVA ordinal interaction simple effects Analysis of Covariance: uses pretest differences to adjust posttest means factorial ANOVA: ranks of means differ across treatment levels designs which have 2 or more indt variables (2-way, 3-way, k-way) effects of one treatment not constant across levels of a second Row and Column F-ratios in a factorial design Multivariate Analysis of Variance: more than one dependent variable Factorial design (e.g., 2-way) with more than one dependent variable factorial ANOVA: ranks of means constant across treatment levels testing differences of means of one treatment across all levels of the second
Study Questions
1. Define factorial ANOVA. 2. What is the advantage of using factorial ANOVA over multiple one-way ANOVAs? 3. Explain the term interaction. 4. Compare and contrast ordinal and disordinal interaction. 5. If you discover a significant interaction in your data, A. What implications does this have for main effects interpretation? B. What further procedure must you employ? 6. Answer the questions below using this computer printout: Source
Parity Size Parity x Size Error Total
df
1.00 2.00 2.00 49.00 54.00
SS
11.24 90.61 16.32 205.00 323.17
MS
11.24 45.31 8.16 4.18
F
2.70 10.83 1.95
p
0.11 0.00 0.15
A. How many subjects were involved in this study? B. How many levels of PARITY were used? C. How many levels of SIZE were used? D. How many groups were tested? E. Which term was used in the denominator of all the F-ratios? F. Was the interaction between PARITY and SIZE significant? G. Was PARITY a significant treatment variable? How do you know?
25-13
H. Was SIZE a significant treatment variable? How do you know? I. In this case, would you (a) interpret the main effects of PARITY and SIZE, or would you (b) apply simple effects tests? Explain why. 7. Design a study in your field of specialty using the following research designs: factorial ANOVA, ANCOVA, or MANOVA.

1. The row and column F-ratios in a two-way ANOVA are called A. simple effects B. side effects C. interaction effects D. main effects 2. Ordinal interaction is indicated on a graph of means by the fact that the lines A. are parallel to each other B. are not parallel, but do not cross each other C. cross each other D. intersect at the origin 3. The term used in the denominators of each F-ratio in a 2-way ANOVA, given variables A and B, is the term A. MSa B. MSb C. MSab D. MSe 4. All of the following suggest the use of ANCOVA except A. intact school classrooms B. differing religious groups C. randomly selected subjects D. criminals and non-criminals 5. The factor used to adjust the SS terms in ANCOVA is A. MSw B. N-k 2 C. r D. n
25-14
Chapter 26
Regression Analysis
26
Regression Analysis
Linear Regression The Equation of a Line The Linear Regression Equation Standard Error of Estimate The Multiple Regression Equation A Walk Through a Computer Printout A Multiple Regression Example
In Chapter 22, we discussed the use of correlation to measure the strength of association between two variables. In this chapter we extend this concept to regression analysis, which allows us to predict the value of a variable from one or more others. Linear regression analyzes two variables one predicted variable (called the criterion) and one predictor variable. Multiple regression analyzes three or more variables one criterion and two or more predictor variables. The mathematical computations for regression analysis are complex, but with the advent of the personal computer and the development of statistical packages, regression analysis is rapidly becoming the most popular statistical procedure -- particularly in the fields of psychology, sociology and education. Dr. Martha Bessac studied predictor variables of marital satisfaction of 375 student couples at Southwestern Baptist Theological Seminary in 1986.1 Aware of the increased stress on seminary marriages, including a rise in the number of divorces -averaging twenty-four per year at that time2 -- Dr. Bessac, as part of the Registrar's staff, wanted to determine what factors might be contributing to this. She hypothesized, based on her literature search, the following variables as significant positive predictors of student marital satisfaction: sex (gender), age of husband, age of wife, seminary program of husband, number of semesters husband has been enrolled, number of hours husband enrolled in this semester, number of hours husband has completed towards degree, education level of husband, education level of wife, number of months married, number of children, child density, child spacing ratio, number of hours per week husband is employed, number of hours a week wife is employed, total income, number of hours per week husband engaged in church activities, and number of hours per week wife engaged in church activities.3
Martha Sue Bessac, The Relationship of Marital Satisfaction to Selected Individual, Relational, and Institutional Variables of Student Couples at Southwestern Baptist Theological Seminary, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1986) 2 3 4 Ibid., p. 19 Ibid., pp. 20-21 Ibid., pp. 41, 44
1
26-1
She found four significant predictors accounting for 9.6% of marital satisfaction variability. These were Months Married (t=-5.428, b= -0.054), Number of Hours Wife Works (t=-2.637, b= -0.183), Number of Hours Husband Works (t=-2.605, b= -0.094), and Income (t=-2.089, b= -0.158). Further, the regression equation produced by the analysis was shown to be a viable model (F=12.925, Fcv=2.39).4 Notice that all the regression coefficients (b's) are negative. As Months Married increased, marital satisfaction decreased. This is perhaps explained by considering two extreme groups of student couples: one group of newly-weds, in seminary-as-honeymoon mode, compared to older couples with teenage children, leaving behind "home, friends and family" for cramped quarters and hectic schedules. Increased hours of work for both husband and wife meant decreased marital satisfaction. Higher incomes, lower satisfaction. Number of children, degree plan, age, number of credit hours in the semester, hours engaged in church activities -- these and the other specified variables proved not significant. Since only 9.6% of the variability of marital satisfaction (Adj. R2=0.096) is accounted for by the four predictor variables, 90.4% of the variability of martial satisfaction was not accounted for. This variability was either accounted for by unnamed variables, or by the unsystematic variation among the 375 couples. Still, the multiple regression procedure declared the model viable by posting a significant F-ratio of 12.925 (Fcv=2.39).5 Before introducing the concepts of regression, however, we need to review the fundamentals of linear equations upon which regression is built.
The Equation of a Line

Do you remember way back in high school math when you studied x- and ycoordinate systems? You would spend homework time plotting points on graph paper. You drew lines between points, and developed equations for the lines. At left is such a graph, which shows a line drawn between the X,Y points of 0,4 and 8,20. This line is defined by a slope and a yintercept point. The slope is defined by the ratio of the change in Y over the change in X (Y/X). Y-values change 16 points (20-4=16) and X values change 8 points (8-0=8), yielding a slope of 16/8 = 2. The y-intercept point is the point where the line crosses the y-axis. It represents the value of Y when X=0. Here, when X=0, Y=4. Therefore the y-intercept is 4. The equation of a line has the general form of Y = mX + b, where m represents the slope and b the y-intercept. The equation of the line in the graph at left is written as Y=2X+4. Four other points are shown in the graph. The coordinates of any point on this line can be computed from the linear equation Y=2X+4, when given a value of X. The table below shows how each Y-value is computed from a given X-value and the equation:
X-value 2X + 4 Y-value {4} 6 8 14 16 20 Coordinates {0,4} 1,6 2,8 5,14 6,16 8,20 {0} -----------> 1 2 5 6 8 2x0 + 4 2x1 + 4 2x2 + 4 2x5 + 4 2x6 + 4 2x8 + 4 = ------------> = = = = =
And if X=100?
26-2
Chapter 26
Regression Analysis
100
2x100+4 =
204
100,204
The two elements of slope (m) and y-intercept (b) define a line. These concepts of slope and y-intercept are used in computing a regression equation by which one variable can be predicted by the other.
Linear Regression
Several scattergrams are displayed on page 22-2. The first two scattergrams illustrate perfect correlations of +1.00 and -1.00 respectively. Remember that in both cases, the points fell along a straight line. The term linear (lin-ee-er) derives from the line represented by points in a scatterplot. We can compute an equation for a line which fits any scatterplot. Using this equation, we can predict one variable from another: the stronger the relationship, the more closely the points cluster around a line, and the better the accuracy of the prediction. Using a process called the least squares method, regression analysis produces a best fit linear equation. "Best fit?" In chapter 16, we learned that the sum of deviations about the mean equals zero (x=0). It is also true that the sum of squared deviations about the mean (x2 ) is a minimum value. That is, the sum of squares about the mean is smaller than it would be if computed about any other value. Looking at mean and x2 another way, the mean of a group of scores produces the smallest sum of squares. It is a least squares measure of central tendency. Just as the mean is the "best fit point" of a single group of scores, the linear regression equation is the "best fit line" through a scatterplot of two groups of scores. It is a least squares fit because -- just as x=0 and x2 = a minimum -- so deviations of scatterplot points about the computed line, called residuals (e), produce the values e=0 and e2 = a minimum. More on this a little later. Let's look at the regression equation.
The Linear Regression Equation

The general equation for the line is Y = mX + b. The equation used in linear regression is written like this:
Y = a + bX
where Y (pronounced "Y prime") is the predicted value of Y, a refers to the y-intercept point, and b refers to the slope of the regression line. Regression analysis produces values for a and b such that we can develop the best fit line through a scatterplot.
Computing a and b
Given a set of scores, how do we calculate the values of a and b? Here are the formulas we use:
First, compute b. The elements of the formula bear a close resemblance to part of the Pearsons r correlation coefficient.
26-3
Second, use b and the means of X and Y to compute a. Earlier, we computed values of Y from values of X using the equation Y = 2X + 4. Lets use those same values, and compute the equation components a and b. If we do this right, we should get a = 4 and b = 2. The X- and Y- values below come from the computed coordinates at the bottom of 26-2.
X 0 1 4 25 36 66 X2 X XY 0 0 1 6 2 16 5 70 6 96 14 188 X XY 196 (X)2 Means: 14/5 = 2.8 Y 4 6 8 14 16 48 Y Y 16 36 64 196 256 568 Y2
48/5 = 9.6
First, compute b.
Second, compute a.
in
Third, substitute the values of a and b into the equation Y = a+bX, which results
This is the same equation we started with on page 26-2. While we may seem to be going around in circles, we have established the fundamentals of conducting regression analysis -- computing a linear equation from a set of matched scores.
Drawing the Regression Line on the Scatterplot

Once we have determined the regression equation, how do we draw the regression line on the scatterplot diagram? "Best fit" lines do not always look as if they fit the best, since "best fit" is a mathematical, rather than a visual, concept. In order to draw a line, we need two points. We already have one: the y-intercept (a). This is the value of Y when X is 0. The second point is established by the means of X and Y, which intersect on the regression line. Establish these two points, and then draw a line through them using a ruler. This generates the "best fit" line through the data. Look at the diagram at right, which shows the two required points in shaded circles.
26-4
Chapter 26
Regression Analysis
Errors of Prediction (e)

A predicted value of Y (Y), computed from a regression equation, is identical to the actual Y only when a perfect correlation exists between the independent, or predictor, variable (X) and the dependent, or criterion, variable (Y). If the correlation is less than perfect, there will be differences between predicted scores (Y) and actual scores (Y). These errors of prediction, or residuals, and are defined as e = Y - Y. Look at the point (6,20) at right. This point does not fall on the regression line. The point labelled Y indicates where it should fall. The vertical distance from the actual data point Y (6,20) to the predicted value on the regression line Y (6,16) is the residual or error of prediction: e = Y - Y = 20 - 16 = 4. The residual (e) is comparable to the deviation (x) as discussed earlier. Residuals are deviations in two-dimensional space. The sum of residuals about the regression line is equal to zero (e=0), just as the sum of deviations about the mean equals zero (x=0). When residuals are squared and summed, the result is the sum of squared residuals (e). When the proper regression line has been computed, the value e will be a minimum. Any other line drawn through the scattergram will produce a larger e than the best fit regression line.
Standard Error of Estimate

When two variables have a perfect correlation (r = 1.00), then Y can be perfectly predicted from X. In most cases, however, the correlation between two variables is less than perfect. This means that prediction is less than perfect. For each "X" value (vertical line X at right) there are many "Y" values centered around Y' (gray points on X vertical line). If we plot data points around the predicted value (Y') on the Y-axis, we create a normal curve. The variability of the residuals can be plotted like the gray area in the normal curve at left. The standard deviation of this Y-based normal curve is the standard deviation of the residuals, called the standard error of estimate (se). It is computed by the following formula:
Compare this equation to the one for estimated population standard deviation (s) on page 16-9. The concepts are the same. The term n - 2 is used because two degrees of freedom are lost -- due to having two groups of scores. Another way to compute the standard error of estimate is to use the correlation coefficient (r) as follows:
where sY is the standard deviation of the Y scores and r is the correlation between X and Y. The larger the correlation (r) between X and Y, the smaller the term under the radical, and the smaller the standard error of estimate. As r approaches 1.00, se ap 4th ed. 2006 Dr. Rick Yount
26-5
proaches 0, which reflects a greater accuracy in prediction. In this section we have reviewed the fundamentals of linear equations, the formulas for computing a and b for the equation Y=a+bX, and the concepts of residuals and and the standard error of estimate. But the real power of regression analysis for the complex studies of the social sciences is found in multiple regression analyses.
Multiple Linear Regression

Linear regression produces an equation which relates one predictor variable to one criterion variable. But real life problems are seldom so easily explained. Multiple regression analysis permits the study of several predictor variables for a given criterion.
Raw Score Regression Equation

The multiple regression equation is much like the simpler linear regression. Each additional predictor variable (X) has a regression coefficient (b) associated with it. The constant term is a. For k predictor variables, the general raw score equation is this:
Y' = a + b1X1 + b2X2 + . . . + bkXk

One drawback to the raw score equation is that the size of b-coefficients is dependent on the scales of predictor variables. They cannot, therefore, be compared directly with each other to determine which predictor has the most influence on the criterion. If we could standardize both scores and coefficients, converting them to a single scale, we could make direct comparisons among the coefficients. This is exactly what the standardized regression equation does.
Standardized Score Regression Equation

The standardized equation has no constant term, uses standardized regression coefficients (beta, ) instead of bs, and z-scores instead of raw scores (X). In this form, the regression equation looks like this:5
z'Y = 1zX1 + 2zX2 + . . . + kzXk

Since s and z's are standardized, they can be directly compared to each other to determine the rank order of influence among the variables on the criterion. We will see a little later how to use this important information.
Multiple Correlation Coefficient

Just as we can compute the correlation r between two variables, we can also compute the multiple correlation coefficient, R, the correlation between the criterion variable (Y) and all the predictor variables taken together. Just as the coefficient of determination (r) tells us proportion of variance accounted for between two variables, so the squared multiple R (R) tells us what proportion of variance is accounted for in Y by all the predictor variables in the equation. The best way to see what multiple regression offers us is to work through an
5 The numbers 1...2...3 are subscripts of the X's, which in turn are subscripts of the z's. But PageMaker 6.5 does not permit two levels of subscript. Therefore the subscript term "X1" is written as "X1." 6 The data for this example is adapted from Howell, 416. 7 Leland Wilkinson, A System for Statistics, Version 4, (Evanston, IL: SYSTAT, Inc., 1988)
26-6
Chapter 26
Regression Analysis
actual analysis.6 The data analysis was done by the author using SYSTAT.
Multiple Regression Example

A number of years ago the Student Association of a large university published an evaluation of over 100 courses taught during the preceding semester. Students in each course completed a questionnaire in which they rated a number of different aspects of the course on a 5-point scale (1=failure, 5=exceptional). These variables were: The overall quality of the courses: The teaching skills of the instructor: The quality of tests and examinations: The instructors perceived knowledge : The students expected grade: The enrollment of the class: OVERALL TEACH EXAM KNOW GRADE ENROLL
The Data
The chart below displays 6 of the data sets collected from 50 courses. Each row is a single course. Scores are mean scores representing the entire class.
1 2 3 4 49 50 OVERALL 3.4 2.9 2.6 3.8 4.0 3.5 TEACH 3.8 2.8 2.2 3.5 4.2 3.4 EXAM KNOW 3.8 4.5 3.2 3.8 1.9 3.9 3.5 4.1 . . . . . . 4.0 4.4 3.9 4.4 GRADE 3.5 3.2 2.8 3.3 4.1 3.3 ENROLL 21 50 800 221 18 90
The Correlation Matrix

Preliminary analysis of data sets such as these include computing Pearson r correlation coefficients on variables two at a time. These coefficients are efficiently displayed in a form called a correlation matrix, which shows all coefficients at one time, as shown below. Correlation Matrix
N = 50 TEACH EXAM KNOW GRADE ENROLL OVERALL 0.804 0.596 0.682 0.301 -0.240 TEACH 0.720 0.526 0.469 -0.451 EXAM KNOW GRADE
0.451 0.610 -0.558
0.224 -0.128
-0.337
With all coefficients shown, any coefficient between any two variables can be quickly found. Note the strong positive correlations between TEACH-OVERALL and ENROLL-EXAM above (0.804, -0.558). Notice also that ENROLL is negatively correlated with every other variable (As classes get bigger, student attitudes get more negative).
The Multiple Regression Equation

We can choose any variable of the six to serve as our criterion. However, since
26-7
one of them (OVERALL) reflects student attitude toward the course as a whole, it is the most appropriate variable to serve as criterion. TEACH, EXAM, KNOW, GRADE, and ENROLL are appropriate predictor variables. How well do the predictor variables account for the variance in the criterion OVERALL? And what are the values of a (constant) and bX based on the data? Here is our raw score regression model equation:
OVERALL = CONSTANT + b1TEACH + b2EXAM + b3KNOW + b4GRADE + b5ENROLL
The Essential Questions

Were looking for the values of CONSTANT (a) and the regression coefficients (bs). Beyond this, the multiple regression analysis provides information to answer three essential questions: First, how much criterion variability is accounted for by the predictors? Look for the Adjusted Squared Multiple R value. Second, which predictor variables are significant? Look for individual predictors whose t-values are significant (p<0.05). Third, is the model-as-a-whole viable? Is it a good model? Look for a significant F-ratio (p < 0.05).
Multiple Regression Printout

DEP VAR: OVERALL N: 50 MULTIPLE R: .869 SQUARED MULTIPLE R: .755 ADJUSTED SQUARED MULTIPLE R: .728 STANDARD ERROR OF ESTIMATE: 0.320 VARIABLE COEFFICIENT STD ERROR CONSTANT -1.195 0.631 TEACH 0.763 0.133 EXAM 0.132 0.163 KNOW 0.489 0.137 GRADE -0.184 0.165 ENROLL 0.001 0.000 STD COEF TOLERANCE T 0.000 1.0000000 -1.893 0.662 .4181886 5.742 0.106 .3245736 0.811 0.325 .6746330 3.581 -0.105 .6196885 -1.114 0.124 .6534450 1.347 F-RATIO 27.184 P(2 TAIL) 0.065 0.000 0.422 0.001 0.271 0.185 P 0.000
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE REGRESSION 13.934 5 2.787 RESIDUAL 4.511 44 0.103
The above SYSTAT multiple regression printout of the data has three distinct sections which relate to the three questions stated above. These are delineated by dotted lines which are not normally seen in a printout. We will now take each section in turn.
Section One
DEP VAR: OVERALL N: 50 MULTIPLE R: .869 SQUARED MULTIPLE R: .755 ADJUSTED SQUARED MULTIPLE R: .728 STANDARD ERROR OF ESTIMATE: 0.320
The first section of the regression printout, shown above, includes the elements defined below. The specific values for this example are displayed in brackets [ ].
DEP VAR: The dependent variable (criterion). [OVERALL]
26-8
Chapter 26
Regression Analysis
N: Number of cases or subjects in the study. [50] MULTIPLE R: Correlation between OVERALL and the predictors. [0.869] SQUARED MULTIPLE R: Proportion of variance in OVERALL accounted for by predictors. [0.755] ADJUSTED SQUARED MULTIPLE R: If you were to use a multiple regression equation with another set of data, the R2 value from the second data set would be smaller than the R2 produced by the original data set. This reduction in R2 is called shrinkage. The adjustment depends on the number of subjects (N) and the number of variables (k) in the study. This is the true value of R2. [0.728] STANDARD ERROR OF ESTIMATE: The standard deviation of the residuals. [0.320]
The answer to our first question is that the five predictor variables account for 72.8 percent of the variability of OVERALL (Adj.R = 0.728).
Section Two
VARIABLE CONSTANT TEACH EXAM KNOW GRADE ENROLL COEFFICIENT STD ERROR STD COEF TOLERANCE -1.195 0.763 0.132 0.489 -0.184 0.001 b's 0.631 0.133 0.163 0.137 0.165 0.000 sb 0.000 0.662 0.106 0.325 -0.105 0.124 's T P(2 TAIL) 0.065 0.000 0.422 0.001 0.271 0.185 p(t)
1.0000000 -1.893 .4181886 5.742 .3245736 0.811 .6746330 3.581 .6196885 -1.114 .6534450 1.347 multicollinearity t=b/sb
The second section of the printout, shown above, details the analysis of each predictor individually. It is this part of the printout that provides the regression coefficients (bs and s) as well as their significance tests.
VARIABLE: Heading for the variable names in the regression model. Con stant is the value of OVERALL when all predictors equal zero. COEFFICIENT: Heading for the values of the respective regression coefficients (the bs) of the regression equation. Using these values, you can write the regression equation for OVERALL and the five predictors as follows:
OVERALL = -1.195 + 0.76TEACH + 0.13EXAM + 0.49KNOW - 0.18GRADE + 0.001ENROLL \ \ \ \ predicted Constant: the value Regression coefficient: Variable: Use score for of OVERALL when all Multiply this by the the raw scores OVERALL predictors = 0 value of VARIABLE for these
Given the mean scores of the five predictors for any class, we can predict what that class OVERALL score will be. STD ERROR: Standard deviation of the regression coefficient (b). It is used in a t-test to determine whether the b is significant. STD COEFFICIENT: Standardized regression coefficients, or beta weights (). Betas are to bs what zscores are to Xs. While the bs are used with raw scores in regression equations, as in 0.76TEACH above, the betas are used with z-scores. The beta for TEACH equals 0.662. The proper term for TEACH in a standardized regression equation is 0.662zTEACH. Because betas are standardized, they can be directly compared according to relative strength. The bs cannot be compared directly because they usually represent different score ranges. ENROLL ranges from a low of 7 to a high of 800, while the other scales range from 1 to 5. The standardization of the betas eliminates this problem of differing ranges, just as z-scores eliminates the problem of comparing raw scores with differing variabilities. In our example, we see that TEACH is more than six times more influential than EXAM, and twice as influential as KNOW. TOLERANCE: The ideal condition in multiple regression analysis is for each predictor variable to be related to the criterion, but not to other predictor variables. Predictor variables are supposed to be independent of each other but they rarely are. Tolerance values near zero (0) in this printout indicate that some
26-9
of the predictors are highly intercorrelated. This undesirable situation is called multicollinearity. Look for tolerance values near 1.0. T: If you divide the value of each regression COEFFICIENT by its respective STD ERROR, you will get the values in this column. For example, the test for the b on the variable TEACH is equal to 0.763/0.133 = 5.742. The t-test values are used to answer the question Is this predictor significant?
T
CONSTANT -1.195 0.631 0.000 1.0000000 -1.893 0.065 >>>> TEACH 0.763 / 0.133 = 5.742 0.000 EXAM 0.132 0.163 0.106 .3245736 0.811 0.422 KNOW 0.489 0.137 0.325 .6746330 3.581 0.001 GRADE -0.184 0.165 -0.105 .6196885 -1.114 0.271 ENROLL 0.001 0.000 0.124 .6534450 1.347 0.185 P(2 TAIL): The probability of obtaining the computed t-value. For TEACH, p is very small less than 0.001 that we would get a t-value of 5.742 if bteach=0. Therefore, we say that TEACH is a significant predictor. There is a 42.2% (0.422) chance of getting the t-value of 0.811 for EXAM by chance. This is not a significant predictor since p>0.05.
The answer to our second question is that TEACH and KNOW are significant predictors of OVERALL. EXAM, GRADE, and ENROLL are not.
Section Three
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE REGRESSION 13.934 5 2.787 RESIDUAL 4.511 44 0.103 F-RATIO 27.184 P 0.000
The third section of the printout details the analysis of the model as a whole. Is the model, as represented by the regression equation being tested, a viable one?
SOURCE: There are two sources of variance in regression analysis. One is from the regression itself, and the other is from variance unaccounted for after the regression analysis. The predicted score (Y) seldom equals the criterion score (Y). There is always some error of estimate (e), such that Y = Y + e. We can therefore divide the criterion scores into two parts: Y (regression) and e (residual). SUM-OF-SQUARES: This is the sum of squared deviations about the regression line. The total sum of squares is divided between accounted for REGRESSION (Y) and unaccounted for RESIDUALS (e). Regression SS: This is the sum of squared deviations of Y about the mean of Y: Residual sum of squares: This is e, which equals DF: Degrees of freedom. DFreg equals the number of variables minus 1 [dfreg=6-1=5]. DFres equals the number of subjects minus the number of variables [dfres=50-6=44]. MEAN-SQUARE: The mean-square terms are variances. MSreg equals SSreg divided by dfreg. MSres equals SSres divided by dfres. F-RATIO: The F-ratio equals MSreg/MSres and is used to determine if the variance due to regression is enough greater than the variance due to residual noise to render the model significant (or viable). P: The probability of the computed F-RATIO being this large by chance. Any time p<0.05, a significant model is indicated. In our case, a P of 0.000 is very small less than 0.001 and indicates a significant model.
The answer to our third question is that we do have a viable model. The Fratio is significant.
26-10
Chapter 26
Regression Analysis
Focus on the Significant Predictors

Notice in Section 2 that only two predictors, TEACH and KNOW, are significant. That is, only these two variables have t-test probabilities of 0.05 or less.
TEACH KNOW 0.763 0.489 0.133 0.137 0.662 0.325 .4181886 .6746330 5.742 3.581 0.000 0.001
Since the other variables are not significant, lets analyze another model which includes only these two predictors.
OVERALL = CONSTANT + b1TEACH + b2KNOW

DEP VAR: OVERALL N: 50 ADJUSTED SQUARED MULTIPLE R: VARIABLE CONSTANT TEACH KNOW COEFFICIENT -1.298 0.710 0.538 MULTIPLE R: .860 SQUARED MULTIPLE R: .739 .728 STANDARD ERROR OF ESTIMATE: 0.320 STD ERROR 0.477 0.101 0.132 STD COEF TOLERANCE T P(2 TAIL) 0.000 1.0000000 -2.720 0.009 0.616 .7230389 7.021 0.000 0.358 .7230389 4.082 0.000
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES REGRESSION 13.627 RESIDUAL 4.818 DF 2 47 MEAN-SQUARE 6.814 0.103 F-RATIO 66.467 P 0.000
Study the printout above and the analysis below carefully: First, did reducing the number of predictors from 5 to 2 reduce the amount of variance accounted for in OVERALL? The Adjusted R-Square is 0.728, exactly the same as we found with five predictors. We lost nothing here, which is good. Second, did we increase the standard error of estimate? No, it is 0.320, the same as before. We gained nothing here, which is good. Third, are the two predictors significant? Yes, both TEACH and KNOW show large t-test scores and very low probabilities (0.000 means p<0.001, very small). This is good. Fourth, is our model more sound? The F-Ratio is larger, showing a better ratio of regression to noise. Notice the values of the sum-of-squares are not much different from before, but the change in regression df from 5 to 2 produces a larger MEANSQUARE value. In conclusion, this second model is better. For these 50 courses, we can account for nearly 73% of the students ratings of courses by knowing their ratings of the instruc-tors TEACHing skills and their instructors perceived KNOWledge of the subject. ENROLLment in the class, quality of EXAMs, and the students anticipated GRADEs are not significant predictors of OVERALL quality.
Multiple Regression Equations

Our resulting raw score and standardized equations are:
26-11
OVERALL = -1.298 + 0.71TEACH + 0.538KNOW z OVERALL = 0.616z TEACH + 0.358z KNOW Summary
In this chapter you have been introduced to the world of regression analysis. You have seen how scattergrams of data can be reduced to a single predictor equation. You have learned how to compute the two variables in a linear regression line: a and b. You have learned how to read a computer printout from a multiple regression analysis.
Example
Dr. Dean Paret studied the relationship between nuclear family health and selected family of origin variables among 302 married subjects in 1991.8 The criterion (predicted) variable was overall perceived nuclear family health {FUNC}-- as measured by the Family Adaptability and Cohesion Evaluation Scale (FACES-R).9 Predictor variables related to family of origin. Autonomy {AUTON} measures an individual's sense of independence and self-reliance. It includes free expression, responsibility, mutual respect, openness and experiences of separation or loss.)10 The second predictor was Intimacy {INTIM}, which reflects close, familiar and usually affectionate or loving personal relationships without feeling threatened or overwhelmed. It includes expression of feelings, sensitivity and warmth, mutual trust, and the lack of undue stress in conflict situations.11 Both AUTON and INTIM were measured by the Family-of-Origin Scale (FOS).12 Additional demographic variables were gathered by means of a questionnaire: educational level {EDUC}, degree program {DEGREE}, number of years in graduate school {YRS}, income level {SALARY}, sex of participant {SEX}, and whether or not the couple was a dual career family {DUAL}.13 Here is Dr. Paret's final printout:
MULTIPLE REGRESSION PRINTOUT
14
DEP VAR: FUNC N: 302 MULTIPLE R: .898 SQUARED MULTIPLE R: 0.806 ADJUSTED SQUARED MULTIPLE R: .804 STANDARD ERROR OF ESTIMATE: 34.973 VARIABLE CONSTANT AUTON EDUC INTIM YRS COEFFICIENT 61.373 1.256 11.206 0.406 7.588 STD ERROR 11.220 0.122 4.642 0.126 1.973 STD COEF 0.000 0.647 0.063 0.200 1.107 TOLERANCE . 0.1656319 0.9462106 0.1700070 0.8370002 T P(2 TAIL) 5.470 0.000 10.320 0.000 2.414 0.016 3.224 0.001 3.847 0.000
F-RATIO 309.434
P 0.000
8 Dean Kevin Paret, A Study of the Perceived Family of Origin Health as It Relates to the Current Nuclear Family in Selected Married Couples, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1991) 9 10 11 12 13 Ibid., p. 39 Ibid., p. 53 Ibid., pp. 53-54 Ibid., p. 38 Ibid., p. 39 14 Ibid., Table 15, p. 149
26-12
Chapter 26
Regression Analysis
Question One: How much variability of the subjects' family health (FUNC) was accounted for by Family-Of-Origin autonomy and intimacy, and the demographic variables of education level (EDUC: high school, college, graduate school) and years in seminary (YRS: 1, 2, 3, 4+)? The Adjusted R2 value (0.804) answers this question: 80.4%. Less than 20% of the variability in current family health is unaccounted for. This is a very strong finding. Question Two: Which predictor variables are significant? What is the order of influence of these variables on family health? Since this is the fourth and final printout in a series, all nonsignificant predictors have been eliminated (DEGREE, DUAL, SALARY, SEX). All of the variables listed above show p(t) values less than 0.05. The rank order of influence is given by the beta-values under the heading STD COEF. Autonomy (AUTON) has by far the greatest influence on family health (FUNC) with =0.647. Intimacy (INTIM) is next with =0.200, followed by years enrolled in seminary (YRS) with =1.107, and finally educational level (EDUC) with =0.063. The raw and standardized regression equations for Dr. Paret's study are FUNC = 61.373 + 1.256AUTON + 11.206EDUC + 0.406INTIM + 7.588YRS zFUNC = 0.647zAUTON + 0.063zEDUC +0.200zINTIM + 0.107zYRS Question Three: Is this a viable model? Does it adequately predict family health among the 302 subjects? The answer to this question is found in the ANOVA table and F-ratio. The F-ratio of 309.434 (p<.001) tells is this is a very strong model. How important are the variables EDUC and YRS for the FUNC model? Dr. Paret dropped these out of his full model and produced the following:
MULTIPLE REGRESSION PRINTOUT
15
DEP VAR: FUNC N: 302 MULTIPLE R: .890 SQUARED MULTIPLE R: 0.792 ADJUSTED SQUARED MULTIPLE R: .791 STANDARD ERROR OF ESTIMATE: 36.121 VARIABLE CONSTANT AUTON INTIM COEFFICIENT 90.925 1.350 0.425 STD ERROR 8.139 0.124 0.130 STD COEF 0.000 0.696 0.209 TOLERANCE . 0.1702486 0.1702486 T P(2 TAIL) 11.171 0.000 10.887 0.000 3.269 0.001
F-RATIO 569.888
P 0.000
How much did Adj. R2 change? The amount of variance-accounted-for dropped from 0.804 to 0.791, a change of -0.013, or a little over one percent. This is good. We did not lose much R2 by dropping two of the four predictor variables. Are AUTON and INTIM still significant predictors? Yes (p<0.001, p=0.001). Did the model suffer from dropping EDUC and YRS? No. The F-ratio is larger than before, showing a smaller, stronger model.
15
Ibid., p. 150
16
Ibid., p. 101
26-13
Family patterns of relationship transfer from generation to generation. Healthy family relationships are rooted in the degree of autonomy and intimacy experienced in the family of origin. Likewise, disfunctional family relationships are rooted in family of origin disfunction. These same patterns show up in seminary couples. Some enter the ministry to help others out of the need to help self because of a disfunctional family background. Inability to establish healthy relationships in family has been found to transfer to the ministry: such ministers have difficulty forming ministerial relationships in the pastorate.16 The challenge to seminaries is to go far beyond teaching students how to minister, but requires also helping disfunctional students break with past patterns and learn anew how to establish autonomous and appropriately intimate relationships with others. The health of our churches is at stake.
Vocabulary
adjusted squared multiple R correlation matrix criterion variable linear equation linear regression multicollinearity multiple correlation coefficient multiple linear regression predictor variable regression sum of squares regression coefficient residual sum of squares residual Shrinkage slope squared multiple R standard error of estimate standardized regrn coefficient tolerance y-intercept Multiple correlation coefficient after adjusted for shrinkage representation of multiple variables and their intercorrelations predicted or dependent variable in multiple regression (Y) mathematical formula which describes a straight line (Y = 2X + 3) predicting one variable by another by best-fit line through scattergram degree of inter-correlation among predictor variables correlation between the criterion variable and all predictors together (R) prediction of one variable by 2+ others variable(s) used to estimate a criterion variable in regression analysis sum of squared deviations between Y and the mean of Y raw score correlates of criterion variable in regression (b) sum of squared deviations between Y and Y (e2) difference between true Y value and the predicted value of Y (e=Y-Y) Reduction in R2 value when equation is applied to new data one of two determiners of a regression line: m=(Y/X) proportion of variance of Y accounted for by all predictors (R2) standard deviation of the residuals standardized score correlates of criterion variable in regression () reflects the degree of multicollinearity among predictors one of two determiners of a regression line: value of Y when X=0
Study Questions
1. Draw a set of axes. Label the X-axis (horizontal) from 0 to 10 and the Y-axis (vertical) from +4 to -1. Compute 10 values for Y when X=1, 2...10 with the equation Y = -0.5X + 4. Plot the 10 points on your axes. 2. Define e. Show how it is calculated. 3. Work through the explanation of the first regression printout using the second printout on page 26-10. Identify and define each of the following elements: a. Dep var b. N c. Squared multiple R d. Adjusted squared multiple R i. Tolerance j. Multiple R k. P(2 tail) l. Regression sum-of-squares
26-14
Chapter 26
Regression Analysis
e. Standard error of estimate f. Coefficient g. Std error h. Std coef
m. Residual df n. Regression Mean-square o. F-ratio p. P
4. A regression analysis was done on the data given below. Draw a scatterplot of the data. Compute a and b, and , then draw the proper regression line on the scatterplot. Study the regression printout below and describe your findings. Include R, Adj. R-squared, coefficients, ttest values and probabilities, and the F-ratio. The following data are scores from 15 students on Bible knowledge test scores (Y) and the number of semester hours of Bible in college (X). X Y 1 5 18 18 1 2 9 9 6 1 2 1 5 1 2 1 2 1 2 1 5 1 2 18 2 3 2 7 3 0 1 9 18 2 1 1 7 2 1 2 7 2 9 2 5 2 2 2 6 2 5 2 4
MULTIPLE REGRESSION PRINTOUT DEP VAR: KNOW N: 15 MULTIPLE R: .728 SQUARED MULTIPLE R: .530 ADJUSTED SQUARED MULTIPLE R: .494 STANDARD ERROR OF ESTIMATE: 2.792 VARIABLE CONSTANT HOURS COEFFICIENT 13.066 0.810 STD ERROR 2.845 0.212 STD COEF 0.000 0.728 TOLERANCE 1.0000000 1.0000000 T P(2 TAIL) 4.593 0.001 3.828 0.002
SOURCE SUM-OF-SQUARES REGRESSION 114.259 RESIDUAL 101.341
ANALYSIS OF VARIANCE DF MEAN-SQUARE 1 114.259 13 7.795
F-RATIO 14.657
P 0.002

1. Given the linear equation Y = 5X - 10, a value of X=3 would yield a Y of A. +3 B. -10 C. +5 D. -5 2. The two-dimensional e relates most closely to the one-dimensional A. X B. x C. s D. z 3. Given the constant (a) = 4 and regression coefficient (b) = 3, the correct linear regression equation is A. Y = 4X + 3 B. Y = 4 + 3X C. Y = 3X + 4 D. Y = a + 7X 4. The standardized regression coefficient is represented by the letter ___ and is used with ________ in the regression equation. A. b, raw scores B. b, z scores C. , raw scores D. , z scores
26-15
Chapter 27
Evaluation Checklist
Unit VI: Evaluating Proposals
27
Guidelines for Evaluating Research Proposals
This chapter is included in the text for two reasons. The first purpose is to give you guidance as you write your own proposal. If you merely mimic another research proposal, you will miss the most important learning goal of the thesis or dissertation process the creation of a plan to solve a real problem through research and analysis. By choosing a subject of interest and systematically applying the guidelines given in this checklist, you will master from the ground up -- skills that will help you in all your problem solving situations. The second purpose is to provide you a checklist to help you evaluate research proposals of other students. Use this checklist along with the descriptions in Chapter 2 to master the essential elements of writing an effective research proposal.
Research Proposal Checklist

Front Matter
Yes! Yes? No? No! Yes! Yes? No? No! Is the title page typed in proper format? Is the Table of Contents properly organized?
Introduction
Yes! Yes? No? No! Does the introductory statement move you, like a funnel, from a general to a specific view of the problem of the study? Does the introductory statement avoid personal pronouns, subjective language, and awkward grammar? Is the Problem stated clearly, tersely, and objectively? Is the Problem stated in the proper format (relationship between variables or difference between groups)? Does the Purpose clearly state the intention of the study? Does the Purpose break the Problem down into subsections for analysis?
Yes! Yes? No? No!
Yes! Yes? No? No! Yes! Yes? No? No!
27-1
VI: Evaluating Research Proposals
Yes! Yes? No? No!
Is the Related Literature a true synthesis of researched material, rather than a review, or summary, or report? Are most of the materials footnoted in the Related Literature section drawn from primary, rather than secondary, sources? Is there an obvious organizational scheme to the Related Literature section: historical, topical, or related to the hypotheses? Does the Related Literature section give you the impression that the writer is thoroughly familiar with what is known in the field? Does the Significance of the Study section answer the question So what? (Does it explain why this particular study is important to the field? Does it include referenced support for the study?) Does the Hypothesis state an expected answer to the Problem which has been stated? Is the Hypothesis written in testable form? Is the Hypothesis stated appropriately? (usually this means as a research, rather than a null, hypothesis)
Yes! Yes? No? No!
Yes! Yes? No? No!
Yes! Yes? No? No!
Yes! Yes? No? No!
Yes! Yes? No? No!
The Method
Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Is the studys population clearly defined? Is the procedure for sampling (if used) clearly explained? Is the size of the sample(s) stated? Is there a clear description of the instrument(s) that will be used to gather data? Are the stated limitations actual limitations to the study or merely delimitations? Are the stated assumptions legitimate in the context of the proposal, rather than cop-outs for shallow thinking? Are the stated definitions legitimate in the context of the study (operational, unusual connotation, or restricted meaning) rather than being obvious or commonly used words? Is the research design (if needed) clearly explained? Are the procedures for collecting data clearly stated step-by-step? Do the procedures avoid fuzzy language and word magic? Is there evidence that the researcher has considered potential
Yes! Yes? No? No!
Yes! Yes? No? No!
Yes! Yes? No? No!
Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No!
27-2
Chapter 27
Evaluation Checklist
problems and provided contingency plans?
The Analysis
Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Are the procedures for analyzing data clearly stated step-by-step? Does the researcher give evidence of understanding the statistical procedures he/she has chosen? Are the research hypotheses restated in a null form for testing? Are each of the hypotheses tested with the appropriate statistic? Is there agreement among the Problem, Hypothesis, and statistical procedures used (All deal with relationship, or difference, or congruence, etc.)? Are there model charts, graphs, or tables which show how the data will be organized and reported in the final paper? Are all references cited in the body of the paper (and only those cited) referenced in the bibliography?
Yes! Yes? No? No!
Yes! Yes? No? No!
General
Yes! Yes? No? No! Does the paper exhibit theological reflection and application to the Christian ministry context? Is Southwestern style used correctly throughout the paper? Does the paper generally exhibit good writing skills: spelling, grammar, syntax, clarity of thought? Does the paper exhibit good organizational skills: flow of thought, effective transitions from section to section, the impression that the paper is all of one piece? Does the paper present a professional appearance? On the basis of what youve learned in class, what grade would you consider proper for this paper? -
Yes! Yes? No? No!
Yes! Yes? No? No! A B C D
____ ____ x5 x3 ____ ____
____ ____ x1 x0 ____ ____
Count Number of Each Category (Should add to 40) Multiply each count by appropriate factor Evaluation Points in each category Add 4 subtotals together for total points (0-200)
====
27-3
Appendix 1
Answer Key to Sample Test Questions
Answer Key
The following answer key is provided to reinforce your study of key concepts in the course. Mastering a language requires more than memorizing correct answers, however. You will be tested by questions which will require you to use the languages of research and statistics. Use these sample questions as a beginning point in testing your understanding.
Chapter One
1. D 2. B 2. A 3. D, E, P, E, C, D, C, H, Q, C, H, V, P, R, C, E, C, E, D, D 3. B 4. D
Chapter Two
1. C
Chapter Three
1. Birth year is interval. Year is the unit of interval, but 0 AD is not the beginning of time. Co or Fo is interval. Degree is the unit of interval, but 0 degrees is not absence of all heat. Class rank is ordinal: 1st, 2nd, 3rd... Test score is ratio. Point is the unit of interval and 0 points means absence of mastery Nationality is nominal: categories of Caucasian, Black, Hispanic, Oriental, etc. Body Weight is ratio. Pound is the unit of interval, and 0 means absence of weight. 2. B 3. D 4. -FIRO-B scores are ratio, though it is difficult to tell from this statement that it isnt interval -A single attitude scale item produces ordinal data. (A group of items produces interval data). -Employment status is nominal. Subjects are categorized into one of three options. -Study Habits is ratio. The assumption, given max=100, is that min=0. -W-GCTA score is ratio. - Leadership Style is nominal: Subjects are categorized into one of five styles. - Attrition Ranking: ordinal - Child Density in ratio data (number/number = ratio: 0.00-1.00)
Chapter Four
1. D 2. D 2. Ind 3. D, S, N, S, N, D, S, N 3. Mult 4. O 5. L 6. P 7. I
Chapter Five
1. M 8. S
Chapter Six
1. C 2. D 2. A 2. D 2. A 3. D 3. B 3. B 3. C 4. A 4. A 4. C
Chapter Seven
1. C
Chapter Eight
1. D
Chapter Nine
1. C
A1-1
References
Chapter Ten
1. D 2. C 2. D 2. C 2. I/E 9. I/A 2. 0.375 3. A 3. B 3. D 3. I/I 10. I/M 3. 125 4. I/B 11. I/G 4. -0.358 5. 1/C 12. E/K 5. 128.6% 6. 66.7% 7. 0.867 6. E/L 7. I/D 4. A 4. D 5. C 6. C
Chapter Eleven
1. F
Chapter Twelve
1. A 1. I/J 8. E/H 1. -12 8.
Chapter Thirteen
Chapter Fourteen
Z(X-W) = (1-C) divide both sides by Z, leaving... X-W = (1-C)/Z add W to both sides, leaving... X = ((1-C)/Z) +W
Solving for X
9.
B(C-1) = (AB-1)/3 (AB-1)/3 = B(C-1)
Solving for A
place the term containing A on the left: multiply both sides by 3, leaving... AB-1 AB A A A = 3B(C-1) add 1 to both sides, leaving... = 3B(C-1) + 1 divide both sides by B, leaving... = (3B(C-1) + 1)/B simplifying the terms for B, we have... = (3B(C-1)/B) + (1/B) simplifying the first term, we have... = (3(C-1)) + (1/B)
The purpose of the algebraic exercises of 8-9 above is to accustom you to thinking in terms of relationships between numerical variables apart from actual data. If you can become comfortable thinking in terms of symbols linked together in equations (rather than words linked in sentences) then the statistical formulas youll encounter will be far less threatening.
Chapter Fifteen
1. A 2. C 2. A 2. C 2. C 3. C 3. D 3. D 3. D 4. A 4. D 4. C 4. D 5. T 6. T 7. F (2.58) 5. B
Chapter Sixteen
1. B 1. C 1. D
Chapter Seventeen Chapter Eighteen
A1-2
Appendix 1
Answer Key to Sample Test Questions
Chapter Nineteen
1. A 2. B 2. C 2. B 2. D 2. C 2. D 2. B 2. B 3. B 3. A 3. B 3. A 3. A 3. D 3. D 3. B 4. T 5. F (excludes)
Chapter Twenty
1. C 1. B 1. E 1. D 1. D 1. D 1. C 4. F (matched, correlated) 5. F (a descriptive) 4. B 4. C 4. C 4. A 4. C 4. D 5. C 5. A 5. F 6. B
Chapter Twenty-one Chapter Twenty-two Chapter Twenty-three Chapter Twenty-four Chapter Twenty-five Chapter Twenty-six
A1-3
References
A1-4
Appendix 2
Southwestern Dissertations Cited
Dissertations and a Thesis

Cited Student Research
Beginning with the 3rd edition of this text, I have used examples from actual dissertations written by doctoral students in the School of Educational Ministries, Southwestern Baptist Theological Seminary. Cited dissertations are listed here. Ang, Helen C. An Analytical Study of the Leadership Style of Selected Academic Administrators in Christian Colleges and Universities as Related to Their Educational Philosophy Profile. Ed.D. diss., Southwestern Baptist Theological Seminary, 1984. Bass, Charles S. A Study to Determine the Difference in Professional Competencies of Ministers of Education as Ranked by Southern Baptist Pastors and Ministers of Education. Ph.D. diss., Southwestern Baptist Theological Seminary, 1998. Bergen, Martha S. A Study of the Relationship Between Attitudes Concerning Computer-Enhanced Learning and Selected Individual and Institutional Variables of Full-Time Faculty Members at Southwestern Baptist Theological Seminary. Ed.D. diss., Southwestern Baptist Theological Seminary, 1989. Bessac, Martha Sue. The Relationship of Marital Satisfaction to Selected Individual, Relational, and Institutional Variables of Student Couples at Southwestern Baptist Theological Seminary. Ed.D. diss., Southwestern Baptist Theological Seminary, 1986. Black, Wesley. A Comparison of Responses to Learning Objectives for Youth Discipleship Training from Minister of Youth in Southern Baptist Churches and Students Enrolled in Youth Education Courses at Southwestern Baptist Theological Seminary. Ed.D. diss., Southwestern Baptist Theological Seminary, 1985. Briggs, Phillip H. The Religious Education Philosophy of J. M. Price. D.R.E. diss., Southwestern Baptist Theological Seminary, 1964. Bryan, Douglas. A Descriptive Study of the Life and Work of John William Drakeford. Ed.D. diss., Southwestern Baptist Theological Seminary, 1986. Burns, Carl. A Descriptive Study of the Life and Work of James Leon Marsh. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Clark, Don. Statistical Power as a Contributing Factor Affecting Significance Among Dissertations in the School of Religious Education at Southwestern Baptist Theological Seminary. Ph.D. diss., Southwestern Baptist Theological Seminary, 1996.
A4-1
References
Clement, Dan Earl. The Relationship Between Recalled Parental Contact and Adult Personality Adjustment. Ed.D. diss., Southwestern Baptist Theological Seminary, 1987. Cook, Marcus Weldon . A Study of the Relationship Between Active Participation as a Teaching Strategy and Student Learning in a Southern Baptist Church, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1994. Randy Covington, An Investigation into the Administrative Structure and Polity Practiced by the Union of Evangelical Christians - Baptists of Russia. Ph.D. proposal, Southwestern Baptist Theological Seminary, 1999. Crain, Matthew Kent. Transfer of Training and Self-Directed Learning in Adult Sunday School Classes in Six Churches of Christ. Ed.D. diss., Southwestern Baptist Theological Seminary, 1987. Damon, Roberta McBride. A Marital Profile of Southern Baptist Missionaries in Eastern South America. Ed.D. diss., Southwestern Baptist Theological Seminary, 1985. Da Silva, Maria Bernadete. A Study of the Relationship Between Leadership Styles and Selected Social Work Values of Social Work Administrators in Texas. Ed.D. diss., Southwestern Baptist Theological Seminary, 1993. DeVargas, Robert. A Study of Lessons in Character: The Effect of Moral Judgement Curriculum Upon Moral Judgement. Ph.D. diss., Southwestern Baptist Theological Seminary, 1998. Doyle, Judith N. A Critical Analysis of Factors Influencing Student Attrition at Four Selected Christian Colleges. Ed.D. diss., Southwestern Baptist Theological Seminary, 1984. Eldridge, Daryl Roger. The Effect of Student Knowledge of Behavioral Objectives on Achievement and Attitude Toward the Course, Ed.D. diss., Southwestern Baptist Theological Seminary, 1985. Floyd, James Scott. The Interaction Between Employment Status and Life Stage on Marital Adjustment of Southern Baptist Women in Tarrant County, Texas. Ed.D. diss., Southwestern Baptist Theological Seminary, 1990. Gill, Rollie. A Study of Leadership Styles of Pastors and Ministers of Education in Large Southern Baptist Churches. Ph.D. diss., Southwestern Baptist Theological Seminary, 1997. Havens, Joan Ellen. A Study of Parent Education Levels as They Relate to Academic Achievement Among Home Schooled Children. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Hedin, Norma Sanders. A Study of the Self-Concept of Older Children in Selected Texas Churches Who Attend Home Schools as Compared to Older Children Who Attend Christian Schools and Public Schools, Ed.D. diss., Southwestern
A4-2
Appendix 2
Southwestern Dissertations Cited
Baptist Theological Seminary, 1990. LaNoue, Kaywin Baldwin. A Comparative Study of the Spiritual Maturity Levels of the Christian School Senior and the Public School Senior in Texas Southern Baptist Churches With a Christian School. Ed.D. diss., Southwestern Baptist Theological Seminary, 1987. Lawson, Margaret P. A Study of the Relationship Between Continuance of LIFE Courses in the LIFE Launch Pilot Churches and Selected Descriptive Factors. Ph.D. dissertation, Southwestern Baptist Theological Seminary, 1994. Linam, Gail. A Study of the Reading Comprehension of Older Children Using Selected Bible Translations. Ed.D. diss., Southwestern Baptist Theological Seminary, 1993. Mathis, Robert. A Descriptive Study of Joe Davis Heacock: Educator, Administrator, Churchman, Ed.D. diss., Southwestern Baptist Theological Seminary, 1984. McQuitty, Marcia G. A Study of the Relationship Between Dominant Management Style and Selected Variables of Preschool and Children's Ministers in Texas Southern Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1992. Mullen, Steven Keith. A Study of the Difference in Study Habits and Study Attitudes Between College Students Participating in an Experiential Learning Program Using the Portfolio Assessment Method of Evaluation and Students Not Participating in Experiential Learning. Ph.D. diss., Southwestern Baptist Theological Seminary, 1995. Paret, Dean Kevin. A Study of the Perceived Family of Origin Health as It Relates to the Current Nuclear Family in Selected Married Couples. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Perez, Darlene J. A Correlational Study of Baptist Youth Groups in Puerto Rico and Youth Curriculum Variables. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Southerland, Dan. A Study of the Priorities in Ministerial Roles of Pastors in Growing Florida Baptist Churches and Pastors in Plateaued or Declining Florida Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1993. Steibel, Sophia. An Analysis of the Works and Contributions of Leroy Ford to Current Practice in Southern Baptist Curriculum Design and in Higher Education of Selected Schools in Mexico. Ed.D. diss., Southwestern Baptist Theological Seminary, 1988. Tam, Stephen. A Comparative Study of Three Teaching Methods in the Hong Kong Baptist Theological Seminary, Ed.D. diss., Southwestern Baptist Theological Seminary, 1989.
A4-3
References
Waggoner, Brad J. The Development of an Instrument for Measuring and Evaluating the Discipleship Base of Southern Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Welch, Robert Horton. A Study of Selected Factors Related to Job Satisfaction in the Staff Organizations of Large Southern Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1990. Williamson, Bradley Dale. An Examination of the Critical Thinking Abilities of Students Enrolled in a Masters Degree Program at Selected Theological Seminaries. Ph.D. diss., Southwestern Baptist Theological Seminary, 1995. The following studies were also cited in the text: Yount, Barbara Parish. An Analytical Study of the Procedures for Identifying Gifted Students in Programs for the Hearing-Impaired. Master of Arts thesis, Texas Woman's University, 1986. _____, William R. A Critical Comparison of Three Specified Approaches to Teaching Based on the Principles of B. F. Skinner's Operant Conditioning and Jerome Bruner's Discovery Approach in Teaching the Cognitive Content of a Selected Theological Concept to Volunteer Adult Learners in the Local Church. Ed.D. diss., Southwestern Baptist Theological Seminary, 1978. ______________. A Monte Carlo Analysis of Experimentwise and Comparisonwise Type I Error Rate of Six Specified Multiple Comparison Procedures When Applied to small k's and Equal and Unequal Sample Sizes. Ph.D. diss., University of North Texas, 1985.
A4-4
Appendix 4
Bibliography
Bibliography
Cited Works
The single largest regret that I have with this most recent edition of the text is that I was unable to update the following sources. Through my doctoral program in the 1970s and preparation to teach research design and statistical analysis in the 1980s, I gathered these texts and used them extensively for illustrations, examples and explanations in my classes. The books listed below and quoted in the text are excellent resources, even if they are not the most recent. Ary, Donald, Lucy Chesar Jacobs and Asghar Razavieh. Introduction to Research in Education. New York: Holt, Rinehart and Winston, 1972. Babbie, Earl. The Practice of Social Research, 3rd ed. Belmont, CA: Wadsworth Publishing Company, 1983. Bell, Judith. Doing Your Research Project. Philadelphia: Open University Press, 1987. Borg, Walter R. Applying Educational Research: A Practical Guide for Teachers. New York: Longman Publishing Company, 1981. ____________ and Meredith D. Gall. Educational Research: An Introduction, 4th ed. New York: Longman Publishing Company, 1983. Churchill, Gilbert A. Marketing Research: Methodological Foundations, 2nd ed. Hinsdale, IL: The Dryden Press, 1979. Drew, Clifford J. and Michael L. Hardman, Chapter 5: Designing Experimental Research, Designing and Conducting Behavioral Research. New York: Pergamon Press, 1985. Glass, Gene V. Statistical Methods in Education and Psychology, 2nd. Englewood Cliffs, NJ: Prentice-Hall, Inc., 1984. Hinkle, Dennis E. , William Wiersma, and Stephen G. Jurs, Basic Behavioral Statistics. Boston: Houghton Mifflin Company, 1982. Hopkins, Charles D. Educational Research: A Structure for Inquiry. Columbus, Ohio: Charles E. Merrill Publishing Company, 1976. Howell, David C. Statistical Methods for Psychology. Boston: Duxbury Press, 1982. Kubiszyn, Tom and Gary Borich, Educational Testing and Measurement: Classroom Application and Practice, 2nd. Glenview, IL: Scott, Foresman and Company, 1987 Lewin, Miriam. Understanding Psychological Research. New York: John Wiley & Sons, 1979.
A5-1
References
Mueller, Daniel J. Measuring Social Attitudes: A Handbook for Researchers and Practitioners. New York: Teachers College Press, 1986. Nunnally, Jum. Educational Measurement and Evaluation, 2nd ed. New York: McGrawHill Book Coompany, 1972. Payne, David. The Assessment of Learning: Cognitive and Affective. Lexington, Mass: D. C. Heath and Company, 1974. Sax, Gilbert. Foundations of Educational Research. Englewood Cliffs, N. J.: Prentice-Hall, 1979. True, June. Finding Out: Conducting and Evaluating Social Research. Belmont, CA: Wadsworth Publishing Company, 1983. SYSTAT Computer Statisitcal Package Wilkinson, Leland. A System for Statistics, Version 4. Evanston, IL: SYSTAT, Inc., 1988.
A5-2

Research Design and Statistical Analysis - 4th Ed 2006

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Research Design and Statistical Analysis - 4th Ed 2006

Uploaded by

Copyright:

Available Formats

Research Design and Statistical Analysis in Christian Ministry

1 ........................................................................................................................ Scientific Knowing

Science as a Way of Knowing

The Scientific Method Types of Research

Correlational Research Experimental Research Ex Post Facto Research Evaluation

Research and Development

Faith and Science

2 ........................................................................................................................ Proposal Organization

4th ed. 2006 Dr. Rick Yount

3 ........................................................................................................................ Empirical Measurement 3-1

Vocabulary Study Questions Sample Test Questions

4 ........................................................................................................................ Getting On Target 4-1

4th ed. 2006 Dr. Rick Yount

Research Design and Statistical Analysis in Christian Ministry

Examples of Problem Statements

The Hypothesis Statement

The Directional Hypothesis The Non-directional Hypothesis The Null Hypothesis

4-4 4-5 4-5 4-6

4-6 4-7 4-7 4-8

4-8 4-8 4-8 4-9 4-9 4-9 4-9 4-10 4-10

4-8 4-9 4-9 4-10

5 ........................................................................................................................ Introduction to Statistical Analysis 5-1

A Statistical Flow Chart

5-2 5-2 5-2 5-2 5-4 5-4 5-4

6 ........................................................................................................................ Synthesis of Related Literature 6-1

Vocabulary Study Questions Sample Test Questions

4th ed. 2006 Dr. Rick Yount

The Procedure for Writing the Related Literature

Related to Your Study

Choose Preliminary Sources

Select Key Words Searching the literature

6-4 6-5 6-6 6-7

Select Articles Analyze the Research Articles

Searching manually Searching by Computer

6-7 6-7 6-8 6-8 6-8 6-8

7 ....................................................................................................................... Populations and Sampling 7-1

Vocabulary Study Questions Sample Test Questions

7-1 7-1 7-2 7-2 7-2 7-2 7-3

Inferential Statistics A Quick Look Ahead The Case Study Approach

Simple Random Sampling Systematic Sampling Stratified Sampling Cluster sampling

7-8 7-8 7-8 7-8 7-8 7-9

4th ed. 2006 Dr. Rick Yount

8 ........................................................................................................................ Collecting Dependable Data 8-1

Reliability and Validity Objectivity Summary

8-5 8-6 8-7

Vocabulary Study Questions Sample Test Questions

8-7 8-8 8-8

9 ........................................................................................................................ Observation 9-1

Unit II: Research Methods

Practical Suggestions for Avoiding these Problems

10 ...................................................................................................................... Survey Research 10-1

4th ed. 2006 Dr. Rick Yount

Types of questionnaires Guidelines

Asking questions Understandable format Clear instructions Demographics at the end

10-4 10-4 10-4 10-4

10-5 10-5 10-5 10-5 10-5 10-6 10-6 10-6 10-6

Types of Interviews Guidelines

Developing the Survey Instrument

Recording responses Interview skills Demographics Alternative modes

10-6 10-7 10-7 10-7

11 ...................................................................................................................... Developing Tests 11-1

11-1 11-2 11-2 11-2 11-2 11-3

Multiple Choice Items

4th ed. 2006 Dr. Rick Yount