Professional Documents
Culture Documents
Table of Contents
Preface Table of Contents i-1 i-5 Unit I: Research Fundamentals
1-4
1-6 1-6
1-7
Descriptive Research
1-8 1-9 1-9 1-10 1-10 1-11 1-11 1-12 1-13 1-13 1-15 1-15 1-15
Qualitative Research Suspicion of Science By the Faithful Suspicion of Religion By the Scientific There Need Be No Conflict Vocabulary Study Questions Sample Test Questions
1-12
Summary
1-14
Introduction
2-3
i-8
Preliminaries The Statement of the Problem Purpose of the Study Synthesis of Related Literature Significance of the Study The Hypothesis 2-3 2-4 2-4 2-6 2-6 2-7 2-8 2-8 2-9 2-10 2-10 2-11 2-12 2-12 2-13 2-13 2-13 2-14 2-14 2-15
2-15 2-15 2-15 2-15 2-15
Method
2-7
Analysis
Population Sampling Instrument Limitations Assumptions Definitions Design Procedure for Collecting Data Procedure for Analyzing Data Testing the Hypotheses Reporting the Data Appendices Bibliography, or Cited Sources Personal Anxiety Professionalism in Writing
Clear Thinking Unified Flow Quality Library Research Efficient Design Accepted Format
2-12
Reference Material
2-13 2-14
Practical Suggestions
Summary
2-16
2-16 2-16 2-17
3-1
3-2 3-2 3-2 3-2 3-2 3-3 3-3 3-4 3-4 3-5 3-6 3-7 3-7 3-7
3-2
Operationalization
3-3
Summary
3-6
4-1
i-9
4-2
4-4
4-4
Revision Examples
Example 1 Example 2 Example 3 Example 4
Association Between Two Variables Association of several variables Difference Between Two Groups Differences Between More Than Two Groups
4-8
Comments Suggested revision Comments Suggested revision Comments Suggested revision Comments Suggested revision Comments
Example 5
Dissertation Examples
Regression Analysis Correlation of Competency Rankings Factorial Analysis of Variance Chi-Square Analysis of Independence
4-10
4-10 4-11 4-11 4-11
5-1
5-4
Summary
5-7
5-7 5-8 5-8
i-10
Preliminaries
6-2 6-2
6-2
6-3
6-5 6-5
Summary
6-9
6-9 6-9 6-10
7-1
Steps in Sampling
Identify the Target Population Identify the Accessible Population Determine the Size of the Sample
Accuracy Cost The Homogeneity of the Population Other Considerations Sample Size Rule of Thumb Select the Sample 7-3 7-3 7-3 7-4 7-4 7-4
7-2
Types of Sampling
7-4
7-4 7-5 7-6 7-6
7-7 7-8
Summary
7-9
i-11
Research Design and Statistical Analysis in Christian Ministry Study Questions Sample Test Questions 7-9 7-10
Reliability
8-3
9-1 9-2
9-2 9-2 9-3 9-3 9-3 9-3 9-3 9-3 9-3 9-4 9-4 9-4 9-4 9-4 9-5 9-5 9-5
9-3
Summary
9-4
10-1
10-1
i-12
Preliminaries Disadvantages
Rate of return Inflexibility Subject motivation Verbal behavior only Loss of control 10-2 10-3 10-3 10-3 10-3
10-2
10-3 10-4
The Interview
Advantages
10-5
10-5
Disadvantages
Flexibility Motivation Observation Broader Application Freedom from mailings Time Cost Interviewer effect Interviewer variables
10-6
10-6 10-6
10-7
10-7 10-7 10-7 10-8 10-8 10-8 10-8 10-9 10-10 10-10
Summary
10-8
11-1
Objective Tests
11-2
11-4
i-13
Research Design and Statistical Analysis in Christian Ministry Writing Multiple Choice Items
Pose a singular problem Avoid repeating phrases in responses Minimize negative stems Make responses similar Make responses mutually exclusive Make responses equally plausible Randomly order responses Avoid sources of irrelevant difficulty Eliminate extraneous material Avoid None of the Above Advantages Disadvantages Disadvantages 11-4 11-5 11-5 11-5 11-5 11-5 11-5 11-5 11-5 11-5 11-6 11-6 11-6 11-6 11-6 11-6 11-6 11-7 11-7 11-7 11-7 11-7 11-7 11-7 11-8 11-8 11-8 11-8 11-8
11-4
Supply Items
11-6 11-6
Matching Items
11-7 11-7
Essay Tests
11-8
11-8
11-9
Item analysis
11-9
11-10 11-10 11-10 11-10 11-10 11-13 11-14 11-14 11-15
Summary
Rank Order Subjects By Grade Categorize Subjects into Top and Bottom Groups Compute Discrimination Index Revise Test Items Examples Vocabulary Study Questions Sample Test Questions Sample Test
11-10
i-14
Preliminaries
Formatting the Scale Write instructions Scoring the Likert scale 12-4 12-4 12-4
12-4
12-5 12-5 12-6 12-6 12-6 12-6 12-6
Develop item pool Compute item weights Rank the items by weight Choose Equidistant Items Formatting the Scale Administering the Scale Scoring
Vocabulary Study Questions Sample Test Questions Sample Thurstone Scale Sample Thurstone Scale (with weights)
External Invalidity
13-2 13-2 13-2 13-3 13-3 13-3 13-4 13-4 13-4 13-4 13-5 13-5 13-5 13-5 13-5 13-6
13-4
Types of Designs
13-6
Quasi-experimental Designs
13-7
Pre-experimental Designs
13-9
Summary
13-10
13-10 13-11 13-11
i-15
Mathematical Concepts
14-3
Summary
14-6
15-1 15-2
15-4
16-1
Measures of Variability
16-3
16-8
i-16
Preliminaries
Sample Statistics Estimated Parameters 16-9 16-9
16-10 16-12
16-12 16-13 16-14 16-15
Sampling Distributions
Summary
17-10
18-1 18-4
18-6 18-6
18-7 18-7 18-7
i-17
Research Design and Statistical Analysis in Christian Ministry Vocabulary Study Questions Sample Test Questions 19-5 19-5 19-6
20-8 20-8
21-1 21-3
21-6
Procedures Computed
21-7
Summary
21-10
21-10 21-11 21-11 21-13
i-18
Preliminaries
Summary
Vocabulary Study Questions Sample Test Question 22-8 22-8 22-9
22-8
23-1 23-2
23-2
23-3
23-4
23-5 23-5 23-6 23-6 23-7 23-8
23-8 23-9
23-9
23-9 23-10 23-10 23-11 23-12 23-12 23-12
Summary
23-11
24-3 24-3 24-3 24-4 24-4 24-5 24-5 24-6 24-6 24-8 24-8 24-8
i-19
25-2
25-3 25-4
25-5 25-6
25-9 25-10
26-3 26-4 26-5 26-6 26-6 26-6 26-7 26-7 26-7 26-7 26-8 26-8
26-8 26-9 26-10
26-6
Summary
Focus on the Significant Predictors Multiple Regression Equations Example Vocabulary Study Questions Sample Test Questions
26-12
27-1
27-1 27-1
i-20
Appendices:
Answer Key to Sample Test Questions Word List Critical Value Tables Dissertations and a Thesis Bibliography A1 A2 A3 A4 A5
i-21
Chapter 1
Scientific Knowing
1
Scientific Knowing
Ways of Knowing Science as a Way of Knowing The Scientific Method Types of Research
Have you considered how you know what you know? As you sit in classes or talk with friends, have you noticed that people differ in the way they know things? Look at six students who are discussing the issue of "modern translations" of the Bible.
Student 1: "I use the King James Version because that's the translation I grew up using. Everybody in our church back home uses it." Student 2: "I use the New King James because my pastor says it offers the best of beauty and modern scholarship." Student 3: "I've prayed about what version to use. I like the Amplified Version because it is so clear in its language. It just feels right." Student 4: " I've tried five or six different translations for devotional reading and for preparation for teaching in Sunday School. After evaluating each one, I've come back again and again to the New International Version. It's the best translation for me." Student 5: "The essense of Bible study is understanding the message, whatever translation we may use. Therefore, I use different translations depending on my study goals." Student 6: "I use the New King James because most of my congregation is familiar with it. In a recent survey, I found that 84% of our members use the KJV or NKJV."
Each of these students reflect a different basis for knowing which translation to use. Which student most closely reflects your view? How did you come to know what you know?
Ways of Knowing
Common Sense
As we begin our study of research design and statistical analysis, we need to understand the characteristics of scientific knowing, and how this kind of knowing differs from other ways we learn about our world. We will first look at five non-scientific ways of knowing: common sense, authority, intuition/revelation, experience, and deductive reasoning. Then we'll analyze the scientific method, which is based on inductive reasoning.
1-1
I: Research Fundamentals
Common Sense
Common sense refers to knowledge we take for granted. We learn by absorbing the customs and traditions that surround usfrom family, church, community and nation. We assume this knowledge is correct because it is familiar to us. We seldom question, or even think to question, its correctness because it just is. Unless we move to another region, or go to school and study the views of others, we have nothing to challenge our way of thinking. It's just common sense! But common sense told us that the earth is flat until Columbus discovered otherwise. Common sense told us that dunce caps and caning are effective student motivators until educational research discovered the negative aspects of punishment. Common sense may well be wrong.
Authority
Authoritative knowledge is an uncritical acceptance of anothers knowledge. When we are sick, we go to the doctor to find out what to do. When we need legal help, we go to a lawyer and follow his advice. Since we can not verify the knowledge on our own, we must simply choose to accept or reject the expert's advice. It would be foolish to argue with a doctor's diagnosis, or a lawyer's perception of a case. This is the meaning of "uncritical acceptance" in the definition above. The only recourse to accepting the expert's knowledge is to get a second opinionfrom another expert. As Christians, we believe that Gods Word is the authority for our life and work. The Living Wordthe Lord Himselfwithin us confirms the Truth of the Written Word. The Written Word confirms our experiences with the Living Word. Scripture is a valid source of authoritative knowledge. However, we spend a lot of time discussing Scriptural interpretations. Our discussions often deteriorate into conflicts about my pastors interpretations. We use our own pastors interpretation as authoritative because of the influence he has had in our own life. (We can substitute any authoritative person here, such as a father or mother, Sunday School teacher, or respected colleague.) But is the authority is correct? Authoritative knowing does not question the source of knowledge. Yet differing authorities cannot be correct simultaneously. How do we test the validity of an authoritys testimony?
Intuition/Revelation
Intuitive knowledge refers to truths which the mind grasps immediately, without need for proof or testing or experimentation. The properly trained mind intuits the truth naturally. The field of geometry provides a good example of this kind of knowing. Lets say I know that Line segment A is the same length as line segment B. I also know that Line segment B is the same length as line segment C. From these two truths, I immediately recognize that Line segments A and C are equal. Or, in short hand,
IF A=B and B=C, THEN A=C
I do not need to draw the three lines and measure them. My mind immediately grasps the truth of the statement. Revelation is knowledge that God reveals about Himself. I do not need test this knowledge, or subject it to experimentation. When Christ reveals Himself to us, we know Him in a personal way. We did not achieve this knowledge by our own efforts, but merely received the revelation of the Lord. We cannot prove this knowledge to others, but it is bedrock truth to those who've experienced it. Problems arise, however,
1- 2
Chapter 1
Scientific Knowing
when we apply intuitive knowing to ministry programs. Well, it's obvious that regular attendance in Sunday School helps people grow in the Lord. Is it? We work hard at promoting Sunday School attendance. Does it actually change the lives of the attenders? Is it enough for people to think it does, whether or not real change takes place? Answers to these questions come from clear-headed analysis, not from intuition.
Experience
Experiential knowledge comes from trial and error learning. We develop it when we try something and analyze the consequences. You've probably heard comments like these: We've already tried that and it failed. Or another: Weve found that holding Vacation Bible School during the third week of August, in the evening, is best for our church. The first is negative. The speaker is saying there's no need to try that ministry or program again, because it was already tried. The second is positive. This church has tried several approaches to offering Vacation Bible School and found the best time for them. Their truth may not apply to any other church in the association, but it is true for them. Theyve tried it and it worked. . .or it didnt. Much of the promotion of new church programs comes out of this framework. We say, This program is being used in other churches with great success (which means our church can have the same experience if we use this program). How do we evaluate program effectiveness? What is success? How do we measure it?
Deductive Reasoning
Deductive reasoning moves thinking from stated general principles to specific elements. We develop general over-arching statements of intent and purpose. Then we deduce from these principles specific actions we should take. Determine world view first. Then make daily decisions which logically derive from this perspective. When we take the Great Commission as our primary mandate, we have framed a world view for ministry. That is, Whatever we do, we will connect it to reaching out and baptizing (missions and evangelism), teaching (discipleship and ministry). Now, how do we do it? We deduce specific programs, plans, and procedures for carrying out the mandate. We eliminate programs that conflict with this mandate. How do we arrive at this world view? Are our over-arching principles correct? Have we interpreted them correctly? Correct action rises or falls on the basis of two things. First, correct action depends on the correctness of our world view. Secondly, correct action depends on our ability to translate that view into practical ministry steps.
Inductive Reasoning
Inductive reasoning moves thinking from specific elements to general principles. Inductive Bible study analyzes several passages and then synthesizes key concepts into the central truth. Science is inductive in its study of a number of specifics and its use of these results to formulate a theory. The truths derived in this way are temporary and open to adjustment when new elements are discovered. Knowledge gained in this way is usually related to probabilities of happenings. We have a high degree of confidence that combining X and Y will produce effect Z. Or, we learn that B and C are seldom found in combination with D. I can demonstrate probability by using matches. Picture yourself at the kitchen table with 100 matches. You pick up the first one. What is the probability it will light when you strike it? Well, you have two possibilities: either it will or it wont. So the probability is 50% (1 event out of 2 possibilities). You strike it and it lights. Pick up the
4th ed. 2006 Dr. Rick Yount
1-3
I: Research Fundamentals
second match. The probability is 0.50 that it will light: (1 event out of two possibilities: Yes or No.) But cumulatively, out of two matches (first and second), one lit. One out of two is 50%. So the probability of the second match lighting is 50%, because 1 of 2 have already lit. You strike it and it lights. Pick up the third match. Again, the third match taken alone has p = 0.50 of lighting (read probability equals point-five-oh). However, taking all three matches together, two of the three have lit and the probability is 2/3 (p = 0.66) that the third match will light. It does. Now, pick up the fourth match. The probability is 3/4 (p=0.75) that it will light, taking all four matches together. What about the 100th match, given that the 99 previous matches have all lit? The probability is 0.50 for this particular match (yes, no), but p = 0.99 taking all matches together. The probability is very high! Yet we cannot absolutely guarantee it will light. This is the nature of inductive logic, and inductive logic is the basis of scientific knowledge. By definition, science does not deal with absolute Truth. Science seeks knowledge about processes in our world. Researchers gather information through observation. They then mold this information into theories. The scientific community tests these theories under differing conditions to establish the degree to which they can be generalized. The result is temporary, open-ended truth (I call it little-t truth to distinguish it from absolute Truth). This kind of truth is open for inquiry, further testing, and probable modification. While this kind of knowing can add nothing to our faith, it is very helpful in solving ministry problems.
Scientific knowing is based on precise data gathered from the natural world we live in. It builds a knowledge base in a neutral, unbiased manner. It seeks to measure the world precisely. It reports findings clearly so that others can duplicate the studies. It forms its conclusions on empirical data. Lets look at these ideals more closely.
Objectivity
Human beings are complex. Personal experiences, values, backgrounds, and beliefs make objective analysis difficult unless effort is made to remain neutral. Optimists tend to see the positive in situations. Pessimists see the negative. But scientists look for objective reality the world as it is uncolored by personal opinion or feelings. Scientific knowing attempts to eliminate personal bias in data collection and analysis. Honest researchers take a neutral position in their studies. That is, they do not try to prove their own beliefs. They are willing to accept empirical results contrary to their own opinions or values.
Precision
Reliable scientific knowing requires precise measurement. Researchers carry out experiments under controlled, narrow conditions. They carefully design instruments to be as accurate as possible. They evaluate tests for reliability and validity. They use pilot projects (trial runs of procedures) to identify sources of extraneous error in measurements. Why? Because inaccurate measurement and undefined conditions and unreliable instruments and extraneous errors produce data that is worthless. Every score has two parts: the true measure of the subject, and an unknown amount of error. We can represent this as
1- 4
Chapter 1
Scientific Knowing
Think of two students who are equally prepared for an exam. When they arrive in class, one is completely healthy and the other has the flu. They will likely score differently on the exam. In this case, illness introduces an error term into the second student's score. When we gather data in a haphazard, disorderly way, error interferes with the true measure of the variable. Like static on a television screen, the error masks the true picture of the data. Analysis of this noisy data will provide a numerical answer which is suspect. Accurate measurement is a vital ingredient in the research process.
Verification
Science analyzes world processes which are systematic and recurring. Researchers report their findings in a way that allows others to replicate their studies to check the facts in the real world. These replications either confirm or refute the original findings. When researchers confirm earlier results, they verify the earlier findings. Research reports provide readers the background, specific problem(s) and hypotheses of studies. Also included are the populations, definitions, limitations, assumptions, as well as procedures for collecting and analyzing data. Writers do this intentionally so others can evaluate the degree that findings can be generalized, and perhaps, replicate the study.
Empiricism
The root of empiricism (Greek, empeirikos) refers to the employment of empirical methods, as in science, or derived from observation or experiment; verifiable or provable by means of observation or experiment.1 Science uses the term to underscore the fact that it bases its knowledge on observations of specific events, not on abstract philosophizing or theologizing. These carefully devised observations of the real world form the basis of scientific knowledge. Therefore, the kinds of problems which science can deal with are testable problems. Empirical data is gathered by observation. Basic observations can be done with the naked eye and an objective checklist (see Chapter 9). But observations are also made with instruments such as an interview or questionnaire (Chapter 10), a test (Chapter 11), an attitude scale (Chapter 12), or a controlled experiment (Chapter 13). Scientific knowing cares less about philosophical reasoning than it does the rational collection and analysis of factual data relevant to the problem to be solved.
Goal: Theories
The goal of scientific research is theory construction, the development of theories which explain the phenomena under study, not the mere cataloging of empirical data. The inductive process of scientific knowing begins with the specifics (collected data) and leads to the general (theories). What causes cancer? What makes it rain? How does man learn? What is the best way to relieve anxiety? What effect do children have on marital satisfaction? Most ministerial students want pragmatic answers to pragmatic problems in the ministry. In the past ten years [during the 1980s] there have been a rash of studies
1
"Empiricism," "empirical." The American Heritage Dictionary, 3rd ed., Version 3.0A, WordStar International, 1993.
1-5
I: Research Fundamentals
relating some variable to church growth. The pragmatic question is How do I make my church grow? But Christian research goes deeper. It looks beyond the surface of ministry programming to the social, educational, psychological, and administrative dynamics of church life and work. Each of these areas have many theories and theorists giving advice and explanation. Are these views valid for Christian ministry? Can you modify these theories for effective use in church ministry? Seek a solid theoretical base for your proposal.
The scientific method is a step-by-step procedure for solving problems on the basis of empirical observations. Here are the major elements:
1. Begin with a felt difficulty. What is your interest? What questions do you want answered? How might a theory be applied in a specific ministry situation? What conflicting theories have you found? The felt difficulty is the beginning point for any study (but it has no place in the proposal). 2. Write a formal Problem Statement. The Problem establishes the focus of the study by stating the necessary variables in the study and what you plan to do with them (see Chapter 4). 3. Gather literature information. What is known? Before you plan to do a study of your own, you must learn all you can about what is already known. This is done through a literature search and results in a synthesis of recent findings on the topic (see Chap 6). 4. State hypothesis. On the basis of the literature search, write a hypothesis statement that reflects your best tentative solution to the Problem (see Chapter 4). 5. Select a target group (population). Who will provide your data? How will you find subjects for your study? Are they accessible to you? (see Chapter 7) 6. Draw one or more samples, as needed. How many samples will you need? What kind of sampling will you use? (see Chapter 7). 7. Collect data. What procedure will you use to actually collect data from the subjects? Develop a step-by-step plan to obtain all the data you need to answer your questions (see Chapters 9-13). 8. Analyze data. What statistics will you use to analyze the data? Develop a step-by-step plan to analyze the data and interpret the results (see Chapters 14-25). 9. Test the null, or statistical, hypothesis. On the basis of the statistical results, what decision do you make concerning your hypothesis? (see Chapters 16-26). 10. Interpret the results. What does the statistical decision mean in terms of your study? Translate the findings from statistics to English. (see Chapters 16-26)
The scientific method provides a clear procedure for empirically solving problems. In chapter 2 we introduce you to the structure of a research proposal. As you read the chapter, notice how the elements of the proposal follow the steps of the scientific method. Refer back to this outline in order to understand the links between the scien-
1- 6
Chapter 1
Scientific Knowing
Types of Research
Under the umbrella of scientific research, there are several types of studies you can do. These types differ in procedure what they entail and outcome what they accomplish. Here are four major and three minor types of research from which you may choose.
Historical Research
Historical research analyzes the question what was? It studies documents and relics in order to determine the relationship of historic events and trends to present-day practice.
Primary sources
A source of information is primary when it is produced by the researcher. Reports written by researchers who conduct studies are eye witness accounts, and are primary sources of information on the results. Other examples of primary sources are autobiographies and textbooks written by authors who conduct their own research. Use primary sources as the major source of information in the Related Literature section of your proposal. Primary sources take two forms: documents and relics. Documents. Society creates documents to expressly record events. They are objective and direct. Documents provide straightforward information. Average Bible Study attendance listed on the Annual Church Letters on file in the state convention office is more likely to be accurate than numbers given from memory by ministers of education in local churches. However, information contained in documents may be incorrect. The documents may have been falsified, or word meanings in the documents may have changed. Relics. Society creates relics simply by living. Relics are artifacts left by communities and cultures in the past. People did not create these objects to record information as is the case with documents. Therefore, information conveyed by relics requires interpretation. The historical researcher reconstructs the meaning of relics in the context of their time and place.
Secondary sources
A source of information is secondary when it is a second-hand account of research. Secondary sources may take the form of summaries, news stories, encyclopedias, or textbooks written by synthesizers of research reports. While secondary sources provide the bulk of materials used in term papers, you should use them only to provide a broad view of your chosen topic. As already stated, emphasize the use of primary sources in your Synthesis of Related Literature.
Criticism
The term criticism has a decidedly negative connotation to most of us. A critical person is one who finds fault, depreciates, or puts down someone or something. The term comes from the Greek krino, to judge. Webster defines criticism as the art, skill, or profession of making discriminating judgments and evaluations, especially of literary or other artistic works."2 Criticism can therefore refer to praise as well as depreciation. A
2
"Criticism," The American Heritage Dictionary, 3rd ed., Version 3.0A, WordStar International, 1993.
1-7
I: Research Fundamentals
Christian may cringe when he hears someone speaks of using higher criticism to study Scripture. It sounds as if the scholar is criticizing -- berating, slandering, putting down -- the Bible. The term actually means that scholars objectively analyze language, culture, and comparative writings to determine the authenticity of the work. Who wrote Hebrews? Paul? Apollos? Peter? Scholars apply the systematic tools of content analysis and literary criticism to determine the answer. Criticism takes two major forms: External criticism and internal criticism. External criticism. External criticism answers the question of genuineness of the object. Is the document or relic actually what it seems to be? What evidence can we gather to affirm the authenticity of the object itself? For example, is this painting really a Rembrandt? Was this letter really written by Thomas Jefferson? External criticism focuses on the object itself. Internal criticism. Internal criticism answers the question of trustworthiness of the object. Can we believe what the document says? What ideas are being conveyed? What does the writer mean by his words, given the culture and time period he wrote? Internal criticism focuses on the objects meaning.
Examples
Historical research is not merely the collection of facts from secondary sources about an historic event or process. It is the objective interpretation of facts, in line with parallel events in history. The goal of historical research is to in explain the underlying causes of present practices. Most of the historical dissertations written by our students have focused on former deans and faculty members. Dr. Phillip H. Briggs studied the contributions of Dr. J. M. Price, Founder and Dean of the School of Religious Education.3 Dr. Robert Mathis analyzed the contributions of Dr. Joe Davis Heacock, Dean of the School of Religious Education, 1950-1973.4 Dr. Carl Burns evaluated the contributions of Dr. Leon Marsh, Professor of Foundations of Education, School of Religious Education, Southwestern Seminary, 1956-1987.5 Dr. Sophia Steibel analyzed the Life and Contributions of Dr. Leroy Ford, Professor of Foundations of Education, 1956-1984.6 Dr. Douglas Bryan evaluated the contributions of Dr. John W. Drakeford, Professor of Psychology and Counseling.7
Descriptive Research
Descriptive research analyzes the question what is? A descriptive study collects data from one or more groups, and then analyzes it in order to describe present conditions. Much of this textbook underscores the tools of descriptive research: survey by questionnaire or interview, attitude measurement, and testing. A popular use of descriptive research is to determine whether two or more groups differ on some variable of interest.
Phillip H. Briggs, The Religious Education Philosophy of J. M. Price, (D.R.E. diss., Southwestern Baptist Theological Seminary, 1964). 4 Robert Mathis, A Descriptive Study of Joe Davis Heacock: Educator, Administrator, Churchman, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984) 5 Carl Burns, A Descriptive Study of the Life and Work of James Leon Marsh, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991) 6 Sophia Steibel, An Analysis of the Works and Contributions of Leroy Ford to Current Practice in Southern Baptist Curriculum Design and in Higher Edcuation of Selected Schools in Mexico, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1988) 7 Douglas Bryan, A Descriptive Study of the Life and Wrok of John William Drakeford, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1986)
3
1- 8
Chapter 1
Scientific Knowing
Another application of descriptive research is whether two or more variables are related within a group. This latter type of study, while descriptive in nature, is often referred to specifically as correlational research (see the next section).
An Example
The goal of descriptive research is to accurately and empirically describe differences between one or more variables in selected groups. Dr. Dan Southerland studied differences in ministerial roles and allocation of time between growing and plateaued or declining Southern Baptist churches in Florida.6 Specified roles were pastor, worship leader, organizer, administrator, preacher and teacher.7 The only role which showed significant difference between growing and non-growing churches was the amount of time spent serving as organizer, which included vision casting, setting goals, leading and supervising change, motivating others to work toward a vision, and building groupness.8
Correlational Research
Correlational research is often presented as part of the descriptive family of methods. This makes sense since correlational research describes association between variables of interest in the study. It answers the question what is in terms of relationship among two or more variables. What is the relationship between learning style and gender? What is the relationship between counseling approach and client anxiety level? What is the relationship between social skill level and job satisfaction and effectiveness for pastors? In each of these questions we have asked about an association between two or more variables. Correlational research also includes the topics of linear and multiple regression which uses the strengths of associations to make predictions. Finally, correlational analysis includes advanced procedures like Factor Analysis, Canonical Analysis, Discriminant Analysis, and Path Analysis all of which are beyond the scope of this course.
An Example
The goal of correlational research is to establish whether relationships exist between selected variables. Dr. Robert Welch studied selected factors relating to job satisfaction in staff organizations in large Southern Baptist Churches.9 He found the most important intrinsic factors affecting job satisfaction were praise and recognition for work, performing creative work and growth in skill. The most important extrinsic factors were salary, job security, relationship with supervisor, and meeting family needs.10 Findings were drawn from 579 Southern Baptist ministers in 153 churches.11
Experimental Research
Experimental research analyzes the question what if? Experimental studies use carefully controlled procedures to manipulate one (independent) variable, such as
Dan Southerland, A Study of the Priorities in Ministerial Roles of Pastors in Growing Florida Baptist Churches and Pastors in Plateaued or Declining Florida Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1993) 7 8 Ibid., 1 Ibid., 2 9 Robert Horton Welch, A Study of Selected Factors Related to Job Satisfaction in the Staff Organizations of Large Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990) 10 11 Ibid., 2 Ibid., 61
6
1-9
I: Research Fundamentals
Teaching Approach, and measure its effect on other (dependent) variables, such as Student Attitude and Achievement. Manipulation is the distinguishing element in experimental research. Experimental researchers dont simply observe what is. They manipulate variables and set conditions in order to design the framework for their observations. What would be the difference in test anxiety across three different types of tests? Which of three language training programs is most effective in teaching foreign languages to mission volunteers? What is the difference between Counseling Approach I and Counseling Approach II in reducing marital conflict? In each of these questions we find a researcher introducing a treatment (type of test, training program, counseling approach) and measuring an effect. Experimental Research is the only type which can establish cause-and-effect relationships between independent and dependent variables. See Chapter 13 for examples of experimental designs.
An Example
The goal of experimental research is to establish cause-effect relationships between independent and dependent variables. Dr. Daryl Eldridge analyzed the effect of knowledge of course objectives on student achievement in and attitude toward the course.12 He found knowledge of instructional objectives produced significantly higher scores on the Unit I exam (mid-range cognitive outcomes) but not on the Unit III exam (knowledge outcomes). Knowledge of objectives did produce significantly higher scores on the postcourse attitude inventory.13
An Example
The goal of ex post facto research is to establish cause-and-effect relationships between independent and dependent variables after the fact of manipulation. An example of Ex Post Facto research would be An Analysis of the Difference in Social Skills and Interpersonal Relationships Between Congenitally Deaf and Hearing College Students. Congenital deafness in this case is the treatment already applied by nature.
Evaluation
Evaluation is the systematic appraisal of a program or product to determine if it is accomplishing what it proposes to do. It is the application of the scientific method to the
Daryl Roger Eldridge, The Effect of Student Knowledge of Behavioral Objectives on Achievement and Attitude Toward the Course, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1985) 13 Ibid., 2
12
1- 10
Chapter 1
Scientific Knowing
practical worlds of educational and administrative programming. Specialists commend to us a variety of programs designed to solve problems. Depending upon the degree of personal involvement of these specialists with the programs, these commendations may contain more word magic than substance. Does a program do what it's supposed to do? The danger in choosing an evaluation type study for dissertation research is the political ramifications which come if the evaluation proves embarrassing to the church or agency conducting the program. Program leaders may not appreciate negative evaluations and apply pressure to modify results. This distorts the research process. Suppose you choose to evaluate a new counselor orientation program at a highly visible counseling network and you find the program substandard. Will this impact your ability to work with this agency as a counselor? Or suppose you want to compare Continuous Witness Training (CWT) with Evangelism Explosion (EE) as a witness training program. What are the implications of your finding one program much better than the other?
An Example
The goal of evaluation research is to objectively measure the performance of an existing program in accordance with its stated purpose. An example of this type of study would be A Critical Analysis of Spiritual Formation Groups of First Year Students at Southwestern Baptist Theological Seminary. Program outcomes are measured against program objectives to determine if Spiritual Formation Groups accomplish their purpose.
An Example
The goal of research and development is the production of a new product which performs according to specified standards. Dr. Brad Waggoner developed an instrument to measure the degree to which a given church member manifests the functional characteristics of a disciple.14 Two pilot tests using this original instrument produced Cronbach Alpha reliability coefficients of 0.9745 and 0.9618, demonstrating its ability to produce a reliable measurement of a church member's functional characteristics of a disciple.15 In 1998, this instrument was incorporated into MasterLife materials produced by LifeWay Christian Resources (SBC).16
14 Brad J. Waggoner, The Development of an Instrument for Measuring and Evaluating the Discipleship Base of Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991) 15 Ibid., 118 16 Report of joint development between Lifeway and the International Mission Board (SBC) at the 1998 Meeting of the Southern Baptist Research Fellowship.
1-11
I: Research Fundamentals
Qualitative Research
In 1979 the faculty of the School of Religious Education at Southwestern created a teaching position for research and statistics. Their desire was for this position to give emphasis to helping students understand research methods and procedures for statistical analysis. It was further the desire that doctoral research become more objective and scientific, less philosophical and historical. In 1981, after two years of interviews and discussions, Religious Education faculty voted, and the president approved, my election to their faculty to provide this emphasis. This textbook, and the dissertation examples it contains, are products of 25 years of emphasis on descriptive, correlational and experimental research -- most of which is quantitative or statistical in nature. In recent years interest has grown in research methods which focus more on the issue of quality than quantity. A qualitative study is an inquiry process of understanding a social or human problem, based on building a complex, holistic picture, formed with words, reporting detailed views of informants, and conducted in a natural setting.17 Dr. Don Ratcliff, in a 1999 seminar for Southwestern doctoral students, suggested the following as the most common qualitative research designs: ethnography, field study, community study, biographical study, historical study, case study, survey study, observation study, grounded theory and any combination of the above.18 Grounded theory is a popular choice of qualitative researchers. It originated in the field of sociology and calls for the researcher to live in and interact with the culture or people being studied. The researcher attempts to derive a theory by using multiple stages of data collection, along with the process of refining and inter-relating categories of information.19 Qualitative research is subjective, open-ended, evolving and relies on the ability of the research to reason and logically explain relationships and differences. Dr. Marcia McQuitty, Professor of Childhood Education in our school, has become our resident expert in qualitative designs. I continue to focus on quantitative research, which is, in comparison, objective, close-ended (once problem and hypothesis is established), structured and relies on the ability of the research to gather and statistically analyze valid and reliable data to explain relationships and differences.
1- 12
Chapter 1
Scientific Knowing
His words reflect Jesus' teaching that He gives understanding to those who follow Him (Mt. 11:29; 16:24). Blaise Pascal wrote in the 17th century, The heart has reasons which are unknown to reason.... It is the heart which is aware of God and not reason. That is what faith is: God perceived intuitively by the heart, not by reason. The truth of Christ comes by living it out, by risking our lives on Him, by doing the Word. We grow in our knowledge of God through personal experience as we follow Him and work with Him. We believe in order to understand spiritual realities. This approach to knowing is private and subjective. Such belief-knowing resents an anti-supernatural skepticism of openminded inquiry. More than that, some scientists consider the scientific method to be their religion. Their belief in evolution may be a justification for their unbelief in God. Science is helpful in learning about our world, but it makes a poor religion. So the faithful view science and its adherents with suspicion. Sometimes, however, the suspicion of science by the religious has less to do with faith than it does political power. In the Middle Ages, the accepted view of the universe was geocentric (earth-center). The moon, the planets, the sun (located between between Venus and Mars) and the stars were believed to rotate about the earth in perfect circles. This view had three foundations: science, philosophy and the Church. Greek science ( Ptolemy) and Greek philosophy (Aristotle) supported a geocentric view of the universe. The logic was rock solid for centuries: Man is the pinnacle of creation. Therefore, the earth must be the center of the universe. The Roman Catholic Church taught that the geocentric view was Scriptural, based on Joshua 10:12-13. Joshua said to the LORD in the presence of Israel: 'O sun, stand still over Gibeon, O moon, over the Valley of Aijalon.' So the sun stood still, and the moon stopped, till the nation avenged itself on its enemies, as it is written in the Book of Jashar. The sun stopped in the middle of the sky and delayed going down about a full day. For the sun and moon to stand still, the Church fathers reasoned, they would have to be circling the earth. Then several scientists began their skeptical work of actually observing the movements of the planets and stars. Copernicus, a Polish astronomer, created a 15th century revolution in astronomy when he published his heliocentric (sun-center) theory of the solar system. He theorized, on the basis of his observations and calculations, that the earth and its sister planets revolved around the sun in perfect (Aristotelian) circles. Keplar later demonstrated that the solar system was indeed heliocentric, but that the planets, including earth, orbited the sun in elliptical, not circular, paths. The Roman Catholic Church attacked their views because they displaced earth from its position of privilege, and opened the door to doubt in other areas. But Poland is a long way from Rome (it was especially so in the 15th century!), and so Copernicus and Keplar remained outside the Church's reach. Galileo is the father of modern Physics and did his work in Italy in the 16th and 17th centuries. He studied the work of Copernicus and Keplar, and built a telescope in order to more closely observe the planets. In 1632, he published the book Dialogue Concerning the Two Chief World Systems: Ptolemaic and Copernican, in which he supported a heliocentric view of the solar system. He was immediately attacked by Church authorities who continued to espouse a geocentric world view. Professors at the University of Florence refused to look through Galileo's telescope: they did not believe his theory, so they refused to observe. Very unscientific! Galileo, under threat of being burned at the stake, recanted his findings. It was not until October 1992 that the Roman
4th ed. 2006 Dr. Rick Yount
1-13
I: Research Fundamentals
Catholic Church officially overturned the decision against Galileo's book and agreed that he had indeed been right. Science questions, observes, and seeks to learn how the world works. Sometimes this process collides with the vested interests of dogmatic religious leaders.
Stephen Hales, The Columbia Dictionary of Quotations is licensed by Microsoft Bookshelf from Columbia University Press. Copyright 1993 by Columbia University Press. All rights reserved.
1- 14
Chapter 1
Scientific Knowing
ena, the machinery, so we may better understand how the world works. Science focuses on the creation. There need be no conflict between giving your heart to the Lord and giving your mind to the logical pursuit of natural truth.
Summary
In this chapter we looked at six ways of knowing. We discussed specifically how scientific knowing differs from the other five. We introduced you to the scientific method, as well as seven types of research. Finally, we made a brief comparison of faith-knowing and science-knowing.
Vocabulary
authority common sense control of bias correlational research deductive reasoning descriptive research empiricism evaluation ex post facto research experience experimental research external criticism historical research inductive reasoning internal criticism intuition/revelation precision primary sources research and development scientific method secondary sources theory construction verification knowledge based on expert testimony cultural or familial knowledge, local maintaining neutrality in gaining knowledge analyzing relationships among variables from principle (general) to particulars (specifics) analyzing specified variables in select populations basing knowledge on observations analyzing existing programs according to set criteria analyzing effects of independent variables after the fact knowledge gained by trial and error determining cause and effect relationships between treatment and outcome determining the authenticity of a document or relic analyzing variables and trends from the past from particulars (specific) to principles (general) determining the meaning of a document or relic knowledge discovered from within striving for accurate measurement materials written by researchers themselves (e.g. journal articles) creating new materials according to set criteria objective procedure for gaining knowledge about the world materials written by analysts of research (e.g. books about) converting research data into usable principles replicating (re-doing) studies under varying conditions to test findings
Study Questions
1. Define in your own words six ways we gain knowledge. Give an original example of each. 2. Define science as a way of knowing. 3. Compare and contrast faith and science as ways of knowing for the Christian. 4. Define in your own words five characteristics of the scientific method. 5. Define in your own words eight types of research.
1-15
I: Research Fundamentals
3. Match the type of research with the project by writing the letter below in the appropriate numbered blank line. Historical Experimental Research & Development Descriptive Ex Post Facto Qualitative Correlational valuation Ev
____ An Analysis of Church Staff Job Satisfaction by Selected Pastors and Staff Ministers ____ Differentiating Between the Effects of Testing and Review on Retention ____ The Effect of Seminary Training on Specified Attitudes of Ministers ____ An Analysis of the Differences in Cognitive Achievement Between Two Specified Teaching Approaches ____ Determining the Relationship Between Hours Wives Work Outside the Home and the Couples Marital Satisfaction scores ____ The Churchs Role in Faith Development in Children as Perceived by Pastors and Teachers of Preschoolers ____ The Relationship Between Study Habits and Self Concept in Baptist College Freshmen ____ The Life and Ministry of Joe Davis Heacock, Dean of the School of Religious Education, 1953-1970 ____ Church Life Around the Conference Table: An Observational Analysis of Interpersonal Relationships, Communication, and Power in the Staff Meetings of a Large Church ____ An Analysis of the Relationship Between Personality Trait and Level of Group Member Conflict... ____ The Role of Womans Missionary Union in Shaping Southern Baptists View of Missions ____ The Effectiveness of the CWT Training Program in Developing Witnessing Skills ____ Determining the Effect of Divorce on Mens Attitudes Toward Church ____ A Learning System for Training Church Council Members in Planning Skills
1- 16
Chapter 1
Scientific Knowing
____ A Multiple Regression Model of Marital Satisfaction of Southwestern Students ____ The Effect of Student Knowledge of Objectives on Academic Achievement ____ A Study of Parent Education Levels as They Relate to Academic Achievement Among Home Schooled Children ____ A Critical Comparison of Three Specified Approaches to Teaching the Cognitive Content of the Doctrine of the Trinity to Volunteer Adult Learners in a Local Church ____ Curriculum Preferences of Selected Latin American Baptist Pastors ____ A Study of Reading Comprehension of Older Children Using Selected Bible Translations
1-17
I: Research Fundamentals
1- 18
Chapter 2
Proposal Organization
2
Proposal Organization
Front Matter The Introduction The Method The Analysis Reference Material
The research proposal is a concise, clearly organized plan of attack for analyzing formal research problems. The beginning point in developing a proposal itself not a part of the final product is the felt difficulty. Hopefully, as you have read textbooks and journal articles, as you have listened to lectures and participated in discussion, you have been attracted to specific issues and concerns in your field. Perhaps there have been questions that remain unanswered, problems which remain unsolved, or conflicts which remain unresolved. These issues, your felt difficulties, hold the beginning point for your research proposal.
The first step toward an objective study of your felt difficulty is the choice of a topic. Consider a topic which has the potential to make a contribution to theory or practice in your chosen field. Afterall, a dissertation will consume large quantities of your time, your money, and your very self. Worthwhile topics can be discovered by browsing the indexes of information databases such as the Educational Resources Information Center (E.R.I.C.) or Psychological Abstracts (For detailed suggestions, see Chapter 6, Synthesis of Related Literature). This search, whether done manually or by computer, can provide useful information for confirming or abandoning a research topic. Once a topic has been determined, it must be translated, step by step, into a clear statement of a solvable problem and a systematic procedure for collecting and analyzing data. We begin that translation process in this chapter by providing a structural blueprint, as well as definitions of each proposal element, for the proposal you will eventually develop. The following structural overview gives you a framework for organizing your own proposal. Each element listed in the structural overview is defined. Study these element until you can see the structure of the whole.
4th ed. 2006 Dr. Rick Yount
2-1
I: Research Fundamentals
Front Matter
Title Page Contents Tables Illistrations
Proposal Overview
Front Matter Title Page Table of Contents List of Tables List of Illustrations INTRODUCTION Introductory Statement Statement of the Problem Purpose of the Study Synthesis of Related Literature Significance of the Study Statement of the Hypothesis METHOD Population Sampling Instrument Limitations Assumptions Definitions Design Procedure for Collecting Data ANALYSIS Procedure for Analyzing Data Testing the Hypotheses Reporting the Data Reference Materials Appendices Bibliography
Title Page
The coversheet for the proposal contains basic information for the reader. You will list on this page your school name, the proposal title, your major department, your name and the date the proposal is submitted. The title of your proposal should provide sufficient information to permit your readers to make an intelligent judgment about the topic and type of study youre proposing to do. Your doctoral dissertation will be cataloged in Dissertation Abstracts upon graduation, so a clear title will attract more readers to your work.
Table of Contents
The Table of Contents lists the major headings and subheadings and their respective page numbers within the proposal. Suggestion: organize your proposal (and simplify the writing of the Table of Contents) using a three-ring binder with dividers for each section and element of the proposal. As you work on each section, file your materials in proper order in the binder.
List of Tables
As you write your dissertation, you will want to augment your written explanations with visual representations of the data. One form of presentation is the table, which displays the data tabular form rows and columns of figures which enhances, clarifies, and reinforces the verbal narrative. The List of Tables lists each table by name and page number. Let me suggest that you consider carefully the tables you will need to use to display your data and include a sample of each planned table in your proposal. Doing this shows that you have given adequate consideration to the forms your data will take.
4th ed. 2006 Dr. Rick Yount
2-2
Chapter 2
Proposal Organization
List of Illustrations
An illustration is a graph, chart, or picture that enhances visually the meaning of what you write. The List of Illustrations lists each illustration by caption and page number.
Introduction
The introduction section includes the introductory statement, the statement of the problem, the purpose of the study, the synthesis of related literature, the significance of the study, and the hypothesis. The purpose of the introduction is to demonstrate the thoroughness of your preparation for doing the study. This section explains to others, like the Advanced Studies Committee for instance, why you want to do this study. It further demonstrates how well you understand your specific field.
Introductory Statement Problem Purpose Synthesis Significance Hypothesis
2-3
I: Research Fundamentals
Problem Statement. Just as an instructional objective provides the framework for lesson planning, so the Problem refects the very heart of the study. For example, look at the following Problem Statements from the dissertations of Drs. Marcia McQuitty and Norma Hedin:
The problem of this study [will be] to determine the relationship between the dominant management style and selected variables of full-time ministers of preschool and childhood education in Southern Baptist churches in Texas. The selected variables [are] level of education, years of service on church staffs, task preference, gender, and age.1 The problem of this study [will be] to determine the differences in measured self-concept of children in selected Texas churches across three variables: school type (home school, Christian school, and public school), grade (fourth, fifth, and sixth), and gender.2
Notice that the list of Purpose statements comes directly out of the Problem Statement, and yet expands each component of it.
2-4
Chapter 2
Proposal Organization
study. It details what others are doing in the field, what methods are being used, and what results have been obtained in recent years. A synthesis is different from a summary. In a summary, articles relating to a subject are outlined and then written up one after another. Let's say we have three articles. Article 1 contains discoveries A, B, and D. Article 2 contains discoveries A, B, and C. Article 3 contains discoveries A and C. A summary would look like this:
Article 1 found A, B, and D. Article 2 found A, B, and C. Article 3 found A and C.
This makes for lifeless writing and boring reading. It also fails to uncover the groupings of discoveries across all the articles. A synthesis, however focuses on key words and discoveries across many articles and combines the various research articles' findings. The focus is on the research discovery-clusters, not on individual articles. Look at the following rewrite:
Three researches found A (1,2,3). Two researchers found B (1,2), and two researchers found C (2,3).
This approach helps you discover linkages among researchers and makes for much more interesting reading. I've used three articles as an example, but a dissertation study will involve scores of them! When I was doing library research on my last doctorate, I found over a hundred research reports relating to my subject. In these reports, statisticians argued about proper procedures on the basis of a particular kind of error rate. As I analyzed the articles, I found that the researchers could be put into three camps. These camps, and the comparison of their views of various statistical issues, formed the organizational structure for my Related Literature section. I condensed ninety-two journal articles into fifteen pages of synthesis using over 30 key words. I remember my grandfather gathering the sap from maple trees to boil down into syrup. It frequently required over 100 gallons of sap to produce a gallon of syrup. This same process applies to the preparation of the Synthesis of Related Literature. Dr. Rollie Gill provides an example of synthetic writing in his dissertation on leadership styles.4
Outside research on Situational Leadership has questioned the validity and reliability of the "theory."127 See Chapter Six for more information on synthesizing literature.
2-5
I: Research Fundamentals
What tangible contribution will it make? In short, it answers the so-what question. You want to study something. You find what you expect. So what?! The personal interest of the student or his/her major professor is not sufficient rationale for approving a proposal. The best rationale is a reference to one or more research studies stating the need for what you propose to do. Dr. Dean Paret wrote an effective statement of significance for his study on healthy family functioning:5
This study [will be] significant in that: 1. It provides empirical data for the relationship between family of origin in terms of autonomy and intimacy roles that were adapted and the current family healthy functioning patterns. Empirical validation has been called for by Hoverstadt et al.118 to support the theoretical assumptions upon which family therapy techniques are based. 2. It provides empirical data for breaking the recurrent cycle perpetuating the adult child syndrome.119 3. It provides a basis for the development of specific parenting training for the ministry of the church. 4. It provides helpful information for the seminary to aide [sic] the students who are having a difficult time juggling married life and student life, by providing indicators of stress areas related to autonomy and intimacy. According to Dr. David McQuitty, Director of Student Aid, the seminary through his office sees an increase in problems encountered by students as their seminary journey increases, both in financial stress, and student stresses, that could possibly be related to issues brought forward from the family of origin.120 It is therefore necessary to provide empirical data to help in breaking down the dysfunctional patterns of interaction.
118 120 119 Hoverstadt, et al., 287 and 296 Fine and Jennings, 14 Conversation with Dr. McQuitty on August 18, 1990
Just before my Proposal Defense, I made one last trip to the North Texas Science library. On that trip, I found a reference to a speech made two years earlier. Looking up the speech, I found a gold mine! The writers had analyzed many of the procedures I was studying. Their conclusion was to call for a computer analysis of several of the most popular procedures. It was the focus of my study! I added this recommendation to my significance section. It provided a solid rationale for my study when I defended it before my Proposal Committee.
The Hypothesis
The Statement of the Problem describes the heart of your study in one or two succinct sentences. The Statement of the (research) Hypothesis describes the expected outcome of your study. Base the thrust of your hypothesis on the synthesis of literature. Use the Problem Statement as the basis for the format of the hypothesis. Look at this Problem-Hypothesis pair from the dissertation of Dr. Joan Havens:
The problem of this study [is] to determine the difference in level of academic achievement across four populations of Christian home schooled children in Texas: those whose parents possessed (1) teacher certification, (2) a college degree, but no certification, (3) two or more years of college, or (4) a high school diploma or less.6
Dean Kevin Paret, A Study of the Perceived Family of Origin Health as It Relates to the Current Nuclear Family in Selected Married Couples, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991), 36-37
5
2-6
Chapter 2
Proposal Organization
[One of the hypotheses of this study is that there will] be no significant difference in levels of academic achievement in home schooled children across the four populations surveyed.7
Or another, from the dissertation of Dr. Don Clark, who did an analysis of the statistical power levels of dissertations hypothesizing differences written here in the School of Educational Ministries at Southwestern since 1981.8
The problem of this study [will be] to determine the difference in power of the statistical test between selected dissertations' hypotheses proven statistically significant and those selected dissertations' hypotheses not proven statistically significant in the School of Religious Education at Southwestern Baptist Theological Seminary.9 The hypothesis of this study [is] that power of the statistical test will be significantly higher in those dissertations' hypotheses finding statistically significant results than those. . .not finding statistically significant results.10
The Problem poses the question to be answered; the hypothesis presents the expected answer. The research hypothesis must be stated in measurable terms and should indicate, at least generally, the kind of statistic you'll use to test it. See Chapter Four for more information on writing the Hypothesis Statement.
Method
The METHOD section contains a detailed blueprint of your planned procedures. It specifically explains how you will collect the necessary data to analyze the variables youve chosen in a clear step-by-step fashion. This section includes the following components: population, sampling, instrument, limitations, assumptions, definitions, design, and collecting data.
Population Sampling Instrument Limitations Assumptions Definitions Design Collecting Data
Population
The Population section of the proposal specifies the largest group to which your study's results can be applied. Any samples used in the study (see below) must be drawn from defined one or more populations. Here is Dr. Da Silva's population:
The population for this study [will consist] of social work administrators in Texas who [are] members of the National Association of Social Workers. According to the mailing list of May 21, 1992, there [are] five hundred and seventy-eight administrators from the state of Texas.11
2-7
I: Research Fundamentals
The population of this study [will consist] of all hypotheses from Ed.D. and Ph.D. dissertations completed within the School of Religious Education at Southwestern Baptist Theological Seminary which met four criteria: 1. The hypothesis was included within a dissertation completed between May 1978 and May 1996. 2. The hypothesis tested differences between groups as opposed to relationhips between variables. 3. The hypothesis was tested statistically by means of t-Test for Difference Between Means, One-way ANOVA, Two Factor ANOVA, or Three factor ANOVA. 4. Statistical significance was determined solely upon meeting a singular criteria, that being a single statistical test.12
Sampling
The Sampling section describes how you will draw one or more samples from the population or populations defined above. It also explains how many subjects you intend to study in these samples. Here are examples of sampling statements based on the populations we defined above.
A twenty-five percent random sample [will be] obtained from the mailing list of the National Association of Social Workers in the State of Texas. The sample [is] estimated to consist of 144 subjects.13 A simple random sample of hypotheses [will be] conducted to produce two equal groups of fifty hypotheses: hypotheses proven statistically significant (Group X) and hypotheses not proven significant (Group Y). . . .14
Instrument
The Instrument section describes the tools you plan to use in measuring subjects. Instruments includes tests, scales, questionnaires and interview guides, observation checklists, and the like). If you choose an existing instrument appropriate for your study, then describe its development, use, reliability and validity. If you cannot find a suitable instrument, you will need to develop your own. Provide a step by step explanation of the procedure you will use to develop, evaluate, and validate the instrument. Here is a portion of Dr. Hedin's instrument section:
The instrument selected for this study [is] the Piers-Harris Children's Self-Concept Scale (The Way I Feel About Myself), developed by Ellen V. Piers and Dale B. Harris in 1969. . .Answers are keyed to high self-concept; thus, a higher total score [indicates] a positive concept of self. . .Reliability coefficients ranging from .88 to .93, based on Kuder-Richardson and SpearmanBrown formulas, were reported for various samples29 . . . Content validity was built into the scale by using children's statements about themselves as the universe to be measured as selfconcept. By writing items pertaining to that universe of statements, the authors defined selfconcept for their scale31 . . . An attempt was made to establish construct validity during the initial standardization study. The PHCSCS scale was administered to eighty-eight adolescent institutionalized retarded females. As predicted by Piers and Harris, these girls scored signifi12
Clark, 30-31
13
Da Silva, 7
14
Clark, 31
2-8
Chapter 2
Proposal Organization
cantly lower than normals of the same chronological or mental age. This was interpreted as meaning that the PHCSCS did measure self-concept and discriminated between high and low self-concept.32
See Chapters Nine, Ten and Eleven for more information on developing instruments..
Limitations
The Limitations section describes external restrictions that reduce your ability to generalize of your findings. An external restriction is one that is beyond your control. Let's say you plan to randomly assign students in a local high school to one of three experimental teaching groups. When you check with the principal, he allows you to do the experiment, but only if you use the regular classes of students he does not want you disrupting classes through random assignment. Since random assignment is an important part of experimental design, this is a limitation to your study and must be stated in this section. Limitations differ from delimitations. Delimitations are restrictions you set on your study. The fact that you decide to study single adults ages 20-50 is a delimitation of your study, not a limitation. Choosing to study only 6 of the 16 scales of the 16PF Test is a delimitation, because you make that decision on your own. Limitations are external restrictions and belong in this section. Delimitations are personal restrictions and belong in the Procedures for Collecting Data section of the proposal -- there is no Delimitations section. One of Dr. Matt Crain's limitations was:
Due to the lack of a central organizational headquarters, no directory of Churches of Christ exists whereby a true random sample of all congregations may be obtained.16
15 Wesley Black, A Comparison of Responses to Learning Objectives for Youth Discipleship Training from Minister of Youth in Southern Baptist Churches and Students Enrolled in Youth Education Courses at Southwestern Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1985), 30-31 16 Matthew Kent Crain, Transfer of Training and Self-Directed Learning in Adult Sunday School Classes in Six Churches of Christ, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1987), 8
2-9
I: Research Fundamentals
This study [will be] subject to the limitations recognized in collecting data by mail, such as difficulty in assessing respondent motivation, inability to control the number of responses, and bias of sample if a 100 percent response is not secured.17
Assumptions
Every study is built on assumptions. The purpose of this section is to insure that the researcher has considered his assumptions in doing the study. In doing a mailed questionnaire, the researcher must assume that the subjects will complete the questionnaire honestly. In testing which of two counseling approaches is best, one assumes that the approaches are appropriate for the subjects involved. Provide a rationale for the assumptions you state. It is not enough to copy assumptions out of previous dissertations. Explain the why of your assumptions. Here are several assumptions made by Dr. Darlene Perez:
1. All [112 Puerto Rican Southern and American Baptist] churches will have a youth Sunday School enrollment. 2. The pastors and youth leaders will cooperate with the study and will insure completion of the questionnaires. 3. Since [all] 112 Southern Baptist and American Baptist churches were used in the study, it is assumed that the findings are important in that they represent the general opinion of Baptist youth groups in Puerto Rico. . . .18
Definitions
If you are using words in your study that are operationally defined -- that is, defined by how they are measured -- or have an unusual or restricted meaning in your study, you must define them for the reader. You do not need to define obvious or commonly used terms. For example, Dr. Kaywin LaNoue studied differences in spiritual maturity in high school seniors across two variables: active versus non-active in Sunday School, and Christian school versus public school. But what did she mean by active in Sunday School? What is spiritual maturity and how did she measure it? Here are her definitions for these two terms:
17 Charles S. Bass, A Study to Determine the Difference in Professional Competencies of Ministers of Education as Ranked by Southern Baptist Pastors and Ministers of Education, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1998), 45 18 Darlene J. Perez, A Correlational Study of Baptist Youth Groups in Puerto Rico and Youth Curriculum Variables, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1991), 12 19 Gail Linam, A Study of the Reading Comprehension of Older Children Using Selected Bible Translations, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1993), 85
2-10
Chapter 2
Proposal Organization
Active. Active means those students attending their Sunday School at least three Sundaysa month.2 Spiritual maturity. Peter gives the steps in a Christian's growth toward maturity when he lists the attributes of the Christian life in the order by which they should be sought. He does this in 2 Peter 1:5-8. . . . In this study, spiritual maturity [is] the extent to which the students have assimilated (internalized) the virtues of goodness, knowledge, self-control, perseverance, godliness, brotherly kindness, and love.21
Dr. LaNoue used an adaptation of the Spiritual Maturity Test, developed and published by Dr. James Mahoney, to convert the virtues listed above into a test score.22 Sometimes special terms are used to communicate complex concepts quickly. These terms need to be defined. For example, the term "k, J combination" makes no sense until it is clearly defined:
k,J combination. -- This term refers to two major variables in this study: the number of groups in an experiment, k, and the sample size category, J. There [will be] four levels of k representing three, four, five, and six groups. There [will be] seven levels of J. J(1) through J(5) [will represent] equal n sample sizes of 5, 10, 15, 20, and 25 respectively. J(6) [will represent] an unequal set of nj's in the ratio of 1:2:3:4:5:6 with n1= 10. That is, when k=3, the sample n's [will be] 10, 20, and 30. J(7) [will represent] a set of nj's in the ratio of 4:1:1:1:1:1 with n1=80. That is, when k=3, the sample n's [will be] 80, 20, and 20. This provides twenty-eight combinations of k,J.23
Design
The Design section describes the research type of your study. It is here you declare your research to be correlational, or historical, or experimental. See the overview of Research Types in Chapter One for a description of eight major design types. Describe key factors that make your study of the stated type. If you are using an experimental design, explain which you are using and why. Dr. Brad Waggoner explained his design this way:24
The method of research [which will be] employed in this study [is] Research and Development. . . This type of research [is] accomplished in two phases. The first phase [will involve] the development of the product. The second phase [will consist] of evaluating the use or effects of the product.xx Although the exact number of specific stages of Research and Development vary from author to author, the following five steps [will be] applied:xy 1. The identification of a need, interest, or problem 2. The gathering of information and resources concerning the problem or need 3. The preliminary product or process [is] developed 4. The product or process [is] field-tested
Kaywin Baldwin LaNoue, A Comparative Study of the Spiritual Maturity Levels of the Christian School Senior and the Public School Senior in Texas Southern Baptist Churches With a Christian School, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1987), 25 21 22 Ibid., 26 Ibid., 93-97 23 William R. Yount, A Monte Carlo Analysis of Experimentwise and Comparisonwise Type I Error Rate of Six Specified Multiple Comparison Procedures When Applied to Small k's and Equal and Unequal Sample Sizes, (Ph.D. diss., University of North Texas, 1985), 8 24 Waggoner, 7-8
20
2-11
I: Research Fundamentals
5. The product or process [is] refined based on the information obtained from the field-testing.
xx
xy
Ibid., 13-14
Dr. Martha Bergen described the design of her study this way:25
The design of this study [is] descriptive in nature. [A] questionnaire [will be] designed to determine the attitudes of Southwestern Seminary's full-time faculty toward computers for seminary education. Further, certain variables [will be] examined to determine their possible predictions of these attitudes.
Analysis
Analyzing Data Testing Hypothesis Reporting Data
The third and final major section of the proposal is the analysis section. The ANALYSIS section describes how you plan to process the numbers on the data sheets. This section moves step by step through the application of selected statistical procedures, the testing of hypotheses, and the reporting of the data in a systematic, coherent way.
2-12
Chapter 2
Proposal Organization
with this? What do you want to find out, I asked. I dunno ... uh, Im not sure. He had paid $300 for advice from a statistician across town, and had been led down a dead-end alley. The student left too much for others to decide. He did not own his own research. I gave him some suggestions, and, with a great deal of effort on his part and some additional help from his statistician, he was able to produce an acceptable dissertation. But he paid for it in many sleepless nights! The truth of the matter is that, as shown in the diagram at right, we really cannot correctly collect data until we know how we're going to analyze it. The two parts design and analysis work together.
Reference Material
The Reference Material section contains supporting materials for the proposal. These materials include appendices and bibliography.
Appendices Bibliography
Appendices
An appendix contains supporting materials which relate directly related to your study. Most proposals require several appendices to include cover letters, a sample of the instrument, results of a pilot study, the data summary sheets, complex tables, illustrations of statistical analysis, and so forth. Dr. Daryl Eldridge developed twentythree appendices to house all the supplemental materials generated by his 188-page dissertation. What could possibly take up twenty-three appendices? Here's the list:27
1 - Course Objectives for Building a Church Curriculum Plan 332-435 [3 pages] 2 - Sample of Class / Session Objectives [1] 3 - First Draft of Unit 1 Exam [5]
26
Eldridge, 79
27
Ibid., 96-183
2-13
I: Research Fundamentals
4 - Cornell Inventory for Student Appraisal of Teaching and Courses [7] 5 - Letter to Research Associates for Validation of Cognitive Tests [2] 6 - Test Item Analysis - Unit 1 Exam [2] 7 - Letter to Research Associates for Validation of Precourse Attitude Inventory [2] 8 - Report Form For Student Test Scores [1] 9 - Session Goals and Indicators [4] 10 - Unit 1 Exam, Final Form [8] 11 - Unit 3 Exam, Final Form [5] 12 - Cognitive PreTest, Final Form [4] 13 - Postcourse Student Inventory [8] 14 - Precourse Student Inventory [3] 15 - Tentative Class Schedule [4] 16 - Course Syllabus, Fall Semester [3] 17 - Course Syllabus, Spring Semester [5] 18 - Quizzes Over SBC Curriculum [6] 19 - Letter to Cornell University [1] 20 - Selected Comments From the Postcourse Inventory and Student Evaluations [3] 21 - Raw Scores For All of the Instruments [4] 22 - A Comparison of Scores Across Semesters for the Various Instruments [2] 23 - Statistical Analysis for Each of the Instruments Across Semesters [5]
You provide a clear, categorized filing system for supportive information by packaging materials in appendices. Small parcels of this information can be drawn from these appendices for explanation and illustration in the body of the dissertation. Such a design permits you to provide complete information, through references to the appendices, without bogging down the flow of thought in the dissertation itself. In the proposal development stage, think ahead concerning what appendices you will need and include an empty copy of each as an appendix to the proposal. This demonstrates to the Committee forethought and critical thinking.
Practical Suggestions
Here are some practical suggestions to help you write a solid proposal.
Personal Anxiety
This assignment is complex. Some students experience a frightening sense of anxiety as they consider the daunting task of writing a research proposal. A research proposal taxes the thinking skills of the best students. You are confronted with learning new definitions (knowledge), understanding new concepts (comprehension), discovering conceptual links among numerous articles (analysis), writing an integrative narrative (synthesis), choosing the correct design and statistical procedures (evaluation) and putting all of this together in a single-focused, comprehensive document. Your educational experiences in high school and college may have emphasized rote memory,
2-14
Chapter 2
Proposal Organization
recall, and simple concepts rather than clear thinking. Therefore, writing an original research proposal is a strange new thing for some. Many paths to choose. Many decisions to make. What topic will I choose? What kind of research will I select? Where do I begin? For some, too many neat ideas compete for attention. For others, neat ideas are nowhere to be found. Don't panic. Take each section, each step of the process, one at a time.
Professionalism in Writing
A research proposal should be written in a clear, professional manner or it will not be understood. Here are some suggestions.
Clear Thinking
Your proposal should show clear thinking. Write and revise. Squeeze out fuzzy phrases, word magic28 and awkward grammar. Write simply and clearly. Use professional jargon only when simple English cant convey the thought.
Unified Flow
There should be a unified flow through the proposal. Take care not to ramble or lose focus in the details. March step by step in a single direction from the first page to the last.
Efficient Design
Your proposal should demonstrate your understanding of research design and statistical analysis, and how they work together. The proposal should present a narrative that is all-of-one-piece rather than a disjointed collection of pieces. Problem, Hypothesis, and Statistic should form its backbone.
Accepted Format
Finally, write in the accepted professional format of your school. Content is more important than format, but a professional format is required.
Summary
This chapter lays out the complete skeletal organization, with examples from actual dissertations, for the proposal you are developing. Study each component individually, as well as its relationship to the whole. Refer to this chapter and to the Evaluation Guidelines in Chapter 27 throughout the writing process to insure that you are on course. You will add to your understanding of each of these components as the semesI use the term word magic to refer to high-sounding, emotive words that have little substantive meaning. The majestic purpose of the American school is to instill in the hearts and minds of our youth the requisite essentials which will allow them to take their rightful place in society and fulfill their destiny. Huh? We hear word magic in sermons and classrooms as well. It gets the amens but communicates little.
28
2-15
I: Research Fundamentals
ter progresses. Use this overview to anchor the big picture in your mind.
Vocabulary
Analysis Appendix Assumptions Bibliography Definitions Delimitations Design felt difficulty Front Matter Hypothesis Instrument Introduction Introductory Statement Limitations List of Tables List of Illustrations Method Population Procedure for Collecting Data Procedure for Analyzing Data Purpose of the Study Reporting the Data research proposal Sampling Significance of the Study Statement of the Problem Synthesis of Related Literature Table of Contents Testing the Hypotheses Title Page describes step-by-step the analysis of collected data an addendum to a proposal which contains supporting examples stated presuppositions upon which a proposed study is based a list of references used in developing the proposal a list of meanings of terms which are unique to the study, operationalized restrictions placed on a study by the researcher an explanation of the specific experimental approach to be used the beginning point of a study but not included in proposal preliminary materials such as Table of Contents and Lists the anticipated outcome of the study or solution to the Problem the means by which data is gathered the first major section of the proposal (includes the Problem) the opening statement of the proposal which leads to the problem restrictions placed on a study outside the researchers control a listing of tables used in the proposal (Front Matter) a listing of illustrations used in the proposal (Front Matter) the second major section of a proposal (includes sampling and instrument) the largest group to which the proposed study can be generalized step-by-step procedure for sampling, instrumentation, and gathering data step-by-step procedure for statistically reducing data to meaningful results explanation of the rationale for doing the study explanation of how data analysis will be presented (charts, tables) a step-by-step blueprint for conducting scientific inquiry the process of identifying a representative group from a population stated reasons why a study is necessary (answers `so what?) Simple focused statement of the relationship among variables in the study a clear narrative which fuses research materials related to the study an outline of proposal organization (Front Matter) an explanation of how stated hypotheses will be tested statistically the cover page of the proposal
Study Questions
1. Differentiate between the Introduction and the introductory statement. 2. Differentiate between a synthesis and a summary of related literature. 3. Differentiate between a limitation and a delimitation. 4. What are the three essential elements that make up the backbone of a proposal?
2-16
Chapter 2
Proposal Organization
2. The introductory statement should a. move from a broad focus of the field to the narrow focus of the study b. express the subjective interest and intent of the researcher c. take care not to use information from research articles d. lead directly to the statement of the hypothesis 3. Which of the following is not recommended as a way to organize the synthesis of literature? a. research article publication dates b. research article author names c. concepts addressed by research articles d. hypotheses of the study 4. Which of the following sections may be omitted from a proposal with appropriate caution? a. The Problem b. The Hypothesis c. The Significance of the Study d. The Limitations
2-17
I: Research Fundamentals
2-18
Chapter 3
Empirical Measurement
3
Empirical Measurement
Variables and Constants Measurement Types Operationalization
Scientific knowing stands or falls on the precision of its empirical observations. Whether these observations are made by a microscope, or a telescope, or stop watch, or pencil and paper test, the scientist strives for an accurate, numerical representation of the phenomena he is studying. The first step is to define the phenomenon under study in terms of the way you intend to measure it. This process is called operationalization. In order to understand the process, you will need to understand the terms variable and measurement. If you do not determine a clear way to measure what you intend to study, you will eventually bog down in the confusion of instrument design and statistical procedures. Now, not sometime later in your studies, is the time to decide specifically how you will measure the variables you intend to study.
3-1
I: Research Fundamentals
Independent Variables
An independent variable is one that you control or manipulate. You decide to study three different teaching methods. Teaching Method is an independent variable. Or you want to compare four approaches to counseling abused children. Counseling Approach is the independent variable.
Dependent Variables
A dependent variable is the variable you measure to demonstrate the effects of the independent variable. If you are studying Teaching Method you might measure achievement or attitude toward the class. If you are studying counseling approach you might measure anxiety level or overt aggression.
Measurement Types
Nominal Ordinal Interval Ratio
Before a dependent variable can be analyzed statistically, it must be measured or classified in some manner. There are four major ways we measure variables. These measurement types are called nominal, ordinal, interval and ratio.
Nominal Measurement
Nominal data refers to variables which are categorized into discrete groups. Subjects are grouped or classified into categories on the basis of some particular characteristic. Examples of nominal variables include all of the following: gender, college major, religious denomination, hair color, residence in a certain geographic region, staff position.
Ordinal Measurement
Ordinal data refers to variables which are rank ordered. Notice that nominal variables have no order to them. Males and Females imply nothing more than two different groups of subjects. but ordinal data orders subjects from high to low on some variable. An example of this data type would be the rank ordering of ten priorities for Christian education in the local church.
Interval Measurement
An ordinal scale only reports 1st, 2nd, 3rd places in a set of data. It cannot tell us whether the distance between 1st and 2nd is greater than or less than the distance between 2nd and 3rd. In order to measure distances between data points, we need a scale of equal, fixed gradations. This is precisely what an interval scale is. Numbers are associated with these fixed gradations, or intervals. One of the most common examples of an interval scale is temperature. The difference between 50 and 60 degrees F. is the same as the difference between 100 and 110 degrees F. Another example is an attitude scale which has 20 items. Each item can have a value of 1, 2, 3, or 4. That means a subject can make a score between 20 and 80. The scores on this scale fall at regular one-point intervals from 20 to 80.
3-2
Chapter 3
Empirical Measurement
Ratio Measurement
Interval data does not, however, lend itself to ratios. We cannot say, for example, that 100 degrees is twice as hot as 50 degrees. The zero point on an interval scale is arbitrary; that is, it does not represent the total absence of the measured characteristic. A temperature reading of 0 degrees F. does not mean there is no heat. (The Kelvin scale was invented for this. A temperature of 0 degrees Kelvin, about -485 degrees F., is absolute zero temperature.) Ratio measurement differs from interval measurement only in the fact that the ratio scale contains a meaningful zero point. Zero weight means that the object weighs nothing. Zero elapsed time means that no time has passed since the beginning of the experiment (it has yet to begin!). A true zero point means that observations can be compared as ratios or percentages. It is meaningful to say that a 60-year-old is twice the age of a 30-year-old. Or that a 90-pound weakling weighs half as much as a 180-pound bully. In most types of studies, interval and ratio data are treated the same for purposes of selecting the proper statistical procedure.
Operationalization
Our research design describes how we plan to measure selected variables. Statistical analysis describes how we plan to reduce these measurements to a meaningful (numerical) form. In both cases, the variables in the study must be defined in terms of measurement.
4th ed. 2006 Dr. Rick Yount
3-3
I: Research Fundamentals
Definitions
An operational definition indicates the operations1 or activities that are performed to measure or manipulate a variable.2 The purpose of an operational definition is to help scientists speak the same language when reporting research. Since one of the primary characteristics of science is precision, we must begin with precise definitions of the variables we plan to study. Operational definitions force us to think concretely and specifically about the terms we use. Some of my students struggle with this. In one of my Principles of Teaching classes, a student was attempting to describe the fruit of the Spirit (Gal. 5:22-23). He defined "love" as "God's kind of love." But what kind of love is that? "Joy" was defined as "joy that you feel deeply, the joy we'll experience in heaven." But what is joy? These are non-definitions. They are empty. They are useless in teaching because they convey nothing but semantic fluff. I call this kind of definition "word magic," for it deceives teachers into thinking they are explaining words and phrases when in fact the definitions are little more than puffs of smoke in the air. Defining terms in precise terms of measurement avoids this kind of imprecision in research. Secondly, operational definitions provide a common base for communication of terms with others. When terms are operationally defined, readers know exactly how we are using our terms. For example, what does hunger mean? In one research study the operational definition for hunger was ...the state of animals kept at 90% of their normal body weight. This is certainly not the definition people use when they reach for their third chocolate-covered doughnut, saying, I'm really hungry!! The goal is to precisely understand the terms we use in research, and to convey that meaning clearly to others.
An Example
Years ago, General Motors used the slogan We Build Excitement PONTIAC! Suppose we wanted to study that. What does General Motors mean by excitement? We need to operationalize the term. There are several ways to do it. Have trained raters follow selected owners of Pontiacs, Fords and Chryslers and count the number of times they behave in an excited, agitated or exuberant manner. Excitement means the number of such behaviors per day. Is there a significant difference among the owners of these three makes of cars? Or, tally the number of dates selected car owners have per week. Excitement means the number of dates per week. This definition assumes that dates are exciting. Or, ask the owners: How excited does your car make you? Have them respond by marking a scale from 0 (no excitement) to 10 (excited all the time because of the car). Here excitement is a self-reported feeling, measured by a number on a scale. Or, ask two acquaintances of each selected subject to rate them on a car excitement scale. With this definition, excitement is the average scale score of impres-
Meriam Lewin, Understanding Psychological Research (New York: John Wiley & Sons, 1979), 75. Walter R. Borg and Meredith D. Gall, Educational Research: An Introduction, 4th ed. (New York: Longman, 1983), 22
1 2
3-4
Chapter 3
Empirical Measurement
sions of the two acquaintances. Each of these definitions provides a different measure of the general term excitement. In fact, we actually have four concepts of the term. But each definition is clear in its meaning.3
Another Example
Lets illustrate the operationalization process with a practical example. Read this example carefully, noting each step in the process. John is considering several topics for his research proposal. He is drawn toward the problem of adolescent bail-out of church attendance when they leave home. Putting his first thoughts down on paper, he writes: "Church attendance decreases when young people leave home" Writing out your thoughts is important! Almost anything can sound logical as you play with ideas in your mind. Putting these thoughts down on paper is a first real step toward constructing a workable topic. Ive heard students complain, I know what I want to study, but I just cant put it down on paper! Well, they feel like they know what they want to study, but their ideas are only wisps of fantasy. To put your idea down on paper is to grasp it, refine it, put shape to it, and bring it into the real world where the rest of us live. Do you have an idea for your study? Write it down. Then work on it, as a sculptor on granite, and bring out the essence of your creation. Nothing of value comes easy. As John reflects on his statement, he asks as many questions about it as he can. He steps away from his idea and objectively critiques it. You must separate your ego from your statement. Otherwise you will find yourself defending your work rather than refining it. Here are some of his questions: Whose church attendance decreases? This statement could refer to parents or friends. It is not specific on this point. The statement seeks to measure a change in behavior. This requires before-and-after measurements. Is this possible to do? What is church attendance? What does this term imply? Worship? Bible Study? Church softball league? What is home? What does it mean to leave home? After writing down these questions and considering alternative ways to express what he wants to study, he rewrites his statement like this: Young people living away from home will have a lower rate of attendance at worship services than young people living at home. First, this statement is better because it clarifies attendance as the young peoples attendance at worship services. Second, this statement is better because it indicates measuring attendance of
See Earl Babbie, The Practice of Social Research, 3rd ed. (Belmont, CA: Wadsworth Publishing Company, 1983), 130-131
3
3-5
I: Research Fundamentals
two groups and comparing them, rather than a before-and-after measurement of a selected group of subjects. The term young people is still fuzzy, however. How young is young? What does living away from home mean? Or living at home? Answers to these questions would be placed in the Definitions section of the proposal. In John's case, he defined these terms as follows: Young people is defined as persons aged 18-25. Home is defined as the residence of the subjects parents and where he or she lived as a child. Living at home is defined as the continued full-time residence of the subject at home. Leaving home is defined as the subject taking up residence away from home for at least three months. In order to do this study, John needs to define two populations: young people living at home and young people living away from home. He will need to sample two study groups from these populations. He will need to gather four pieces of data from each subject: (1) age, (2) residence, (3) attendance at worship services, and (4) how long away from home. You have just walked through a process of operationalization. It is a process essential for clear problem-solving. Begin now to operationalize the variables you are considering for your study.
Operationalization Questions
As you consider the measurement of variables for your study, there are two basic questions you must answer. The first is Are my variables measurable? If they are not, you cannot study them -- not statistically, that is. Some students have difficulty answering this question because they have too limited an understanding of what measurement entails. We will be looking at several approaches to measurement in the chapters ahead: direct observation, survey, testing, attitude measurement, and experimentation. Once you have settled on what kind of data you need for your study, begin looking in research texts and journal articles for ways to gather that data. Dont overlook the guidelines in later chapters of this text! The second question is How will I measure these variables? Define each of your variables in terms of how you will measure them [operational definitions]. I suggest you work on the statement for a while and then put it aside for several hours. When you come back to it, youll be able to look at it more objectively. It is difficult to avoid rationalization and self-defense of your work. But you will excel in writing your proposal only if you can critique yourself clearly and objectively. It is better if you find the weaknesses before others do! Once you have operationalized your draft statement, you will be ready to write the Statement of the Problem and the Research Hypothesis. We will get into these two sections of the proposal in the next chapter.
Summary
This chapter has introduced you to the concept of variables, four data types (nominal, ordinal, interval, and ratio), as well as the process of operationalization: defining selected variables in terms of measurement.
3-6
Chapter 3
Empirical Measurement
Vocabulary
arbitrary zero category constant dependent variable independent variable interval interval data measurement type measurement nominal data operational definition operationalization ordinal data rank ratio data true zero variable an arbitrary 'zero value' -- does not mean absence of the variable (e.g. 0F) a class or group of subjects (e.g. male/female on variable GENDER) a numerical value which does not change (e.g. the freezing point of water: 32F) a variable which is MEASURED by the researcher a variable which is MANIPULATED by the researcher equi-distant markings on a scale (e.g. degrees on a thermometer) a measurement which reflects a position on an interval scale (e.g. 54F) a specific kind of measurement (nominal, ordinal, interval, ratio) the process of assigning a number value to a variable a measurement which reflects counts in a group (e.g. 15 males in Research class) describing a variable by its measurement (e.g. adult means 18+ years old) the process of defining variables by their measurement a measurement which reflects rank order within a group the relative position in a group (e.g. 1st, 2nd, 3rd) a measurement which reflects a position on a ratio scale (e.g. 93 on Exam 1) the complete absence of a variable (e.g. 0 pounds = no weight) an element that can have many values (e.g. `weight can be 120 or 210 or 5)
Study Questions
1. List and define four kinds of measurement. Give an example of each kind. 2. Define constant and variable. Give two examples of each. 3. Operationalize the fuzzies below. A. Staff members who work with autocratic pastors are less happy than those who work with democratic pastors. B. Teaching Sunday School with discussion will result in better feelings than teaching with lecture. C. Group counseling is better than individual counseling.
2. Which of the following is not a characteristic of an operational definition? A. helps researchers communicate clearly B. uses global, abstract terminology C. specifies activities used to measure a variable D. addresses sciences desire for precision con't
4th ed. 2006 Dr. Rick Yount
3-7
I: Research Fundamentals
3. Which of the following is the best operational definition? A. An attitude of forgiveness B. Aggressive facial expressions C. Immoral behavior D. Anxiety test score 4. Identify the type of data expressed in the statements below by writing the appropriate letter in the blank provided. N ominal O rdinal I nterval R atio
____ Statistical Aptitude will be measured by scores obtained on the STAT2 (0-20)1 ____ My current feelings toward my father could be characterized as:2 Very Warm and Tender Good Unsure Unfavorable 1 2 3 4 ____ Employment Status: Full-Time Part-Time Not Employed3 Very Distant and Cold 5
____ Study Habits: Sum of Delay Avoidance (DA) and Work Methods (WM) Scores on the Survey of Study Habits and Attitudes Inventory (Max: 100)4 ____ Critical Thinking Ability: score on the Watson-Glaser Critical Thinking Appraisal5 ____ Leadership Style: 9,9 5,5 9,1 1,9 1,16 ____ Reasons for Dropping Out of a Christian College: Ranking of 50 Attrition Factors7 ____ Child Density: Computed by dividing the number of children in a family by the number of years married8
"0" means no aptitude for statistics. Ibid., 86 3 James Scott Floyd, The Interaction Between Employment Status and Life Stage on Marital Adjustment of Southern Baptist Women in Tarrant County, Texas, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990), 45 4 Steven Keith Mullen, A Study of the Difference in Study Habits and Study Attitudes Between College Students Participating in an Experiential Learning Program Using the Portfolio Assessment Method of Evaluation and Students Not Participating in Experiential Learning, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1995), 51 5 Bradley Dale Williamson, An Examination of the Critical Thinking Abilities of Students Enrolled in a Masters Degree Program at Selected Theological Seminaries, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1995), 23 6 Helen C. Ang, An Analytical Study of the Leadership Style of Selected Academic Administrators in Christian Colleges and Universities as Related to Their Educational Philosophy Profile, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984), 28-29 7 Judith N. Doyle, A Critical Analysis of Factors Influencing Student Attrition at Four Selected Christian Colleges, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1984), 98 8 Martha Sue Bessac, The Relationship of Marital Satisfaction to Selected Individual, Relational, and Institutional Variables of Student Couples at Southwestern Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1986), 23
1 2
3-8
Chapter 4
4
Getting On Target
The Problem of the Study The Hypothesis of the Study From Raw to Refined
I lay in the cold, damp sand of Fort Dix, New Jersey with my M-16 pointing down range. Getting my weapon on target was not as easy as my instructors had made it sound in class. I felt as if I were all thumbs as I wrestled with sight alignment, breathing, placement of the front sight on the target, correction for wind, and correction for distance. I had one thing going for me, however, despite my awkward confusion. My problem was clear: put the 7.62mm round in the center of the target standing 100 yards away. The anticipated result was clear as well: put it all together and the round will hit the bulls eye. Practice translated the problem into the anticipated result. I qualified for the Sharpshooters Badge. Writing a proposal is more complex than target practice. The need to get on target with your proposal, however, is just as important. The Problem and Hypothesis statements focus every other element of the proposal. They form the proposals heart its bulls eye. Confusion here will generate confusion throughout the proposal.
Characteristics of a Problem
The following characteristics are important to keep in mind as you develop the formal statement of the problem of your study.
4-1
I: Research Fundamentals
data from your field and discover if you are proposing a redundant study.
Meaningfulness
Is your problem statement meaningful? Is it important to your field? The problem may focus on something you personally want to know, but this is not enough to establish the need for the study. The inexperienced tend to focus on the obvious, surface issues related to ministry. The problem statement should have a theoretical basis beyond the pragmatic concern of what works? Research seeks to know the whys as well as the hows of the way the world works.
Clearly written
The problem statement is usually a single sentence which isolates the variables of the study and indicates how these variables will be studied. The statement is terse, brief, concise. It is objectively written so that another can read the statement and understand the focus of the study.
This study proposes to measure the administrative leadership style and the particular philosophy of education of selected Christian college administrators and determine whether there is any relationship between these two variables. Since style and philosophy are nominal variables, this problem statement infers the use of the chisquare2 Test of Independence -- relationship between two nominal variables. (See Chapters 5 and 23 for further information on chi-square.)
4-2
Chapter 4
The problem of this study [is] to determine the relationship between ministerial job satisfaction and a specific set of predictor variables. These variables [are] Principle Ministry Classification, Gender, Age, Marital Status, Education, Tenure, and presence in the workplace of a Performance Evaluation.2
This statement identifies variables which the researcher believed influences the degree of job satisfaction in ministerial staff members of Southern Baptist Churches. Problem statements of this type refer to multiple regression analysis. (See Chapter 26 for further information on multiple regression).
This study will measure the variable learning outcomes -- defined later as the achievement score of the student on the multiple-choice post test measuring the lesson objectives at three cognitive levels: knowledge, comprehension, and application4 -- in two groups of adult Sunday School members. One group experienced a Bible study which intentionally integrated active participation methods. The second group experienced the same Bible study without active participation. Would intentional active participation make a difference in their learning? The statistic inferred by this statement is the t-Test for Independent Samples. (See Chapter 20 for further information on the two sample independent t-test).
Dr. Scott Floyd wrote his second problem statement this way:
It [is] also the problem of this study to determine the difference in marital adjustment of Southern Baptist women. . . who were not employed outside the home, employed part-time, and employed on a fulltime basis.5
This study will measure marital adjustment, a ratio score, in Southern Baptist
2 Robert Horton Welch, A Study of Selected Factors Related to Job Satisfaction in the Staff Organizations of Large Southern Baptist Churches, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1990), 4 3 Marcus Weldon Cook, A Study of the Relationship Between Activie Participation as a Teaching Strategy and Student Learning in a Southern Baptist Church, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1994), 3 4 5 Ibid., 24 Floyd, 5 4th ed. 2006 Dr. Rick Yount 4-3
I: Research Fundamentals
women divided into three employment groups. Do the mean scores of these three groups differ significantly? The Problem Statement infers the use of one-way Analysis of Variance (ANOVA). (See Chapter 21 for further information on ANOVA). Dr. Floyd tested one independent variable above. His primary problem, however, involved two. In addition to employment status he also divided women into three levels of life cycle -- ages 18-31, 32- to 46 and 47 to 65. The Problem statement for this design read this way:
The problem of this study [is] to determine the interaction between life cycle stage and employment status of Southern Baptist women in Tarrant County, Texas, on a measure of marital adjustment.
This problem statement infers the use of two-way ANOVA, because it identifies two independent variables, employment and life cycle, and one dependent variable, marital adjustment. (See chapter 25 for information on Factorial ANOVA.) The Problem statement delineates the question of the study. It is the climax of the Introductory Statement and opens the door to the Synthesis of Related Literature. In doing your literature search, you will learn a great deal from others who have studied the variables you are interested in studying. At the end of the Related Literature section (see Chapter 6) you will be ready to write a confidence statement of your expected findings. This statement of expectation is called a hypothesis.
As explained in Chapter 2, an hypothesis states the anticipated answer to the problem youve stated. The two major types of hypotheses are the research, or alternative, hypothesis, and the null, or statistical, hypothesis. The research hypothesis can either be directional or non-directional.
Another way this relationship between nominal variables could be stated is this: It is the hypothesis of this study that leadership style of the academic administor
Ang, 3
Ibid., 19
4-4
Chapter 4
and his/her educational philosophy profile are not independent. The phrase not independent indicates more clearly that the study will use the chi-square statistic. Categories of leadership style and educational philosophy are the nominal measurements.
The above is a multiple regression example where one variable is being predicted by two others. Association among several variables can also involve several pairings of variables. Dr. Maria Bernadete Da Silva wrote her problem statement to analyze the relationships among several pairs of variables.
The problem of this study [is] to determine the relationship between leadership style and the levels of agreement on selected social work values of social work administrators in social service agencies in Texas.10
The four social work values were respect for basic rights, social responsibility, individual freedom, and self-determination. Level of agreement of these values consisted of the number of social workers selecting one of four options: strongly agree, agree, disagree, or strongly disagree. This design required four chi-square tests of independence, matching leadership style and each of the four values.
13
Paret, 5 Ibid., 10
Ibid., 37
10
Da Silva, 4
11
Ibid., 7
12
Havens, 7
4-5
I: Research Fundamentals
Scores were divided into two groups for purposes of testing this hypothesis: one group of children had parent-teachers with teacher certification and the second group did not. Did academic achievement -- defined as improved grade level scores in vocabulary, reading, writing, spelling, mathematics, science and social studies skills, as measured by the subtests of the Standford Achievement Test14 -- significantly differ between these two groups? This hypothesis suggests the use of t-Test for Independent Samples (Chapter 20). Dr. Daryl Eldridge, conducting an experimental study, wrote his problem statement this way:
The problem of this study will be to investigate the effect of student knowledge of behaviorally stated course objectives upon the performance and attitudes of seminary students in a church curriculum planning course.15
The instrument adapted for his study produced interval data. The hypothesis infers use of the one-way Analysis of Variance statistic. (Chapter 21) Research hypotheses can be directional or non-directional. The distinction between these two types of research hypotheses lies in whether the hypothesis simply states a difference or states a difference in a specific direction.
Ibid., 21
15
Eldridge, 3
16
Ibid., 29
17
Babler, 7
18
Ibid., 32
4-6
Chapter 4
It [is] the hypothesis of this study that autonomy and intimacy as percieved in the couple's family of origin are significant positive predictors of current nuclear family health. (Paret) 1. It is the hypothesis of this study that the test scores of students who have knowledge of course objectives will be significantly greater than the test scores of students who have no knowledge of objectives. (Eldridge)
When you state your research hypothesis in a directional form, you show more confidence in the anticipated result of your study. This confidence grows out of your literature review and expertise in the field. You should state your research hypotheses in a directional format if possible.
The first hypothesizes prediction, but does not specify direction positive or negative. The second hypothesizes difference, but does not specify greater than or smaller than. These non-directional statements are weaker than the directional statements actually stated by the researchers. Use a non-directional research hypothesis in your proposal only if you cannot develop a reasonable basis for stating a direction for your anticipated results.
4-7
I: Research Fundamentals
Notice that the null form of the hypothesis declares no relationship among variables, and no difference between groups.
NOTE: There are times, though rare, when the "null hypothesis" is the "research hypothesis" of the study. For example, you are creating a new treatment that you believe will require half the time, but will produce the same results, as a more costly, time-intensive procedure. Your intent to show "no difference" between the approaches. In these rare occasions, the null is the research hypothesis as wellas the statistical hypothesis. The point: The null is not always the opposite of the research hypothesis.
Revision Examples
It is relatively easy to read a statement of problem or hypothesis and agree that it is focused and meaningful. It is quite another to write such statements. The following examples are problem and hypothesis statements written by students in class. I will comment on the statement as written, and then suggest a revised version.
Example 1
The problem of this study is to determine the effect of adequate premarital counseling on the success rate of teenage marriages.
Comments
The term effect calls for an experimental or ex post facto approach to the study. If you are thinking in this direction, move to Chapter 13 soon. I encourage you to pursue an experimental design, but students sometimes use the term effect when they are actually thinking of correlation. You cannot infer a cause-and-effect relationship from a correlation. There are other questions raised by this Problem. What is adequate counseling? What kind of premarital counseling? How will you measure success rate? Success over what period of time? How do you define teenage marriage? Is this study focusing only on teenagers who are married, or on all marriages which began in the teenage years?
Suggested revision
The problem of this study is to determine the difference in attitude toward married life between married teenagers who undergo a specified course of premarital counseling and those who do not. Here you are studying teenagers who are married. You will have two groups: one group undergoes a specified counseling treatment (which you will define under Procedure for Collecting Data) and the other doesnt. You measure differences in attitude toward married life between the two groups.
Example 2
The problem of this study is to determine whether those who complete MasterLife Discipleship Training will have a more positive attitude toward discipleship and will become actively involved in discipleship.
4-8
Chapter 4
Comments
More positive than what? There is nothing to compare MasterLife against. What is meant by actively involved? Discipleship is a global term. What does it mean in the framework of this study? What is the theoretical basis for this study? How will it contribute to the field of Christian education? Is this really an evaluation of the MasterLife program?
Suggested revision
The problem of this study is to determine the difference in discipleship skills and attitudes developed in median adults between the MasterLife Discipleship Training program and the (Alternative) Discipleship Training program. This study will evaluate MasterLife against another discipleship training program. The basis for comparison will be measured skills and attitudes in the area of discipleship.
Example 3
It is the hypothesis of this study that the level of social extroversion expressed by a child will differ significantly in relationship to the type of before and after school care environment he or she receives.
Comments
This statement targets the variables rather well. Level of social extroversion and type of care environment are clearly stated. But the wording is awkward. How many types of before and after school care will be studied? Two? Three? What does type of care mean? How will it be measured?
Suggested revision
It is the hypothesis of this study that children receiving Type I care will score significantly higher on the social extroversion scale than children receiving Type II care. Two types of child care are specified. These two types are directly compared on the basis of a social extroversion measurement of the children. If one were interested in comparing several types of child care, the hypothesis could read: It is the hypothesis of this study that childrens scores on the social extroversion scale will significantly differ across (number) specified types of before and after school care.
Example 4
It is the hypothesis of this study that staff longevity of ministers is significantly increased in churches using a salary administration plan than churches who do not use such a plan.
Comments
The term increased indicates a before and after study. This may be difficult to do in churches. How do you get churches to agree to install a different plan for pur 4th ed. 2006 Dr. Rick Yount
4-9
I: Research Fundamentals
poses of a research study? It is easier to focus on difference. What is staff longevity? How long a staff member stays in a position? How is it measured? Months? Years? What is a salary plan? This is a fuzzy concept. How will you determine whether a church qualifies as having a plan or not having a plan? Is a bad plan better than no plan? Is salary the major factor in staff longevity? Are there other variables that need to be considered in studying why staff members remain in a given church? How will the researcher deal with ineffective staff members who are not invited to consider other churches those who remain because they have nowhere else to go?
Suggested revision
It is the hypothesis of this study that the length of service of ministers is significantly higher in churches that qualify as having a specified salary administration plan than in churches that do not. The researcher maintains his focus on salary. However, there is a procedure which will be used to categorize churches on the basis of their salary plans. Rather than measure increase, the researcher will look at the difference between length of service of ministers in two categories of churches.
Example 5
The hypothesis of this study is that men who remain in the pastorate are significantly different than those who leave the pastorate to enter denominational work.
Comments
This statement uses some of the words weve discussed, but misses the mark as a hypothesis statement. It is an excellent example of a hypothesis written by someone who knows the words but does not understand their meaning (But I used the words significantly different!) What is the variable being studied? These two groups of men will be different on what variables(s)? What is the theoretical foundation of this? Is there justification for considering pastoral ministry or denominational ministry better than the other? Besides, what is being measured? How will the researcher obtain his data? There is really no study here. We need to head back to the drawing board on this one.
Dissertation Examples
The Problem-Hypothesis-Statistic set forms the backbone, the framework, for both the proposal and the dissertation itself. While you are certainly not expected to understand the statistical procedures referenced here, I include them for future reference and for a sense of completeness. We will introduce you to these and other statistical procedures in Chapter 5, and focus on them in chapters 16 to 26. The following statement-sets are drawn from dissertations of our graduates. They are written in the past tense since they are taken from the dissertations.
Regression Analysis
The problem of this study was to determine the relationship between attitudes concerning computer-enhanced learning and selected individual and institutional variables of full-time
4-10
Chapter 4
faculty members at Southwestern Baptist Theological Seminary. [The hypothesis] of this study was that the following variables would prove to be significant predictors of attitudes toward computer-enhanced learning for theological education among the full-time faculty of Southwestern Baptist Theological Seminary: age, gender, school division where teaching, discipline teaching, degree(s) held, number of years teaching at Southwestern, last enrolled in a course, whether or not own a computer, believe students should own a computer, and taken any computer courses/instruction.19
The statistic for this study was Multiple Regression (see Chapter 26). There were two significant predictors found in this study: whether the professor owned a computer or not, and whether they believed students should own a computer. A positive attitude toward computer-enhanced learning in theological education was predicted by "yes" answers to these two questions.
The statistic for this study was Spearman rho correlation coefficient (see Chapter 22). Competencies for ministers of education were divided into five areas: minister, administrator, educator, growth agent, and personal [relational skills]. Higher coefficients reflect higher agreement between pastors and educators on ranked competencies. Lower coefficients reflect lesser agreement. The coefficients were minister (0.94), administrator (.64), educator (.83), growth agent (.54) and personal (.70).
The statistic for this study was Factorial Anova (see Chapter 25). There was no interaction between the two variables, so the two "main effects" (school, activity) could be interpreted directly. There was no significant difference in spiritual maturity between seniors in Christian vs. public schools, but spiritual maturity in active Sunday School attenders was significantly higher than in inactive attenders.
Bergen, 7, 46
20
Bass, 3, 37
21
LaNoue, 2, 22
22
Marcia McQuitty, 5, 27
4-11
I: Research Fundamentals
of service on church staffs, task preference, gender, and age. The hypothesis of this study was that dominant management style and selected variables were not independent.22
The statistic for this study was the chi-square test of independence (see Chapter 23). Dr. McQuitty queried all full-time preschool and children's ministers serving in Texas Baptist churches (N=132), and actually gathered data from eighty one (81). Only nineteen (19) ministers produced a dominant management style, and thirteen (13) of these were categorized as comforter. This discovery required a change in the hypothesis: rather than one of five management styles, Dr. McQuitty tested her specified variables against dominant vs multiple management styles. None of the specified variables produced a significant chi-square value.23 Still, insights gained through the data collection provided important insights into the strengths and needs of preschool and childhood education ministers -- insights which Dr. McQuitty uses in her seminary classes.
Analysis of Variance
The problem of this study was to determine the difference in achievement, both cognitive and affective, among students who learned through interactive instruction, simulation games, and presentational instruction in the Hong Kong Baptist Theological Seminary, Hong Kong.24 The following were the hypotheses of the study: 1. H1 : was the hypothesis that there was significant difference among the means across [testing] occasions. . . 2. H2 : was the hypothesis that there was significant difference among the means across all groups. . . 3. [interaction] 4. [post-test 1: cognitive] 5. [post-test 1: affective] 6. [post-test 2: cognitive] 7. [post-test 2: affective]25
The statistic for this study was one-way analysis of variance (see Chapter 21). The analysis revealed no significant differences in cognitive learning across teaching methods used in the three groups. All three groups learned. The greatest change in attitude toward learning and interpersonal relationships occurred in the "Simulation Games" group.26
Summary
The material of this chapter is crucial to your research proposal. It is important that you understand the concepts discussed here and be able to use them with your own topic. Read the examples of good statements several times until the pattern of each kind of study begins to become clear. Work step-by-step through the evaluations of the real-life examples.
Ibid., 43 Stephen Tam, A Comparative Study of Three Teaching Methods in the Hong Kong Baptist Theological Seminary, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1989), 2 25 26 Ibid., 14-17 Ibid., 76-77
23 24
4-12
Chapter 4
Vocabulary
research hypothesis null hypothesis statistical hypothesis directional hypothesis non-directional anticipated outcome of study, stated in terms of difference (grps), or relationship (vars) anticipated outcome of study, stated in terms of NO difference or NO relationship same as null hypothesis states a direction of difference (larger, smaller) or relationship (positive, negative) states no direction -- simply states 'difference' or 'relationship'
Study Questions
1. Explain the purpose of the problem and hypothesis statements. 2. Describe the four characteristics of a good problem statement. 3. Describe four types of hypothesis statements.
Therapy A will result in significantly less marital anxiety than therapy B There will be no significant difference between Teaching Approaches 1 and 2. There will be a relationship between Number of Hours Studied and GPA Number of Hours Worked Outside the Home and Marital Satisfaction are independent Bible Knowledge Score will be significantly different across the three groups Senior Adults' Preference Score toward the King James Version will be significantly higher than for Young Adults There will be no difference in ministerial commitment scores across three staff categories Men and women will score differently on the nurturing scale of the BA12 Test
4-13
Chapter 5
Introduction to Statistics
5
Statistical Analysis
Statistics, Mathematics, and Measurement A Statistical Flow Chart
In the first four chapters of the text, we have focused on concerns of research design: the scientific method, types of research, proposal elements, measurement types, defining variables, and problem and hypothesis statements. But designing a plan to gather research data is only half the picture. When we complete the gathering portion of a study, we have nothing more than a group of numbers. The information is meaningless until the numbers are reduced, condensed, summarized, analyzed and interpreted. Statistical analysis converts numbers into meaningful conclusions in accordance with the purposes of a study. We will spend chapters 15-26 mastering the most popular statistical tools. But you must understand something of statistics now in order to properly plan how you should collect your data. That is, the proper development of a research proposal is dependent on what kind of data you will collect and what statistical procedures exist to analyze that data. The fields of research design and statistical analysis are distinct and separate disciplines. In fact, in most graduate schools, you would take one or more courses in research design and other courses in statistics. My experience with four different graduate programs has been that little effort is made to bridge the two disciplines. Yet, the fields of research and statistics have a symbiotic relationship. They depend on each other. One cannot have a good research design with a bad plan for analysis. And the best statistical computer program is powerless to derive real meaning from badly collected data. So before we get too far into the proposal writing process, some time must be given to establishing a sense of direction in the far-ranging field of statistics.
5-1
I: Research Fundamentals
Descriptive Statistics
Descriptive statistical procedures are used to describe a group of numbers. These tools reduce raw data to a more meaningful form. Youve used descriptive statistics when averaging test grades during the semester to determine what grade youll get. The single average, say, a 94, represents all the grades youve earned in the course throughout the entire semester. (Whether this 94 translates to an A or a C depends on factors outside of statistics!). Descriptive statistics are covered in chapters 15 (mean and standard deviation) and 22 (correlation).
Inferential statistics
Inferential statistics are used to infer findings from a smaller group to a larger one. You will recall the brief discussion of population and sample in chapter 2. When the group we want to study is too large to study as a whole, we can draw a sample of subjects from the group and study them. Descriptive statistics about the sample is not our interest. We want to develop conclusions about the large group as a whole. Procedures that allow us to make inferences from samples to populations are called inferential statistics. For example, there are over 36,000 pastors in the Southern Baptist Convention. It is impossible to interview or survey or test all 36,000 subjects. Round-trip postage alone would cost over $21,000. But we could randomly select, say, one percent (1%) or 360 pastors for the study, analyze the data of the 360, and infer the characteristics of the 36,000. Inferential procedures are covered in chapters 16, 17, 18, 19, 20, and 21.
5-2
Chapter 5
Introduction to Statistics
Statistical Flowchart
Accompanies explanation on pages 5-4 to 5-7 in text
1
Relationships Between Variables Interval/Ratio? Ordinal? Nominal?
2
Differences Between Groups Interval/Ratio? Ordinal?
I/R
3
2 Vars 3+ Vars
5
2 Dicho* 1 Var 2 Vars
6
1 Group 2 Groups 2+ Groups
6c One-Way ANOVA
7
2 Groups 3+ Groups
2 Ranks 3+ Ranks
4b Kendall's W
4a Spearman rho ()
Kendall's
tau ()
Point Biserial
Sample mean and Population mean - known, or n>30 Sample mean and Population mean - unknown
7c Kruskal-Wallis H test
ASSOCIATION
*Dichotomous - two and only two categories
DIFFERENCE 5-3
I: Research Fundamentals
You have chosen a similarity study. Statistical procedures that compute coefficients of similarity or association or correlation (synonymous terms) come in four basic types. The first type computes correlation coefficients between interval or ratio variables. The second type computes correlation coefficients between ordinal variables. The third type computes correlation coefficients between nominal variables (or, at the very least, at least one of the two is nominal). The fourth type is a special category which computes a coefficient of independence between nominal variables. If your data is interval or ratio, then go to -3- below. If your data is ordinal, then go to -4- below. If your data is nominal, then go to -5- below.
5-4
Chapter 5
Introduction to Statistics
-3bInterval\ratio Correlation with 3+ Variables The procedure we will study which analyzes three or more interval/ratio variables simultaneously is multiple linear regression. This procedure is quickly becoming the dominant statistical procedure in the social sciences. With this procedure, you develop models which relate two or more predictor variables to a single predicted variable. We will confine our study to understanding the printouts of a statistical computer program called SYSTAT. See Chapter 26.
Just like the interval/ratio procedures above, ordinal correlation procedures come in two types.
-4aOrdinal Correlation with 2 Variables The two procedures which compute a correlation coefficient between two ordinal variables are Spearmans rho (rs) and Kendalls tau (). Spearmans rho should be used when you have ten or more pairs of rankings; Kendalls tau when you have less than ten. Both measures give you the same information. If you had pastors and ministers of education rank order seven statements of characteristics of Christian leadership, you would compute the degree of agreement between the rankings of the two groups with Kendalls tau. See Chapter 22. -4bOrdinal Correlation with 3+ Variables Kendalls Coefficient of Concordance (W) measures the degree of agreement in ranking from more than two groups. Using our example above, you could compute the degree of agreement in rankings of pastors, ministers of education and seminary professors using Kendalls W. See Chapter 22.
5-5
I: Research Fundamentals
male, 15% female] to determine if class enrollment fits well the expected enrollment. The Chi-square Test of Independence compares two nominal variables to determine if they are independent. Are educational philosophy (5 categories) and leadership style (5 categories) independent of each other? When you want to determine the strength of the relationship between the two nominal variables, use Cramers Phi (c). This procedure computes a Pearsons r type coefficient from the computed value. See Chapter 23.
5-6
Chapter 5
Introduction to Statistics
interaction among the independent variables. See Chapter 25. If the groups are related, use the Repeated Measures Analysis of Variance. (Not discussed in this text.)
Summary
In this chapter we introduced you to statistical analysis. We linked statistics to the process of research design. We looked at two major divisions of statistics. We separated the practical application of statistical procedures from the need for higher level mathematics skills. We differentiated statistical differences by measurement type. And finally, we laid out a mental map of statistical procedures we will be studying so that you can determine which procedures might be of use to you in your own proposal.
Vocabulary
correlation coefficient Cramers Phi descriptive statistics Factorial ANOVA Goodness of Fit Indep't Samples t-test Inferential statistics Kendalls tau Kendalls W Kruskal-Wallis H Test Linear Regression Mann-Whitney U Test Matched Samples t-test Multiple Regression one-sample z-test one-sample t-test One-Way ANOVA Pearsons r a number which reflects the degree of association between two variables measures strength of correlation between two nominal variables measures population or sample variables two-way, three-way ANOVA compares observed counts with expected counts on 1 nominal variable tests whether the average scores of two groups are statistically different INFERS population measures from the analysis of samples correlation coefficient between two sets of ranks (n < 10) correlation coefficient among three or more sets of ranks non-parametric equivalent of ANOVA establishes the relationship between one variable and one predictor variable non-parametric equivalent of the independent t-test tests whether the paired scores of two groups are statistically different establishes the relationship between one variable and multiple predictor variables tests whether a sample mean is different from its population mean (n > 30) tests whether a sample average is different from its population average tests whether average scores of three or more groups are statistically different correlation coefficient between two interval/ratio variables
5-7
I: Research Fundamentals
Phi Coefficient Point Biserial Rank Biserial Spearmans rho Test of Independence Two Sample Wilcoxin Wilcoxin Matched Pairs
correlation coefficient between two dichotomous variables correlation coefficient between interval/ratio variable and dichotomous variable correlation coefficient between ordinal variable and dichotomous variable correlation coefficient between two sets of ranks (n > 10) chi-square test of association between two nominal variables non-parametric equivalent of independent t-test non-parametric equivalent of matched samples t-test
Study Questions
1. Differentiate between descriptive and inferential statistics. 2. Consider your own proposal. Review the types of data (Chapter 3). List several statistical procedures you might consider for your proposal. Scan the chapters in this text which deal with the procedures youve selected. 3. Give one example of each data type (Review Chapter 3). Identify one statistical procedure for each example you give.
____ 1. Difference between fathers and their adult sons on a Business Ethics test. ____ 2. Whether learning style and gender are independent. ____ 3. Analysis of six predictor variables for job satisfaction in the ministry. ____ 4. Difference in Bible Knowledge test scores across three groups of youth ministers. ____ 5. Prediction of marital satisfaction by self-esteem of husband. ____ 6. Relationship between number of years in the ministry and job satisfaction score. ____ 7. Difference in anxiety reduction between treatment group I and treatment group II. ____ 8. Correlation between rankings of objectives of the School of Religious Education by pastors and ministers of education.
5-8
Chapter 6
Synthesis of Literature
6
Synthesis of Related Literature
A Definition The Procedure
In this chapter we look at the process of finding, collecting, analyzing and synthesizing research articles which relate to the topic of our study. Before we can add to the knowledge base of our field of study, we must learn what is already known. The literature search provides a factual base for the proposed study.
A Definition
The related literature section of your proposal, entitled the Synthesis of Related Literature, is a synthetic narrative of recent research which is related to your study.
Synthetic Narrative
The related literature section is a synthetic narrative. It is a narrative in the sense that it should flow from the beginning to the end with a single, coordinated theme. It should not contain a series of disjointed summaries of research articles. Such unrelated and disconnected summaries generate confusion rather than understanding. It is synthetic in that it has been born out of the synthesis of many research studies. You will analyze research reports by key words. There may be twenty articles that provide information for a given key word. As you write your findings for each of your key words, you will draw from all of the articles addressing that key word simultaneously. The final product will be a synthesis a smooth blending of selected articles built around the key words of your study. This is the reason for the name of this section: The Synthesis of Related Literature. Not a summary, but a synthesis.
Recent Research
The synthesis of related literature focuses on recent research. The rule of thumb in defining recent is ten years. You will want to select and include research articles which are less than 10 years old. Major emphasis should be placed on research conducted in the past 5 years. Articles older than this are out of date and misleading. Consider an opinion survey conducted in 1955 on the attitudes of Americans on family. Such information has little relevance to family attitudes today. Its only value would be to show the change in attitude since 1955. Gather your information from research journal articles rather than books. Books are, by necessity, more out of date than the research theyre based upon. Research reports are primary sources of information, because they are written by those who conducted the study. Books are usually secondary sources; that is, sources written by authors not directly associated with the reported research: they merely compile re 4th ed. 2006 Dr. Rick Yount
6-1
I: Research Fundamentals
search results from many sources. Focus your synthesis on primary sources of information.
E.R.I.C.
The Educational Resources Information Center (ERIC) was initiated in 1965 by the U.S. Office of Education to transmit findings of current educational research to researchers, teachers, administrators and graduate students.1 Information is housed in 16 clearinghouses around the nation.2
RIE
The ERIC system consists of two major parts. The first is the Resources in Education (RIE) which provides abstracts of unpublished papers presented at educational conferences, speeches, progress reports of on-going research studies, and final reports of projects conducted by local agencies such as school districts.3
CIJE
The second major part of the ERIC system is the Current Index of Journals in Education (CIJE). The CIJE indexes articles published in over 300 educational journals and articles about educational concerns in other professional journals.4 In general, ERIC listings have less lag time than the Education Index or Psychological Abstracts. This means it will provide you with more recent research findings. Altogether, the ERIC system indexes and abstracts research projects, theses, conference proceedings, project reports, speeches, bibliographies, curriculum-related materials, books and more than 750 educational journals.1
Walter R. Borg and Meredith D. Gall, Educational Research: An Introduction, 4th (New York: Longman Publishing Co., 1983), 153. 2 See Borg and Gall, pp. 901-2 for addresses of clearinghouses. 3 Ibid., p. 153 4 Charles D. Hopkins, Educational Research: A Structure for Inquiry (Columbus, Ohio: Charles E. Merrill Publishing Company, 1976), 221
1
6-2
Chapter 6
Synthesis of Literature
Psychological Abstracts
Published by the American Psychological Association, this publication lists articles from over 850 journals and other sources in psychology and related fields.2 It gives summaries of studies, books, and articles on all fields of psychology and many educational articles.3
Dissertation Abstracts
The Dissertation Abstracts database contains all dissertations written and registered since 1860. This is a rich resource not only of graduate level research findings, but also of research design and statistical analysis methods.
Education Index
The Education Index provides an up-to-date listing of articles published in hundreds of education journals, books about education and publications in related fields since 1929. For an index to educational articles for the years 1900 to 1929, check the Readers Guide to Periodic Literature.5
Citation Indexes
The Citation Indexes list published articles which references (cites) a given article. My statistics professor at University of North Texas gave me a copy of a 1973 article on multiple comparisons one evening before class. He thought the questionable findings in the article would make a good dissertation study. By using citation indexes, I was able to quickly track down references to over fifty articles published since 1973 which cited the article hed given me. The Science Citation Index (SCI) provides citations in the fields of science, medicine, agriculture, technology, and the behavioral sciences.
Sharon B Merriam and Edwin L. Simpson, A Guide to Research for Educators and Trainers of Adults (Malabar, FL: Robert E. Krieger Publishing Company, 1984), 35. 2 Borg and Gall, p. 150. 3 Hopkins, p. 224 4 See Borg and Gall, pp. 148-166 for detailed information on these and many other sources. 5 Hopkins, p. 221
6-3
I: Research Fundamentals
The Social Science Citation Index (SSCI) does the same for the social, behavioral and related sciences.1
2 5
Hopkins, 225
6-4
Chapter 6
Synthesis of Literature
The bridge that connects your study to the documents in the databases youve selected is made up of the descriptors, or key words, that grow out of your Problem Statement and operationalized variables. Only key words that are known by the database will work. In the example above, we found that the descriptor academic self-concept does not exist in the ERIC system. Other key words had to be substituted. When I wrote a research proposal on Research Priorities in Religious Education, the descriptor Religious Education led me to over thirty research articles. But none of the articles used the term the way Southern Baptists use it. If you consider a study which has a solid theoretical base, you will find it easier to find descriptors. Ultimately, you will secure reports that provide a good foundation for your study. If your study is theoretically shallow, you will have difficulty finding descriptors. You will be barred from the world of scientific knowledge.
Searching manually
To do a manual search for the key words listed above in the ERIC system, follow these steps:
1. Look in the ERIC index published in the most recent month of the current year. (Indexes for ERIC documents are published monthly; semi-annual volumes are published twice each year.) 2. Look up each of your descriptors in the Subject Index section. 3. You will notice that descriptors are organized in hierarchies. The higher up the hierarchy you find a descriptor, the broader it is (that is, the greater number of articles it references). The farther down the hierarchy you find a descriptor, the narrower it is (the fewer number of articles it references). Articles are referenced under the descriptors by ED numbers, such as ED 654 321. 4. Look up the ED number in the Document Resumes section of the ERIC index. Here you will find a brief description (an abstract) of the referenced article. You can usually tell from the abstract whether the article will be of help to you in your own study. 5. When you have found all the abstracts for all your descriptors in this index, move to the next earlier month and repeat the process. 6. When you have completed the current year, use the semi-annual volumes to search back through previous years. 7. Continue the process until you have located every ERIC document related to every descriptor back as far as you want the search to extend.
Searching by Computer
A manual search requires a great deal of time because you must manually thumb through multiple volumes of database indexes. Just think about looking up each of four descriptors, along with their associated articles, in monthly and then semi-annual indexes for up to ten years! How much time do you have to sit in the Reference Section of your university library? But more important than wasted time is the limitation of doing only simple searches. This rules out searches such as self-esteem AND elementary school children. Such a search would select only those articles which
4th ed. 2006 Dr. Rick Yount
6-5
I: Research Fundamentals
relates to BOTH descriptors. With a computerized database, you can search through literally millions of articles in seconds, and combine key words in complex ways. We can combine all our selected descriptors into a single search command for the computer. With one pass through all the ERIC documents, every article meeting the specifications of the command line will be selected from that database. Lets use our example to illustrate the process.
1. The library assistant responsible for doing computer searches dials up the data base. 2. Descriptors are entered one at a time. 3. With each entry, there is a pause for a few seconds while the computer scans all of its material. It responds with a number of articles relating to that descriptor. The following numbers of articles were found by Borg and Gall for the example problem: 1. handicapped children 2. handicapped students 3. self-concept 4. self-esteem 5. elementary school students Total Number: 277 450 4,433 894 5,031 11,085
4. Descriptors can be combined to select only those articles that fit a specific combination. Borg and Gall's example is interested in (1) self-concept OR (2) self-esteem AND (3) handicapped children OR (4) handicapped students AND (5) elementary school students. This combination is entered with the command (1 or 2) and (3 or 4) and (5). The OR increases the number of selected articles by including additional descriptors. Any article relating to either self-esteem OR self-concept and any article relating to either handicapped children OR handicapped students will be selected. The AND narrows the number of selected articles by requiring articles to match all the descriptors connected by it. All articles must have either (1) or (2) AND either (3 or (4) AND (5) elementary school students to be selected in this search. The search above produced only one article reference out of the 11,085 articles identified by single descriptors. The Related Literature section requires more than a single article! The researchers broadened the search by dropping (5) elementary school students. Entering the command (1 or 2) and (3 or 4) produced 41 articles in ERIC documents. 5. Print out abstracts. You can have the computer print out the selected abstracts immediately (online) or you can have them printed out later (off-line). The difference is COST! Printing out abstracts while on-line means paying the connect fee between the computer and the database while the printer cranks out the abstracts. Printing off-line gives you the abstracts in a few days, but cost only a few cents each. This lower cost is possible because the database computer can call the library in the evening when phone rates are low, down-load all of the articles to the librarys computer, and hang up. The library computer then prints out the listing. On-line printing is expensive, but quick. You get your listing of articles immediately. Off-line printing is much cheaper, but you may have to wait 3-4 days before you can get your printouts.
Borg and Gall suggest the most productive results for educational topics would be to search RIE and CIJE from 1969 to date, RIE and Education Index from 1966 to 1968, and Education Index from 1965 back as far as the student plans to extend his review.1 Note: This provides a good historical context. Use sources less than 10 years old for the bulk of your study.
6-6
Chapter 6
Synthesis of Literature
Select Articles
You now have either citations or abstracts of the selected articles. Citations give the author, title, and date of selected articles; an abstract gives a 50-100 word summary of the study. You want to get abstracts if the database provides them. You now must find the article. Your library can help you do a computer search and provide you with citations. However, the articles cited may not be in on your campus. You may need to go to a larger university or state school to find the original article. In our area, for example, North Texas State University has over 5 million journal articles on microfiche and adds thousands of articles each year. Make a list of the publications cited in your search. The first step is to find out which libraries in the area carry these publications. The reference desk at area university libraries can provide you with a catalog of publications collected by a particular library. Locate the publications on your list in the directory. Some libraries have articles bound in annual volumes and stored on shelves. Others record articles on microfilm or microfiche and store them in filing cabinets. Using the librarys indexing system, you can locate the full article selected by your key word search. There are two major ways to process the articles when you find them. The first is to read through the article in the library and take notes on it immediately. Copy down what you think is relevant on 5x8 cards. Be sure to get all the bibliographical information you need for footnotes and references. The second way is to merely scan the article to determine whether it really pertains to your study or not. If it does, make a xerox copy of it. Both bound journals and microfilm/fiche materials can be xeroxed. The cost is about ten cents per page. You may spend twenty or thirty dollars in dimes this way, but you have a real advantage over the first approach. You have the articles. You can analyze them at home: write on them, categorize them, cut and paste them the copies belong to you. I heartily recommend this approach -- especially if you have a family who would like to see you from time to time. Check the bibliographies of the research articles for further references to related literature. This provides you another path to important studies done in your area of interest. Now you must analyze and organize all of this material.
An Organizational Notebook
In my last dissertation, I organized my literature conceptually. I began by scanning the 167 selected articles, looking for key concepts and terms used by the authors that related to the key words of my study. I then placed each term at the top of a blank sheet of paper in a notebook. I began with about thirty concepts which were organized alphabetically.
6-7
I: Research Fundamentals
Prioritizing Articles
While I scanned the articles, I categorized them into three levels of importance: high, medium, and low. High priority articles were identified as those which dealt directly either with my subject or methods. Medium priority articles were identified as those which provided either relevant background information or important implications of my subject. Low priority articles were identified as those which only tangentially referred to my subject or methodology. After my key word notebook was organized, I began reading the high priority articles in detail. New concepts were added to the organizational notebook.
6-8
Chapter 6
Synthesis of Literature
for a natural timeline of development, a chronological ordering is best. In this case, clusters will be time-sensitive, showing a change in thinking over time. Conceptually: If your study is anchored in clear, inter-related concepts, a conceptual ordering is suggested. My last dissertation had sections on the development of ANOVA and multiple comparisons tests, Type I error rate, Type II error rate, power, and research design. Stated hypotheses: If you have several hypotheses in your study, thesde form a natural way to order key word clusters.
Summary
As you can see, the process of developing the Related Literature section of your paper involves a great deal more than checking ten or twelve books out of the library and writing a term paper. The process takes time. You have most of the semester to complete this but dont wait! Searching the literature will provide you necessary insight into how to mold your entire proposal. Begin now to search the literature. You should do at least one computer search just for the practice of it, and, additionally, it will save you weeks of library time.
Vocabulary
CIJE Citation Indexes computer search databases descriptors Dissertation Abstracts ERIC Education Index manual search Measures for Psy. Measurement Mental Measurements Yearbook organizational notebook preliminary sources primary sources Psychological Abstracts RIE secondary sources SSIE abbreviation for Current Index of Journals in Education (published articles) resources that list articles which cite a given research article reference locating research articles by computer collections of research information by subject matter (e.g. ERIC) key words by which research articles are indexed (e.g. cognitive or children) a resource that catalogs abstracts of all dissertations back to 1860 abbreviation for Educational Resources Information Center (CIJE and RIE) a resource that catalogs education information back to 1929 locating research articles using printed indexes catalogs psychological tests used in research catalogs published educational, psychological and vocational tests tool to aid in dissecting articles and synthesizing related ideas resources used to locate articles (e.g. indexes) materials produced by those who conduct research (e.g. journal articles) index to over 850 psychological journals abbreviation for Resources in Education: index to unpublished materials materials produced by writers who study research reports (e.g. books) abbr for Smithsonian Science Information Exchange: best for ongoing research
6-9
I: Research Fundamentals
multiple articles broken down and reordered by concept in clear concise writing both X and Y must be true (1) for Z to be true (1); otherwise Z = 0 false either X or Y must be true (1) for Z to be true (1); both 0? Z = 0 false
Study Questions
1. Differentiate among preliminary, primary and secondary sources of information. 2. Define the following terms: ERIC, SSIE, RIE, CIJE, descriptor, SCI, SSCI, database, synthesis. 3. Differentiate between a summary of literature and a synthesis of literature. 4. What is the major difference between printing abstracts on-line and off-line? 5. Discuss the importance of revision in writing your proposal. How are you planning to incorporate revision into your proposal development schedule?
6-10
Chapter 7
7
Populations and Sampling
The Rationale of Sampling Steps in Sampling Types of Sampling Inferential Statistics: A Look Ahead The Case Study Approach The Rationale of Sampling
In Chapter One, we established the fact that inductive reasoning is an essential part of the scientific process. Recall that inductive reasoning moves from individual observations to general principles. If a researcher can observe a characteristic of interest in all members of a population, he can with confidence base conclusions about the population on these observations. This is perfect induction. If he, on the other hand, observes the characteristic of interest in some members of the population, he can do no more than infer that these observations will be true of the whole. This is imperfect induction, and is the basis for sampling.1 The population of interest is usually too large or too scattered geographically to study directly. By correctly drawing a sample from a specific population, a researcher can analyze the sample and make inferences about population characteristics.
Population Sampling Biased Samples Randomization
The Population
A population consists of all the subjects you want to study. Southern Baptist missionaries is a population. So is ministers of youth in SBC churches in Texas. So is Christian school children in grades 3 and 4. A population comprises all the possible cases (persons, objects, events) that constitute a known whole.2
Sampling
Sampling is the process of selecting a group of subjects for a study in such a way that the individuals represent the larger group from which they were selected.3 This representative portion of a population is called a sample.4
Donald Ary, Lucy Cheser Jacobs, and Asghar Razavieh, Introduction to Research in Education, (New York: Holt, Rinehart and Winston, Inc., 1972), 160 2 Ibid., p. 125 3 L. R. Gay, Educational Research: Competencies for Analysis and Application, 3rd ed., (Columbus, Ohio: Merrill Publishing Company, 1987), 101. 4 Ary et. al., 125
1
7-1
I: Research Fundamentals
Biased Samples
It is important that samples provide a representative cross-section of the population they supposedly represent. The sample should be a microcosm a miniature model of the population from which it was drawn. Otherwise, the results from the sample will be misleading when applied to the population as a whole. If I select Southern Baptist ministers as the population for my study and select Southern Baptist pastors in Fort Worth as my sample, I will have a biased sample. Fort Worth pastors may not reflect the same characteristics as ministers (including staff members) across the nation. Selecting people for a study because they are within convenient reach members of my church, students in a nearby school, co-workers in the surrounding region yields biased samples. Biased samples do not represent the populations from which they are drawn.
Randomization
The key to building representative samples is randomization. Randomization is the process of randomly selecting population members for a given sample, or randomly assigning subjects to one of several experimental groups, or randomly assigning experimental treatments to groups. In the context of this chapter, it is selecting subjects for a sample in such a way that every member of the population has an equal chance at being selected. By randomly selecting subjects from a population, you statistically equalize all variables simultaneously.
Steps in Sampling
Target Population Accessible Population Size of Sample Select
Regardless of the specific type of sampling used, the steps in sampling are essentially the same: identify the target population, identify the accessible population, determine the size of the sample, and select the sample.
7-2
Chapter 7
Accuracy
In every measurement, there are two components: the true measure of the variable and error. The error comes from incidental extraneous sources within each subject: degree of motivation, interest, mood, recent events, future expectations. All of these cause variations in test results. In all statistical analysis, the objective is to minimize error and maximize the true measure. As the sample size increases, the random extraneous errors tend to cancel each other out, leaving a better picture of the true measure of the population.
Cost
An increasing sample size translates directly into increasing costs: not only of money, but time as well. Just think of the difference in printing, mailing, receiving, processing, tabulating, and analyzing questionnaires for 100 subjects, and then for 1000 subjects. The dilemma of realistically balancing accuracy (increase sample size) with cost (decrease sample size) confronts every researcher. Inaccurate data is useless, but a study which cannot be completed due to lack of funds is not any better. Cost per subject is directly related to the kind of study being done. Interviews are expensive in time, effort and money. Mailing out questionnaires is much less expensive per subject. Therefore, one can plan to have a larger sample with questionnaires than with interviews for the same cost.
Gay, 114
7-3
I: Research Fundamentals
Other Considerations
Borg and Gall list several additional factors which influence the decision to increase the sample size (See pp. 257-261). These are
1. When uncontrolled variables are present. 2. When you plan to break samples into subgroups. 3. When you expect high attrition of subjects. 4. When you require a high level of statistical power (see Chapter 17) .
So, what is a good rule of thumb for setting sample size in a research proposal? Here are two suggestions
Size of Population
0-100 101-1,000 1,001-5,000 5,001-10,000 10,000+
Sampling Percent
100% 10% 5% 3% 1%
Types of Sampling
Simple Systematic Stratified Cluster
There are several ways of doing this. We will look at four major types here: simple random, systematic, stratified, and cluster sampling. The basic characteristic of random sampling is that all members of the population have an equal and independent chance of being included in the sample.8
Gay, 114-115
Ary, 162
Gay, 105-7
7-4
Chapter 7
from 0000 to 4999 to the teachers. 4. A table of random numbers is entered at an arbitrarily selected number such as the one underlined below:
59058 11859 53634 48708 71710
5. Since his population has only 5000 members, he is interested only in the last 4 digits of the number, 3634. 6. The teacher assigned #3634 is selected for the sample. 7. The next number in the column is 48708. The last four digits are 8708. No teacher is assigned #8708 since there are only 5000. Skip this number. 8. Applying these steps to the remaining numbers shown in the column, teachers 1710, 3942, and 3278 would be added to the sample. 9. This procedure continues down this column and succeeding columns until 500 teachers have been selected.
This random sample could well be expected to represent the population from which it was drawn. But it is not guaranteed. The probable does not always happen. For example, if 55% of the 5000 teachers were female and 45% male, we would expect about the same percentages in our random sample of 500. Just by chance, however, the sample might contain 30% females and 70% males. If the superintendent believed teaching level (elementary, junior high, senior high) might be a significant variable in attitude toward unions, he would not want to leave representation of these three sub-groups to chance. He would probably choose to do a stratified random sample.
Systematic Sampling
A systematic sample is one in which every Kth subject on a list is selected for inclusion in the sample.10 The K refers to the sampling interval, and may be every 3rd (K=3) or 10th (K=10) subject. The value of K is determined by dividing the population size by the sample size. Lets say that you have a list of 10,000 persons. You decide to use a sample of size 1000. K = 10000/1000 = 10. If you choose every 10th name, you will get a sample of size 1000. The superintendent in our example would employ systematic sampling as follows:
1. The population is 5,000 teachers. 2. The sample size is 10%, or 500 teachers. 3. The superintendent has a directory which lists all 5,000 teachers in alphabetical order. 4. The sampling interval (K) is determined by dividing the population (5000) by the desired sample size (500). K = 5000/500 = 10. 5. A random number between 0 and 9 is selected as a starting point. Suppose the number selected is 3. 6. Beginning with the 3rd name, every 10th name is selected throughout the population of 5000 names. Thus, teacher 3, 13, 23, 33 ... 993 would be chosen for the sample (Gay, pp. 113-114).
Writers disagree on the usefulness of systematic sampling. Ary and Gay discount systematic sampling as not as good as random sampling because each selection is not independent of the others.11 Once the beginning point is established, all other choices are determined. Both writers give as an example a population which includes various nationalities. Since certain nationalities have distinctive last names that tend to group together under certain letters of the alphabet, systematic sampling can skip over
10
Gay, 112
11
7-5
I: Research Fundamentals
whole nationalities at a time. Babbie on the other hand, states that systematic sampling is virtually identical to simple random sampling when one chooses a random starting point.12 Sax reports that systematic sampling usually leads to the same results as simple random sampling.13 There is a module on your tutorial disk that directly compares systematic sampling with simple random sampling. Use that to compare the results of sampling for yourself. There is one major danger with systematic sampling on which all authors agree. If there is some natural periodicity repetition within the list, the systematic sample will produce estimates which are seriously in error.14 If this condition exists, the researcher can do one of two things. He can use simple random sampling on the list as it exists, or he can randomly order the list and then use systematic sampling.
Stratified Sampling
Stratified sampling permits the researcher to identify sub-groups within a population and create a sample which mirrors these sub-groups by randomly choosing subjects from each stratum. Such a sample is more representative of the population across these sub-groups than a simple random sample would be.15 Subgroups in the sample can either be of equal size or proportional to the population in size. Equal size sample subgroups are formed by randomly selecting the same number of subjects from each population subgroup. Proportional subgroups are formed by selecting subjects so that the subgroup percentages in the population are reflected in the sample. The following example is a proportionally stratified sample. The superintendent would follow these steps to create a stratified sample of his 5,000 teachers.16
1. The population is 5,000 teachers. 2. The desired sample size is 10%, or 500 teachers. 3. The variable of interest is teaching level. There are three subgroups: elementary, junior high, and senior high. 4. Classify the 5,000 teachers into the subgroups. In this case, 65% or 3,250 are elementary teachers, 20% or 1,000 are junior high teachers, and 15% or 750 are senior high teachers. 5. The superintendent wants 500 teachers in the sample. So 65% of the sample (325 teachers) should be elementary, 20% (100) should be junior high teachers, and 15% (75) should be senior high teachers. This is a proportionally stratified sample. (A non-proportionally stratified sample would randomly select 167 subjects from each of the three groups.) 6. The superintendent now has a sample of 500 (325+100+75) teachers, which is representative of the 5,000 and which reflects proportionally each teaching level.
Cluster sampling
Cluster sampling involves randomly selecting groups, not individuals. It is often impossible to obtain a list of individuals which make up a target population. Suppose
12 Earl Babbie, The Practice of Social Research, 3rd. (Belmont, CA: Wadsworth Publishing Company, 1983), 163 13 Gilbert Sax, Foundations of Educational Research (Englewood Cliffs, NJ: Prentice-Hall, 1979), 191 14 Gilbert Churchill, Marketing Research: Methodological Foundations, 2nd (Hinsdale, IL: The Dryden Press, 1979), 328 15 Ary and others, 164; Babbie, 164-165; Borg and Gall, 248-249; Sax, 185-190 16 Gay, pp. 107-109
7-6
Chapter 7
a researcher is interested in surveying the residents of Fort Worth. Through cluster sampling, he would randomly select a number of city blocks and then survey every person in the selected blocks. Or, another researcher wants to study social skills of Southern Baptist church staff members. No list exists which contains the names of all church staff members. But he could randomly select churches in the Convention, and use all the staff members of the selected churches. Any intact group with similar characteristics is a cluster. Other examples of clusters include classrooms, schools, hospitals, and counseling centers. Lets apply this approach to the superintendents study.
1. The population is 5,000 teachers. 2. The sample size is 10%, or 500 teachers. 3. The logical cluster is the school. 4. The superintendent has a list of 100 schools in the district. 5. Although the clusters vary in size, there are an average of 50 teachers per school. 6. The required number of clusters is obtained by dividing the sample size (500) by the average size of cluster (50). Thus, the number of clusters needed is 500/50 = 10 schools. 7. The superintendent randomly selects 10 schools out of the 100. 8. Every teacher in the selected schools is included in the sample.
In this way, the interviewer can conduct interviews with all the teachers in ten locations, and save traveling to as many as 100 schools in the district.17 There are drawbacks to cluster sampling. First, a sample made up of clusters may be less representative than one selected through random sampling.18 Only ten schools out of 100 are used in our example. These ten may well be different from the other ninety. Using a larger sample size, say, 25 schools rather than 10, reduces this problem. A second drawback is that commonly used inferential statistics are not appropriate for analyzing data from a study using cluster sampling.19 The statistical procedures we will be studying require random sampling.20
7-7
I: Research Fundamentals
Oral Histories
Oral histories involve extensive first-person interviewing of a single individual. Dissertations have been written on the lives of J. M. Price and Joe Davis Heacock, former deans of the School of Religious Education, using this approach.
Situational Analysis
An event is studied from the perspective of the participants involved. For example, a staff member is summarily fired from a church staff by the pastor. Interviews with the staff member and family, staff colleagues, pastor, church leaders and selected church members would be conducted. When all the views are synthesized, an indepth understanding of the event can be produced.
7-8
Chapter 7
suffering from the problem. "Depression in the Ministry: A Case Study of Twenty Ministers of Eeducation."
Summary
In this chapter you have learned about sampling techniques that allow you to select and study a small representative group of subjects (the sample) and infer findings to the larger group (the population). You have been given a rationale for sampling, the place of randomization in sampling, the steps of sampling, four types of sampling, and a look at the case study approach.
Vocabulary
accessible population attrition biased sample case study approach Cluster sampling error estimated parameters population parameters randomization sample sample size sample statistics sampling error Sampling Simple random sampling statistical power Stratified sampling systematic sampling target population true measure subjects available for sampling (e.g. mailing list) loss of subjects during a study subjects selected in non-random manner (e.g. 3rd grade classes at school) in-depth study of individual subject or institution selecting subjects by randomly choosing groups (e.g. city blocks or churches) difference between the measurement of a variable and its true value mean and standard deviation of population computed from sample statistics mean and standard deviation of population measured directly selecting subjects so that each population member has equal chance of being selected a (smaller) group of subjects which represents a (larger) population the number of subjects in a sample (symbolized by N or n) mean and standard deviation of a sample (not useful in themselves) source of the discrepancy between sample statistics and population parameters process of selecting a representative sample from a population drawing subjects by random number (e.g. names out of a hat) the probability that a statistic will declare a difference `significant selecting subjects at random from population strata (e.g., male, female) selecting every kth subject from a list. (e.g., every 10th person in 1000 = 100 subjects) population of interest to your study (e.g. single adults) the true value of a variable (no error)
Study Questions
1. Define target population, accessible population, and sample. 2. Explain why sampling is an important part of research. 3. List and describe four types of sampling. 4. Explain why randomization is important in sampling. 5. You want to study Youth ministers attitudes toward small group Bible study. You have identified 4,573 youth ministers. Using the rules of thumb estimate for sampling, how many youth ministers should you select for your study?
7-9
I: Research Fundamentals
7-10
Chapter 8
8
Collecting Dependable Data
Validity Reliability Objectivity
We have discussed variables and problems, hypotheses and purposes, populations and samples. The theoretical foundation of your study must sooner or later yield to concrete action: the collection of real pieces of data. The tools used to collect data are called instruments. An instrument may be an observation checklist, a questionnaire, an interview guide, a test or attitude scale. It may be a video camera or cassette recorder. An instrument is any device used to observe and record the characteristics of a variable. Before you can accurately measure the stated variables of your study, you must translate those variables into measurable forms. This is done by operationally defining the variables of your study (Chapter 3). Data collection is meaningless without a clearly operationalized set of variables. The second step is to insure that the selected instrument accurately measures the variables youve selected. The naive researcher rushes past the instrument selection or development phase in order to collect data. The result is faulty, error-filled data -- which yields faulty conclusions. The accuracy of the instrument used in your study is an important factor in the usefulness of your results. If the data is incomplete or inadequate, the study is destined for failure. A wonderful design and precise analysis yields useless results if the data quality is poor. So carefully design or select the instrument you will use to collect data. Three characteristics -- "the Great Triad" -- determine the precision with which an instrument collects data. The Great Triad consists of (1) validity, Does the instrument measure what it says it measures?; (2) reliability, Does the instrument measure accurately and consistently?; and (3) objectivity, Is the instrument immune to the personal attitudes and opinions of the researcher?
Validity
The term validity refers to the ability of research instruments to measure what they say they measure. A valid instrument measures what it purports to measure. A
4th ed. 2006 Dr. Rick Yount
8-1
I: Research Fundamentals
12-inch ruler is a valid instrument for measuring length. It is not a valid instrument for measuring I.Q., or a quantity of a liquid, or an amount of steam pressure. These require an I.Q. test, a measuring cup, and a pressure gauge. Lets say a student wants to measure the variable spiritual maturity, and operationally defines it as the number of times a subject attended Sunday School out of the past 52 Sundays. The question we should ask is whether attendance count in Sunday School is a valid measure of spiritual maturity does count really measure spiritual maturity? Can one attend Sunday School and be spiritually immature? (Yes, for coffee, fellowship and business contacts). Can one be spiritually mature and not attend Sunday School? (Yes, pastors usually use this time for pastoral work). If either of these questions can be answered yes, (and they are), then the measure is not a valid one. There are four kinds of instrument validity: content, concurrent, predictive, and construct. Each of these have specific meaning, and helps establish the nature of valid instruments.
Content Validity
The content validity of a research instrument represents the extent to which the items in the instrument match the behavior, skill, or effect the researcher intends them to measure.1 In other words, a test has content validity if the items actually measure mastery of the content for which the test was developed. Tests which ask questions over material not covered by objectives or study guidelines, or draw from other fields besides the one being tested, violate this kind of validity. Content validity is different from face validity, which is a subjective judgement that a test appears to be valid. Researchers establish content validity for their instruments by submitting a long list of items (such as statements or questions) to a validation panel. Such a validation panel consists of six to ten persons who are considered experts in the field of study for which the instrument is being developed. The panel judges the clarity and meaningfulness of each of the items by means of a 4- or 6-point rating scale. Compute the means and standard deviations (see Chapter 16) for each of the items. Select the items with the highest mean and lowest standard deviation on "meaningfulness" and "clarity" to be included in your instrument. In summary, content validity asks the question, How closely does the instrument reflect the material over which it gathers data? Content validity is especially important in achievement testing.
Predictive Validity
The predictive validity of a research instrument represents the extent to which the tests results predict such things as later achievement or job success. It is the degree to which the predictions made by a test are confirmed by the later success of the subjects. Suppose I developed a Research and Statistics Aptitude Test to be given students at the beginning of the semester. If I correlated these test scores of incoming students with their final grade in the course, I could use the test as a predictor of success in the course. In this example, the Research and Statistics Test provides the predictor measures and the final course grade is the criterion by which the aptitude test is analyzed for validity. In predictive validity, the criterion scores are gathered some time after the predictor scores. The Graduate Record Examination (GRE) is taken
1
Merriam, p. 140
8-2
Chapter 8
by college students and supposedly predicts which of its users will succeed in (future) doctoral level studies. Predictive validity asks the question, How closely does the instrument reflect the later performance it seeks to predict?
Concurrent Validity
Concurrent validity represents the extent to which a (usually smaller, easier, newer) test reflects the same results of a (usually larger, more difficult, established) test. The established test is the criterion, the benchmark, for the newer, more efficient test. Strong concurrent validity means that the smaller, easier test provides data as well as the larger, more difficult one. A popular personality test, called the Minnesota Multi-Phasic Inventory (MMPI), once had only one form consisting of about 550 questions. The test required several hours to administer. In order to reduce client frustration, a newer short-form version was developed which contained about 350 questions. Analysis revealed that the shorter form had high concurrent validity with the longer form. That is, psychologists found the same results with the shorter form as with the long form, but also reduced patient frustration and administration time. A researcher wanted to determine whether anxious college students showed more preference for female role behaviors than less anxious students. To identify contrasting groups of anxious and non-anxious students, she could have had a large number of students evaluated for clinical signs of anxiety by experienced clinical psychologists. However, she was able to locate a quick, objective test, the Taylor Manifest Anxiety Scale, which has been demonstrated to have high concurrent validity with clinical ratings of anxiety in a college population. She saved considerable time conducting the research project by substituting this quick, objective measure for a procedure that is time-consuming and subject to personal error.1 Concurrent validity asks the question, How closely does this instrument reflect the criterion established by another (usually more complex or costly) validated instrument?
Construct Validity
Construct validity reflects the extent to which a research instrument measures some abstract or hypothetical construct.2 Psychological concepts, such as intelligence, anxiety, and creativity are considered hypothetical constructs because they are not directly observable -- they are inferred on the basis of their observable effects on behavior.3 In order to gather evidence on construct validity, the test developer often starts by setting up hypotheses about the differentiating characteristics of persons who obtain high and low scores on the measure. Suppose, for example, that a test developer publishes a test that he claims is a measure of anxiety. How can one determine whether the test does in fact measure the construct of anxiety? One approach might be to determine whether the test differentiates between psychiatric and normal groups, since theorists have hypothesized that anxiety plays a substantial role in psychopathology. If the test does in fact differentiate the two groups, then we have some evidence that it measures the construct of anxiety.4 Construct validity asks the question, How closely does this instrument reflect
Borg and Gall, 279 A construct is a theoretical explanation of an attribute or characteristic created by scholars for purposes of study. Merriam, p. 141 3 4 5 6 Borg and Gall, 280 Ibid. Sax, 206 Ary et. al., 200 7 David Payne, The Assessment of Learning: Cognitive and Affective (Lexington, Mass.: D.C. Heath and Company, 1974), 259
1 2
8-3
I: Research Fundamentals
Reliability
Stability Consistency Equivilance
Reliability is the extent to which measurements reflect true individual differences among examinees.5 It is the degree of consistency with which [an instrument] measures what it is measuring.6 The higher the reliability of an instrument, the less influenced it is by random, unsystematic factors.7 In other words, is an instrument confounded by the smoke and noise of human characteristics, or can it measure the true substance of those variables? Does the instrument measure accurately, or is there extraneous error in the measurements? Do the scores produced by a test remain stable over time, or do we get a different score every time we administer the test to the same sample? There are three important measures of reliability. These are the coefficients of stability, internal consistency, and equivalence. All three use a correlation coefficient to express the strength of the measure. We will study the correlation coefficient in detail when we get to chapter 22. For the time being, we will merely state that a reliability coefficient can vary from 0.00 (no reliability) to +1.00 (perfect reliability, which is never attained). A coefficient of 0.80 or higher is considered very good.
Coefficient of Stability
The coefficient of stability, also called test-retest reliability,9 measures how consistent scores remain over time. The test is given once, and then given to the same group at a later time, usually several weeks. A correlation coefficient is computed between the two sets of scores to produce the stability coefficient. The greatest problem with this measure of reliability is determining how much delay to use between the tests. If the delay is too short, then subjects will remember their previous answers and the reliability coefficient will be higher than it should be. If the delay is too long, then subjects may actually change in the interval. They will answer differently, but the difference is due to a change in the subject, not in the test. This will yield a coefficient lower than it should be.10 Still, science does best with consistent, stable, repeatable phenomena, and the stability of responses to a test is a good indicator of the stability of the variable being measured.
10
Ibid., 284
11
Ibid., 284-5
12
Ibid., 285
8-4
Chapter 8
puted correlation between the two halves. If r=0.60, then the formula yields r' = 0.75. Another measure of internal consistency can be obtained by the use of the KuderRichardson formulas. The most popular of the formulas are known as K-R 20 and K-R 21. The K-R 20 formula is considered by many specialists in education and psychology to be the most satisfactory method for determining test reliability. The K-R 21 formula is a simplified approximation of the K-R 20, and provides an easy method for determining a reliability coefficient. It requires much less time to apply than K-R 20 and is appropriate for the analysis of teacher-made tests and experimental tests written by a researcher which are scored dichotomously.13 (A dichotomous variable is one which has two and only two responses: yes-no, true-false, on-off). Cronbachs Coefficient Alpha is a general form of the K-R 20 and can be applied to multiple choice and essay exams. Coefficient Alpha compares the sum of the variances for each item with the total variance for all items taken together. If there is high internal consistency, coefficient alpha produces a strong positive correlation coefficient.
Coefficient of Equivalence
A third type of reliability is the coefficient of equivalence, sometimes called parallel forms, or alternate-form reliability. It can be applied any time one has two or more parallel forms (different versions) of the same test.14 One can administer both forms to the same group at one sitting, or with a short delay between sittings. A correlation coefficient is then computed on the two sets of parallel scores. A common use of this type of reliability is in a pretest-posttest research setting. By using the same test for both testing occasions, the researcher cannot know how much of the gain in scores is due to the treatment and how much is due to subjects remembering their answers from the first test. If one has two parallel forms of the same exam, and the coefficient of equivalence is high, one can use one form as the pretest and the other as the posttest.
8-5
I: Research Fundamentals
maximum validity of a test is equal to the square root of its reliability.18 Therefore, test validity is dependent upon test reliability.
Payne and Babbie would hold that an instrument can be unreliable and still be valid. A yardstick made out of rubber or a measuring tape made out of yarn are valid instruments for measuring length, even though their measurements would not be accurate. Bell, Sax and Nunnally would say a tape measure made of yarn is not valid if it cannot produce reliable measurements. McCallon demonstrates the boundary condition of Vmax = R. In the final analysis, whether we are aiming a rifle or designing a research instrument, our goal should be to get a tight cluster in the bulls-eye. Use instruments which demonstrate the ability to collect data with high validity and high reliability.
Objectivity
The third characteristic of good instruments is objectivity. Objectivity is the extent to which equally competent scorers get the same results. If interviewers A and B interview the same subject and produce different data sets for him, then it is clear that
17 Jum Nunnally, Educational Measurement and Evaluation, 2nd ed. (New York: McGraw-Hill Book Company, 1972), 98-99 18 Class notes, Research Seminar, Spring 1983 19 Payne, 259 20 Ibid., 254
8-6
Chapter 8
the measurement is subjective.22 Something about the subject is hooking the interviewers differently. The difference is not in the subject, but in the interviewers. A pilot study which uses the researchers instrument with subjects similar to those targeted for the study will demonstrate whether it is objective or not. This is particularly important in interview or observation type studies in which human subjectivity can distort the data being gathered. The validation panel described under validity also helps the researcher create an objective test. All items in an item bank should be as clear and meaningful as the researcher can make them. But after the validation panel has evaluated and rated them, the best of the items can be selected for the instrument. This will filter out much of the researchers own biases. An illustration of the objective-subjective tension in instruments is the difference between essay and objective tests. The difference in grades produced on essay tests can be more related to the mood of the grader than the knowledge of the student. A well-written objective test avoids this problem because the answer to every question is definitively right or wrong. Whether you are planning to use an interview guide, an observation checklist, an attitude scale, or a test, you must work carefully to insure that the data you gather reflects the real world as it is, and not as you want it to be.
Summary
The first element of the Great Triad is validity. The four types of validity content, predictive, concurrent, and construct focus on how well an instrument measures what it purports to measure. The fifth type of validity, face validity, is nothing more than a subjective judgement on the part of the researcher and should not be used as a basis for validating instruments. The second element of the Great Triad is reliability. These three approaches to reliability stability, internal consistency, equivalence focus on how accurate the gathered data is. The third element of the Great Triad is objectivity, which concerns the extent that data is free from the subjective characteristics of the researchers.
RELIABLE it says what it says accurately and consistently, and OBJECTIVE it says what it says without subjective distortion or personal bias
Vocabulary
coefficient of stability concurrent validity construct validity
21
measure of steadiness, or sameness, of scores over time degree a new (easier?) test produces same results as older (harder?) test degree to which test actually measures specified variable (e.g. intelligence)
22
Babbie, 118
Sax, 238
8-7
I: Research Fundamentals
content validity Cronbachs coefficient coefficient of equivalence face validity coefficient of internal consistency Kuder-Richardson formulas objectivity parallel forms predictive validity reliability Spearman-Brown prophecy formula split half test test-retest validity
degree to which test measures course content measure of internal consistency of a test measure of sameness of two forms of a test degree a test looks as if it measures stated content degree each item in a test contributes to the total score measures of internal consistency the degree that data is not influenced by subjective factors in researchers tests used to establish equivalence degree test measures some future behavior degree a test measures variables accurately and consistently used to adjust the r value computed in split-half test procedure used to establish internal consistency test given twice over time to establish stability of measures degree a test measures what it purports to measure
Study Questions
1. Define the terms instrument, validity, reliability, and objectivity. 2. Discuss the relationship between an operational definition and the procedures for collecting data. 3. Of these three essentials of research, which is most important? Clear research design, accurate measurement, precise statistical analysis. Why?
8-8
Chapter 9
Observation
9
Observation
The Problem The Obstacles Practical Suggestions
In a sense, all scientific research involves observation of one kind or another. This is what empiricism means (Review Chapter One if needed). But in this chapter we come to focus on observation as one specific research technique among many. In this sense, the term observation means looking at something without influencing it and simultaneously recording it for later analysis.1 In observational research, we do not deal with what people want us to know (self-report measures) or with what some test writer believes he knows (tests and scales). Rather, we deal with actual people in real situations. People are seen in action. As such, observation is the most basic of techniques. The researcher with pad in hand carefully observes subjects he has selected in order to quantify variables he is interested in. Deciding what to observe and who to observe has been discussed in more general ways. Here we will look at how to record what is seen, and what mode of observation to use. Before we move to practical steps in doing observational research, we must first consider the biggest problem in observational research. That problem is, quite simply, the human being who does the observing.
June True, Finding Out: Conducting and Evaluating Social Research (Belmont, CA: Wadsworth, 1983),
159
9-1
struct.
Two people watch a prominent television evangelist preach for ten minutes. One responds, What courageous leadership! What a man of God! The other responds What a con man! He sure can manipulate people! The difference in the data is in the observers, not in the evangelist. More data is needed to determine which of these two pictures is more correct.
These two examples illustrate inference, an enemy of valid and reliable data. When an observer infers motive to observed action, he adds something of himself to the data. Such data is distorted, invalid and unreliable. A second enemy is interference. The very presence of the observer can affect the behavior of the people being observed. Tell a Sunday School teacher youll be visiting his class the next Sunday, and you can expect a marked improvement in preparation of the lesson. This factor is also the rationale for using undercover agents to infiltrate and observe criminal behavior as it really is. The presence of a uniformed police officer would certainly interfere with the criminal behavior.
Obstacles to objectivity in collecting data in observation research include personal interest, early decision, and personal characteristics.
Personal Interest
I see what I want to see. I once had a lady church member who insisted that we never elect a divorced person as a Sunday School teacher. She quoted scripture and produced one reason after another why divorced persons would be the ruin of the church until her own daughter got a divorce. It was not three weeks until this same lady was in my office, quoting scripture and complaining of how the church does not care about divorced people -- that we needed to give them opportunities for service after all, theyre people too!!! The scripture had not changed, but she certainly had, because of her personal experience. We always have a personal interest in any study we conduct. If we did not, the process of giving birth to a research plan might be unbearable. But our personal interest should be directed toward collecting objective facts, not proving preconceived notions. If the study is intended from its inception to substantiate what you already believe, you will have difficulty seeing anything that contradicts this perspective. This is called selective observation, or, as we have noted, I see what I want to see.
Early decision
It is part of the reality of human perception that we naturally and automatically fill in the gaps of what we know to be true. We add elements from our own imagination to make situations reasonable. The problem with this is that we can be deceived by our own imagination into creating a situation that does not exist in reality. When we have too few factual observations, we tend to fill in too much. This is the psychological basis of gossip: filling in the gaps between known data points with what we subjectively feel. The researcher needs a large number of objective data points from which to develop a theoretical pattern. By ending the observation phase prematurely, the researcher may interpret the data incorrectly. Ive seen enough. I can see the trend. The trend may be an incorrect extrapolation from the facts.
9-2
Chapter 9
Observation
Personal characteristics
Many of the things that characterize us as being human pose difficulties in the observation process: emotions, prejudices, values, physical condition. We can unknowingly make a faulty inference because of the subjective influence of one or more of these personal characteristics. They may be difficult to identify.2 Whatever we study, we must make every effort to insure that our data reflects that which we study and not ourselves. Objective observation checklists can help remove our personal biases and lack of neutrality concerning the chosen subject.
Definition
Observation is the act of looking at something without influencing it and recording the scene or action for later analysis.
Familiar Groups
Positively, studying a familiar group permits the use of previous experience with the group and established understanding of the subjects. Negatively, this very previous experience reduces the objectivity of the study. Further, revelation of discoveries within a familiar group can be perceived by group members as a betrayal of a trust. For example, a minister on a large church staff decides to study "interpersonal conflict in local church ministry," using his position as a platform for observation if staff meeting discussions. While his existing relationship with the staff (and further, the level of trust he enjoys with staff colleagues) will encourage more realistic behaviors, revelation of those behaviors through his study may well end his relationships!
Unfamiliar Groups
Positively, studying an unfamiliar group reduces the effects of group identification and bias. In addition, observers notice things that insiders overlook. Unfamiliarity with the group improves objectivity in the data. Negatively, observers face problems in gaining access to unfamiliar groups, and, once involved, may have difficulty in understanding member actions within the group.
Observational Limits
Observation is an intensively human process. It is a fact that observers simply cannot study some people. Factors such as gender, age, race, appearance, religious denomination, or political affiliation of observers may prevent access to some groups of subjects. These are just six of many possible barriers to observation.
Hopkins, 81
True, 175-176
9-3
simplified by using shorthand or tallies on observation checklists. Mechanical recording makes an exact record of all the data, but does nothing to simplify or reduce the bulk of the observations. Observational episodes must be analyzed at a later time.
Interviewer Effect
Observation is an intensely human process! If subjects see observers taking notes, they may well change their behavior. (Interviewer effect is increased). Recording data surreptitiously decreases interviewer effect, but can be an invasion of privacy!
Debrief Immediately
Write-ups of observation sessions have to be made promptly because observers -being human! -- may selectively forget details, or unintentionally distort observations. Waiting until after the observational session is over to record responses greatly increases the likelihood that observer subjectivity will influence the data.
Participant Observation
(Compare "Familiar Groups"). Positively, participant observers (i.e., observers who are members of the groups they observe), have easier access, and gain a truer picture of group behavior. Negatively, participant observers are restricted to one role within the group, and are more partial in their observations than a non-participant observer.
Non-participant Observation
(Compare "Unfamiliar Groups"). Positively, non-participant observers have a clearer, less biased perspective on group behavior. Negatively, the presence of a known (non-member) observer alters the behavior of subjects, especially at the beginning of the study. Failure to announce the purpose for an observer being present in the group may be unethical.
Observational Checklist
An observational checklist is a structure for observation, and allows observers to record behaviors during sessions quickly, accurately, and with minimal interviewer effect on behaviors. Dr. Mark Cook developed an observer consistency checklist for use in his study on active participation as a teaching strategy in adult Sunday School classes.4 He described his instrument this way:
The observer consistency checklist was developed to be used by trained observers in examining each teaching situation for consistency across treatments. It was imperative in this study that all other elements in the lesson plan and teaching environment be held constant while allowing active participation to be the independent variable. This evaluation form included (a) a checklist of teacher factors (such as any unusual enthusiasm or behaviors), student factors (such as unusual interruptions or group behaviors), and unusual external factors (outside interruptions, weather, or equipment problems); (b) frequency counts of the number of external interruptions, disruptions by students, departures from the lesson, and active participation; (c) a five-point rating of teacher enthusiasm; and (d) a record of the time span of the lesson.5
Cook, 21
Ibid., 22
9-4
Chapter 9
Observation
Summary
The fundamental data gathering technique in science is observation. In this chapter we looked at the obstacles facing one who plans to do an observational study, as well as practical suggestions to help you plan an effective study.
Vocabulary
inference interference interviewer effect observation researcher infers motivation behind observed behavior researcher changes observed behavior by his/her presence potential bias in data due to subjective factors in interviewers gathering data by way of objective observation of behavior
Study Questions
1. Define observation research. 2. Define in your own words the terms inference and interference as they relate to enemies of valid data. Give an original example of each term. 3. Explain how our humanness is a liability in observational research. observa
9-5
APPENDIX A6 OBSERVER CONSISTENCY CHECKLIST Date: _______________________ Observer: ___________________ Time: ________________________ Teacher:______________________
Observer Instructions: Place a checkmark for each episode of the following factors. Memo the significant events or factors under the comment section at the bottom of the form. ACTIVE LESSON NONACTIVE LES-
OBSERVED FACTORS SON EXTERNAL FACTORS Interruptions from outside class Unusual weather Equipment problems Any other external factors STUDENT FACTORS Students' experiences affect lesson Student interruptions Hostile environment Unusual group behavior TEACHER FACTORS Teacher experience affects lesson Unusual teacher enthusiasm Unusual teacher behavior Different teaching style Variation from lesson plan Gave test answers Use of active participation Level of teacher enthusiasm (Scale: 1-5) Time of lesson (record in minutes) Attendance in the class
_____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____ _____
_____
Cook, 61
9-6
Chapter 10
Survey Research
10
Survey Research
The Questionnaire The Interview Developing a Survey Instrument
Survey research uses questionning as a strategy to elicit information from subjects in order to determine characteristics of selected populations on one or more variables.1 A written survey is called a questionnaire; an oral survey is called an interview. Although they serve similar purposes in gaining information, each provides unique advantages and disadvantages to the researcher.
The Questionnaire
The mailed questionnaire has been heavily criticized in recent times and has fallen into disfavor as a device for gathering data. But it has been the abuse and misuse of this technique that has drawn the criticism, not the nature of the questionnaire itself.2 Hastily constructed questionnaires, consisting of poorly worded questions, produce unreliable information at best and invalid results at worst. A planned, well-constructed questionnaire can obtain information that is obtainable in no other way.
Advantages
A questionnaire provides researchers several advantages over the interview.
Remote Subjects Influence Cost Reliability Convenience
Remote subjects
A questionnaire allows researchers to gather data from any part of the world. Through the use of existing postal systems, or, more recently, the internet, contact can be made with almost any literate population of interest. As a result, subjects can be randomly selected from wide-ranging populations, such as Southern Baptists in America.
Researcher influence
The standardized wording of a printed questionnaire reduces researcher interference in subject responses. The researchers gender, appearance, mannerisms, social skills and the like have no effect on how subjects respond to the questions.
Cost
Even with the high cost of postage, the mailed questionnaire is still the most
1
Gay, 191
Hopkins, 145
10-1
economical means, per subject, for gathering data. The economy of process allows researchers to increase the number of subjects in the study. Increased sample size provides more accurate estimates of population characteristics. Not only does the questionnaire save money directly, it also saves time. Consider the difference in processing time between mailing out 1000 questionnaires and interviewing 1000 subjects. Dr. Jay Sedgwick of Dallas Theological Seminary (Southwestern Ph.D. graduate, 2003) analyzed differences in costs and data quality among three data collection techniques. He investigated direct collection from conference participants, e-mail responses to a website, and a traditional mailed survey. Conventional wisdom suggested that email would provide quality data at greatly reduced costs. He found this not to be the case. Direct collection can be frustrated by restrictions imposed by conference leaders. Return rate was lowest among email recipients -- and responses provided the least reliability. The mailed survey was the most expensive, but provided the best return rate and quality of data.
Reliability
The standardized wording and structured questions of the questionnaire provide a higher reliability in the data than is practically able to be obtained by interview.
Subjects convenience
The questionnaire is completed at the subjects convenience. They can consider each question, check necessary records, and reflect on their answers. Data is more valid under these conditions than when answers are given "on the spot" in an interview.
Disadvantages
Rate of Return Inflexibility Motivation Limited data Loss of control
There are disadvantages in using a mailed questionnaire that are overcome by the interview. These include the questionnaire's rate of return, its inflexible structure, the level of subject motivation, the limitation of not observing the subject as questions are answered, and the loss of control over the questionning process.
Rate of return
The biggest drawback in using questionnaires is the rate of return of the completed forms. Let me illustrate. You have drawn a representative sample from which to collect data. But when the questionnaires stop coming in, you find that only 35% of the sample responded. Why did 65% not respond? Are they different in some systematic way from the 35% who did? Does this have a bearing on your variables? You have no way of knowing. And this is a confounding variable (a source of error) in your study. Therefore, valid mail surveys have extensive follow-up procedures to produce the largest possible rate of return. How large? Some texts say 50%, some 60%. We suggest that doctoral students gathering data for their dissertation aim to get a 70% response rate or better. The return rate is computed as a percentage as follows:4
Hopkins, 148
10-2
Chapter 10
Survey Research
sent out, and ND the number unable to be delivered (return to sender). For example, if you send out 180 questionnaires, and have 10 undelivered and 150 returned, your return rate is
rate = (150 / (180 - 10)) x 100 = rate = (150 / 170) x 100 = rate = (.882353) x 100 = 88.24%
The major problem with a low rate of return is that the data may not reflect the true measure of the sample you chose to study. Part of the sample volunteered to comply with the research request, and returned the completed form. Others ignored the questionnaires. The difference in willingness to comply may relate to some aspect of your study. So, a low return rate (i.e., less than 50%) of survey forms may well give a distorted view of the target population. Higher return rates (60% - 80%) increase confidence that the returned data correctly reflects the sample, which, in turn, reflects characteristics in the population from which the sample was drawn.
Inflexibility
The structure of a written questionnaire (which increases reliability of subject responses) also limits the researchers ability to probe subject responses or clarify misunderstandings. To write a questionnaire which directs subjects through a series of probes (follow-on questions which move the subjects deeper) and branches (skips to following sections) usually results in a complex, perhaps confusing, instrument. The written questionnaire is much more inflexible than the interview as a device for gathering data.
Subject motivation
There is no way to determine the motivation level of the subjects when they fill out the form. What is the subject's mental state: overworked, busy, contemplative, focused? The questionnaire cannot measure this as an interviewer would.
Loss of control
Researchers give up control over the administration of the questions on the survey form. There is no control over the subjects environment, time, or attention to the task. There is no control over the order in which the questions are answered. There is no control over leaving answers blank. This loss of control creates missing data or distorted data, which can pose problems in statistical analysis.
Types of questionnaires
Questionnaires consist of questions of two basic types: structured and unstructured. A structured question, sometimes called close-ended, provides a predeter 4th ed. 2006 Dr. Rick Yount
10-3
mined set of answers from which the subject chooses. Here is an example of a structured, or close-ended, question:
What kind of college did you attend? ____ Evangelical college ____ Private secular college ____ ____ Catholic college ____ State college
The advantage of this type of question over the unstructured (open- ended, see below) question is its greater reliability. It is a more reliable (consistent, stable) question because subjects are given specific responses from which to choose. The data from this type of question are more easily analyzed than data from open-ended items. The second type of question is the unstructured, or open-ended, question. This question asks the subject for information without providing choices. Heres how the structured question above might be restated as an unstructured item.
Describe the kind of college you attended.
This type of question allows subjects to respond in their own way, using their own terms and language. It is less restrictive, so it might uncover subject characteristics that would be missed by the close-ended type. The open-ended item, however, increases the likelihood that subjects will respond incorrectly (that is, in a way not planned by the researcher). One subject might answer the above question like this: It was an expensive nightmare! This tells the researcher how he felt about his college, but it does not answer the question he had in mind. Close-ended questions may miss important data points because they are restrictive. Open-ended questions may provide so many data points that the researcher cannot reduce them meaningfully. The answer? Use a survey form of open-ended questions in a pilot project to gather as many answers as possible. Then design a close-ended questionnaire for the actual study. This provides a valid base for the structured items, yet yields a reliable set of data for the study.
Guidelines
Here are some specific guidelines for developing a questionnaire.
Asking questions
The key to designing an effective questionnaire is asking good questions. A good question is specific, clearly presented, and generates an answer that is definite and
10-4
Chapter 10
Survey Research
quantifiable. Asking unambiguous, meaningful questions is difficult. Researchers write questions according to standard guidelines (see Chapter 11). They then evaluate and revise questions as needed. Finally, questions are validated for clarity and meaningfulness by objective judges. The quality of the questionnaire is built directly on the quality of each question in it.
Clear instructions
Questionnaire designers know how to fill out their questionnaires because they created them. It is easy to assume that anyone would know how to complete the form. Such assumptions can doom a survey study. Subjects need clear instructions for completing the survey. If there are several sections in the form, specific instructions should be given for each section.
Understandable format
The order of questions in the questionnaire should not confuse subjects. Answers should be easy to select. Eliminate complex structures as much as possible (i.e., avoid probes into telescoped questions, or jumps to different sections in the form). A simple structure will produce more reliable data.
The Interview
In its most basic form, the interview is an oral questionnaire where subjects answer questions live, in the presence of researchers or their assistants.
Advantages
There are several key advantages to using an interview approach over the mailed questionnaire.
Flexibility Motivation Observation Broad Application No Mailing
Flexibility
A face-to-face interview affords greater flexibility than the more rigid written questionnaire. Interviewers can branch from one set of questions to another without confusing the subject. The interviewer can clarify misunderstandings of questions or instructions. If a subject makes an unexpected comment, the interviewer can investigate with follow-up questions. The survey instrument can be more complex. This is because a trained interviewer is better able to handle branching and probing than the
10-5
untrained subject.
Motivation
When interviewers and subjects are facing each other, the motivation level of subjects can be directly observed and noted. Rapport between the interviewer and subject can create a more cooperative atmosphere, which increases the validity of the subjects responses.
Observation
Researchers can record the manner, as well as the content, of subjects answers. Mood, attitude, bias, emotional state, body language, facial expression these are excellent clues to the quality of answers being received.
Broader Application
Interviewers can gather information from people who cannot read. Young children, senior adults with poor eyesight, and groups for whom English is a second language can give better information through an interview than they can with a written questionnaire.
Disadvantages
Likewise, there are some major disadvantages with the interview.
Time
Questioning scores of subjects one by one, in person, requires far more time than sending out survey forms by mail. In order to acquire a sufficient sample size of subjects, researchers may need to enlist and train a group of assistants to help in the interviewing. The training of interviewers is a monumental task and requires a great deal of time to insure that all the interviewers administer the survey the same way.
Cost
While the cost of postage is avoided by interviewing subjects, interviewing involves other expenses. Payment of assistants is more expensive than stamps, but is necessary if you plan to do a professional study. The printed interview guide will cost about the same to print as a comparable questionnaire. Additionally, interviewing may require travel costs or long distance phone costs. This means that, given a set research budget, the number of subjects you can interview will be less than the number you can survey by mail. This results in a loss of statistical power in your study.
Interviewer effect
Do you remember the problems of inference and interference associated with observation research (Chapter 9)? All of the human problems we discussed regard-
10-6
Chapter 10
Survey Research
ing observational research apply to interviewers as well. Personal characteristics, social skills, competence, gender, appearance all of these factors will produce variance in subject responses to questions, unless they can be controlled by homogeneous enlistment and adequate training.
Interviewer variables
Differences among interviewers their values, beliefs, and biases may introduce distortion in the way interviewers interpret and record responses by subjects.
Types of Interviews
Earlier in the chapter we defined questions which are structured (close-ended) and unstructured (open-ended). A structured interview is simply an oral questionnaire. Researchers ask the questions in the order they appear on the form. An unstructured, or free response, interview presents the subject with open-ended questions. Researchers can follow up answers with probes and skips without confusing subjects. Just as the structured question increases reliability and decreases the range of answers, so does the structured interview. Just as the unstructured question increases answer variance and decreases the ability to quantify research data, so does the unstructured interview.
Guidelines
Here are some specific guidelines to consider if you plan to use the interview.
Recording responses
Subject responses need to be accurately recorded during the interview. Recording the responses after the interview invites problems with subjective interpretation, selective memory, or unconscious bias.
Interviewer skills
Before the study begins, interviewers should be given adequate practice in asking questions, fielding responses, probing, clarifying instructions, and recording answers. If skill levels differ among the interviewers, extraneous variability will be introduced into the data, making findings ambiguous.
Demographics first
Ask demographic questions first in the interview. By asking non-threatening demographic questions at the beginning of an interview session, researchers establish rapport between themselves and subjects. Such rapport improves the level of trust between researchers and subjects, which, in turn, increases the validity of answers received. Demographics come FIRST in the interview, LAST in the questionnaire.
Alternative modes
The face-to-face interview is only one mode of interviewing. Researchers can conduct interviews by telephone. This extends the range of the interview far beyond that possible with face-to-face meetings. Researchers can also mail cassette tapes to subjects. The subject listens to the question on tape and records his answer. This is less
4th ed. 2006 Dr. Rick Yount
10-7
expensive than interviewing by phone, and extends the interview beyond that possible with face-to-face meetings. These modes provide more subject information than the written questionnaire. Voice characteristics, subject hesitation, and tone of voice provide clues to subject motivation. Still, none of these alternatives permit direct observation of the subject as in the face-to-face meeting.
Pilot Study
Select a group of people similar to those who will be involved in the actual study. Use the instrument to gather data from them. Check for any problems the pilot group encountered while completing the form. Ask the group for suggestions. Revise the instrument as needed.
10-8
Chapter 10
Survey Research
Summary
Survey research gathers specific data from a large group of people that possess that data. We have developed advantages, disadvantages, and guidelines for using the mailed questionnaire and the personal interview.
Examples
Dr. Margaret Lawson designed her own questionnaire to gather data for her study of selected variables and their relationship to whether or not Life Launch pilot churches (1987-88, n=120) continued offering LIFE courses (MasterLife, Experiencing God, Pa-renting by Grace, and the like, 1992-93).5 She collected data on what courses were offered, who led the courses (pastor, staff or lay), how the materials were paid for (participants paid full, part, or none), as well as attendance in Sunday School and Discipleship Training, church membership, number of baptisms, gifts and initiated ministries. Her survey instrument is located at the end of the chapter. Her procedure for developing the survey form was as follows:
The steps in developing the survey instrument were as follows:6 1. Questions were designed for subjects' responses to reflect information on the factors present in those churches that did, and those that did not, continue to offer LIFE courses. The same two-page questionnaire was sent to all the churches. Drew and Hardman suggest that respondants are more likely to complete a cone or two page questionnaire.91 2. A validation panel of experts drawn from the areas of adult discipleship training, research design, and the field of religious education were asked to rate the relevance and clarity of each question. . . .Following the panel's critique and evaluation eight surveys were returned. Suggestions were offered by Avery Willis and Clifford Tharpe and the appropriate revisions and modifications were incorporated.93
Dr. Darlene Perez developed her Spanish-language survey to gather information from youth and youth leaders in Puerto Rico concerning Youth Curriculum materials. Here was her procedure:7
The Youth Sunday School Curriculum Questionnaire was designed to obtain data related to the youth curriculum variables identified in the problem statement. The procedures for designing the instrument followed guidelines in . . . Research Design and Statistical Analysis for Christian Ministry.2 . . . .The first step . . .consisted of stating the purpose of the study with clear instructions on how to complete the questionnaire. Second, an item pool of questions was developed. The questions were written in an objective, structured and close-ended form. They were designed to obtain information about the curriculum being used by participants, the degree of curriculum satisfaction, the disposition to change curriculum, the preference for a Bible study approach, and the preference for a teaching/learning method. Third, the questionnaire included a section at the end for demographic information. A copy of this questionnaire is provided as appendix H. . . . The questionnaire was submitted to a validation panel of seven experts in the areas of education or curriculum development or youth knowledge. Each panel member considered points of clarification and the validity of each item. The best, most clear, and most valid questions were selected for the survey. . . . A proposed pilot study with youth and youth leaders not included in the research was to
5 Margaret P. Lawson, A Study of the Relationship Between Continuance of LIFE Courses in the LIFE Launch Pilot Churches and Selected Descriptive Factors, (Ph.D. dissertation, Southwestern Baptist Theological Seminary, 1994) 6 7 Ibid., 25-26 Perez, 55-58
10-9
be completed in Puerto Rico. The validation procedures with a pilot group the following steps: 1. The Sunday School Board provided a list of Baptist and non-Baptist churches in Puerto Rico currently using the Spanish Convention Uniform Series. A non-Baptist, evangelical church (Alianza Christiana y Misionera, Rio Pierdras, Puerto Rico) was selected for the pilot study. The questionnaire was submitted during a youth Sunday School class to a group of thirteen youth and three youth leaders. Corrections were made to clarify the instructions on how to complete the questionnaire. Also, the term "youth" (joven) was changed to Intermedios y Pre-jvenes along with a parenthesis stating the ages twelve to seventeen. 2. After making corrections, it was felt that the instrument needed further validation. A second validation pilot study was performed with a group of thirty youth and youth leaders from the Baptist Convention of Puerto Rico who were meeting at a youth camp during July, 1990. After this validation process, the following changes were made. . . [six changes listed]. 3. In order to make the validation process more consistent, a third pilot study was performed with a group of thirty youth and youth leaders from the Puerto Rico Southern Baptist Association, at a youth camp in July 1990. Only a few corrections were made in the section of demographics. . .[two changes listed]. A copy of the validated questionnaire appears as appendix I. [the English-language version is included at the end of the chapter]
10-10
Chapter 10
Survey Research
Vocabulary
close-ended question demographics item pool open-ended question rate of return structured question unstructured question validation panel type of question which provides a set of answers to choose from (a b c d) personal data on subjects (gender, ed level,years in ministry) a collection of test items from which a subset is drawn for creating an instrument question which allows subject to answer in his/her own words percentage of mailed questionnaires which are completed and returned synonym for close-ended question synonym for an open-ended question judges who analyze the clarity and relevance of questions in an item pool
Study Questions
1. Compare and contrast the advantages and disadvantages of the interview and questionnaire. 2. Define structured or close-ended questions. Give an example. 3. Define unstructured or open-ended questions. Give an example. 4. Discuss the pros and cons of using structured or unstructured questions. 5. Differentiate the handling of demographic questions in the questionnaire and interview.
10-11
[65]
Please complete the information requested concerning LIFE courses in your church at the time of the LIFE Launch project and the present time. FIRST YEAR refers to the reporting year following the LIFE LAUNCH, October 1987 to September 1988. LAST YEAR refers to the latest reporting year, October 1992 to September 1993.
1. What LIFE courses did you offer in the first year of the LIFE Launch? MasterLife MasterBuilder MasterDesign Parenting by Grace None Other (please specify) __________ 2. What LIFE courses have you offered during the last year? MasterLife MasterBuilder MasterDesign DecisionTime Parenting by Grace I Parenting by Grace II Covenant Marriage WiseCounsel Disciple's Prayer Life Experiencing God Step by Step Step by Step Through the Old Testament Through the New Testament LifeGuide to Discipleship None and Doctrine Other (please specify) ______________________________ 3. Which staff member began the initial LIFE courses? Pastor Associate Pastor Minister of Education Other (please specify) __________ 4. Did any lay person have a leadership position from the beginning? Yes No 5. Has a staff person led LIFE courses in the past year? Yes No 6. Has a lay person led LIFE courses in the past year? Yes No
(OVER)
7
Lawson, 65-66
10-12
Chapter 10
Survey Research
7. Indicate how participants paid for their study materials in the first year: Participants paid full price Participants paid some of the cost Materials were provided Other (please specify) free of charge __________ 8. Indicate how participants paid for their study materials in the past year: Participants paid full price Participants paid some of the cost Materials were provided Other (please specify) free of charge __________ 9. Indicate the total number of participants in all LIFE groups: _______ FIRST YEAR ________ LAST YEAR
10. Indicate the average number of participants in individual LIFE groups: _______ FIRST YEAR ________ LAST YEAR
11. Complete the following information about your church during the LIFE Launch year: _____ Resident Church Membership _____ Average Sunday School Attendance _____ Total Gifts 12. Complete the following information about your church during the past year: _____ Resident Church Membership _____ Average Sunday School Attendance _____ Total Gifts 13. What specific ministries have been initiated by LIFE course participants? _____ Total Baptisms _____ Average Discipleship Training Attendance _____ Total Baptisms _____ Average Discipleship Training Attendance
Please return the completed survey to: Margaret Lawson address address city, state Would you like to receive a summary of the results of the survey? ______________
10-13
108 APPENDIX I8 VALIDATED YOUTH SUNDAY SCHOOL MATERIALS QUESTIONNAIRE (ENGLISH TRANSLATION) The purpose of this questionnaire is to obtain basic information about the Sunday School youth materials being used in your church and to identify the curriculum preferences of youth and youth leaders. Instructions: Select with a check mark (9) the best alternative. Choose only one response for each question
1) Which Sunday School materials are currently being used in your church? __ 1. El Interprete (Convention Uniform Series of the Sunday School Board) __ 2. Enseanza Bblica Para Jvenes, (Dilogo y Acin Program of The Spanish Publishing House) __ 3. Materials designed in your own church. __ 4. Exploradores y Embajadores (Editorial Vida, Miami, Florida) __ 5. Other, specify: ____________________________________________
2)
How satisfied are you with the Youth Sunday School materials used in your church? __ 3. Dissatisfied (I do not like it) __ 4. Very dissatisfied (I do not like at all)
3)
__ 1. Yes
4)
If you were going to change Youth Sunday School materials, which Bible study approach would you prefer?
__ 1. I would like to study the Bible systematically, book by book, covering the whole Bible within a certain period of time. __ 2. I would like to study the Bible by themes that relate to daily life, such as the family, friendships, the community, and others. __ 3. I would like to study the Bible by doctrinal themes, such as the doctrine of God, Jesus, the Holy Spirit, Church, Bible, prayer, and others. __ 4. I would like to have Bible studies about discipleship, Christian growth and formation.
Perez, 108-109
10-14
Chapter 10
Survey Research
109 5. If you were going to change the Youth Sunday School materials, which teaching/learning methods would you prefer?
__ 1. Conference -- The teacher would expose and explain the Bible passage. __ 2. Questions and answers -- The teacher would use questions to promote group participation. __ 3. Small group work -- The class would be divided into small groups. Each group is assigned to work on a task and will report to the whole class its findings. __ 4. Individual tasks -- The teacher would assign questions or tasks to each student and he/she would work independently. __ 5. Other, specify: _________________________________________________
Please complete the following information: Position: ___ ___ ___ ___ ___ ___ ___ Sex: ___ ___ Male Female Youth Pastor Youth Minister Minister of Christian Education Sunday School Director Youth teacher Other Age: ___
Denomination: ___ ___ ___ Southern Baptist American Baptist Other, specify: ___________________________________
Church name: ____________________________________________ Have you completed this questionnaire before? Comments/suggestions: ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ________________________________________________________________________ ____ yes _____ no
10-15
10-16
Chapter 11
Writing Tests
11
Developing Tests
Preliminary Considerations Objective Test Items Essay Test Items Item Analysis
A test is an instrument which measures a subjects knowledge, understanding, or skill in a given content area, and produces a ratio score reflecting that measure. If the focus of a study is "testing subjects" on some variable (Bible knowledge, comprehension of various translations, current events), an appropriate test must be found, or one must be developed. This chapter introduces you to principles of developing tests.
Preliminary considerations
You may be able to use existing tests for your study. Lets say the nature of your study is to identify a relationship between job satisfaction and interpersonal dynamics among staff members. You may be able to find an existing test which will measure job satisfaction. Check the Mental Measurements Yearbook, or Tests In Print, or other such resources for published tests in your area of interest. Tests can also be found in research articles being gathered for the Related Literature section of your proposal. Study the validity and reliability scores on the test, the population(s) the test was designed for, and the conditions of test administration. If these factors fit your study, youre in business! Describe these characteristics in the Instrument section of your proposal. You may need, however, to develop your own test, since there are many areas in the field of Christian Education that do not yet have tests. This chapter focuses on the procedure to use in developing such a test for use in a larger dissertation context. Good tests gather good data. Good tests build good attitudes. Good tests can even produce a positive learning experience. The principles discussed here will help you in this task.
11-1
you use, the length of the test and other such variables depend a great deal on who your subjects are.
Writing items
Avoid ambiguous or meaningless test items. Use good grammar. Avoid rambling or confusing sentence structure. Use items that have a definitely correct answer. Avoid obscure language and big words, unless you are specifically testing for language usage. Be careful not to give the subject irrelevant clues to the right response. Using a(n) rather than a or an is an example of this. In short, a test should not provide any barrier to subjects apart from demonstrating mastery over the test content. Otherwise, scores reflect more noise than true measure.
Objective Tests
True-False Multiple Choice Matching
An objective test is a test made up of close-ended questions. Objective tests have several advantages over essay tests. Asking 100 objective questions over a given content field provides a much better sampling of examinee knowledge and understanding than asking three or four essay questions. With objective tests, grading is easier and the scores are a more reliable measure of what the examinee knows. There are four common types of objective questions. These are the constant alternative (true-false) question; the changing alternative (multiple choice) question; the supply (or fill-in-the-blank) question; and the matching question.1
Advantages
The advantages of the true-false test item are efficiency and potency. It is efficient in that a large number of items can be answered in a short period of time. Scoring is fast and easy. It is potent because it can, in a direct way, reveal common misconceptions and fallacies.
1 The material in this chapter is a synthesis of principles gleaned from Nunnally, Chapter 6: Test Items, 153-196; and Payne, Chapter 5: Constructing Short Answer Achievement Test Items, 95-136. These are excellent resources for those wanting to improve their test-writing ability. Another excellent (more recent) source is Tom Kubiszyn and Gary Borich, Educational Testing and Measurement: Classroom Application and Practice, 2nd (Glenview, IL: Scott, Foresman and Company, 1987). Also more recent material can be found in my own Created to Learn (1996), Chapter 14 and Called to Teach (1999), Chapter 9, both from Broadman and Holman.
11-2
Chapter 11
Writing Tests
Disadvantages
Good true-false items are hard to write. An item that makes sense to the writer may confuse even well-informed subjects. Statements require careful wording, evaluation and revision. Secondly, true-false items encourage guessing. An examinee can earn around 50% of the test score by mere chance simply by guessing at the right answer. If there are only two alternatives, then pure chance gives him 50% over the long run. This is if subjects know absolutely nothing about the subject. This is a lot of noise in the test scores. Thirdly, constant alternative items tend toward response sets. A response set is a repetitious pattern of answers, like the following 18-item test. T T F T T F T T F F T F T T F T T F ^ ^ ^ ^ ^ Notice that the pattern T T F repeats itself through the test. Test writers can produce these response sets without being aware of it. Subjects pick up these irrelevant clues, and score higher than their knowledge allows. The objective is not to insure high scores, but to actually measure what subjects know and understand.
Determiners Answers Negatives Language Quotes Item length Sentences False Items
Absolute answer
Base true-false items on statements that are absolutely true or absolutely false. Avoid statements that are true under some conditions, but not others, unless the conditions are specifically stated. Well-informed subjects have greater difficulty answering ambiguous questions correctly, because they have more information to process in trying to understand the item.
11-3
Advantages
The multiple choice question, with its multiple responses, can be written with less ambiguity and greater structure than the true-false question. Guessing is reduced since the probability of guessing the correct answer is 1 in 4 (25%) instead of 1 in 2 (50%) for true-false items. Multiple choice items can demand more subtle discrimination than other forms of objective questions. Lastly, one can write multiple choice items which test at higher levels of learning, such as application and analysis, than other question types.
Disadvantages
Good multiple choice questions are difficult to write. Effective detractors plausible wrong answers are hard to create, particularly if you are providing a 5th or 6th alternative response. Secondly, multiple choice tests are less efficient because a subject can process fewer multiple choice items in a given time than other types.
11-4
Chapter 11
Writing Tests
Single Problem Repeats Negative Stems Reponses Similar Responses Exclusive Responses Plausible Responses Random Irrelevance Extraneous None of the Above
11-5
Supply Items
Supply items, sometimes called recall or fill in the blank items, present a statement with one or more blanks. The task of the subject is to fill in the blank(s) with the most appropriate terms in order to correctly complete the statement.
Advantages
Supply items are relatively easy to construct. Second, they are efficient in that a large number of statements can be processed in a given length of time. Third, remembering a term or phrase is more difficult than recognizing it in a list or response set. Therefore, supply items discriminate better between subjects knowledge of important definitions and concepts.
Disadvantages
Supply items are notorious for being ambiguous. It is difficult to write a supply item that is clear and plainly stated. Supply items are unclear in the way theyre graded because usually more than one word will adequately fill the blank. Grading can be arbitrary and unfair, depending on how synonyms are handled.
Limit blanks
Use only one or two blanks in a supply item. The greater the number of blanks, the greater the item ambiguity and the more difficult grading is.
11-6
Chapter 11
Writing Tests
Matching Items
Matching items presents subjects with two or three columns of items which relate to each other. An example of a matching question is one which provides a numbered list of authors with a parallel lettered list of the books they wrote. Match the books to their authors by writing the letter of each book in the space next to the numbered author. The list of authors is the item list and the list of books is the response option list.
Advantages
The matching item can test a large amount of material simply and efficiently. Response pairs can be drawn from various texts, class notes, and additional readings to form a summary of facts. Grading is easy.
Disadvantages
A good matching item is difficult to construct. As the number of response pairs in a given item increases, the more mental gymnastics is required to answer it. Matching items can present little more than a confusing array of trivial terms and sentence fragments.
11-7
Specific instructions
Be sure to clearly instruct subjects on how the matching is to be done. Show an example, if necessary. This eliminates test-wiseness as an extraneous variable in the scoring.
Essay Tests
Essay tests are constructed from unstructured or open-ended questions which require subjects to write out a response.
11-8
Chapter 11
Writing Tests
Advantages
Essay test items allow much greater flexibility and freedom in answering. Grammar, structure, and content of the answer is left to subjects. Essay items permit testing at the higher levels of learning than most types of objective questions. Finally, essay questions permit a greater range of answers than objective items.
Disadvantages
The greatest disadvantage of essay items is that they are difficult to score consistently. The answers are more ambiguous and subjective than objective responses. The reliability of scores is lower than those produced by objective tests over the same content because of the variability of response. Essay items test a smaller sample of material because of the amount of time required to analyze and understand the question, develop the answer, and write it out in complete sentences. They are less efficient than objective types.
Item analysis
Item analysis is a procedure for determining which items in an objective test discriminate between informed and uninformed subjects. If a tests purpose is to separate subjects along a scale of content mastery (and most tests have this purpose!), then it is important that this separation be done fairly. Every item in a test should contribute to this separation process. Those that do not should be revised or eliminated.
4th ed. 2006 Dr. Rick Yount
11-9
A popular method of item analysis is a procedure called the Discrimination Index. After administering and grading the exam, the procedure is applied as follows:
Summary
In this chapter we have looked at procedures for developing various types of tests. We have considered four kinds of objective items: true-false, multiple choice, supply and matching. We have discussed the use of essay questions. Finally, we described item analysis, which allows test developers to determine whether objective test items are valid.
11-10
Chapter 11
Writing Tests
Examples
In addition to the checklist in Chapter nine, Dr. Mark Cook also developed an objective test
. . .to measure the lesson objectives at three cognitive levels: knowledge, comprehension, and application. The process of development began by creating a thirty-item multiple-choice test to be used in the field test of the study (appendix D). The test was examined by three selected specialists. The specialists that were asked for validation of the test were as follows: [specialists listed]. These professors were provided complete lesson plans to use in evaluation.3
A copy of the test is located at the end of the chapter. Dr. Brad Waggoner focused his entire 1991 dissertation on developing a standardized test to measure the discipleship base -- defined as 'that portion of a given church's membership that meets the criteria of a disciple'4 -- of local Southern Baptist churches. He worked in conjunction with the International Mission Board of the Southern Baptist Convention to produce a valid and reliable instrument. A final instrument of 136 items5 produced a Cronbach's alpha reliability coefficient of 0.9618.6 While we can certainly not replicate the fifty-eight pages7 of his development procedure here, we will outline the procedure and focus on key aspects of test development.
Phase One: Identification of Functional Characteristics8 Attitudes: A disciple is one who: Possesses a desire and willingness to learn Has conviction regarding the necessity of living in accordance to biblical principles and guidelines Evidences a repentant attitude when a violation of Scripture occurs Possesses a willingness to forfeit personal desires and conveniences, if necessary, in order to seek the interests of others Possesses and demonstrates the character trait of humility Possesses and demonstrates the character trait of integrity Is willing to be accountable to others Conduct/Behavior: A disciple is one who: Manifests a lifestyle of utilizing time and talents for God's purposes Possesses a lifestyle depicted by intentional compliance with the with the moral teachings of the Bible. . . Maintains appropriate behavior toward those of the opposite sex Actively seeks to promote social justice and righteousness in society as well as to individuals Relational/Social: A disciple is one who: Values and accepts himself as created in the image of God Has an awareness of the reality and presence of God through the ministry of the Holy Spirit Experiences trust in God in times of adversity as well as in times of prosperity Seeks to commune with and learn about God through the means of meditation upon Scripture and prayer Is consistently involved in fellowship with other believers in the context of a local church Applies oneself to building meaningful relationships with other believers Maintains a forgiving spirit when wronged Confesses or seeks forgiveness when guilty of an offense
5
3 8
Ibid., 209
Ibid., 118
11-11
Ministry/Skills: A disciple is one who: Publicly identifies with Christ and the Church when provided an opportunity Seeks and takes advantage of opportunities to share the Gospel with others Is involved in ministering to other believers Seeks the good of all men with a willingness to meet practical social needs such as food, clothing, and the like Doctrine/Beliefs: Eternal security Salvation The Holy Spirit (the nature and role of) The Eternal State (the literal existence of heaven and hell) Scripture (the authority and reliability of) Phase Two: Testing of Content Validity9 The finctional characteristics, categorized according to the five domains described above, were placed on a 9-point Likert rating scale, a value of "1" being "not valid," and a value of "9" being "very valid" with gradations of validity in between57 (appendix B). The purpose of the rating scale was for a panel of experts to determine the degree to which each characteristic was a valid and measureable function of a disciple.(58) A list of names was compiled. . .the panel was to consist of five experts and two alternatives representing the academic, denominational, and local church levels (appendix C).(59) A letter was constructed that explained the nature and purpose of the research and requested their participation on the panel (appendix D). . . . When the rating scales were returned, the mean scores were calculated for the characteristics (appendix F). Phase Three: Revision of Characteristics10 Revisions to the list of characteristics were made based on the panel's scores, comments, and additions. It was predetermined that any item receiving a mean score of less than 7.0 would be considered for deletion. Phase Four: Item Writing11 Review Related Measures Construction of Questions The Size of the Item Pool The Issue of Relevance The Issue of Clarity The Issue of Simplicity The Issue of Single Meaning The Issue of Double Negatives The Issue of Question Length The Issue of Question Variety The Issue of Response Categories The Issue of Assuming The Issue of "Leading" or "Loaded" Questions The Issue of Grammar and Tone Phase Five: Testing Content Validity of Questions12 Selection of a Panel of Experts Development of a Validation Instrument Follow-Up of Validation Panel
9
Ibid., 81-82
10
Ibid., 82
11
Ibid., 83-91
11-12
Chapter 11
Writing Tests
Calculation of Validity: "...questions receiving mean scores of less than 6.0 would be considered for deletion." Phase Six: Questionnaire Design13 Question Order and Flow Questionnaire Length Questionnaire Design and Layout Size and Color of Paper Layout Instructions Expression of Gratitude Expression of Confidentiality Identification of the Sponsor Phase Seven: Refining the Pilot Test14 The process of refining the pilot test consisted of a small number of individuals evaluating the clarity of questions, word meanings, instructions, and procedure for completing the instrument. . . .Revisions were made to the instrument based upon the results. Subsequently, over 100 questionnaires were printed and put into booklet form (appendix M). Phase Eight: Pilot Test #115 Selection of Sample Group [n=50 church members in two groups] Establish Time and Place of Pilot Test Letter of Invitation Constructed and Mailed Administering the Instrument Phase Nine: Data Analysis16 [This is part of Chapter Four of the dissertation]. Phase Ten: Revision of the Instrument17 Phase Eleven: Second Pilot Test18 Selection of [Three] Churches Procedure for Administering the Pilot Test Follow-Up Procedure Phase Twelve: Data Analysis of the Second Pilot Test19 [This is part of Chapter Four of the dissertation].
As mentioned in Chapter One, this instrument -- with further revisions by Dr. Waggoner in conjunction with the IMB and LifeWay Christian Resources (SBC) -- is being integrated into revised MasterLife materials produced by LifeWay.
Vocabulary
changing alternative constant alternative discrimination index distractors multiple choice question
12 17
synonym for a multiple choice test item synonym for a true-false test item procedure used to determine quality of test items multiple choice options which appear plausible but are incorrect test item with one stem and 4 or 5 plausible options
13 18
14 19
15
Ibid., 99-103
16
Ibid., 103
11-13
predictable pattern in objective answers (e.g. TTTF TTTF TTTF) terms like `never or `sometimes that give clues to correct answer synonym for fill-in-the-blank questions
Study Questions
1. Explain the four preliminary guidelines given for writing tests in your own words. 2. Explain why objective test items produce more reliable scores than essay test items. 3. Write out 3 TF, 3 MC, 2 supply and 2 essay questions relating to this material. Set them aside for a few days. Then go back and evaluate each of your questions according to the criteria given for each kind of question.
11-14
Chapter 11
Writing Tests
Sample Test
APPENDIX B3 PRE-SESSION TEST Student Number _________ (see your name tag) Circle the letter of the phrase that best completes the sentence. 1. The phrase "priesthood of believers" is found in the Bible (a) in the New Testament, (b) in the Old Testament, (c) in both testaments, (d) in neither testament. 2. The doctrine of the priesthood of the believer teaches that priests should (a) be representative of all people, (b) represent God to other persons (c) be ordained by a church, (d) remain completely separated from the world. 3. During the Reformation, the priesthood of all believers particularly emphasized (a) infant baptism, (b) personal witnessing, (c) direct access to God, (d) wrongs of the Catholic church. 4. The concept of priest in the Old Testament is most often associated with the priesthood of (a) all Israelites, (b) some Israelites, (c) no Israelites, (d) the special prophets of Israel. 5. The Old Testament covenant was designed by God (a) to bless Israel as His people only, (b) to assure that Israel worshipped only God, (c) to help Israel conquer their world, (d) to make Israel a blessing to all other nations. 6. Christians are referred to as a holy priesthood. This holiness is best reflected by Christians when they are (a) motivated by love, (b) pure in their thoughts, (c) serving God at church, (d) separated from the world.... 62
3
11-15
11-16
Chapter 12
Developing Scales
12
Developing Scales
The Likert Scale The Thurstone Scale The Q-Sort Scale The Semantic Differential
Our emphasis from the beginning of the text has been on the objective measurement of research variables. Sometimes we are most interested in studying subjective variables: attitudes, feelings, personal opinions, or word usage. How can we measure subjective variables objectively? The answer is an instrument called a scale.1 Dr. Martha Bergen used an adaptation of an existing scale2 to measure the attitude of seminary professors toward using computers in seminary education.
Respondants [110 seminary professors serving at Southwestern Baptist Theological Seminary in 1988] were asked to read each question and decide to what extent they agreed or disagreed with each question. They were instructed to circle the appropriate number after each of the items. The rating scale was set up in a logical pattern using the numbers "1," "2," "3," and "4" to correspond with "strongly disagree," "disagree," "agree," and "strongly agree," respectively. Responses [from the 53 items] were totaled and evaluated to reveal which attitude/s was/were most prominent. . . . A validation panel consisting of five experts in the areas of education, religious education, and computers was asked to rate the relevance and clarity of each question. Proper revisions and modifications were made as deemed necessary from the panel's critique and evaluation. For the purpose of establishing reliability, a stratified random sample of ten seminary professors -- representative of the intended population -- was selected to respond to the questionnaire. The method of split-half correlation was used to determine the coefficient of internal consistency. . . .3
The result of the modifications was an instrument which measured the strength of support (an attitude) of seminary professors for the use of computers in the seminary education in 1988. The internal consistency coefficient, after applying the SpearmanSee Babbie, "Chapter 15: Indexes, Scales and Typologies," pp. 366-389; Nunnally, "Chapter 15: Attitudes and Interests," pp. 441-467; and Payne, "Chapter 8: The Development of Self-Report Affective Items and Inventories," pp. 164-200. An excellent paperback dealing with this subject is Daniel J. Mueller, Measuring Social Attitudes: A Handbook for Researchers and Practitioners, (New York: Teachers College Press, 1986). 2 Bergen describes her instrument as an adaptation of a 1986 dissertation [instrument] from North Texas State Univeristy. See Mitchell Drake Weir, 'Attitudes and Perceptions of Community College Educators toward the Implementation of Computers for Administrative and Instructional Purposes' (Ph.D. dissertation, North Texas State University, 1987), pp. 129-35. In May 1988 North Texas State University became the University of North Texas, 48 3 Ibid., 48-49. See also 57-62 for more detail.
1
12-1
Brown Prophecy Formula, was +0.75, a strong positive value (see Chapter 22). A scale is an instrument which measures subjective variables. In this chapter we look at four major types of scales: the Likert (LIE-kurt), the Thurstone, the Q-sort and the Semantic Differential. Each of these important scale types provides the means to gather subjective data objectively.
Write statements
Next, we will write statements that reflect positive and negative aspects of these areas. Weve defined positive to mean that which agrees with my position, and negative means that which disagrees with my position. The statements, even though reflecting subjective variables, should be objective. That is, statements must not be systematically biased toward one position or the other. Students who really want merely to get a degree should have no trouble scoring low on the scale. They should tend to agree with statements reflecting degree and tend to disagree with statements reflecting learning. In the same way, students who really want to learn should tend to agree with learning statements, and tend to disagree with degree
2 Mueller, Chapter 2: Likert Attitude Scaling and Chapter 3: Likert Scale Construction: A Case Study, 8-33.
12-2
Chapter 12
Developing Scales
statements.
Positive examples
Positive statements should be objective statements which are acceptable by those having the attitude, and just as unacceptable to those not having it. The following reflect these characteristics in regard to our attitude scale:
I generally enjoy homework assignments and sometimes do more than the assignment requires. I frequently use library resources to go beyond the required reading. I believe a degree is empty unless it reflects my best efforts of scholarship. A late assignment, thoughtfully done, is more important than the loss in grade average.
Negative examples
Negative statements should be objective statements which are acceptable to those not having the attitude, and just as unacceptable to those having it. These statements coincide with the positive examples above.
Homework assignments are designed to meet course requirements. It is impractical in time and energy to do more than is required. It is better to master the required reading than to dilute ones thinking with other authors. A degree is a credential for ministry and reflects, in itself, none of the extremes of scholarship some try to ascribe to it. It is better to turn in an assignment on time than to be docked for lateness to make it better.
Rank
Rank order the evaluated items on clarity and potency. Choose an equal number
3 Mueller states, Five categories are fairly standard.... Some scale constructors use seven categories, and some prefer four or six response categories (with no middle category). All of these options seem to work satisfactorily. It should be noted in this regard that reducing the number of response categories reduces the spreading out of scores (reduces variance) and thus tends to reduce reliability. Increasing the number of response categories adds variance. As the number of categories is increased, a point is reached at which respondents can no longer reliably distinguish psychologically between adjacent categories [i.e., whats the difference between a 10 and an 11 on a 12-point scale? WRY]. Increasing the number of categories beyond this point simply adds random (error) variance to the score distribution (pp. 12-13).
12-3
Write instructions
Write instructions which clearly explain how to select responses on the form. (See the finished example at the end of the chapter.) There are other ways to indicate the intensity of response. Dr. Don Mattingly (Ed.D., 1984) developed a scale for his dissertation which used the categories Yes! Yes No No!
to indicate how strongly his subjects agreed or disagreed with statements concerning recreation ministry.
12-4
Chapter 12
Developing Scales
(A) SA = 3 pts A SA
(SD) D = 4 pts SD D
SD
SD (D) A = 2 pts SD SD D D A
(A) SA = 2 pts
Red notations are not included on the form, but are included here to demonstrate the scoring of a completed form. This subject selects items as marked, which are scored according to statement type. This subject scored 23 points on this scale (32 possible). Very positive attitude!
12-5
Scoring
Compute the median (or mean) of the weights of the statements marked by the
5
Mueller, p. 37
12-6
Chapter 12
Developing Scales
subject. This is the subjects score which reflects attitude on the theme.
Q-Methodology
It is difficult to rank order more than ten statements. But rank ordering attitudinal statements is a good way to gather subjective data on a given sample. The Q-sort is a procedure for rank ordering a large number of statements. Rankings of statements by two or more groups can then be compared. One version of the Q-sort uses a physical set of boxes, numbered 1 through 11 (This is the same arrangement as that described for weighting Thurstone items). The procedure is usually applied when the number of statements to be ranked is greater than 40. The subject looks through a number of statements written on cards. Each card contains one statement. The first time through, the subject selects the statement he agrees with the most. That item goes into box 1. The subject then goes through the cards a second time and selects the statement he agrees with the least. This card is placed in box 11. The next time through the cards, the subject selects two cards he agrees with the most, and places them into box 2. Then he chooses the two cards he agrees with least in box 10. Then 4 cards in box 3 and 4 cards in box 9, and so forth, until he is left with the middle box (#6). All the remaining statements are placed in it. The researcher then assigns point values for each statement, 1-11, based upon the box into which they were placed. After all subjects have placed the statements, averages are computed. Rank order statements for the group on the basis of their average values.
Semantic Differential
The semantic differential provides information on differences (differential) in word usage (semantics) in subjects. Osgood and Tannenbaum wrote the classic work on using the semantic differential, entitled The Measurement of Meaning.1 The book is a My Church detailed analysis of this powerful technique. We valuable __ : __ : __ : __ : __ : __ : __ simply introduce the procedure here. clean __ : __ : __ : __ : __ : __ : __ Osgood and Tannenbaum isolated three bad __ : __ : __ : __ : __ : __ : __ major dimensions of word meanings through unfair __ : __ : __ : __ : __ : __ : __ the use of factor analysis. These dimensions are large __ : __ : __ : __ : __ : __ : __ evaluative (good or bad), potency (strong or strong __ : __ : __ : __ : __ : __ : __ weak) and activity (fast or slow). Their book deep __ : __ : __ : __ : __ : __ : __ contains hundreds of adjective pairs relating to fast __ : __ : __ : __ : __ : __ : __ these three dimensions. active __ : __ : __ : __ : __ : __ : __ A subject is presented a sheet of paper with hot __ : __ : __ : __ : __ : __ : __ a single word or term at the top. Below this word are a number of adjectival pairs, separated (1) (2) (3) (4) (5) (6) (7) by seven blanks. For example, the meanings associated with the term my church might be formatted like this: The first four adjective pairs measure the evaluative dimension; the next three measure potency; and the last three measure activity. The numbers shown above are not
1
worthless dirty good fair small weak shallow slow passive cold
12-7
printed on the instrument, but are shown here to help clarify the scoring procedure. Pairs which are reversed should be scored in reverse, so that positive is always (1) and negative (7) regardless of which side of the scale they appear. Subjects check one blank between each pair indicating their opinion of the term on this scale. Blanks are scored 1-7, providing a numerical score for the meaning of the term in each dimension. Groups of subjects can then be compared on the three dimensions of meaning for any commonly used word. (Note: the numbering scale 1-7 is true only if the positive term is on the left; otherwise the scale is labelled 7-1). Results can be plotted in three dimensions to provide a picture of semantic differences between two or more groups of subjects.
Pairs of statements are created for each major concern. Randomly select an equal number of positive and negative statements for inclusion in the Delphi instrument. Construct an instrument in which statements are randomly listed. Associate each with a Likert type response: Strongly Agree. . . Strongly Disagree. Duplicate the instrument and send it to all youth teachers in Tarrant Association. Each teacher will read the statements and mark his or her degree of agreement (or disagreement) with each statement. Completed forms will be returned to the researcher by means of self-addressed and stamped envelopes. Score forms just like a Likert scale. Scores for each statement produces a mean for the entire group. Means (and their associated statements) will then be ranked. From this ranking, the researcher can determine how the group responded to the "major concerns" submitted by individuals earlier. Thesewill either be reinforced by agreement by the entire group (major concerns, indeed!), or they will be identified as
Procedure described by Dr. John Curry, University of North Texas, EDER 601, Fall 1983
12-8
Chapter 12
Developing Scales
a isolated concerns not shared by the group. The Delphi Technique is a powerful way to allow a group of subjects to create their own attitude statements, and then measure the strength (or lack) of support by the whole group for the statements generated by the process.1
Summary
In this chapter we have introduced ways researchers measure attitudes. We have emphasized the Likert and Thurstone scales, the Q-Sort, and the Semantic Differential. These are but a sampling of procedures available to you to measure the subjective characteristics of groups.
Vocabulary
Evaluative Likert scale Potency Q-sort Activity Semantic Differential Thurstone scale A scale in the semantic differential which measures good-bad Attitude scale which uses + and - equally weighted statements A scale in the semantic differential which measures strong-weak Method for rank ordering a large number of attitudinal statements A scale in the semantic differential which measures fast-slow An attitude scale which measures differences in word meanings Attitude scale which uses weighted statements
Study Questions
1. Define attitude scale. 2. Compare and contrast the Likert and Thurston attitude scales. 3. What applications would be appropriate for the semantic differential in Christian research? Likert scale? Thurstone scale? Delphi Technique?
12-9
12-10
Chapter 12
Developing Scales
1.0 10.0 4.2 6.4 0.5 5.4 6.9 8.4 10.1 7.9 5.7 10.9 1.3 2.2 3.7 3.0 9.3 11.4 3.3 7.4 4.5 10.5 2.3 0.3 1.2 2.7 7.1 4.9 5.8 8.9 9.9 1.8 8.6 6.7
I am intensely interested in education. I go to school only because I am compelled to do so. I am interested in education but one shouldnt get too concerned about it. I like reading thrillers and playing games better than studying. Education is of first rate importance in the life of man. Sometimes I feel education is necessary and sometimes I doubt it. I wouldnt work at studying so hard if I didnt have to pass exams. Education tends to make people snobs. I think time spent studying is wasted. It is better to start a career at age 18 than to go to college. It is doubtful that education has helped the world. I have no desire to have anything to do with education. We cannot become good citizens unless we are educated. More money should be spent on education. I think my education will be of use to me after I leave school. I always read newspaper articles on education. Education does more harm than good. I see no value in education. Education allows us to live a less monotonous life. I dislike education because time has to be spent on homework. I like the subjects taught in school but do not like attending school. Education is doing more harm than good. Lack of education is the source of all evil. Education enables us to make the best possible use of our lives. Only educated people can enjoy life to the full. Education does more good than harm. I do not like school teachers so I somewhat dislike education. Education is alright in moderation. It is enough that we should be taught to read, write and do sums. I do not care about education so long as I can live comfortably. Education makes people forget God and despise Christianity. Education is an excellent character builder. Too much money is spent on education. If anything, I must admit to a slight dislike of education.
12-11
12-12
Chapter 13
Experimental Designs
13
Experimental Designs
What is Experimental Research? Internal Invalidity External Invalidity Types of Designs
We've previously discussed aspects of three dissertations which embraced an experimental design. My Southwestern dissertation compared three approaches to teaching adults in a local Southern Baptist church: Skinnerian behaviorism, Brunerian cognitivism, and an eclectic approach of the two in 1978. Dr. Stephen Tam compared three approaches to teaching with Chinese students in Hong Kong seminary: interactivity, gaming, and lecture in 1989. Dr. Mark Cook studied the role of active participation in adult learning in a local church in 1994.1
13-1
ment is one that confines the variation of measurement scores to variation caused by the treatment itself. The hindrances to good research design are called sources of experimental invalidity. These sources fall under two major subdivisions: internal invalidity and external invalidity. Lets define further these sources of experimental invalidity.2
History Maturation Testing Instrumentation Regression Selection Mortality Interaction John Henry Diffusion
Internal Invalidity
Internal invalidity asks the question, Are the measurements I make on my dependent (i.e., the variable I measure) variable influenced only by the treatment, or are there other influences which change it? An experimental design suffers from internal invalidity when the other influences, called extraneous sources of variation, have not been controlled by the researcher. When extraneous variables have been controlled, researchers can be reasonably sure that post-treatment measurements are influenced by the experimental treatment, and not by extraneous variables. Donald Campbell and Julian Stanley wrote a chapter of a text on research designs that has become a classic in the field.3 In this chapter they list eight extraneous variables: history, maturation, testing, instrumentation, statistical regression, differential selection, experimental mortality, and selection-maturation interaction. Borg and Gall list two more: the John Henry effect and experimental treatment diffusion.4
History
History refers to events other than the treatment that occur during the course of an experiment which may influence the post-treatment measure of treatment effect. If the explosion of the nuclear reactor in Chernobyl, Ukraine had occurred in the middle of a six-month treatment to help people reduce their anxiety of nuclear power, it is likely that post-test anxiety scores would be higher than they would have been without the disaster. History does not refer to the background of the subject. Since history is an internal source of invalidity, it's influence must occur during the experiment. If you study two groups, one which receives the treatment and a similar one which does not, you control for history (which is why this second group is called a control group) since both groups are statistically5 affected the same way by events outside the experiment. Any differences between the two groups at the end of the experiment could reasonably be linked to the treatment.
Maturation
Subjects change over the course of an experiment. These changes can be physical, mental, emotional, or spiritual. Perspective can change. The natural process of human growth can result in changes in post-test scores quite apart from the treatment. Question: How would a control group control this source of internal invalidity?6
I use the term invalidity to differentiate this concept from test validity discussed in Chapter 8. Be careful, however. Many texts use the terms experimental validity and test validity. 3 Donald T. Campbell and Julian C. Stanley, Experimental and Quasi-experimental Designs for Research on Teaching, in Handbook of Research on Teaching, ed. N. L. Gage (Chicago: Rand McNally, 1963) 4 Borg and Gall, 635-637 5 Individuals might be affected, but the groups will not significantly differ from each other. 6 Subjects in both groups will mature, on average, the same.
2
13-2
Chapter 13
Experimental Designs
Testing
A common research design is to give a group a pre-test, a treatment, and then a post-test (see p. 13-6). If you use the same test both times, the group may show an improvement simply because of their experience with the test. This is especially true when the treatment period is short and the tests are given within a short time. Unless you must specifically measure changes during the experiment -- requiring testing before and after the treatment -- it is better to only give a post-test. Randomly assign subjects to groups to render the dependent variable (as well as all others!) statistically equal at the beginning of the study.
Instrumentation
In the previous section we discussed the problem of using the same test twice in pre- and post-measurements. But if you use different tests for pre- and post-measurements, then the change in pre- and post-scores may be due to differences between the tests rather than the treatment. The best remedy, as we have already discussed, is to use randomization and a post-test only design. But if you must have pre-test scores you must use intact groups and need to know if the groups are equivalent, or you want to study changes over time then you must develop equivalent tests using the parallel forms techniques discussed in Chapter Eight. How does use of a control group relate to instrumentation?7
Statistical regression
Set a glass of cold milk and a hot cup of coffee on a table. Over time, the cold milk will get warmer and the hot coffee colder. They both regress toward the room temperature. Statistical regression refers to the tendency of extreme scores, whether low or high, to move toward the average on a second testing. Subjects who score very high or very low on one test will probably score less high or low when they take the test again. That is, they regress toward the mean. Lets say you are analyzing how much a particular reading enrichment program enhances the reading skills of 3rd grade children. You give a reading skills test and select for your experiment every child who scores in the bottom third of the group. You provide a three-month treatment of reading enrichment, and then measure the reading ability of the group. On the basis of the scores on the childrens first and second tests, you find that reading skills improved significantly. What, in your opinion, is wrong with this study?8 Do not study groups formed from extreme scores. Study the full range of scores. The question we need to answer is: Does the reading enrichment program significantly improve reading skills of randomly selected subjects over a control group?
Differential selection
If we select groups for treatment and control differently, then the results may be due to the differences between groups before treatment. Say you select high school
Even if tests are not equivalent both experimental and control groups answer the same test. This controls for the effects of instrumentation on the treatment group. It isolates treatment group changes to the given treatment. 8 The group would have scored, on average, better on the second testing regardless of the treatment, simply due to statistical regression. In addition, there is no control by which to measure the treatment.
7
13-3
seniors who volunteer for a special Bible study program as your treatment group, and compare their scores with a control group of high school seniors who did not volunteer. Do your post-test scores measure the effect of the Bible study treatment, or the differences between volunteers and non-volunteers? You cannot say. Randomization solves this problem by statistically equating groups.
Experimental mortality
Experimental mortality, also called attrition, refers to the loss of subjects from the experiment. If there is a systematic bias in the subjects who drop out, then posttest scores will be are biased. For example, if subjects drop out because they are aware that theyre not improving as they should, then the post-test scores of all those who complete the treatment will be positively biased. Your results will appear more favorable than they really are. How does use of a control group solve the problem of attrition?9
Treatment diffusion
Similar to the John Henry effect is treatment diffusion. If subjects in the control group perceive the treatment as very desirable, they may try to find out whats being done. For example, a sample of church members are selected to use an innovative program of discipleship training, while the control group uses a traditional approach. Over the course of the experiment, some of the materials of the treatment group may be borrowed by the control group members. Over time, the treatment diffuses to the control group, minimizing the treatment effect. This often happens when the groups are in close proximity (members of the same church, for example). Both the John Henry Effect and Treatment Diffusion can be controlled if experimental and control groups are isolated.
Subjects will tend to drop out of both treatment and control groups equally. Those who remain in both groups provide a better picture of "difference" than before-and-after type designs.
9
13-4
Chapter 13
Experimental Designs
External Invalidity
External invalidity asks, How confidently can I generalize my experimental findings to the world? Sources of external invalidity cause changes in the experimental groups so that they no longer reflect the population from which they were drawn. The whole point of inferential research is to secure representative samples to study so that inferences can be made back to the population from which the samples were drawn (Chapter Seven). External invalidity hinders the ability to infer back. Campbell and Stanley list four sources of external invalidity: the reactive effects of testing, the interaction of treatment and subject, the interaction of testing and subject, and multiple treatment interference.
Effects of Testing Treatment & Subject Testing & Subject Multiple Treatments
13-5
Summary
Designing an experiment that produces reliable, valid, and objective data is not easy. But experimental research is the only direct way to measure cause and effect relationships among variables. What a help it would be to Kingdom service if we could develop effective experimental researchers who are also committed ministers of Gospel -- learning from direct research how to teach and counsel and manage and serve in ways that directly enhance our ministry.
Types of Designs
The following is a summary of some of the more important designs of Campbell and Stanley. I will briefly describe the design, give an example of how the design would be used in a research study, and indicate possible sources of internal and external invalidity. In the design diagrams which follow, a test is designated by O, a treatment by X, and randomization by an R.
R R
O1 O3
O2 O4
Example. Third graders are randomly assigned to two groups and tested for knowledge of Paul. Then one group gets a special Bible study on Paul. Both are then tested again. Analysis. The t-test for independent samples (Chapter 20) can be used to determine if there is a significant difference between the average scores of the groups (O2 and O4). You can also compute gain scores (O2 - O1 and O4 - O3) and test the significance of the average gain scores with the matched samples t-test. Comments. This designs only weakness is pre-test sensitization and the possible interaction between pretest and treatment.
13-6
Chapter 13
Experimental Designs
R R
O1 O2
Example. Third graders are randomly assigned to two groups. Then one group receives a special study on the life of Paul (no pre-test). Both are tested on their knowledge of Paul at the conclusion of the study. Analysis. The difference between group means (O1 and O2) can be computed by an independent groups t-test. [Other procedures that can be used include one-way ANOVA (though usually used with three or more groups - see Chapter 24), the ordinal procedures Wilcoxin Rank Sum test or Mann-Whitney U (see Chapter 21). Well discuss these later].
Solomon Four-Group
Subjects are randomly selected and assigned to one of four groups. Group 1 is tested before and after receiving the treatment; Group 2 is tested before and after receiving no treatment; Group 3 is tested only after receiving the treatment; and Group 4 is tested after receiving no treatment.
1 2 3 4
R R R R
O1 O3
X X
O2 O4 O5 O6
The Solomon design is actually a combination of the Pre-Test Post-Test Design (groups 1 and2) and the Post-Test Only design (groups 3 and 4). Look!
1 2 3 4
R R R R
O1 O3
X X
O2 O4 O5 O6
Example. Third graders are randomly assigned to 1 of 4 groups. The knowledge of Paul is measured in groups 1 and 2. Groups 1 and 3 are given a special study on the life of Paul. When the special study is over, all four groups are tested. Analysis. One-way ANOVA can be used to test the differences in the four posttest mean scores (O2, O4, O5, O6). The effects of the pretest can be analyzed by applying a t-test to the means of O4 (pretest but no treatment) and O6 (neither pretest or treatment). The effects of the treatment can be analyzed by applying a t-test to the means of O5 (treatment but no pretest) and O6 (neither pretest or treatment). Subject maturation can be analyzed by comparing the combined means of O1 and O3 against O 6. Comments. The Solomon Four Group design provides several ways to analyze data and control sources of extraneous variability. Its major drawback is the large number of subjects required. Since each group needs to contain at least 30 subjects, one experiment would require 120 subjects.
4th ed. 2006 Dr. Rick Yount
13-7
Quasi-experimental Designs
The term quasi- (pronounced kwahz-eye) means almost, near, partial, pseudo, or somewhat. Quasi-experimental designs are used when true experiments cannot be done. A common problem in educational research is the unwillingness of educational administrators to allow the random selection of students out of classes for experimental samples. Without randomization, there are no true experiments. So, several designs have been developed for these situations that are almost true experiments, or quasi-experimental designs. Well look at three: the time series, the nonequivalent control group design, and the counterbalanced design.
Time Series
Establish a baseline measure of subjects by administering a series of tests over time (O1 through O4 in this case). Expose the group to the treatment and then measure the subjects with another series of tests (e.g., O5 through O8).
O1
O2 O3 O4
O5 O6
O7
O8
Example. A class of third graders is given several tests on Paul before having a special study on him. Several tests are given after the special study is finished. Analysis. I could say something like data is analyzed by trend analysis for correlated data on n subjects under k conditions (linear and polynomial), or the monotonic trend test for correlated samples, but let me simply say that data analysis is much more complex with a time series design. An effective visual analysis can be made by graphing the groups mean scores on each test over time. Important changes in the group can easily be attributed to the treatment by the shape of the line. One could also average the pre-treatment scores and the post-treatment scores, and apply a t-test for matched samples to the averages! Comments. Since there is no control group, one cannot determine the effects of history on the test scores. Instrumentation may also be a problem (Are the tests equivalent?) Beyond these internal validity problems, the reactive effects of repeated testing of subjects is a source of external invalidity.
O1 X O2 --------------------O3 O4
Example. Two intact third grade classes (no random selection) are tested on their knowledge of Paul before and after one of them receives a special study on the life of Paul. Analysis. One approach to measuring the significance of difference between the
13-8
Chapter 13
Experimental Designs
two groups is to compute gain scores. This is done by subtracting the pre-test score from the post-test score for each subject. Use gain scores to compute average gain for each group. Test whether the average gain is significantly different by the t-test for independent samples. Another approach is to use the pre-test scores as a covariate measure to adjust the posttest means. Analysis of covariance (See Chapter 25) is the procedure to use. Comments. This design should be used only when random assignment is impossible. It does not control for selection-maturation interaction and may present problems with statistical regression. Beyond these internal sources of invalidity, this design suffers from pretest sensitization.
Counterbalanced Design
Subjects are not randomly selected, but are used in intact groups. Group 1 receives treatment 1 and test 1. Then at a later time, they receive treatment 2 and test 2. Group 2 receives treatment 2 first and then treatment one.
Group 1 Group 2
Time 1 2 X2 O X1 O X2 O X1 O
Example. Two third grade classes receive two special studies on Paul: one in classroom and the other on a computer. Class 1 does the classroom work first, followed by the computer; class 2 does the computer work first. Both groups are tested after both treatments. Analysis. Use the Latin Squares analysis (beyond the scope of this text). Comments. Since randomization is not used in this design, selection-maturation interaction may be a problem. Multiple treatment effect is a possible source of external invalidity.
Pre-experimental Designs
Pre-experimental designs should not be considered true experiments, and are not appropriate for formal research. I include them so that you can contrast them with the better designs. Data collected with these designs is highly suspect. We will consider the One Shot Case Study design, the One Group Pretest Posttest design, and the Static Group comparison design.
Example. A third grade class is provided a special Bible study course on Paul, after which their knowledge of Paul is tested.
13-9
Analysis. Very little analysis can be done because there is nothing to compare the posttest against and no basis to determine what influence the treatment had. Comments. None of the sources of internal or external invalidity are controlled by this design. It suffers most in the areas of history, maturation, regression, and differential selection. It also suffers from the external source of treatment and subject. The design is useless for most practical purposes because of numerous uncontrolled sources of difference.
One-Group Pretest/Posttest
A single intact group is tested before and after a treatment. O1 X O2
Example. A group of third graders is tested on knowledge of Paul before and after a special study on the life of Paul. Analysis. Test the difference between the pre-test and post-test means using the matched sample t-test (See Chapter 20) or Wilcoxin matched pairs signed rank test (See Chapter 21). Comments. Problems abound with history, maturation, testing, instrumentation, and selection-maturation interaction. The reactive effects of pre-and post- tests and treatment and subject are external sources of invalidity.
Static-Group comparison
Two intact groups are tested after one has received the treatment. X O -----------O Example. Two classes of third graders are tested on their knowledge of Paul after one of them has had the special Bible study. Analysis. Determine whether there is a significant difference between post-test means by using the t-test for independent samples (Chapter 20) or the Mann-Whitney U nonparametric test (Chapter 21). While these statistics will work, their results are meaningless since there is no assurance that groups were the same at the beginning of the treatment. Comments. This design suffers most from selection, attrition, and selection-maturation interaction problems. It also fails to control the external invalidity source of treatment and subject.
13-10
Chapter 13
Experimental Designs
Summary
This chapter introduced you to the world of experimental research design. The concepts of internal and external validity, randomization, and control are essential to constructing experiments which provide valid data. Experimental research is the only type which can establish cause-and-effect relationships between variables.
Vocabulary
control group differential selection experimental mortality external invalidity history instrumentation interaction of testing/subject interaction of treatment/subject internal invalidity John Henry effect maturation posttest sensitization pretest sensitization selection-maturation interaction statistical regression testing treatment diffusion true experimental research representative sample which does not receive treatment subjects selected for samples in a non-random manner, i.e., in "different ways" loss of subjects from the study flaw which prevents experimental results from being generalized to the original population events during experiment which influences scores on post test differences in subject scores due to differences in tests used subjects may react to tests unpredictably (generalization?) subjects may react to treatment unpredictably (generalization?) condition which alters measurements within the experiment Control Group tries harder (distorting the results) change in subjects over course of the experiment posttest changes subjects: they put it all together and score higher than they normally would pretest changes subjects: advance organizer: prepares subjects for treatment samples of subjects may mature differently top and bottom scoring subjects move toward the average on second test source of internal invalidity: improvement due to (different) tests, not treatment source of internal invalidity: treatment leaked to Control Group design which involves random selection and random assignment
Study Questions
1. Define internal and external invalidity. 2. Explain the ten sources of internal invalidity and four sources of external invalidity. 3. What is required for a research design to be true experimental? Why?
13-11
13-12
Chapter 15
15
Distributions and Graphs
Creating an Ungrouped Frequency Distribution Creating a Grouped Frequency Distribution Visualizing the Distribution: the Histogram Visualizing the Distribution: the Frequency Polygon Common Distribution Shapes Distribution-Free Data
The end of the research part of a study comes after the data has been collected through tests, attitude scales, questionnaires, or other instruments. Raw data presents us an incomprehensible mass of numbers. The first step in statistical analysis is to reduce this incomprehensible mass of numbers into meaningful forms. This is done by using frequency distributions and associated graphs. In this chapter well look at several ways to organize data so that you see its meaning. We will look at both Ungrouped and Grouped Frequency Distributions.
As you can see, this collection of numbers makes little sense as it is. But we can organize and summarize the data in such a way to make it meaningful. Lets start by rank ordering the numbers from high (109) to low (44).
v v v v v v 109 105 104 100 99 98 97 97 95 95 93 91 90 89 84 84 83 82 81 80 78 75 75 75 74 72 71 70 69 68 66 62 59 59 58 51 47 44
15-1
This ranking helps us to see where any given score fell along the whole range of scores. But the list is still rather long and difficult to manage. Lets now go through the list and count the number of times each score occurs. This is the scores frequency, represented by the letter f.
Score 109 105 104 100 99 98 97 95 93 91 90 f 1 1 1 1 1 1 2 2 1 1 1 Score 89 84 83 82 81 80 78 75 74 72 71 f 1 1 1 1 1 1 1 3 1 1 1 Score 70 69 68 66 62 59 58 51 47 44 f 1 1 1 1 1 2 1 1 1 1
The ungrouped frequency distribution above removes the redundancy of repeating scores. But the large number of single scores (f=1) still confuses the picture. If we were to group ranges of scores together in classes, we would get a better picture of the data. Grouping scores into classes produces a grouped frequency distribution.
15-2
Chapter 15
66 / 10 = 6.6 We need to round up or down to a whole number. Odd class widths are better than even ones because the midpoint of an odd-width class is a whole number. So lets round up to 7. (In this context, we would even round a number like '6.1' up to 7). The distribution will have a class interval (i) of 7.
This grouped frequency distribution reveals much more about the Bible knowledge of high school seniors than we could discern in previous listings. On the down side, by grouping our scores into classes, we actually lost some detail. But "losing detail" is necessary when the aim is to derive meaning from the numbers. We can combine our scores even more by increasing the class width i. Lets look at a frequency distribution of the same data with i = 14.
Class 98-111 84-97 70-83 56-69 42-55 Tally ///// ///// ///// ///// /// f / ///// ///// // // 6 10 12 7 3 n = 38
15-3
This last graph gives a smoother picture of the data set, though we notice the loss of more detail because we reduced the number of classes. Frequency distributions certainly simplify data sets, but we can present the data even more clearly by graphing the frequency distributions.
X- and Y-axes
A graph is composed of a vertical line, called the ordinate or the Y-axis, and a horizontal line, called the absissa or the X-axis. These two lines intersect to form a right angle. By convention, the Yaxis should be three-fourths the length of the X-axis. Axis is pronounced AX-is. Axes is pronounced AXees.
Scaled Axes
Numbers are placed on the X- and Y-axes at equal intervals to represent the scale values of the variable being graphed. In a graph of a grouped frequency distribution, the X-axis is scaled by the range and class intervals, the Y-axis is scaled by frequency. There are two major graph types used to display information from a grouped frequency distribution. The first is the histogram and the other is the frequency polygon.
Histogram
A histogram (HISS-ta-gram) is a special type of bar graph. The width of the bars equals the class interval and the heights of the bars equal class frequencies. Let's use the example data to build a histogram with a range of 44-111 and class width (i) of 7. The frequencies for this graph are located in the middle of page 15-3. Look at the graph at left. Class limits are listed along the X-axis. The widths of all classes equal 7. The height of each bar equals the frequency of scores contained in each category. The shape of the graph provides us a clear and meaningful picture of the entire data set. Then we reduced the number of categories from ten to five (increased i from 7 to 14). The graph at
15-4
Chapter 15
left shows the effect of reducing the number of classes. Irregularities have been smoothed out, but some of the more specific (irregular) data has been glossed over. Choosing class width and the number of classes is a trial and error process. Our goal is to reflect the shape of the data as clearly as possible while attaining as much precision as possible.
Frequency Polygon
By connecting the midpoints of the bars with lines, we produce a frequency polygon. The frequency polygon displays the same information as the histogram, but in a different form. The frequency polygon at right is based on the ten-class histogram on the previous page. If we remove the bars of the histogram, we obtain a frequency polygon graph, below right.
Distribution Shapes
The graphic image of a histogram or frequency polygon tells us at a glance the group profile of the data. The incomprehensibility of a set of numbers is transformed into a meaningful visual protrait. This visual portrait displays two special characteristics: kurtosis and skewness. The kurtosis of a curve describes how flat or peaked it is. The three basic profiles of kurtosis are platykurtic (flat), leptokurtic (peaked), and mesokurtic (balanced). A flat curve is called platykurtic. Think of the flatness of a plate and youll remember platey-kurtic. Notice that there are low frequencies for all the categories. A peaked curve is called leptokurtic. Think of the central frequencies leaping away from the others and youll remember leap-tokurtic. Notice that outer categories have lower frequencies while the central categories have high frequencies. A curve that falls between platykurtic and leptokurtic is called mesokurtic. Think of medium (meso-) and youll remember meso-kurtic. The familiar bell shaped curve is mesokurtic. The skewness of a curve describes how horizontally distorted a curve is from the familiar bell-shaped curve. A curve with negative skew has its left tail pulled outward to the left, to the negative end of the scale. A curve with positive skew has its right tail pulled outward to the right, to the positive end of the scale. A common mistake is to focus on the mound of scores rather than the distorted tail. Remember: the direction the tail is pulled is the direction of the skew. A distribution where all categories of scores have equal
4th ed. 2006 Dr. Rick Yount
platykurtic
leptokurtic
mesokurtic
negative skew
positive skew
rectangular
15-5
Distribution-Free Measures
Our discussion on distributions applies to ratio or interval data only, called parametric data. Two other types of statistics deal with the non-parametric measures: either ordinal (ranks) or nominal (counts) data. Non-parametric data is often called distribution-free. We will spend the next few chapters dealing with parametric statistics, and then deal with non-parametric types in Chapters 22, 23, and 24.
Summary
This chapter carried you through the first step in data analysis: reducing a series of chaotic numbers to orderly distributions and graphs. Before engaging in more sophisticated statistical procedures, you should initially analyze your data with these data reduction techniques. All good introductory statistics texts have chapters on data reduction techniques.
Vocabulary
Absissa Class width (i) Class Exponential curve Frequency (f) Frequency polygon Histogram Kurtosis Leptokurtic Mesokurtic Midpoint Negative skew Non-parametric measures Ordinate Parametric measures Platykurtic Positive skew Rectangular distribution Skew X-axis Y-axis number along the horizontal (x-) axis of a graph distance between upper and lower limits in a given class a subset of scores defined by upper and lower limits in a frequency distribution line on a graph produced by the equation y = x the number of scores in a given class graph that depicts class frequencies: uses class midpoints graph that depicts class frequencies: uses class limits amount of flatness (or peakedness) in a distribution of scores highly peaked distribution ("leaps up" in the middle) moderately peaked distribution (normal curve) halfway point between class limits in a given class: x' negative end of skewed distribution: tail pulled left in a negative direction ranks or counts; ordinal or nominal; distribution-free number along the vertical (y-) axis scales or tests; interval or ratio; normal distribution flat distribution ("like a plate") positive end of skewed distribution: tail pulled right in a positive direction all classes have same frequency the degree a tail in a frequency distribution is pulled away from the mean the horizontal axis in a graph the vertical axis in a graph
Study Question
Using the following data and the guidelines provided in this chapter... 89, 92, 83, 98, 98, 80, 89, 97, 83, 87, 86, 84, 97, 97, 99, 90, 95, 90, 91, 96, 95, 91, 91, 92, 94, 93, 94, 100 a) ...to construct a grouped frequency distribution with i=3. b) ...to construct a histogram of this distribution. c) ...to construct a frequency polygon of this distribution. d) How would you describe this distribution? (What type?)
15-6
Chapter 15
15-7
15-8
Chapter 16
16
Central Tendency and Variation
Measuring the Central Tendency of Data Measuring the Variability of Data Statistics and Parameters The Standard (z-) Score
In the last chapter we considered a way to reduce a mass of numbers by creating a grouped frequency distribution and graphing it. The graph is a visual image of the data, and is an important first step in data analysis. In this chapter we develop basic concepts in reducing data numerically. A group of numbers has two primary numerical characteristics. The first is a central point about which they cluster, called the central tendency. The second is how tightly they cluster about that point, called variability.
The Mode
The mode is the most frequently occurring score in a set of scores. 82 82 83 83 84 85 86 87 87 87 88 90 95 99 99 The mode of the above set of numbers is 87 because it appears three times more than any other number in the set. 82 83 84 86 87 88 88 89 90 91 91 92 94 97 98 There are two modes above. The numbers 88 and 91 both appear twice. This is a bi-modal (two modes) data set. 82 83 84 86 87 88 89 90 91 92 93 94 95 96 97 There is no mode for this distribution. No score occurs more frequently than any other. The mode is the most frequent score in a set of data.
The Median
The median is the middlemost score. That is, it is the score that represents the
4th ed. 2006 Dr. Rick Yount
16-1
exact halfway point through the data set. The median score divides the set of data into two equal halves. Half of the scores falls below the median, and half falls above it.
34
56
67
100 356
In the above set, it is the 6th score, or the number 9. Five scores fall below 9, and five scores fall above 9. We can calculate this score with the simple formula: (N+1)/2 where N is the number of scores in the set. There are 11 scores (N=11) in the data set above. Using the formula, we compute (11+1)/2 = 6. The 6th score is the median, which is the number 9. The median of this data set is 9.
9
median
34 56 67 100 356
5 scores above
5 scores below
34 23 67 4 8 17 2 78 99 5 178 3 1678
First, we rank order the numbers from low to high.
2 3 4 5 8 17 23 34 67 78 99 178 1678
Applying the formula, we compute (13+1)/2 = 7. We are looking for the 7th score. The 7th score is the number 23. 23 is the middle number, the median. Six scores fall above and six below this number. Here's an example with an even number of scores:
1 2 3 4 5 6 7 8 9 10 11 12 13 14
In this case, there are two middlemost values. The median for this data set is the average of the two middlemost values. Add the two middle values together and divide by 2. In our case, (7+8)/2 = 7.5. Notice that seven numbers fall below 7.5 (1-7) and seven numbers fall above it (8-14).
1 2 3 4 5 6 7 8 9 10 11 12 13 14
\_ 7.5 The median is the middlemost score.
1 2 3 4 5 6 7 8 9 10
The mean is found by adding these ten numbers (N) together and dividing by ten (N). We can represent the procedure for computing a mean in a shorter form by using symbols. You were introduced to the symbol (capital sigma) in chapter 14. The
16-2
Chapter 16
symbol X (capital "X" or "Y" or "L" or any English letter) refers to scores. The letter N refers to the number of scores. And finally, the Greek letter (pronounced myoo) represents the arithmetic mean of the scores. Using these letters to define the formula for the mean, we have the following:
Read the above formula like this: mu equals the sum of X divided by N. Or, in English, the average value of a group of scores is the sum of those scores divided by the number of scores in the group. Lets use this formula on the following data set: 10 23 17 5 64 28 3
The mean score of 21.43 represents the average value of all the individual scores in the group, and is the most important measure of central tendency due to its use in statistical analysis.
Measures of Variability
The second essential characteristic of a group of scores is variability. Variability is a measure of how tightly a group of scores clusters about the mean. Scores can be tightly clustered or loosely clustered about the mean. Scores that tightly cluster about the mean have lower variability. Scores that loosely cluster, that are more spread out from the mean, have higher variability. There are three measures of variability. These are range, average deviation, and standard deviation.
Range
As we learned in the last chapter, the range of a group of scores is equal to the highest score minus the lowest score plus 1, or, Range = Xmax- Xmin + 1. It is a crude
4th ed. 2006 Dr. Rick Yount
16-3
measure of variability, but is a useful first step in understanding a distribution. Lets look at an example. Class A took a midterm examination in research. The highest score in the class was 103 and the lowest was 48. The range was 103 - 48 + 1 or 56 points. Class B is the same size and took the same exam. Their highest and lowest scores were 95 and 67 respectively. Their range was 95 - 67 + 1 or 29 points. Therefore, the scores of Class B have lower variability (more tightly clustered) than the scores of Class A. The problem with range is that it tells us nothing of the dispersion of scores between the high and low points. Classes C and D have the same ranges, but have different dispersions of scores. One way of getting at the dispersion of scores throughout the whole distribution is to measure the deviation of each score from the mean and then compute the average of all the deviations.
Average Deviation
A deviation score, symbolized by a lower case x, is the difference between a score (X) and the mean () of the distribution. When you subtract the mean of a group of scores from a specific score, you compute the deviation of the score from the mean. Or, we can write this relationship more simply as x = X - . The average deviation of a group of scores is computed by summing all the deviations in the group and dividing by N. Look at the following scores:
10 20 30 40 50
First, compute the mean of these scores: (150/5=30). Then compute the deviation scores (x) by subtracting the mean (30) from each score (X) like this: deviation scores (x)
score - mean = deviation 10 30 -20 20 30 -10 raw scores (X) 30 30 0 40 30 10 50 30 20 sum of deviations (x) = 0
Notice that when we sum the deviations, we get 0 (x=0). Why is the sum of deviations equalled to zero? The mean is the balance point in a distribution. When two children of equal weight use a teeter-totter, the balance point is placed half-way between them, as in diagram A below left.
16-4
Chapter 16
But when children of unequal weight use it, the board must be changed so that the balance point, or fulcrum, is closer to the heavier child. This is shown in B below right. Heavier weight plus shorter distance on one side of the board balances with the lighter weight and longer distance of the other. Another way of saying this is that for perfect balance, the moment of force (weight x distance) of one side equals the moment of force of the other. Subtract one from the other and the result is zero. This is what is meant in statistics when we say the mean is the fulcrum of a group of scores. Large deviations are like large distances from the fulcrum, and small deviations like small distances. (All scores weigh the same in this example). The sum of deviations on one side of the mean will always cancel out or balance the sum of deviations on the other side of the mean. Therefore, x=0. In order to compute average deviation, we must take the absolute values of the deviations. An absolute value, symbolized as |x|, equals the value of a number regardless of sign. So, the absolute value of -4 equals 4 (|-4| = 4).. By taking the absolute values of deviations, we make them all positive distances from the mean. Summing them, we produce a meaningful measure of "spreadedness" from the mean:
The average deviation equals 12. But average deviation has some mathematical limitations that cause problems in more advanced procedures. A better measure of variability, which also reflects the dispersion of scores throughout a distribution, is the standard deviation.
Standard deviation
The standard deviation has mathematical properties which make it, like the mean, much more useful in higher-order statistics. The procedure for standard deviation involves summing squared deviations (producing a value variously called the sum of squared deviations, sum of squares, and statistically, x2, which is a fundamental component of many statistical procedures) in order to eliminate negative values. The pathway to standard deviation moves from deviations to the sum of squares to variance to standard deviation. Well look at two ways to compute sum of squares. The first, called the deviation method, clearly illustrates what standard deviation means. The second, called the raw score method, is easier to use. Both procedures result in the same value for sum of squares.
Deviation Method
Compute deviations of all scores from the mean. Square all deviations (x2) and sum them (x2) as follows
score 10 20 30 40 50 mean = 30 30 30 30 30 deviation squared -20 400 -10 100 0 0 10 100 20 400 2 x=0 x = 1000
16-5
Large groups will have a larger sum of squares than small groups, simply because there are more deviations in a large group. Dividing by N eliminates size of group from the result. This gives a truer picture of spreadedness in a group of numbers no matter how many are in the group. Divide the sum of squares by N in order to factor out the variable of group size. The resulting value is called the variance of the scores, and is symbolized by the lower case Greek letter sigma ().
Variance (2) = x2 / N = 1000/5 = 200.0
Since we squared deviations before adding them, variance measures variability in squared units. It would be better if score variability were in the same unit of measure as the scores themselves. We can "undo" the squaring by taking the square root (/)1 of the variance, like this:
Standard Deviation () = /2 = /200 = 14.14
The number 14.14 represents the standardized measure of variability for our example. This number represents, in the same unit of measure as our scores, the degree of spread-out-ness of the scores from the mean. The larger the number, the greater the spread. It is useful in comparing the variabilities in different groups of scores, but will become more meaningful in future statistical procedures. This deviation method shows you exactly what a "standard deviation" is, and is fine to use when you have a few scores and a whole number mean. But if you have a large data set, and the mean is a fraction, like 73.031, computing individual deviation scores, squaring them, and then summing them can be painfully tedious. A simpler way to compute the sum of squares -- and get the very same result -- is to use the raw score formula.
X2 - (X)2/N
where X2 refers to the sum of squared-raw-scores (square all the scores and sum them) and (X)2 refers to the sum-of-all-scores squared (sum the scores and then square the sum). Let's apply this formula to the same data that we used under the deviation method. We should get the same answer: x2 = 1000.
X 10 20 30 40 50 X = 150 (X)2 = 22500 X2 100 400 900 1600 2500 X2 = 5500
x2 = = = x2 =
1 The "square root" symbol actually looks like this: ox but is difficult to produce within the text. So I am using the simpler (/x) symbol. Later, with more complicated formulas, I will use graphical characters to indicate square root.
16-6
Chapter 16
As you can see, both methods give sum of squares values of 1000. The raw score method is easier to do and less prone to arithmetic errors.
Notice that the means of the two groups are equal. But the degree of scatter (variability) among the scores is not. Lets compute the standard deviations of both groups to compare them. Which group should have the larger standard deviation?2 Using the deviation method, we calculate the sum of squares of X as follows:
i 1 2 3 4 5 6 N = 6 Xi 1 3 5 9 11 13 X = 42 xi -6 -4 -2 2 4 6 x = 0 xi2 36 16 4 4 16 36 x2 = 112
The variance of group X equals x2/N = 112/6 = 18.66. The standard deviation is the square root of variance, or /18.66 = 4.32. Using the raw score method, we calculate the sum of squares for Group X as follows:
i 1 2 3 4 5 6 n = 6 Xi 1 3 5 9 11 13 X = 42 Xi2 1 9 25 81 121 169 X2 = 406
x2 = X2 - (X)2/N = 406 - (42)2/6 = 406 - 294 = 112 We get the same result, 112, with either method.
2 Did you say the X's? Good. You can see from the graph that the X's are spread out more than the Y's (another way of saying this is that the range of X is greater than the range of Y). We would expect the X's to have more variability than the Y's, and, in turn, the standard deviation of the X's will be greater.
16-7
Now lets compute variance and standard deviation for Group Y, which should produce a smaller sum of squares, variance and standard deviation than Group X did. Heres the deviation method:
i 1 2 3 4 N = 4 Yi 5 6 8 9 Y = 28 yi -2 -1 1 2 y = 0 yi2 4 1 1 4 = 10
y2
The variance of group Y equals y2/N = 10/4 = 2.5. The standard deviation is the square root of variance, or /2.5 = 1.58. Using the raw score method, we calculate the sum of squares for Group 2 as follows:
i 1 2 3 4 n = 4 Yi 1 6 8 9 Y = 28 Yi2 1 36 64 81 Y2 = 206
y2 = Y2 - (Y)2/N = 206 - (28)2/4 = 206 - 196 = 10 Again, we get the same result, 10, with either method. Since the sum of squares equals 10, variance equals 2.5 and standard deviation 1.58, as calculated above. The deviation method illustrates the meaning of standard deviation, the raw score method gives the same result more simply. We have computed the standard deviation for both groups of scores. The groups have identical means, but different spreads. We expected that the scores of Group X would have the larger standard deviation than Group Y because of its larger spread of the scores. Calculations demonstrated a standard deviation of 4.32 in Group X and of 1.58 in Group Y, confirming our expectation.
16-8
Chapter 16
defined as
Population Parameters
Suppose we have a population of 10,000 ministers. We want to compute the mean and standard deviation of their IQ. In order to compute these population parameters directly, you give all 10,000 ministers an IQ test. Sum the 10,000 IQ scores (X), and divide by 10000 (N). The result is the population mean, symbolized by (pronounced "myoo"). Subtract from the 10,000 IQs (x=X-), square the 10,000 deviations (x2), sum them (x2), divide by 10,000 (N), and finally take the square root (/). This yields the population standard deviation , symbolized by .
Sample Statistics
The cost in time and materials to test 10,000 subjects and compute the parameters is not practical. Draw a random sample of 100 ministers (1%) and measure their IQs. Sum the 100 IQs (SX) and divide by 100 (N) to produce the sample mean. The sample mean is symbolized by , pronounced X-bar. from the 100 IQs (x=X- ), square the 100 deviations (x2), sum them Subtract 2 (x ), divide by 100 (N), and finally take the square root (/). This sample standard deviation is symbolized by a sigma with a hat on top ( ), pronounced sigma-hat.3
Estimated Parameters
When we can not compute population parameters directly, we must estimate them from sample statistics. This is not a problem for the estimate of the mean (). The sample mean ( ) is the best estimate. But due to the smaller number of scores in the sample -- because it is a subset of the population -- then the sample standard deviation ( ) always underestimates . This underestimation of requires a small correction factor in the equation for estimated standard deviation (s). While the equations for and have sum of squares divided by N or n4,the equation for estimated standard deviation (s) has sum of squares divided by n-1. Why n-1? It has to do with the Central Limit Theorem and you really dont want to know. (Okay, for those who do: the selection of a sample of n scores from the population reduces by one the number of n-sized samples that can be drawn from the
Some textbooks refer to the sample standard deviation as "sigma-tilde" ( ). Often, N and n are used interchangibly to refer to the number of scores in a set. Other times, N refers to the number of scores in a population, and n to the number of scores in a sample.
3 4
16-9
population. This reduces the number of degrees of freedom of the population by one. Well talk more about degrees of freedom in a few chapters). So, we have three sets of formulas. Mean and standard deviation are common concepts across the three versions, but there are important differences to note. Notice the use of N for parameters and n for samples. Match the formulas with the diagram above.
The equation is pronounced z equals X minus mu over sigma. In English, the formula means that a standardized score is equal to a raw score minus the population mean divided by population standard deviation. A raw score (X) from a sample that has a mean of and estimated standard deviation of s is transformed into a standardized scale score (z) with the following formula.
The equation is pronounced z equals X minus X-bar over s. In English, the formula means that a standardized score is equal to a raw score minus the sample mean divided by the estimated standard deviation. Both formulas reflect the same relationship between a raw score and a standard5
Upper left: Two distributions have the same mean and standard deviation. Upper right: Two distributions have the same standard deviation, but different means. Lower left: Two distributions have the same mean, but different standard deviations. Lower right: Two distributions have different means and standard deviations.
16-10
Chapter 16
ized score in a distribution of numbers. The distinction is whether the distribution is a sample or a population. Notice the values for mean and standard deviation are both part of the transformation formula. No matter what these parameters are, the standardized scores are plotted on a z-scale which looks like this:
For a standardized scale, the mean is always zero and standard deviation is always one. The z-score equations transform any group of scores into these standardized values. Lets look at an example of how z-scores facilitate comparison between scores. John is taking Hebrew and Research. On his midterm exams, he made a 85 in Hebrew and an 80 in Research. On which exam did he do better? It seems obvious that he did better in Hebrew than he did in Research. But the real answer is not so easy. To compare his performance on the two tests, we must take into consideration how well his classmates as a whole did. That is, we need to know the means and standard deviations for the two exams. Heres the information we need: Hebrew 80 10 Research 70 5 Now compute z-scores for Hebrew (zh) and Research (zr).
Notice several things about the diagram above. First, since the z-scores from Hebrew and Research now fall on the same standardized scale, we can directly
4th ed. 2006 Dr. Rick Yount
16-11
compare them. It is clear from the scale that John did much better in Research, scoring two standard deviations above the mean, than he did in Hebrew, where he scored only one-half standard deviation above the mean. Second, notice that the means of both classes line up on a z-score of 0. In standardized scores, the mean is always 0. Third, notice that the of 1 on the z-scale is equivalent to 10 in Hebrew and 5 in Research. Fourth, notice that Johns score of 85 in Hebrew falls directly below 0.50 on the z-scale. His score of 80 in Research falls directly below 2.00 on the z-scale. Standardized scores lie at the heart of inferential statistics. These basic building blocks provide the foundation for procedures well study soon.
Summary
The three measures of central tendency are the mode, the median, and the mean. These refer, respectively, to the most frequent score, the middlemost score, and the arithmetic average of a group of scores. In terms of statistical analysis, the mean is by far the most important of the three, and the most affected by skewed distributions. Three measures of variability are the range, average deviation, and standard deviation. The standard deviation (and its squared cousin, variance) is the most important of the three. The two characteristics of mean and standard deviation can be combined to transform a raw score (X) into a standard score (z). Z-scores can be directly compared across groups, regardless of differing parameters.
Example
In my Ed.D. dissertation, I analyzed how much learning of the doctrine of the Trinity in Southern Baptist adults occurred over a seven week course. Cognitive tests were given at the beginning (Test 1), end (Test 2), end plus three months (Test 3) and end plus six months (Test 4).6 I was also interested in whether the mental abilities of the three groups were balanced. Here is one of my Tables showing the means and standard deviations of these groups.7 You can notice several things immediately from the numbers below. The three groups' average mental ability, measured by the Otis-Lennon Mental Ability Test (maximum score: 80), were within 0.90 points of each other. All three groups learned a great deal about the doctrine of the Trinity -- all three groups jumped an average of 50.69 points over the seven weeks (Test #2 Total N minus Test #1 Total N). All three groups forgot some of what they learned, dropping an average of 11.48 points over three months and 17.98 points over six months. Are these means significantly different? We will learn how to answer this question in Chapter 20.
6 William R. Yount, A Critical Comparison of Three Specified Approaches to Teaching Based on the Principles of B. F. Skinner's Operant Conditioning and Jerome Bruner's Discovery Approach in Teaching the Cognitive Content of a Selected Theological Concept to Volunteer Adult Learners in the Local Church, (Fort Worth: Southwestern Baptist Theological Seminary, 1978). 41-42 7 Ibid., 168
16-12
Chapter 16
APPENDIX XI Means and Standard Deviation Scores Total N MENTAL ABILITY TEST #1 TEST #2 TEST #3 TEST #4 59.96* 15.58+ 24.70 8.01 75.39 15.40 63.91 13.91 57.41 11.56 +Standard Deviation X 59.71 16.55 25.57 4.79 81.43 8.02 66.00 12.36 61.00 9.81 Y 59.67 19.13 23.44 8.80 78.44 14.85 67.78 15.97 59.22 11.58 Z 60.57 11.30 25.43 10.26 65.43 18.41 56.86 11.44 52.29 11.86
*Mean
Vocabulary
X-bar, the average or mean of a group of scores (sample) the average or mean of a group of scores (population) sigma-squared, the population variance sigma, the population standard deviation sigma-hat squared, sample variance sigma-hat, sample standard deviation |x|/n : Sum absolute values of deviation scores, then divide by n sum of scores divided by the number of scores focal point of scores: mean, median, mode and s, computed from sample, infers population parameters average score middlemost score most frequent score number of scores (sometimes used to refer to one group within experiment) number of scores (sometimes used to mean entire experiment) population measurements (, ) distance between highest and lowest scores in a group standardized measure of variation in scores: s sample measurements ( and ) sum of squared deviation scores measure of spreadedness in a group of scores measure of spreadedness in squared units deviation score: difference between score (X) and mean ( or ) raw score: e.g., test score standardardized score which reflects both and (or and s)
average deviation average central tendency estimated parameter mean median mode n N parameter range standard deviation statistics sum of squares variability variance x X z-score
Ibid., 169
16-13
Study Questions
1. What are the modes for the sets of scores below? a. 1 2 3 4 5 6 6 7 8 9 b. 1 2 3 4 5 6 6 7 8 8 c. 1 1 2 2 3 3 4 4 5 5 Mode: ____ Mode: ____ Mode: ____
2. What are the medians for the following data sets? a. b. 10 3 15 7 20 78 22 45 27 2 29 56 33 4 7 Md: ____ Md: ____
3. Compute the mean, sum of squares (use deviation method), variance and standard deviation for the following scores: 65 70 70 75 85 90 95
4. Using the scores in #3, compute the sum of squares with the raw score method.
5. You have taken midterm exams. Your score in New Testament survey was 75. Your score in Principles of Teaching was 90. NTS PT n 100 25 X 7020 2175 x2 2500 225
a. Compute means for both classes. b. Compute standard deviations (s) for both classes. c. Transform your midterm scores into z-scores. d. Plot your standard scores on a z-scale. Include the appropriate raw score scale values for the two classes. e. In which class did you do better? Explain how you know this.
16-14
Chapter 16
16-15
Chapter 17
17
The Normal Curve and Hypothesis Testing
The Normal Curve Defined Level of Significance Sampling Distributions Hypothesis Testing
In the last chapter we explained the elementary relationship of means, standard deviations, and z-scores. In this chapter we extend this relationship to include the Normal Curve, which allows us to convert z-score differences into probabilities. On the basis of laws of probability, we can make inferences from sample statistics to population parameters and make decisions about differences in scores. Using z-scores and the Normal Curve, we can convert differences in scores to probabilities. The chapter is divided into the following sections: The Normal Curve Defined. What is the nature of the Normal Curve? How does the Normal Curve and its associated Distribution table, link z-score with area under the curve? How does area under the curve relate to the concept of probability? Level of Significance. What do the terms level of significance and region of rejection mean? What is alpha ()? What is a critical value? The Sampling Distribution. What is a sampling distribution? How does it differ from a frequency distribution? Hypothesis Testing. How do we statistically test a hypothesis?
17-1
Recall that the mean of the z-scale equals zero and extends, practically speaking, 3 points in either direction. Each point on the z-scale equals one standard deviation away from the mean. A score of 100 in Johns Hebrew class equals 2 standard deviations above the mean (=80, =10, z=+2.0). A score of 55 in Johns Research class equals 3 standard deviations below the mean (=70, =5, z=-3.0). The z-scale assumes that the distribution of standardized scores forms a bellshaped curve, called a Normal Curve. The normal curve is plotted on a set of X-Y axes, where the X-axis represents, in this case, z-scores and the Y-axis frequency of z-scores. It looks like the diagram at left. The area between the bell and the baseline is a fixed area, which equals 100 percent of the scores in the distribution. We will use this area to determine the probabilities associated with statistical tests. There is an exact and unchanging relationship between the z-scores along the xaxis and the area under the curve. The area under the curve between z = 1 (read z equals plus or minus 1) standard deviation is 68% of the scores (p=0.68). The area between 2 standard deviations is 96%, or 0.96 of the curve.
The tails of the distribution theoretically extend to infinity, but 99.9% of the scores fall between z = 3.00. Now, lets use the normal curve in a practical way with Johns classes. We can use the information in the diagram above to answer questions about Johns classes. Example 1: How many Hebrew students scored between 70 and 90?
For the Hebrew class, a score of 70 equals a z-score of -1 and a 90 equals a z-score of +1. The area under the normal curve between -1 and +1 is 68%. Therefore, the proportion of students in Hebrew scoring between 70 and 90 is 0.68 0.68. How many students is that? Multiply the proportion (p=0.68) times the number of students in the class (60). The answer is 40.8. Rounding to the nearest whole student we would say that 41 Hebrew students fall between 70 and 90 on this test. Example 2: How many research students scored between 60 and 80?
For the research class a score of 60 equals a z-score of -2; an 80 equals a z-score of +2. The area under the curve between -2 and +2 is 96%. Therefore, the proportion of the students in Research scoring between 60 and 80 is 0.96. How many students is that? (0.96)(40)=38.4. Rounding off to the nearest whole student we would say that 38 research students fall between 60 and 80 on this test.
17-2
Chapter 17
What is the area under the curve between the mean and z=-1.65? The normal curve is symmetrical, which means that the negative half mirrors the positive half. We can find the area under the curve for negative z-scores as easily as we can for positive ones. Look down the column for the row labelled 1.6 and then across to the column labelled 0.05. Where these cross you will find the answer: 0.4505. Forty-five percent (45%) of the scores of a group falls between the mean and -1.65 standard deviations from the mean. .05 .06 ... | | --.4505 .02 .03 .04
17-3
John? Our first step is to compute the z-score for the raw score of 85, which we have already done. We know that the standard score for John's Hebrew score of 85 is zh = 0.500 (diagram on 16-11 and 17-1). The second step is to draw a picture of a normal curve with the area were interested in. Notice that Ive lightly shaded the area to the right of the line labelled z = 0.5. This is because we want to determine how many students scored higher than John. Since higher scores move to the right, the shaded area, which is equal to the proportion of students, is what I need. But just how much area is this?
Look at the Normal Curve Table for the proportion linked to a z-score of 0.5. Down the left column to "0.5." Over to the first column headed ".00." The area related to z=0.5 is 0.1915. I have shaded this area darker in the diagram below. Our lightly shaded area is on the other side of z=0.5! The area under the entire Normal Curve represents 100% of the scores. Therefore, the area under half the curve, from the mean outward, represents 50% (0.5000) of the scores. So, the lightly shaded area in the diagram is equal to 0.5000 minus 0.1915, or 0.3085.
So we know that 30.85% of the students in Johns Hebrew class scored higher than he did. How many students is that? Multiplying .3085 (proportion) times 60 (students in class) gives us 18.51, or 19 students. Nineteen of 60 students scored higher than John on the Hebrew exam. Here's another. John scored 80 in Research. How many students scored lower than this? Weve already computed Johns z-score in Research as +2.00. The area under the curve between the mean and z = 2.00 is 0.4772. Find 0.4772 in the Table.
17-4
Chapter 17
Since John also score higher than all the students in the lower half of the curve, we must add the 0.5000 from the negative half of the curve to the 0.4772 value of the positive half to get our answer. So, 97.72% of the students in Johns research class scored lower than he. How many students is this? It is (40 * .9772 =39.09) 39 students. Here's an example which takes another perspective. Weve used the Normal Curve table to translate z-scores into proportions. We can also translate proportions into z-scores. Take this question: What score did a student in Johns Hebrew class have to make in order to be in the top 10% of the class? We start with an area (0.10) and work back to a z-score, then compute the raw score (X) using the mean and standard deviation for the group. Draw a picture of the problem -- like the one below.
We have cut off the top 10% of the curve. What proportion do I use in the Normal Curve table? We know we want the upper 10%. We also know that the table works from the mean out. So, the z-score that cuts off the upper 10% must be the same z-score that cuts off 40% of the scores between itself and the mean (50% - 10% = 40%). The proportion we look for in the table is 0.4000. Search the proportion values in the table and find the one closest to .4000. The closest one in our table is 0.3997. Look along this row to the left. The z-score value for this row is 1.2. Look up the column from 0.3997 to the top. The z-score hundredth value is .08. The z-score which cuts off the upper 10% of the distribution is 1.28. . 1.0 1.1 1.2 1.3 .08 .09 ... | | --.3997 05 .06 .07
The z-score formula introduced in Chapter 16 yields a z-score from a raw score when we know the mean and standard deviation of a group of scores (left formula below). This z-score formula can be transformed into a formula that computes X from z. Multiply both sides of the z-score formula by s and add . This produces the formula below right. Do you see how the two equations below are the same? One solves for z and the other for X.
17-5
A student had to make 92.8 or higher to be in the upper 10% of the Hebrew class. These examples may seem contrived, but they demonstrate basic skills and concepts youll need whenever you use parametric inferential statistics. Learn them well, become fluent in their use, because youll soon be using them in more complex, but more meaningful, procedures.
Level of Significance
Johns Hebrew score was different from the class mean, but was the difference greater than we might expect by chance. Or as a statistician would ask it, was the score significantly different? Johns research score was different from the class mean, but was it significantly different?
Criticial Values
We determine whether a difference is significant by using a criterion, or critical value, for testing z-scores. The critical value cuts off a portion of the area under the normal curve, called the region of rejection. The proportion of the normal curve in the region of rejection is called the level of significance. Level of significance is symbolized by the Greek letter alpha ().
In this example, the critical value of 1.65 cuts off 5% of the normal curve. The level of significance shown above is = 0.05. Any z-score greater than 1.65 falls into the region of rejection and is declared "significantly different" from the mean. Convention calls for the level of significance to be set at either 0.05 or 0.01.
17-6
Chapter 17
So now we return to our question at the beginning of this section. Did John score significantly higher than his class averages in research and Hebrew? Since this is a directional hypothesis, we'll use a 1-tail test, with =0.05. Under these conditions, John had to score 1.65 standard deviations above the mean in order for his score to be considered "significantly different." In Research, John scored 2.00 standard deviations above the mean. Since 2.00 is greater than 1.65, we can say with 95% confidence that John scored significantly higher in research than the class average. In Hebrew, John scored 0.5 standard deviations above the mean. Since 0.5 is less than 1.65, we conclude that John did not score significantly higher in Hebrew than the class average. Our discussion to this point has focused on single scores (e.g., Johns exam grades) within a frequency distribution of scores. While this has provided an elementary scenario for building statistical concepts, we seldom have interest in comparing single scores with means. We have much more interest in testing differences between a sample of scores and a given population, or between two or more samples of scores. Among the example Problem Statements in Chapter 4, you saw Group 1 versus Group 2 types of problems. This requires an emphasis on group means rather than subject scores, on sampling distributions rather than frequency distributions.
--- --- Warning! This transition from scores to means is the most confusing element of the course --- ---
Sampling Distributions
A distribution of means is called a sampling distribution, which is necessary in making decisions about differences between group means. Just as naturally occurring scores fall into a normal curve distribution, so do the means of samples of scores drawn from a population. The normal curve of scores forms a frequency distribution; the normal curve of means forms a sampling distribution. Look at the diagram at right. Here we see three samples drawn from a population. All three sample means are different, since each group of ten scores is a distinct subset of the whole. The variability among these sample means is called sampling error. Even though we are drawing equal-sized groups from the same population, the means differ from one
4th ed. 2006 Dr. Rick Yount
17-7
another and from the population mean. Differences between means must be large enough to overcome this "natural" variation to be declared significant. If we were to draw 100 samples of 10 scores each from a population of 1000 scores, we would have 100 different mean scores. These 100 sample means would cluster around the population mean in a sampling distribution, just as scores cluster around the sample mean in a frequency distribution. If we were to compute the "mean of the means" we would find it would equal the population mean. The two characteristics which define a normal frequency distribution are the mean and standard deviation. These same characteristics define a sampling distribution. The mean () of a sampling distribution is the population mean, if it is known. If it is unknown, then the best estimate of the mean is one of the sample means ( ). The standard deviation of the sampling distribution, called the standard error of the mean ( ), is equal to the standard deviation of the population () divided by the square root of the number of subjects in the sample (n). Or, as in the formula below left,
If the population standard deviation () is unknown (which is usually the case), we must estimate it. In this case, the formula for standard error of the mean ( ) is based on the estimated standard deviation (s), as in the formula above right.
Research hypotheses are also called "alternative" hypotheses -- hence the reference Ha
17-8
Chapter 17
4.00 at z=0, =0.07, and the computed raw meanlabels for each of the z-scores (3.79 - 4.21). The x-axis of the sampling distribution reflects means, not scores. Notice also in the diagram at left that the much smaller differences between the mean-labels than between the score-labels. This is because the standard error of the mean ( =0.07) is much smaller than standard deviation of the sample (s=0.7). Using a 1-tail test with =0.05, the critical value needed to reject H0 : =4.00 is -1.65. The area cut off by this critical value is shown shaded gray at left. Since the sample mean (3.8) falls into this area (beyond the dotted line), we declare that the 3.80 is significantly lower than 4.00. Translating into English, we can say that the church at large does have a negative attitude toward building renovation. Let's now look at a slightly modified form of the z-score formula to compute the exact z-score for , as well as the exact probability of getting such a mean, all using the same procedure and Normal Curve table that we used before.
The above formula converts a mean into a z-score, given and . Sampling distribution z-scores are tested for significance just as we did with frequency distribution z-scores. Substituting 4.0, 3.8 and 0.07 for , and we have
The mean of 3.80 is 2.85 standard errors below the hypothesized mean of 4.00. In order to be significant (1-tail, = 0.05), the z-score must be 1.65 standard errors or more from the mean. Since -2.85 is farther from the mean than -1.65, we reject the null hypothesis and accept the alternative: "The congregation has a negative attitude toward renovation." In hypothesis testing, the null hypothesis is either retained (no significant difference) or it is rejected (significant difference). There are no partial decisions. Moving back to the sampling distribution diagram, make a note that the dotted
4th ed. 2006 Dr. Rick Yount
17-9
line representing =3.8 is 2.85 standard errors below =4.0. Our finding is the same (3.8 falls into the shaded area), but we obtain a specific z-score, a more accurate measurement, by using the formula.
Summary
In this chapter we have introduced you to the process of testing hypotheses of parametric differences by way of the Normal Curve. We have differentiated between frequency and sampling distributions, and introduced the formula for computing the standardized score z. Because our hypothesis decisions (reject or retain H0) are based on probabilities (necessary since we work with sampling error and inferences), our results are always subject to errors. Such is science: hypotheses, data gathering, speculations, probabilities of findings. Our goal through proper research design and statistical analysis is to minimize errors and maximize "true findings." We will take up the topic of error rates and power in the next chapter.
Example
Dr. Robert DeVargas studied the change in moral judgment in students (3% sample, N=360) who used the Lessons in Character curriculum adopted by the Fort Worth I. S. D. for the 1996-1997 school year.2 While much of his statistical analysis is far beyond the scope of this chapter, notice in his writing below the use of level of significance as a benchmark for his findings.
Analysis of the fifth grade test data proceeded with the following steps: 1. An Analysis of Covariance (ANCOVA) was performed upon the post-test means of the treatment and control groups using the pre-test scores as a covariate variable. The mean score of the control group post-test score was 2.19 (n=30). The mean score of the treatment group post-test was 2.18 (n=31). The ANCOVA procedure produced an F value of 0.163 giving a significance of p=0.688. The critical value Fcv(1, 60, =.05) =4.00. ... [3b] The treatment group's mean pre-test score equaled 2.0535 and the mean post-test score was 2.1835; the mean difference was 0.13 (n=31, SD=0.316). The standard error of the difference was 0.057 giving t = 2.29. The critical value tcv(df=30, 1-tail, =.05) = 1.697.3 ... The step 3b analysis performed on the pre- and post-test means of the treatment group calculated the t value to be 2.29. Comparison to the critical value. . .reveals that. . .there exists a significant difference between the pre- and post-test scores. . . .it can be stated that the treatment [of moral judgement curriculum] made a significant difference in the level of moral judgement between the pre- and post-test scores of the treatment group.4
Robert DeVargas, A Study of Lessons in Character: The Effect of Moral Judgement Curriculum Upon Moral Judgement, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1998) 3 4 Ibid., 75 Ibid., 78
2
17-10
Chapter 17
Vocabulary
Alpha () Critical value Frequency distribution Level of significance Normal curve One-tail test Region of rejection Reject null Retain null Sampling error Sampling distribution Standard error of the mean Two-tail test probability of rejecting true null hypothesis. Level of significance. 1%, 5% value beyond which the null hypothesis is rejected categorization of scores into classes and class counts probability of rejecting a true null. Symbolized by symmetrical mesokurtic (bell-shaped) distribution of scores hypothesis test which places in only one tail of distribution area under the normal curve beyond the critical value the decision of "statistically significant difference" the decision of "no statistically significant difference" random differences among the means of randomly selected groups normal curve distribution of sample means within a population the standard deviation of a sampling distribution of means hypothesis test which places /2 in both tails of distribution
Study Questions
1. Define the following terms. Think of how you would explain them to someone in the class. inferential statistics level of significance sampling distribution research hypothesis (Ha) Normal Curve table p null or statistical hypothesis (H0) 1- and 2-tail tests standard error of the mean directional hypothesis non-directional hypothesis
2. Determine the area under the normal curve between the following z-scores. Draw out the problems with the following diagrams. A. z = 0 and z = 2.3 B. z = 0 and z = -1.7
17-11
3. You go to an associational picnic. The men decide to have a contest to see who can throw a softball the farthest. You record the distances the ball is thrown, and compute the mean ( = 164 feet) and the standard deviation (s = 16 feet). There were 100 men who threw the ball. Answer the following questions: A. With this information, sketch a normal curve and label it with z-scores and raw scores, mean and standard deviation. B. How many men threw the ball 180 feet or more? 120 feet or less? C. How far did one have to throw the ball to be in the top 10%? D. What distances are so extreme that only 1% of the men threw this far? E. Sketch a sampling distribution based on mean, s, n given above. F. A sister association joins the fellowship and challenges your association with a balltoss of their own. Their average distance was 170 feet. Did they throw significantly better than your association? The following diagrams will help you work through problem 3.
17-12
Chapter 17
17-13
17-14
Chapter 18
18
The Normal Curve Error Rates and Power
Type I and Type II Error Rates Increasing Statistical Power Statistical versus Practical Significance
Because decisions to reject or retain null hypotheses are based on probabilities (necessary since we work with sampling error and inferences), our results are always subject to errors. Such is science: problem, hypothesis, data gathering, analysis, and probabilities of findings. Our goal through proper research design and statistical analysis is to minimize errors and maximize "true findings." We complete our journey through the Normal Curve and hypothesis testing with considerations of error rates and power. The chapter is divided into the following sections: Error Rates. What are Type I and Type II Error Rates? How can we reduce the likelihood of committing Type I and Type II errors? Power. What is statistical power? How do we increase the power of a statistical test? Statistical versus Practical Significance. What is the difference between statistical significance and practical significance?
18-1
Correct
Statistical Language
Probabilities
18-2
Chapter 18
population mean higher than our own. The critical value line of Curve 1 cuts off the region of rejection (light gray area). This area is equal to , the probability of committing a Type 1 error: rejecting a true null. Any mean falling into this region is declared significantly higher than 1. The dark gray area to the left of the critical value is the region of non-rejection and is equal to 1-, the probability of retaining a true null. If we set at 0.05, then 1 equals 0.95. This means that when we declare a difference significant, we are 95% sure of our decision. But notice that the critical value line also cuts off the lower part of Curve 2 (light gray). This area is symbolized by , the probability of committing a Type 2 error: retaining a false null. Any mean from distribution 2 (which should be declared different) falling in this area will be declared not significantly higher than 1. The dark gray area to the right of the critical value line equals 1, the probability of rejecting a false null (power). Now let's put the curves and boxes together with the labels A, B, C, and D. The diagram at right shows two sample means (A,B) which are true nulls* (arrows show they belong with population 1). Mean A falls to the left of the critical value and is declared not significantly different from . This is a correct decision, and reflects box A (p=1-). Mean B falls beyond the critical value and is declared significantly different from . This is a Type 1 error and reflects box B (p=). In the second diagram at right, we have means C and D which are both false nulls* (arrows show they belong with population 2). Mean C falls to the left of the critical value and is declared not significantly different from . This is a Type 2 error, and reflects box C (p=). Mean D falls to the right of the critical value and is declared significantly different from . This is a correct decision and reflects box D (p=1-). Review the decision table and these diagrams until you can see the correspondence between the two. *Of course, we can never really know what the "real world" conditions are, whether the nulls are actually true or false. In the previous examples, I was giving you hidden information in order to establish the four possibilities in hypothesis testing. We are left with the tasks of gathering reliable and valid samples of data, applying statistical procedures, and making decisions of outcome based on probability. But this is still a wonderful mechanism for solving problems. Understanding the
4th ed. 2006 Dr. Rick Yount
18-3
dynamics of hypothesis testing -- error rates, power, z-scores -- is like any other kind of under-the-hood, behind-the-scenes knowledge. It provides insight into how things work, sophistication in what doesn't, and calm assurance that a research design -whether our own or one found in the literature -- is what it purports to be. Such understanding elevates us from blind user to savvy consumer of research findings. Further, it matures us as a competent research designer. At the heart of this competence lies the ability to improve our chances of making a correct decision even before we send out instruments or determine the samples we'll study. A statistician would ask it this way: How can I increase the power of my study? Lets take a look at some possibilities.
Increase
The level of significance () directly controls the size of the critical value used to declare a difference significant. As we increase , we reduce the critical value. As the critical value is reduced, the likelihood that a mean will be declared significantly different from the mean increases. In other words, the probability of declaring a null hypothesis false (power) increases simply because we reduced the critical value. The diagrams at left show how this happens. Notice the position of the critical value line in the upper diagram at left. It cuts off the curve at z=1.65 (=0.05). If we increase to 0.10, the cut off line moves to the left, to z=1.28. As the critical value line moves to the left, the area labelled power increases by the lighter segment shown in the lower diagram. However, we have increased power (1-) by increasing , the Type I error rate. This is simply robbing Peter to pay Paul. It does not improve the overall research design to increase the probability of Type 1 errors as we decrease the probability of Type 2 errors . It is better to remain with the conventional values of 0.05 or 0.01 for .
The power of a statistical test means is the probability of declaring a difference significant. Greater power means nothing more or less than a greater probability of declaring a value significant. It stands to reason that the probability of declaring a larger difference significant is greater than for declaring a smaller difference significant. The difference between population means in the upper diagram at left is smaller than the difference in the lower diagram. Look at the dramatic difference in power, reflected by the shaded areas in the diagrams.
Increase 1 - 2
18-4
Chapter 18
Several years ago a Ford Motor Company commercial featured a road test comparing a Lincoln and a Cadillac. They used 100 drivers. The Lincoln won the test (remember, it was a Ford commercial). But interesting to me was that researchers needed 100 persons to show the difference. Why so many? Because the difference between a Cadillac and Lincoln is very small. Had one of the cars been a 67 Chevy taxi cab, the ability to distinguish between cars would have been easier, and the difference could have been firmly established with fewer subjects. It is reasonable to assume that as 1 - 2 increases, detecting the difference becomes easier. We can see this very fact in the formula for z itself. The equation for z has the term in the numerator, so that as this difference increases, z increases falling farther out from the mean, and the more likely to be declared significant. The problem, of course, is that this discussion is purely theoretical, since we have no control over the size of difference ( ). So let's turn our attention to elements we do have some control over.
Decrease s
The standard error of the mean ( ) is decreased by decreasing s. We do this by improving the precision and accuracy of the measurements of our sample(s). By designing better experiments, writing better tests, and using more reliable methods for collecting data, we squeeze some of the noise (extraneous, unsystematic variability) out of our data. It is clear that by gathering data that is more precise, we will be able to detect targeted differences more easily because we are removing unwanted static from the process. Sloppy designs, poor instruments, and awkward data gathering should be replaced by clear designs, accurate and valid instruments, and precise data gathering procedures. Decreasing s increases power without increasing Type I error rates.
Increase n
The second way to decrease
4th ed. 2006 Dr. Rick Yount
18-5
As the number of subjects increases, the more their individual differences (random noise) cancel each other out and allow true differences to show through. The size of your sample(s) has a direct influence on the outcome of your study. If you study three approaches to counseling using groups of 10 subjects each, you may not have sufficient statistical power to declare the differences significant, even if they really exist!. The same study, done with three groups of 30, might declare these real differences significant. If you use three groups of 1000, you may find significant differences that are, in a practical sense, trivial (see "practical importance" below). Because n is so potent an influence on power, you must use caution in selecting your sample size. You may want to consult an advanced statistics text to determine the size of sample(s) you need for your statistic, but for now, Dr. Currys rule of thumb for sample size (Chapter 7) is a good place to start.
18-6
Chapter 18
Summary
Weve come a long way in the last two chapters! We began in Chapter 17 with the standardized z-scale and linked it to the Normal Curve distribution table. We introduced the characteristics of the normal curve. We linked the concepts of z-score and area under the curve. We explained the concept of level of significance (). We differentiated one- and two-tail statistical tests. We made the leap from frequency distributions (of scores) to sampling distributions (of means). We related the z-score equation to hypothesis testing with sampling distributions. In our present chapter we explained and illustrated the concepts of Type I and Type II error rates, as well as power. We tied these concepts to pictorial representations of where these error rates come from, as well as what they mean. Finally, we described practical ways to improve the statistical design of our studies. These two chapters lay the foundation for understanding and using the statistical procedures we'll discuss in the remainder of the book.
Vocabulary
Alpha Beta (Type II) Power Reject null Retain null Type I error rate Type II error rate probability of rejecting true null hypothesis () probability of retaining false null hypothesis () probability of rejecting a false null the decision of statistically significant difference the decision of no statistically significant difference probability of rejecting a true null probability of retaining a false null
Study Questions
1. Explain the following terms in English: Type 1 error rate Type 2 error rate Power Practical significance Statistical significance Level of significance
2. Draw from memory the 2x2 decision table, labelling the four headings and filling in the boxes with level of significance, error rates and power. Then draw four sets of paired normal curves and identify the areas under the curves which relate to each of the four cells in the decision table.
18-7
2. Greater power in a statistic means A. more precision B. less precision C. more differences declared "significant" D. fewer differences declared "significant" 3. The best way to increase power in your statistical design is to. A. lower the critical value of the test B. increase C. increase the standard deviation of scores D. use a larger sample 4. You want to be 99% confident of your decision that Sample mean A is different from Sample mean B. Which of the following will allow you to do this? A. 1-tail test at =0.01 B. 2-tail test at = 0.99 C. 1-tail test at =0.99 D. 2-tail test at =0.01 5. T F Results can have statistical significance without having practical significance.
6. T F The best ways to decrease the probability of committing a Type 2 error in your study is to increase n and decrease "noise" in the scores. 7. T F The critical value for a 2-tail test, =0.01, is 2.33.
18-8
Chapter 19
19
One Sample Parametric Tests
The One-Sample z-Test The One-Sample t-Test The Confidence Interval
Dr. Roberta Damon developed a marital profile of Southern Baptist missionaries in eastern South America in 1985. Part of her study involved measuring whether SBC missionary husband-wife pairs differed from American society-at-large on the variable Couple Agreement on Religious Orientation, as measured by the Religious Orientation Test. She set = 0.01 and decided to use a two-tail test. The American mean () at the time was 56. The mean score ( ) and estimated standard deviation (s) of her sample of 330 missionaries was 86.3 and 20.847 respectively. Applying the sampling distribution z-formula we introduced on page 17-9, she computed z as follows:
Remember that a two-tail test (=0.01) requires only z=2.58 for declaring a difference significant. Here we see z=26.403. Southern Baptist missionaries serving in eastern South America in 1984, as a group, scored over 26 standard errors above the national mean in Couple Agreement1 on the Religious Orientation Scale! In chapters 17 and 18 we explored the theoretical basis for hypothesis testing with sample means. Here in Chapter 19 we will use these principles to establish our first practical use for hypothesis testing. One-sample tests compare a population mean () with a single sample mean( ). We have already seen the one-sample z-test in action. We will also discuss the one-sample t-test.
19-1
(which is usually the case), we use s to estimate , and so use the equation on the right (the more popular of the two).
The limitation to using s to estimate is whether the sample is large enough to approximate a normal curve. "Large enough" means at least n=30 subjects. The normal curve table (z-) requires a normal distribution of scores in order to give accurate proportions under the curve. When a sample contains less than 30 scores, the requirement of normality is not met, and we cannot use the normal curve table or the z-test. In this case, use the one sample t-test.
The logic behind the one-sample t is the same as we've used for the z-test, with one exception. A smaller n means lower power (see 18-5). Since the t-test is designed to be used with smaller n samples, the t-distribution critical value table assigns a slightly larger critical value to reflect the loss of power.
19-2
Chapter 19
have 26 subjects in a study and use the one-sample t-test, then the degrees of freedom for the analysis is (n-1 = 26-1 =) 25. Recall that in the Normal Curve table we can find proportion values for any zscore. We focused on the four primary critical values of 1.65, 1.96, 2.33, and 2.58, but could calculate proportions for any z from 0 to 3. Look at the t-table under the column labelled 0.05 and move down to the bottom of the column, next to df= . The value of the t-test critical value (1.645) is exactly the same as the value in the Normal Curve table (0.05, 1-tail test). To the right of 1.645 is the .025 column with the value "1.96." This is exactly the same as the 2tail 0.05 Normal Curve value of z. The heading ".025" in the t-table refers to the /2 area of the two-tail test: 0.05/2 = 0.025. The next value to the right is "2.326," the one-tail 0.01 value of z. And the next, "2.576." These are the same four essential critical values we studied for the normal curve: 1.65, 1.96, 2.33, and 2.58. The t-Distribution table differs from the Normal Curve table in that it contains nothing but critical values for a given level of significance and df. Each df row provides the critical value for a unique t-distribution. As we have just seen, for , the t-distribution is the same as the Normal Curve distribution. As n decreases, the t-distribution becomes increasingly platykurtic. That is, the smaller the sample size, the flatter and wider the distribution. This pushes the critical values out farther on the tails, making significance harder to establish, as you can see in the diagram at right. Now choose the column in the t-table ending with "1.645." Move up the column and watch how the critical values increase. As df decreases, critical values increase. This is just another way of saying as n decreases, power decreases. Thus, the t-table accounts for lower power which derives from smaller n.
Computing t
Let's revisit the congregation study of attitude toward building renovation we did with the z-test in Chapter 17. Suppose our random sample is a small one, made up of only 25 members rather than 100. Lets use the same hypothesized population value of 4.0 and the same mean and standard deviation of 3.8 and 0.7 respectively. Computing t we have
The critical value for =.05, 1-tail, and df = 24 is 1.711. Notice that this critical value for t, symbolized as tcv, is a larger value than the comparable z-test value of 1.65. But since our value of -1.43 is smaller (not as far out on the left tail) than -1.711 , we retain the null hypothesis. While the difference between sample mean and hypothesized population mean is the same as before (-0.2), the standard error of the mean (sx) was twice as large -- 0.14 in the t-test as opposed to 0.07 in the z-test. This is due to the smaller number of subjects (n=25) in this sample as compared to the former one (n=100). Additionally, the critical value bar for the t-test (1.711) is higher than the z-test (1.645). So the smaller sample size yields less power. The same hypothesis (H0:=4.0)
4th ed. 2006 Dr. Rick Yount
19-3
and the same difference (-0.2) yielded two different results. The z-test yielded a significant difference. The t-test did not. Why? Because in the second case we lacked sufficient power to declare the difference significant. The t-test does not correct for lack of power. It simply allows us to test samples too small for the z-test. Up to now we have used the z- and t- formulas to test single hypotheses. Another, less common, use for these formulas is in the creation of a confidence interval.
Confidence Intervals
The confidence interval offers another approach to making statistical decisions. In this approach, we set an interval around the population mean, bordered by confidence limits. We can then state, with a given degree of confidence, that the null hypothesis is true if any sample mean which falls within this interval , and false if it falls outside this interval. The benefit of using a confidence interval is that any number of sample means can be tested with only one computation.
where CI95 is the 95% confidence interval consisting of two endpoint values (x.x, x.x), is the population mean, z is the z-score for the given level of significance (in this case, = 0.05, so z = 1.96), and is the standard error of the mean. Now let's revisit the church attitude study again using n=100 subjects and a confidence interval based on the z-score. Using the formula above we have the following computation:
In the diagram at left you can see the church's mean score (n=100) of 3.8, and the confidence interval values of 3.863 and 4.137. The mean (3.8) falls outside the confidence interval. We therefore reject the null hypothesis (just as we did with the hypothesis test in Chapter 17). The church has a negative attitude toward renovation. But let's assume we tested twenty churches in our association.2 All we need do is calculate the mean score of each church and compare that to the interval "3.863 - 4.137." Any church mean falling below this interval
I've oversimplified this to focus your attention on the meaning of confidence interval. But for this to actually work as I've described it, we have to assume that all twenty churches produced the same standard deviation (s) on their attitude scores -- and this is an unreasonable assumption. If we were to actually do this study, we would compute the overall standard deviation from all twenty churches, and use that value to construct the confidence interval. Then, any church falling outside the interval would be significantly different from 4.00, the hypothesis neutral point. But even so, we make one computation, 20 comparisons.
2
19-4
Chapter 19
reflects a significant negative attitude. Any church mean falling above the interval reflects a significant positive attitude. Any mean falling within the interval reflects no attitude (neutral). Confidence intervals are always based on two-tailed tests. The two values, 3.863 and 4.137, are called the confidence limits. The range of scores between the limits, shown at right by the two-headed arrow, is the confidence interval.
The flatter curve of the t-distribution forces the interval to be wider than the one we computed with z. Because of the larger standard error of the mean ( ), the t-score cut-off values are farther apart (0.14 vs. 0.07). Notice also that the sample mean of 3.8, which fell outside the z-score interval, falls inside the t-score confidence interval. This agrees with the hypothesis test on page 19-3. Since the mean of 3.8 falls within the confidence interval, it is declared not significantly different. The different result (z- vs. t-) is due directly to different n's (100 subjects vs. 25 subjects). The loss of power with n=25 changed our significant finding to a non-significant finding.
Summary
This chapter is built on Chapters 17 and 18. The principal extensions we made in this chapter include the use of the t-distribution table, the concept of degrees of freedom and confidence interval. The next chapter extends these concepts still further: to testing mean differences between two samples.
Vocabulary
Confidence interval Degrees of freedom One sample z-test One sample t-test Distance between the 2-tail critical values centered on mean. "Region of acceptance (of null)" the number of values free to vary given a fixed sum (n-1 for one group) tests difference between and X-bar - if is known, or if is unknown and n>30 tests difference between and X-bar - if unknown and estimated by s (must use if n<30)
Remember: The one-sample t-test can be used with any sample size, which makes it one of the most popular statistics of difference.
19-5
Study Questions
1. A sample mean on an attitude scale equals 3.3 with a standard deviation of 0.5. There were 16 people tested in the group. Test the hypothesis: The group will score significantly higher than 3.0 on attitude X. (Use =0.01) A. State the statistical hypothesis. B. Compute the proper statistic to answer the question. (con't) C. Test the statistic with the appropriate table. D. State your conclusion. E. Establish a 99% confidence interval about the sample mean. (Careful here...) F. Draw the sampling distribution and the confidence interval. 2. Repeat #1 above, but with a sample size of 49. 3. A study in 1980 revealed that the average salary of Southern Baptist ministers of education in Fort Worth was $29,000 (fictitious data). You randomly sample 28 ministers of education (1995) and find their average salary is $37,000 with a standard deviation of $3,000. Have salaries improved significantly? A. B. C. D. E. F. G. Draw and label an appropriate sampling distribution. State the research hypothesis. State the statistical hypothesis. Select the proper test and compute the statistic. Test the statistic with the appropriate table. Establish a 95% confidence interval about the sample mean. Does this confidence interval agree with your hypothesis test? Explain how.
19-6
Chapter 20
Two-Sample t-Tests
20
Two Sample t-Tests
t-Test for Two Independent Samples t-Test for Two Matched Samples Confidence Interval for Two-Sample Test
The next logical step in the analysis of differences between parametric variables (that is, interval/ratio data) is testing differences between two sample means. In one-sample tests, we compared a single sample mean ( ) to a known or hypothesized population mean (). In two-sample tests, we compare two sample means ( ). The goal is to infer whether the populations from which the samples were drawn are significantly different. Dr. Mark Cook studied the impact of active learner participation on adult learning in Sunday School in 1994.1 Using twelve adult classes at First Baptist Church, Palestine, he randomly grouped six classes into a six-session Bible study using active participation methods and six that did not.2 Because he used intact Sunday School classes, he applied an independent samples t-test to insure that the two experimental groups did not differ significantly in Bible knowledge prior to the course. Treatment and control groups averaged 6.889 and 7.436 respectively. They did not (t= -0.292, tcv= 2.009, df=10, = .05).3 At the conclusion of the six-week course, both groups again took a test of knowledge. Treatment and control groups averaged 10.111 and 8.188 respectively. Applying the independent samples t-test, Dr. Cook discovered that the active participation group scored significantly higher than the control (t=2.048, tcv =1.819, df=10, =.05).4 The average class achievement of those taught with active learner participation was significantly higher than those taught without it.
Descriptive or Experimental?
Two-sample tests can be used in either a descriptive or experimental design. In a descriptive design, two samples are randomly drawn from two different populations. If the analysis yields a significant difference, we conclude that the populations from which the samples were drawn are different. For example, a sample of pastors and a sample of ministers of education are tested on their leadership skills. The results of this study will describe the difference in leadership skill between pastors and ministers of education. In an experimental design (like Dr. Cook's above), two samples are drawn from the same population. One sample receives a treatment and the other (control) sample
1 Marcus Weldon Cook, A Study of the Relationship between Active Participation as a Teaching Strategy and Student Leanring in a Southern Baptist Church, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1994) 2 3 4 Ibid., pp. 1-2 Ibid., p. 45 Ibid., p. 50
20-1
either receives no treatment or a different treatment. If the analysis of the post-test scores yields a significant difference, we conclude that the experimental treatment will produce the same results in the population from which the sample was drawn. The results from Dr. Cook's study can be directly generalized to all adult Sunday School classes at First Baptist Church, Palestine -- since this was his population -- and further, by implication, to all similar adult Sunday School classes: active learner participation in Sunday School improves learning achievement.
where the numerator equals the difference between two sample means ( ), and the denominator, called the standard error of difference equals the combined standard deviation of both samples. Review: We have studied a normal curve of scores (the frequency distribution) where the standard score (z, t) equals the ratio of the difference between score and mean (X- ) to the distribution's difference within -- standard deviation (s). See diagram 1 at left. We have studied a normal curve of means (the sampling distribution) where the standard score (z, t) equals the ratio of the difference between sample mean ( ) and population mean () to the distribution's difference within -- the standard error of the mean ( ). See diagram 2. In the independent-samples t-test, we discover a third normal curve -- a normal curve of differences between samples drawn from two populations. Given two populations with means x and y , we can draw samples at random from each population and compute the difference between them ( ). The mean of this distribution of differences equals x - y (usually assumed to be zero) and the standard deviation of this distribution is the standard error of difference: . See diagram 3. Any score falling into the shaded area of diagram 1 is declared significantly different from . Any mean falling into the shaded area of diagram 2 is significantly
5 When subjects are drawn from a population of pairs, as in married couples, and assigned to husband and wife groups, the groups are not independent, but matched or correlated. We will discuss the correlated-samples t-test later in the chapter.
20-2
Chapter 20
Two-Sample t-Tests
different from . Any difference between means falling into the shaded area of diagram 3 is significantly different from zero, meaning that are significantly different from each other. You will find the importance of this discussion in the logic which flows through all of these tests.
The purpose of this formula is to combine the standard deviations of the two samples (sx, sy) into a single "standard deviation" value. We can rewrite the first part of the equation into a more recognizable (but mathematically incorrect) form like this:
The shaded area above highlights the formula we established for the estimated variance (s2) in chapter 16. Here it computes sx2. A similar element computes sy2. These are added together. The gray [brackets] combines the n's of both samples into a harmonic mean and has the same mathematical function in the formula as "dividing by n," producing a combined variance. Mathematically, we add the sums of squares, divide by degrees of freedom, and multiply by the [harmonic mean]. The square root radical produces the standard-deviation-like standard error of difference. The equation above spells out what the standard error of difference does, but it is tedious to calculate. If you have a statistical calculator that computes mean ( ) and variance (s2) from sets of data, you can use an equivalent equation much more easily. The equivalent equation substitutes the term (nx-1)sx2 for x2, and (ny-1)sy2 for y2. A quick review of the equation for s2 above shows these two terms to be equivalent. The altered equation looks like this:
Note: The terms are separated to show link to standard deviation. It is mathematically incorrect.
Substituting (n-1), s2 (from the calculater), and n terms from both samples quickly simplifies the above equation. Lets illustrate the use of these equations with an example.
Example Problem
The following problem will illustrate the computational process for the independent-samples t-test. Suppose you are interested in developing a counseling technique to reduce stress within marriages. You randomly select two samples of married indi 4th ed. 2006 Dr. Rick Yount
20-3
viduals out of ten churches in the association. You provide Group 1 with group counseling and study materials. You provide Group 2 with individual counseling and study materials. At the conclusion of the treatment period, you measure the level of marital stress in the group members. Here are the scores:
Group 1 25 17 29 29 26 24 27 33 23 14 21 26 20 27 26 32 20 32 17 23 20 30 26 12 Group 2 21 26 28 31 14 27 29 23 18 25 32 23 16 21 17 20 26 23 7 18 29 32 24 19
Using a statistical calculator, we enter the 24 scores for each group and compute means and standard deviations. Here are the results:
: s: Group 1 24.13 5.64 Group 2 22.88 6.14
Are these groups significantly different in marital stress? Or, asking it a little differently: Are the two means, 24.13 and 22.88, significantly different? Use the independentsamples t-test to find out. Here is the procedure. Step 1: State the research hypothesis. Step 2: State the null hypothesis. Step 3: Compute . (Numerator of t-ratio)
Step 4: Compute
. (Denominator of t ratio)
20-4
Chapter 20
Two-Sample t-Tests
Step 5: Compute t
Step 6: Determine tcv(.05, df=46, 2 tail) Degrees of freedom (df) for the independent samples t-test is given by nx + ny -2, or 24 +24 - 2 = 46 in this case. Since the t-distribution table does not have a row for df=46, we must choose either df=40 or df=50. The value for df=40 is higher, and convention says it is better to choose the higher value (to guard against a Type I error). So, tcv (df=40, 2 tail, 0.05) = 2.021 Step 7: Compare: Is the computed t greater than the critical value? t = 0.736 tcv = 2.021 Step 8: Decide The computed value for t is less than the critical value of t. It is not significant. We retain the null hypothesis. Step 9: Translate into English There was no difference between the two approaches to reducing marital stress. The t-Test for Independent Samples allows researchers to determine whether two randomly selected groups differ significantly on a ratio (test) or interval (scale) variable. At the end of the chapter you will find several more examples of the use of Independent-samples t in dissertations written by our doctoral students. You will often see journal articles that read like this: The experimental group scored twelve points higher on post-test measurements of subject mastery (t=2.751*, =0.05, df=26, tcv=1.706) and eight points higher in attitude toward the learning experience (t=1.801*) than the control group. (The asterisk [*] indicates "significant difference"). If the two samples are formed from pairs of subjects rather than two independently selected groups, then the t-test for correlated samples should be used.
20-5
The correlation between scores in matched samples controls some of the extraneous variability. In husband-wife pairs, for example, the differences in attitudes, socioeconomic status, and experiences are less for the pair than for husbands and wives in general. Or, when two scores are obtained from a single subject, the variability due to individual differences in personality, physical, social, or intellectual variables is less in the correlated samples design than with independent samples. Or, when a study pairs subjects on a given variable, it generates a relationship through that extraneous variable. In each case, more of the extraneous error (uncontrolled variability) in the study is accounted for. This reduction in overall error reduces the standard error of difference. This, in turn, increases the size of t making the test more likely to reject the null hypothesis when it is false (i.e., more powerful test). On the other hand, the degrees of freedom are reduced when using the correlated pairs formula. For independent samples, df is equal to nx + ny - 2. In the correlated samples, n is the number of pairs of scores and df is equal to n (pairs) - 1. Two groups of 30 subjects in an independent-samples design would yield df=30+30-2=58, but in a correlated samples design, df=30 pairs-1=29. The effect of reducing df is to increase the critical value of t, and therefore we have a less powerful test. The decision whether to use independent or correlated samples comes down to this: Will the loss of power resulting from a smaller df and larger critical value for t be offset by the gain in power resulting from the larger computed t? The answer is found in the degree of relationship between the paired samples. The larger the correlation between the paired scores, the greater the benefit from using the correlated samples test.
I provide here a two-step process for computing the standard error of difference to reduce the complexity of the formula. First, compute the sd2 using the formula above right. Note the distinction between d2 (differences between paired scores which are squared and then summed) and (d)2 (differences between paired scores which are summed and then squared). Secondly, use the sd2 term in the formula above left to compute the standard error of difference ( ). Substitute and in the matched-t formula and compute t. Compare the computed t-value with the critical value (tcv) to determine whether the difference between the correlated-samples means is significant. Lets look at the proce"Correlation" is another category of statistical analysis which measures the strength of association between variables (rather than the difference between groups). We'll study the concept in Chapter 22.
6
20-6
Chapter 20
Two-Sample t-Tests
Example Problem
A professor wants to empirically measure the impact of stated course objectives and learning outcomes in one of her classes. The course is divided into four major units, each with an exam. For units I and III, she instructs the class to study lecture notes and reading assignments in preparation for the exams. For units II and IV, she provides clearly written instructional objectives. These objectives form the basis for the exams in units II and IV. Test scores from I,III are combined, as are scores from II,IV, giving a total possible score of 200 for the with and without objectives conditions. Do students achieve signifIcantly better when learning and testing are tied together with objectives? (=0.01) Here is a random sample of 10 scores from her class:
Subject 1 2 3 4 5 6 7 8 9 10 Without (I,III) With (II,IV) d 165 182 = -17 178 189 -11 143 179 -36 187 196 -9 186 188 -2 127 153 -26 138 154 -16 155 178 -23 157 169 -12 171 191 -20 X=1607 Y=1779 d = -172
Instructional Objectives
d2 289 121 1296 81 4 676 256 529 144 400 d2= 3786
means: 160.7 177.9 = -17.2 Insight: Compare the difference between the means with the mean of differences. What do you find?*
Subtract the with scores from the without (e.g., 165-182), producing the difference score d (-17). Adding all the d-scores together yields a d of -172. Squaring each d-score yields the d2 score (-17 x -17=289) in the right-most column. Adding all the d2 -scores together yields a d2 of 3786. These values will be used in the equations to compute t. Here are the steps for analysis: Step 1: State the directional research hypothesis. Step 2: State the directional null hypothesis. Step 3: Compute the mean difference ( )
*Answer: The difference between the means equals the mean of score differences.
20-7
Step 5: Compute
Step 6: Compute t
Step 7: Determine tcv from the t-distribution table. In this study, = 0.01. tcv (df=9, 1-tail test, =.01) = 2.821 Step 8: Compare: Is the computed t greater than the critical value? t= - 5.673 tcv= 2.821
Step 9: Decide: The calculated t-value is greater than the critical value. It is significant. Therefore, we reject the null hypothesis and accept the research hypothesis. The with objectives test scores ( ) were significantly higher than the without objectives ). test scores ( Step 9: Translate into English Providing students learning objectives which are tied to examinations enhances student achievement of course material.
The equation for the same 95% confidence interval about differences between two independent means follows the same form. The two-sample confidence interval is given by
20-8
Chapter 20
Two-Sample t-Tests
Confidence intervals for two-sample tests are rarely found. This section is included more for completeness of the subject than for practical considerations.
Summary
Two-sample t-tests are among the most popular of statistical tests. The t-test for independent samples assumes that the group scores are randomly selected individually (no correlation between groups) while the t-test for correlated samples assumes that the group scores are randomly selected as pairs.
Examples
Dr. Roberta Damon, cited in Chapter 19 for her use of the one-sample z-Test, also used the independent-samples t-Test to determine differences between missionary husbands and wives on three variables: communication, conflict resolution, and marital satisfaction. Her findings were summarized in the following table:7
TABLE 4 t SCORES AND CRITICAL VALUES FOR THE SCALES OF COMMUNICATION, CONFLICT RESOLUTION AND MARITAL SATISFACTION ACCORDING TO GENDER Scale Communication Conflict Resolution Marital Satisfaction N = 165 df = 163 t 2.7578 0.4709 -0.71410 Critical Value 2.617 1.645 1.645 Alpha Level .005 .05 .05
Dr. Don Clark studied the statistical power generated in dissertation research by doctoral students in the School of Educational Ministries, Southwestern Seminary, between 1978 and 1996.11 Clark's interest in this subject grew out of a statement I made in research class: Many of the dissertations written by our doctoral students fail to show significance because the designs they use lack sufficient statistical power to declare real differences significant. He wanted to test this assertion statistically. One hundred hypotheses were randomly selected from two populations: hypotheses proven significant (Group X) and those not (Group Y). Seven of these were eliminated because they did not fit Clark's study parameters, leaving ninety-three.12
Damon, 41. Husbands were significantly more satisfied than wives with the way they communicate. "This indicates a felt dissatisfaction among the wives which their husbands do not share" (p. 41). 9 No difference in satisfaction with how couples resolve conflicts. 10 No difference in marital satisfaction. 11 12 13 Clark, 30 Ibid., 39 Ibid., p. 44
7 8
20-9
The scores for this study were the computed power values for all 93 hypotheses of difference. The mean power score (n=47) for "significant hypotheses" was 0.856. The mean power score (n=46) for "non-significant hypotheses" was 0.3397. The standard error of difference was 0.052, resulting in a t-value of 9.904,13 which was tested against tcv of 1.660. The power of the statistical test is significantly higher in those dissertations' hypotheses proven statistically significant than those. . .not proven statistically significant.14 At first glance, the findings seem trivial. Wouldn't one expect to find higher statistical power in dissertations producing significant findings? The simple answer is no. A research design and statistic produces a level of power unrelated to the findings. A fine-mesh net will catch fish if they are present, but a broad-mesh net will not. We seek fine-mesh nets in our research. The average power level of 0.34 for non-significant hypotheses shows that these studies were fishing with broad-mesh nets. As I had said, These studies were doomed from the beginning, before the data was collected. Dr. Clark found this to be true, at least for the 93 hypotheses he analyzed. We have been growing in our research design sophistication over the years. Computing power levels of research designs as part of the dissertation proposal has not been required in the past. Based on Dr. Clark's findings, we need to seriously consider doing this in the future, if for no other reason than to spare our students from dissertation research which is futile in its very design from the beginning. Dr. John Babler conducted his dissertation research on spiritual care provided to hospice patients and families by hospice social workers, nurses, and spiritual care professionals.15 In two of his minor hypotheses, he used the Independent-Samples tTest to determine whether there were differences in provision of spiritual care16 between male and female providers, and between hospice agencies related to a religious organization or not. Males and females (n= 21, 174) showed no significant difference in the provision of spiritual care (Meanm=49.6, Meanf = 51.2, t= -0.83, p=0.409). Agencies related to religious organizations (n=26) and those which were not (n=162) showed no difference in provision of spiritual care (Meanr =51.5, Meannr= 51.1, t=0.22, p=0.828).
Vocabulary
Average difference between two matched groups Standard error of difference (independent samples) Standard error of difference (matched samples) Sample paired subjects test differences between pairs Samples randomly drawn independently of each other combined variability (standard deviation) of two groups of scores tests difference between means of two matched samples tests difference between means of two independent samples
Correlated samples Independent samples Standard error of difference t-Test for Correlated Samples t-Test for Independent Samples
Study Questions
1. Compare and contrast standard deviation, standard error of the mean, and standard error of difference.
15 Ibid., p. 45 Babler, 7 Scores were generated by an instument which Babler adapted by permission from the Use of Spirituality by Hospice Professionals Questionnaire developed by Millison and Dudley (1992), p. 34 14 16
20-10
Chapter 20
Two-Sample t-Tests
2. Differentiate between descriptive and experimental use of the t-test. 3. List three ways we can design a correlated-samples study. Give an example of each type of pairing. 4. Discuss the two factors involved in choosing to use one test over the other. What is the major factor to consider? 5. A professor asked his research students to rate (anonymously) how well they liked statistics. The results were: Males Females 5.25 4.37 s2 6.57 7.55 n 12 31 a. State the research hypothesis c. Compute the standard error of difference e. Determine critical value (=0.01) g. Decide i. Develop CI99 for the differences. b. State the null hypothesis d. Compute t f. Compare h. Translate into English
7. Test the hypothesis that there is no difference between the average scores of husbands and their wives on an attitude-toward-smoking scale. Use 0.05 and the process listed in #6 (NOTE: Develop a CI95 confidence interval) Husband Wife 16 10 8 14 20 15 19 15 15 13 12 11 18 10 16 12
20-11
20-12
Chapter 21
21
One-Way Analysis of Variance
Why Not Multiple t-Tests? F-Ratio Fundamentals The F-Distribution Table Computing the F-Ratio Multiple Comparison Procedures
In Chapter 20 we presented two minor hypotheses from Dr. John Babler's dissertation on spiritual care provision through hospice agencies. His primary hypothesis was that there would be significant differences in [provision of spiritual care scores] . . . between social workers, nurses, and spiritual care professionals.1 The mean scores on provision of spiritual care for these three groups were 47. 23, 50.75, and 55.94 respectively. Applying the Analysis of Variance procedure, Babler found these three groups differed significantly in their provision of spiritual care (F=10.547, p=0.000). Application of the (F)LSD multiple comparison test revealed that the three groups differed significantly from each other.2 Social workers scored lowest, professional spiritual care providers highest, and nurses in between.3 We established the fundamentals for parametric testing in Chapters 17 and 18. We learned how to apply one-sample z, t tests in Chapter 19. We extended these principles to two-sample tests in Chapter 20. The next logical step is testing the differences on a single dependent variable among three or more group means. The procedure to use is one-way analysis of variance, more popularly known as one-way ANOVA.
1 2 Babler, p. 32 Ibid., p. 47. Note: use of the Least Significant Difference test is valid when the F-ratio is significant. This test was designed by Sir Ronald Fisher, developer of the ANOVA procedure (hence the name F-Ratio). Carmer and Swanson call this the Fisher-Protected LSD (FLSD). 3 Ibid., p. 48
21-1
The multiple application of t-tests was used earlier in this century until Englishman R. A. Fisher showed that the Type I error rate expands from to some larger value as the number of tests between paired means increases. The error rate expansion is constant and predictable, given by the following equation: p = 1 - (1 - )k where p is the actual Type I error rate of all tests together, is the stated level of significance, and k is the number of tests performed. In the A-B-C example above, the true probability (p) of committing a Type I error using three t-tests (=0.05) is given by p = 1 - (1 - )k = 1 - (1 - .05)3 = 1 - .953 = 1 - 0.857 = 0.143 In other words, we run a 14.3% chance of wrongly declaring two means significantly different, even though we set the error rate () at 5%. The problem grows with the number of means in the experiment. Suppose an experiment consists of ten groups. The researcher decides to apply the independent ttest ( = 0.05) to all pairs of means. The number of required t-tests equals (k)(k-1)/2 where k is the number of means in the experiment.With k=10 means, there are 10(9)/ 2 or 45 t-tests to compute. The Type I error rate across these 45 tests (p) is p = 1 - (1 - .05)45 = 1 - .9545 = 1 - .0994 = .9006 This means there is a 90% chance of committing a Type I error, with set at 5%! Since we want to lock the Type I error rate to when testing multiple means, multiple t-tests should not be used. Sir Ronald Fisher proposed a solution, however, and he named his procedure the Analysis of Variance, or ANOVA. The F in F-ratio comes from his name. We have been walking down the "parametric road of differences" since chapter 16. From the simple z-score formula in chapter 16, through one- and two- sample parametric tests, there has been a common thread tying all these procedures together. That thread perhaps youve already seen it is that every procedure involves a ratio of difference between to difference within. This z-equation for a score is a ratio of difference between an individual score and population mean in the numerator, and difference within the group (population standard deviation) in the denominator. If n > 30, s estimates , and estimates , giving the second z-equation. We can use the t-equation (3rd), especially when n<30. It uses the same ratio of difference between score and sample mean and "difference within" (estimated standard deviation). The next z-equation (4th) is a ratio of "difference between" a sample mean ( ) and population or hypothesized mean () in the numerator, and difference within the sampling distribution (standard error of the mean) in the denominator. The tequation (5th) uses estimated values in the same ratio. The next t-equation (6th) is a ratio of difference between two sample means in the numerator, and difference within both samples (standard error of difference) in the denominator. While the form of the last t-equation (7th) changes, conceptually it maintains the ratio of difference between paired scores to difference within all scores together. In all these cases, the between-to-within ratio remains constant. ANOVA contin-
21-2
Chapter 21
ues this relationship. No matter how many groups are involved in an experiment, the ANOVA procedure breaks down the sum of squares of all experimental scores into two parts a difference between part and a difference within part. The ratio of between to within differences forms the F-ratio, just as we have done all along.
Sums of Squares
We illustrate the process of dividing the Total Sum of Squares (SSt) into two parts with the diagram at right. Here we find three groups of scores with grand mean g (the mean of all scores in the study) and sample means 1, 2, and 3. Lets focus on one score in sample 3, indicated by the letter ("T" in the diagram) X3,1 at right. The distance between X and equals this one score's part of the total deviation between the scores and the grand mean. When we subtract g from every score in the experiment and add these deviations together, well get 0 (x=0). Square all the deviations and sum them to produce SSt for the experiment. The full equation for SSt is
where k is the number of groups, j is the group counter which increments from 1 to k, nj is the number of scores in the jth group, and i is the score counter which increments from 1 to nj. The equation says to subtract the grand mean from each score in each group, square the deviation, and sum them all up across groups to produce SSt. Next, notice that the T line in the diagram is equal to the sum of the other two parts. The first part is labelled B -- for Between Sum of Squares (SSb) -- and is the deviation of 3 from g. If we square these deviations, adjust for the number of subjects in each group, and sum them, well have SSb for the experiment. The full equation for SSb is
The equation says to subtract the grand mean from each sample mean, square the difference, and multiply by the number of subjects in the sample. Add the k elements together to produce SSb. The second part is labelled W -- for Within Sum of Squares (SSw) -- and is the deviation of from . If we square all of these deviations and sum them for Sample Three, and do the same for Samples One and Two, well have SSw for the experiment. The equation for SSw is
4th ed. 2006 Dr. Rick Yount
21-3
The equation says to subtract each groups mean from each of the scores within that group, square the differences, and add them up. Add all these elements across all groups to produce SSw. This gives us the combined "within" sum of squares for all the groups in the experiment. Notice that the Total line in the diagram is equal to the Between and Within segments, illustrating that the total sum of squares in any experiment of three or more groups can be divided into two parts, SSb and SSw, such that SSt = SSb + SSw.
Degrees of Freedom
Each sum of squares term (SSb, SSw, SSt) in ANOVA has an associated df term (dfb, dfw, dft). The between df term is k groups minus 1 (dfb = k-1). The within df term is the total number of scores in the experiment (N) minus the number of groups (k) in the experiment (dfw = N-k).
The total df term equals number of subjects minus 1 (dft = N-1). SSb + SSw = SSt. In the same way, dfb + dfw = dft.
k - 1 + N - k = N-1
In the one-sample test we lost one degree of freedom (df = n-1). In the two-sample test we lost two degrees of freedom (df = n + n - 2). It follows that when k groups are studied, we lose k degrees of freedom (dfw = N-k.)
Variance Estimates
Review: Recall from Chapter 16 that variance (s2) equals the sum of squares (x2) divided by degrees of freedom (n-1). ANOVA computes a between variance estimate4 (MSb) and a within variance estimate (MSw) from the SS and df terms defined above.
The MS terms stand for mean-square. Variance equals the average sum of squares. So, MSb (mean-square-between) stands for the mean of the squared deviations between groups. And MSw (mean-square-within) stands for the mean of the squared deviations within all groups. We do not take the square root of variance, as we did in the previous procedures. The F-ratio is built from these two variance estimates hence the name, Analysis of Variance. 5
The F-Ratio
The F-ratio of ANOVA is the ratio of MSb and MSw. This value is compared to a critical F value drawn from the F-distribution table to determine whether it is significant or not.
Variance (s2 ) equals sum of squares (x2) divided by degrees of freedom (n-1). The two MS terms equals sum of squares divided by degrees of freedom. 5 If you were to apply the independent-samples t-Test and ANOVA to the same two groups of scores, the resulting t, F values would have the relationship: F = t2 or t = F.
4
21-4
Chapter 21
An Example
At the beginning of the chapter we highlighted the findings of Dr. John Babler's dissertation on spiritual care provision. Here are two ANOVA tables from his dissertation.
TABLE 26 ANALYSIS OF VARIANCE OF SCORE AND HOSPICE PROFESSION Sum of Squares Mean Squares F Ratio F Probability
Source
DF
2 193 195
680.3283 64.5074
10.5465
.0000
The F-ratio of 10.5465 is significant because p=.0000, that is, p(F) is very small, much less than =0.05. This p(F) tells us that the spiritual care provision means of the three professions (47. 23, 50.75, and 55.94) were significantly different.
6
Babler, p. 47
21-5
TABLE 87 ANALYSIS OF VARIANCE OF SCORE AND AGE Source DF Sum of Squares Mean Squares F Ratio F Probability
4 191 194
219.2222 68.6090
3.1952
.0247
In Table 8 we see an F of 3.1952 (p=.0247). This p(F) tells us that the spiritual care provision means of the four age groups of professionals were significantly different. These means were 52.08 (26-35), 48.39 (36-45), 52.46 (46-55) and 52.12 (over 55). In these examples, you can see that the computer printout includes a p value for the computed F-ratio. This p is the exact probability of obtaining the computed F-ratio. It is easier to compare the computed p with than it is to look up an F critical value in a table. If p < , then reject the null hypothesis.
Procedures Defined
To determine which means differ sufficiently to produce the significant F-ratio, they study differences between pairs of means by using statistical techniques called multiple comparison procedures. These procedures vary in definition and procedure. We will briefly discuss four procedures here: the Least Significant Difference (LSD), the Tukey Honestly Significant Difference (HSD), the Student-Newman-Keuls (SNK) and the Fisher-Protected Least Significant Difference (FLSD).
21-6
Chapter 21
answer, of course, lies in the demands for educator-researchers to conduct and publish research. Since refereed journals demand articles reporting significant findings, many professors sought ways to produce significant findings more often. One way this was done was to use the LSD without a prior significant F-ratio, violating Fisher's guidelines. The result was an explosion of Type I errors and false positives in the literature. The problem was not the LSD test per sec, but its misuse. Other theorists sought ways to reduce the problem of excessive Type I error rate. The LSD generates the lowest critical value (and the highest level of power) of those discussed here, when used as Fisher directed.
Procedures Computed
In each of these procedures, a value is computed and compared to the differences between paired means. If the difference between two means is greater than the mul9
21-7
tiple comparison value, the two means are declared significantly different. We will forego the specific formulas for each of these procedures since this computational work will most likely be done by computer. However, given specific elements of an ANOVA example, multiple comparison critical values can be compared with each other. The following table displays the values: r 5 4 3 2 (F)LSD 12.522* 12.522 12.522 12.522 SNK 17.531 16.502 15.026 12.522 HSD 17.531** 17.531 17.531 17.531
Notice that the (F)LSD procedure produces the smallest critical value (12.522), producing the greatest power. The HSD procedure produces the largest critical value (17.531), producing the least power. The SNK procedure produces a range between the two. There is a great deal of confusion in the literature concerning multiple comparisons. My Ph.D. dissertation* focused on six multiple comparison tests and analyzed their error rates by a computerized Monte Carlo technique. I generated 28,000 sets of random data, 1000 tests for each of 28 n- and k-combinations related to educational research. My findings indicated the best multiple comparison procedure, under all conditions, was the (F)LSD. It consistently provided the greatest power and an error rate closest to . HSD was too conservative (consistently produced low power). The Scheffe method (not discussed in the chapter, but included in the study) consistently decreased the level of significance below , reducing the power of the test more than any other. Scheffe consistently reduces the likelihood of finding "true differences" and should be avoided except under very narrow conditions. If you need a multiple comparison test, my unqualified recommendation is the (F)LSD. If using SPSS, check the box marked LSD under multiple comparison tests, but only use the results if the F-ratio is significant.
Summary
In this chapter we established the general procedure for use of the one-way Analysis of Variance (ANOVA) test. We explained the problem of using multiple ttests. We illustrated the breakdown of total sum of squares into between and within parts. We showed how each element in an ANOVA table is computed. We discussed the ANOVA table and the relationships among the various table elements. Finally, we introduced the concept of multiple comparison procedures and illustrated their use.
Examples
Dr. John Babler's Table 2, p. 21-5, displayed a significant F between the means of the three professions. But which professional group differed significantly from the others? Babler used a computerized LSD test (with a significant F) to determine that
*Yount. A Monte Carlo Analysis... Ph.D. diss., University of North Texas, 1985, 45-46
21-8
Chapter 21
each of the three means differed significantly from the others.10 We can put them in a difference table to see the paired-differences more clearly. 1 = Spiritual Care professionals, 2 = nurses, 3 = social workers.
2 50.75 5.19 1 ranks 55.94 8.71 <--- largest difference 3.52 <--- smallest difference
ranks 3 2
The largest difference (8.71) is between the highest mean (55.94) and the lowest (47.23). You can see the differences between all three paired means above. All of these differences exceeded the critical value, and were declared significant. Dr. Gail Linam's dissertation (see Chapter 24-1 for full reference and Chapter 25 for the two-way ANOVA findings) compared reading comprehension in children grades 4-6 across three translations of Scripture, the King James, New International and New Century versions. She found that the KJV produced significantly lower comprehension scores than either NIV or NCV. She applied the FLSD procedure to determine exact differences between versions. For the Old Testament Retelling (OTR) scores, mission children scored significantly lower with KJV than NCV (7.00 < 23.11) and main campus children significantly lower with KJV than either NCV or NIV (18.81 < 34.41, 18.81 < 32.95). For the New Testament Retelling (NTR) scores, mission children scored significantly lower with KJV than either NCV or NIV (7.25<29.44, 7.25<21.56) and main campus children the same (25.55<37.55, 25.55<34.60).11 For Old Testament Cloze (OTC) scores, mission children scored significantly lower with KJV than either NCV or NIV (4.22<17.56, 4.22<13.33), and main campus children the same (6.55<23.50, 6.55<23.27).12 For New Testament Cloze (NTC) scores, mission children scored significantly lower with the KJV than with either NCV or NIV (0.38<16.11, 0.38<10.78) and main campus children scored the same (11.05<23.50, 11.05<22.82).13 In every case and under every condition, the KJV produced significantly lower reading comprehension scores using two different types of testing procedures and stories from both Old and New Testaments. Older children (4th-6th grades) simply do not understand the King James translation as well as the NIV or NCV versions.
Vocabulary
ANOVA dfb dft dfw FLSD HSD LSD
10
Analysis of Variance: tests difference among 3 or more indt samples means degrees of freedom between: df between the means: =k-1 degrees of freedom total: df for whole experiment: =N-1 degrees of freedom within: (n-1 per group)(k groups) = N-k Fisher-protected LSD -- LSD test protected by a prior significant F-ratio Tukey Honestly Significant Difference mcp: very conservative Least Significant Difference high Type I error without significant F-ratio
11
Babler, 49
Linam, 109
12
Ibid., 111
13
Ibid.
21-9
mean square within: variance within-all-groups-combined mean square between: variance between means and grand mean applying t-tests to multiple pairs of means in an experiment with three+ groups Student-Newman-Keuls mcp which uses a range of critical values between sum of squares: SS term between grand mean and group means total sum of squares: SS term between grand mean and all scores within sum of squares: SS between scores and their respective group means
Study Questions
Dr. Martha Bergen studied attitudes toward computer-enhanced learning for seminary education among full-time professors at Southwestern Baptist Theological Seminary in 1989.14 One of her hypotheses was that there would be a significant difference [in attitude toward computer-enhanced learning] between the professors in the religious education, theology, and church music schools. Scores were generated from an attitude scale Dr. Bergen developed for the study. The mean attitude scores for the three schools were 118 (highest) in the Religious Education faculty, 117 in the church music faculty, and 114 (lowest) in the theology faculty. But were these differences in attitude significant? Here is the ANOVA table she generated:15
SOURCE OF VARIATION Between Within Total SUM OF SQUARES 323.387 25018.652 25342.039
df 2 73 75
MS 161.694 342.721
F .472
p .626
I. Using the problem and printout above, answer these questions: 1. Is the F-ratio significant? Explain why you say this. 2. Explain this F-ratio in terms of the three group means: 114, 117, 118. 3. How do you explain the differences in the school mean scores? 4. Dr. Bergen did not apply multiple comparisons tests to see if any single mean was significantly different from the others. Why ? Was she correct in doing so? II. General Chapter Questions: 1. Explain the problem of using several t-tests to determine significant differences among pairs of means. 2. Since the FLSD is a modified multiple t-test, explain how it overcomes the problem explained in #1. 3. Explain in your own words how ANOVA divides the total sum-of-squares into be tween and within parts. (Use the deviation explanation). 4. Fill in the ANOVA table below. You are testing the means of 4 groups of 10 subjects each. The SSb=480.0 and SSt=1440.0. Compute the F-ratio. Determine the critical value (=0.05). Is the F-ratio significant?
14
15
Ibid., 87
21-10
Chapter 21
21-11
21-12
Chapter 22
Correlation Coefficients
22
Correlation Coefficients
The Meaning of Correlation Correlation and Data Types Pearsons r Spearman rho Other Coefficients of Note Coefficient of Determination r2
The concept of correlation was introduced in Chapters 1 and 5. Our focus since Chapter 16 has been basic statistical procedures that measure differences between groups -- one-sample, two-sample, and k-sample tests. Now we turn our attention to basic statistical procedures that measure the degree of association between variables. Dr. Wesley Black studied the relationship between rankings of selected learning objectives in a youth discipleship taxonomy between full-time church staff youth ministers and seminary students enrolled in youth education courses at Southwestern Seminary.1 Questionnaires were returned by 318 students and 184 youth ministers.2 Ten objectives in each of five categories (Personal Ministry, Christian Theology and Baptist Doctrine, Christian Ethics, Baptist Heritage, and Church Polity and Organization) were ranked by these two groups. The basic question raised by Black in this study was whether students prioritized discipleship training objectives for youth in the same way as full-time ministers in the field. Using the Spearman rho correlation coefficient, Black found the correlations of rankings generated by students and ministers of the ten items for each of five categories were as follows: Personal Ministry, 0.915; Christian Theology and Baptist Doctrine, 0.867; Christian Ethics, 0.939; Baptist Heritage, 0.939; and Church Polity and Organization, 0.927. Each of these are strong positive correlations3 between the rankings of objectives by students and ministers.
1 Wesley Black, A Comparison of Responses to Learning Objectives for Youth Discipleship Training from Ministers of Youth in Southern Baptist Churches and Students Enrolled in Youth Education Courses at Southwestern Baptist Theological Seminary, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1985). 2 Black received 356 responses from students, but 38 of these were full-time youth ministers, and so were excluded from the study, leaving 318 student responses. Of the 307 responses from youth ministers, 197 indicated they were full-time church staff youth ministers. Thirteen additional responses were eliminated for incompleteness, leaving 184 youth minister responses. pp. 71-72 3 Any coefficient over 0.80 indicates a strong positive correlation.
22-1
22-2
Chapter 22
Correlation Coefficients
rxy rs*
rs* rs C, C**
Point Biserial
Rank Biserial
r
*requires interval/ratio data to be ranked **Requires 2 value
Finally, Kendalls Coefficient of Concordance (W) computes the correlation of three or more rankings of of items. Now well look at how each of these correlation coefficients are computed.
22-3
distribution produces a smaller r than a normal distribution. Use a large scale for variables in correlational analysis, since the larger the variability, the stronger the coefficient will be. A common mistake in research design is to use age categories rather than actual ages, or salary categories rather than actual dollar values. The range of categories will always be much smaller than the range of actual data, reducing the value of r. Pearsons r is computed with the following formula:
where n equals the number of score-pairs, and X and Y equal paired scores. While this formula is somewhat foreboding, it consists of the following simple components:
XY X Y X2 Y2 (X)2 (Y)2 Multiply X by Y and sum Sum all the X scores Sum all the Y scores Square all Xs and sum Square all Ys and sum Square the sum of X Square the sum of Y
Lets say we have a set of 5 paired scores: (3,6), (5,9), (2,5), (3,7), and (4,8). A scatter-plot of this data is shown at left. From what you can see in this graph, do you predict a strong or weak correlation coefficient? Weve put the paired X-Y values in the chart below to facilitate computing the various elements of the Pearsons r formula. The letters in the chart (A-G) refer to the step below (A-G).
X2 ==== 9 25 4 9 16 63 X2 X === 3 5 2 3 4 17 X XY ==== 18 45 10 21 32 126 XY Y === 6 9 5 7 8 35 Y Y2 ==== 36 81 25 49 64 255 Y2
A
289
C
1225
G
2
(X)
(Y)
B
A. Add up the Xs. This is X, and equals 17
D
4th ed. 2006 Dr. Rick Yount
22-4
Chapter 22
Correlation Coefficients
B. (X)2 = 17 x 17 = 289 C. Add up the Ys. This is Y, and equals 35 D. (Y)2 = 35 x 35 = 1225 E. Multiply each XY pair together and add. This is (XY), and equals 126 F. Square each X and add up the squared values. X2 = 63 G. Square each Y and add up the squared values. Y2 = 255
Now substituting into the raw score equation we have: Before going on, be sure to identify each term in the equation above with the chart above and the equation on the previous page.
The Pearson r value of +0.971 indicates a very strong -- nearly perfect -- positive correlation between these two variables.
where D is the difference between paired ranks. The number 6 is a constant. Suppose a pastor asked two staff members to rank ten church objectives according to how well they were being accomplished by the church. Here are the rankings of the ministers.
Objective 1 2 3 4 5 6 7 8 9 10 Min Ed Rank 1 2 3 4 5 6 7 8 9 10 Min Youth Rank 2 1 5 3 7 6 4 10 9 8
Question: Do these two staff members agree in their evaluation of the objectives?
The two ranked variables in Dr. Black's dissertation (p. 1) was ranking of objectives by students and ranking of objectives by youth ministers. This was accomplished by assigning a score value to each individual subject's set of ranks, computing means of these scores, and rank ordering the objectives by the means. The result was a separate ranking high to low of objectives by the two groups. Spearman rho was then applied to compute the degree of agreement between the two rankings.
4
22-5
What is the strength of their agreement? First we compute the differences (D) between ranks, then square the differences (D ), sum the squares (D2), and substitute into the formula. The table below summarizes the process:
2
Objective 1 2 3 4 5 6 7 8 9 n= 10
Min Ed Rank 1 2 3 4 5 6 7 8 9 10
D -1 +1 -2 +1 -2 0 +3 -2 0 +2 D=0
D2 1 1 4 1 4 0 9 4 0 4 D2=28
Objective 1 was ranked highest (1) by the minister of education and second (2) by the minister of youth. Subtracting 2 from 1 yields a difference (D) of -1. Squaring D yields a D2 of 1. Notice that the sum of differences (D) equals 0. Summing the D2 values, we get 28. Substituting the value of D2 and n into the Spearman formula, we have
The coefficient of +0.83 indicates a strong agreement between the two staff ministers with respect to the rankings of church objectives.
22-6
Chapter 22
Correlation Coefficients
The Phi Coefficient measures the strength of relationship between two dichotomous variables. A study of marital status and attrition rate in college might arbitrarily assign a 1 to married and 0 to not married; a 1 to dropped out and a 0 to remaining in school. Any type of variable that can be classified 1 and 0 can use the phi coefficient. A positive correlation indicates those who score "1" on one variable tend to score "1" on the other. Using the example above, a positive correlation would mean that married students (1) tend to drop out of school (1) more than unmarried students.
22-7
Here we again see the distinction between statistical significance and practical importance. A different approach, a more meaningful approach, in determining the importance of a correlation coefficient, is the coefficient of determination (r). By squaring the correlation coefficient, one obtains a measure of the common variance between two variables, the proportion of variance accounted for in one of the variables, or explained by, the other. If the correlation between marital satisfaction and number of months married is -0.40, then 16% of the variance (-.40 x -.40 = .16) of one variable is accounted for by the variance of the other (the shaded area at right). We could say that 16% of the variability in marital satisfaction and number-of-months-married overlaps. It follows that 84% of the variability is unaccounted for. In the Catholic sex education study mentioned above, the r2 value of r=0.07 is 0.0049, or 0.49%: one-half of one percent of variance accounted for. Ninety-nine and three-fourths percent (99.51%) of variance was unaccounted for. This was a meaningless significant finding to be sure. We will use the concept of r2 much more when we discuss regression analysis.
Summary
In this chapter you have been introduced to the concept of correlation. You have learned how to compute the two most popular correlation coefficients, Pearson's r and Spearman rho, as well as learned of several other helpful correlational tools. You have been introduced to the coefficient of determination (r2) which will be of central importance in Chapter 26, Regression Analysis.
Vocabulary
Coefficient of Determination Contingency Coefficient Correlation Correlation coefficient Cramers Phi Kendalls tau Kendalls W Negative correlation Pearsons r Phi Coefficient Point biserial Positive correlation Rank biserial Scattergram Spearmans rho proportion of variance in one variable accounted for by another (r) measure of association between two nominal variables (max < 1) degree of association between two variables numerical measure of degree of association between two variables measure of association between two nominal variables (max = 1) measure of association between two sets of ranks (n < 10 pairs) measure of association among three+ sets of ranks one element of paired scores increases while the other decreases measure of association between two interval\ratio variables measure of association between two dichotomous variables measure of assocn between an interval/ratio variable and a dichotomous variable both elements in paired scores increase (or decrease) measure of assocn between ordinal variable and dichotmous variable graphical representation of correlation of two variables measure of association between two sets of ranks (n > 10 pairs)
Study Questions
1. Is age related to the length of stay of surgical patients in a hospital? The following data was collected in a recent study. Age: 40 36 30 27 24 22 20 Days: 11 9 10 5 12 4 7
22-8
Chapter 22
Correlation Coefficients
a. Draw a scatterplot diagram of the data, with AGE on x-axis and DAYS on y-axis. b. By appearance alone, do AGE and DAYS appear to be related? c. Compute the appropriate correlation coefficient. d. Interpret the results. e. Compute the coefficient of determination. What does it tell you? 2. A professional person and a blue-collar worker were asked to rank 12 occupations according to the social status they attached to each. A ranking of 1 was assigned to the occupation with the highest status down to a ranking of 12 for the lowest. Here are their rankings: Occupation
Physician Dentist Attorney Pharmacist Optometrist School Teacher Veterinarian College Professor Engineer Accountant Health Care Administrator Government administrator
Professional
Person 1 4 2 6 12 8 10 3 5 7 9 11
Blue-Collar
Worker 1 2 4 5 9 12 6 3 7 8 11 10
____ 1. Preacher popularity by rank and whether he graduated from seminary or not. ____ 2. Reading score in 6-year-olds and whether they participated in HEADSTART preschool program. ____ 3. Seminary GPA and marital satisfaction scores of graduating students. ____ 4. Smoking/not smoking and death by (1) cancer or (2) other causes. ____ 5. Staff position and leadership style category. ____ 6. Ten objectives in Christian Education ranked by pastors and ministers of education.
22-9
22-10
Chapter 23
Chi-Square Tests
23
Chi-Square Procedures
The Chi-Square Formula The Chi-Square Critical Value Chi-Square Goodness of Fit Test Chi-Square Test of Independence Cautions in Using Chi-Square
Dr. Helen Ang studied the relationship between predominant leadership style and educational philosophy of administrators in Christian colleges and universities for her Ed.D. dissertation in 1984.1 Leadership Style was a categorical variable with the following five levels (with percentages of the 113 administrators studied): team administrator (high people/high task: 23%), constituency-centered (moderate people/moderate task: 16%), authorityobedience (low people/high task: 4%), comfortable-pleasant (high people/low task: 38%), and caretaker (low people/low task: 19%).2 Educational Philosophy Profile was a categorical variable with the following six levels (with percentages): idealism (7%), realism (4%), neo-thomism (15%), pragmatism (58%), existentialism (1%), and eclectic (16%).3 Applying the Chi-Square Test of Independence, Dr. Ang found that the variables Leadership Style and Educational Philosophy were independent (2 = 21.676, 2cv = 31.410, a=0.05, df=20).4 The chi in chi-square is the Greek letter , pronounced ki as in kite. Chi-square (2) procedures measures the differences between observed (O) and expected (E) frequencies of nominal variables, in which subjects are grouped in categories or cells. There are two basic types of chi-square analysis, the Goodness of Fit Test, used with a single nominal variable, and the Test of Independence, used with two nominal variables. Both types of chi-square use the same formula.
Helen C. Ang, An Analytical Study of the Leadership Style of Selected Academic Administrators in Christian Colleges and Universities as Related to their Educational Philosophy Profile, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1984). 2 3 4 Ibid., 28-29, 46 Ibid., 45 Ibid., 47
1
23-1
where the letter O represents the Observed frequency -- the actual count -- in a given cell. The letter E represents the Expected frequency -- a theoretical count -- for that cell. Its value must be computed. The formula reads as follows: The value of chi-square equals the sum of O-E differences squared and divided by E. The more O differs from E, the larger 2 is. When 2 exceeds the appropriate critical value, it is declared significant.
The chart above shows the step-by-step procedure in computing the chi-square formula. Notice that both O and E columns add to the same value (N=120).
23-2
Chapter 23
Chi-Square Tests
from each category. For example, the largest contributor of the chi-square is the high tally in category 6. It yields 1.25 of the 2.80 total. The fourth step is to sum the values in the last column to produce the final chisquare value in this case, 2.80.
23-3
12=50). The first category, Republicans, has 5 parts (5:4:3), or 5x50=250 Expected voters. The second, Democrats, has 4 (5:4:3) parts, or 4x50=200 Expected voters. The third, Independents, has 3 parts (5:4:3), or 3x50=150 Expected voters. Putting this in a table as before, we have the following:
O Rep Dem Ind 322 184 94 600 E 250 200 150 600 O-E 72 -16 -56 (O-E) = 0 (O-E)2 5184 256 3136 (O-E)2/E 20.74 1.28 20.91 2 = 42.93
Notice that both O and E columns add to 600 (N). Notice that the O-E column adds to zero. Notice that the E values are unequal, reflecting the 5:4:3 ratio derived from the earlier poll. The resulting 2 value equals 42.93.
Looking at the O-E column, we see that we observed more Republicans than we expected (322 > 250), and fewer Independents than expected (94 < 150), based on data from four years before. It is this twisting ( ) effect that causes the large chisquare value. In summary, the Goodness of Fit procedure tests one variable across k categories. The computed value is tested for significance at and df = k-1. The expected frequencies for each category can be equal (EE) or proportional (PE).
23-4
Chapter 23
Chi-Square Tests
independent to mean related. The two nominal variables form a contingency table of cells.
Each of the 47 schools in the study were categorized by both variables and placed into one of 6 cells. How many deaf schools identify giftedness in their students (II) and use total communication as their primary approach? [15]. How many schools use aural/oral methods and do not identify giftedness in their students (I)? [3]. The table also includes margin totals, labelled Total. The total number of aural/ oral schools, regardless of school type, for example, was 3. The total number of Type I schools, regardless of language preference, was 27. The margin totals for the row variable are called row totals (3, 35, 9). The margin totals for the column variable are called column totals (27, 20). The sum of column totals (47) equals the sum of row totals (47) a good check on math accuracy. Margin totals are the means by which expected values are computed.
23-5
E =
(27x3) / 47 = 1.723 / | \
col 1 row 1 total
Putting this in more general terms, we can show the computation of the Expected values for all cells in a 3x4 contingency table.
The above table shows three levels of a column variable (1, 2, 3) and four levels of a row variable (I, II, III, IV). Once the observed frequencies are placed in the table and margin totals computed, expected values for each cell can be computed. The Expected value for cell 3,27 is found by multiplying the cell's row total (C) by its column total (Y) and dividing by the Table total (T). Once the expected cell frequencies are computed, the remainder of the computation is the same as demonstrated before. O-E, (O-E)2, (OE)2/E for each cell.
Degrees of Freedom
We determine df for the Test of Independence by the formula df = (r-1)(c-1), where r = the number of rows and c = the number of columns in the contingency table. For a contingency table of 5 rows and 6 columns, the degrees of freedom would be (5-1)(6-1) or 20. (Each variable loses one degree of freedom).
Application to a Problem
Lets apply this to our example on deaf schools. The expected frequencies are shown bold-faced in parentheses () below. It is suggested that you compute several of these to insure your understanding of the procedure.
Cell 3,2 refers to the cell at row 3, column 2, shown in the table as the shaded cell.
23-6
Chapter 23
Chi-Square Tests
Putting the O and E values into a chart, we have the following computations:
O 3 20 4 0 15 5 E 1.72 20.11 5.17 1.28 14.90 3.83 (O-E) 1.28 -0.11 -1.17 -1.28 .10 1.17 (O-E)2 1.638 .012 1.369 1.638 .010 1.369 (O-E)2/E 0.953 .001 .265 1.280 .001 .357
The computed value of 2.857 is smaller than the critical value of 5.991. Therefore, the value is declared not significant. The statistical decision is to retain the null hypothesis. In terms of this study, language preference and school category are not related. It appears that educational approach is unrelated to identifying giftedness in deaf students in these 47 deaf schools.
Democrat
184
Independent
94
Total
600
Heres our chart. Identify the Os and Es above in the chart below.
O 170 112 68 152 72 26 E 187.83 107.33 54.83 134.17 76.67 39.17 (O-E) -17.83 4.67 13.17 17.83 -4.67 -13.17 (O-E)2 317.91 21.81 173.45 317.91 21.81 173.45 2 = (O-E)2/E 1.69 .20 3.16 2.37 .28 4.43 12.13
RM DM IM RF DF IF
23-7
The computed value of 12.13 is larger than the critical value of 5.991 (0.05, df=2). Therefore, the value is declared significant. The statistical decision is to reject the null hypothesis. In terms of this study, this result means that gender and political party preference are related. Ones political preference is influenced by his or her gender. How are these two variables related? We can answer this by eyeballing the data in the table. The greatest part of the chi square comes from the n FEMALE-INDEPENDENT (IF) cell. We observe fewer women independents () than we expect by chance (26 vs. 39.17). The second highest value comes from the o MALE-INDEPENDENT (IM) cell. We observe more male independents () than we expect by chance (68 vs. 54.83). Notice that men outnumber women across independent. The third highest value comes from the p FEMALE-REPUBLICAN (RF) cell. We observe more women republicans () than we expect by chance (152 vs. 134.17). The fourth highest value comes from the q MALE-REPUBLICAN (RM) cell. We observe fewer male republicans () than we expect by chance (170 vs. 187.83). Notice that women outnumber men across republican.
RM DM IM RF DF IF 170 112 68 152 72 26 187.83 107.33 54.83 134.17 76.67 39.17 -17.83 4.67 13.17 17.83 -4.67 -13.17 317.91 21.81 173.45 317.91 21.81 173.45 1.69 .20 3.16 2.37 .28 4.43
q o p n
The arrows show the twisting motion in the table that indicates that the two variables are related.
Strength of Association
The chi-square test of independence tells you whether two nominal variables are related or not. It does not tell you how strong that relationship is. When you produce a significant chi-square (two variables are related), it is natural to wonder how strong the relationship is. Two procedures can provide such measures: the Contingency Coefficient (C) and Cramers phi (C).
Contingency Coefficient
The contingency coefficient (C) computes a Pearson r type correlation coefficient from a computed value. The formula is
If you get, say, a chi-square value of 63.383 (significant at = 0.001) with a sample size of 390, then you can compute the degree of association by
23-8
Chapter 23
Chi-Square Tests
If we were to compare this to a maximum value of 1.00, we would conclude that 0.398 is a weak correlation. But the maximum value for C is not 1.00. It is estimated by another formula:
where k is the number of categories in the variable with the fewer categories. Lets say in our case that one of our variables has 6 categories and the other has 3. Then, k = 3. The maximum value C can take is then computed as Cmax = (3-1)/3, or 0.817. Comparing 0.389 to 0.817, we would say that we have a moderately strong correlation.8
Cramers Phi
While the contingency coefficient is popular, a better alternative to the measurement of association in a contingency table is Cramers phi. The advantage of this procedure is that it ranges from 0.00 to +1.00 and is independent of the size of the table. Cramers Phi is defined as
Hinkle, p. 320
Howell, p. 105
23-9
gested a dissertation using a chi-square table of 16 rows and 16 columns (256 cells) and considered 200 subjects more than enough.10 The reason is power. Fewer subjects than 5 per cell will not allow the chisquare procedures to detect relationships that may exist. If you plan correctly, but lose subjects during the study, or find some category tallies to be much smaller than anticipated, remember that your significance tests are suspect.
Assumption of Independence
We noted in Barb's study of deaf schools that each one of the 47 schools were placed in one and only one cell in the contingency table. Each school was independent of every other school. The assumption of independence means that each subject is located in one and only one cell in the contingency table. This mistake is easy to make usually by having subjects respond more than once. A student came into my office with a contingency table of tally marks in the fall in 1981 -- my first semester on faculty. His table was the result of $200 in mailings, $300 to a statistician across town, and the prior 10 months of his life. He had listed various educational programs down one side of the contingency table, and five levels of ratings across the top. Each subject checked off a rating for each program. He had 60 subjects and 300 tallies! The observations were not independent (each subject made five responses in the table). He had produced a chi-square value, but the value was meaningless. I encouraged him strongly to go back to his statistican and have him work out another approach to analyzing his data. Proper Planning Prevents Poor Performance and sleepless nights, as well.
Inclusion of Non-Occurrences
There is one final warning I would make about use of chi-square, and this involves the handling of non-occurances. Lets say you ask 20 men and 20 women whether they favor Variable or not. Seventeen men and eleven women say "Yes." With 28 yes responses, we can compute equal Es as 28/2=14. The analysis would be set up as follows:
O 17 11 E 14 14 O-E 3 -3 0 (O-E)2 9 9 2 (O-E)2/E 0.643 0.643 = 1.286
Male Female
This faulty design produces a chi square of 1.286 and is not significant. The fault lies in the fact that the number of nos for males and females is excluded. The correct approach is to build a contingency table as follows, which includes both yes and no responses: Male Yes No 17 3 20 Female 11 9 20 28 12 40
10
23-10
Chapter 23
Chi-Square Tests
O 17 3 11 9
E 14 6 14 6
O-E 3 -3 -3 3 0
(O-E)2 9 9 9 9
Now 2 = 4.286 and is significant (2cv = 3.84, df = 1, 0.05). Looking only at "yes" responses (excluding "no"s) invalidated the test. Further, it lowered the value of chi square, leaving us with a non-significant finding -- incorrectly.
Summary
In this chapter weve introduced the concept of non-parametric, or distributionfree, statistics. Weve looked at the chi-square Goodness of Fit tests with both equal and proportional expected frequencies. Weve studied the chi-square Test of Independence. The concept of degrees of freedom was discussed. Weve illustrated how the chi-square statistic is computed, how the critical value is obtained and what significance means in English.
Example
Dr. Roberta Damon's dissertation was cited (p. 19-1) for her use of the one11 This student is now professor at a prominent Christian university, author of many books, and a prominent leader in his professional organization, proving that unsignificant research findings need not impair one's career! 12 Roberta Damon, A Marital Profile, p. 70
23-11
sample z-Test. She also used the chi-square Test of Independence to analyze relationships among several other variables. First, she found that level of marital satisfaction and age category were not independent among missionary wives of her sample (2 = 7.525, 2cv = 5.99, df=2, 0.05). The younger wives expressed higher marital satisfaction than older wives. Second, she found that conflict resolution and age category were not independent among missionary wives of her sample (2 = 6.4513, 2cv = 5.99, df=2, 0.05). The younger wives were more satisfied with the way conflict is resolved in their marriage than older women.12
Vocabulary
contingency coefficient contingency table Cramers phi distribution-free tests equal expected frequencies expected frequencies margin totals observed frequencies proportional expected frequencies measures strength of assocn between two nominal variables () table of rows and columns in chi square test of independence measures strength of assocn between two nominal variables (c) statistics which do not assume a normal distribution of data E-values computed by dividing N by k theoretical values by which observed frequencies are tested (E) sums of counts used to compute Es in chi-sq test of independence actual counts of subjects in chi-square categories (O) E-values computed by known percentages in population
Study Questions
1. What are the critical values for the following conditions: a. 3 rows, 1 column, p=0.05 b. 5 rows, 3 columns, p=0.01 c. 4 rows, 9 columns, p=0.005 2. Define df for both Goodness of Fit and Test of Independence. Demonstrate how that k-1 and (r-1)(c-1) are the proper terms for the two dfs. 3. Youve done your analysis and your computed chi square is less than the critical value. What does this mean, given you are testing one variable? 4. If you have a table with 5 rows (margin totals A..E) and 6 columns (margin totals U..Z), what would the expected value of the cell at row 4, column 2 be?
23-12
Chapter 23
Chi-Square Tests
2. The term (O-E) in chi square is most closely related to ___ in the z-test. A. X2 B. X C. x D. x2 3. A significant chi square for the Test of Independence means that A. two nominal variables are related. B. two independent groups are different. C. two dependent groups are different. D. two categorical variables are independent. 4. A contingency table has 2 columns and 8 rows. The proper df is A. 16 B. 14 C. 7 D. 1
23-13
23-14
Chapter 24
24
Non-Parametric Statistics for Ordinal Differences
The Rationale of Testing Ordinal Differences Wilcoxin Rank-Sum Test Mann-Whitney U Test Wilcoxin Matched-Pairs Test Kruskal-Wallis H Test
Chapters 16-21 covered major parametric procedures for testing hypotheses of difference (z, t, F). Here we will look at several non-parametric procedures used to test hypotheses of difference when the data is ordinal -- rankings. The most common application of these tests is with small group testing, where interval/ratio data is converted to ranks. These non-parametric tests are not constrained by the same mathematical restrictions as parametric tests, and so give better results for small n.. These procedures include the Wilcoxin Rank-Sum test, the Mann-Whitney U test, the Wilcoxin Matched-Pairs test, and the Kruskal-Wallis H test. Dr. Gail Linam studied the Bible reading comprehension of children, grades 4-6, across three translations of Scripture: the King James (KJV), the New International (NIV), and New Century (NCV).1 The children's reading comprehension was measured by two different instruments on a story from the Old and New Testaments. The first was the retelling method (OTR, NTR), and the second was the Cloze method (OTC, NTC). She also averaged the two stories into a single Bible comprehension score (BIBR, BIBC). Ninety-two (92) children were tested. Scores were ranked without regard to group membership of the child, and then sums of ranks were computed for each group (KJV, NIV, NCV). The results are shown in the following computer printout2:
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES2 DEPENDENT VARIABLE IS OTR GROUPING VARIABLE IS VER
Gail Linam, A Study of the Reading Comprehension of Older Children Using Selected Bible Translations, (Ed.D. diss., Southwestern Baptist Theological Seminary, 1993) 2 Ibid., 204
1
24-1
COUNT 30 31 31
KRUSKAL-WALLIS TEST STATISTIC = 18.649 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF
The Kruskal-Wallis shows a significant difference (p=0.000) in Old Testment Retelling scores (OTR) across the three translations (VER). Notice that the sum of ranks for Group 1 (KJV) is much smaller than Groups 2 and 3. This reveals much lower reading comprehension among children in grades 4-6 for the King James. She found the same results with the NTR, OTC, NTC, BIBR, and BIBC tests. In every case, children understood much less of the King James English than either the New International or the New Century versions.3 Each of the ordinal tests have parametric counterparts, with which you are already familiar.
Wilcoxin Matched-Pairs
Correlated-samples t-Test
Kruskal-Wallis H
One-way ANOVA
Both procedures test three or more independent samples for significant difference
Ibid., 205-206
24-2
Chapter 24
sum of ranks; groups that score systematically higher produce a larger sum of ranks. If the difference between the R terms is large enough, it will be declared significant.
Orthopedic Patients
1 2 2 3 6 2 3.5 3.5 5 7 R = 21
The lowest score (0) receives the rank of 1, and the highest count (12) receives the rank of 11. Two scores have the same count of 2. They are assigned the tied ranks of 3.5, 3.5 in place of 3, 4 (there is no 4 rank). These rankings are then summed by group, yielding sums of 45 and 21. Since Group 2 (n=5) is smaller than Group 1 (n=6), use R of Group 2: 21.
David C. Howell, Statistical Methods for Psychology, (Boston: Duxbury Press, 1982), 500
24-3
where n1 = number of observations in group 1, n2 = number of observations in 2, R1 = sum of ranks assigned to group 1, and R2 = sum of ranks assigned to 2.
The U1 term (6) is smaller than the U2 term (24), so the U statistic is 6.
Howell, p. 505
24-4
Chapter 24
Change 10 7 5 25 13 -6 1 40
Rank 5 4 2 7 6 3 1 8
T+ =(+ranks) = +33
The Rank column shows the Change values ranked low to high without regard to sign (score 1 = rank 1; score 40 = rank 8). The Signed Rank column applies the sign (-,+) of Change to Rank. Add together all positive ranks for T+ and all negative ranks for T-. The T statistic equals the smaller of the two T values. Since T- (-3) is smaller than and T+ (33), T = 3.
Kruskal-Wallis H Test
The Kruskal-Wallis H Test is a generalization of the Wilcoxin Rank-Sum test to the case where we have three or more independent groups. As such it is the distributionfree counterpart to the one-way analysis of variance test. Using the following equation, we test whether the Rs for all groups are equal:
where k = the number of groups, ni = the number of observations in group i, Ri = the sum of ranks in group i, and N = the total sample size.
Ibid., p. 507
24-5
Depressant Score 55 23 40 17 50 60 44
Rank 9 2 3 1 6 10 4 ____ 35
Stimulant Score 73 82 51 63 74 85 66 69 R2
Ri:
R1 =
Placebo Score 61 54 80 47
Rank 11 8 17 5
R3 =
____ 41
Substituting the R values into the H Test formula, we have the following:
Summary
In this chapter we have investigated the more popular and powerful of the distribution-free ordinal tests of difference. We analyzed the Wilcoxin Rank-Sum test, the Mann-Whitney U test, the Wilcoxin Matched-Pairs Signed-Ranks test, and the Kruskal-Wallis H test. The value of these tests is their ability to handle smaller groups of subjects than their comparable parametric counterparts. This is particularly helpful in the kinds of studies designed in the context of Christian education, administration, counseling and social work.
24-6
Chapter 24
Example
Here are the remaining Kruskal-Wallis H test results from Dr. Linam's study:
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS NTR (New Testament Retelling Test) GROUPING VARIABLE IS VER GROUP COUNT RANK SUM 1.000 30 888.000 (KJV) 2.000 31 1679.500 (NIV) 3.000 31 1710.000 (NCV) KRUSKAL-WALLIS TEST STATISTIC = 17.884 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF ................... KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS OTC (Old Testament Cloze Test) GROUPING VARIABLE IS VER GROUP COUNT RANK SUM 1.000 30 808.500 2.000 31 1546.500 3.000 31 1923.000 KRUSKAL-WALLIS TEST STATISTIC = 27.115 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF ........... KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS NTC (New Testament Cloze Test) GROUPING VARIABLE IS VER GROUP COUNT 1.000 30 2.000 31 3.000 31 KRUSKAL-WALLIS TEST STATISTIC = PROBABILITY IS 0.000 ASSUMING ........... RANK SUM 705.000 1742.500 1830.500 33.342 CHI-SQUARE DISTRIBUTION WITH 2 DF
KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS BIBR (Average Bible Retelling Test) GROUPING VARIABLE IS VER GROUP COUNT RANK SUM 1.000 30 851.500 2.000 31 1654.000 3.000 31 1772.500 KRUSKAL-WALLIS TEST STATISTIC = 20.822 PROBABILITY IS 0.000 ASSUMING CHI-SQUARE DISTRIBUTION WITH 2 DF ............ KRUSKAL-WALLIS ONE-WAY ANALYSIS OF VARIANCE FOR 92 CASES8 DEPENDENT VARIABLE IS BIBC (Average Bible Cloze Test) GROUPING VARIABLE IS VER GROUP COUNT 1.000 30 2.000 31 3.000 31 KRUSKAL-WALLIS TEST STATISTIC = PROBABILITY IS 0.000 ASSUMING RANK SUM 711.000 1664.500 1902.500 33.765 CHI-SQUARE DISTRIBUTION WITH 2 DF
............................................................................................
In every case, comprehension of the KJV was significantly lower than NIV or NCV
24-7
Vocabulary
Kruskal-Wallis H test Mann-Whitney U test sum of ranks Wilcoxin T test Wilcoxin Ws test ordinal alternative to one-way ANOVA ordinal alternative to t-test for independent samples key concept in ordinal statistics (R) - used to differentiate groups ordinal alternative to matched samples t-test ordinal alternative to the t-test for independent samples
Study Questions
1. Describe the rationale for using non-parametric tests. 2. Describe the appropriate way to handle tied ranks in the procedures discussed in this chapter. 3. Explain in your own words when to use the Kruskal-Wallis H, the Wilcoxin T, the Mann-Whitney U and the Wilcoxin Ws tests.
24-8
Chapter 25
25
Kaywin Baldwin LaNoue, A Comparative Study of the Spiritual Maturity Levels of the Christian School Senior and the Public School Senior in Texas Southern Baptist Churches with a Christian School, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1987), 45 2 3 Ibid., 46 Ibid., 47
1
25-1
Two-Way ANOVA
Let's look at a study of the effect of reinforcement on developing vocabulary. One variable is reinforcement with two levels: immediate and delayed. The second variable is subject socioeconomic class with two levels: low and middle. The dependent (measured) variable is vocabulary test score. The two-factor table for this experiment is shown below. The means in the table are identified as r,c. Meanrow=1,col=1 is designated . The other three cell means are designated , , and . The margin mean for row 1 is designated . The dot (.) replaces the column number, indicating all columns. The other margin means are designated (second row), (first column), and (second column).
=40 =20
=30
The 2x2 design above produces three F-ratios. It yields an Fr-ratio (row F-ratio) which compares with (Row mean 1 with Row mean 2). This F-ratio is the same as if we had computed a one-way ANOVA across reinforcement type alone. The 2x2 design also yields an Fc-ratio (column F-ratio) which compares with (Column mean 1 with column mean 2). This F-ratio is the same as if we have computed a one-way ANOVA across socioeconomic status alone. These row and column F-ratios are called main effects. The 2x2 design also yields an Frc-ratio, which tests whether there is an interaction between the two independent variables in this case, reinforcement and socioeconomic status. This interaction cannot be tested in one-way ANOVA designs.
25-2
Chapter 25
lower left graph. In this (fictious) graph we see that the differences in vocabulary score are parallel across both reinforcement types and socioeconomic group. Let's look at some illustrations which show three types of interaction.
Types of Interaction
There are three basic kinds of interaction: no interaction, ordinal interaction, and disordinal interaction. Match the illustrations below with each description. For these examples we are using a 2x3 experimental design: Two levels in variable one and three levels in variable two.
No Interaction
In data with no interaction between variables, the lines are parallel. Treatment effects are constant across variable levels. Notice that the difference between means (black circles and gray squares) is the same for the three levels of variable 2.
(Note: As with all statistical concepts, data can have some interaction -- and therefore slightly unparallel lines -- and still not have significant interaction.)
Ordinal Interaction
In ordinal interaction, the rank order of the cell means of one variable is the same within each level of the second variable. While the lines are not parallel, they do not cross. Notice that the difference between means (black circles and gray squares) varies, but remains in the same order, for the three levels of variable 2.
Disordinal Interaction
In disordinal interaction, the rank order of cell means is not consistent within each level of the second variable. The lines representing each variable cross. Effects of treatments are radically different across the two variables. Notice that the difference between means (black circles and gray squares) not only varies, but changes order across the three levels of variable 2. A significant ordinal interaction shows that one treatment is superior to another at every level of the second variable. But when there is a significant disordinal interaction, one treatment is superior at one level of the second variable, but not at another. In both cases, interpretation of treatment effect must be made separately for each level of the second variable. Such an analysis is called simple effects. Whenever there is a significant interaction, main effects (Fr, Fc) are meaningless and simple effects must be computed. In the diagram at right, simple effects computations would test the two means at level 1 for significance, then the two means at level 2, and then the two means at level 3, as indicated by the rectangles. There are special formulas for these computations, but well not address them here. 1 2
25-3
between variance in variable A and variable B, variance within cells, and variance due to interaction between A and B. We can summarize these terms as: SSt total SScells within cells SSa variable A SSab interaction SSb variable B SSe error
The Fr-ratio tests the difference between row means 153 and 149.5 for significance, and the Fc-ratio tests the difference between column means 166.25 and 136 for significance. Analyzing this data by computer produces the following ANOVA table:
TABLE 35 SUMMARY TABLE FOR THE TWO-WAY ANOVA SOURCE SUM-OF-SQUARES C/P6 198.220 7 A/I 12730.850 C/P-A/I 2745.269 Error 98787.740 df 1 1 1 108 MEAN-SQUARE F-RATIO 198.220 0.217 12730.850 13.918 2745.269 3.001 914.701 denominator for all three F-ratios P 0.642 0.000 0.086
dfr = r-1 = 2-1 = 1. dfc = c-1 = 2-1 = 1. dfrc = (r-1)(c-1) = 1. dfe = k(n-1) where k is the number of cells (equal cell n's) = dft - (dfa + dfb + dfab) = 111-3 = 108. (unequal cell n's) dft = N-1 = 112-1 = 111 MS terms are given by the respective SS/df terms. F-ratios are given by MSx/MSe Fr = MSr/MSe = 198.22 / 914.701 = 0.217 Main Effect Fc = MSc/MSe = 12730.85 / 914.701 = 13.918 Main Effect Frc=MSrc/MSe = 2745.269 / 914.701 = 3.001 Interaction Effect
The first F-ratio to consider is the interaction F-ratio. This is Frc (3.001) in the table. The computed value of 3.001 is not significant (p=0.086). Therefore, the interaction between School Type and Participation is not significant. Had the interaction been
4 7
Data drawn from LaNoue, pp. 107-109 5Ibid., p. 46 Active or Inactive in Sunday School
25-4
Chapter 25
significant, the main effects values would have been meaningless, and we would have had to compute simple effects values. Because the interaction F is not significant, we can interpret main effect F-ratios directly. The first main effect is School Type. This F-ratio tests whether the two row means (153, 149.5) are significantly different. Its value is 0.217 (p=0.642). Because the p value is greater than 0.05, these row means are declared not significantly different. The spiritual maturity scores did not differ between seniors in Christian and public high schools. The second main effect is Active Participation in Sunday School. This F-ratio tests whether the column means (166.25, 136) are significantly different. Its value is 13.918 (p=0.000). Because the p value is less than 0.05, the column means are declared significantly different. Seniors who were active in Sunday School were significantly more mature spiritually than seniors who were inactive, regardless of school type attended. Two graphs of these means is shown below.
The graph at left above orders the means in the same ways as the two-way table. We can see some disordinal interaction, but since the interaction F is not significant (p=0.086), these differences are explained by sampling error. We can also see small (non-significant) differences between the Christian school and Public school seniors (squares and circles close together). By re-ordering the means in the graph above right, we can focus on the activity variable. The difference between the Active seniors and Inactive seniors is clearly seen here. We also see the slight (non-significant) interaction. Notice also that the highest mean of spiritual maturity is found in Christian-School-Active seniors (175, n=46). The lowest mean of spiritual maturity is found in Christian-School-Inactive seniors (131, n=13). These findings are strengthened by the fact that they are based on eleven Christian schools and their sponsoring churches in the state of Texas.8 Dr. LaNoue raised the following questions in her proposal: Does the Christian school accomplish the administrative goal of growth in Christ-likeness or spiritual maturity, or does that public school Christian grow as much or more in spiritual maturity as the Christian school student? Is the Christian school accomplishing something that is not being accomplished in another way?9 What she found should focus our attention on how active our teenagers are in Sunday School, not on whether they attend a private Christian school. It should also send a wake-up call to administrators of private Christian schools that spiritual growth may be more related to the school's publicity than to its students.
Ibid., p. 19
25-5
Three-way ANOVA
Lets extend these ideas to three independent variables. Suppose you wish to measure the level of test anxiety in seminary students (dependent variable). One independent variable is school; the categories are theology, educational ministries, and music. Another independent variable is gender; the categories are male and female. The third independent variable is year in seminary; the levels are 1st, 2nd, 3rd, 4th+ years. With one analysis we can test the following: 1. Is there a significant difference in test anxiety (F1) across schools, (F2) between genders, or (F3) across length of study? (3-way main effects) 2. Is there an interaction between school and gender (F12), school and years of study (F13), or gender and years of study (F23)? (2-way interactions) 3. Is there an interaction among all three variables (F123)? (3-way interaction) A Three-way ANOVA table looks like the table below. (These table values are not related to seminary problem above, but are given merely as an example.)
1 2 3 12 13 23 123 Error Total
Source
df
SS
MS
F1 = 48.79* F2 = 19.05* F3 = 34.42* F12 = 2.18 F13 = 17.56* F23 <1 F123 <1 *p<0.05
The table shows that all three main effects and the AC interaction term are significant. Only F2 can be directly interpreted because the a-c interaction renders F1 and F3 meaningless. In graphing a three-way or higher order ANOVA, you must graph two variables at a time. For example, in graphing the means from the ANOVA table above, you might consider the two levels of A separately. Graph B and C for A1 and then B and C for A2. To show the significant interaction, graph A and C for each level of B. Graphing the ABC term is much more difficult, because it forms a plane in 3-dimensional space. With each additional independent variable, the complexity of analysis and interpretation increases. Science likes simple solutions. Avoid overly complex designs, even if your computer software allows you to do them!
Analysis of Covariance
When intact groups must be used for a study, differences may exist between two groups before the treatment begins. Results of the experiment cannot be attributed confidently to the treatment. It would be helpful to have a way to statistically level the groups, adjusting for pre-treatment differences in the post-treatment tests. Fortu-
25-6
Chapter 25
nately, such a procedure exists. The Analysis of Covariance (ANCOVA) procedure gets its name from the fact that it uses a known variable, called a covariate, to adjust the means of the dependent (measured) variable before applying an ANOVA test. The adjustment to the means is done through the coefficient of determination (r2) and variance accounted for (See the end of Chapter 22).
Uses of ANCOVA
ANCOVA is employed where random assignment of subjects is not possible or permitted. This is frequently the case in schools, where classes must be studied as they are, intact. A simple approach is to give the intact groups a pretest, and then use the pretest as a covariate for posttest scores. But there are many situations which lend themselves to ANCOVA: differences among religious, cultural, community, political, social, economic, or medical diagnostic groups; differences between alternative attitude, aptitude, or achievement groups; differences between vegetarians and non-vegetarians, smokers and non-smokers, users and non-users of a given product, criminals and non-criminals. Measure the differing groups of interest on a large number of variables, and then analyze these variables to discover which ones best distinguish between the groups. This is done through a procedure called Discriminant Analysis. These differentiating
Gene V. Glass, Statistical Methods in Education and Psychology, 2nd. (Englewood Cliffs, NJ: PrenticeHall, Inc., 1984), 493-497
10
25-7
variables can then be used as covariates. The strongest warning Ive heard about ANCOVA came from my statistics professor at the University of North Texas. In defining what ANCOVA does, Dr. William Brookshire said, with a dry smile and somewhat sarcastically, ANCOVA estimates what the experimental means would be if they werent what they are. Be sure you understand both the research design and the statistical limitations of your study before you make strong statements about your findings. There lurk numerous pitfalls for researchers who fail to consider their findings carefully.
Example Problem
Gene Glass10 provides this example of the benefits of ANCOVA. An experiment was performed in twenty elementary schools of a large school district. Ten of the schools were randomly designated to be sites for adoption of an innovative science curriculum, Science: A Process Approach (SAPA). The SAPA materials were bought and placed in the ten schools; teachers were trained to use them. The other ten elementary schools continued to use the districts traditional science curriculum. After two years of study in the respective programs, sixth-grade pupils in all twenty schools were given the Science Test (a 45-item measure of scientific methods, reasoning, and knowledge) of the Sequential Tests of Educational Process (STEP). Each students score was expressed as a percentage. There were 50 to 120 6th graders in each school, but since the school itself (along with its teachers, administrators, surrounding neighborhoods, and the like) was randomly designated as either SAPA or Traditional (Experimental or Control), the school was the experimental unit. The twenty schools means of sixth-grade pupils STEP-Science scores were used as the observational unit in the statistical analysis. The collected data follows:
SAPA n = 10 77.63% 74.13 67.20 78.23 57.93 57.65 83.30 73.90 45.90 64.83 68.07% 134.60 Traditional n = 10 64.10% 43.67 50.40 84.33 44.93 71.43 71.10 44.57 68.23 68.47 61.12% 201.50
s2
Applying one-way ANOVA to this data produced the following table. Was there a significant difference in the two curricula?
Source Between Within Total SS 241.30 3,024.94 3,266.24 df 1.00 18.00 19.00 MS 241.30 168.05 F 1.44 Fcv (0.10, 1, 18) = 3.01
25-8
Chapter 25
The answer is no. The F-ratio is smaller than the critical value even at =0.10, and inappropriate level of significance. But what if we know something more about the schools that could explain some of the error variance, and in so doing, reduce the error term of the analysis? IQ differences between the schools might affect the results. It is reasonable to assume that schools with high scholastic aptitude (IQ) will tend to have higher means on the achievement test than schools with lower IQ means. IQ means for each of the twenty schools are shown with the achievement means below:
SAPA IQ 105.7 100.3 94.3 108.7 93.1 96.7 106.9 100.3 86.5 96.1 98.86 47.94 ACH 77.63% 74.13 67.20 78.23 57.93 57.65 83.30 73.90 45.90 64.83 68.07% 134.60 Traditional IQ ACH 101.2 97.6 96.4 109.6 94.0 105.4 102.4 100.6 104.2 112.6 102.40 33.60 64.10% 43.67 50.40 84.33 44.93 71.43 71.10 44.57 68.23 68.47 61.12% 201.50
s2
Using adjusted SS terms, an ANCOVA table can be built from this data which looks like this:
Source Between Within Total SS 786.71 830.88 1,617.59 df 1.00 17.00 18.00 MS 786.71 48.88 F 16.10
Fcv(.01,1,17) = 15.7
Now we find the groups significantly different (p < 0.01). This adjustment was possible because of the high correlation between mean IQ and mean science achievement scores (r = +.931, +.805 for the experimental and control groups respectively). ANCOVA used these correlations to reduce the error variance and provide a more powerful analysis than was possible through ANOVA alone.
25-9
way ANOVA design, but the multiple learner measurements achievement, anxiety level, attitude toward the course, attitude toward the instructor make the design a MANOVA. A counseling researcher might be interested in the impact of group vs. individual counseling and level of social competence on four counselee variables. These two independent variables form a 2-way ANOVA design, but the four dependent variables make the design a multi-factor MANOVA. It is enough at this point to be aware of the existence of these procedures, and to know what has been done when you discover a MANOVA analysis in your reading.
Summary
In this chapter, youve been introduced to the concepts of factorial ANOVA, interaction, ANCOVA, and MANOVA. A basic understanding of these advanced techniques will help you understand the research articles youll read as part of your literature analysis. The following table gives a summary of the key elements in these procedures.
Name of Analysis Number of Independent Variables ONE (Questionning strategy) MANY (Questionning strategy, Structure, Variety, and Attitude of Teacher) ONE plus COVARIATE (Questionning strategy, IQ) ONE (Questionning strategy) Number of Dependent Variables ONE (Achievement) ONE (Achievement)
One-Factor ANOVA
Multi-factor ANOVA
One-factor ANCOVA
ONE (Achievement) MANY (Achievement, attitude toward class, anxiety level) MANY (Achievement, attitude toward class, anxiety level)
One-factor MANOVA
Multi-factor MANOVA
Example
Dr. Gail Linam's dissertation11 was cited earlier for her use of the Kruskal-Wallis H Test to measure differences between three groups of ranks (see Chapter 24). Her use of the H Test was secondary to her primary statistic of two-way ANOVA. Her dependent variable was reading comprehension score. She used the Retelling Method and the Cloze Test to produce reading comprehension scores for an Old Testament story (OTR, OTC), a New Testament story (NTR, NTC), and a Bible score, the average of the two stories (BIBR, BIBC). Her two independent variables were CAMP (church campus or mission campus) and VER (Bible version: KJV, NIV, NCV), as shown
11
Information for these tables from Linam, pp. 174, 196, and 198-200
25-10
Chapter 25
below:
CAMP VERSION KJV NIV NCV Church Campus xx.xxx xx.xxx xx.xxx Mission Campus xx.xxx xx.xxx xx.xxx
ANALYSIS OF VARIANCE SOURCE CAMP VER CAMP*VER ERROR SUM-OF-SQUARES DF 3591.258 3377.792 175.082 11794.434 MEAN-SQUARE F-RATIO 1 3591.258 2 1688.896 2 87.541 87 135.568 P 26.490 12.458 0.646 0.000 0.000 0.527
There is no interaction between CAMP and VERsion (p=.527), so we can test the two independent variables separately. There was a significant difference in OTR reading comprehension scores between the church campus and mission campus children (p<.001). Looking at the scores below, we can see the church campus children scored higher than mission campus children. This was true in every case. There was a significant difference across translation (p<.001). The scores below show that the KJV produced the lowest comprehension scores. This was true in every case.
CAMP VERSION KJV NIV NCV Church Campus 18.81 32.96 34.41 Mission Campus 7.00 15.00 23.11
ANALYSIS OF VARIANCE SOURCE CAMP VER CAMP*VER ERROR SUM-OF-SQUARES DF 3215.289 3406.045 615.553 9378.172 MEAN-SQUARE F-RATIO 1 3215.289 2 1703.022 2 307.777 86 135.568 P 29.485 15.617 2.822 0.000 0.000 0.065
There is no interaction (p=.065). Both CAMP and VER show significant differences. Here are the group means for NTR:
4th ed. 2006 Dr. Rick Yount
25-11
ANALYSIS OF VARIANCE SOURCE CAMP VER CAMP*VER ERROR SUM-OF-SQUARES DF 1995.647 2239.649 2.229 4367.414 MEAN-SQUARE F-RATIO 1 1995.647 2 1119.824 2 1.115 87 50.200 P 39.754 22.307 0.022 0.000 0.000 0.978
There is no interaction (p=.978). Both CAMP and VER show significant differences.
CAMP VERSION KJV NIV NCV Church Campus 14.91 23.27 27.55 Mission Campus 4.22 13.33 17.56
ANALYSIS OF VARIANCE SOURCE CAMP VER CAMP*VER ERROR SUM-OF-SQUARES DF 1873.493 2618.725 119.222 3388.047 MEAN-SQUARE F-RATIO 1 1873.493 2 1309.362 2 59.611 86 39.396 P 47.556 33.236 1.513 0.000 0.000 0.226
There is no interaction (p=.226). Both CAMP and VER show significant differences.
CAMP VERSION KJV NIV NCV Church Campus 11.05 23.50 22.82 Mission Campus 0.38 10.78 16.11
25-12
Chapter 25
In each case we see that church campus children scored significantly higher in reading comprehension than mission children, and readers of the KJV scored significantly lower than readers of either NIV or NCV versions -- in every condition. Specific computations of pair-wise differences were made using the FLSD procedure. See Example on page 21-8 for these findings. Dr. Linams findings indicate that teachers and curriculum writers need to avoid use of the King James Version of the Scriptures for older children (grades 4-6). Children of this age simply cannot understand the text as well as the New International of New Century versions.
Vocabulary
ANCOVA disordinal interaction factorial ANOVA interaction effect main effects MANOVA multi-factor MANOVA ordinal interaction simple effects Analysis of Covariance: uses pretest differences to adjust posttest means factorial ANOVA: ranks of means differ across treatment levels designs which have 2 or more indt variables (2-way, 3-way, k-way) effects of one treatment not constant across levels of a second Row and Column F-ratios in a factorial design Multivariate Analysis of Variance: more than one dependent variable Factorial design (e.g., 2-way) with more than one dependent variable factorial ANOVA: ranks of means constant across treatment levels testing differences of means of one treatment across all levels of the second
Study Questions
1. Define factorial ANOVA. 2. What is the advantage of using factorial ANOVA over multiple one-way ANOVAs? 3. Explain the term interaction. 4. Compare and contrast ordinal and disordinal interaction. 5. If you discover a significant interaction in your data, A. What implications does this have for main effects interpretation? B. What further procedure must you employ? 6. Answer the questions below using this computer printout: Source
Parity Size Parity x Size Error Total
df
1.00 2.00 2.00 49.00 54.00
SS
11.24 90.61 16.32 205.00 323.17
MS
11.24 45.31 8.16 4.18
F
2.70 10.83 1.95
p
0.11 0.00 0.15
A. How many subjects were involved in this study? B. How many levels of PARITY were used? C. How many levels of SIZE were used? D. How many groups were tested? E. Which term was used in the denominator of all the F-ratios? F. Was the interaction between PARITY and SIZE significant? G. Was PARITY a significant treatment variable? How do you know?
4th ed. 2006 Dr. Rick Yount
25-13
H. Was SIZE a significant treatment variable? How do you know? I. In this case, would you (a) interpret the main effects of PARITY and SIZE, or would you (b) apply simple effects tests? Explain why. 7. Design a study in your field of specialty using the following research designs: factorial ANOVA, ANCOVA, or MANOVA.
25-14
Chapter 26
Regression Analysis
26
Regression Analysis
Linear Regression The Equation of a Line The Linear Regression Equation Standard Error of Estimate The Multiple Regression Equation A Walk Through a Computer Printout A Multiple Regression Example
In Chapter 22, we discussed the use of correlation to measure the strength of association between two variables. In this chapter we extend this concept to regression analysis, which allows us to predict the value of a variable from one or more others. Linear regression analyzes two variables one predicted variable (called the criterion) and one predictor variable. Multiple regression analyzes three or more variables one criterion and two or more predictor variables. The mathematical computations for regression analysis are complex, but with the advent of the personal computer and the development of statistical packages, regression analysis is rapidly becoming the most popular statistical procedure -- particularly in the fields of psychology, sociology and education. Dr. Martha Bessac studied predictor variables of marital satisfaction of 375 student couples at Southwestern Baptist Theological Seminary in 1986.1 Aware of the increased stress on seminary marriages, including a rise in the number of divorces -averaging twenty-four per year at that time2 -- Dr. Bessac, as part of the Registrar's staff, wanted to determine what factors might be contributing to this. She hypothesized, based on her literature search, the following variables as significant positive predictors of student marital satisfaction: sex (gender), age of husband, age of wife, seminary program of husband, number of semesters husband has been enrolled, number of hours husband enrolled in this semester, number of hours husband has completed towards degree, education level of husband, education level of wife, number of months married, number of children, child density, child spacing ratio, number of hours per week husband is employed, number of hours a week wife is employed, total income, number of hours per week husband engaged in church activities, and number of hours per week wife engaged in church activities.3
Martha Sue Bessac, The Relationship of Marital Satisfaction to Selected Individual, Relational, and Institutional Variables of Student Couples at Southwestern Baptist Theological Seminary, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1986) 2 3 4 Ibid., p. 19 Ibid., pp. 20-21 Ibid., pp. 41, 44
1
26-1
She found four significant predictors accounting for 9.6% of marital satisfaction variability. These were Months Married (t=-5.428, b= -0.054), Number of Hours Wife Works (t=-2.637, b= -0.183), Number of Hours Husband Works (t=-2.605, b= -0.094), and Income (t=-2.089, b= -0.158). Further, the regression equation produced by the analysis was shown to be a viable model (F=12.925, Fcv=2.39).4 Notice that all the regression coefficients (b's) are negative. As Months Married increased, marital satisfaction decreased. This is perhaps explained by considering two extreme groups of student couples: one group of newly-weds, in seminary-as-honeymoon mode, compared to older couples with teenage children, leaving behind "home, friends and family" for cramped quarters and hectic schedules. Increased hours of work for both husband and wife meant decreased marital satisfaction. Higher incomes, lower satisfaction. Number of children, degree plan, age, number of credit hours in the semester, hours engaged in church activities -- these and the other specified variables proved not significant. Since only 9.6% of the variability of marital satisfaction (Adj. R2=0.096) is accounted for by the four predictor variables, 90.4% of the variability of martial satisfaction was not accounted for. This variability was either accounted for by unnamed variables, or by the unsystematic variation among the 375 couples. Still, the multiple regression procedure declared the model viable by posting a significant F-ratio of 12.925 (Fcv=2.39).5 Before introducing the concepts of regression, however, we need to review the fundamentals of linear equations upon which regression is built.
And if X=100?
26-2
Chapter 26
Regression Analysis
100
2x100+4 =
204
100,204
The two elements of slope (m) and y-intercept (b) define a line. These concepts of slope and y-intercept are used in computing a regression equation by which one variable can be predicted by the other.
Linear Regression
Several scattergrams are displayed on page 22-2. The first two scattergrams illustrate perfect correlations of +1.00 and -1.00 respectively. Remember that in both cases, the points fell along a straight line. The term linear (lin-ee-er) derives from the line represented by points in a scatterplot. We can compute an equation for a line which fits any scatterplot. Using this equation, we can predict one variable from another: the stronger the relationship, the more closely the points cluster around a line, and the better the accuracy of the prediction. Using a process called the least squares method, regression analysis produces a best fit linear equation. "Best fit?" In chapter 16, we learned that the sum of deviations about the mean equals zero (x=0). It is also true that the sum of squared deviations about the mean (x2 ) is a minimum value. That is, the sum of squares about the mean is smaller than it would be if computed about any other value. Looking at mean and x2 another way, the mean of a group of scores produces the smallest sum of squares. It is a least squares measure of central tendency. Just as the mean is the "best fit point" of a single group of scores, the linear regression equation is the "best fit line" through a scatterplot of two groups of scores. It is a least squares fit because -- just as x=0 and x2 = a minimum -- so deviations of scatterplot points about the computed line, called residuals (e), produce the values e=0 and e2 = a minimum. More on this a little later. Let's look at the regression equation.
Y = a + bX
where Y (pronounced "Y prime") is the predicted value of Y, a refers to the y-intercept point, and b refers to the slope of the regression line. Regression analysis produces values for a and b such that we can develop the best fit line through a scatterplot.
Computing a and b
Given a set of scores, how do we calculate the values of a and b? Here are the formulas we use:
First, compute b. The elements of the formula bear a close resemblance to part of the Pearsons r correlation coefficient.
26-3
Second, use b and the means of X and Y to compute a. Earlier, we computed values of Y from values of X using the equation Y = 2X + 4. Lets use those same values, and compute the equation components a and b. If we do this right, we should get a = 4 and b = 2. The X- and Y- values below come from the computed coordinates at the bottom of 26-2.
X 0 1 4 25 36 66 X2 X XY 0 0 1 6 2 16 5 70 6 96 14 188 X XY 196 (X)2 Means: 14/5 = 2.8 Y 4 6 8 14 16 48 Y Y 16 36 64 196 256 568 Y2
48/5 = 9.6
First, compute b.
Second, compute a.
in
Third, substitute the values of a and b into the equation Y = a+bX, which results
This is the same equation we started with on page 26-2. While we may seem to be going around in circles, we have established the fundamentals of conducting regression analysis -- computing a linear equation from a set of matched scores.
26-4
Chapter 26
Regression Analysis
Compare this equation to the one for estimated population standard deviation (s) on page 16-9. The concepts are the same. The term n - 2 is used because two degrees of freedom are lost -- due to having two groups of scores. Another way to compute the standard error of estimate is to use the correlation coefficient (r) as follows:
where sY is the standard deviation of the Y scores and r is the correlation between X and Y. The larger the correlation (r) between X and Y, the smaller the term under the radical, and the smaller the standard error of estimate. As r approaches 1.00, se ap 4th ed. 2006 Dr. Rick Yount
26-5
proaches 0, which reflects a greater accuracy in prediction. In this section we have reviewed the fundamentals of linear equations, the formulas for computing a and b for the equation Y=a+bX, and the concepts of residuals and and the standard error of estimate. But the real power of regression analysis for the complex studies of the social sciences is found in multiple regression analyses.
26-6
Chapter 26
Regression Analysis
actual analysis.6 The data analysis was done by the author using SYSTAT.
The Data
The chart below displays 6 of the data sets collected from 50 courses. Each row is a single course. Scores are mean scores representing the entire class.
1 2 3 4 49 50 OVERALL 3.4 2.9 2.6 3.8 4.0 3.5 TEACH 3.8 2.8 2.2 3.5 4.2 3.4 EXAM KNOW 3.8 4.5 3.2 3.8 1.9 3.9 3.5 4.1 . . . . . . 4.0 4.4 3.9 4.4 GRADE 3.5 3.2 2.8 3.3 4.1 3.3 ENROLL 21 50 800 221 18 90
0.224 -0.128
-0.337
With all coefficients shown, any coefficient between any two variables can be quickly found. Note the strong positive correlations between TEACH-OVERALL and ENROLL-EXAM above (0.804, -0.558). Notice also that ENROLL is negatively correlated with every other variable (As classes get bigger, student attitudes get more negative).
26-7
one of them (OVERALL) reflects student attitude toward the course as a whole, it is the most appropriate variable to serve as criterion. TEACH, EXAM, KNOW, GRADE, and ENROLL are appropriate predictor variables. How well do the predictor variables account for the variance in the criterion OVERALL? And what are the values of a (constant) and bX based on the data? Here is our raw score regression model equation:
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE REGRESSION 13.934 5 2.787 RESIDUAL 4.511 44 0.103
The above SYSTAT multiple regression printout of the data has three distinct sections which relate to the three questions stated above. These are delineated by dotted lines which are not normally seen in a printout. We will now take each section in turn.
Section One
DEP VAR: OVERALL N: 50 MULTIPLE R: .869 SQUARED MULTIPLE R: .755 ADJUSTED SQUARED MULTIPLE R: .728 STANDARD ERROR OF ESTIMATE: 0.320
The first section of the regression printout, shown above, includes the elements defined below. The specific values for this example are displayed in brackets [ ].
DEP VAR: The dependent variable (criterion). [OVERALL]
26-8
Chapter 26
Regression Analysis
N: Number of cases or subjects in the study. [50] MULTIPLE R: Correlation between OVERALL and the predictors. [0.869] SQUARED MULTIPLE R: Proportion of variance in OVERALL accounted for by predictors. [0.755] ADJUSTED SQUARED MULTIPLE R: If you were to use a multiple regression equation with another set of data, the R2 value from the second data set would be smaller than the R2 produced by the original data set. This reduction in R2 is called shrinkage. The adjustment depends on the number of subjects (N) and the number of variables (k) in the study. This is the true value of R2. [0.728] STANDARD ERROR OF ESTIMATE: The standard deviation of the residuals. [0.320]
The answer to our first question is that the five predictor variables account for 72.8 percent of the variability of OVERALL (Adj.R = 0.728).
Section Two
VARIABLE CONSTANT TEACH EXAM KNOW GRADE ENROLL COEFFICIENT STD ERROR STD COEF TOLERANCE -1.195 0.763 0.132 0.489 -0.184 0.001 b's 0.631 0.133 0.163 0.137 0.165 0.000 sb 0.000 0.662 0.106 0.325 -0.105 0.124 's T P(2 TAIL) 0.065 0.000 0.422 0.001 0.271 0.185 p(t)
1.0000000 -1.893 .4181886 5.742 .3245736 0.811 .6746330 3.581 .6196885 -1.114 .6534450 1.347 multicollinearity t=b/sb
The second section of the printout, shown above, details the analysis of each predictor individually. It is this part of the printout that provides the regression coefficients (bs and s) as well as their significance tests.
VARIABLE: Heading for the variable names in the regression model. Con stant is the value of OVERALL when all predictors equal zero. COEFFICIENT: Heading for the values of the respective regression coefficients (the bs) of the regression equation. Using these values, you can write the regression equation for OVERALL and the five predictors as follows:
OVERALL = -1.195 + 0.76TEACH + 0.13EXAM + 0.49KNOW - 0.18GRADE + 0.001ENROLL \ \ \ \ predicted Constant: the value Regression coefficient: Variable: Use score for of OVERALL when all Multiply this by the the raw scores OVERALL predictors = 0 value of VARIABLE for these
Given the mean scores of the five predictors for any class, we can predict what that class OVERALL score will be. STD ERROR: Standard deviation of the regression coefficient (b). It is used in a t-test to determine whether the b is significant. STD COEFFICIENT: Standardized regression coefficients, or beta weights (). Betas are to bs what zscores are to Xs. While the bs are used with raw scores in regression equations, as in 0.76TEACH above, the betas are used with z-scores. The beta for TEACH equals 0.662. The proper term for TEACH in a standardized regression equation is 0.662zTEACH. Because betas are standardized, they can be directly compared according to relative strength. The bs cannot be compared directly because they usually represent different score ranges. ENROLL ranges from a low of 7 to a high of 800, while the other scales range from 1 to 5. The standardization of the betas eliminates this problem of differing ranges, just as z-scores eliminates the problem of comparing raw scores with differing variabilities. In our example, we see that TEACH is more than six times more influential than EXAM, and twice as influential as KNOW. TOLERANCE: The ideal condition in multiple regression analysis is for each predictor variable to be related to the criterion, but not to other predictor variables. Predictor variables are supposed to be independent of each other but they rarely are. Tolerance values near zero (0) in this printout indicate that some
26-9
of the predictors are highly intercorrelated. This undesirable situation is called multicollinearity. Look for tolerance values near 1.0. T: If you divide the value of each regression COEFFICIENT by its respective STD ERROR, you will get the values in this column. For example, the test for the b on the variable TEACH is equal to 0.763/0.133 = 5.742. The t-test values are used to answer the question Is this predictor significant?
T
CONSTANT -1.195 0.631 0.000 1.0000000 -1.893 0.065 >>>> TEACH 0.763 / 0.133 = 5.742 0.000 EXAM 0.132 0.163 0.106 .3245736 0.811 0.422 KNOW 0.489 0.137 0.325 .6746330 3.581 0.001 GRADE -0.184 0.165 -0.105 .6196885 -1.114 0.271 ENROLL 0.001 0.000 0.124 .6534450 1.347 0.185 P(2 TAIL): The probability of obtaining the computed t-value. For TEACH, p is very small less than 0.001 that we would get a t-value of 5.742 if bteach=0. Therefore, we say that TEACH is a significant predictor. There is a 42.2% (0.422) chance of getting the t-value of 0.811 for EXAM by chance. This is not a significant predictor since p>0.05.
The answer to our second question is that TEACH and KNOW are significant predictors of OVERALL. EXAM, GRADE, and ENROLL are not.
Section Three
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE REGRESSION 13.934 5 2.787 RESIDUAL 4.511 44 0.103 F-RATIO 27.184 P 0.000
The third section of the printout details the analysis of the model as a whole. Is the model, as represented by the regression equation being tested, a viable one?
SOURCE: There are two sources of variance in regression analysis. One is from the regression itself, and the other is from variance unaccounted for after the regression analysis. The predicted score (Y) seldom equals the criterion score (Y). There is always some error of estimate (e), such that Y = Y + e. We can therefore divide the criterion scores into two parts: Y (regression) and e (residual). SUM-OF-SQUARES: This is the sum of squared deviations about the regression line. The total sum of squares is divided between accounted for REGRESSION (Y) and unaccounted for RESIDUALS (e). Regression SS: This is the sum of squared deviations of Y about the mean of Y: Residual sum of squares: This is e, which equals DF: Degrees of freedom. DFreg equals the number of variables minus 1 [dfreg=6-1=5]. DFres equals the number of subjects minus the number of variables [dfres=50-6=44]. MEAN-SQUARE: The mean-square terms are variances. MSreg equals SSreg divided by dfreg. MSres equals SSres divided by dfres. F-RATIO: The F-ratio equals MSreg/MSres and is used to determine if the variance due to regression is enough greater than the variance due to residual noise to render the model significant (or viable). P: The probability of the computed F-RATIO being this large by chance. Any time p<0.05, a significant model is indicated. In our case, a P of 0.000 is very small less than 0.001 and indicates a significant model.
The answer to our third question is that we do have a viable model. The Fratio is significant.
26-10
Chapter 26
Regression Analysis
Since the other variables are not significant, lets analyze another model which includes only these two predictors.
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES REGRESSION 13.627 RESIDUAL 4.818 DF 2 47 MEAN-SQUARE 6.814 0.103 F-RATIO 66.467 P 0.000
Study the printout above and the analysis below carefully: First, did reducing the number of predictors from 5 to 2 reduce the amount of variance accounted for in OVERALL? The Adjusted R-Square is 0.728, exactly the same as we found with five predictors. We lost nothing here, which is good. Second, did we increase the standard error of estimate? No, it is 0.320, the same as before. We gained nothing here, which is good. Third, are the two predictors significant? Yes, both TEACH and KNOW show large t-test scores and very low probabilities (0.000 means p<0.001, very small). This is good. Fourth, is our model more sound? The F-Ratio is larger, showing a better ratio of regression to noise. Notice the values of the sum-of-squares are not much different from before, but the change in regression df from 5 to 2 produces a larger MEANSQUARE value. In conclusion, this second model is better. For these 50 courses, we can account for nearly 73% of the students ratings of courses by knowing their ratings of the instruc-tors TEACHing skills and their instructors perceived KNOWledge of the subject. ENROLLment in the class, quality of EXAMs, and the students anticipated GRADEs are not significant predictors of OVERALL quality.
26-11
OVERALL = -1.298 + 0.71TEACH + 0.538KNOW z OVERALL = 0.616z TEACH + 0.358z KNOW Summary
In this chapter you have been introduced to the world of regression analysis. You have seen how scattergrams of data can be reduced to a single predictor equation. You have learned how to compute the two variables in a linear regression line: a and b. You have learned how to read a computer printout from a multiple regression analysis.
Example
Dr. Dean Paret studied the relationship between nuclear family health and selected family of origin variables among 302 married subjects in 1991.8 The criterion (predicted) variable was overall perceived nuclear family health {FUNC}-- as measured by the Family Adaptability and Cohesion Evaluation Scale (FACES-R).9 Predictor variables related to family of origin. Autonomy {AUTON} measures an individual's sense of independence and self-reliance. It includes free expression, responsibility, mutual respect, openness and experiences of separation or loss.)10 The second predictor was Intimacy {INTIM}, which reflects close, familiar and usually affectionate or loving personal relationships without feeling threatened or overwhelmed. It includes expression of feelings, sensitivity and warmth, mutual trust, and the lack of undue stress in conflict situations.11 Both AUTON and INTIM were measured by the Family-of-Origin Scale (FOS).12 Additional demographic variables were gathered by means of a questionnaire: educational level {EDUC}, degree program {DEGREE}, number of years in graduate school {YRS}, income level {SALARY}, sex of participant {SEX}, and whether or not the couple was a dual career family {DUAL}.13 Here is Dr. Paret's final printout:
MULTIPLE REGRESSION PRINTOUT
14
DEP VAR: FUNC N: 302 MULTIPLE R: .898 SQUARED MULTIPLE R: 0.806 ADJUSTED SQUARED MULTIPLE R: .804 STANDARD ERROR OF ESTIMATE: 34.973 VARIABLE CONSTANT AUTON EDUC INTIM YRS COEFFICIENT 61.373 1.256 11.206 0.406 7.588 STD ERROR 11.220 0.122 4.642 0.126 1.973 STD COEF 0.000 0.647 0.063 0.200 1.107 TOLERANCE . 0.1656319 0.9462106 0.1700070 0.8370002 T P(2 TAIL) 5.470 0.000 10.320 0.000 2.414 0.016 3.224 0.001 3.847 0.000
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE REGRESSION 1513896.178 4 378474.045 RESIDUAL 363265.557 297 1223.116
F-RATIO 309.434
P 0.000
8 Dean Kevin Paret, A Study of the Perceived Family of Origin Health as It Relates to the Current Nuclear Family in Selected Married Couples, (Fort Worth, Texas: Southwestern Baptist Theological Seminary, 1991) 9 10 11 12 13 Ibid., p. 39 Ibid., p. 53 Ibid., pp. 53-54 Ibid., p. 38 Ibid., p. 39 14 Ibid., Table 15, p. 149
26-12
Chapter 26
Regression Analysis
Question One: How much variability of the subjects' family health (FUNC) was accounted for by Family-Of-Origin autonomy and intimacy, and the demographic variables of education level (EDUC: high school, college, graduate school) and years in seminary (YRS: 1, 2, 3, 4+)? The Adjusted R2 value (0.804) answers this question: 80.4%. Less than 20% of the variability in current family health is unaccounted for. This is a very strong finding. Question Two: Which predictor variables are significant? What is the order of influence of these variables on family health? Since this is the fourth and final printout in a series, all nonsignificant predictors have been eliminated (DEGREE, DUAL, SALARY, SEX). All of the variables listed above show p(t) values less than 0.05. The rank order of influence is given by the beta-values under the heading STD COEF. Autonomy (AUTON) has by far the greatest influence on family health (FUNC) with =0.647. Intimacy (INTIM) is next with =0.200, followed by years enrolled in seminary (YRS) with =1.107, and finally educational level (EDUC) with =0.063. The raw and standardized regression equations for Dr. Paret's study are FUNC = 61.373 + 1.256AUTON + 11.206EDUC + 0.406INTIM + 7.588YRS zFUNC = 0.647zAUTON + 0.063zEDUC +0.200zINTIM + 0.107zYRS Question Three: Is this a viable model? Does it adequately predict family health among the 302 subjects? The answer to this question is found in the ANOVA table and F-ratio. The F-ratio of 309.434 (p<.001) tells is this is a very strong model. How important are the variables EDUC and YRS for the FUNC model? Dr. Paret dropped these out of his full model and produced the following:
MULTIPLE REGRESSION PRINTOUT
15
DEP VAR: FUNC N: 302 MULTIPLE R: .890 SQUARED MULTIPLE R: 0.792 ADJUSTED SQUARED MULTIPLE R: .791 STANDARD ERROR OF ESTIMATE: 36.121 VARIABLE CONSTANT AUTON INTIM COEFFICIENT 90.925 1.350 0.425 STD ERROR 8.139 0.124 0.130 STD COEF 0.000 0.696 0.209 TOLERANCE . 0.1702486 0.1702486 T P(2 TAIL) 11.171 0.000 10.887 0.000 3.269 0.001
ANALYSIS OF VARIANCE SOURCE SUM-OF-SQUARES DF MEAN-SQUARE REGRESSION 1487058.544 2 743529.272 RESIDUAL 390103.191 299 1304.693
F-RATIO 569.888
P 0.000
How much did Adj. R2 change? The amount of variance-accounted-for dropped from 0.804 to 0.791, a change of -0.013, or a little over one percent. This is good. We did not lose much R2 by dropping two of the four predictor variables. Are AUTON and INTIM still significant predictors? Yes (p<0.001, p=0.001). Did the model suffer from dropping EDUC and YRS? No. The F-ratio is larger than before, showing a smaller, stronger model.
15
Ibid., p. 150
16
Ibid., p. 101
26-13
Family patterns of relationship transfer from generation to generation. Healthy family relationships are rooted in the degree of autonomy and intimacy experienced in the family of origin. Likewise, disfunctional family relationships are rooted in family of origin disfunction. These same patterns show up in seminary couples. Some enter the ministry to help others out of the need to help self because of a disfunctional family background. Inability to establish healthy relationships in family has been found to transfer to the ministry: such ministers have difficulty forming ministerial relationships in the pastorate.16 The challenge to seminaries is to go far beyond teaching students how to minister, but requires also helping disfunctional students break with past patterns and learn anew how to establish autonomous and appropriately intimate relationships with others. The health of our churches is at stake.
Vocabulary
adjusted squared multiple R correlation matrix criterion variable linear equation linear regression multicollinearity multiple correlation coefficient multiple linear regression predictor variable regression sum of squares regression coefficient residual sum of squares residual Shrinkage slope squared multiple R standard error of estimate standardized regrn coefficient tolerance y-intercept Multiple correlation coefficient after adjusted for shrinkage representation of multiple variables and their intercorrelations predicted or dependent variable in multiple regression (Y) mathematical formula which describes a straight line (Y = 2X + 3) predicting one variable by another by best-fit line through scattergram degree of inter-correlation among predictor variables correlation between the criterion variable and all predictors together (R) prediction of one variable by 2+ others variable(s) used to estimate a criterion variable in regression analysis sum of squared deviations between Y and the mean of Y raw score correlates of criterion variable in regression (b) sum of squared deviations between Y and Y (e2) difference between true Y value and the predicted value of Y (e=Y-Y) Reduction in R2 value when equation is applied to new data one of two determiners of a regression line: m=(Y/X) proportion of variance of Y accounted for by all predictors (R2) standard deviation of the residuals standardized score correlates of criterion variable in regression () reflects the degree of multicollinearity among predictors one of two determiners of a regression line: value of Y when X=0
Study Questions
1. Draw a set of axes. Label the X-axis (horizontal) from 0 to 10 and the Y-axis (vertical) from +4 to -1. Compute 10 values for Y when X=1, 2...10 with the equation Y = -0.5X + 4. Plot the 10 points on your axes. 2. Define e. Show how it is calculated. 3. Work through the explanation of the first regression printout using the second printout on page 26-10. Identify and define each of the following elements: a. Dep var b. N c. Squared multiple R d. Adjusted squared multiple R i. Tolerance j. Multiple R k. P(2 tail) l. Regression sum-of-squares
26-14
Chapter 26
Regression Analysis
4. A regression analysis was done on the data given below. Draw a scatterplot of the data. Compute a and b, and , then draw the proper regression line on the scatterplot. Study the regression printout below and describe your findings. Include R, Adj. R-squared, coefficients, ttest values and probabilities, and the F-ratio. The following data are scores from 15 students on Bible knowledge test scores (Y) and the number of semester hours of Bible in college (X). X Y 1 5 18 18 1 2 9 9 6 1 2 1 5 1 2 1 2 1 2 1 5 1 2 18 2 3 2 7 3 0 1 9 18 2 1 1 7 2 1 2 7 2 9 2 5 2 2 2 6 2 5 2 4
MULTIPLE REGRESSION PRINTOUT DEP VAR: KNOW N: 15 MULTIPLE R: .728 SQUARED MULTIPLE R: .530 ADJUSTED SQUARED MULTIPLE R: .494 STANDARD ERROR OF ESTIMATE: 2.792 VARIABLE CONSTANT HOURS COEFFICIENT 13.066 0.810 STD ERROR 2.845 0.212 STD COEF 0.000 0.728 TOLERANCE 1.0000000 1.0000000 T P(2 TAIL) 4.593 0.001 3.828 0.002
F-RATIO 14.657
P 0.002
26-15
Chapter 27
Evaluation Checklist
27
Guidelines for Evaluating Research Proposals
This chapter is included in the text for two reasons. The first purpose is to give you guidance as you write your own proposal. If you merely mimic another research proposal, you will miss the most important learning goal of the thesis or dissertation process the creation of a plan to solve a real problem through research and analysis. By choosing a subject of interest and systematically applying the guidelines given in this checklist, you will master from the ground up -- skills that will help you in all your problem solving situations. The second purpose is to provide you a checklist to help you evaluate research proposals of other students. Use this checklist along with the descriptions in Chapter 2 to master the essential elements of writing an effective research proposal.
Introduction
Yes! Yes? No? No! Does the introductory statement move you, like a funnel, from a general to a specific view of the problem of the study? Does the introductory statement avoid personal pronouns, subjective language, and awkward grammar? Is the Problem stated clearly, tersely, and objectively? Is the Problem stated in the proper format (relationship between variables or difference between groups)? Does the Purpose clearly state the intention of the study? Does the Purpose break the Problem down into subsections for analysis?
27-1
Is the Related Literature a true synthesis of researched material, rather than a review, or summary, or report? Are most of the materials footnoted in the Related Literature section drawn from primary, rather than secondary, sources? Is there an obvious organizational scheme to the Related Literature section: historical, topical, or related to the hypotheses? Does the Related Literature section give you the impression that the writer is thoroughly familiar with what is known in the field? Does the Significance of the Study section answer the question So what? (Does it explain why this particular study is important to the field? Does it include referenced support for the study?) Does the Hypothesis state an expected answer to the Problem which has been stated? Is the Hypothesis written in testable form? Is the Hypothesis stated appropriately? (usually this means as a research, rather than a null, hypothesis)
The Method
Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Is the studys population clearly defined? Is the procedure for sampling (if used) clearly explained? Is the size of the sample(s) stated? Is there a clear description of the instrument(s) that will be used to gather data? Are the stated limitations actual limitations to the study or merely delimitations? Are the stated assumptions legitimate in the context of the proposal, rather than cop-outs for shallow thinking? Are the stated definitions legitimate in the context of the study (operational, unusual connotation, or restricted meaning) rather than being obvious or commonly used words? Is the research design (if needed) clearly explained? Are the procedures for collecting data clearly stated step-by-step? Do the procedures avoid fuzzy language and word magic? Is there evidence that the researcher has considered potential
Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No!
27-2
Chapter 27
Evaluation Checklist
The Analysis
Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Yes! Yes? No? No! Are the procedures for analyzing data clearly stated step-by-step? Does the researcher give evidence of understanding the statistical procedures he/she has chosen? Are the research hypotheses restated in a null form for testing? Are each of the hypotheses tested with the appropriate statistic? Is there agreement among the Problem, Hypothesis, and statistical procedures used (All deal with relationship, or difference, or congruence, etc.)? Are there model charts, graphs, or tables which show how the data will be organized and reported in the final paper? Are all references cited in the body of the paper (and only those cited) referenced in the bibliography?
General
Yes! Yes? No? No! Does the paper exhibit theological reflection and application to the Christian ministry context? Is Southwestern style used correctly throughout the paper? Does the paper generally exhibit good writing skills: spelling, grammar, syntax, clarity of thought? Does the paper exhibit good organizational skills: flow of thought, effective transitions from section to section, the impression that the paper is all of one piece? Does the paper present a professional appearance? On the basis of what youve learned in class, what grade would you consider proper for this paper? -
Count Number of Each Category (Should add to 40) Multiply each count by appropriate factor Evaluation Points in each category Add 4 subtotals together for total points (0-200)
====
27-3
Appendix 1
Answer Key
Sample Test Questions
The following answer key is provided to reinforce your study of key concepts in the course. Mastering a language requires more than memorizing correct answers, however. You will be tested by questions which will require you to use the languages of research and statistics. Use these sample questions as a beginning point in testing your understanding.
Chapter One
1. D 2. B 2. A 3. D, E, P, E, C, D, C, H, Q, C, H, V, P, R, C, E, C, E, D, D 3. B 4. D
Chapter Two
1. C
Chapter Three
1. Birth year is interval. Year is the unit of interval, but 0 AD is not the beginning of time. Co or Fo is interval. Degree is the unit of interval, but 0 degrees is not absence of all heat. Class rank is ordinal: 1st, 2nd, 3rd... Test score is ratio. Point is the unit of interval and 0 points means absence of mastery Nationality is nominal: categories of Caucasian, Black, Hispanic, Oriental, etc. Body Weight is ratio. Pound is the unit of interval, and 0 means absence of weight. 2. B 3. D 4. -FIRO-B scores are ratio, though it is difficult to tell from this statement that it isnt interval -A single attitude scale item produces ordinal data. (A group of items produces interval data). -Employment status is nominal. Subjects are categorized into one of three options. -Study Habits is ratio. The assumption, given max=100, is that min=0. -W-GCTA score is ratio. - Leadership Style is nominal: Subjects are categorized into one of five styles. - Attrition Ranking: ordinal - Child Density in ratio data (number/number = ratio: 0.00-1.00)
Chapter Four
1. D 2. D 2. Ind 3. D, S, N, S, N, D, S, N 3. Mult 4. O 5. L 6. P 7. I
Chapter Five
1. M 8. S
Chapter Six
1. C 2. D 2. A 2. D 2. A 3. D 3. B 3. B 3. C 4. A 4. A 4. C
Chapter Seven
1. C
Chapter Eight
1. D
Chapter Nine
1. C
A1-1
References
Chapter Ten
1. D 2. C 2. D 2. C 2. I/E 9. I/A 2. 0.375 3. A 3. B 3. D 3. I/I 10. I/M 3. 125 4. I/B 11. I/G 4. -0.358 5. 1/C 12. E/K 5. 128.6% 6. 66.7% 7. 0.867 6. E/L 7. I/D 4. A 4. D 5. C 6. C
Chapter Eleven
1. F
Chapter Twelve
1. A 1. I/J 8. E/H 1. -12 8.
Chapter Thirteen
Chapter Fourteen
Z(X-W) = (1-C) divide both sides by Z, leaving... X-W = (1-C)/Z add W to both sides, leaving... X = ((1-C)/Z) +W
Solving for X
9.
Solving for A
place the term containing A on the left: multiply both sides by 3, leaving... AB-1 AB A A A = 3B(C-1) add 1 to both sides, leaving... = 3B(C-1) + 1 divide both sides by B, leaving... = (3B(C-1) + 1)/B simplifying the terms for B, we have... = (3B(C-1)/B) + (1/B) simplifying the first term, we have... = (3(C-1)) + (1/B)
The purpose of the algebraic exercises of 8-9 above is to accustom you to thinking in terms of relationships between numerical variables apart from actual data. If you can become comfortable thinking in terms of symbols linked together in equations (rather than words linked in sentences) then the statistical formulas youll encounter will be far less threatening.
Chapter Fifteen
1. A 2. C 2. A 2. C 2. C 3. C 3. D 3. D 3. D 4. A 4. D 4. C 4. D 5. T 6. T 7. F (2.58) 5. B
Chapter Sixteen
1. B 1. C 1. D
A1-2
Appendix 1
Chapter Nineteen
1. A 2. B 2. C 2. B 2. D 2. C 2. D 2. B 2. B 3. B 3. A 3. B 3. A 3. A 3. D 3. D 3. B 4. T 5. F (excludes)
Chapter Twenty
1. C 1. B 1. E 1. D 1. D 1. D 1. C 4. F (matched, correlated) 5. F (a descriptive) 4. B 4. C 4. C 4. A 4. C 4. D 5. C 5. A 5. F 6. B
Chapter Twenty-one Chapter Twenty-two Chapter Twenty-three Chapter Twenty-four Chapter Twenty-five Chapter Twenty-six
A1-3
References
A1-4
Appendix 2
A4-1
References
Clement, Dan Earl. The Relationship Between Recalled Parental Contact and Adult Personality Adjustment. Ed.D. diss., Southwestern Baptist Theological Seminary, 1987. Cook, Marcus Weldon . A Study of the Relationship Between Active Participation as a Teaching Strategy and Student Learning in a Southern Baptist Church, (Ph.D. diss., Southwestern Baptist Theological Seminary, 1994. Randy Covington, An Investigation into the Administrative Structure and Polity Practiced by the Union of Evangelical Christians - Baptists of Russia. Ph.D. proposal, Southwestern Baptist Theological Seminary, 1999. Crain, Matthew Kent. Transfer of Training and Self-Directed Learning in Adult Sunday School Classes in Six Churches of Christ. Ed.D. diss., Southwestern Baptist Theological Seminary, 1987. Damon, Roberta McBride. A Marital Profile of Southern Baptist Missionaries in Eastern South America. Ed.D. diss., Southwestern Baptist Theological Seminary, 1985. Da Silva, Maria Bernadete. A Study of the Relationship Between Leadership Styles and Selected Social Work Values of Social Work Administrators in Texas. Ed.D. diss., Southwestern Baptist Theological Seminary, 1993. DeVargas, Robert. A Study of Lessons in Character: The Effect of Moral Judgement Curriculum Upon Moral Judgement. Ph.D. diss., Southwestern Baptist Theological Seminary, 1998. Doyle, Judith N. A Critical Analysis of Factors Influencing Student Attrition at Four Selected Christian Colleges. Ed.D. diss., Southwestern Baptist Theological Seminary, 1984. Eldridge, Daryl Roger. The Effect of Student Knowledge of Behavioral Objectives on Achievement and Attitude Toward the Course, Ed.D. diss., Southwestern Baptist Theological Seminary, 1985. Floyd, James Scott. The Interaction Between Employment Status and Life Stage on Marital Adjustment of Southern Baptist Women in Tarrant County, Texas. Ed.D. diss., Southwestern Baptist Theological Seminary, 1990. Gill, Rollie. A Study of Leadership Styles of Pastors and Ministers of Education in Large Southern Baptist Churches. Ph.D. diss., Southwestern Baptist Theological Seminary, 1997. Havens, Joan Ellen. A Study of Parent Education Levels as They Relate to Academic Achievement Among Home Schooled Children. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Hedin, Norma Sanders. A Study of the Self-Concept of Older Children in Selected Texas Churches Who Attend Home Schools as Compared to Older Children Who Attend Christian Schools and Public Schools, Ed.D. diss., Southwestern
A4-2
Appendix 2
Baptist Theological Seminary, 1990. LaNoue, Kaywin Baldwin. A Comparative Study of the Spiritual Maturity Levels of the Christian School Senior and the Public School Senior in Texas Southern Baptist Churches With a Christian School. Ed.D. diss., Southwestern Baptist Theological Seminary, 1987. Lawson, Margaret P. A Study of the Relationship Between Continuance of LIFE Courses in the LIFE Launch Pilot Churches and Selected Descriptive Factors. Ph.D. dissertation, Southwestern Baptist Theological Seminary, 1994. Linam, Gail. A Study of the Reading Comprehension of Older Children Using Selected Bible Translations. Ed.D. diss., Southwestern Baptist Theological Seminary, 1993. Mathis, Robert. A Descriptive Study of Joe Davis Heacock: Educator, Administrator, Churchman, Ed.D. diss., Southwestern Baptist Theological Seminary, 1984. McQuitty, Marcia G. A Study of the Relationship Between Dominant Management Style and Selected Variables of Preschool and Children's Ministers in Texas Southern Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1992. Mullen, Steven Keith. A Study of the Difference in Study Habits and Study Attitudes Between College Students Participating in an Experiential Learning Program Using the Portfolio Assessment Method of Evaluation and Students Not Participating in Experiential Learning. Ph.D. diss., Southwestern Baptist Theological Seminary, 1995. Paret, Dean Kevin. A Study of the Perceived Family of Origin Health as It Relates to the Current Nuclear Family in Selected Married Couples. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Perez, Darlene J. A Correlational Study of Baptist Youth Groups in Puerto Rico and Youth Curriculum Variables. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Southerland, Dan. A Study of the Priorities in Ministerial Roles of Pastors in Growing Florida Baptist Churches and Pastors in Plateaued or Declining Florida Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1993. Steibel, Sophia. An Analysis of the Works and Contributions of Leroy Ford to Current Practice in Southern Baptist Curriculum Design and in Higher Education of Selected Schools in Mexico. Ed.D. diss., Southwestern Baptist Theological Seminary, 1988. Tam, Stephen. A Comparative Study of Three Teaching Methods in the Hong Kong Baptist Theological Seminary, Ed.D. diss., Southwestern Baptist Theological Seminary, 1989.
A4-3
References
Waggoner, Brad J. The Development of an Instrument for Measuring and Evaluating the Discipleship Base of Southern Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1991. Welch, Robert Horton. A Study of Selected Factors Related to Job Satisfaction in the Staff Organizations of Large Southern Baptist Churches. Ed.D. diss., Southwestern Baptist Theological Seminary, 1990. Williamson, Bradley Dale. An Examination of the Critical Thinking Abilities of Students Enrolled in a Masters Degree Program at Selected Theological Seminaries. Ph.D. diss., Southwestern Baptist Theological Seminary, 1995. The following studies were also cited in the text: Yount, Barbara Parish. An Analytical Study of the Procedures for Identifying Gifted Students in Programs for the Hearing-Impaired. Master of Arts thesis, Texas Woman's University, 1986. _____, William R. A Critical Comparison of Three Specified Approaches to Teaching Based on the Principles of B. F. Skinner's Operant Conditioning and Jerome Bruner's Discovery Approach in Teaching the Cognitive Content of a Selected Theological Concept to Volunteer Adult Learners in the Local Church. Ed.D. diss., Southwestern Baptist Theological Seminary, 1978. ______________. A Monte Carlo Analysis of Experimentwise and Comparisonwise Type I Error Rate of Six Specified Multiple Comparison Procedures When Applied to small k's and Equal and Unequal Sample Sizes. Ph.D. diss., University of North Texas, 1985.
A4-4
Appendix 4
Bibliography
Bibliography
Cited Works
The single largest regret that I have with this most recent edition of the text is that I was unable to update the following sources. Through my doctoral program in the 1970s and preparation to teach research design and statistical analysis in the 1980s, I gathered these texts and used them extensively for illustrations, examples and explanations in my classes. The books listed below and quoted in the text are excellent resources, even if they are not the most recent. Ary, Donald, Lucy Chesar Jacobs and Asghar Razavieh. Introduction to Research in Education. New York: Holt, Rinehart and Winston, 1972. Babbie, Earl. The Practice of Social Research, 3rd ed. Belmont, CA: Wadsworth Publishing Company, 1983. Bell, Judith. Doing Your Research Project. Philadelphia: Open University Press, 1987. Borg, Walter R. Applying Educational Research: A Practical Guide for Teachers. New York: Longman Publishing Company, 1981. ____________ and Meredith D. Gall. Educational Research: An Introduction, 4th ed. New York: Longman Publishing Company, 1983. Churchill, Gilbert A. Marketing Research: Methodological Foundations, 2nd ed. Hinsdale, IL: The Dryden Press, 1979. Drew, Clifford J. and Michael L. Hardman, Chapter 5: Designing Experimental Research, Designing and Conducting Behavioral Research. New York: Pergamon Press, 1985. Glass, Gene V. Statistical Methods in Education and Psychology, 2nd. Englewood Cliffs, NJ: Prentice-Hall, Inc., 1984. Hinkle, Dennis E. , William Wiersma, and Stephen G. Jurs, Basic Behavioral Statistics. Boston: Houghton Mifflin Company, 1982. Hopkins, Charles D. Educational Research: A Structure for Inquiry. Columbus, Ohio: Charles E. Merrill Publishing Company, 1976. Howell, David C. Statistical Methods for Psychology. Boston: Duxbury Press, 1982. Kubiszyn, Tom and Gary Borich, Educational Testing and Measurement: Classroom Application and Practice, 2nd. Glenview, IL: Scott, Foresman and Company, 1987 Lewin, Miriam. Understanding Psychological Research. New York: John Wiley & Sons, 1979.
A5-1
References
Mueller, Daniel J. Measuring Social Attitudes: A Handbook for Researchers and Practitioners. New York: Teachers College Press, 1986. Nunnally, Jum. Educational Measurement and Evaluation, 2nd ed. New York: McGrawHill Book Coompany, 1972. Payne, David. The Assessment of Learning: Cognitive and Affective. Lexington, Mass: D. C. Heath and Company, 1974. Sax, Gilbert. Foundations of Educational Research. Englewood Cliffs, N. J.: Prentice-Hall, 1979. True, June. Finding Out: Conducting and Evaluating Social Research. Belmont, CA: Wadsworth Publishing Company, 1983. SYSTAT Computer Statisitcal Package Wilkinson, Leland. A System for Statistics, Version 4. Evanston, IL: SYSTAT, Inc., 1988.
A5-2