You are on page 1of 160

Quantitative Methods in Social Sciences

Unit One Background Issues
Unit objectives
Having studied this material, you should be able to: Understand the general concept, approaches and steps of quantitative methods in social sciences Describe the role of quantitative methods in social science researches Explain the logical structure in quantitative techniques Understand how concepts, models, theories and laws develop in quantitative methods Explain the merits and demerits of quantitative methods Differentiate descriptive and inferential statistics Formulate and test hypotheses Apply basic quantitative techniques in research

1.1. Research Methods in Social Sciences
Research is a scientific or critical investigation aimed at discovering facts and interpreting data or applying the evolved techniques in solving certain problems and give answers to research questions. It is an organized and systematic way of finding solutions to problems. It is systematic because there is a definite set of procedures and a series of steps which a researcher must follow. There are certain procedures in research process which must always be done in order to get the most accurate results. Similarly, research is an organized activity or it is done in a planned procedure so that it gives the most appropriate answers to problems or questions. Questions and problems are central to any research. This is because if there is no problem or question, there will be nothing to solve or answer. Hence, the end result of almost all research is finding answers to problems, research questions and hypotheses.

Research is the method to expand the frontiers of human knowledge. In geography and environmental studies, for instance, it has to aim at maintaining the environment so as to attain the well-being and betterment of the human and physical environment. The aim of any good research is to advance the frontiers of human knowledge and add valid information to what is already known.
Mesay Mulugeta, 2009 1

Quantitative Methods in Social Sciences

Therefore, research can be carried out in any field wherever there is a need to expand the horizon of human knowledge that bring about significant change in the betterment of all human affairs.

At this juncture one may raise at least two questions. What is science? What is the relationship between science and research? Nachmias, C. and David Nachmias (1996: 3) try to answer these questions as, The word science is derived from the Latin word Scire meaning to know. Science is difficult to define primarily because people often confuse the content of science with its methodology. Science has no particular subject matter of its own .but a distinct methodology.

Science, in this sense, refers to any systematic and highly skilled means of acquiring knowledge. Research is a scientific method or a technique for investigating phenomena and acquiring new knowledge. To be termed scientific, a method of inquiry must be based on gathering observable, empirical, and measurable data subject to specific principles of reasoning.

If so, when has human begun to search a truth? Or when has human started research? One may answer that human started searching for truth when s/he began to find fruits/animals for survival (hunting/gathering) millions of years back. Others may respond to such a question in such a way that human started research when s/he began to select the most appropriate animal/plant for domestication. Still some other may say it might when human began explore the world. Others still may say it was when human started to identify the location of the most important minerals for industrial proposes. In this manner different individual may comment his/her view for the forwarded questions. Anyhow, these people may not be wrong in that throughout history humans have tried to grasp knowledge in various ways. One of these ways, which is also the most recent, is scientific inquiry. Though scientific method is not the only means to know, science has been helping humans to understand their environment and themselves. Science is, therefore, the best instrument to grasp knowledge through research which involves observation, identification, description, experimental investigation and theoretical explanation of any phenomena that occur in nature.

Mesay Mulugeta, 2009

2

Quantitative Methods in Social Sciences

The above paragraph tries to briefly explain the fact that research is one of the long-dated human activities. One should also bear in mind that any systematical and justifiable method of inquiry on human and physical environment is a science. Researches in social sciences are therefore scientific inquiries to the extent that these fields are founded on scientific methodology, rigorous data analysis and systematic observations. The only reasonable difficulty in human science is the fact that the uniformity of nature is not a reasonable assumption in the world of human beings and their characteristics. This is mainly because of the complex nature of human being to which developing sound theories is much more difficult unlike the cases in the physical world. It also involves the environment which is equally dynamic and any investigation pertaining to man or any living being cannot be treated in isolation.

Regarding this, Best and Kahn (2005) explains that researches in human subjects are difficult mainly because of the following human natures. These are: 1. No two persons are alike in feelings, drives and emotions. For instance, an event that extremely delights an individual may irritate the others or a method that we may employ to approach an interviewee may not work to hand other respondents in a research. 2. No one person is completely consistent from one moment to another. Human behavior is influenced by the interactions of the individual with every changing element in his or her environment. 3. Human beings are influenced by the research process itself. They are influenced by the attention that is focused on them when under investigation unlike other animals such as mice. Because of these factors, some scholars in the field of applied sciences are less confident about the scientific inquiries in nonphysical aspects of our world. Hence, they recommend the application of scientific methods with greater vigor and imagination in social science aspects. It is believed that the development of scientific inquiries in social sciences and their applications to several human affairs may be best solutions to some of our present and future greatest challenges such as peace and security, human rights violation, global warming, food insecurity, polar ice-melting and global economic recession.

Mesay Mulugeta, 2009

3

1. Waker. thesis. 1. In geography field of study. dissertation or any kind of research output. defining and classifying the concepts 5.3. Qualitative and Mixed methods approaches. Designing field survey with appropriate statistical techniques and tools 9. It occurred during 1950s and 1960s in the universities of Europe and USA and marked a rapid change in geographical researches.2. tabulation. as discussed hereinbefore. Choosing the methodology to study the parameters 7. B. Standardizing the methodology and testing its suitability for the specific problem 8. Collecting data or information 10. Spratt. J. The main claim for the quantitative Mesay Mulugeta. Pertaining to this. (2004) and Frechtling.3. Here the term systematic implies the fact that in scientific research there is always a definite set of procedures that a researcher must follow in order to arrive at justifiable. including differentiating. Essential Steps in Research Research. In fact. Gurumani (2007) clearly indicates that a creditable research should follow the following general steps: 1. analysis and interpretation of data 11. presentation.Quantitative Methods in Social Sciences 1. Definition of the problem. testable and the most accurate research outputs. 2009 4 ..eds (1997) has also tried to distinguish and explain the three research approaches though not as clearly as Cresswell. Reporting in the form of essay. The objective of this course is therefore to discuss in detail all about quantitative research techniques in social sciences such as geography.et. sociology and development studies. C. Specific survey of the pertinent literature and development of bibliography 4. R and Robinson. Determining the parameters required to be studied towards the solution of the problem 6. other writers such as Best and Kahn (2005).el. Systematic classification. is a systematic way of finding solutions to problems. quantitative revolution was one of the major turning points in its history of development. General or pilot survey of the field to understand the problem of the research 3. Quantitative Research Approaches in Social Sciences One of the most notable social sciences that employ quantitative techniques in its research is geography. Selection of the topic and specific problem of the research 2. Research Approaches in Social Sciences Cresswell (2003) clearly identified three approaches to social sciences research: Quantitative. 1.

1: The logical structure of quantitative research process Theory or research problem ---------------------------------.Deduction Hypothesis --------------------------------. In the early 1950s there was a growing sense that the existing paradigm for geographical research was not adequate in explaining how physical.Interpretation Findings ---------------------------------.Quantitative Methods in Social Sciences revolution is that it led to a shift from descriptive (idiographic) geography to an empirical law making (nomothetic) geography. social and political processes are spatially organized or ecologically related. economic. 20) Mesay Mulugeta. but how they are interrelated. This is mainly because modern geography is an all-encompassing discipline that foremost seeks to understand the earth and all of its human and natural complexities in more scientific manner.Induction Source: Bryman. evolving the analytical method of inquiry. 2009 5 .Data Processing Data Analysis ---------------------------------. why they are there and their socio-economic values of being there. It studies not merely where objects are. Look at the logical structure of quantitative research approach in the figure below. theoretical approach to geographical research has emerged.Operationalization Observations/ Data Collection ---------------------------------. Figure 1. A (2000: p. A more abstract.

questionnaire. The definitions of these important terms are collected from different sources like websites and books. It is often assumed that everyone knows what these words mean and what the differences between them are. The ending of one research cycle will be the beginning of the other one. As indicated in the figure above (Fig. Mesay Mulugeta.4. Concept Several writers define the term concept simply as an abstract notion or idea. Theories and Laws in Quantitative Research Concept. quantitative research process consists of at least five stages: theory.1). income. These methods and techniques tend to specialize in quantities in the sense that numbers come to represent the variables like altitude. 1. The figure illustrates that a research process is cyclic in nature. the researcher is motivated by the numerical outputs and how to derive meanings from them. and personal observations. Sources of data are of less concern in identifying an approach as being quantitative research approach than the fact that empirically derived numbers lie at the core of the scientific evidence assembled. In all cases. Models. something that isn t concrete. hypothesis formulation.Quantitative Methods in Social Sciences Quantitative research in social sciences is. A quantitative researcher may use archival data or gather it through different tools such as interview. The presence of quantities is so predominant in quantitative research approach that statistical tools and packages are essential element in the researcher's toolkit. rainfall. body weight and age. 2009 6 . a set of quantitative techniques that allow researchers to answer research questions in the discipline. measurements. hypotheses and findings of the research problems logically and empirically. observation or data collection. 1. theory and model are the three important words that occur very regularly in research texts. The interpretation of the numbers is viewed as strong scientific evidence of how a phenomenon works. Concepts. Here we have to bear in mind that there are a number of possible definitions for each word. therefore. temperature. It starts with a theory or research problem and ends with research findings or empirical generalization. This cyclic process continues indefinitely reflecting the process of a scientific investigation and it opens door for self-correcting in such away that scientific investigators test the generalizations. dietary energy. data analysis and finding or generalization. It is an abstract summary of characteristics that we see as having something in common.

Researchers use theory in a quantitative study to provide an explanation or prediction about the relationship among variables in the study. theory develops as explanation to advance knowledge in particular field. propositions. Thus. When investigators test hypotheses over and over in different settings and with different populations a theory emerges and someone gives it a name. 2009 7 . influence or affect a dependent variable. The systematic view may be an argument. X. Theories exist in different social science disciplines such as economics. Theory As the case for the term concept stated above. to explain phenomena. It is a set of hypotheses or principles linked by logical or mathematical arguments which is advanced to explain an area of empirical reality of a type of phenomenon. and propositions that presents a systematic view of phenomena by specifying relations among variables . psychology and sociology. or principles analyzed in their relation to one another and used. Y? The theory would provide the explanation for this expectations or prediction.Quantitative Methods in Social Sciences Concepts are created by people for the purpose of communication and efficiency. acting as a bridge between or among the variables. Therefore. Cresswell (2003) cited Kerliger (1979) defining theory as a set of interrelated constructs (variables) . therefore. However. in quantitative research. hypothesis and research questions are often based on theories that the researcher seeks to test. Why would an independent variable. In this definition theory is an interrelated set of constructs (variables) formed into propositions or hypothesis that specify the relationship of variables typically in terms of magnitudes or directions. a discussion or a rationale and it helps to explain or predict phenomena that occur in the world. As stated in Cresswell (2003). as an educator or researcher you would be expected to review all the existing range of definitions of the term concept and decide on which you are going to use. there are definitions of theory in literatures and electronic media. Theories develop when researchers test a prediction many times. a more substantial definition of a theory seems the one stated in ENCARTA World English Dictionary which defines the term theory as a set of facts. includes a set of basic assumptions and axioms as the foundation and the body of the theory is composed of logically interrelated and empirically verifiable propositions." A theory. Mesay Mulugeta. A theory explains how the variables are related. definitions. especially in science.

a researcher or investigator must intensively read different related research works and books before s/he goes to building a theory or hypotheses for his or her research. 2009 8 . As stated by Cresswell (2003).. These are review the related literature. Hence. for instance globe as a model of the earth. Micro-level theories provide explanations limited to small slices of time. . the following procedures should be used to present a model for writing a quantitative theoretical perspective section into a research plan. It is a simplified version of something complex used in analyzing and solving problems or making predictions A model can more simply be analogous model. Macro-level theories explain larger aggregates such as social institutions. which are based on logics and inter-relationships between concepts and usually expressed mathematically or algebraically. . find what theories were used by other investigators or researchers in your area of study. cultural systems and the whole societies. Model ENCARTA World English Dictionary defines model as an interpretation of a theory arrived at by assigning referents in such a way as to make the theory true. space and variables while meso-level theories integrate some micro-level theories of organizations.. Theories can be classified into micro-level. ask questions why the independent variable(s) affect the dependent variable(s).. social movements or communities. and script out the theory section.Quantitative Methods in Social Sciences Another aspect of a theory is that it varies in its breadth of coverage. meso-level and macro-level. There are many such symbolic models in the fields of human and natural sciences. Examples of symbolic models: Y = A + BX1 + CX2 + DX3 r [N N X2 ( XY ( X )( X ) 2 ][ N +NXn Y) Y2 ( Y )2 ] . Regression Model Correlation analysis model Mesay Mulugeta. For instance the under-mentioned regression model is used to quantify an estimated or predicted value of a data set and the correlation model below helps us to analyze the strength and direction of a linear relationship between an independent and dependent variables. Symbolic models are concerned with quantification... or symbolic models..

A law is frequently referred to as a universal and predictive statement. A law in research is a precise statement of a relationship among facts that has been repeatedly corroborated by scientific investigation and is generally accepted as accurate by experts in the field. Hypotheses. theories and models is law. although the conditions may be predicted to follow. Testing of hypotheses requires statistical procedures in which the investigator draws inferences about the population from a study sample. In due course. One of the three basic approaches should be followed: (a) in the form of comparing the variables-the impact of independent variable on dependent variable. Maps. are propositions or predictions. are often used to represent such models. A hypothesis requires more work by the researcher in order to either confirm or disprove it. A conceptual model is composed of a pattern of interrelated concepts but not expressed in mathematical form and primarily not concerned with quantification. balance sheets. It is universal in the sense that the stated relationship is held always to occur under the specified conditions. either asserted merely as a provisional conjecture to guide investigation or accepted as highly probable in the light of established facts. investigators use research questions or hypotheses to shape and specifically focus the purpose of the study. Cresswell (2003) forwards the following guidelines for writing good quantitative research questions and hypotheses: 1. on the other hand. 1. Research questions are interrogative statements or questions that the investigator seeks to answer.Quantitative Methods in Social Sciences There is also what we call conceptual model. Techniques of hypothesis testing will be discussed in Unit 8 of this teaching material. They are numeric estimates of population values based on the data collected from samples. and flowcharts. Hypothesis Formulation in Quantitative Methods In quantitative research. They are predictions that the researcher hold about the relationships among variables. set forth as an explanation for the occurrence of some specified group of phenomena. Law Another word associated with concepts. graphs. charts. Mesay Mulugeta.5. (b) in the form of relating the variables. circuit diagrams. or set of propositions. Laws are generally derived from a theory. 2009 9 . a confirmed hypothesis may become part of a theory or occasionally may grow to become a theory itself. and (c) in the form of describing the variables.

opinions. better. not both. e. less. feelings. Overlooks motivations.6. and attitudes of individuals who are carrying out the research and also those individuals participating in the research Mesay Mulugeta. Alternative hypothesis (H1) predicts about the outcome for the population of the study.Quantitative Methods in Social Sciences 2. 2009 10 .c. artificial environment so that a level of control can be applied to the exercise. It can be directional (makes use of words like higher.t. Preset answers will not necessarily reflect how people really feel about a subject and in some cases might just be the closest match. to eliminate redundancy 4. Merits and Demerits of Quantitative Methods Merits: Examines the relationships between and among variables critically. Use only research question or hypothesis.) or non-directional. and attitudes of individuals who are carrying out the research and also those individuals participating in the research Demerits: Collect a much narrower and sometimes superficial dataset Results are limited as they provide numerical descriptions rather than detailed narrative and generally provide less elaborate accounts of human perception Often carried out in an unnatural. Specify the hypothesis 3. formulated when a researcher does not know what can be predicted from post literature. feelings. Answers research questions through surveys and experiments Provides measures or observations to test theories and hypotheses Leads to meaningful interpretation of quantitative data Provides more empirical data analysis techniques than the qualitative ones Seems more valid and reliable method Relatively more free of motivations. 1. You can use hypothesis either in null or alternative form. opinions. Null hypothesis (Ho) is statistical hypothesis that states there are no differences between observed and expected data.

Descriptive statistics are used to describe the basic features of the data in a study. on the other hand. Advocators of qualitative methods. With descriptive statistics you are simply describing what is or what the data shows. and make predictions out of the analyzed data. This combination of quantitative and qualitative data gathering is often referred to as mixed-methods research. summarizing.7. With inferential statistics. Descriptive vs Inferential Statistics Statistics as a subject is broken into two branches: Descriptive Statistics and Inferential Statistics. Advocators of quantitative methods argue that only by using such methods can the social sciences become truly scientific. Or. Descriptive statistics includes collecting. they form the basis of virtually every quantitative analysis of data. it is possible to give precise and testable expression to qualitative ideas. Quantitative methods might be used with a global qualitative frame. which may be even the most important ones. and presenting data. Descriptive statistics are typically distinguished from inferential statistics. with particular schools of thought within each discipline favoring one type of method and pouring scorn on to the other. For instance. we use inferential statistics to try to infer from the sample data what the population might think. The modern tendency (and in reality the majority tendency throughout the history of social science) is to use eclectic approaches. you are trying to reach conclusions that extend beyond the immediate data alone.8. determine relationships. Inferential statistics is when we make inferences to hypothesis testing. argue that quantitative methods tend to obscure the reality of the social phenomena under study because they underestimate or neglect the non-measurable factors.Quantitative Methods in Social Sciences 1. Mixed method is new and still developing in form and substance. Qualitative methods might be used to understand the meaning of the numbers produced by quantitative methods. Quantitative vs Qualitative Methods In most social sciences the use of quantitative or qualitative method has become a matter of controversy and even ideology. 2009 11 . Together with simple graphic analysis. Using quantitative methods. Quantitative method has been available to social and human scientists for years while qualitative method has emerged primarily during the last three or four decades. we use inferential Mesay Mulugeta. They provide simple summaries about the sample and the measures. 1. organizing.

This data can be summarized by finding the average income of those 20 instructors and we could describe the difference each income is above or below the average. For example. or make a pie chart or bar chart. We could also go into Excel or SPSS softwares and construct a table with this data in it. so this would be inferential statistics. if this group is representative of the whole university.Quantitative Methods in Social Sciences statistics to make judgments of the probability that an observed difference between groups is a dependable one or one that might have happened by chance in this study. maybe a frequency distribution of the number or proportion of the instructors in each class or range. we could then estimate and test various hypotheses about these 20 instructors average income to the university as a whole. Mesay Mulugeta. let s say we have data on the incomes of 20 instructors at Adama University. This is descriptive statistics! Now. and we could even quantify this probability of error. These conclusions will be subject to some error. Thus. we use inferential statistics to make inferences from our data to more general conditions. We are now inferring. 2009 12 . we use descriptive statistics simply to describe what's going on in our data.

rainfall and any other related data of a very large entity can be obtained through sampling and be used for certain statistical inferences.1. Introduction Quantitative research usually employs surveying or measuring to collect data. you should be able to: Understand the rationale for sampling Explain merits and demerits of sampling Define key terms in sampling such as samples.Quantitative Methods in Social Sciences Unit Two Methods of Quantitative Data Collection Unit objectives Having studied this unit. Mesay Mulugeta. samples. number of oxen per rural household. 2009 13 . In order to collect the quantified data. It is a research depending upon quantities or quantifying variables. landholding. sampling error and sampling frame. sampling frame. origin of immigrants and reason of migration. Sampling is a statistical practice concerned with the selection of some representations from a population or universe intended to yield some generalized characteristics of the concerned population by minimizing the computational work. daily income. Determine appropriate sample size for any project work such as senior essay Describe different sampling processes and techniques Appreciate the use of sampling in geographic researches 2. population/universe. this research approach usually carries out at least one of the sampling techniques which will be discussed in detail in this unit. family size. Let's begin by covering some of the key terms in sampling like population. parameter and statistic. sampling error. Socioeconomic variables such as yield/hectare. non-sampling error.

which we wish to describe or draw conclusions about. In order to make any generalizations about a population. A population is any entire collection of people. it is important that the researcher must carefully and completely defines the population. A sample statistic gives information about the corresponding population parameter. a sample. Population/Universe In statistics. For example. Before selecting the samples. if you want to study the food security status of farm households in a woreda having 1200 farm households. you may communicate only 8% of the total which is only 96.2. animals. It is assumed that the sample is the utmost perfect representative of the general population. that is meant to be representative of the population. For instance.Quantitative Methods in Social Sciences 2. It is the entire group we are interested in. the term population/universe is used in a different sense from its literary sense. you may raise questions like: What is the advantage of sampling? Why shouldn t we study the whole population rather than limiting ourselves to the data that we obtain from certain proportion of the overall population? You will get answers to these questions after you effectively complete discussing all about sampling with your instructor in this unit. including a description of the members to be included. it is assumed that a sample mean for a set of data would give information about the overall population mean. Here the 1200 farm households in the woreda are said to be population and the 96 ones are your samples. or it may be too costly and time consuming to deal with each and every population under consideration. By studying the sample it is hoped to draw valid conclusions about the larger group. 2009 14 . is often studied. At this juncture.3. plants or things from which we may select sample data. Mesay Mulugeta. Samples A sample is a group of items selected from a larger group (the population/universe) for any statistical analysis. 2. A sample is generally selected for study because the population may be too large to study in its entirety.

g. In this example.5. A statistic is a figure that is calculated based on sample data. It is used to give information about unknown values in the corresponding population. the specific parameter is a fixed value which does not vary. the population mean is a parameter that is often used to indicate the average conditions of the population pertaining to different attributes. 2009 15 . Sampling Frame Sampling frame is the actual set of units from which a sample has been drawn. Parameters are often assigned by Greek letters (e. ethical. This may result in particular problems in forecasting where inferences about the future are made from. practical. Parameter and Statistic A parameter is a value. 2. But it is normal that each sample set drawn from the population will have its own value with negligible dissimilarities between the value of one set and other sample sets. spatial and technical issues need to be addressed. The need to obtain timely results may prevent extending the frame far into the future. In other words. a survey aimed at studying the position of assets of rural households in a woreda. whereas statistics are assigned by Roman letters (e.Quantitative Methods in Social Sciences 2. economic. Consider. it is the listing or displaying of all accessible population from which the researcher draws his/her samples.). variance. For a population. In defining the frame. The mean. median. usually unknown (and which therefore has to be estimated). For example. For example.g. population of interest is all rural households of the woreda and a possible sampling frame may be list or display of all rural households in the woreda under investigation. standard deviation and Skewness of a sample set of data are termed as statistic. for example. m and s). used to represent a certain population characteristic.4. The difficulties can be extreme when the population and frame are disjoining. µ and . Mesay Mulugeta. the mean of the data in a sample is used to give information about the overall average in the population from which that sample was drawn.

and becomes negligible with increasing sample size. you may Mesay Mulugeta. when you would like to know the average income of the residents of a town. One is the error that occurs just because of chance. There are two basic causes of sampling error. which have the highest crop yield per year. Some literature calls this bad chance. take great care at the sampling stage of her/his research so as to avoid committing sampling errors or not to draw wrong samples which may lead to wrong conclusions. This may result in untypical choices.Quantitative Methods in Social Sciences 2. You may select your samples randomly but. The two major errors in sampling are known as sampling error and non-sampling (measurement) error. Sampling bias is usually the result of a poor sampling plan. The second cause of sampling error is sampling bias. for the data for your BA Thesis at woreda level you may unluckily select all the well-to-do farm households in the woreda in your set of sample. 2. Sampling error causes the discrepancy between population parameters and statistic. A means of selecting the units of analysis must be designed to avoid the most obvious forms of bias. Sampling Error Sampling errors may happen simply because of sampling itself or due to certain biasness towards certain parameters. Here also you may raise a question: How can I protect such a bad chance during field work for my research? The main protection against this kind of error is to use a sample size larger enough.6.1. Errors in Sampling A researcher may commit at least two types of errors during sampling. The most notable is the bias of non response when for some reason some units have no chance of appearing in the sample. S/he should. The discrepancy generally decreases as the sample size increases.6. For example. therefore. Unusual units (extremely small or large units) in a population do exist and there is always a possibility that an abnormally large or small number of them will be chosen. What is optimum sample size or proportion? You will see in the discussions hereinafter. Hence a sample of optimum size must be obtained for a study. may be selected making the sample average by far higher than what it should be. all the rich households in the whole population. Sampling bias is a tendency to favor the selection of units/items that have particular characteristics. 2009 16 . For example.

This type of error can occur whether a census or a sample is being used. and GPS while collecting any primary socioeconomic data for our research or any project work. You will then end up with high average income which will lead to wrong conclusions in your findings. Non-sampling error (Measurement error) The other main cause of unrepresentative samples is what is known as non-sampling error.2. Like sampling error.6. you must be very careful in selecting your research samples free of any bias. Mesay Mulugeta. Therefore. 2009 17 . When his colleagues discovered that the measuring instrument had been contaminated by cigarette smoke as a result of which they mistakenly rejected his findings. The simplest example of non-sampling error is an inaccurate measurement due to malfunctioning instruments or poor procedures. they may tell you in terms of traditional measuring tools like qunna or enqib which may vary in size from household to household or from place to place. will not be of comparable validity unless the information of all persons is weighed under the same measuring tools and circumstances. weighing machine. Related to this a story is told of a French astronomer who once proposed a new theory based on spectroscopic measurements of light emitted by a particular star. 2. A non sampling error is an error that results solely from the manner in which the observations are made. if persons are asked to state their annual production. Therefore. as a result of which the two answers will not be of equal reliability. therefore. in case of data of crop yield of farm households in Ethiopia. Errors due to inaccurate measurement may happen innocently but very devastating leading to extremely wrong research findings. non-sampling error may either be produced by participants in the statistical study or be an innocent by product of the sampling plans and procedures. For example. Responses. conclusions and recommendations.Quantitative Methods in Social Sciences decide to use mobile telephone numbers to select a sample from the total population in a locality where only the well-to-do social class households (in Ethiopian case) own mobile telephones. we have to take care of the accuracy or proper functioning of our instruments like altimeter.

2009 18 . If you ask such a person her/his age in years. Individuals tend to provide false answers to particular questions.Quantitative Methods in Social Sciences There are also other significant factors which results in errors and reduces the quality of data. This is because to reduce error often it requires an increased expenditure of resources such as time. allowing no room for personal interpretation. Interviewer s effect: No two interviewers are alike and the same person may provide different answers to different interviewers. Respondents effect: This might also give incorrect answers to impress the interviewer. the household heads tended to lie by responding only very small amount of yield per year. only sampling error can be controlled by exercising care in determining the appropriate method for choosing the sample. A sequence of such questions may produce more accurate information than directly asking questions like. 3. s/he may tell definitely more accurate figure since it will require a bit of quick arithmetic to give a false date. One way to guard against such bias is to make the questions very specific. These are: 1. For example. If yes how much is it? and so on. 2001. "Do you have any extra income? . a different figure may be provided than the respondents would give to some neutral or purely academic researcher. For example. it is easier for the individual just to lie to you by over or under stating her/his age by some years less or more than the reality. What is your monthly income? Error and cost seems competing in sampling. The manner in which a question is formulated can also result in inaccurate responses. Knowing the study purpose: This is the case when the respondents give wrong data solely because they are aware of why a study is being conducted. Mesay Mulugeta. finance and human power. This type of error is the most difficult to prevent because it results from outright deceit (dishonest) on the part of the respondents or interviewees. An example of this is what I witnessed during my MA Thesis Study in which I was asking farmers how much crop they harvested last year. Of the two types of statistical errors. If a government agency is asking. But if you ask which year s/he was born. In most cases. the questions might be: "Are you employed?" could be followed by "What is your salary?". 2. I then tried all my best to convince them so that they could tell me somewhat correct figures. A good example can be a question on income. some people want to feel younger or older for some reason known to them.

may be minimized by the wise choice of a sampling procedure.8. judgment sampling. and the only way to minimize chance sampling errors is to select an adequately large sample size. 2. The chance component (sometimes called random error) exists no matter how carefully the selection procedures are implemented. Simple Random Sampling In a simple random sampling. The advantage of probability sampling is that all the items in the set of population have the chance to be selected. Probability methods include random sampling.7.8.Quantitative Methods in Social Sciences The above discussion has shown that sampling error may be due to either bias or chance. In non-probability sampling.1. every member of the population has equal and independent chance of being selected. 2. systematic sampling. 2009 19 . The selection of one observation does not affect the opportunity of other observations to be selected. a set of items or events possible to measure Specifying a sampling method for selecting items or events from the frame Determining the sample size Pilot testing for questionnaire Administering the questionnaire or data collection 2. Defining the concerned population Specifying a sampling frame. One disadvantage of this method is that all members of the population have to be available for selection. quota sampling. on the other hand. Mesay Mulugeta. Stages in Sampling Process Most literatures recommend the under-mentioned stages while sampling to collect primary data. Each element of the sampling frame has an equal probability of selection if the frame is not subdivided or partitioned. Sampling bias. In probability sampling. and snowball sampling. These include convenience sampling. each member of the population has a known non-zero probability of being selected. members are selected from the population in some nonrandom manner. and stratified sampling. Sampling Techniques Sampling methods are classified into probability or non-probability sampling techniques.

Population Sample Size Size n Example: Let us assume that you want to study certain characteristics of rural households in Y woreda. then bias will result. 2000 divided by 100 Mesay Mulugeta. where 7 is a random integer. every 10th name from the telephone directory is called an every 10th sample. i. After the required sample size has been calculated. Firstly.2.1: Classification of Sampling Methods Sampling Methods Probability Sampling (Usually for Quantitative Methods) Simple Random Sampling Systematic sampling Stratified random sampling Cluster sampling Multistage sampling Non-probability Sampling (Usually for Qualitative Methods) Convenience sampling Judgment/Purposive sampling Quota sampling Snowball sampling Volunteer sampling 2. It is a type of probability sampling unless the directory itself is not randomized. say. Divide the population (2000) by sample size (100). which is an example of systematic sampling.e. Systematic Sampling It is also called an ith item selection technique. but is chosen to be (say) the 7th. If periodicity is present and the period is a multiple of 10. every ith record is selected from a list of population members. Let the total farm households in the woreda be 2000 serially numbered from 1 to 2000 in alphabetical order. 2009 20 . you have to decide your sample size based on the criteria you have read above. Selecting.8. Let it be 100. It is important that the first name chosen is not simply the first in the list. but it is especially vulnerable to periodicities in the list.Quantitative Methods in Social Sciences Figure 2. It is easy to implement and the stratification induced can make it efficient.

At this stage attention should be given to the appropriate proportional representation of the items to each stratum based on certain characteristics like total size of population and area. rural and urban and so on. Then select your samples as follows: Calculation 13 13 + 20 = 33 13 + 2*20 = 53.Quantitative Methods in Social Sciences gives you 20 which will provide the positions of sample items. the frame can be organized by these categories into separate strata. Sufficient here refers to a sample size large enough for the researcher to be reasonably confident that the stratum represents the population. Random sampling is then used to select a sufficient number of subjects/items from each stratum. Mesay Mulugeta. A sample is then selected from each stratum separately. A stratum is a subset of the population that shares at least one common characteristic. Where the population embraces a number of distinct categories. 2009 21 . The researcher first identifies the relevant stratums and their actual representation in the population. producing a stratified sample.3. Stratified sampling is often used when one or more of the stratums in the population have a low incidence relative to the other stratums. Now draw a random number from 1 to 20. Stratified Random Sampling Stratified sampling is one of the most commonly used probability sampling method that is superior to random sampling because it reduces sampling error. in every 20th item from the first item identified randomly. 13 + 3*20 =73 13 + 4*20 =93 13 + 5*20 =113 13 + 6*20 =133 .8. Serial No. educated and uneducated. Examples of stratums might be males and females. of the sample households from the list 13th 33rd 53rd 73rd 93rd 113th 133rd Continue till you select the predetermined 100 samples 2. Let number 13 is selected randomly.

and (2) to improve efficiency by gaining greater control on the composition of the sample since sample size is usually proportional to the relative size of the strata. depending on how the clusters differ between themselves. Firstly. Cluster sampling generally increases the variability of sample estimates above that of simple random sampling. It can be further sub-divided by introducing other characteristics one by one. Example: Let us assume that we want to know the amount of crop production in a woreda during a specific crop year.8. then kebeles.4. from each homogeneous group (stratum) you can select a required size of sample households randomly and distribute your questionnaire or administer an interview. villages and so on. This can reduce travel and other administrative costs. the study area can be divided into agro-climatic zones which can be further divided by woredas. as compared with the within-cluster variation.Quantitative Methods in Social Sciences In multistage stratification say the gender attribute of the household heads each dividing the universe/population into two groups. you can divide the whole urban households into a number of homogeneous groups called strata on the bases of certain characteristics of the households. In fact. Finally. 2009 22 . age. Cluster or Area Sampling Cluster sampling is an example of two-stage sampling or multistage sampling: in the first stage sample of cluster/s is/are chosen while in the second stage sample/s of respondent/s within those areas is/are selected. At the second stage the monthly family income monthly can be introduced to have sub-classes. family income or any other available information can be used. at this stage we need predocumented information about each household. In other cases. 2. It may be very costly to communicate each and every farming household in the Mesay Mulugeta. It also means that one does not need a sampling frame for the entire population. Example: Let us assume that you want to study certain characteristics of urban households in one of the towns in Ethiopia. they may be divided into homogeneous groups according to sex of household head. but only for the selected clusters. For example. The two main reasons for using a stratified sampling design are (1) to ensure that every group within a population are adequately represented in the sample.

8. For example interviewers might be tempted to interview those individuals who look most helpful to them. For example. Using all the sample elements in all the selected clusters may be prohibitively expensive or unnecessary. Deciding what items within the cluster to use is the second stage. firstly we can randomly/purposely take reasonable number of kebeles from the woreda and the kebeles are now called 1st stage units. Again.g.Quantitative Methods in Social Sciences woreda. the population is first segmented into mutually exclusive sub-groups. sub-kebeles and then into villages) multistage sampling is recommended. 2. Instead of using all the items contained in the selected clusters. just as in stratified sampling. In quota sampling. This random element Mesay Mulugeta. Then. 2009 23 . the researcher may randomly select items from each cluster. Now. multistage cluster sampling becomes useful. if the cluster (kebele) is homogeneous and basically each unit in the population is divisible into a number of smaller units (e. it is advisable not to study the whole cluster or kebele as it increases the cost and time of the study. The sample households are now termed as 4th stage units. Multistage Sampling Multistage sampling is a complex form of cluster sampling. then villages (3rd stage unit) and finally we tend to take appropriate farm households from the selected villages. we can randomly or purposely take reasonable and representative number of the segments (primary units) and communicate each household in the segment. Under these circumstances. Hence. Constructing the clusters is the first stage.g. in case the selected units or cluster (kebeles in the woreda) may be homogeneous or if the whole elements of the selected cluster give the same response. The problem is that these samples may be biased because not everyone gets a chance of being selected.6. Then judgment is used to select the subjects or units from each segment based on a specified proportion. Quota sampling Quota sampling is one of the non-probability sampling techniques. 2. In quota sampling the selection of the sample is non-random. kebeles) known as primary units in statistics. The technique is used frequently when a complete list of all members of the population does not exist and is inappropriate Example: In the example given for the cluster sampling above. Then we can divide the woreda into smaller segments (e. an interviewer may be told to sample 200 females and 300 males between the age of 45 and 60.5. we can randomly select any reasonable size of sub-kebeles (2nd stage unit).8.

the researcher must be confident that the chosen sample is truly representative of the entire population. the reason for the term snowball. Addis Ababa may be purposively selected for such a study even though the population includes all cities/towns in Ethiopia such as Adama. 2. a researcher who wants to study the status of urban poverty in Ethiopia may decide to draw the entire sample from one representative city in the country. Dessie. 2009 24 . This non-probability method is often used during preliminary research efforts to get a general estimate of the results without incurring the cost or time required to select a random sample.Quantitative Methods in Social Sciences is its greatest weakness and quota versus probability has been a matter of controversy for many years. While this technique can dramatically lower search costs. The researcher selects the sample based on his/her own judgment. Snowball sampling relies on referrals from initial subjects to generate additional subjects.8.8.9. 2.7. For instance. your initial contact person should be a person who can effectively help you to communicate one or more such individuals (other required chat-addicted individuals or samples) in the kebele. Convenience sampling Convenience sampling. As you began to communicate other such samples the importance of the initially contacted person may be negligible or melts as snow. Bahr Dar. Mekele.8. Note that when using this method. if you want to select chat-addicted individuals from the total residents of a kebele. Snowball sampling Snowball sampling is a special non-probability sampling technique used when the desired sample characteristic is rare or little known to the researcher. the samples are selected because they are convenient to the researcher. Judgment/Purposive sampling Judgment sampling is one of the commonest non-probability sampling techniques. As the name implies. it comes at the expense of introducing bias because the technique itself reduces the likelihood that the sample will represent a good cross-section from the population. This is usually an extension of convenience sampling technique. It may be extremely difficult or cost prohibitive to locate respondents in these situations. 2. For example. is used in exploratory research where the researcher is interested in getting an inexpensive approximation of the truth. Nekemte and Awasa.8. sometimes called opportunity sampling. Gonder. Mesay Mulugeta.

In grid sampling we should divide the field into squares or rectangles of equal size. In case of inadequate sample size (if any). including the purpose of the study. population size. ±5percent).8. are then collected from each of the grid-cells. For instance. Spatial or Grid Sampling This is a form of cluster sampling. a cluster being individual areas of a grid and hence consisting of groups of basic grid cells arranged in some standard geometric pattern. 2. Level of precision The level of precision. Sample Size One of the challenging and most frequently asked question concerning sampling in research is the question of sample size or how large of a sample is required to infer research findings back to a population? The answer to this question is influenced by a number of factors. sometimes called sampling error.10. Volunteer Sampling Volunteer sampling is such a sampling technique in which only volunteer samples are communicated. soil and rock samples. valid and generalizable results than studies conducted with entire population or census data. such as plant. level of confidence or risk and the degree of variability in the attributes being measured. the researchers should be able to report both the statistically appropriate sample size and the sample size actually used in the study.Quantitative Methods in Social Sciences 2. the three most important criteria that need to be specified to determine the appropriate sample size are level of precision. this may be the method through which Ethiopian television and radio station communicates and interview experts on current political. is the range in which the true value of the population is estimated to be. (e. Moreover. budget. regarding sample size you should note that adequate sample size with high quality data collection efforts may result in more reliable. This allows the reader to make her/his own judgments as to whether s/he accepts the researcher s assumptions and procedures. This range is often expressed in percentage points.g. if a researcher finds that 60% of the farmers in the sample have adopted a Mesay Mulugeta. For instance.9. economic and social affairs of the country. 2009 25 . Samples. usually referred to as grid cells. The location of each grid-cell is usually geo-referenced using global positioning system technology. Finally. 2.11.8. time and the allowable sampling error.

using published tables. mean). imitating a sample size of similar studies.g. the degree of variability in the attributes being measured. then s/he can conclude that between 55% and 65% of the farmers in the population have adopted the practice. while samples that are too small may lead to inaccurate results. recourses allotted to the study in terms of time and money. There is always a chance that the sample you obtain does not represent the true population value. this means that. with some samples having a higher value and some obtaining a lower score than the true population value. if a 95% confidence level is selected. The more heterogeneous a population. the larger the sample size required to obtain a given level of precision. Degree of Variability The third criterion. Furthermore. the sample size depends on several factors: purpose of the study. Confidence level The confidence or risk level is based on the ideas encompassed under the Central Limit Theorem. In other words. Mesay Mulugeta. the type of population (varied or identical). the average value of the attribute obtained by those samples is equal to the true population value. Each strategy is discussed below briefly. Here we should bear in mind that determining sample size is a very important issue because samples that are too large may waste time. 2009 26 . 2. resources and money. In a normal distribution. the values obtained by these samples are distributed normally about the true value. and applying formulas to calculate a sample size.10. There are several approaches to determining the size of the respondents. approximately 95% of the sample values are within two standard deviations of the true population value (e. Strategies for Determining Size of Respondents Determination of sample size to be selected from a population is one of the pivotal issues that arise during research. The key idea encompassed in the Central Limit Theorem is that when a population is repeatedly sampled. availability of equipment and technical people. and the level of precession required. 95 out of 100 samples will have the reliable population value. refers to the distribution of attributes in the population. These include using a census for small populations. In fact..Quantitative Methods in Social Sciences recommended practice with a precision rate of ±5%.

Using a Sample Size of a Similar Study Another approach is to use the same sample size as those of other methodologically very sounding studies similar to the one you plan to carry out. Census survey also eliminates sampling error and provides data on all the individuals in the population.10. census yields more precise research output than sampling does as it touches each and every population under study. confidence and variability. However. The fourth approach to determine sample size is the application of one of the several formulas such as: n N 1 N ( e) 2 Where. However. Sample Size Determination Table This is a table created by the scholars in this field of study to be used to determine appropriate sample size for a research based on three alpha levels ( = 0. It is very difficult and even sometimes impossible to carryout census survey for large population as it is time and money consuming. of error.Quantitative Methods in Social Sciences Using a Census for Small Populations One approach is a census survey in which we consider the entire population as respondent. N e population size level of precision n sample size Mesay Mulugeta. = 0. for a small population (eg. 200 or less). a review of the literature in your discipline can provide guidance about typical sample sizes which are used. you may need to calculate the necessary sample size for a different combination of levels of precision.05 and = 0.01) and margins Using Formulas to Calculate a Sample Size Although tables can provide a useful guide for determining the sample size. 2009 27 . In this case you must be very careful that you may run the risk of repeating errors that were made in determining the sample size for another study if you fail to deeply review the procedures employed in the earlier studies.

e no sample size z = the abscissa of the normal curve that cuts off an area at the tails e = the desired level of precision in the same unit of measure as the variance 2 = the variance of an attribute in the population. The above mathematical formula can also be rewritten as follows to determine the required sample size with specific confidence and margin of error. The value of distribution for a 95% ( deviation za 2 is 1. 2009 28 . similar studies or other related Then. It is still E with a confidence of 1 from pilot test. = population standard deviation \ It is obvious that this formula can be used if and only if the value of size necessary to establish the population mean value within possible to use this formula if we are able to determine documents. The disadvantage of the sample size based on the mean is that a good estimate of the population variance is necessary. the sample size can vary widely from one attribute to another because each is likely to have a different variance. The margin of error E 1 and the standard can be calculated from pilot test or related studies/documents. Furthermore. Often. 2 za n Where. Mesay Mulugeta. n = sample size 2 E E = the maximum difference b/n the sample and population means za 2 = critical value: Obtained from the table of the probability distribution which the data follow. an estimate is not available.Quantitative Methods in Social Sciences Another formula to determine sample size may be no Z = 2 2 2 Where. it is possible to determine the appropriate number of samples the study to be 95% sure that the sample mean is within 1 unit of the population mean.96 in the table of standard normal 0. is known and to determine the sample .05) level of confidence.

2009 29 . As a result. Here you can imagine that it is very expensive and time consuming to carry out a census survey for this specific study.Quantitative Methods in Social Sciences Project work Let us assume that you are going to study the ethnic composition of the residents of your town based only on primary data. What will be your appropriate sample size? What method will you use to decide the sample size and what do you say about the adequate representation of the samples? Mesay Mulugeta. try to explain the type and method of sampling technique you are going to apply to answer the research questions and come up with the most appropriate research outcomes. Then. you may be required to select samples for this study.

Thirdly. it allows the readers to easily understand the research findings and what the writer wants to communicate. These are: Scales must be in regular intervals Graphs/charts that are compared must have the same scale You must be clear with what you are going to communicate Graphs. it is highly recommended to present data in tables. All the data presentation tools such as tables. charts and graphs is threefold. Firstly. Statistical programs like SPSS.1. charts and tables must be easy to read You should know who your audience is Be sure whether or not the display tell the entire story of the specific issue Mesay Mulugeta. it is usually the best way to show the data to the audience rather than immersing them in reading lots of numbers in the text which may put people sleep and grasp little information. 2009 30 . However. charts and graphs in condensed form. Secondly. you should be able to: Explain briefly statistical data presentation tools Present any quantitative raw data in its condensed and easily communicable way 3. Hence. As a result.Quantitative Methods in Social Sciences Unit Three Techniques of Quantitative Data Presentation Unit objectives Having studied this chapter. you should bear in mind the undermentioned points while producing tables. SYSTAT and SAS are higher-powered programs that perform many statistical analyses as well as producing high quality graphs. Introduction The data collected from the field or obtained through measurements are usually presented in condensed forms in tables. charts and graphs can be drawn by hand or on computer. it enables the researcher to visually look at the data and see what happened and make interpretations. Softwares such as Microsoft Excel produce graphs and perform some statistical calculations. The purpose of putting the data into tables. charts and graphs. charts and graphs for any quantitative data. graphs may mislead unless carefully and precisely drawn.

1. Interval data is a scale of measurement where the distance between any two adjacent units of measurement (or intervals) is the same but the zero point is arbitrary. An interval scale has all the characteristics of an ordinal scale. Sardessa and Arfase. 2. You can count but not order or measure nominal data. ordinal. 3 and 4 above indicates not the amount of yield the four farmers produced in 2007/8 but it indicates the relative positions of the four farmers out of the existing total farmers in Halelu-Chari Kebele.e. Nominal data is set of data and said to be nominal if the values (observations) belonging to it can be assigned as a code in the form of a number where the numbers are simply labels.2. are the 1st. individuals or responses belonging to a subcategory have a common characteristic and the sub-categories are arranged in an ascending or descending order. For instance. interval and ratio. an interval scale uses a unit of measurement that enables the individuals or responses to be placed at equally spaced intervals in addition to the spread of the variable. I. in your research you can code males 0 and females as 1. This scale has a starting and terminating points and the number of units/intervals between them are arbitrary and vary from scale to scale. For most practical purposes. Lamesa. let four farmers. In order to try to understand the phenomena under study. These four levels of measurement are known as nominal. III. 3 if single. 2nd 3rd and 4th crop producers in a rural kebele called Halelu-Chari during crop year 2007/8. Ordinal (Rank) data is where the numerical value indicates something about relative rather than absolute position in a series such as ranks. marital status of an individual could be coded as 2 if married. Here you can see that the data are ordered but differences between values are not important. II. In addition. Then. Many statistical tests make use of this type of ranked data. 3 and 4 here are Ordinal data.Quantitative Methods in Social Sciences 3. each level has a specific purpose and also has important implications for the type of analysis to be undertaken. Mesay Mulugeta. Scores on an interval scale can be added and subtracted but cannot be meaningfully multiplied or divided. For example. 2. Nature of Geographic Data Numerical data is the essence of quantitative methods. the researcher has to find a means of expressing the variables to be measured using some form of numerical technique. namely Chalchisa. the numerical values 1. Hence. i. data can be measured at four different levels. 2009 31 .

so is temperature when it is measured in Kelvin s. do not have continuous gradations but there is a definite gap between two measurements. such as mass.e. i. and number of eggs per chicken in poultry are discrete data.5times than a 60cm long baby. it is also common to identify two sets of data namely Continuous and discrete data. IV. It has all the properties of nominal. In the centigrade system the starting point (considered as freezing point) is 00C while the end point (considered as boiling point) is 100oC.Quantitative Methods in Social Sciences Centigrade and Fahrenheit scales are examples of the interval scale. 2009 32 . Mesay Mulugeta. For example. milk yield of cows. Other example can be height and weight of sample newly born babies where a 2. Continuous variables are those variables that have theoretically an infinite number of gradations between two measurements.e. on the other hand. a family size. It has a real negative. i.5kg one and a 90cm baby is longer than 1. relative to absolute zero. the zero point of a ratio scale is fixed. i. length or energy are measured on ratio scales. This means that while no mathematical operation can be performed on the readings.e. Ratio data is a set of data if the values (observations) belonging to it may take on any value within a finite or infinite interval. they are not absolute.e. For example.5kg baby is twice heavier than a 1. As the starting and terminating points are arbitrary. it can be performed on the differences between readings. Various variables in geography and environmental studies are of continuous type. The gap between freezing and boiling points is divided into 100 equally spaced intervals known as degrees. order and measure continuous data. For example. you cannot say 600C is twice as hot as 300C or 300F is not three times hotter than 100F. Each degree or interval is a measurement of temperature. 15quintals/year or no (zero) production for certain farmers. Discrete variables. In this scale all mathematical operations such as multiplication and division are therefore meaningful. In real quantitative data collection process. You can count. number of peoples in a country. zero or positive values. during your field work (data collection) for your research you may get 80quintals/year. etc are continuous variables. In the Fahrenheit system the freezing point is 320F and the boiling point is 2120F. body weight of individuals. which means it has fixed starting point. ordinal and interval scales plus its own property. Most physical quantities. The zero value on a ratio scale is non-arbitrary. The gap between the two points is divided into 180 equally spaced intervals. i. they cannot be measured in fractions. crop yield per unit of land.

However. height. (b)Dependent Variable (DV) and (c)Extraneous Variable (EV) Climatic Conditions (Assumed Causes) Independent Variable Crop production per unit area (Assumed effect) Dependent Variable Type of seed Fertilizers Soil type Farmers level of education Farming technology. ordinal.4. 3. In research. Data Classification The collected data. weight and family size. interval and ratio scale is of importance mainly because it helps us to choose the appropriate statistical tool for the analysis of the data. it is very likely that the DV (the amount of crop production per unit area) may be suffering from other factors such as type of seed and fertilizers. where we are using experiments to try to establish cause and effect certain variables are especially important: (a)Independent Variable (IV). therefore. essential for an investigator to condense a mass Mesay Mulugeta. experimental design is done in most scientific researches to control these extraneous factors as far as possible. It is. Usually. We may hypothesize that climate and soil will have beneficial effects on crop productivity. 3. Statistical Variables A variable is a factor or character that can have more than one or different values such as yield.Quantitative Methods in Social Sciences Note that the distinction between nominal. In this situation climate and soil are the Independent Variables (IV) and crop productivity is the Dependent Variable (DV). etc Extraneous Variables Consider a study aimed at finding factors affecting crop productivity in a certain area in Ethiopia. are always in an unorganized form and need to be organized and presented in meaningful and readily comprehensible form in order to facilitate further statistical analysis.3. All these other potential sources of influence are known as extraneous variables. 2009 33 . particularly in quantitative researches. also known as raw data or ungrouped data.

Importance of Classification The following are main objectives of classifying geographical data: It condenses the mass of data in an easily manageable form It eliminates unnecessary details.Quantitative Methods in Social Sciences of data into more and more comprehensible form.5 1.2 0.1: Classification of farm households' cash sources in Kuyu woreda S/N 1 2 3 4 5 6 7 8 9 10 Source of Income Livestock and livestock products sale Poultry Bee production Grain sale vegetables sale Firewood sale charcoal sale Transfer/gift Rural credit local trades % as of total income 61.9 2. 2009 34 .8 0. It helps in the statistical treatment of the information collected Table 3.5 Source: Mesay M. (2001). sorghum and oat. Mesay Mulugeta.1 0.c.2 11. Thus classification is the first step in tabulation. crop products in the crop year 2007/8 can be classified according to their types such as teff. wheat.2 3. married or unmarried. It facilitates comparison and highlights the significant aspect of data. population data can be classified as male or female. For Example. e.3 18. educated or uneducated. The process of grouping into different classes or sub classes according to some characteristics is known as classification. Look at the table above that the farmers sources of cash income are clearly indicated with their respective percentages and it is now easier to read and understand or more suitable for farther statistical analysis. It enables one to get a mental picture of the information and helps in drawing inferences.t. Similarly. barley.3 0.

production of wheat in different woredas in Oromiya. farmers in a kebele may be classified according to their amount of crop production within a year as given below. etc c) Qualitative classification In this type of classification data are classified on the basis of same attributes or quality like sex. For example. etc. religion. For instance. etc. Such attributes cannot be measured along with a scale. For example. if the population to be classified in respect to one attribute. crop production. The data is generally classified in ascending order of time. height. literacy. Similarly. the production of coffee in different zonal administration of Ethiopia. weeks. altitude. Broadly there are four basic types of classification namely a) Chronological classification b) Geographical or Locational classification c) Qualitative classification d) Quantitative classification a) Chronological or locational classification In chronological classification the collected data are arranged according to the order of time expressed in years.. 2009 35 . they can also be classified into employed or unemployed on the basis of another attribute employment . temperature.Quantitative Methods in Social Sciences Types of classification Statistical data are classified in respect of their characteristics. then we can classify them into two namely that of males and females. employment. weight. months. Mesay Mulugeta. b) Geographical classification In this type of classification the data are classified according to geographical region or place. d) Quantitative Classification Quantitative classification refers to the classification of data according to some characteristics that can be measured such as amount of rainfall. etc. say sex.

20.5 21 .35 30.45 40. Tabulation Tabulation is the process of summarizing classified or grouped data in the form of a table so that it is easily understood and an investigator is quickly able to locate the desired information.2: Classification of hypothetical farmers by quantity of production Production in Quintals (Classes/Groups) Discontinuous Continuous 6 .40.15. There are 12 farmers having production ranging from 10 to 15. 3.45. 14 farmers having production ranging between 26 to 30 and so on. 2009 36 . and (ii) the frequency in the number of farmers in each class.20 15.35.5 11 . graphic aids make attractive the physical appearance of your research. Dear Student! Do you know the advantages of data classification? Please.25 20.e the production in the above example. A table is a Mesay Mulugeta. These include bar graphs.5 26 .5 . Besides making the report easier to read and understand.5 .5 Total Number of Farmers (Frequency) 8 12 17 21 14 11 6 11 100 In this type of classification there are two elements. histograms. Data Presentation Tools and Analysis Raw statistical data must be presented in a suitable and summarized form without any loss of relevant information so that it can be efficiently used for decision making.5 .30 25. scattered diagrams.5 31 .5.10 5.30. pie-chart. The two graphic aids mostly used in research reports are tables and graphs. explain it.10.5 .1.5 36 .5 .5 41 .25. line graphs and tables.5 . Whenever there is a need to present statistical data. 3. .40 35.15 10. namely (i) the variable i. Several types of statistical/data presentation tools (graphic aids) exist.5 16 .5. graphic aids can help communicate this information to your audience more quickly.Quantitative Methods in Social Sciences Table 3.5 .5 .

It facilitates computation of various statistical measures like measures of central tendency. Advantages of Tabulation Statistical data arranged in a tabular form serve the following objectives: 1. 4. It presents facts in minimum possible space unnecessary repetitions and explanations are avoided. the needed information can be easily located 5. body of the table and sources of data should be presented. Before tabulation data are classified and then displayed under different columns and rows of a table. title of the table. Classification and Tabulation . An ideal table should consist of main parts such as table number. title of the table.Quantitative Methods in Social Sciences systematic arrangement of classified data in columns and rows. 2. It facilitates comparison and often reveals certain patterns in data which are otherwise not obvious. What the purpose of tabulation is and how the tabulated information is to be used are the main points to be kept in mind while preparing a statistical table. It facilitates comparison of related facts. Actually they go together. a statistical table makes it possible for the investigator to present a huge mass of data in a detailed and orderly form. Moreover. Mesay Mulugeta. Thus. This should contain all the information needed within the smallest possible space. Preparing a Table The making of a compact table itself is an art. are not two distinct processes. Tabulated data are good for references and they make it easier to present the information in the form of graphs and diagrams. body of the table and sources of data Dear Student! Look at the table below and observe how table number. dispersion and correlation. It simplifies complex data and the data presented are easily understood. as a matter of fact. 3. 2009 37 .

It is usually a list.00 22.0 0.00 23.Quantitative Methods in Social Sciences Table 3.4 2.00 12.9 0. Table 3.4 0.3: Number of livestock per household by types and agro-climatic zones: Kuyu Woreda Types Dega Woina Dega Qolla All zones Oxen Cows Young cattle Sheep Goats Equine Total 1.9 1.5 25.5 93.4 2. 2009 38 .4: Frequency table Laborers Daily Income (Eth.6 9. When a variable can take continuous values instead of discrete values or when the number of possible values is too large. So we group the data into class intervals (or groups) to help us organize. 2001) Frequency Distribution In statistics.0 3.3 0.00 27. 12 and 17 appear two times each.5 0.. Birr) Number of Laborers (Frequency) Cumulative Frequency (<UL) Percentage Cumulative Frequency 3.1 Source: MA Thesis (Mesay M.00 24.8 100. if not impossible.4 2.3 0.0 62.9 1.4 0.2 1. a frequency distribution is a list of values that a variable takes in a sample. showing the number of times each value appears.2 1.0 2.9 1.5 81.3 87.0 37.0 1.4 1.7 9.5 9. A slightly different tabulation scheme based on the range of values is used in such cases. ordered by quantity.00 Total 2 2 2 2 2 3 1 1 1 16 2 4 6 8 10 13 14 15 16 ---- 12.00 18. For example in the table below 22 appears three times while others such as 8.5 50.7 1.7 1.2 6.00 8.0 ---- This simple tabulation has two drawbacks.00 17.3 1. Mesay Mulugeta.9 2. the table construction is cumbersome.

30 Total Frequency 2 2 2 4 5 1 16 CF < UL 2 4 6 10 15 16 ---- Inclusive method. Exclusive method is when the data is classified in such a way that the upper limit of a class interval is the lower limit of the succeeding class interval.30 Total Mesay Mulugeta. namely exclusive method and inclusive method. Table 3. There are two ways in which observations in the data set are classified on the basis of class intervals. is when the data are classified in such a way that both lower and upper limits of a class in the interval itself as illustrated below. This method is illustrated in Table 3. The frequency of a group (or class interval) is the number of data values that fall in the range specified by that group (or class interval). on the other hand. 2009 39 . Table 3.5: Exclusive method of data classification Class Interval 0 5 5 10 10 15 15 20 20 25 25 .5.Quantitative Methods in Social Sciences interpret and analyze the data.6: Inclusive method of data classification Class Interval 0 4 5 10 15 20 9 14 19 24 Frequency 2 2 2 4 5 1 16 25 .

5 9.5. 2009 40 .5 Frequency 2 2 2 4 5 1 16 24.Quantitative Methods in Social Sciences An exclusive method is used to classify a set of data involving continuous variables while inclusive method should be used to classify a set of data involving discrete variables. below for continuous variables) Table 3. (Table 3. serves as an easy technique for quick and effective comparison b/n two or more variables. In the same way graphic presentation of data helps us to easily understand the overall nature of the data and facilitates for further analysis or interpretation.2. therefore.5 14. Mesay Mulugeta. Where X is correction factor 2 And then subtract the value of X from the lower limits of all classes and add it to the upper limits of all the classes.6. above can be adjusted as Table 3.7. The shape of the graph offers easy and immediate answers to several questions such as the variations of the data distribution and trends.5 4.5 19. 3. Graphic presentation.7: Adjusted inclusive method for continuous variables Class Interval -0.5 19.5 30. then certain adjustment in the class interval is needed to obtain continuity as shown in Table 3.5 Total The method is that first calculate correction factor as: X Upper Limit of a Class Lower Limit of the Next Higher Class .5 4.5 14.7. If a continuous variable is classified according to the inclusive method.5 24. Graphical Presentation of Data It has already been discussed that one of the most important functions of statistics is to present complex and unorganized (raw) data in such a manner that they would easily be understandable.5 9.

both x and y coordinates for point P can be identified as 40 KG per hectare and 25 quintals per hectare. y). Tridimensional diagrams such as cubes. respectively Hence. A graph consists of two axes called the x-axis (horizontal line) and y-axis (vertical line). blocks and spheres 5. presentation of data in diagrams has the following advantages: 1. charts the relationship between fertilizer input per hectare and crop yield per hectare in quintals. animals. 2. For example. (Fig. Diagrams leave good visual impact and facilitates comparisons 3. 25). Bidimensional diagrams such as histograms. 2009 41 . Each point in the graph is defined by a pair numbers containing two coordinates. The axes correspond to the variables we are analyzing. 3. the coordinate value for Point P is written as (40. Interpretations from diagrams save time 4. Dimensionless diagrams also known as point graphs or dot graphs 2. cups.1).Quantitative Methods in Social Sciences Graphic representations of data can be categorized into five types based on the dimensions of the graphs in use. Mesay Mulugeta. of a set of data. The horizontal coordinates (x-coordinate) of each point corresponds to the amount of fertilizer input while the vertical coordinate (y-coordinate) corresponds to the yield per hectare in quintals as demarcated on the left hand side of the chart. x and y. Diagrams simplify complexity and easily depict the characteristics of the data Point Graph or Dot Graph Point graph provides a visual representation of a relationships between two variables. for example. houses and fruits Generally. Pictorial representation of data by using pictures like human being. Diagrams give an attractive and elegant presentation of data 2. These are: 1. in the figure we have been using. Note that the x-coordinate always comes first followed by the y-coordinate. The coordinate values of a point are identified by drawing lines extending out from the specific values at each axis. bardiagrams and pie-charts 4. Points are identified by stating their coordinates in the form of (x. The given point graph (Fig.1). Unidimensional graphs or line graphs such as frequency curve/polygon and cumulative frequency curve (Ogive) 3.

a linechart is often used to visualize a trend in data over individuals of time. 2009 42 . Mesay Mulugeta. the data depicted in Table 3.8 above can easily be converted to a linechart as shown in figure below.Quantitative Methods in Social Sciences Figure 3. For instance. Table 3.8: A hypothetical farmer s crop output over years Crop Year 1985 1990 1995 2000 2005 Production in Quintals 10 12 15 20 28 For instance.1: Point graph Linechart or Linegraph Linechart or linegraph is a type of graph created by connecting a series of the coordinates of data points that represent individual measurements with line segments. A linechart is a basic type of chart common in many fields and it gives the reader a fairly good idea of the nature of the data.

00 1995. The points so drawn are then joined by a series of straight lines and the polygon is closed as explained earlier.00 10.00 1990.00 1985. A series of straight lines are drawn from the midpoint the top base of the first and the last rectangles to the midpoint falling on the horizontal axis of the next outlaying interval with zero frequency. The frequency polygons are formed as a closed figure with the horizontal axis.00 2000.00 Yield per Hectare 20. Exercise Use the data indicated in Table 3.00 2005.00 25. Drawing a frequency polygon does not necessarily require constructing a histogram first.00 Crop Year Frequency polygon is also a type of linechart formed by marking the midpoint of the top of bars in histogram and joining these dots by series of straight lines.2: Line graph 30.4 hereinbefore and draw a frequency polygon A) By constructing a histogram first B) Without constructing histogram Mesay Mulugeta. In this case.Quantitative Methods in Social Sciences Figure 3. 2009 43 . horizontal x-axis measures the successive class midpoints and not the lower class limits. It can be obtained directly on plotting points above each class midpoint at heights equal to the corresponding class frequency.00 15.

Cumulative frequencies show the running total.Quantitative Methods in Social Sciences Ogive (Cumulative frequency curve) is another type of linechart. as shown in the example below. Example Plot (construct) an Ogive based on the data (assumed students mark) in the table below.5 Total 8 12 14 10 6 50 8 20 34 44 50 ------ 16 24 28 20 12 100 16 40 68 88 100 ----- Mesay Mulugeta. the frequency below each class boundary. 2009 44 .5 39.5 99.5 59. One way of constructing an Ogive (pronounced as O-jive) or cumulative frequency curve is indicated below. (iii) An Ogive is connected to a point on the x-axis representing the actual lower limit of the first class. It is a cumulative frequency polygon often presented in cumulative frequencies or in percentage cumulative terms. (ii) Join the points plotted by a smooth curve. (i) Plot the points with coordinates having abscissa (x-axis) as actual limits and ordinates (y-axis) as the cumulative frequencies. The curve is usually of S shape. Table 3.5 79.9: Frequency distribution Students Mark Upper Limit Frequency (No of Students in the Range) Cumulative Frequency % Frequency % Cumulative Frequency 0-19 20-39 40-59 60-79 80-99 19.

00 28.00 Commulative Frequency 60.00 40.5.00 38.00 80. See Figures 3. Mesay Mulugeta. Notice how the bar diagram above is represented by the histogram below having eight interconnected bars that represent the numbers of farmers in each of the quantities of crop production distribution.00 18.4 and 3.00 8.00 33.00 20.00 13. The height of such boxes (rectangles) measures the number of observations in each of the classes.00 0. Histogram is simply a bar graph where the bar lengths are determined by the frequencies in each class of a grouped frequency distribution.3: Cumulative frequency curve (Ogive) based on the data in Table 3. The plotted points are then connected by straight lines to enhance the shape of the distribution. 2009 45 .00 23.Quantitative Methods in Social Sciences Figure 3. Values of the variables (the characteristics to be measured) are scaled along the horizontal axis and the number of observations (frequencies) along the vertical axis of the graph.00 43.00 Crop Production in Quintals Histogram and Bardiagram: These are one dimensional diagrams used to represent both ungrouped (raw) and grouped data.9 100.

00 13. It is named for its resemblance to a pie which has been sliced.00 10.00 5.00 43.00 28.00 23. is proportional to the quantity it represents.00 0.00 Number of Farmers (Frequency) 20.00 Crop production in Quintals Mesay Mulugeta.00 38. Together.4: Bardiagram Presenting Number of Farmers by Crop Production Hypothetical Data Piediagram or piechart or a circlegraph is a circular chart divided into sectors. the arc length of each sector and consequently its central angle and area. Figure 3.00 8.00 15. illustrating relative magnitudes or frequencies or percents. In a pie chart. 2009 46 .00 18.5: Histogram Indicating Number of Farmers by Crop production: Hypothetical Data 25.00 33.Quantitative Methods in Social Sciences Figure 3. the sectors create a full disk.

Observe that the summation of the percentages (14 + 4 + 20 + 62) is 100. Table 3. Mesay Mulugeta.10: Percentage distribution of hypothetical population by marital status Marital Status Single Married Widowed Divorced % Distribution 20 62 14 4 Figure 3.Quantitative Methods in Social Sciences Observe that the data indicated in the table below (Table 3. Then you can now establish a relationship between the percentages and angular measurements as follows.6: Piechart presenting percentage distribution of marital status: Hypothetical data To classify the pie-chart proportionally use the following methods.6. You may also know that the total angular measurement of any circle is 360o. 2009 47 .10) has been presented by the preceding pie-diagram or Figure 3.

what will be the equivalent angular measurement of the remaining percentages? 14% 4% 20% 62% 100% = 50. Approximate Interval Size to be used in constructing a frequency distribution (h): L arg est data value Smallest data value h Number of classes Three dimensional representation of dataset This is the method of depicting a data set by using the picture of objects that has three dimensional (length. width and height) such as prism. Important formulas in this unit are: 1.00o Now. cylinder and cone. cube. For instance. by using your protractor you can construct a piechart as indicated in Fig 3.000 = 223. 2009 48 . Midpoint of a class (m) = Lower Limit Upper Limit Lower Limit 2 3.400 = 14. Table 3. Table 3.6.Quantitative Methods in Social Sciences For instance if 100% = 360o.11: A farmer s Crop Output over Years: Hypothetical Data Year 1985 1990 1995 2000 2005 Crop Output in Qtls 10 12 15 20 28 Mesay Mulugeta.200 = 360. Class Interval (h)= Upper Limit 2.400 = 72.11 below can be shown by one of the three dimensional graphs as follows.

Figure 3.Quantitative Methods in Social Sciences Figure 3. car.000. What is very crucial and worth mentioning here is a picture should be given certain scale that it represents. 2009 Scale: = 1.7: A Farmer s Crop Output over Years Pictorial representation of data This. cup.000 peoples 49 Pictorial representation of population . ball and man.000.8: Pictorial representation of national regional states of Ethiopia by population Regions Oromiya Amhara SNNPRS Somali Tigray Addis Ababa Afar B/Gumuz Source: CSA (2008) Mesay Mulugeta. refers to the method of presenting a set of quantitative data by using pictures of different objects like tree. cattle.000 peoples.8. also known as Pictogram or Ideograph. 2. airplane. a single picture represents 1. For example. in Fig. sack.

693.888 1.752 4. 980.270. 780.448. 550.377. Extraneous variable D. Ogive F. 900.617 150. Define the following statistical terms A. 550. The following table indicates the population size of the top 8 most populous countries in the world.866. 550. 2008 Country Population China India USA Indonesia Brazil Pakistan Bangladesh Russia 1. 480.Quantitative Methods in Social Sciences Exercise 1.947 234.12: The world's 8 most populous countries. 740. Qualitative data B. 760. 750.139. 680. Then construct a cumulative frequency table and Ogive for the data. Cumulative frequency G.647 169. Present the data in pie diagram and bar chart Table 3. 720.129. 740.010. 500 3. 610. Histogram E. 760. 1100. Let the following data represent the gross income (in Eth. 600. 1000. 600. Quantitative data C.321. Birr) of forty (40) urban households in Adama town. 600. 800. 750. 840. 520. 950. 2009 50 . 670. Mesay Mulugeta. 880. 700. 550. 500.851.997 190. 600.154 301. Draw a suitable diagram (chart) to present the data. Frequency distribution H. 750. 550. Midpoint 2. 570. 550. 920. 820.339 141. The following data is percentage distribution of sources of annual cash income of sample rural households in Kuyu woreda during a particular year (2001). 620. 900.

72 59.13: Rural households' major sources of cash income S/N 1 2 3 4 5 6 7 8 9 10 11 Source of cash income Livestock and livestock products sale Poultry Bee product Grain sale vegetables sale Firewood sale charcoal sale Transfer/gift Rural credit local trades Other non-farm activities Total Cash income per household (Birr/household) 712. The following data is the mean maximum temperature of Debre Birhan town in oC (1997 2006).38 5.85 19.85 19.80 19.85 19.Quantitative Methods in Social Sciences Table 3.64 118.84 48. 2009 51 .85 19.87 22. Present the data by using bar-diagram.93 70.85 19.30 945.14 Months Mean Max To 19.85 Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec Source: Computed from NMSA s data Mesay Mulugeta.85 19.12 799.71 112.46 56.85 19.85 19.20 44.85 19.85 19. Table 3.26 32.

Although frequency distribution and corresponding graphical presentations make raw data more meaningful. it becomes easy to plan for the annual total need of potable water supply for Adama town residents firstly by studying the average quantity of water needed per household per head in the town. 2009 52 . Mesay Mulugeta. called variation or dispersion in terms of single distances of individual observation from central values. It is necessary to identify or calculate these typical central values to describe or project the characteristic of the entire data set in one figure. called Skewness or Shape of frequency distribution or measure of symmetry.1. Central value of a set of data called central tendency 2. 3. organized and presented in terms of tables. This descriptive value is known as measure of central tendency. Introduction In the previous sections. methods and applications of measures of central tendency Describe the role of measures of central tendency in geography and environmental studies Understand the merits and demerits of different measures of central tendency 4. you should be able to: Understand the general concept of measures of central tendency or measures of location in geographic data Identify the specific types.Quantitative Methods in Social Sciences Unit Four Measures of Central Tendency Unit objectives Having studied this chapter. yet they fail to identify three major properties that describe a set of quantitative data. charts and frequency distribution in order to be easily understood and analyzed. The extent of departure of numerical values from symmetry of distribution around the central value. In this unit we will deeply discuss about Central Tendencies also known as Measures of Location First Order Analysis. For instance. The term central tendency was coined because observations (numerical values) in most data sets show a distinct tendency to group or cluster around a value of an observation located somewhere in the observations. It is very important in social science applications such as planning. These are: 1. The extent to which numerical values are dispersed around the central value. we discussed how raw data can be collected.

Mean Mean is a central value which is computed by taking into consideration all the observations or all recorded values. 27kg.x n . . Geometric. 4. Population Mean x1 X1 X2 X3 N .3kg 10 10 Mesay Mulugeta. as well as for grouped and ungrouped data sets. . Moreover. 24kg. Weighted and Harmonic means. But unless and until specified. 28kg. . 2009 53 . 25kg. mean of a list of numerical observations is the sum of the entire observation divided by the number of items in the list. We will be calculating these values for populations (i. 21kg Arithmetic Mean = 25kg+30kg+18kg+22kg+28kg+24kg+33kg+27kg+25kg+21kg = 253kg =25. . Some related literatures also use y and d to denote sample mean. The Greek letter µ is used to denote the mean of an entire population while sample mean is typically denoted by x (enunciated x bar). . It has four sub types known as Arithmetic. 22kg. . 18kg. 25kg. There are at least two methods to calculate arithmetic mean for ungrouped (raw) data. Mathematically. a) Calculating Arithmetic Mean for Raw Data It is the most widely used and widely reported measure of central tendency.2. 30kg. the median and the mode.Quantitative Methods in Social Sciences The most widely used measures of central tendency are the mean. the term mean invariably refers to the Arithmetic Mean or Average.X n or xi n Xi N Sample Mean x x2 x3 n or x Example: Calculate the arithmetic mean of the data given below. 33kg. Geometric and Harmonic Means have very limited applications as well. the collection of all elements we are describing) and for samples drawn from populations. It is this measure which is most frequently used because it is easier to compute as well as it is used in further rigorous statistical analysis where Geometric and Harmonic means are not useful.e. Direct Method In this method it is calculated by adding the values of all observations and dividing the total by the number (N) of observations.

Short-Cut Method of Calculating Arithmetic Mean In this method an arbitrary assumed mean (am) is used as a base for calculating the mean of deviations from individual values in the data set.26 2 3 2 4 3 4 5 23 10 -1 -6 -2 12 14 -12 ------- 20 -3 -12 -8 36 56 -60 29 Then. Let 24 birr be taken as the arbitrary assumed mean (am) of the employees daily earnings.1: Calculation of Mean by Short-cut Method for Ungrouped Data Daily Earnings in Birr (x) Number of Employees (fi ) Di = xi am = xi 24br fi di 34 23 18 22 36 38 12 Total d fi d i N 29 23 1. Table 4. (Case 1) By changing the origin to the mid-value or class-mark assumed as mean as mean and calculating d as usual and get the correct mean by adding it to the assumed mean. the real arithmetic mean am fi d i N 24 29 23 24 1. Let the following table represents the daily income of 23 hypothetical employees in a firm. Look at the example below.26 Calculating Arithmetic Mean for Grouped (Classified) Data It can be applied in grouped data in two ways. the arithmetic mean of the above observations (data) is calculated to be 25. The correct mean is calculated by adding the assumed mean to the mean of differences or deviations from the assumed mean.3kg.Quantitative Methods in Social Sciences Then. 2009 54 .26 25. and (Case 2) by changing the origin as well as the scale by dividing the deviation from assumed mean by class interval (C) and then calculating the assumed mean of the new variate and compute the corrected mean by adding the multiplication of mean of new variate Mesay Mulugeta.

2: Calculation of Arithmetic Mean for Grouped Data Daily Earnings in Birr (Discrete Grouping) 10 14 15 19 20 24 25 29 30 34 35 39 Total Class mid-value mi 12 17 22 27 32 37 ------Number of Employees fi 2 3 2 4 3 4 18 fi mi 24 51 44 108 96 148 471 N Where.167 Case 2 Table 4.Quantitative Methods in Social Sciences with C and then adding to the assumed mean.5 14.5 29.5 34. 2009 55 .5 24. of Employees (Frequency. the arithmetic Mean for the above example 471 18 26. Case 1 Table 4.5 34. N = population size Arithmetic Mean fm or fm n n = sample size Then.5 29.5 19.3: Calculation of Arithmetic Mean for Grouped Data Daily Earning in Birr Class Mark (Mid-Point) mi No.5 12 17 22 27 32 37 Total Mesay Mulugeta.5 39.5 14. This method requires grouping the raw data into class intervals.5 19. For this it is also advisable to change the class interval into continuous form of grouping. calculation of class midpoint and identification of the number of observations (data) in each class. fi) 2 3 2 4 3 4 18 di (mi am) -15 -10 -5 0 5 10 --- fi d i -30 -30 -10 0 15 40 -15 ui = mi am C -3 -2 -1 0 1 2 --- fi ui -6 -6 -2 0 3 8 -3 9. Example Let the following table represents a daily earnings of 23 hypothetical employees in a firm in a classified form. Look at the table below.5 24.

For large data sets the calculations of arithmetic mean may sometimes be difficult and tedious as every element is used in the calculation 4. It used for more rigorous further statistical analysis 7. The value of arithmetic mean cannot be calculated accurately for open-ended class intervals 2. It is clear and unambiguous since every data set has one and only one mean value 3.167 Merits and Demerits of Arithmetic Mean Sharma. The arithmetic mean is least affected by fluctuations in sample size. 3.833) 26.833 26. beauty and loyalty 5. Arithmetic mean is a stable average Demerits 1. It cannot be calculated for qualitative characteristics such as intelligence. The arithmetic mean is reliable single value that reflects all values in the data set. 5.167) 27 0. 2.167 Correct mean am C (u ) 27 5 ( 0. Calculate the monthly minimum arithmetic mean and write your answers on the space provided at the bottom of the table. It is affected by the extreme values which are not the exact representative of the data set. Arithmetic mean cannot be determined by inspection Exercise The following data is a monthly minimum temperature data taken from Guder Weather Station.833 Correct mean OR fi ui am d 27 ( 0. Mesay Mulugeta. The calculation of arithmetic mean is based on all values given in the data set 4. Its value determined from various samples drawn from a population vary by the least possible amount 6.K (2004) writes the merits and demerits of arithmetic mean as follows: Merits 1. The calculation of arithmetic mean is simple and it is unique.Quantitative Methods in Social Sciences d fi d i fi 15 18 0. J. 2009 56 .167 u fi 3 18 0.

2 7.7 7.2 5. In the process of selecting the nominee.2 3.8 4.6 11.8 2.6 7 1.2 11.4 6 1.4 11.6 2.Quantitative Methods in Social Sciences Table 4.5 8.5 12. 22 38 12 17 17 18 22 23 22 27 18 28 12 12 24 18 18 22 33 33 38 24 24 17 22 22 28 12 28 24 18 33 27 22 24 Weighted Arithmetic Mean The arithmetic mean.5 13.2 5.1 5.5 3.3 0.9 9.6 12.8 6.8 8.5 12.2 6.3 Sep 11. However.4 9.2 8.7 7.1 12 10.1 10 8.4 12.2 9.6 12.9 10.5 3.6 10.8 8.9 Oct 10.5 12. 2009 57 .5 2 Feb 11 7 6.2 Jun 12. Under these circumstances.4: Monthly Minimum Temperature in 0C (1998 Year of Record Months 2006): Guder Station Jul 13. teff. barley and sorghum as depicted in the Table 4. who do think should be awarded? It is some what difficult to Mesay Mulugeta.6 11.3 6.8 7. there are situations in which values of observations in the data are not of equal importance.1 1998 1999 2000 2001 2002 2003 2004 2005 2006 Mean Min T0 ? ? ? ? ? ? ? ? ? ? ? ? Exercise Form a frequency table for the raw data given below and calculate an arithmetic mean.9 -0.4 7. In such cases computing simple arithmetic mean may not be truly representative and even misleading.5 9.5 8.8 12. we may attach to each observation a value or weight as an indicator of their importance and compute a weighted mean.7 11.8 8.7 12. If each farmer produced four types of crops namely wheat.3 10.6 11.7 12 11.1 9.2 2.4 12 6.9 8. as discussed earlier.8 10.6 1.9 8.4 Apr 11.4 5 Jan 10.4.4 4.8 8. Y and Z) be found and each of them harvested 80quintals of crops during the crop year.2 8.1 Mar 11.1 9.9 9.8 5.3 11.9 Dec 3.9 7.2 13 10.2 7.4 9.5 10.1 10.4 7 6.3 7.6 9. let three top farmers (Farmers X.4 Nov 5.1 6 8. Example Let us assume that East Shewa Zone administration wants to award only one top crop producing farmer in the zone based on his/her annual crop production.6 4.7 5.8 May 13.8 7.4 Aug 12.7 6.9 10.2 9.2 9. gives equal importance (weight or value) to each observation in the data set.

we should now calculate the weighted mean for each farmer from the table above: WM for farmer X 64200 3100 20. 800 to barley. it can be noted that Farmer Z should be awarded.. Therefore.5: Calculation of Weighted Arithmetic Mean Current Crops type Teff Market Price per Qntl X Prodn in qntls Farmers Y Prodn in qntls Z Prodn in qntls (Weight) 1200 (x) 18 wx 21600 17600 21000 4000 64200 (x) 12 wx 14400 14400 31500 2000 62300 (x) 25 wx 30000 20000 10500 6000 66500 Barley Wheat Sorghum Total 800 700 400 3..0968 wx1 wx2 wx3 .7096 WM for farmer Y 62300 3100 66500 3100 WX 20. Table 4..100 22 30 10 80 18 45 5 80 25 15 15 80 In order to decide who should be awarded.4516 As per the calculation above. In order to decide whom to award. but weighted versions of other means can also be calculated. Remark: As noted by Sharma. we may attach to each crop type value (weight) as w1. such as weighted geometric mean and weighted harmonic mean. Observe Table 4. we can attach 1200 to teff.Quantitative Methods in Social Sciences decide as three of the farmers produced the same quantity and the average production per crop is 20 quintals for each of them. for instance. below. nxn w 1 2 3 w1 w w3. where the importance of all the numerical values in the given data set is not equal and when the frequencies of various classes are widely varying. n w 2 WM for farmer Z 21. w3 w4 as an indicator of their importance and compute a weighted mean.5.. Mesay Mulugeta. The term weighted mean usually refers to a weighted arithmetic mean. Do not forget that only one farmer is required to be awarded. based on their current market price.. K (2004) the weighted arithmetic mean should be used. among others. 2009 58 . 700 to wheat and 400 to sorghum. and w2 . J.

.6 Subjects Mathematics GeES Civics Language Gebre 60 62 55 67 Mandefro 57 61 53 77 Waktole 62 67 60 49 Then. the actual formula and definition of the geometric mean is that it is the nth root of the product of n numbers or Geometric Mean = nth root of (X1)(X2). converted back to a base 10 number. converted back to a base 10 number.(xn ) . decide who should be awarded the scholarship by calculating the weighted arithmetic mean.Quantitative Methods in Social Sciences Exercise Let an examination was held to decide the award of a scholarship from three selected students namely Gebre.. 3 for GeES.. Geometric Mean It is the nth root of the product of n numbers (observations) or the average of the logarithmic values of a data set... Hence GM . Xn represent the individual data and n is the total number of data Mesay Mulugeta. Where X1. Mandefro and Waktole. The logarithmic transformation will be LogGM 1 (Logx1 Logx2 n .(Xn). Anti Log of Average of Log Values Then. It is given by the formula GM n (x1) (x2) (x3) (x4) . Logxn ) . Table 4. 2009 59 . However. The weights of various subjects were different (4 for mathematics. how do you calculate a geometric mean? The easiest way to think of the geometric mean is that it is the average of the logarithmic values.. X2 points used in the calculation.. The marks obtained by three candidates (out of 100 in each subject) are given below. 2 for civics and 1 for language).

K (2004) writes the merits and demerits of Geometric mean as follows: Merits 1. 24.Quantitative Methods in Social Sciences GM n (x1) (x2) (x3) (x4) . Mesay Mulugeta. and 23 is 8. and the square root of 64 is 8. 25. 6 3 32 25 .. we will convert to base-2 logs so that we can solve the problem as follows. GM cannot be calculated when any of the observation in the data set is either negative or zero Exercise Calculate the geometric mean of 8. Therefore the geometric mean of 2 and 32 is 8. It useful in determining rate of increase or decrease Demerits 1. The calculation of GM as compared to AM is more difficult and intricate 2. the x 32 64 . J. Now. since there are only two numbers (2 nth root is the square root.(xn ) Consider this example.. Merits and Demerits of Geometric Mean Sharma. 28 and 32. we have: 2 21 . In this case. Of course. The values of GM is not much affected by extreme observations 2. 26. let's solve the problems using logs. 21 x 2 5 26 64 The square root of 2 is 2 which is equal to 8. Converting our numbers. This simple example can be done as follows: First.. Suppose you want to calculate the geometric mean of the numbers 2 and 32. take the product that means 2 and 32). GM is calculated by taking all the observations into account 3. 2009 60 . Then. the short cut to solve the problem is to take the average of the two exponents (1 and 5) which is 3.

2009 61 . The harmonic mean H of the positive real numbers x1. 26 and 30. the harmonic mean is one of several kinds of average.. Note that for a given set of observations the following inequality holds: Arithmetic Mean Geometric Mean Harmonic Mean 4. It is not easy to calculate and understand compared to AM 2. Merits and Demerits of Harmonic Mean Merits 1. it is appropriate for situations when the average of rates is desired. It is not representative value of the distribution or the data set unless the analysis requires greater weight to be given to smaller items. Exercise Calculate the harmonic mean of 10. The HM of a data set is calculated based on its every elements 2. that is. x2. It is impossible to determine harmonic mean if any of the values is zero and/or negative 3.. HM mean can be extended to further statistical analysis Demerits 1.Quantitative Methods in Social Sciences Harmonic Mean In mathematics. xn is defined to be H n 1 X1 1 X 2 1 X 3 1 Xn Note: Equivalently. the harmonic mean is the reciprocal of the arithmetic mean of the reciprocals. . It is called a middle value in an ordered sequence of data in the sense that half of the observations are smaller and half are larger than this value. Mesay Mulugeta.. Typically. 14. Median Median is defined as the middle value in the data set when its elements are arranged in a sequential order. 20. More weightage is given to smaller values in calculating Harmonic Mean 3. in either ascending or descending order of magnitude. 22.3.

2 Median for Grouped Data: To find the median for grouped data. To find such class interval. In order to calculate it for ungrouped data. Note that median is only 2 moderately useful for farther statistical analysis unlike arithmetic mean which is the most important measure of central tendency (measure of location) for farther statistical analysis of a given dataset. 2009 62 66 69 . then the median is defined as the arithmetic mean of the numerical values of the (n)th n and ( 2 2 1)th observations in the data array Example Find the median value of the following data 50 55 60 63 66 69 72 75 78 84 n = 10 which means even (n)th value 2 (n)th 2 value (10)th value 2 (10)th 2 5th value 66 69 1 value 6th value Then. 50 55 60 63 66 (n 1)th 2 69 72 75 78 n 9 . Example Find the median value of the following data. If the number of observations (n) is odd number. the arithmetic mean of 66 and 69 = 67. then the median (Med) is represented by the numerical value corresponding to the positioning point of the (n 1)th order observation . then we have to find the 5th value 66 value in the data (9 1)th 2 Then.5 which is the median of the data above. 2 Mesay Mulugeta. the median value of the data above is 66 If the number of observations (n) is an even number. first identify the class interval n 1 th which contains the median value or ( ) observation of the data set. first the data should be arranged in an ascending or descending order.Quantitative Methods in Social Sciences Median can be calculated for both ungrouped and classified data.

2009 63 .5 20 x 500 = 500 + 401. that is. the median class interval f h n = frequency of the median class = width of the median class interval = total number of observations in the distribution Example Table 4. but not including.Quantitative Methods in Social Sciences find the cumulative frequency of each class for which the cumulative frequency is equal to or greater than the value of (n)th observation.7 represents the dietary energy intake per person per day (kcal/day/person) of 100 rural households in one of the woredas in Ethiopia. 2 2 This observation lies in (500 1000) class interval Applying the formula above we have: Med = 500 + (500. Steps Total number of observation (n) = 100 Median is in the size of ( n 1 th 100 1 th ) =( ) = (50.32) = 901.32 38 Mesay Mulugeta. the sum of all the class frequencies unto. Calculate the median value of the dietary energy intake in kilocalorie based on the discussion above. L =lower class limit of the median class interval cf = cumulative frequency of the class prior to the median class interval.5)th observation in the data set. Med L (n 1 / 2) f cf x h Where.

calculate the median wage of the workers. Birr. of 30 workers is 10 to 20 Eth. 2009 64 . Then. Birr and of the remaining 9 workers ranges from 40 to 50 Eth.Quantitative Methods in Social Sciences Table 4.3500 Total Merits and Demerits of Median Merits 1. Median can be computed while dealing with a distribution with open ended classes 4. Mesay Mulugeta. Median. Birr. Birr. In case of even number of observations for ungrouped data. Exercise Let us assume that you have collected a data from a factory employing 80 workers during your project work for this course. being a positioning value. Let your data indicate that the daily wage of 20 workers is less than 10 Eth. Median is not suitable for further mathematical treatment 4. Birr. Median is more affected by fluctuations of sampling as compared to arithmetic mean. of 7 workers is 30 to 40 Eth. Easy to calculate and understand even for professionals with low level of mathematics and statistics 2. is not based on each item in a data set 3. median cannot be determined exactly 2.7: Calculation for Median Value Dietary Energy (kcal/day/person) No of Households (f) Cumulative Frequency (<UL) 0 500 20 38 28 4 3 3 4 100 20 58 86 90 93 96 100 --------- 500-1000 1000-1500 1500-2000 2000-2500 2500-3000 3000 . It is not affected by the extreme values in a data set 3. Median can sometimes be located by simple inspection Demerits 1. of 14 workers is from 20 to 30 Eth.

Mode.Quantitative Methods in Social Sciences 4. Steps: The largest frequency (38) corresponds to the class (500-1000) Then. Mesay Mulugeta. 7. Mode has little or no use for farther statistical analysis. However. You should also note that a data set may have one mode value (unimodal distribution). two mode values (bimodal distribution). 12. Mode In statistics. It is also unaffected by the presence of extreme values of a data set and can be calculated from frequency distribution with open ended classes. For instance. 12. 7.4 821. the mode of the sample [1. three mode values (trimodal distribution) or many mode values (multimodal distribution). mode has some advantages. 6. the mode is the value that occurs most frequently in a data set or a probability distribution. 3. mode has no significance unless large number of observations is available. 6. 6. the mode is a way of capturing important information about a random variable or a population in a single quantity. 6.4 Though a poor measure of central tendency. we have L = 500 f = 38 f1 f2 500 = 20 = 28 h = 500 Mode 500 38 20 x 500 2(38) 20 28 321.7. L = lower limit of the modal class interval f = frequency of modal class f1 = frequency of the class preceding the modal class interval f2 = frequency of the class following the modal class interval h = width of the modal class interval Example Calculate the mode of the dietary energy intake in kilocalorie indicated in Table 4. L f 2f f1 f1 f2 x h Where. 17] is 6 Calculation of Mode in Grouped Data: In case of the grouped data the following formula is used to calculate the mode.4. 2009 65 . Like the statistical mean and the median. It can be calculated only by inspection from a simple frequency distribution.

hexile. Mode value can be calculated for open-end frequency distributions Demerits 1. In order to split a set of data into certain number of partitions first each value should be arranged in either ascending or descending order of their magnitude and then dividing this ordered series into the required number of equal parts. Deciles and Percentiles The term partition value is here used to comprise all such several measures as the median. 2009 66 . respectively.Quantitative Methods in Social Sciences Merits and Demerits of Mode Merits 1. deciles and percentiles. ten. A data set may have more than one mode value which makes the comparison and interpretation more difficult 2. Mode class can be inspected by inspection 4. Mode value is easy to understand and to calculate 2. can be divided into four equal parts or quarters namely the first quartile (Q1). The first quartile (Q1) divides a distribution in such a way that 25% (=1/4) of Mesay Mulugeta. the commonest partition values are quartile. octile. deciles and percentiles. four. The only difference is in their location or position. decile and percentile referring to partitioning a set of data into two. the second quartile (Q2) and the third quartile (Q3). Mode is not used for further rigorous statistical analysis 4. and hundred parts of equal size. Mode is not affected by extreme values in the distribution 5. pentile. Quartiles: The value of observation in a data set. However. we shall discus data analysis by dividing it into four. The basic purpose of all these data partitioning activities is to know more and more about the characteristic of a data set.5. The measures of central tendency which are used for dividing the data in to several equal parts are called partition values. Corresponding partition values are called quartiles. when arranged in an ordered sequence. ten and hundred equal parts. seven. six. five. Partition Values: Quartiles. eight. It is difficult to locate modal class in the case of multi-modal frequency distributions 3. quartile. heptile. In this section. All these values can be determined in the same way as median.

Q2 and Q3 for hypothetical data given below Table 4.5)th is contained or where it lies. Hence. we use (N smaller sample sizes as compared to other fields of study such as psychology.8 Daily income 4 5 7 9 11 14 17 24 28 Mesay Mulugeta. If you continue calculating. Then. respectively. the Q2 and Q3 will be 11 and 17. Then. Example Calculate Q1. the Q1 of the data is 9. If the population or sample size is relatively small. Arrange the data in order.Quantitative Methods in Social Sciences observations have a value less than or equal to Q1 and 75% (=3/4) have a value more than or equal to Q1. The variate value against this cumulative frequency is the value of Q1. (N 1) is used instead of N as indicated in the formula above. This is because in very large sample or population size the difference between the ratio of (N 1) and N to the same denominator is negligible as compared to the case of small size 1) in geographic researches since we usually make use of sample or population. To find out Q1 calculate (N 1) / 4 where N is the total 1) / 4 is number of observations. For Q2 find (N 1) / 2 and search for the minimum cumulative frequency in which (N 1) / 2 is contained. This value lies at 11. if they are not and work out the cumulative frequencies. which means 1(29 1) / 4 value (7. Calculate 3( N in the same manner as Q1 and Q2. 2009 Number of workers 2 3 2 4 5 3 4 2 4 Total 29 Cumulative frequency 2 5 7 11 16 19 23 25 29 ---67 . Search for the minimum cumulative frequency in which (N contained. find where the value of 1( N 1) / 4 . Similarly Q2 has 50% items with values less than or equal to Q2 For discrete data it is simple to locate the partition values. The 1) / 4 and locate Q3 variate value corresponding to this cumulative frequency is Q2.

cf = cumulative frequency prior to the ith quartile class L = lower limit of the ith quartile class f = frequency of the ith quartile class interval h = width of the class interval Qi = ith quartile value which is to be worked out Deciles: In descriptive statistics. 2. 2009 68 . 3. The general formula for calculating deciles in case of grouped data is: i (n 1 / 10) f cf Di L x h. For a grouped set of data.Quantitative Methods in Social Sciences To find the deciles Di ( i = 1. 2nd Quartile. The class corresponding to this cumulative frequency is called ith quartile class (1st Quartile. the percentiles are located. Di. calculate i (N for percentiles and proceeding on as for quartiles and deciles. 3 Where. The procedure for calculating the ith class is to calculate iN/10 and search that minimum cumulative frequency in which this value is contained. 3 9) we calculate the value i (N 1) / 10 and search for the minimum cumulative frequency which contains the value i (N 1) / 10 . 9). a decile is any of the 9 values that divide the sorted data into 10 equal parts using nine deciles. This means search that minimum cumulative frequency in which i ( N 1 / 4) th is contained. 3..2. first calculate i ( N 1 / 4) th value and proceed as we do for ungrouped data above. 2. The variate value 1) / 100 (i = 1. i 1. . .. i 1. 2. so that each part represents 1/10th of the sample or population set of data. or 3rd Quartile) The general formula for calculating quartiles in case of grouped data is: i (n 1 / 4) f cf Q i L x h. (i = 1. Similarly.3. 2. to locate the ith quartile value.9 Mesay Mulugeta..99) corresponding to this cumulative is the ith deciles. The class corresponding to this cumulative frequency is the ith decile class.

Calculate median. Since the number of observations in the data set are 100. first quartile. So the 20th percentile is the value (or score) below which 20 percent of the observations may be found. 25 (100 / 2) 48 x5 4 25 2. the median value is (n/2)th = (100/2)th = 50th observation.99). the median number of livestock can be calculated as: Med . Applying the earlier formula... the 50th percentile as the median or second quartile (Q2). 2009 69 .. Pi (i = 1.99 Example Let us assume that the following distribution (Table 4. The term percentile and the related term percentile rank are often used in descriptive statistics as well as in the reporting of scores from norm-referenced tests. 8th decile and 70th percentile of the grouped data.Quantitative Methods in Social Sciences Percentile: Represents values of observations in a data when arranged in an ordered sequence into hundred equal parts ninety nine percentiles. 3.5 27. 2. 4 . 2. The general formula for calculating percentiles in case of grouped data is: i(n 1 / 100) f cf Pi L x h.9: Calculation for Partition Values Livestock population per household No of Households (f) Cumulative Frequency 10 15 15-20 20-25 25-30 30-35 35-40 40 45 Total 15 8 25 4 33 8 7 100 15 23 48 52 85 93 100 --------- Mesay Mulugeta.. The 25th percentile is also known as the first quartile (Q1). This observation lies in the class interval 25 30.9) gives the pattern of livestock population per household for 100 rural households in Woreda W in Ethiopia. the 75th percentile as the third quartile (Q3)..50 Table 4. 3. i 1.

20-25 is Q1or Quartile 1 class.44 Frequency 12 34 67 73 82 74 66 54 48 Mesay Mulugeta.21 You can also calculate P70 by using the same methods for Q1and D8 above. You can now apply the formula above. f 30 8(101 / 10) 33 53 x5 34.10: Class Interval 0 -4 5 9 10 14 15 19 20 24 25 29 30 -34 35 39 40. First. Then. 30-35 is i (n 1 / 100) f cf P8 class. D8 L i (n 1 / 10) cf x h. and P72 for the grouped data below Table 4. D7. 2009 70 . Then. find where (iN)/100 observation is contained where N=100. f 20 25 25 23 x5 20. i(N+1)/4 = (101)/4 = (25.Quantitative Methods in Social Sciences To calculate Q1 first find where i(N)/4 observation is contained where N=100.45 Similarly. i (n 1 / 4) cf Q1 L x h. to calculate D8 first find where i(N+1)/10 observation is contained where N=100. Then. Then. (iN)/100 = (70 x 100 )/100 = (70)th value is contained in 30-35 class interval.83 Exercise Calculate Q3. Then. Apply the formula above. You can now apply the formula above.25)th value is contained in 20-25 class interval. 30-35 is D8or decile 8 class. 52 P8 L xh 30 70 (101 / 100) 33 x 5 32.8)th value is contained in 30-35 class interval. (iN)/10 = (8 x 101 )/10 = (80. Then.

i (n f 1) can be Relationships b/n Mean. median and mode Mode = Median =Mean Conditions Unimodal and symmetrical distribution Most of the values of observation in the distribution Mean > Median > Mode Mean Mode = 3(Mean Median) fall to the right. Deciles and Percentiles.1: Comparison of Mean. In other words. invariably. the values of mean. median and mode are equal. 4.1 below) the distribution is not symmetrical Figure 4. 60 and 90 according to which the values in the formula used.Quantitative Methods in Social Sciences Remark Partitioning values are used in the classification or grouping of certain data sets where the intervals. differ but frequencies of the classes will remain constant. 15 30. Partitioning (locational) values are not only confined to Quartiles. 2009 71 . skewed to the right or positively skewed (See Figure 1 above) Most of the values of observation in the distribution Mean < Median < Mode Mode = 3Median 2Mean fall to the left or skewed to the left or negatively skewed. Median and Mode f r e q u e n c y R/ships b/n mode. One can also go for any other numbers of grouping say 5. Median and Mode: In a unimodal and symmetrical distribution. (See Figure 2 above) Mesay Mulugeta. when these three values are all not equal to each other (as in Fig.

median.0 Total 520. pie-chart and bar graph for the data given below. insert the variables and percentages into the chart/graph. mode. Give your judgment whether it is an asymmetrical or symmetrical distribution based on Karl Pearson s principle. insert title.6 Maize 6. 2002 Mesay Mulugeta.1 Oats 1.e.0 Teff Sorghum 96. Table 4. Birr) Households 0 50 122 50 100 95 100 150 50 150 200 68 200 250 18 250 300 11 300 350 7 350 400 14 400 450 15 Total 400 Source: Mesay M.0 Barley 14.12: Farmland covered by major crops during 1999 crop year in Kuyu Woreda Crops Area in hectare 254. Remark also whether it is negatively or positively skewed set of data.3 Wheat 64.7 Oilseeds 54. Q3. Edit the chart/graph (i.Quantitative Methods in Social Sciences Exercise Find the mean.5 Source: Mesay M. D4 and P65 of the grouped data set below and comment on the results. 2009 72 . (2002) SPSS Practice By using SPSS software create a frequency table.11: Rural households by value of tradable material ownership in Kuyu Woreda (January 2001) Value of materials No.8 Pulses 29. and select the suitable color by using SPSS Data Editor and Chart Editor Windows) Table 4. of (Eth.

respectively. 2009 73 . Mesay Mulugeta. Find the mode value of this data set and comment whether it is positively or negatively skewed.Quantitative Methods in Social Sciences Exercise Let us assume that the mean and median of a certain skewed (asymmetrical) set of data are 117 and 86 units.

it becomes difficult to identify or distinguish their specific differences. Measures of dispersion. Measures of dispersion help us to know whether the scores clustered around certain measures of central tendencies or spread out over a large segment of the scale. you should be able to: Explain the terms and mathematical formulas used to measure level of data dispersion Understand the techniques and methods of measures of dispersion Determine the nature of variability or dispersion in any set of quantitative data Know the purposes of measures of dispersion Differentiate absolute and relative measures of dispersion Identify the merits and demerits of each measures of dispersion 5.14unit Range = 40 unit Data Set A 90 95 100 105 110 Data Set B 80 90 100 110 120 More dispersed than data set A (SDB > SDA) Look at the characteristics of the two data sets (Data Set A and Data Set B) above that in both cases the Arithmetic Mean is 100 unit. For example. See the cases of two sets of values below. which may lead to wrong interpretations indicating similar Mesay Mulugeta. or spread or variability provide information about the spread of the scores in quantitative techniques.07unit Range = 20 unit Sum = 500 unit Mean = 100 unit SD = 14.1. Sum = 500 unit Mean = 100 unit SD = 7. Introduction The term dispersion in statistical methods refers to the variability or spread in a data set. also known as second order analysis. Here. are better employed when measures of central tendency failed to explicitly distinguish the two or more sets of distributions.Quantitative Methods in Social Sciences Unit Five Absolute and Relative Measures of Dispersion Unit Objective Having studied this chapter. invariably. Measures of dispersion. two sets of observations may provide the same Arithmetic Means. 2009 74 .

3.2.6 8 Data set II 2 120 50 300 750 30.Quantitative Methods in Social Sciences distributions. Data set I is characterized by extremely less value of measure of variability (SD) and better representation of the mean value than the case of data set II. e. Mesay Mulugeta. To provide information about the structure of a series: A value of measure of dispersion gives an idea about the spread of the observation. (2006): 1. Purposes of Measures of Dispersion Various measures of dispersion are calculated with the following purposes as pointed out by Agarwal. standard deviation and coefficient of variation. To compare two or more sets of values with regard to their variability: Two or more sets of values can be compared by calculating the same or similar measures of dispersion. To have an idea about the reliability of central values: If the scatter or variability is large. Data set I 7 6 9 8 10 Standard deviation Mean 1.L. it indicates that a central value is a good representative of all the values in the set. B. 2009 75 . compare the means and measure of variability (i. standard deviation) of the two data sets below. Look at the example in serial number 1 above. an average is less representative and less reliable or if the value of dispersion is small. A set with smaller value possesses lesser variability.4 244 2. For instance. The actual inference can be drawn only by using certain measure of dispersion such as range. 5.

Quantitative Methods in Social Sciences

4. To pave way to the use of other statistical measures: Measure of dispersion, especially Standard Deviation and Variance lead to many other statistical techniques like correlation, coefficient of variation, regression, analysis of variance (ANOVA), e.t.c. A measure of dispersion, therefore, is defined as a numerical value explaining the extent to which individual observations vary among themselves. They can be broadly categorized into two: Absolute and Relative Measures of Dispersion. The main difference between the two is that absolute measures of dispersion measure numerically heterogeneity of data and are not free from unit of measurement. Absolute measures of dispersion, unlike the relative ones, cannot be used to measure the degree of heterogeneity between two data sets, which does not have similar unit of measurement. The following diagram gives an overview of measures of dispersion.

5.1. Classification of Measures of Dispersion 5.1.1. Absolute Measures of Dispersion Range
The range (the difference between the maximum and minimum values) is the simplest measure of data spread or variability. But if there is an outlier in the data, it will be the minimum or maximum value. Thus, the range is not robust to outliers. Range gives us information on only the extreme values and highly unstable as a result. In other words, range as a measure of dispersion does not consider other data in the set except the two extreme values, the maximum and the minimum. Example The following data is the mean minimum temperature data in oC for the month of January for 10 years (1994 2004) taken from Tullubolo weather station. Find the range of the temperature records. 12.5 11.0 11.4 10.0 8.8 7.8 8.8 9.8 8.8 10.5

Solution: Max value = 12.50C Range = Max value Min value = 7.8 C
0

Min value

Range = 12.5oC - 7.80 C = 4.70 C

Mesay Mulugeta, 2009

76

Quantitative Methods in Social Sciences

Figure 5.1: Classification of Measures of Dispersion Measures of Dispersion/Deviation

Absolute Dispersion

Relative Dispersion

Range

Quartile Deviation

Mean Deviation

Standard Deviation

Variance

Coefficient of Variation

Coefficient of Quartile Deviation

Coefficient of Mean Deviation

Interquartile Range (IR)
It is the difference b/n the third and first quartiles.

IR
QD = Q

Q

3

Q1
Q 2

Quartile Deviation (QD)
3 1

Exercise The following data is the mean maximum temperature data in oC for the month of January for 10 years (1998 2008) taken from Entoto weather station. Find the Interquartile Range (IR) and

Quartile Deviation (QD) for the temperature records and statistically explain the result you have calculated.

19.1

18.9

19.1

19.5

19.2

19.9

20.3

20.2

21.1

21.2

Mesay Mulugeta, 2009

77

Quantitative Methods in Social Sciences

Mean Deviation (MD)
This is also an absolute measure of dispersion. It is the average of the absolute deviations taken from a central value, usually mean and median ignoring the signs. Although standard deviation is a more accurate method of finding the error margin we use the average deviation method because it is relatively easy to calculate. The average deviation is an estimate of how far off the actual values are from the average value, assuming that our measuring device is accurate. We can use this as the estimated error. Sometimes it is given as a number (numerical form) or as a percentage. This measure is not used in further advanced statistical techniques. Follow the steps below to find mean deviation 1. Find the average value of your measurements 2. Find the difference between your first value and the average value. This is called the deviation. 3. Take the absolute value of this deviation. 4. Repeat steps 2 and 3 for your other values. 5. Find the average of the deviations. This is the average deviation. The formula for MD is

MD

xi N

for raw data

MD

f xi N

for grouped data

Where, the letters have their usual meanings and | | means ignoring the negative signs. The two vertical bars in the formula indicate the absolute values or values omitting the signs with the other symbols having the same meaning discussed earlier. In case of grouped data, the mid-point of each class interval is treated as xi.

Exercise
Calculate mean deviation for the data given below.
Mesay Mulugeta, 2009 78

7 Variance It is simply the mean of the squares deviations from a central value.9 28.2 27.5 Source: NMSA 27.1 28. One very powerful statistical technique known as analysis of variance (ANOVA) uses variance to help decide whether a two or more sets of samples differ significantly from each other. ) and S 2 (xi x) n 1 2 s2 = Sample variance. generally the Arithmetic Mean (i. ANOVA will be discussed later in this course In case of grouped data. mid-values of the classes are considered as xi and consequently we can make use of the formula below: Mesay Mulugeta.7 27. 1994 -2001) in oC 27. N = Number of population. 2009 79 .1 27. the average of the squared deviations). µ = Population mean Variance is an important measure in statistics particularly in assessing variation b/n two or more samples of a population. The main demerit of variance is that its unit is the square of the unit of measurement of the values and this value is large making it very difficult to decide about variation of magnitude. Here is the formula for variance of variable x for ungrouped or raw data: 2 2 (Xi N Where. The symbol for variance is s or 2 2 accompanied by a subscript for the corresponding variable. 2 = Sample mean.8 28..e. n = Number of samples = (Sigma square) population variance xi = Population or sample values.Quantitative Methods in Social Sciences Monthly Minimum To of Bushoftu town (Feb.

x) 2 . hence is slightly larger than the population standard deviation which uses N. The sample standard deviation uses n-1 in the denominator. The SD formula for ungrouped population data is Mesay Mulugeta. the greater the variation among the values of data set. The standard deviation formula for ungrouped (raw) sample data is S (x n i x) 1 2 Notice the difference between the sample and population standard deviations. the numerator. That is. explains the average amount of variation on either side of the mean. The main drawback of variance is that its unit is the square of the unit of measurement of the observations.Quantitative Methods in Social Sciences 2 f ( Xi N ) 2 and S2 f ( xi ) 2 or f (xi x) n 1 f ( xi 2 In the formula above. This results in larger value of the variance making it more difficult to interpret The variance gives more weightage to the extreme values as compared to those which are near to the mean value. the standard deviation is also in hectares as well. also known as root mean squared deviation. The Standard deviation is another way to calculate dispersion. is called total sum of squares (TSS). This is because the difference is squared in variance. If they are not squared. TSS measures the total variation among values in a data set while variance measures the average variation among data set. 2009 80 . a central value cannot be obtained because (X i X ) is always zero. The population ( ) and the sample (s) standard deviations are the positive square roots of their respective variances and have the desirable property of being in the same units as the data. The larger the values of TSS or variance. Properties of Variance Values are squared only to get rid of negative signs in variance. Standard Deviation Standard deviation. if the data is in hectares. It is considered to be the best measure of dispersion and is used most widely even in advanced techniques.

2009 81 . The variance of a set of data is larger number and even larger than any individual value in a set of data. If you compare this with the formula for quadratic mean you will realize we are doing the same thing. This is why s is used for the sample standard deviation and (sigma) is used for the population standard deviation. however. But the average deviation from the mean is actually zero. Occasionally the mean deviation.Quantitative Methods in Social Sciences Where. One simple advantage of SD over variance is that the value of SD of a set of values is almost always considerably less than the variance. = Population mean. It serves to indicate that we are adding things up. However. The SD formulae for grouped population and sample data are: f ( xi N )2 and s f ( xi x) 2 n 1 Properties of Standard Deviation The drawbacks arising from the squared units in the analysis of variance are overcome in this measure of dispersion. Mesay Mulugeta. The demerit of standard deviation. and then taking the square root after dividing by the number of data elements or one less than that. using average distance or using the symbols for absolute value. is used. another sigma. appears inside the formula. except for what we are dividing by. is the fact that the variability of two or more sets of data cannot be compared if the unit of measurement of the variables is not the same. a better measure of variation comes from squaring each deviation. the capital one ( ). n = Number of samples ( xi N ) 2 We have already discussed the use of Greek letters for sample statistics vs population parameters. However. x x . summing those squares. = (Sigma) Population SD N = Number of population. What is added up are the deviations from the mean: ( xi x) .

2009 82 . Coefficient of Quartile Deviation (CQD) This is an absolute quantity (unitless) and is useful to compare the variability among the middle 50% observations. and is most useful for variables that are always positive. Hence.2. it is a normalized measure of dispersion of a distribution. cannot be expressed by measures of absolute dispersion or location. it is the coefficients which can speak whether rainfall or temperature is more dispersed. So when comparing Mesay Mulugeta.Quantitative Methods in Social Sciences 5. are 3rd order analysis and are used when the absolute dispersion measures are the same or when they are not able to distinguish the characteristics of the distribution. not considered as a good measure of dispersion as it does not show the scattering of the central values. Relative Measures of Dispersion Relative measures of dispersion. therefore. It is often reported as a percentage (%) by multiplying the above formula by 100. also known as Coefficients.1. It is a dimensionless number. The coefficient of variation describes the magnitude of values of data and the variation within them. It is defined as the ratio of the standard deviation to the mean. the characteristics of the distribution having different units say rainfall and temperature distribution. It is. temporally or spatially. The coefficient of variation is useful because the standard deviation of a data must always be understood in the context of the mean of the data. Here. It excludes the lowest and highest values as a result of which it is not affected by extreme values. This is only defined for non-zero mean. Q Q 3 3 CQD Q Q 1 1 Coefficient of Range or Range Coefficient (CR) RC Maximum Minimum Maximum Minimum a) Coefficient of variation (CV) Also known as relative variability. it is not commonly used. Also.

the coefficient of variation is sensitive to small changes in the mean. 2009 83 .5 122.2: Output 10 20 30 40 50 60 70 80 90 19 29 39 49 59 69 79 89 99 Class mid-point (x) 14.5 202 1020 --- --- 24240 Mesay Mulugeta.0 623. Example Let the following table indicates 65 sample farmers by their annual crop yield output in Adama district. f. find the average deviation.0 521. Then.5 84.0 189. Table 5.5 54.5 94.5 24.)2. variance.99 Number of Farmers 3 5 8 14 12 10 7 4 2 In order to calculate the above requested values. (x.5 44. one should use the coefficient of variation for comparison instead of the standard deviation.5 34.5 276.5 64. we must first find class midpoint (x).5 f 3 5 8 14 12 10 7 4 2 fx 43. (x.1: Number of Sample Farmers by their Annual Crop Output: Hypothetical Data Crop Output in Quintals 10 19 20 29 30 39 40 49 50 59 60 69 70 . fx. x. Look at the table below.0 654. When the mean value is near zero.0 x38 28 18 8 2 12 22 32 42 f x114 140 144 112 24 120 154 128 84 x-38 -28 -18 -8 2 12 22 32 42 (x 1444 784 324 64 4 144 484 1024 1764 ) 2 f(x 4332 3920 2592 896 48 1440 3388 4096 3528 ) 2 Total --- 65 3412.5 338. mean ( ).)2. Table 5.5 74. standard deviation and coefficient of variation (CV) of the data.79 80 .Quantitative Methods in Social Sciences between data sets with different units or wildly different means.0 645. limiting its usefulness.89 90 .

concentrations and diversifications of a variable.92 = 19.50 0. Theil Index of Inequality. a) Then.92 64 S f ( x x) 2 n 1 = 372. find the mean deviation.3: Yield in Quintals 501 750 751 1000 1001 1250 1251 .2250 2251 2500 2501 2750 2751 . standard deviation and coefficient of variation (cv) of the data based on the example given above.69 S2 = 24240 372. The following table is a hypothetical data containing the number of urban households by their monthly income in Kebele K.31 52. These.31 CV S x 19. and Tideman-Hall Index Exercises 1.1500 1501 1750 1751 . Table 5. Herfindahl-Hirschman Index.5 f ( x x) 2 n 1 MD n 1020 65 15. include Lorenz Curve. there are also statistical techniques to measure inequalities. Mesay Mulugeta.2000 2001 . for instance.37 37% Though not the scope of this resource material to deal with. 2009 84 .5 65 f x x 52. Gini Coefficient. variance.3000 Total Number of Farmers 16 24 19 28 10 12 8 6 6 2 131 b) Try to answer the question above by using SPSS software and compare your answers with the one you have done in question (a).Quantitative Methods in Social Sciences Solution: Mean 3412.

9 25.2 29.5: Monthly maximum temperature of Adama in oC (2000 Year Recorded 2000 2001 2002 2003 2004 2005 Mean Max SD CV (%) Jan 27.3 24.6 32. Which set of data is more dispersed? Table 5.5 Dec 25.7 26.4: Data Set x: 450 465 Data Set y: 45 54 87 56 112 76 232 45 132 34 233 54 546 345 435 67 342 10 23 27 460 765 500 496 440 392 389 871 986 567 457 987 876 234 345 567 987 432 3.1 29.8 29.8 26.8 28.6 27.9 Nov 25.6 29.2 27 27.2 30.2 May 30.7 29. Calculate the mean maximum values.7 25.6 25. 2009 85 .6 Oct 25.8 27.7 28. standard deviation and coefficient of variation for the two sets of data below and comment on the results you have calculated.7 2005) Sep 26.4 * 29.3 29.8 26. Calculate the variance.4 28.3 27.8 28.7 27.3 25.5 26.8 28.Quantitative Methods in Social Sciences 2.6 Aug 25.4 26. Comment on the results you have found. Table 5.6 Feb 28.9 30.3 26.6 Jul 25.5 28.8 29.9 26 26.5 26. standard deviation and coefficient of variation for the temperature data given below.9 29.7 30.9 Apr 30.2 26.3 29.8 27.8 31 29.1 28.3 28.6 Mar 30.5 26.5 25.8 29.8 29.2 29.3 27.2 28.6 25.1 Jun 29.5 29.3 26.2 27 27.3 Source: NMSA Mesay Mulugeta.8 32.

the analysis of a data set still remains incomplete until we measure the degree to which these individual values in the data set deviate from symmetry on both sides of the central value and the direction in which these are distributed.1. thereby longer tail in one direction. you should be able to: Explain the terms and mathematical formulas used to determine Skewness. In a skewed distribution. However. moments and kurtosis Understand the techniques and methods of measures of Skewness. Measure of Skewness A frequency distribution of the set of values that is not symmetrical is called asymmetrical or skewed. For a positively skewed distribution Maen Median Mode . The relationship between the three measures central tendency (Mean.2. Median and Mode) tells us the nature of the Skewness of the data set. extreme values in a data set incline towards one side or tail of a distribution. In case of symmetrical distribution Maen Median Mode . Where such values incline towards the lower or left tail the distribution is said to be negatively skewed. Introduction In the previous chapters we have discussed measures of location and variation of a data set to describe the nature of individual values in the data set. 2009 86 . 6. moments and kurtosis Appreciate the use of Skewness. while Maen Median Mode for a negatively skewed distribution. moments and kurtosis in quantitative data analysis 6.Quantitative Methods in Social Sciences Unit Six Skewness. If extreme values incline towards the upper or right tail. the distribution is known as positively skewed. Mesay Mulugeta. moments and kurtosis Determine the nature of Skewness and kurtosis in any set of quantitative data Know the purposes of measures of Skewness. Moments and Kurtosis Unit objectives Having studied this chapter.

Absolute Skewness (Sk) = Mean Mode Mesay Mulugeta. 3. in which case the Skewness coefficient is lower than zero. Negative skew: The left tail is longer.1: Symmetrical and Skewed Distribution 2. Figure 6. the mass of the distribution is concentrated on the right of the figure. the distance between mean and mode may be used to measure the degree of Skewness because the mean is equal to mode in a symmetrical distribution.1 1.e mean median mod e) . the mean is lower than median which in turn is lower than the mode (i. It has a few relatively high values. For an asymmetrical distribution. Symmetric Distribution: If there is no Skewness or the distribution is symmetrical (and unimodal) like the bell-shaped normal curve then the mean median mod e) .Quantitative Methods in Social Sciences Graphically in case of symmetrical distribution the lengths of the two segments of the curve on either sides of the peak point are equal while in asymmetrical curve one of the segments or tails of the curve is longer than the other. Look at Figure 6. the mass of the distribution is concentrated on the left of the figure. The degree of Skewness can be measured both Absolute Skewness and Relative or Coefficient of Skewness. In such a distribution. The distribution is said to be left-skewed. The distribution is said to be right-skewed. In such a distribution. 2009 87 . in which case the Skewness coefficient is greater than zero. the mean is greater than median which in turn is greater than the mode (i. Positive skew: The right tail is longer.e mean median mod e) . It has a few relatively low values.

Mean Mode Mode 3( Mean Median) 3Median 2Mean By substituting this value of mode in the formula above. Mean > Mode and therefore Sk is positive otherwise it is negative.Quantitative Methods in Social Sciences For a positively skewed distribution. it is convenient to define it using median. you know the relationship between mode. median and Pearson s Coefficient of Skewness for a set of data indicated below. Example Calculate the mode. SkP = Karl Pearson s Coefficient of Skewness Since a mode does not always exist uniquely in a distribution. there are at least three important relative measures of Skewness. Bowley s Coefficient of Skewness and Kelly s Coefficient of Skewness. From our previous discussion in this course. mean. we can get: SK 3 ( Mean p Median ) SD Note: The values of SkP varies between +3 and -3 theoretically. Other than the above stated absolute method of measuring Skewness. median and mean. For the purpose of this course. you will observe the procedures used by Karl Pearson to measure the Skewness of a data set herein. 2009 88 . The measure suggested by Karl Pearson for measuring coefficient of Skewness is given by: SK Mean p Mode SD Where. These are Karl Pearson s Coefficient of Skewness. Comment on the results! Mesay Mulugeta.

45 46 50 51 55 median class modal class 5 15 41 42 2 12 3 N = 120 f1 f0 f2 Solution Look at the table above that the mode lies the class 36 40 by inspection. we can make use of any method (formula) we have seen in our earlier discussions. by applying the formula we have discussed earlier mode can be calculated as: Mode L f 2f f1 f1 f2 xh = 36 42 41 x4 2 x 42 41 2 36. Table 6.1: Class Interval (x) Frequency (f) 21 25 26 30 31 35 36 40 41 .2: Class Interval (x) 21 25 26 30 31 35 36 40 41 .Quantitative Methods in Social Sciences Table 6. 2009 89 .096 To calculate the mean.55 ------ Midpoints (x) 23 28 33 38 43 48 53 ------- Frequency (f) 5 15 41 42 2 12 3 -------- fx 115 420 1353 1596 86 576 159 fx = 4305 Mesay Mulugeta.45 46 50 51 . Then.

95 Now we are left to know standard deviation so that we can easily calculate Pearson s Coefficient of Skewness. What is it? Median Mode 3 Median 3 Median Mode Mode Median Median 36 .Quantitative Methods in Social Sciences Mean 4305 120 35.45 46 50 51 . therefore. 2009 90 .875 Now we can calculate the remaining measure of central tendency.55 23 28 33 38 43 48 53 36 36 36 36 36 36 36 . 875 3 35 . be calculated based on the formula below: (xi N ) 2 Table 6.13 -8 -3 2 7 12 17 169 64 9 4 49 144 289 f(x - 5 15 41 42 2 12 3 )2 845 960 369 168 98 1728 867 5035 f (X N )2 5035 120 6 . Pearson s Coefficient of Skewness (SkP ) 3( Mean Median) SD Mesay Mulugeta. 096 2 Mean 2 Mean 2 Mean 3 2 x 35 .5 Then.3: Class Interval (x) Midpoints (x) Mean ( ) x- (x - ) 2 f f(x - ) 2 21 25 26 30 31 35 36 40 41 . Standard deviation can.

Secondly. is calculated using the following equation: Skewness (x n 3 x )3 Where.Quantitative Methods in Social Sciences 3(35. may be misleading if used in isolation. They are therefore unlikely to come from a population which is normally distributed. positive values of the index indicate positive Skewness and negative values indicate negative Skewness.0. Logically. it casts doubt upon the diversity of applying parametric statistical tests to this data. Moments can be calculated about mean. arbitrary mean.035 (SkP = -0. The most common one. (x x ) 3 denotes the cube of the deviations of the values from their mean. properly known as Momental Skewness.5 0.875 35. the concentration of the values of the distribution is slightly to the lower values to the extent of 3.035). First. 2009 91 . Rather than the above discussed ones there are other various statistical measures of Skewness. The value of these measures is obtained by taking the deviation of individual observations from a given origin. In a highly skewed data observation the mean on its own is not a very informative measure. The value of Skewness for a symmetrical distribution is zero. A high degree of Skewness is one sign that sample data are not normally distributed.95) 6. zero or origin. the distribution is slightly skewed to the left (slightly negatively skewed). Thus. is the standard deviation and n is the number of values. Moments Measure of moments includes the measure of mean. This fact has two important consequences. average deviation and standard deviation. Remark: Skewness is an important concept in geographical statistics because very many of the variables measured in geographical studies show highly skewed distributions. other descriptive measures particularly the mean. At this juncture you have to remember that a normal distribution is symmetrical and has a Skewness of zero. As in physics the term moments is affected by (i) size of class interval representing the force and (ii) deviation of mid-values of each class from an observation representing the distance.035 Since the coefficient of Skewness is . and about any arbitrary Mesay Mulugeta.3.5%. 6.

2. Platykurtic or Mesokurtic. 3. That is. the term kurtosis in statistics refers to the degree of flatness or peakedness in the region about the mode of a frequency curve.56 6. 3..4 Class Interval 22. 4 For grouped data mr f (x n x) r r 1. Comment also whether the distribution is Leptokurtic.36 37 -41 42 47 46 51 52 .4. xn be the n observations in a data set with the . Kurtosis Kurtosis (from the Greek word kyrtos or kurtos. Mesay Mulugeta. and whether it is positively skewed or negatively skewed. For example. Then the rth moment about the actual mean of a variable both for ungrouped and grouped data is . r 1. whether the observed values are concentrated more around the mode (a peaked curve) or away from the mode towards both tails of the frequency curve. 4 Exercise Calculate the first four moments about the mean for the following grouped data and explain the results you have found. Table 4. given by: For ungrouped data mr (x n x)r . 2 . 2009 92 .26 27 31 Frequency 2 5 4 8 10 9 3 32. let x1. x2 .Quantitative Methods in Social Sciences points in a set of data. meaning bulging) describes the degree of concentration of frequencies in a given distribution. In other words. Explain whether the distribution is symmetric or asymmetric.

Mesay Mulugeta. those with small tails are called Platykurtic. Kurtosis is based on the size of a distribution's tails.Quantitative Methods in Social Sciences Note that two or more distributions may have identical average. is the standard deviation and n is the number of values. and Skewness but they may show different degrees of concentration of values of observation around the mode and hence may show different degrees of peakedness. Exercises A) Find the Variance. The usual measure of Kurtosis is calculated with the following equation: (x Kurtosis (x n 4 x) n x) 2 4 x) 4 ( (x n ) 2 Where.0 and less than 3. A normal distribution has a kurtosis of 3.0 while a very peaked (leptokurtic) distribution and a very flat (Platykurtic) distribution has a kurtosis greater than 3. in addition to that provided by the mean and standard deviation. This definition is used so that the standard normal distribution has a kurtosis of 0. (x x) 4 denotes the fourth power of the deviations of the values from the mean. variation. Explain the results you have calculated. Like Skewness. 2009 93 . respectively. Distributions with relatively large tails are called leptokurtic. In addition. with the second definition positive kurtosis indicates a peaked distribution and negative kurtosis indicates a flat distribution. A distribution with the same kurtosis as the normal distribution is called mesokurtic. Skewness and Kurtosis of the following frequency distribution by the method of moments. Kurtosis gives valuable information about the distribution of a set of data values.0. (x n x)4 denotes the fourth moment around the mean.

Comment on the result you have found. 2009 94 .Quantitative Methods in Social Sciences Height in Inches: No of Soldiers (f) 58 11 61 62 65 14 66 69 11 70 73 6 74 8 77 B) Compute the first four moments about the mean from the following data. Mid-value of a grouped data Frequency 17 8 22 12 27 13 32 16 37 5 42 7 Mesay Mulugeta.

very essential computer softwares. you should be able to: Explain the basic concept of spatial statistics Differentiate several techniques of spatial statistics Apply the techniques of spatial statistics in relevant aspects of geographic studies. only a handful of techniques are in common use for analyzing the spatial distribution of these data. It is certainly true that some of the most advanced spatial techniques may not be applied to solve the real problems and analyze the data without the aid of a computer. it is. Currently. there is the lack of pre-existing theory and methods of statistical analysis applicable to spatial distributions. ArcGIS and other related such softwares have remarkably eased spatial geographic data analysis at present. it can reasonably be argued that one of the central themes of geographical enquiry is the appreciation of distribution of phenomena on the earth s surface. Introduction Although there may be disagreement in detail about the nature and purpose of geography. surprising that remarkably little progress has been made in the development and application of spatial statistics. difficult to understand and tedious to use. Appreciate the use of spatial statistics in quantitative analysis of geographic data 7. such as ArcGIS. Both geographers and statisticians themselves have not shown a great deal of interest in spatial statistics in the past. quantitative revolution . 2009 95 . perhaps.1. Although. a wide variety of statistical techniques have been applied by geographers to data which have been collected on an areal basis.Quantitative Methods in Social Sciences Unit Seven Elementary Spatial Analysis Unit objectives At the end of this chapter. many of the techniques. First. There are several possible explanations for this state of affairs. In view of the enthusiasm with which geographers embraced the. are introduced into this specific area of geography. Secondly. which do exist are. and geographers are in a minority in demanding them now. Mesay Mulugeta. at least at first sight.

Quantitative Methods in Social Sciences

Despite these reservations, there are several simple spatial techniques which can be applied using manual methods of calculation, and which are intuitively easy to understand. It may well be that the present underuse of spatial techniques is merely due to lack of publicity. Various types of

phenomena can be studied using these techniques: points, lines, and areas, and a number of different characteristics can be measured. These include central tendency, dispersion, shape, pattern and spatial relationships.

7.2. Central Tendency in Point Patterns
In our previous discussions various measures of central tendency were applied to sets of non spatial tabular data. Each of these measures was described as giving some indication of the average value in a set of data, or the centre of a frequency distribution. When dealing with spatial distributions the concept of a centre is intuitively reasonable. But there are several ways by which the spatial positions of such centre can be calculated, each of which will give a different result. It is important to realize that there is no one correct answer to the problems of finding the centers of a spatial distributions. Each measure has a different interpretation and the choice should be determined by the nature of the problem.

7.2.1. Mean Centre of Point Distribution
The mean centre is the simplest measure of the centre of a spatial distribution. It is analogous to the mean of a set of data, and is calculated in very similar way. Figure 7.1 shows a hypothetical spatial distribution of points. It could be the distribution of towns or of any other geographical

phenomenon. As a first step in calculating the mean centre it is necessary to devise some way of quantifying the locations of the points by coordinate systems. This can be done by calculating the co-ordinates of each point by using set of rectangular axial systems that can be laid down of the map showing the locations. Then, with reference to these X and Y axes, the coordinates can be measured either in centimeters/inches or ground distances by using the map scale. For the calculation of most spatial statistics the position of points needs to be measured in relation to some such co-ordinate system. The orientation of the co-ordinate grid is quite arbitrary, however. Geographers are used to measuring location in terms of eastings and nothings, but there is no reason why they should not use say south-easting and north-easting. Similarly the origin of the grid, the point from which the co-ordinates are measured, is arbitrary. For example, the national grid origin
Mesay Mulugeta, 2009 96

Quantitative Methods in Social Sciences

of Ethiopia is found at 00N latitude and 34030 E longitude. The only prerequisites of a co-ordinate system which is to be used in the calculation of spatial statistics are: 1. The co-ordinate axes must be at right angles to each other, in other words they must be orthogonal axes, 2. Measurements along the two axes must be made in the same units. In Figure 7.1 below, an arbitrary co-ordinate system has been superimposed, with its origin at the bottom left hand corner. For simplicity the horizontal axis, measuring easting, has been labeled x, and the vertical axis, measuring northings, has been labeled y. The axes have been marked off in arbitrary distance units. The co-ordinate for all points is given in Table 7.1. Figure 7.1: Identification of Spatial Mean Center

T1

T2

T3

T4 T6

T5

T7
Map scale 1 : 50,000

T8

The mean centre can now be found simply by calculating the mean of the x co-ordinates (easting) and the mean of the y co-ordinates (northings). These two mean co-ordinates mark the location of the mean centre. The equation for the mean centre is thus:

x

x n

,

y

y n

Mesay Mulugeta, 2009

97

Quantitative Methods in Social Sciences

Where x and y are the co-ordinates of the points,

and y are the means of the x and y co-ordinates

respectively, and n is the number of points. The calculation of the mean centre for Figure 7.1 is given in Table 7.1 and its position. Table 7.1: Finding the spatial mean centre Co- ordinate Values in cm Points or Towns X Y 1.0 6.7 T1 T2 T3 T4 T5 T6 T7 T8 Total 6.5 11.6 1.9 10.9 7.0 4.1 12.0 55.0 6.6 6.7 4.4 4.4 3.1 0.9 1.0 33.8

n x

8 55 8

x 6.875

55 y 34.8 8

y

33.8

4.35

The co-ordinates of the spatial mean center, therefore, is 6.875, 4.35. Then, we can easily find and locate the spatial mean center of the towns in Fig 7.1 above. The spatial mean centers of population can also be calculated provided that the population size of each town (P1, P2 .) is given.

The simple formula to calculate spatial mean of population is: X co-ordinate = P1X1 + P2X2 + P3X3 + . PnXn P1 + P2 + P3 + . Pn Y co-ordinate = P1Y1 + P2Y2 + P3Y3 + . PnYn P1 + P2 + P3 + . Pn

Mesay Mulugeta, 2009

98

Quantitative Methods in Social Sciences 7.2. and as many to the west as to the east provided that all the points have equal weights and no matter whether or not the lines are orthogonal. following this definition. The median centre is located in such a way that it has as many points to the south as to the north of it. without resorting to any mathematics other than counting points. The disadvantage is that its location depends on the orientation of the two lines used to divide up the point distribution.2: Spatial median center of towns/points T1 T2 SMC T3 T4 T6 T7 T8 T5 The advantage of the median centre is that its location can be found very quickly.2. is analogous to the median of a set of data. Figure 7. where speed may be more important than accuracy. each of which has an equal number of points on either side. This results in the fact that the location of the median centre cannot be uniquely found and its use should be restricted to preliminary geographical investigations. Spatial Median Centre The definition of spatial median center in this course is that it is the intersection of two orthogonal axes. Mesay Mulugeta. 2009 99 . The spatial median centre. This is illustrated in Figure 7.2.

T2 .) by using the population T1 T2 T3 T4 T6 T7 T8 T9 T5 T10 P1= 5000 P6=1900 P2= 5200 P7=4300 P3= 2800 P8=6000 P4= 7800 P9=3450 P5= 3200 P10= 4280 Solution First you have to calculate the spatial median center of population by using the formula Median center p1 p2 . P2. p10 2 500 5200 2800 7800 3200 1900 4300 6000 3450 4280 21. T10. .. data/size (P1.) given below. P10..) in hypothetical region R...Quantitative Methods in Social Sciences Example Locate the spatial median center of population for towns (T1. Then answer the following questions based on data given about the assumed map. T2. 2009 100 . 965 2 Exercise Let us assume that the following figure indicates the distribution of hypothetical towns T1. . Mesay Mulugeta.

2. Calculate the spatial mean and median centers of population for the given two periods if the population of each town is as given in the Table 7.2: The towns population data: Hypothetical Total population of each town in 1995 T1= 200 T2= 250 T3= 400 T4= 120 T5= 80 T6= 456 T7= 250 T8= 550 T9= 350 T10= 680 T11= 450 Total population of each town in 2005 T1= 700 T2= 650 T3= 600 T4= 220 T5= 180 T6= 756 T7= 550 T8= 850 T9= 850 T10= 780 T11= 550 B. Table 7. Mesay Mulugeta. 2009 101 .3: Identification of Population Spatial Mean Center T1 T2 T3 T4 T7 T9 T5 T6 T8 T10 T11 Map scale 1 : 100.Quantitative Methods in Social Sciences Figure 7. Find the spatial mean center and median center of the towns.000 A. Try to comment on the direction of shift of spatial population mean during the given periods.

) in certain region R. Exercise Let us assume that the following figure indicates the distribution of towns (T1. Centre of Minimum Travel The centre of minimum travel.3. Then which one of the towns can serve as the center of minimum travel for the whole eight T1 T2 T3 T4 T6 T5 T7 T8 Map scale 1 : 50. Although this interactive procedure (one which involves a long series of repeated steps) would be extremely time consuming to do manually.000 Mesay Mulugeta.2. It is based on the principle that sum of the deviations around the median (ignoring the signs) is minimum. 2009 102 . referred to in many texts as the median centre. median and center of minimum travel are likely to be fairly close to each other. Either of the first two could therefore be used as a starting point in the search for the centre of minimum travel. The true centre of minimum travel could eventually be found. In most cases the mean centre. it can easily be manipulated by computer softwares such as ArcGIS. is the location from which the sum of the distances to all the points in a distribution is a minimum.Quantitative Methods in Social Sciences 7. T2 towns indicated in the map? . The position of this centre could clearly be found manually by a process of trial and error or by choosing a number of alternative trial locations and calculating the sum of the distances from each trial centre to all the points.

otherwise known as standard distance deviation or root mean square distance deviation.3. is the spatial equivalent of standard deviation of the spread of points around the mean centre. divide by the number of points and then take the square root. 7. Spatial Dispersion Just as the measures of dispersion discussed in the earlier sections of this course. 2009 103 .Quantitative Methods in Social Sciences 7. The simplest equation for standard distance is: S tan dard dis tan ce Where. Measures of spatial dispersion give information about the areal spread of points/or places around a centre. In this section two techniques for measuring the spread of points around the mean centre will be discussed. add up all the squares. square them.1. Standard Distance Standard distance. d is the distance of each point from the mean centre n is the number of points.000 Having located the mean centre. it is possible to measure all distances directly from the map.3. Though Mesay Mulugeta. d2 n Exercise Calculate the standard distance for the point pattern shown below and comment on the result you have found if the points represent the distribution of towns. T1 T2 T3 T4 T5 T6 T7 Map scale 1 : 100. spatial dispersion describes the spread of values around some form of average.

The commonest measure of compactness will be discussed as follows. It is possible to say how much circular the shape of an areal unit is. a circle. regions. A circle is the most compact shape in the sense that it has the smallest possible perimeter relative to the area contained within it. this will be the simplest and the quickest way of calculating the standard distance for many map distributions. Analysis of Shape The measure of shape is an obvious area in which a statistician can help a geographer. for an elongated shape such as straight line.4. Er = 1. the shape is more and more compact or circular. drainage basins. However. Mesay Mulugeta. where the value ranges between 0 and 1. shape is extremely difficult to quantify or only certain specific characteristics of shape can be quantified. L As Er approaching to 1. The symbols below are used throughout the formulas. for a compact shape Er = 0. These are A = Actual area A1 = Area of the smallest circle circumscribing the actual area L = Longest axis or major axis L1 = Minor axis P = perimeter 1. Elongation ratio Er = L1 . 2009 104 . administrative units and urban areas is known as Index of shape or Index of Compactness. This is effectively a measure of how far a shape deviates from the most compact possible shape. The measure of compactness of shapes of geographical areas such as countries. For most elongated shapes such as straight line Er = 0.Quantitative Methods in Social Sciences there are various methods. The most commonly measured characteristic shape is compactness. 7. Circle is taken as a standard because the area of a circle with the same perimeter as other areal units is the maximum.

L2 4 Find the Farm ratio for a square which has area as one kilometer square. the value ranges between 0 and 0.3184 P2 Example Calculate the Cr value for a square given in the example above.25 2 P 16 4. Farm ratio Fr = Example A . Compactness ratio a. the value ranges b/n o and 1 b. Cr = 4 x1square kilometer 4 = = 0.273 A . 2009 105 . the value ranges b/n 0 and 1 L2 A A1 c.Quantitative Methods in Social Sciences 2. Richardson compactness ratio = 2 P A . where the value ranges between 0 and . Area = 1km2 L 1km 1km L= 2 L2 = 2km2 Fr = 3. Gibbs compactness ratio = 1. Cole compactness ratio = Mesay Mulugeta. Circularity ratio 1 = 0.5 2 Cr = 4A .

657 2. Table 7.Quantitative Methods in Social Sciences Example Calculate Cole s compactness ratio for a square shaped area with each side measuring 1kilometer. Explain the results you have found.000 22.400 582.6369 1km 5.644 637.3: Countries Ethiopia Djibouti Eritrea Kenya Somalia Sudan Area (km2) 1.502.600 5. 2009 106 . Exercise Calculate the circularity index (CI) and compactness ratios of each of the country indicated below. indicating the most elongated and perfect compact shapes.290 820 2.192 Mesay Mulugeta. Circularity Index (CI) CI = AA ACSP Where.106. AA = actual area ACSP = area of a circle with the same perimeter The value of CI ranges b/n 0 and 1. Area of a square = 1km x 1km = 1km2 d A1 = r2 Cole compactness ratio = 0.813 Boundary (km)) 5.420 3.000 117. respectively.100 7.

libraries or websites and find both the area and boundary length of all the regions/city administrations and analyze their shape based on what you have learnt in this unit. Mesay Mulugeta. Try to visit any nearby governmental or non-governmental offices.Quantitative Methods in Social Sciences Project Work You know that Ethiopia is currently divided into 9 regional states and 2 city administrations. You are expected to show all the necessary steps and formulas you have used for the analysis. 2009 107 .

Quantitative Methods in Social Sciences

Unit Eight Correlation and Regression
Unit objectives
Having studied this chapter, you should be able to: Explain the basic concept of correlation and regression analyses Apply the techniques of regression analysis in related geographic data analysis Appreciate the use of regression analysis in quantitative geographic data analysis Understand the basic theoretical concept of regression analysis in SPSS Apply SPSS software in the analysis of correlation and regression

8.1. Introduction
In this chapter we are going to analyze the degree and actual of relationships between two variables known as Correlation and Regression Analysis. For both regression and correlation studies, the number of variables may be two (Bivariate analysis) or more (Multivariate analysis). In this course we shall discuss in some detail both the bivariate and multivariate regression relationships. We shall see correlation and then regression in this chapter.

8.2. Correlation Analysis
In statistics, correlation indicates the strength and direction of the mutual interdependence of two or more variables. The relationship can be either linear or non-linear. The relationship between two variables is termed as bivariate correlation while that between more than two variables is known as multivariate correlation. The relative strengths of relationships are identified by a measure referred to as Correlation Coefficient or Coefficient of Correlation. Thus, one can have a) Bivariate linear coefficients of correlation b) Bivariate nonlinear coefficients of correlation c) Multivariate linear coefficients of correlation d) Multivariate nonlinear coefficients of correlation

Mesay Mulugeta, 2009

108

Quantitative Methods in Social Sciences

The relationship between variables (coefficients of correlation) can be direct or inverse. Its value ranges between +1 and -1 indicating the positive and negative correlation between the variables, respectively. The analysis of the coefficient of correlation has been attempted by different scholars. The most widely known is the Karl Pearson s product-moment correlation coefficient or simply Pearson s Coefficient of Correlation, which is obtained by dividing the covariance of the two variables by the product of their standard deviations. The Pearson s Coefficient of Correlation runs as follows for bivariate linear correlation:
( X
XY i

r

X )( Y
X Y

i

Y )

N

Where

, r X

Coefficien & Y Interdepen

t of

correlatio dent var

n iables

It can

be written in other forms as:
( X iY i ) r XY N
X Y

XY
X N
2 i

( X iY i ) N ( X
2

X Y Y N
2 i

) (

Y

2

)

( X iY i ) N ( X
2 i

N X Y Yi
2

( X iYi ) N
N Y N
2

N XY
2 2

N X N

2

)(

)

1 N

(

X

2 i

NY )(

2

Yi

NY

)

( X iY i ) N ( X
2 i

N X Y Yi
2

( X iYi ) N
N Y
2

Xi N Yi
2

Yi N N( Yi N )2

N X

2

) (

)

(

Xi

2

N(

Xi N

)2 ) (

N N Xi
2

X iYi ( N X i )2 N

X iYi Yi N ( Yi ) 2
(N Xi
2

N (

( X iYi ) X i )2 ) ( N

X iYi Yi
2

(

N

)(

Yi )

2

Mesay Mulugeta, 2009

109

Quantitative Methods in Social Sciences

Where,

N = Number of pairs of scores XY = Sum of the products of paired scores X = Sum of X scores Y = Sum of Y scores X 2 = Sum of squared X scores Y 2= Sum of squared Y scores

Assumption of Correlation Coefficient
There are three assumptions made in giving the correlation coefficient by using the above formula. They are: 1. The random variables X and Y are distributed normally 2. The variables X and Y are related or interdependent 3. There is a cause and effect relationship between X and Y variables

Example
Let's assume that we want to look at the relationship between two variables, food grain available per head (in quintals) and family size. Perhaps we can have a hypothesis that family size affects the daily calorie supply per head in a family. Let's say we collect some information on 15 households and recorded as indicated in Table 8.1 below. Table 8.1: Food Grain Available (Quintal per Head) 12 8 8 9 3 5 21 2 2 7 21 25 18 2 3
Mesay Mulugeta, 2009

Family Size 3 5 4 3 6 5 2 9 10 4 3 2 2 9 7
110

What does a negative relationship mean in this context? It means that.00 4.00 6. lesser will be the value of correlation coefficient.00 2.00 X=-0. The angle between the two lines will be zero when there is perfect relationship between the variables i. the other decreases in value.00 8.00 20. we would expect a negative correlation. Another check is that if the calculations and the Mesay Mulugeta.1).Quantitative Methods in Social Sciences You should immediately see in the Bivariate plot that the relationship between the variables is a negative or inverse one because if you were to fit a single straight line through the dots it would have a negative slope or move down from left to right.1: Scatter plot of the data in Table 8.00 Faily Size From Figure 8.00 10.1 above we can see that there are two lines indicating the two variables are mutually regressed against each other. Since the correlation is nothing more than a quantitative estimate of the relationship.00 Intersection Point 10.e. Greater is the angle. the coefficient of correlation is +1 or -1. Figure 8.6986 Food Grain per Head 15.1 25. if one increases.3917X+21. 2009 111 .2841Y+7. You should confirm visually that this is generally true in the plot below (Figure 8.5325 5.00 Y=-2.00 0.

it is now easy to compute the coefficient of correlation for the variables in Table 8. if it's direct.62 ) (Y 9.933) 0. If the 112 coefficient of correlation is negative. N (N Xi 2 rXY ( X iYi ) ( Xi ) ) (N 2 X iYi Yi 2 ( Yi ) 2 The symbol r stands for the coefficient of correlation.3917 ( X 4.733 11. The two regression lines give two regression coefficients: for the regression of y on x. the regression bYX r Y X . the coordinates of the intersection of the two lines will be the averages of the two variables.3917X 9.8243( 7.1. The multiplication of the two regression coefficients [( )( bXY r X Y )] gives the coefficient of determination. 2 By using the previously explained formula.733 0.0 and +1. X 4. regression coefficient bYX coefficient bXY r X Y r Y X and for the regression of x on y. 1 Case 2: X regressed on Y (X X X X) r X Y (Y Y ). standard deviations and correlation coefficients Case1: Y regressed on X (Y Y ) r Y X (X X ).7992) 2.602 0.7652 4.733 Y Y 2.5325 Equation No. the relationship is Mesay Mulugeta. 2009 .602 ) ( X 4.933 ) 2.2841Y 7.Quantitative Methods in Social Sciences drawings have gone smoothly. It is always between -1.733) 7. Regression equation by using means.8243( 2.62 Y 9. r2.2841Y 2.933) 2.3917X 21.Y 9. we have an inverse relationship.933 0.6986 ) Equation No.0.

The bottom row consists of the sum of each column. This is all the information we need to compute the coefficient of correlation. 8249 113 .2: Grain available (Y) 12 8 8 9 3 5 21 2 2 7 21 25 18 2 3 Family Size (X) 3 5 4 3 6 5 2 9 10 4 3 2 2 9 7 XY 36 40 32 27 18 25 42 18 20 28 63 50 36 18 21 X2 9 25 16 9 36 25 4 81 100 16 9 4 4 81 49 Y2 144 64 64 81 9 25 441 4 4 49 441 625 324 4 9 Total 146 74 474 468 2288 Let's look at the data we need for the formula. Table 8.1 above.Quantitative Methods in Social Sciences positive. Here are the values from the bottom row of the table (where N is 15) as they are related to the symbols in the formula: N= 15 X= 74 2 X = 468 2 X = 474 Y Y= 146 Y = 2288 Now. But you probably will need to know how the formula relates to real data and how you can use the formula to compute the correlation coefficient. The first two columns are the same as in the Table 8.2). when we plug these values into the given formula given. 2009 0 . You don't need to know how we came up with this formula unless you want to be a statistician. The next three columns are simple computations based on the height and self esteem data. Here's the original data with the other necessary columns. (See table 8. we get the following: r 15 * 468 15 * 474 5476 74 * 146 15 * 2288 21316 r Mesay Mulugeta.

Table 8.Quantitative Methods in Social Sciences Here we can determine the Probable Error of correlation coefficient and confidence interval as: Pe 0 . Other properties Pe . 6745 Where. Pe r n 1 r2 n Pr obable error correlation coefficient number of pairs of observations This probable error sets a range for the coefficients of correlation of other sets of samples selected randomly from the same population.3: Crop yield/ha (in quintals) Y 18 8 28 20 14 22 24 16 6 12 Fertilizer/ha (in Kg) x 50 35 15 45 100 0 38 27 43 55 Mesay Mulugeta. it is definitely significant. The range is put as r associated with Pe is that if correlation coefficient is if r Pe to r Pe . Example Calculate the coefficients of correlation and determination for the following data and comment on the results you have found. it is not significant at all and 6 Pe . 2009 114 .

That is.02% . The coefficient of determination ( r 2 = 0.1885) indicates that about 18. you have to find the summations as follows: Table 8.0.0. Testing the Significance of a Correlation Once you have computed a correlation coefficient. 2009 115 . you are testing the mutually exclusive hypotheses: Mesay Mulugeta. you can conduct a significance test.Quantitative Methods in Social Sciences Then. the correlation for our ten cases is r between crop yield/hectare and use of fertilizer per hectare. Most often you are interested in determining the probability that the correlation is a real one and not a chance occurrence.85% of the dependent variable(y) is explained by the investigated independent variable (X).17028 or 17.434. In this case. you can determine the probability that the observed correlation occurred by chance.412654 r 2 = 0.4: S/N X Y X2 Y2 XY 1 2 3 4 5 6 7 8 9 10 Total 50 35 15 45 100 0 38 27 43 55 408 18 8 28 20 14 22 24 16 6 12 168 2500 1225 225 2025 10000 0 1444 729 1849 3025 23022 324 64 784 400 196 484 576 256 36 144 3264 900 280 420 900 1400 0 912 432 258 660 6162 You can now substitute the value above in the formula below: N [N X 2 r XY ( ( 2 X )( Y Y) 2 X ) ][ N ( Y )2 ] r 10 * 6162 408 *168 [10 * 23022 (408) 2 ][10 * 3264 (168) 2 ] r . which is a moderately negative relationship So.

Quantitative Methods in Social Sciences Null Hypothesis Alternative Hypothesis r=0 r 0 The easiest way to test this hypothesis is to find a statistics book that has a table of critical values of r. this is a two-tailed test) we should conclude that the odds are less than 5 out of 100 that this is a chance occurrence. Most introductory statistics texts would have a table like this. degrees of freedom (df = 13). the significance level ( = . (2-tailed) Food Grain per Head -0. This means that if the calculated value ( r . Since our calculated correlation of r actually quite a bit higher than the negative tabulated value.2. and type of test (two-tailed) we can now test the significance of the coefficient of correlation we have found.05)). in this example of Table 8. Finally. Bivariate Correlation SPSS Output Family Size Pearson Correlation Sig. we can conclude that it is not a chance finding and that the correlation is "statistically significant".000 ** Correlation is significant at the 0. We can reject the null hypothesis and accept the alternative. As in all hypotheses testing.0. you need to first determine the significance level. When we lookup this value in a table at the back of any statistics book we find that the critical value is 1.771 (remember. In this example.771 or -0. With these three pieces of information i.05. The df is simply equal to N-2 or. Mesay Mulugeta. since we have no strong prior theory to suggest whether the relationship between food grain available and family size would be positive or negative. Here. This means that we are conducting a test where the odd that the correlation is a chance occurrence is no more than 5 out of 100. we will opt for the twotailed test. we can use the common significance level of = . 2009 116 . Before we look up the critical value in a table we should also have to compute the degree of freedom (df).434) is greater than 1. is 15-2 = 13.434 is less than -1.824(**) . we have to decide whether we are doing a one-tailed or two-tailed test.05 level (2-tailed).771. We can also compute the correlation coefficient and statistically confirm (test) the relationship between the variables by using SPSS software as indicated in the table below.e.

of Livestock Sex of H/Head Off-farm Income Fertilizer per Ha Grain per Head Family Size Farm Land Size Number of Oxen No.882 0.5: Correlation Matrix Variables Grain per Head Family Size Farm Land Size .818 0. Here's the result. Here I used SPSS to calculate correlation among the variables and create the correlation matrix for the 8 variables. Let's say we have a study with 9 interval-level variables and we want to estimate the relationships among all of them (i. Or we could use just SPSS or any statistics program to automatically compute all 36 with a simple click of the mouse. That is N * ( N 1) 2 We could do the above computations 36 times to obtain the correlations.755 -0.917 0.936 0. Table 8.916 0. In most studies we have considerably more than two variables.564 0. I told the program to compute the correlations among these variables.855 0. There is no reason to print both triangles because the two triangles of a correlation matrix are always mirror images of each other (the Mesay Mulugeta.930 Pearson Correlation This type of table is called a correlation matrix.512 0.849 0.771 0. In this instance.860 0.742 0.819 0.867 -0. In every correlation matrix there are two triangles that are the values below and to the left of the diagonal (lower triangle) and above and to the right of the diagonal (upper triangle).650 -0.790 0. we have 36 unique correlations to estimate.716 0.678 0.870 0.912 0.638 0.900 0. It lists the variable names down the first column and across the first row.e..705 -0.834 0.Quantitative Methods in Social Sciences The Correlation Matrix All we have discussed so far is how to compute a correlation between two variables. How do we know that there are 36 unique correlations when we have 9 variables? There's a simple formula that tells how many pairs. between all possible pairs of variables).562 0. Number of Oxen No.901 0.915 0. 2009 117 .669 0.589 -0.626 0.824 0.743 -0.911 0.666 -0. It shows only the lower triangle of the correlation matrix. of Livestock Sex of H/Head Off-farm Income Fertilizer per ha Dung per ha -0.

N Y . two-category) you can use the Point-Biserial Correlation also. N Y2 Multivariate Mesay Mulugeta.. When one measure is a continuous interval level one and the other is dichotomous (i. However there are a wide variety of other types of correlations for other circumstances. explains to what extent the variation of dependent variable Y is being explained (expressed) by the independent variable X.. 2009 118 . Other Correlations The specific type of correlation we have seen above is known as the Pearson s Product Moment Correlation coefficient.824.e. to find the correlation between variables Family Size and Grain per Head. For instance. Bivariate ( X 3Y ). Bivariate Coefficient of Determination The coefficient of determination. Then. which is given as r2. we find that the correlation is -0. you could use the Spearman s rank Order Correlation or the Kendall Rank Order Correlation.Quantitative Methods in Social Sciences correlation of variable x with variable y is always equal to the correlation of variable y with variable x). find the value in the table for the row and column intersection for those two variables.YC)2 Total variation = variation unexplained + variation explained r = r = 2 2 A Y B ( XY ) N Y 2 (Y Y ) 2 A Y B ( X 1Y ) C ( X 2Y ) D (Y 2 NY ) 2 . For instance. It is appropriate when both variables are measured at an interval level. To locate the correlation for any pair of variables. r2 = Explained var iation Total var iation 2 Explained Variation = A Y + B (XY) . A correlation matrix is always a symmetric matrix. if you have two ordinal variables.Y )2 (Y .. we should look for where row and column intersects. Unexplained variation = = Total variation = (Y .. When a matrix has this mirror-image quality above and below the diagonal we refer to it as a symmetric matrix.

Regression can be used for prediction (including forecasting of time-series data). Most commonly the best fit is evaluated by using the least squares method. One factor contributing to the misuse of regression is that it can take considerably more skill to critique a model than to fit a model Linear relation between two variables is represented by straight line which is known as regression line. plots or any other variable. and hence n pairs of sample observations can be written as (X1. Use SPSS software to do so. Crop yield/ha (in quintals) Farm oxen/ha 18 2 8 1 28 4 20 0 14 3 22 1 24 2 16 1 6 4 12 2 8. Y2).3. corresponding parameters (constants). also known as explanatory variables or predictors. on the units may be people. It represents unexplained variation in the dependent variable. 119 Mesay Mulugeta. Generally. Regression analysis has been criticized as being misused for these purposes in many cases where the appropriate assumptions cannot be verified to hold. and an error term. then we call it the regression line of Y on X.(Xn. and modeling of causal relationships. (X3. the studies are based on samples of size n. Yi) on the variables X and Y are necessarily taken in pairs. regression analysis is a collective name for techniques for the modeling and analysis of numerical data consisting of values of a dependent variable (Y) and of one or more independent variables. . crop yield. The parameters are estimated so as to give a best fit of the data. inference. hypothesis testing. holding size. The dependent variable in the regression equation is modeled as a function of the independent variables. Regression Analysis In statistics. These uses of regression rely heavily on the underlying assumptions being satisfied. The error term is treated as a random variable. Yn). but other criteria have also been used. Y1). suppose the variable Y is such that depends on X. animals. X. (X2. To find out the regression line. the observations (Xi. In the study of linear relationship between two variables Y and X. 2009 .Quantitative Methods in Social Sciences Exercise Calculate the coefficients of correlation and determination for the following data and comment on the results you have found. Y3).

2009 120 . When one expresses the sum of squared errors/deviations mathematically and applies calculus techniques to ascertain the values of expressions for and that are easy to evaluate. and its slope provides the estimate of which is the length of the slope angle. there can be two regression lines.Quantitative Methods in Social Sciences Regression models are the mathematical/algebraic expressions while regression lines are the graphical representations based on the models in a two dimensional space in case of a bivariate distribution. for instance. The equation for a regression line of Y on X for the population is: Regression Equation Remark First degree or straight line equation/linear model Nonlinear second degree or curvilinear regression equation or Parabolic model Nonlinear third degree regression model a) Growth model (If b 1) b) Decay model (If b 1) Y Y Y Y a bX + e a bX cX 2 a bX ab X cX 2 dX 3 It is hardly obvious why we should choose our line using the minimum sum of squared errors criterion. Yc = estimated value Y = actual or observed value (Y Yc) = deviation of estimated value from actual value/residuals Regression analysis chooses among all possible models by selecting the one for which the sum of the squares of the residuals is at a minimum. Least Square Method of Fitting a Regression Line The general assumption of least square method of fitting a regression model to a data set is (Y Yc)2 = should be minimum Where. and that minimize it. For bivariate distributions. one obtains Mesay Mulugeta. The intercept of the line provides the estimate of . and in multidimensional spaces in case of multivariate analysis. One virtue of the sum of squared errors criterion is that it is very easy to employ computationally. The numbers of curves/lines depends on the numbers of variables.

sociology. Graphical straight line fitting The first degree(straight line) estimating equation is Y = + X Figure 8. The noise component e is comprised of factors that are unobservable or at least unobserved while and are the constants. The symbol e is the noise which is usually omitted in calculations. It ( ) is also called the regression coefficient and defined as the measure of change in the dependent variable (Y) corresponding to a unit change in the independent variable (X). Mesay Mulugeta.2:: Y Y= + X = y intercept = is the slope of the line or tangent value of angle . You will know more about the application of regression analysis after you observe the examples below. psychology and education. and is the tangent of angle subtended with x-axis.Quantitative Methods in Social Sciences Note in the equation Y = + X (first degree or straight line equation) and are constants where is the intercept where the line cuts on the axis of Y (or when x=0). economics. then other methods might be more appropriate to study. When the dependent variable is qualitative or categorical. The common aspect of the applications of regression analysis in these fields is that the dependent variable is a quantitative measure of some conditions or behaviors. 2009 121 . It is known as regression coefficient X Dear student! Do you know the importance of regression analysis? Where do you use it? Applications of regression analysis exist in almost every field such as geography. political science.

X.Quantitative Methods in Social Sciences Calculation of 1st Degree Curve Calculate the estimated value (Yc) of the production data given below by using first degree (straight line) prediction equation. Table 8.7: Crop Year (X) 1 2 3 4 5 Production in Quintals (Y) 10 12 15 20 28 The prediction equation nicknamed as normal or standard equations are given as Y Y=n XY = + X + X X2 a bX 1 2 as follows: The raw data provides the values of Y.6: A hypothetical farmer s crop output over years Crop Year 1985 1990 1995 2000 2005 Production in Quintals 10 12 15 20 28 Note: In this example production is being regressing on time. XY and X2 Mesay Mulugeta. Table 8.3 4. and 5 as follows.2 . In order to make the calculation more convenient (suitable) you can rename production years by 1. 2009 122 .

8 4. Under the presumption that the trend of change in Y corresponding to X remains the same.4 instead of 85 = 5 + 15 85 = 5 + 15 (4. We know Y depends on X in the case of regression equation of Y on X. there remains two unknown variables in the regression equation. 2009 123 . 3.8 The estimating (prediction) equation or model is.4 in equation No. and find 4 + 45 subtract equation No.4 = 4. the value of Y Mesay Mulugeta.4) 19 = 5 = 3. Y and X. 1 by 3 so that eliminate 255 = 15 To eliminate 44 = 10 Hence.4 X and . therefore. are calculated. 2 from No. Y Once the constants. 1 to find Substitute 4.8: Crop Year (X) 1 2 3 4 5 Total 15 Production (Y) 10 12 15 20 28 85 XY 10 24 45 80 140 299 X2 1 4 9 16 25 55 Those numerical values replace the symbols in the above two equations to provide two simultaneous equations in terms of and as 85 = 5 + 15 1 2 299 =15 + 55 Multiply equation No.Quantitative Methods in Social Sciences Table 8.

24 0.0 21.4 2.0 -1.4 25.2 0.8 -0. It is simply substituting the values of X (1.Quantitative Methods in Social Sciences can be estimated for any value of X.6 17. Some of the most commonly used curves are given here along with their mathematical equations.9: Crop Year (X) 1 2 3 4 5 Total 15 Production (Y) 10 12 15 20 28 85 Yc 8.0 (Y YC )2 3. 3.36 4.00 1.2 12. Y Y= + X +cX2: 2nd degree Y= When X > 1: Exponential Growth Curve When < 1: Exponential Decay Curve 0 X Mesay Mulugeta. For instance. These curves may be fitted to the data and used. the value of Yc (the estimated value) will be calculated as follows when the value of X is 3.0 Y YC 1.6 -2.84 14. 4 and 5) into the estimating (prediction) equation and calculating for Yc turn by turn. Yc = 3. 2.8 85.40 Note that Y is always equal to Yc except in case of exponential relations Curvilinear (Parabolic) Regression Equation The relationship between the dependent variable Y and independent variable X can be curvilinear in many cases. 2009 124 .96 4.4 (3) = 17Quintals Other calculated estimated values are indicated in the below: Table 8. The shape of the curve depends on the rate of change in Y corresponding to the change in the value of X.8 + 4.

firstly you have to find Y. X2.10: Crop Year (X) 1 2 3 4 5 15 Production (Y) 10 12 15 20 28 85 XY 10 24 45 80 140 299 X2 1 4 9 16 25 55 X3 1 8 27 64 125 225 X2Y 10 48 135 320 700 1213 X X4 1 16 81 256 625 979 cX 2 . 2009 125 . equation) Y=n XY = (X2Y) = and c (the constants) for Y + X + X2 + X + c X2 X2 + c X3 X3 + c X4 1 2 3 Then. (2nd Degree The normal equations to calculate . (X2Y) and X4 Now you are expected to substitute in the three equations above as follows: 85 = 5 299 = 15 + 15 + 55c + 55 + 225c 1 2 3 1213 = 55 + 225 + 979c To eliminate 255 = 15 multiply equation No. XY. Table 8. X.Quantitative Methods in Social Sciences Calculation of 2nd Degree Curve Calculate the estimated value (Yc) of the production data in immediate example above by using the second degree curvilinear (parabolic) prediction equation. X3. 1 by 3 4 + 45 + 165c Mesay Mulugeta.

6 c=1 Mesay Mulugeta.6) + 55 (1) + -24 + 55 + 31 31 = 5 = 10. = 10. 6 you can get: 44 = 10 + 60c 44 = 10 + 60 (1) = -1.6 Substituting c and 85 = 5 85 = 5 85 = 5 85 = 5 85 into equation No.Quantitative Methods in Social Sciences Multiply again equation No. 8 gives you 14 = 14c Then.8 Then. 4 gives you 44 = 10 + 60c 6 Equation No. 2 minus Equation No. c = 1 Substituting c into equation No. 1 by 11 935 = 55 + 165 + 605c 5 Equation No. 6 by 6 to eliminate ß 264 = 60 + 360c 8 Equation No. 3 minus Equation No. 1 to get + 15 + 55c + 15 (-1. 2009 126 . 5 gives you 278 = 60 + 374c 7 Multiplying equation No. 7 minus Equation No.8 = -1.

therefore.4 28 27.2 0. Yc 10 .0 0.0 -0.0 20 20.04 0.16 0.4 0. Table 8.2 12 11.04 0. For instance.0 Y YC -0.6 15 15.8 85 85. which means confirmed! The estimating or prediction equation is.11: Independent variable X 1 2 3 4 5 Total Dependent Variable Estimated value Actual Value Yc Y 10 10. 2009 127 . if you substitute in equation No.2 0. 2 you can find 299 = 299. Look at the estimated or predicated value below. The exponential prediction equation will be the following based on the linear equation Y= X or LogY = Log + xLog Log XY = m is the same as Y = Xm Mesay Mulugeta.04 Note again that Y is always equal to Yc except minor differences because of rounding of fractions Calculation of Exponential Curve Calculate the estimated value (Yc) of the production data in Table 8.4 0.16 0.0 (Y YC )2 0. 8 1 . for every values of X you can estimate Yc.7 by using the exponential prediction equation.6 X 1X 2 Now.Quantitative Methods in Social Sciences You can confirm the results you have found by substituting into one of the equations above.

20 7.30 1.45 6.18 1.Log Mesay Mulugeta.294 = 1. XLogY and X2as follows: Table 8. 3 from No.Quantitative Methods in Social Sciences The normal equation will be LogY = NLog + Log (XLogY) = Log X X2 1 2 X + Log Now.01 = 5Log + 15Log 19.0035 XLogY 1. 2 so that you can get 1.00 2.866 Anti.01 = 5Log + 15 (0. 1 by 3 to eliminate 18.03 = 15Log + 45Log Subtract equation No.01 1. 1 to find . 1 2 3 4 5 Total X 1 2 3 4 5 15 Y 10 12 15 20 28 85 LogY 1 1.54 5.12: No.112 1 2 3 Anti Log or Now substitute = 1.16 3.12 = 10Log Log = 0.112) 6. X.1265 X2 1 4 9 16 25 55 Now you are expected to substitute the values above in the exponential equations as follows: 6.112 in equation No.15 = 15Log + 55Log Multiply equation No. 2009 or = 7.294 or Log = 0. you have to find LogY.25 19.08 1. 6.345 128 .68 = 5Log Log = 0.

It never crosses the Y-axis but approaches to it.202 1.866 + 0.7 by using the 3 prediction equation. The normal equations in this case.01 YC 9.922 20. This is why an exponential prediction equation is used in distance decay analysis.506 12.303 15 15.606 26.426 6. Table 8.34(1.922 20 20.29)X Remark LogY= LogYC Y YC unless arbitrarily or accidentally. Calculation of 3rd Degree Curve Calculate the estimated value (Yc) of the production data in Table 8.303 15.669 0.668 85 85. 2009 X6= 20515 (X3Y) = 5291 129 .112X Yc = 7.005 LogYC 0. An exponential curve will never be zero whatsoever the value of X is. based on Y = Y=N + (XY) = (X2Y) = (X3Y) = X + c X2 +d X3 X+ X2 + X3 + X2 + c X3 +d X4 X3 + c X4 +d X5 X4 + c X5 +d X6 rd Degree + X + cX2 + dX3will be: 1 2 3 4 Additional items needed here are: X5= 4425 Mesay Mulugeta.314 1.13 Independent variable X 1 2 3 4 5 Total Dependent Variable Actual Value Estimated value Y Yc 10 9.978 1.090 1.606 28 26.Quantitative Methods in Social Sciences Then the estimating or prediction equation will be: Y= X or LogY = Log + XLog or LogYc = 0. Look at the estimated or predicated value below. Now for every values of X you can calculate Yc.506 12 12.005 Note again that LogY is always equal to LogYC except minor differences because of rounding off fractions.

Quantitative Methods in Social Sciences Then.4=14.1.8 by 6 Multiply equation No.1 by 11 Multiply equation No.33 d 0. 8. we can now apply quadratic equation as usual as follows: Equation No.5 Equation No.4d 126=126c+1134d 2.33 -7 = 14c. 8.33 = 10 ß.6d 14=14c+126d 128.166X3 The 3rd degree estimating equation is: YC = 8.00 c 0.8.4 Equation No. c and d into equation No. 13 by 9 Equation No.8 by 30.166 0.12 Multiply equation No. 85=5 +15ß+55c+225d 229=15 +55ß+225c+979d 1213=55 +225ß+979c+4425d 5291=225 +979ß+4425c+20515d Multiply equation No.4d.4=126c+1148.2 Equation No.7 Multiply equation No.6 Equation No.4 Equation No.00 + 2.9 Equation No.00 2.50 Then. we can get Substituting the value of ß. d 0.6=304ß+1824c+9241.5X2 + 0.10 Equation No. c 23.15 Substituting the value of d into equation No.13 we can get Substituting the value of c and d into equation No.14 Equation No. we can get 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 225=15 +45ß+165c+675d 935=55 +165ß+605c+2475d 3825=225 +675ß+2475c+10125d 44=10ß+60c+304d 278=60ß+374c+1950d 1466=304ß+1950c+10390d 264=60ß+360c+1824d 1337.50 2.1 by 3 Multiply equation No.11 Equation No.3 Equation No.1 by 45 Equation No.166 0. 2009 130 .33X Mesay Mulugeta.

3rd degree and exponential curves we can determine which fits best to the given data based on the minimum value of the sum. The general computational problem that needs to be solved in multiple regression analysis is to fit a straight line to a number of points.01 0.99 28 27.001 0. 2nd degree.95 (Y-YC) 0.01 0. (b) prediction equation and (c) the predicted sales due to a salesman having the test score of 9.001 0.01 0.99 15 14. Table 8. A line in a two dimensional or two variable space is defined by Mesay Mulugeta. 2009 131 .99 20 19.001 0. by comparing the values of (Y YC)2 for the linear.05 (Y-YC)2 0. 2 12 3 23 5 14 5.001 0.5 30 6 28 8 34 Multiple Regression Model The general purpose of multiple regression (the term was first used by Pearson.Quantitative Methods in Social Sciences Now for every value of X you can calculate Yc.99 12 11.01 0. The department measured the performance of the salesmen after certain months and found the following data.01 0.001 0.5.99 85 84. Look at the estimated or predicated value below.14 Independent variable X 1 2 3 4 5 Total Remark Dependent Variable Actual Value Estimated value Y Yc 10 9.0005 Finally. Test scores (10%) Sales in 000 Birr. Exercise Let us assume that a department has trained salesmen and has given them test. The smallest sum value indicates the best fit. 1908) is to learn more about the relationship between several independent or predictor variables and a dependent or criterion variable. Then identify (a) the best fitting line.

0. For example. if there is no relationship between the X and Y variables.0. the regression line cannot be visualized in the two dimensional space. If we have an R square of 0. construct a linear equation containing all those variables. If X and Y are perfectly related then there is no residual variance and the ratio of variance would be 0. the ratio would fall somewhere between these extremes.0 minus this ratio is referred to as R square or the coefficient of determination.e. In the multivariate case. but can be computed just as easily. Mesay Mulugeta. ß2. 1. The constant is also referred to as the intercept on Y-axis when X is 0. The smaller the variability of the residual values around the regression line relative to the overall variability. multiple regression procedures will estimate a linear equation of the form: Y= + ß1X1 + ß2X2 + ß3X3 + ßn Xn Y = dependent variable X1. the Y variable can be expressed in terms of a constant ( ) and a tangent slope (ß) times the X variable. and are left with 60% residual variability. i.Quantitative Methods in Social Sciences the equation Y=A + BX. X2. when there are more than one independent variables. between 0. in other words we have explained 40% of the original variability. the better is our prediction.4 then we know that the variability of the Y values around the regression line is 0. that is. the values of each regression coefficient (or ß coefficients) indicate the contributions of each independent variable to the observed change within the dependent variable.4 times the original variance. 2009 132 .0. then the ratio of the residual variability of the Y variable to the original variance is equal to 1. and the slope as the regression coefficient or ß.0 and 1. however. ß3 Estimating equation is YC = + ß1X1 + ß1X2 + ßnX3 +ßnXn Xn = independent variables + ßn = constants In this equation. This value is immediately interpretable in the following manner. In general. In most cases. We can. X3 ß1.

Quantitative Methods in Social Sciences

Ideally, we would like to explain most if not all of the original variability. The R square value is an indicator of how well the model fits the data. For instance, an R square close to 1.0 indicates that we have accounted for almost all of the variability with the variables specified in the model. The degree to which two or more predictors (independent or x variables) are related to the dependent (Y) variable is expressed in the correlation coefficient R, which is the square root of R square. In multiple regressions, R can assume values between 0 and 1. To interpret the direction of the relationship between variables, one looks at the signs (plus or minus) of the regression or ß coefficients. If a ß coefficient is positive, then the relationship of this variable with the dependent variable is positive. Of course, if the B coefficient is equal to 0 then there is no relationship between the variables.

Example
Analyze the relationship between the dependent variable (y) and independent variables (xi) of the hypothetical data given below. In order to find the solution for the question, we have to know the values of the following summations as in the table below: Y, X1, X2, X3, X4 -------------------------------Calculated in the table

(YX1),

(YX2), (YX3), (X12), (X22), (X32), (X1X2), To be calculated

(X1X3), (X2 X3), (X1X4), (X2X4), (X3X4) + (X42)

Mesay Mulugeta, 2009

133

Quantitative Methods in Social Sciences

Table 8.15:
Crop yield/ha (Y) 18 8 28 20 14 22 24 16 6 12 Total 168 Farm Oxen/ha (X1) 2 1 4 0 3 1 2 1 4 2 20 Fertilizer (Kg/ha) (X2) 50 35 15 45 100 0 38 27 43 55 408 Per capita farmland (ha) (X3) 2 1 1.5 0.5 3 2 4 2 1 3 20 Per capita irrigated land (X4) 1 0.5 1 0 2 1 2 1 0.5 2 11

In order to find the solution for the question, we have to know the values of the following summations as in the table below: Y, X1, X2, X3, X4 -------------------------------Calculated above

(YX1),

(YX2), (YX3), (X12), (X22), (X32), (X1X2), To be calculated

(X1X3), (X2 X3), (X1X4), (X2X4), (X3X4) + (X42) The normal equation is: Y = NA + B X1 + C X2 + D X3 + E X4 ... ....1 . .2 .3 .4

(YX1) = A X1 + B (X12) + C (X1X2)+ D (X1X3) + E (X1X4) (YX2) = A X2 + B (X1 X2) + C (X22) + D (X2X3) + E (X2X4)

(YX3) = A X3 + B (X1 X3) + C (X2 X3)+ D (X32) + E (X3X4) ...

(YX4) = A X4 + B (X1 X4) + C (X2 X4)+ D (X3X4) + E (X42) .........5

Mesay Mulugeta, 2009

134

Quantitative Methods in Social Sciences

Table 8.16: YX1 36 8 112 0 42 22 48 16 24 24 332 YX2 900 280 420 900 1400 0 912 432 258 660 6162 YX3 36 8 42 10 42 44 96 32 6 36 352 YX4 18 4 28 0 28 22 48 16 3 24 191 X12 4 1 16 0 9 1 4 1 16 4 56 X22 2500 1225 225 2025 10000 0 1444 729 1849 3025 23022 X32 4 1 2.25 0.25 9 4 16 4 1 9 50.50

X42
1 0.25 1 0 4 1 4 1 0.25 4 16.50

X1X2 100 35 60 0 300 0 76 27 172 110 880

X1X3 4 1 6 0 9 2 8 2 4 6 42

Continued . X1X4 2 0.5 4 0 6 1 4 1 2 4 24.5

X2 X3 100 35 22.5 22.5 300 0 152 54 43 165 894

X2X4 50 17.5 15 0 200 0 76 27 21.5 110 517

X3X4 2 0.5 1.5 0 6 2 8 2 0.5 6 28.5

168 = 10A + 20B + 408C + 20D + 11E 332 = 20A + 56B + 880C+ 42D + 24.5E .

..

...

...1 .2

6162 = 408A + 880B + 23022C + 894D + 517E

..

3 4

352 = 20A + 42B + 894C + 50.5D + 28.5E ..............................................

191 = 11A + 24.5B + 517C + 28.5D + 16.5E ...............................................5

You can now find all the constants to arrive at an estimating equation. The values of the constants are calculated as follows:

Mesay Mulugeta, 2009

135

129 4. the independent variables with positive Beta coefficients (X1 and X3) in the above example affect the dependent variable positively. Likewise.394 Then. the coefficients of correlation and determination are calculated to be: Multiple coefficient of correlation (R) = 0.273X1 0. The formula for multiple coefficient of determination is: r2 = A Y B ( X 1Y ) C ( X 2Y ) D Y 2 NY 2 ( X 3Y ).Quantitative Methods in Social Sciences Variables X1 X2 X3 X4 Constant Direct or ß Coefficients 0.751 which means an increase in one unit of an independent variable changes positively the dependent variable by 0. This can easily be done by using SPSS software. Let us calculate the multiple coefficients of correlation and determination for the example above.751 units.873 -3.0 % What do the values above indicate? The value of multiple coefficient of determination (R2) = 0.104 Beta Coefficients (Standardized coefficients) 0. for instance.566 Multiple coefficient of determination (R ) = 0.104 + 0. the output table of which looks like the one indicated below.873X3 3.751 -0.052 -0. 2009 136 ..129X2 + 4.320 or 32.950X4 Note: The independent variables with negative Beta coefficients (X2 and X4 in the above example) affect the dependent variable (Y) negatively with the greater the absolute value the greater the effect is.950 16. The Beta coefficient of X3... 2 Mesay Mulugeta.489 0. the estimating equation is Yc = 16.0 % indicates that all the independent variables together explain some 32% of the changes or variance in the dependent variable.273 -0.320 or 32. N Y2 By using this formula. is +0.

873 -3.5 2 X7 Fertilizer Input per Hectare (Kg/ha) 50 40 40 33 0 4 76 0 0 4 76 89 27 5 0 X8 Dung Input per Farmland (Kg/ha) 33 28 30 26 10 14 54 6 5 14 33 40 25 7 10 12 8 8 9 3 5 21 2 2 7 21 25 18 2 3 3 5 4 3 6 5 2 9 10 4 3 2 2 9 7 1.3 0.498 -.5 2.2 1.902 0. Error 8.17: Linear Regression SPSS Output Unstandardized Coefficients Dependent (Constant) Variable: Crop yield per Farm oxen per hectare hectare Fertilizer per hectare Farmland Irrigated Land ß 16.113 9.183 Standardized Coefficients Beta 0.8 1.244 Sig.052 -0.5 2 1 0.6 1.303 0.751 -0.5 7 3 1 1.468 2.394 t 1. 2009 137 .3 7 8 4 1.950 Std.8 1. 0.787 16.5 0 6 1.5 1 0.141 0.4 1.910 0.4 0.306 0.5 6 6 5 0.104 0.119 -1.5 1.489 0.817 Exercise: Answer the following questions based on the following hypothetical data Table 8.640 0.273 -0.6 5 7 6 1.116 0.4 a) Calculate the coefficients of each independent variables b) What percent of the dependent variable is explained by the identified independent variables? c) Find the multivariate prediction equation d) Calculate the predicted values by using the prediction equation Mesay Mulugeta.4 4 0.5 M M M M F F M F F M M M M M F 322 323 123 212 76 45 380 23 76 150 456 444 267 80 35 X2 X1 Family Size Farmland Size (Hectare per Head) X3 Number of Oxen per Hectare 4 3 1 4 0.5 0.18: Hypothetical Data Dependent Variable Food Grain Available per Head (Qntls/Head) Independent Variables X4 X5 X6 Number of Sex of Off-farm other Household Income per Livestock Head Head Per Hectare (Birr) 3 2.129 4.Quantitative Methods in Social Sciences Table 8.

Quantitative Methods in Social Sciences e) Screen out the most significant independent variables by using Stepwise regression analysis model f) Construct correlation matrix g) Conform the correlation between the dependent variable and each independent variable by using statistical testing technique you have learnt in this unit Mesay Mulugeta. 2009 138 .

Thus. Thus. in relation with the already established values. The outcome of the test is to be compared with some already established values appealing in relevant tables and the interpretations concerning acceptance or rejection is based on instructions accompanying the tables. Introduction For any statistical/quantitative analysis.e.1. the computed value differs significantly. it is always required to establish the validity or acceptance and rejection level of the results. For this purpose. the hypothesis can be considered. a set of techniques are established by statisticians pertaining to various statistical parameters or measures. the whole operation is based on (1) establishing a hypothesis or assumption concerning the computed results in relation with the values which stand for being compared (2) a level of significance telling the fractional or percentage level of the comparisons and (3) on the degree of freedom. 2009 139 . Mesay Mulugeta. In this case it is referred to as Null Hypothesis generally denoted by Ho. generally. automatically the alternative hypothesis is accepted i. while hypothesizing we should consider whether or not the value/s does/do not differ significantly from the already established norm or the value with which the comparison is to be made. Thus. which will be varying with techniques and the numbers of observations. It is also true that if Ho is rejected. It is also possible to presume that the result differs significantly from the value to be compared with. There is a fact to be considered in hypothesizing or assuming our notion about the results computed. It is referred to as Alternative Hypothesis denoted by H1.Quantitative Methods in Social Sciences Unit Nine Tests of Significance Unit objectives Having studied this unit. you should be able to: Understand the basic concept of hypothesis testing Explain the need of tests of significance in quantitative methods Distinguish different tools of testing Select appropriate testing technique for a specific analytical results Appreciate the use of hypothesis testing in quantitative research methods 9. These are whether (1) the result to be tested differs significantly or (2) not.

parametric test is a statistical test that depends on an assumption about the distribution unlike non-parametric test. For example. among others. nonparametric test is more flexible in conditions than the counterpart. But it is also remarked that nonparametric tests are less reliable than the parametric tests. < 0 . Mesay Mulugeta. µ > 0 µ 2 . µ = 0. Mann-Whitney U-Test or Wilcoxon-MannWhitney Rank-Sum Test and Kruskal Wallis Test (H-test). In other words.Quantitative Methods in Social Sciences If Ho µ=0 µ 1 = µ2 2 2 H1 will be µ µ1 2 0. necessary when data have no actual numerical interpretations/values rather take into consideration information like frequencies and ranks. a typical parametric test. µ < 0. This is because both suffer from some limitations. Thus. µ1 > µ 2 0 2 2 2 2 2 = 0 . are z-test and ANOVA while among non-parametric tests are Chi-square (X2) test. there are three assumptions: Observations are independent The sample data have a normal distribution Scores in different groups have homogeneous variances Included in the first broad category (parametric test). It is very difficult to have a comparative assessment of the two groups. Parametric tests use the statistical parameters like mean. in Analysis of Variance (ANOVA). therefore. Invariably the assumptions with the parametric tests are that the distribution is normal while nonparametric tests can be used with all the types of distributions. The use of non-parametric methods is. parametric and nonparametric tests. 2009 140 . standard deviation variance and correlation coefficients. are widely used for studying populations that take on a ranked order ignoring the actual values. Non-parametric methods. > 0 Two broad classifications can be made among parametric and non-parametric tests. µ1 < µ 2 . on the other hand.

1. 2009 141 . Then. The tabulated values make us decide about the rejection or acceptance of hypothesis.05 for moderate precision. it is called a significant result. a test is performed to decide whether a postulated hypothesis is accepted or not. respectively. The size of the sample varies since it depends either on the experimenter or on the resources available or also the nature of the phenomena being investigated. For example. The table values for distribution of test statistics are provided in separate pages of statistics books. a researcher in economic geography may be interested to know whether there is a significant difference in landholdings among rural households in four woredas. Level of Significance It is the quantity of risk which we are ready to tolerate in making decision about acceptance or rejection. 0.01 with percentage equivalence of 95% and 99%. Often other levels of significance like 0. the sample size plays an important role in testing of hypothesis and is taken care of by degrees of freedom. Level = 0.05 or 0. the test statistic involves the estimated value of the parameter which depends on the number of observations.01 is used for high precision and = 0. 9.2 and 0. Hence. In other words. On the basis of observational data.2. When hypothesis is accepted. Degree of Freedom Degree of freedom is the number of independent observations in a set. a sample is drawn from the population of which the parameter is under test.3. The level of significance is denoted by (alpha) and is conventionally chosen as 0. Statistical test of hypothesis play an important role in geographical studies. In a test of hypothesis. Moreover. it is the probability which is tolerable or the probability level at which the decision-maker concludes that observed difference between the value of the test statistic and hypothesized parameter value cannot be due to chance. The level of significance is specified before the samples are drawn so that the results obtained should not influence the choice of the decision-maker. the researcher can collect landholding size of certain sample households from each woreda and perform a statistical test based Mesay Mulugeta.3 may also appear in tables. This involves certain amount of risk.Quantitative Methods in Social Sciences 9. we consider it a non-significant result and when hypothesis is rejected (fail to accept or retained). This amount of risk is called level of significance.

Quantitative Methods in Social Sciences on the observations. 2009 142 . µ = population mean from the population parameters that they represent µ o = Hypothesized parameter value or the parameter obtained for some similar studies Ho: µ = µo An alternative hypothesis is the logical opposite of the null hypothesis. say crop yield and fertilizer input per unit area. That is: The general way of expressing the Null Hypothesis is that the sample parameters do not differ significantly Where. hypothesis testing requires that the null hypothesis to be considered true or no difference until it is proved false on the basis of results observed from the sample data. Procedure for Hypothesis Testing This refers to the steps required to test the validity of the claim or assumption about the sample statistic. Hence. that is. The null hypothesis is always expressed in the form of an equation making claim regarding the specific value of the population parameter. and perform a statistical test which enables us to decide whether the correlation is statistically significant or not. an alternative hypothesis must be true when the null hypothesis found to be false. Note that the correlation between two variables X and Y is statistically not significant means. so that the results obtained should not influence the choice of the decision-maker. It is also possible to find the degree of relationship (correlation) between two geographical data. It is stated as: H1: µ µo H1: µ < µo or µ > µo Step 2: State the Level of Significance or (alpha) for the Test The level of significance is specified before the samples are drawn. 9. The statistical test tells the researcher whether the landholding sizes differ significantly among the woredas or not. the general procedure for any hypothesis testing is summarized below. The results of the analysis are used to decide whether the claim is valid or not. It is specified in terms of the level of probability of the null hypothesis being wrong or rejected. the relationship between the variables is only because of chance not because one variable affects the other. Step 1: State the null hypothesis (Ho) and alternative hypothesis (H1) Theoretically.4. Mesay Mulugeta.

/2 (Ho is rejected) Acceptance region (Ho is accepted) Rejection region. and (2) whether young participants perform better on a memory test than elderly. These are called the acceptance region and the rejection or critical regions in a normally distributed data. Step 4: Calculate the Suitable Test Statistic The value of the test statistic is calculated from the distribution of sample statistic by using the following formula: Tests statistic Value of sample statistic value of hypothesized population parameter S tan dared error of the sample statistic Step 5: Reach a Conclusion Compare the calculated value of the test statistic with the critical value (also called standard table value or tabulated value). not its direction. Figure 9. /2 (Ho is rejected) Critical Values If the value of the test statistic fall into the acceptance region. For instance.Quantitative Methods in Social Sciences Step 3: Establish Critical or Rejection Region As can be seen from the figure below. In contrast. (1) we can test whether there will be a difference in performance of young and old participants on a memory test. a twotailed hypothesis predicts only the presence of a statistically significant effect.1: Two-tailed Test Region Rejection region. otherwise it is rejected. A one tailed hypothesis makes predictions regarding both the presence of a significant effect and also of the direction of this difference or association. The decision rules for null hypothesis are as follows: |Value|Cal |Value|Table. the sample space of the experiment is divided into two mutually exclusive regions. At this stage we should bear in mind that research hypotheses can be of two types. one-tailed and two tailed. 2009 143 . the null hypothesis is accepted. Reject the Ho Mesay Mulugeta.

2009 144 . if the significance level for the test is percent.4. Look at Fig. Errors in Hypothesis Testing Ideally the hypothesis testing procedure should lead to the acceptance of Ho when it is true and the rejection of Ho when it is not. (Ho is rejected) Critical values 9. or /2 percent which is kept in each tail of the sampling This implies that the value of sample statistic is either higher or lower than the hypothesized parameter value.1. One-tailed and Two-tailed Tests There are two types of tests referred to as the one-tailed and two-tailed tests. Figure 9. The rejection region is kept in both tails as indicated in Fig.Quantitative Methods in Social Sciences |Value|Cal < |Value|Table. b. the correct decision is not always possible.1. Ho. Two-tailed test is when null and alternative hypotheses are stated as: Ho: µ = µo and H1: µ µ o This implies that any deviation (either on the lower or higher side) of the calculated value of test statistic from the hypothesized value leads to rejection of the null hypothesis.5. µ o and H1: µ < µo (Left-tailed test). 9. the rejection region equal to distribution. The type of tests depends on the way the hypotheses are formulated. This leads to the rejection of null hypothesis for significant deviation from the specified value in one direction or tail of the curve of sampling distribution. Since the Mesay Mulugeta.2 below. a. However. Accept the Ho 9. One-tailed test is when null and alternative hypotheses are stated as: Ho: µ Ho: µ µ o and H1: µ >µo (Right-tailed test).2: One-tailed test (Right-tailed) Acceptance region (Ho is accepted) Rejection region. Then.9.

there is a possibility of an incorrect decision or error.Quantitative Methods in Social Sciences decision to accept or reject a hypothesis is based on sample data. A type I error is made when Ho is rejected and conclude that the H1 is true when it is wrong. A decision-maker may commit two types of errors while testing a null hypothesis. Hypothesis testing for Single Population Mean: The test statistic for determining the difference b/n the sample mean and population mean µ is given by: x s* 1 n Where. x = sample mean µ = population mean t s = sample standard deviation n = sample size Mesay Mulugeta. Common Types of Hypothesis Testing 1. a type II error is made when a false Ho is accepted and concludes that the H1 is wrong when it is true. 2009 145 . On the other hand. When testing a hypothesis with small samples (<30). among others. Student t-test was named after Sir William Gosset of Ireland who under his pen name student developed a method for hypothesis testing popularly known as the t-test .6. 9. Student s t-test It is the deviation of estimated mean from its population mean expressed in terms of standard deviation. I. These are known as Type I Error ( ) and Type II Error (ß). we must assume that the samples come from a normally or nearly normally distributed population. so he published his research findings in 1905 under pen name Student . It is said that Gosset was employed by Guinness Brewery in Dublin which did not permit him to publish his research findings under his own name. for: Hypothesis testing for the difference b/n two populations with independent samples Hypothesis testing for the difference b/n two populations with dependent samples Hypothesis testing for observed coefficient of correlation including partial and rank correlations Hypothesis testing for an observed regression coefficient. A t-test is used.

Exercise Let a herbicide spray machine is set to give 20 kilograms of herbicide per hectare of land. df = 13.x ) . More clearly.5 quintals.05.1: Variables (x) 19 22 20 18 21 17 19 x=? Mesay Mulugeta. Critical value of t at df =13 and t or t /2. 20.85 18.24 1. 21. The tabulated t-value gives the critical value of t.955.160). 18. Is there reason to accept that the machine is defective? Hint: Table 9.24) value is less than its critical value (ttab = 2. 22.955 14 Since tcal (-1. Solution: Let us take the null hypothesis that there is no significant deviation in amount of production among households. From that a sample of 14 households was selected. x = 17. reject the Ho. 2009 146 Deviation from Mean (x .025 is 2. n = 14. Hence. Test the significance of the deviation. otherwise accept it.1. /2 = 0. Ho: µ = 18. we conclude that there is no significant deviation of sample mean from the population mean.Quantitative Methods in Social Sciences Note: This test statistic has a t distribution with n-1 degree of freedom.955quintals. 17and 19 kilograms.85. The mean and standard deviation of the samples were calculated as 17. Seven plots of land (each one hectare in area) are examined and the amounts of herbicides in the plots are found to be 19. s = 1.50 = 0. the null hypothesis Ho is accepted.16 t 17. Example The average rural households cereal production per year is specified to be 18.85quintals and 1. respectively.50 and H1: µ 18.75 = . Note also that the sample size should be small (<30) and at least five observations (taken from normally distributed Given. if tcal population) are desirable to apply t-test.

examine the significance of the difference between the mean of the marks secured by the students of the two groups. statistic t has (n1 + n2 .µ 2 b/n mean values of the two populations. the expression for t is: t Sp x1 x 2 1 n1 1 n2 or x1 Sp x2 n1 n2 n1 n2 Where. then our aim is to estimate the value of the difference µ1 .µ 2 = µ0 and H1: µ1 .µ 2 and degrees of freedom (n1 + n2 µ0 Accept Ho if calculated value of t is less than its critical value at a specified level of significance 2). x1b.2) degree of 2 2 Sp = 2 ( x1i x1 ) 2 ( x2i x2 ) 2 freedom. x2c). x1c.Quantitative Methods in Social Sciences II. Exercise Let us assume that the following table shows a test score (out of 20) of two groups of students in a class Group I Group II 11 20 18 8 12 13 12 18 17 16 14 19 14 15 16 17 11 17 7 10 13 15 16 13 11 14 19 10 Then. x2b. Then. The calculated value of the t-test statistic here represents the number of standard deviations the difference x1 . Mesay Mulugeta.x 2 is from µ 1 . Let the sample values be denoted by (x1a.µ 2 specified in Ho. x2c. Otherwise reject Ho. Hypothesis testing for Difference of Two Means: For comparing two mean values of two normally distributed populations. Sp is the pooled standard deviation which is equal to S p . Thus the rule to either accept or reject a null hypothesis is as follows: Ho: µ1 . S p can be calculated by using the formula below: (n1 n2 2) Note: In hypothesis testing for difference of two means. x1c) and (x2a. x1 and x 2 are means of the samples I and II respectively. If µ1 and µ2 are the mean values of the two populations. we can draw independent random samples of sizes n1 and n2 from the two populations. x1b. x2b. 2009 147 .

For testing Ho: µ = µo against H1: µ x 0 Z / n Whereas. Z-test.Quantitative Methods in Social Sciences 2. Z-test It is one of the commonest types of hypothesis testing. Z-test (not the same as Z-score though closely related) compares sample and population means to determine if there is statistically a significant difference. Theoretically. Birr. In the x represents standard error. x is the sample mean and formula above. sample variance approaches to population variance and is deemed to be almost equal to population variance. Mesay Mulugeta.2: 12 7 8 9 12 10 16 23 21 17 6 8 9 12 32 16 4 21 20 22 8 9 4 23 21 20 12 16 18 19 5 18 9 12 12 6 14 21 22 18 Analyze whether it can be concluded or not that the average (mean) income of a person in this manufacturing plant is 15 Eth. µ o. the population variance is known even if we have sample data and hence the normal test is applicable. the test statistics is is the standard deviation based on large sample size n. is used in cases when the population variances (s) is/are known and sample size is large (>30). In this way. SE. Table 9. Birr) of randomly selected 40 laborers in a manufacturing plant. when the sample size is large. 2009 148 . also known as normal test. Then. z-test can also be stated as SE n Exercise Let the table below gives the daily income (in Eth. The distribution of Z is always normal with a mean zero (0) and a variance one (1).

also called variance ratio distribution or F-test is used either for testing the hypothesis about the equality of two population variances or the equality of two or more population means. The hypothesis indicated below can be tested by F-test Ho: 2 1 = 2 2 against 2 1 2 2 Whenever independent random samples of size n1 and n2 are drawn from two normal populations. compare it against the tabulated value and decide whether to accept or reject the Ho. The equality of the two population means has been dealt with t-test. Then. There are two independent degrees of freedom. The larger variance should always be placed in the numerator Let there be two normal populations N ( 1 . 2 1 ) and N ( 2 . The assumptions for F-distribution (F-test) are: a. here you have to give more emphasis how to test hypotheses by comparing the standard deviations or variances of population.Quantitative Methods in Social Sciences Hint: 1. the F-ratio will be calculated by the formula below: F S1 S2 2 2 Mesay Mulugeta. one for the numerator and the other for the denominator d. The F-values are non-negative b. Then. Calculate also Then. by using the formula for Z-test. µ = 15 4. 2009 149 .e. you should use a normal test (Z-test). 3. Hypothesis Testing Based on F-distribution (F-test) F-test is one of the parametric tests first coined by Sir Ronald Fisher who initially developed the statistic as the variance ratio in 1920s. State the hypothesis in such a way that Ho: µ = 15 against H1: µ 15 against 2. F-distribution. Since the sample size is 40 (i. The distribution is non-symmetric or samples should be drawn from normal population c. first you should calculate sample mean. x 3. 2 2 ). large).

larger variance is taken in the numerator of the formula S1 > S 2 in the formula above with the n1 1 degree of freedom for the numerator and n 2 1 degree of freedom for the denominator. For H1: For H1: 2 1 2 1 > < 2 2 2 2 . Note that we keep the larger variance in the denominator so that the ratio is always equal to or greater than one. it is a non-parametric test for assessing whether two samples of Mesay Mulugeta.9 60. State the hypothesis as Ho: b. In the reverse situation Ho is not rejected Exercise Let us assume that the following table represents the life expectancy of 7 and 9 regional states of Ethiopia in 1991 and 2007.8 58.3 58. 4.2 38. reject Ho if Fcal < F1.2 49. for arbitrary sample sizes. confirm whether the variation in life expectancy in various regions in 1991 and in 2007 is the same or not.0 56. for equal sample sizes. Degrees of freedom are 6 and 8 for the data set of 1991 and 2005. respectively. reject Ho if Fcal > F .0 39.6 54. Then. respectively.Quantitative Methods in Social Sciences As a norm.5 Hint: a. Mann-Whitney U-Test (Wilcoxon-Mann-Whitney Rank-Sum Test) Initially proposed by Wilcoxon (1945).5 47.1 2007 54.2 41. First calculate S1 and S 2 2 2 2 1 = 2 2 vs 2 1 2 2 c.5 47.9 48.2 50. 2009 150 ..5 41. and later by Mann and Whitney (1947).3: Life expectancy in years Regions 1 2 3 4 5 6 7 1991 43. Table 9.

0 210.5 56.5 43.5) 6(8) 10(14. Table 9.5 8(11. N = total number of observations R = sum of the ranks of ith set. 5.5) 6(8) 12(17) 6(8) 8(11. ni = frequency of ith set Exercise Find H-value for the four sets of data given below and compare against tabulated value.5) 6(8) 14(18) 10(14. 2009 151 .5) 0(10 10(14.789 Mesay Mulugeta.5) R 25. Allen Wallis) is a non-parametric method of testing equality of population medians among groups.0 7310. Kruskal-Wallis one-way of analysis of variance by rank (named after William Kruskal and W.Quantitative Methods in Social Sciences observations come from the same distribution or not.25 5 3136 ) 3 x 21 5 10(14. Table 9.5) 6(8) 2(2.25 1849 ( 20(21) 5 5 H-value = 10. It is one of the best known non-parametric significance tests.4: A B C D 4 6 12 10 6 8 16 10 2 6 14 10 6 10 8 6 2 0 10 4 Solution Rank the data as follows. It is an extension of the Mann-Whitney U-test to 3 or more groups. The formula for H-test is H Ri 12 ( ) 3 N 1 N ( N 1) ni 2 Where.5: Data Ranking A B C D 4(4.5) 10(14. Kruskal Wallis Rank-Sum Test (H-test) In statistics. Start from the lowest data.5) 16(19) 2(2.0 85.5) Grand Total of Rank 12 650.5) 4(4.

H-calculated = 10. The hypothesis here is: 2 2 2 Ho: o = vs o This can be tested by: 2 2 = ( xi 2 o x) 2 2 or (n 1) S 2 2 0 Where. The sampling distribution of 2 2 2 is called distribution. 2009 152 . The data should be expressed in original units.) at n-1 degree of freedom Mesay Mulugeta. the postulated value 2 o of population variance ( 2 ) is to be either substantiated or rejected with the help of statistical test. Draw a random sample of size n (<30) from this population. the calculated -test statistic is compared with its critical (or table) value to know whether the Ho hypothesis is true or not. It is usually represented by . Chi-square ( 2) Test This is also one of the non-parametric or distribution free statistical tests which go back to 1900.Quantitative Methods in Social Sciences By using the chi-square compare H-calculated with H-tabulated and decide whether or not the Ho is to be rejected or retained.820 Hcal > Htab. On the basis of n sample observations. Let us assume that we have a perceived value of variance ( 2 o ) of a normal population on the basis of previous knowledge. In case of one-tailed test (Ho: = vs 2 o > 2 ). then reject Ho 6. Reject Ho at pre-decided level of significance freedom or if Valuecal Value (1o if Valuecal Value /2 at n-1 degree of /2) 2 at n-1 degree of freedom 2 2. when Karl Pearson used it for frequency data classified into k-mutually exclusive categories. Note that it is the Chi-square ( 2) table that must be used for H-test also. rather than in percentage or ratio form. 1. S2 = sample variance and To accept or reject Ho: statistic has (n-1) degrees of freedom.789 H-tabulated = 7. reject Ho if Valuecal Value at n-1 degree of freedom or reject Ho if Valuecal Value(1. a Greek letter Chi . The decision of accepting the Ho is based on how close the sample results are to the expected results. Like other hypothesis testing procedures.

helps to test the differences between three or more sample means drawn from two or more sets of populations. the buyer selects 8 sample items of the commodity.625 .14 11.6.6.625 2 = ( xi 2 o x) 2 1.6.6. H0: µ = µ2 = µk H1: Not all µj are equal (j = 1. The weight of each sample item was measured to be as follows.625 .9922 5.x )2 2.6. It means that the factory owner should purchase the commodity.0671).64 6.6. 7.625 .6.89 1.9922) value is less than tabulated value (14.6: 5 7 10 4 9 4 8 6 (xi. we should accept the null hypothesis.89 5. Here we test null hypothesis (Ho) that three or more sets of population or populations from which samples are drawn have equal (homogeneous) means against the alternative hypothesis (H1) that population means are not equal at all.4 kilograms in weight.625 . 5 9 Table 9. Then.Quantitative Methods in Social Sciences Example Let us assume that a factory owner wants to purchase a commodity if it does not have variance of more than 0.6.625 . To make sure of the specifications. 3.625 . since the calculated (6. k) Mesay Mulugeta.625 (xi.x ) .0671. 8 For = 0.87 Weight in kilogram 7 10 4 8 4 6 x =6. developed by Sir Ronald Fisher. 2.625 .87 6.x )2 = 35. Analysis of Variance (ANOVA) Analysis of variance (ANOVA).39 (xi.13 1 = 7) the tabulated value from Chi-square table ( 2-distribution) is 14.39 6.89 0.05 and 7 degree of freedom (n 35.64 0. 2009 153 .

Each populations are a normal distributions b. In analysis of variance. These are: a. the null and alternative hypothesis of population means imply that the null hypothesis should be rejected if any of the r sample means is different from others. H0: µ1 = µ2 = µ3 = µ3 H1: µ1 µ2 µ3 µ3 Mesay Mulugeta. This difference is considered due to chance causes or random errors. Each sample is drawn randomly and is independent of other samples The first step in the analysis of variance is to partition the total variation in the sample data into the following two component variations.Quantitative Methods in Social Sciences Hence. The amount of variation within the sample observations. we can apply ANOVA to check (test) the differences of prices collected by the enumerators. The assumptions for the analysis of variance are: a. b. The sets of populations from which the samples are drawn have equal variances c. 2009 154 . a table known as ANOVA table is required and established as follows: Table 9. Then.7: Source of variation Degrees of freedom Sum of squares Variance F-value Total Example Let us assume that 4 enumerators are sent to a market to collect a data related to a price of a commodity. The amount of variation among (variation between) the sample means or the variations attributable to the difference among sample means.

Birr Total x A 4 6 2 6 2 20 4 B 6 8 6 10 6 30 6 C 12 16 14 8 20 70 14 D 10 10 10 6 4 40 8 20+30+70+40=160 Grand mean = 8 4 6 14 8 8 4 Then. 2009 ( x grand mean) 2 80 + 20 + 180 + 0 = 280 05 = 0 280 155 .Quantitative Methods in Social Sciences Enumerators Prices in Eth. Here the mean of each data set represents the whole data: 165 = 80 Variation b/n a data set collected by A ( x grand mean) 2 Variation b/n a data set collected by B ( x grand mean) 2 45 = 16 Variation b/n a data set collected by C ( x grand mean) 2 365 = 180 Variation b/n a data set collected by D Total variation b/n Mesay Mulugeta. find variation within as follows: Variation within a data set collected by A Variation within a data set collected by B Variation within a data set collected by C Variation within a data set collected by D Total variation within ( xi ( xi ( xi ( xi x) 2 = 16 x) 2 = 56 x) 2 = 80 x) 2 = 32 16 + 56 + 80 + 32 = 184 We have to find also variation between as follows.

2009 156 .5 16 Variation within 16 184 93. the t-test statistic for testing the null hypothesis is: r n r2 1 r2 0 (There exist a correlation) tn 2 = Mesay Mulugeta. create an ANOVA table as follows: Source of variation Variation b/n Degrees of freedom 3 Sum of squares 280 Variance F-value 280 93. For determining the correlation. This test of linear relationship between x and y is the same as determining whether there is any significant correlation between them.113 11.113) > Ftab (3. whether or not the correlation coefficient is zero in population. This means we test: Ho: p = 0 (There is no correlation) vs H1: p Thus.7.3 3 184 11.24). 9. The test of significance of correlation coefficient means to test the hypothesis.5 Grand variation 19 464 The conclusion of the ANOVA above is that since the Fcal (8. The test of significance for the existence of a linear relationship between two variables x and y involves the determination of sample correlation coefficient r.Quantitative Methods in Social Sciences Grand variation = total variation within (184) + total variation b/n (280) = 464 Finally. Test of Significance of Correlation Coefficients Whatever conclusions are drawn from the sample under consideration is meant to draw inferences about the parent population.3 8. The estimates are not unique and hence a sort of confirmation is sought by way of test of significance for validity of inferences drawn from the sample about population. we start by hypothesizing the population correlation coefficient p equal to zero. the difference is significant (µ1 µ 2 µ 3 µ 3) and reject the null hypothesis.

n 6 Mesay Mulugeta. Rejection of Ho leads to the conclusion that the two variables are not independent. Ho: p = 0 (There is no correlation) correlation) 3. we should calculate the coefficient of correlation(r) for the paired data. it means that the value of r is due to sampling error whereas in reality two variables are uncorrelated in the population too. the Pearson s correlation coefficient for the data is calculated to be +0.8836.Quantitative Methods in Social Sciences Where. Table 9. r = sample coefficient of correlation of n pairs of observations (n-2) p = degree of freedom = population coefficient of correlation = t-test with n-2 degree of freedom = sample size value of significance and (n- tn 2 n If the calculated value of t is greater than the tabulated value of t for 2) degrees of freedom. r 0. Firstly. State the hypothesis i.e. Apply the method above to confirm or test the correlation r n r2 1 r 2 vs H1: p 0 (There exist a tn 2 = .940. Then by the methods (formula) you have learnt earlier in this course. Example Calculate the correlation coefficient of the following paired hypothetical data and confirm or test its validity. r2 0. On the other hand.8: Farm Households 1 2 3 4 5 6 Crop Yield per Unit Area (in Quintals) 25 14 18 22 15 18 Fertilizer Application Per Unit Area (in Kg) 70 30 30 60 30 35 Solution 1. 2009 157 . This means that the correlation between them is worth considering. if Ho is accepted. 2.940. reject Ho.

This means that the correlation between them is worth considering. Rejection of Ho leads to the confirmation of the conclusion that the two variables are really highly correlated.1164 0.Quantitative Methods in Social Sciences tn 4.940 5. 6 -2 = 4) degrees of freedom is 3. Hence.233 Then. reject Ho. 0.747 which is less than the calculated value of t (6. Mesay Mulugeta.940 6 0.01 and (n-2. 2 = 0.8836 1 0.233).1164 = 6.8836 . 2009 158 . the tabulated value of t at = 0.

Routledge Inc. Mixed Research Methods. Cresswell. Quantitative and Mixed Methods Approaches. W. Research Methodology for Biological Sciences. Basic Statistics.L (2006). Frechtling. Joe W. B. 2nd Edition.Quantitative Methods in Social Sciences References 1. Kotrlik and Chadwick C. (2000). (1999).: London. Sayer.al. DIANE Publisher. Bartlett. B. London.et. New Delhi. London.. Organizational Research: Determining Appropriate Sample Size in Survey Research. R and Robinson. New Delhi 13. (2003). F. Pearson Education Ltd. Ball State University: Muncie 3. Nachmias. A Short Introduction to Social Research. Vol. A. Theory and Problems of Statistics and Econometrics. New York 11. K. D. (2007). New Age International Publishers: New Delhi 2. 9. Waker. Sharma. M et.Eds (1997). (2006). Spratt. Unwin University: London 5. No. Salvatore. J. Prentice-Hall Pvt. J. C. and J. Kahn (2005).W. 6. Methods in Social Sciences. (2004). J. Research Methods in Social Sciences. Bryman. 12. James E. 19. N. J. Quantity and Quality Research in Social Sciences. Gurumani. C. (2004). and D. Research Design: Qualitative. St Martin s Press Inc. 4. M.el. Agrawal. SAGE Publications: London. 2009 159 . MJP Publications: Chennai 8. SAGE Publications Ltd. Commonwealth of Learning Mesay Mulugeta. 7. The Limitations of Social Research. Best. 10.V. Schaum s Outline Series.. User-Friendly Handbooks for Mixed Methods Evaluation. Shipman. Higgins (2001). A. Business Statistics. Longman Group: London 14. (1998). Information Technology Learning and Performance Journal. Nachmias (1996). (1982). Research in Education. 1. Henn.

daneprairie.com. . The unregistered version of Win2PDF is for evaluation or non-commercial use only.This document was created with Win2PDF available at http://www.