You are on page 1of 17

CHAPTER ONE Definition of statistics Statistics has been defined differently by different scholars from time to time.

However, the definition of statistics given by Croxton and cowden is the most scientific and realistic one and it will form the subject matter of the study. That is: Statistics is defined as the science that deals with the methods of the collection, organization, analysis of data and interpretation of result Classification of statistics Statistics can be classified in to two categories. Descriptive statistics and inferential statistics All statistical tests are part of inferential analysis; there are no tests conducted in Descriptive analysis Descriptive Statistics Descriptive statistics is the term given to the analysis of data that helps describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make conclusions beyond the data we have analyzed or reach conclusions regarding any hypotheses we might have made. They are simply a way to describe our data. Descriptive statistics are very important, as if we simply presented our raw data it would be hard to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore allow us to present the data in a more meaningful way which allows simpler interpretation of the data. For example, if we had the results of 100 pieces of students' coursework, we may be interested in the overall performance of those students. We would also be interested in the distribution or spread of the marks. Descriptive statistics allow us to do this. The frequency distribution, measure of central tendencies (such as mean, median and mode), measures of variation (such as variance, standard deviation) belongs to this category of statistics.

Inferential Statistics We have seen that descriptive statistics provide information about our immediate group of data. For example, we could calculate the mean and standard deviation of the exam marks for the 100 students and this could provide valuable information about this group of 100 students. Any group of data like this that includes all the data you are interested in is called a population. A population can be small or large, as long as it includes all the data you are interested in. For example, if you were only interested in the exam marks of 100 students, then the 100 students would represent your population. Descriptive statistics is applied to populations and the properties of populations, like the mean or standard deviation (called parameters) as they represent the whole population. Often, however, you do not have access to the whole population you are interested in investigating but only have a limited number of data instead. For example, you might be interested in the exam marks of all students in the Addis. It is not feasible to measure all exam marks of all students in the whole of the Addis so you have to measure a smaller sample of students, for example, 100 students, that are used to represent the larger population of all students in Addis. Properties of samples, such as the mean or standard deviation are not called parameters but statistics. Inferential statistics are techniques that allow us to use these samples to make generalizations about the populations from which the samples were drawn. It is, therefore, important the sample accurately represents the population. The process of achieving this is called sampling. Inferential statistics arise out of the fact that sampling naturally incurs sampling error and thus a sample is not expected to perfectly represent the population. The methods of inferential statistics are the estimation of parameter(s) and testing of statistical hypotheses. Definition of same terminologies Population: A collection of all possible individuals, objects or, measurements of interest Sample: A portion or part of the population of interest selected systematically and infer same thing about the population. Variable: Any factor whose values can change from individual to individual or time to time.

Types of Variables: A variable can be quantitative or qualitative. Quantitative variables are variables measured on a numeric scale. Height, weight, response time, subjective rating of pain, temperature, and score on an exam are all examples of quantitative variables. Quantitative variables are distinguished from categorical (sometimes called qualitative) variables such as favorite color, religion, city of birth, favorite sport in which there is no ordering or measuring involved.

Qualitative variables also known as categorical variables are variables with no natural sense of ordering and whose values cannot be represented numerically. They are therefore measured on a nominal scale. For instance, hair color (Black, Brown, Gray, Red, and Yellow), sex, Nationality, Religion etc are qualitative variable. Qualitative variables can be coded to appear numeric but their numbers are meaningless, as in male=1, female=2. Quantitative variables are again divided in to two: Discrete and continuous Discrete variable is one that can take on a countable number of different values. E.g. the number of people attending a seminar is a discrete variable because we can count the number of people (E.g. 10, 11, 12, 13 people) in the seminar. There is no such thing as fractional (E.g. 10.2 or 12.5 people). A continuous variable is one that takes on infinite number of different values. E.g. the distance between two places is a continuous variable because the distance could be 5, 5.8 or 5.83445 meters. Age of an object, person's height, air pressure, water temperature, time to complete a task, etc are also continuous variables. Application Area/ use of statistics The science of statistics is very essential for research and decision processes in all aspects of human life. The following are same uses of statistics. To help summarize the findings of some inquiry To obtain a better understanding of the phenomenon under study as an aid in generalization of theory validation.

To make forecast of same variable, for example deforestation rate incoming ten years in a given area Help select a course of action among a number of alternatives

Limitations of statistics Statistics is indispensable to almost all sciences - social, physical and natural. It is very often used in most of the spheres of human activity. In spite of the wide scope of the subject it has certain limitations. Some important limitations of statistics are the following: 1. Statistics does not study qualitative phenomena: Statistics deals with facts and figures. So the quality aspect of a variable or the subjective phenomenon falls out of the scope of statistics. For example, qualities like beauty, honesty, intelligence etc. cannot be numerically expressed. So these characteristics cannot be examined statistically. This limits the scope of the subject. However, statistical techniques may be applied indirectly by first reducing the qualitative expressions to precise quantitative terms. For example, the intelligence of a group of candidates can be studied on the bases of their score in a certain test 2. Statistical laws are not exact: Statistical laws are not exact. These laws are true only on average. They hold well under certain conditions. They cannot be universally applied. So statistics is less practical utility. 3. Statistics does not study individuals: Statistics deals with aggregate of facts. Single or isolated figures are not statistics. This is considered to be a major handicap of statistics. 4. Statistics can be misused: Statistics is mostly a tool of analysis. Statistical techniques are used to analyze and interpret the collected information in an enquiry. Perhaps the most important limitation of statistics is that it

must be used by experts, as saying goes Statistical methods are the most dangerous tools in the hands of the inexpert . The use of statistics by inexperienced and untrained persons might lead to very fallacious conclusion. CHAPTER TWO Data Collection and presentation Data collection is the most important stage in statistical data analysis. Before we begin with data collection, there are four important points to be considered. These are: The purpose of collecting the data The data to be considered The source from which we can get the data The method(s) used for data collection

2.1. Types of Statistical Data and sources of Data

When we decide what data to be collected for a given purpose we think for the source of getting the data. Based on the source of data, statistical data are of two: primary data and secondary data. Primary sources A primary source is a source that you cite in your writing that is as close to possible as the original information. A primary source is the most direct place you can find the information you want to write about. For example, would be a primary source for a population estimate of your city, whereas a newspaper article detailing the number would not be considered primary. Some other examples of primary sources are

research publications autobiographies, diaries legal documents, original maps Laws, letters, novels, oral histories, photographs, speeches

Data obtained from primary data sources, survey methods and experimental methods is called primary data Importance of Primary Data: Importance of Primary data cannot be neglected. A research can be conducted without secondary data but a research based on only secondary data is least reliable and may have biases because secondary data has already been manipulated by human beings. In statistical surveys it is necessary to get information from primary sources and work on primary data: for example, the statistical records of female population in a country cannot be based on newspaper, magazine and other printed sources. Such sources are old and secondly they contain limited information as well as they can be misleading and biased. Secondary Sources Secondary sources are informational sources that analyze the event. These sources often use several primary sources and compile the information. Examples of secondary sources:

Bulletins, books ,Journals/periodicals Magazines/Newspapers webs, e-journals

Data obtained from secondary data sources is called secondary data Importance of Secondary Data: Secondary data can be less valid but its importance is still there. Sometimes it is difficult to obtain primary data; in these cases getting information from secondary sources is easier and possible. Sometimes primary data does not exist in such situation one has to confine the research on secondary data. Sometimes primary data is present but the respondents are not willing to reveal it in such case too secondary data can suffice: for example, if the research is on the psychology of transsexuals first it is difficult to find out transsexuals and second they may not be willing to give information you want for your research, so you can collect data from books or other published sources.

Survey Methods Survey methods of data collection are personal interview, telephone interview, mail questionnaire and personal observations. Personal Interview: A trained interviewer asks a series of questions and records responses on a specially designed form. Advantages: As the enumerator is with the respondent, he/she can explain some points which are not clear for respondents. Disadvantages The quality of the data may be affected both by how the questioner was designed and the quality of the interviewer. The respondent may not be free from the bias of the enumerator The respondents nay not have enough time to give thought answers

Mailed questioner: This is the approach in which the researcher sends mailed questioner to respondents Advantages: Costs are low The respondents are free from the bias of the enumerator Respondents can have more time to give well thought answers

Disadvantages Non response, partial response, low return rates Only applicable for educated respondents The respondents can have enough time to give thought answers

Telephone interview: It involves contacting the respondents on telephone. Advantage It is faster to collect data

Disadvantages The absence of telephone lines makes this approach less usable It cannot be used for rural surveys Respondents cannot have more time to give thought answers The clarity of the telephone may create communication barrier

Observational Methods: It is monitoring of on going activities and direct recording of data. Advantage It avoids incompleteness of data Disadvantages It is rarely used as it is not possible to plane when the event will happen. Question Design It is important to design questions very carefully. A poorly designed questionnaire renders results meaningless. There are many factors to consider.

Make items clear (don't assume the person you are questioning knows the terms you are using).

Avoid double-barreled questions (make sure the question asks only one clear thing). Respondent must be competent to answer (don't ask questions that the respondent won't accurately be able to answer).

Questions should be relevant (don't ask questions on topics that respondents don't care about or haven't thought about).

Short items are best (so that they may be read, understood, and answered quickly). Avoid negative items (if you ask whether librarians should not be paid more, it will confuse respondents).

Avoid biased items and terms (be sensitive to the effect of your wording on respondents). Unless the nature of a survey definitely warrants their usage, avoid slang, jargon, and technical terms.

Whenever possible, develop consistent response methods. Make questions as impersonal as possible. As an ordinary rule, sequence questions from the general to the specific. If closed questions are employed, try to develop exhaustive and mutually exclusive response alternatives.

Use an attractive questionnaire format that conveys a professional image.

As may be seen, designing good questions is much more difficult than it seems. One effective way of making sure that questions measure what they are supposed to measure is to test them out first, using small focus groups.

2.2 Data Collection Methods

In general, there are two methods of data collection: census and sampling.
Census refers to periodic collection of information about the populace from the entire population.

Sampling is a scientific method of collecting information from a sample that is representative of entire population. It is important for the investigator to decide whether to use sample or census method for collecting data. The selection primarily depends upon the nature and extent of the enquiry and the degree of accuracy desired. At the same time, the scope of the enquiry, its cost, the time of enquiry, the selection and training of the enumerators etc. are also to be taken into account. Basically, accuracy and precision depends upon the human element. If the human element is

perfectly impartial and unbiased then best results can be expected from either method. In the absence of it, the results may be distorted. Both the methods of enquiry have their individual merits and demerits. Merits and demerits of census merits data from census is reliable and accurate census method is preferred in case of small population Sampling experts are not required demerit It is very time consuming It is costly It requires more labor

Merits and demerits of sampling merits sampling method is suitable in case of large population sampling is quick, less costly, inexpensive and requires relatively less manpower By taking samples we can reduce the damages caused by same tests in quality control.( e.g. the quality of factory products are checked by taking only few of the products) demerit there is a margin of error in sampling Sampling experts are required

Both the systems of enumeration are useful and advantageous at different occasions. So both or any of these methods may be followed according to the need or environment. Types of Sampling: In the sampling method a representative group of items from the population are selected. These groups of items are called samples. These samples represent the whole universe. Therefore the selection of samples is crucial in this method. It is rightly said, Samples are like medicines.


They can be harmful when taken carelessly or without knowledge of their effects. Every good sample should have a proper label with instructions about its use. So while selecting samples, a number of factors should be considered. Some of the important factors are: i) The nature of the population ii) The distribution of items in the population iii) The characteristics to be studied iv) Availability of data v) Availability of both financial and human resources etc. Considering these factors a choice is made regarding the type of samples to be used. Generally there are two methods of selecting samples 1. Probability sampling method or Random sampling method 2. Non- probability (purposive) sampling method Probability sampling method: Probability sampling is also known as choice sampling. Here each individual item of the population has non zero chance of being included in the sample. So here the selection of the units in the sample entirely depends on the chance factor. One does not know beforehand which units will actually constitute the sample. Probability samplings are of different types. Some of the important types of probability sampling are. a) Sample random sampling b) Systematic Sampling c) Stratified Sampling d) Cluster Sampling

Let us discuss these types of sampling in brief. a) Simple Random Sampling: The most common type of probability sampling is simple random sampling. Here each individual population unit has equal chance of being included in the sample. Random sampling technique is suitable where the population is more or less homogeneous. Random selection implies a strict process of selection like that of drawing lotteries. The selection of elements of the elements may be done using lottery system or by using the random number table. Advantages

ideal for statistical purposes


hard to achieve in practice requires an accurate list of the whole population expensive to conduct as those sampled may be scattered over a wide area

b) Systematic Sampling This is the sampling procedure which assumes the population is arranged in some order like house number, seat number, and so on. If we have N units in the population that are numbered 1 to N, in same order, the sampling procedure is explained as follows. To select n units, we take a starting unit at random from the first k units and every kth unit then after at regular intervals. . The constant k is usually approximated by houses from a street of 120 houses. 120/8=15, so every 15th house is chosen after a random starting point between 1 and 15. If the random starting point is 11, then the houses selected are 11, 26, 41, 56, 71, 86, 101, and 116. . For example, suppose you want to sample 8


If there were 125 houses, 125/8=15.625, so should you take every 15th house or every 16th house? If you take every 16th house, 8*16=128 so there is a risk that the last house chosen does not exist. To overcome this random starting point should be between 1 and 10. On the other hand if you take every 15th house, 8*15=120 so the last five houses will never be selected. The random starting point should now be between 1 and 20 to ensure that every house has some chance of being selected. In a random sample every member of the population has an equal chance of being chosen, which clearly not the case here is, but in practice a systematic sample is almost always acceptable as being random. Advantages

spreads the sample more evenly over the population easier to conduct than a simple random sample


the system may interact with some hidden pattern in the population, e.g. every third house along the street might always be the middle one of a terrace of three

c) Cluster Sampling In cluster sampling the units sampled are chosen in clusters, close to each other. Examples are households in the same street, or successive items off a production line. The population is divided into clusters, and some of these are then chosen at random and then each element in the selected cluster will be enumerated. Within each cluster units are assumed to be heterogeneous and among the clusters units are assumed to be homogenous. Advantages

saving of travelling time, and consequent reduction in cost


useful for surveying employees in a particular industry, where individual companies can form the clusters


units close to each other may be very similar and so less likely to represent the whole population larger sampling error than simple random sampling

d) Stratified Sampling In a stratified sample the population is divided into non-overlapping sub groups called strata, e.g. geographical areas, age-groups, genders. The elements in a stratum are supposed to be homogeneous with respect to a given characteristic, but have different characteristics with the elements in other strata. Taking simple random samples are taken from each stratum considering the strata as sub populations on their own. After the population has been divided into strata, either a proportional or non proportional sample can be selected. As the name implies, a proportional sampling procedure requires that the number of items in each stratum be in the same proportion as found in the population. In non proportional stratified sample, the number of items studied in each stratum is disproportionate to the respective numbers in the population. We the weight the sample results according to the stratums proportion of the total population. Suppose stratified random sampling with proportional allocation collection is used data collection N = the total number of elements in the population all the strata taken together = population size in stratum i n = total sample size required for the study Then, the sample size in stratum , is given by )


Advantages Stratification will always achieve greater precision provided that the strata have been chosen so that members of the same stratum are as similar as possible in respect of the characteristic of interest. The bigger the differences between the strata, the greater the gain in precision. For example, if you were interested in Internet usage you might stratify by age, whereas if you were interest in smoking you might stratify by gender or social class. It is often administratively convenient to stratify a

sample. Interviewers can be specifically trained to deal with a particular age-group or ethnic group or employees in a particular industry. The results from each stratum may be of intrinsic interest and can be analyzed separately. It ensures better coverage of the population than simple random sampling.


Difficulty in identifying appropriate strata. More complex to organize and analyze results.

b) Non random sampling

A sample of units where the selected units in the sample have an unknown probability of being selected and where some units of the target population may even have no chance at all of being in the sample. The sampling procedure is based on the researcher knowledge or judgment. It is also known as, purposive sampling and judgmental sampling. E.g. Quota sampling


Quota Sampling
In quota sampling the selection of the sample is made by the interviewer, who has been given quotas to fill from specified sub-groups of the population. For example, an interviewer may be told to sample 50 females between the age of 45 and 60. There are similarities with stratified sampling, but in quota sampling the selection of the sample is non-random. Anyone who has had the experience of trying to interview people in the street knows how tempting it is to ask those who look most helpful, hence it is not the most representative of samples, but useful. Advantages

quick and cheap to organize


not as representative of the population as a whole as random sampling methods does because the sample is non-random it is impossible to assess the possible sampling error