You are on page 1of 9

ANALYSIS OF DATA Analysis of data is a process of inspecting, cleaning, transforming, and modelling data with the goal of highlighting

useful information, suggesting conclusions, and supporting decision making. Data analysis has multiple facets and approaches, encompassing diverse techniques under a variety of names, in different business, science, and social science domains. CODING OF DATAData is often coded before storage. Coding means to change the original data into a shortened version by assigning a code. For example consider these data codes: Gender: Instead of Male, Female it could be shortened to M, F Dates: The full date of 18th January 2000AD could be shortened to JAN The colours of the houses could said to be 'lilac, 'light blue', 'black', 'sage' and so on - but the data coding could change these to Pink, Blue, Black, Blue. Advantages of coding Less storage space required Comparisons are shortened and can therefore be made quicker, thus speeding up searches A limited number of codes makes data input faster and simplifies validation.

DATA EDITINGData should be edited before being presented as information. This action ensures that the information provided is accurate, complete and consistent. No matter what type of data you are working with, certain edits are performed on all surveys. Data editing can be performed manually, with the assistance of computer programming, or a combination of both techniques. It depends on the medium (electronic, paper) by which the data are submitted. There are two levels of data editingmicro- and macroediting. Micro-editing corrects the data at the record level. This process detects errors in data through checks of the individual data records. The intent at this point is to determine the consistency of the data and correct the individual data records. Macro-editing also detects errors in data, but does this through the analysis of aggregate data (totals). The data are compared with data from other surveys, administrative files, or earlier versions of the same data. This process determines the compatibility of data. We might ask the question "Why are there errors in our files?" There are several situations where errors can be introduced into the data, and the following list gives some of them: A respondent could have misunderstood a question. A respondent or an interviewer could have checked the wrong response. An interviewer could have miscoded or misunderstood a written response. An interviewer could have forgotten to ask a question or record the answer. A respondent could have provided inaccurate responses.

Problems of CodingCoding obscures the meaning of data A reader seeing the 'gender' data as M/ F is pretty likely to know that it means Male/ Female. But with more obscure codes such as Switzerland being coded as CHE means the reader must be given the complete list of possibilities to understand the meaning of the data. Coding of Value Judgements For example, was that curry too spicy? As it is to be coded as a judgement of 1-4. This will be coded differently by different people and makes comparison difficult. The problem with a value-judgement is that there is no single correct value. The value depends on someone's opinion. Coding of value judgements will inevitably lead to coarsening of the data since there will be a wide range of opinions that could be held and only a limited number of codes available.

Always keep in mind the objectives of data editing: to ensure the accuracy of data; to establish the consistency of data; to determine whether or not the data are complete; to ensure the coherence of aggregated data; and to obtain the best possible data available.

Applying editing rules So, how do we edit? The first step is to apply 'rules' (or factors to be taken into consideration) to the data. These rules are determined by the expert knowledge of a subjectmatter specialist, the structure of the questionnaire, the history of the data, and any other related surveys or data. Expert knowledge can come from a variety of sources. The specialist could be an analyst who has extensive experience with the type of data being edited. An expert could also be

Prof Amit Kumar FIT Group of Institutions

one of the survey sponsors who is familiar with the relationships between the data. The layout and structure of the questionnaire will also impact the rules for editing data. For example, sometimes respondents are instructed to skip certain questions if the questions do not apply to them or their situation. This specification must be respected and incorporated into the editing rules. Lastly, other surveys relating to the same sort of variables or characteristics are used in order to establish some of the rules for editing data. Data editing types There are several types of data edits available: They include Validity edits look at one question field or cell at a time. They check to ensure the record identifiers, invalid characters, and values have been accounted for; essential fields have been completed (e.g., no quantity field is left blank where a number is required); specified units of measure have been properly used; and the reporting time is within the specified limits. Range edits are similar to validity edits in that they look at one field at a time. The purpose of this type of edit is to ensure that the values, ratios and calculations fall within the pre-established limits. Duplication edits examine one full record at a time. These types of edits check for duplicated records, making certain that a respondent or a survey item has only been recorded once. A duplication edit also checks to ensure that the respondent does not appear in the survey universe more than once, especially if there has been a name change. Finally, it ensures that the data have been entered into the system only once. Consistency edits compare different answers from the same record to ensure that they are coherent with one another. For example, if a person is declared to be in the 0 to 14 age group, but also claims that he or she is retired, there is a consistency problem between the two answers. Inter-field edits are another form of a consistency edit. These edits verify that if a figure is reported in one section, a corresponding figure is reported in another. Historical edits are used to compare survey answers in current and previous surveys. For example, any dramatic changes since the last survey will be flagged. The ratios and calculations are also compared, and any percentage variance that falls outside the established limits will be noted and questioned. Statistical edits look at the entire set of data. This type of edit is performed only after all other edits have been applied and the data have been corrected. The data are compiled and all extreme values, suspicious data and outliers are rejected. Miscellaneous edits fall in the range of specialreporting arrangements; dynamic edits particular to

the survey; correct classification checks; changes to physical addresses, locations and/or contacts; and legibility edits (i.e., making sure the figures or symbols are recognizable and easy to read). Data editing is influenced by the complexity of the questionnaire. Complexity refers to the length, as well as the number of questions asked. It also includes the detail of questions and the range of subject matter that the questionnaire may cover. In some cases, the terminology of a question can be very technical. For these types of surveys, special reporting arrangements and industry-specific edits may occur. TABULATION OF DATA-

The process of placing classified data into tabular form is known as tabulation. A table is a symmetric arrangement of statistical data in rows and columns. Rows are horizontal arrangements whereas columns are vertical arrangements. It may be simple, double or complex depending upon the type of classification. Types of Tabulation: (1) Simple Tabulation or One-way Tabulation: When the data are tabulated to one characteristic, it is said to be simple tabulation or one-way tabulation. For Example: Tabulation of data on population of world classified by one characteristic like Religion is example of simple tabulation. (2) Double Tabulation or Two-way Tabulation: When the data are tabulated according to two characteristics at a time. It is said to be double tabulation or two-way tabulation. For Example: Tabulation of data on population of world classified by two characteristics like Religion and Sex is example of double tabulation. (3) Complex Tabulation: When the data are tabulated according to many characteristics, it is said to be complex tabulation. For Example: Tabulation of data on population of world classified by two characteristics like Religion, Sex and Literacy etcis example of complex tabulation. PIE CHART FOR REPRESENTATION OF DATAA pie chart (or a circle graph) is a circular chart divided into sectors, illustrating proportion. In a pie chart, the arc length of each sector (and consequently its central angle and area), is proportional to the quantity it represents. When angles are measured with 1 turn as unit then a number of percent is identified with the same number of centiturns. Together, the sectors create a full disk. It is named for its resemblance to a pie which has been sliced. The pie chart is perhaps the most widely used statistical chart in the business world and the mass media. However, it has been criticized, and some recommend avoiding it,

Prof Amit Kumar FIT Group of Institutions

pointing out in particular that it is difficult to compare different sections of a given pie chart, or to compare data across different pie charts. Pie charts can be an effective way of displaying information in some cases, in particular if the intent is to compare the size of a slice with the whole pie, rather than comparing the slices among them. Pie charts work particularly well when the slices represent 25 to 50% of the data, but in general, other plots such as the bar chart or the dot plot, or non-graphical methods such as tables, may be more adapted for representing certain information. It also shows the frequency within certain groups of information. BAR DIAGRAMSA bar chart or bar graph is a chart with rectangular bars with lengths proportional to the values that they represent. The bars can be plotted vertically or horizontally. Bar charts are used for marking clear data which has learned values. Some examples of discontinuous data include 'shoe size' or 'eye color', for which you would use a bar chart. In contrast, some examples of continuous data would be 'height' or 'weight'. A bar chart is very useful if you are trying to record certain information whether it is continuous or not continuous data. Bar charts also look a lot like a histogram. They are often mistaken for each other SPSS (STATISTICAL PACKAGE FOR SOCIAL SCIENCE)INTRODUCTIONSPSS is a computer program used for conducting statistical analysis , manipulating data, and generating tables, graphs that summarize data. SPSS is the most popular computer software for data analysis. The computer software provides a comprehensive set of flexible tools that can be used to accomplish a wide variety of data analysis tasks. SPSS is specially useful for social for social scientist and social science students, including scholars performing quantitative research and undergraduates working for their thesis. USES OF SPSSSPSS is the statistical package most widely used by social scientist. The main uses and advantages of SPSS are as follow1. One can use it either a window point-and-click approach or through syntax (i.e., writing out of SPSS commands).Each has its own advantages and users can switch between the approaches. 2. Of the major packages, it seems to be the easiest to use for most widely used statistical technique. ANOVAThe Analysis Of Variance, popularly known as the ANOVA test, can be used in cases where there are more than two groups.

When we have only two samples we can use the t-test to compare the means of the samples but it might become unreliable in case of more than two samples. If we only compare two means, then the t-test (independent samples) will give the same results as the ANOVA. It is used to compare the means of more than two samples. This can be understood better with the help of an example. ONE WAY ANOVA EXAMPLE: Suppose we want to test the effect of five different exercises. For this, we recruit 20 men and assign one type of exercise to 4 men (5 groups). Their weights are recorded after a few weeks. We may find out whether the effect of these exercises on them is significantly different or not and this may be done by comparing the weights of the 5 groups of 4 men each. The example above is a case of one-way balanced ANOVA. It has been termed as one-way as there is only one category whose effect has been studied and balanced as the same number of men has been assigned on each exercise. Thus the basic idea is to test whether the samples are all alike or not. WHY NOT MULTIPLE T-TESTS? As mentioned above, the t-test can only be used to test differences between two means. When there are more than two means, it is possible to compare each mean with each other mean using many t-tests. But conducting such multiple t-tests can lead to severe complications and in such circumstances we use ANOVA. Thus, this technique is used whenever an alternative procedure is needed for testing hypotheses concerning means when there are several populations. ONE WAY AND TWO WAY ANOVA Now some questions may arise as to what are the means we are talking about and why variances are analyzed in order to derive conclusions about means. The whole procedure can be made clear with the help of an experiment. Let us study the effect of fertilizers on yield of wheat. We apply five fertilizers, each of different quality, on four plots of land each of wheat. The yield from each plot of land is recorded and the difference in yield among the plots is observed. Here, fertilizer is a factor and the different qualities of fertilizers are called levels. This is a case of one-way or one-factor ANOVA since there is only one factor, fertilizer. We may also be interested to study the effect of fertility of the plots of land. In such a case we would have two factors, fertilizer and fertility. This would be a case of two-way or two-factor ANOVA. Similarly, a third factor may be incorporated to have a case of three-way or three-factor ANOVA.

Prof Amit Kumar FIT Group of Institutions

CHANCE CAUSE AND ASSIGNABLE CAUSE In the above experiment the yields obtained from the plots may be different and we may be tempted to conclude that the differences exist due to the differences in quality of the fertilizers. But this difference may also be the result of certain other factors which are attributed to chance and which are beyond human control. This factor is termed as error. Thus, the differences or variations that exist within a plot of land may be attributed to error. Thus, estimates of the amount of variation due to assignable causes (or variance between the samples) as well as due to

chance causes (or variance within the samples) are obtained separately and compared using an F-test and conclusions are drawn using the value of F. ASSUMPTIONS There are four basic assumptions used in ANOVA. the expected values of the errors are zero the variances of all errors are equal to each other the errors are independent they are normally distributed.

Measures of Central TendencyIntroduction A measure of central tendency is a single value that attempts to describe a set of data by identifying the central position within that set of data. As such, measures of central tendency are sometimes called measures of central location. They are also classed as summary statistics. The mean (often called the average) is most likely the measure of central tendency that you are most familiar with, but there are others, such as, the median and the mode. The mean, median and mode are all valid measures of central tendency but, under different conditions, some measures of central tendency become more appropriate to use than others. In the following sections we will look at the mean, mode and median and learn how to calculate them and under what conditions they are most appropriate to be used. Mean (Arithmetic) The mean (or average) is the most popular and well known measure of central tendency. It can be used with both discrete and continuous data, although its use is most often with continuous data. The mean is equal to the sum of all the values in the data set divided by the number of values in the data set. So, if we have n values in a data set and they have values x1, x2, ..., xn, then the sample mean, usually denoted by (pronounced x bar), is:

This formula is usually written in a slightly different manner using the Greek capitol letter, "sum of...":

, pronounced "sigma", which means

You may have noticed that the above formula refers to the sample mean. So, why call have we called it a sample mean? This is because, in statistics, samples and populations have very different meanings and these differences are very important, even if, in the case of the mean, they are calculated in the same way. To acknowledge that we are calculating the population mean and not the sample mean, we use the Greek lower case letter "mu", denoted as :

The mean is essentially a model of your data set. It is the value that is most common. You will notice, however, that the mean is not often one of the actual values that you have observed in your data set. However, one of its important properties is that it

Prof Amit Kumar FIT Group of Institutions

minimises error in the prediction of any one value in your data set. That is, it is the value that produces the lowest amount of error from all other values in the data set. An important property of the mean is that it includes every value in your data set as part of the calculation. In addition, the mean is the only measure of central tendency where the sum of the deviations of each value from the mean is always zero. When not to use the mean The mean has one main disadvantage: it is particularly susceptible to the influence of outliers. These are values that are unusual compared to the rest of the data set by being especially small or large in numerical value. For example, consider the wages of staff at a factory below: Staff Salary 1 15k 2 18k 3 16k 4 14k 5 15k 6 15k 7 12k 8 17k 9 90k 10 95k

The mean salary for these ten staff is $30.7k. However, inspecting the raw data suggests that this mean value might not be the best way to accurately reflect the typical salary of a worker, as most workers have salaries in the $12k to 18k range. The mean is being skewed by the two large salaries. Therefore, in this situation we would like to have a better measure of central tendency. As we will find out later, taking the median would be a better measure of central tendency in this situation. Another time when we usually prefer the median over the mean (or mode) is when our data is skewed (i.e. the frequency distribution for our data is skewed). If we consider the normal distribution - as this is the most frequently assessed in statistics when the data is perfectly normal then the mean, median and mode are identical. Moreover, they all represent the most typical value in the data set. However, as the data becomes skewed the mean loses its ability to provide the best central location for the data as the skewed data is dragging it away from the typical value. However, the median best retains this position and is not as strongly influenced by the skewed values. This is explained in more detail in the skewed distribution section later in this guide. Median The median is the middle score for a set of data that has been arranged in order of magnitude. The median is less affected by outliers and skewed data. In order to calculate the median, suppose we have the data below: 65 55 89 56 35 14 56 55 87 45 92

We first need to rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92

Our median mark is the middle mark - in this case 56 (highlighted in bold). It is the middle mark because there are 5 scores before it and 5 scores after it. This works fine when you have an odd number of scores but what happens when you have an even number of scores? What if you had only 10 scores? Well, you simply have to take the middle two scores and average the result. So, if we look at the example below: 65 55 89 56 35 14 56 55 87 45

We again rearrange that data into order of magnitude (smallest first): 14 35 45 55 55 56 56 65 87 89 92

Only now we have to take the 5th and 6th score in our data set and average them to get a median of 55.5. Mode

Prof Amit Kumar FIT Group of Institutions

The mode is the most frequent score in our data set. On a histogram it represents the highest bar in a bar chart or histogram. You can, therefore, sometimes consider the mode as being the most popular option. An example of a mode is presented below:

Normally, the mode is used for categorical data where we wish to know which is the most common category as illustrated below:

We can see above that the most common form of transport, in this particular data set, is the bus. However, one of the problems with the mode is that it is not unique, so it leaves us with problems when we have two or more values that share the highest frequency, such as below:

We are now stuck as to which mode best describes the central tendency of the data. This is particularly problematic when we have continuous data, as we are more likely not to have any one value that is more frequent than the other. For example, consider measuring 30 peoples' weight (to the nearest 0.1 kg). How likely is it that we will find two or more people with exactly the same weight, e.g. 67.4 kg? The answer, is probably very unlikely - many people might be close but with such a small sample (30 people) and a large range of possible weights you are unlikely to find two people with exactly the same weight, that is, to the nearest 0.1 kg. This is why the mode is very rarely used with continuous data.

Prof Amit Kumar FIT Group of Institutions

Another problem with the mode is that it will not provide us with a very good measure of central tendency when the most common mark is far away from the rest of the data in the data set, as depicted in the diagram below:

In the above diagram the mode has a value of 2. We can clearly see, however, that the mode is not representative of the data, which is mostly concentrated around the 20 to 30 value range. To use the mode to describe the central tendency of this data set would be misleading. Skewed Distributions and the Mean and Median We often test whether our data is normally distributed as this is a common assumption underlying many statistical tests. An example of a normally distributed set of data is presented below:

When you have a normally distributed sample you can legitimately use both the mean or the median as your measure of central tendency. In fact, in any symmetrical distribution the mean, median and mode are equal. However, in this situation, the mean is widely preferred as the best measure of central tendency as it is the measure that includes all the values in the data set for its calculation, and any change in any of the scores will affect the value of the mean. This is not the case with the median or mode. However, when our data is skewed, for example, as with the right-skewed data set below:

Prof Amit Kumar FIT Group of Institutions

we find that the mean is being dragged in the direct of the skew. In these situations, the median is generally considered to be the best representative of the central location of the data. The more skewed the distribution the greater the difference between the median and mean, and the greater emphasis should be placed on using the median as opposed to the mean. A classic example of the above right-skewed distribution is income (salary), where higher-earners provide a false representation of the typical income if expressed as a mean and not a median. If dealing with a normal distribution, and tests of normality show that the data is non-normal, then it is customary to use the median instead of the mean. This is more a rule of thumb than a strict guideline however. Sometimes, researchers wish to report the mean of a skewed distribution if the median and mean are not appreciably different (a subjective assessment) and if it allows easier comparisons to previous research to be made. Summary of when to use the mean, median and mode Please use the following summary table to know what the best measure of central tendency is with respect to the different types of variable. Type of Variable Nominal Ordinal Interval/Ratio (not skewed) Interval/Ratio (skewed) Best measure of central tendency Mode Median Mean Median

CONCEPT OF DISPERSIONThe word dispersion is used to denote the degree of heterogeneity in the data. It is an important characteristic indicating the extent to which observations vary amongst themselves. The dispersion of a given set of observations will be zero when all of them-are equal (as in Set B given above). The wider the discrepancy from one observation to another, the larger would be the dispersion. (Thus dispersion in Set A should be larger than that in Set C.) A measure of dispersion is designed to state numerically the extent to which individual observations vary on the average. There are quite a few measures of dispersion. We discuss them below. Range Of all measures of dispersions, range is the simplest. It is defined as the difference between the largest and the smallest observations. Thus for the data given at Set A the range is 44 - 2 = 42. Similarly, for Set B the range is 17 - 17 = 0

Prof Amit Kumar FIT Group of Institutions

and for Set C it is 11. Now let us look into some grouped data. For Table 4.2 data (look back to the previous Unit), the Age is Rs. 406.5 - Rs. 262.5 = Rs. 144. Notice that, for grouped data, largest and the smallest observations are not identifiable. Hence we take the deference between two extreme boundaries of the classes. It is intuitive that, because of central tendency, if one selects a small sample, observations are more likely to be around its mode than away from it. Less likely or extreme values will be included in the sample when its size is large. This, in other words, implies that range will increase with increase in sample size. Also, it is known that in repeated sampling with same sample size, range varies considerably making it a less suitable measure for comparisons. However, range is a measure which is easy to understand and can be computed quickly. Inter-quartile Range Range as a measure of dispersion does not reflect a frequency distribution well, as it depends on the two extreme values. Even one very large or small observation, away from general pattern of other observations in the data set, makes the range very large. For example, in Set A, the range is found to be excessively large (44 - 2 = 42) because of ?he presence of very large one observation, that is 44. To avoid such extreme observations, particularly when there is a strong central tendency, inter-quartile range is useful as a measure of dispersion. It is defined as Inter-quartile Range = Q, - Q, = P,, - P,,. Inter-quartile range is the range of the middle most 50% of the observations. If the observations are compact around median, i.e., a strong mode close to the median Mean DeviationWhile range depends on the two extreme observations, inter-quartile range depends on the two extreme observations among the middle most 50 percent of the observations. Thus, one talks only about the percentage of observations between minimum, P,, and maximum, P,, . Thus both range and interquartile range do not depend upon all the observations in the sample. Hence while computing range or inter-quartile range we do not say anything about the distribution of observations within the group. Among many possibilities to quantify spread or dispersion of observations, one possibility is to use the deviation of observations from some central value. Since mean is the most commonly used measure of central tendency, it is often taken as the central value with reference to which the deviations are computed. These deviations are then suitably combined to get a measure of dispersion.

Prof Amit Kumar FIT Group of Institutions