Professional Documents
Culture Documents
CHAPTER SIX
DATA PROCESSING AND ANALYSIS
Data are classified on the basis of common characteristics, which can either be descriptive (such as
literacy, sex, honesty, etc) or numerical (such as weight, age, height, income, expenditure, etc).
Descriptive characteristics refer to qualitative phenomenon, which cannot be measured
quantitatively: only their presence or absence in an individual item can be noticed. Data obtained
this way on the basis of certain attributes are known as statistics of attributes and their
classification is said to be classification according to attributes.
6.2.Data Analysis
Data analysis is further transformation of the processed data to look for patterns and relations
among data groups. By analysis we mean the computation of certain indices or measures along
with searching for patterns or relationship that exist among the data groups. Analysis particularly
in case of survey or experimental data involves estimating the values of unknown parameters of
the population and testing of hypothesis for drawing inferences. Analysis can be categorized as:
Descriptive Analysis
Inferential (Statistical) Analysis
6.2.1 Descriptive Analysis
Descriptive analysis is largely the study of distribution of one variable. Analysis begins for most
projects with some form of descriptive analysis to reduce the data in to a summary format.
Descriptive analysis refers to the transformation of raw data in to a form that will make them easy
to understand and interpret.
The most common forms of describing the processed data are:
1. Tabulation 4. Measures of dispersion
2. Percentage 5. Measures of asymmetry
3. Measurers of central tendency
1. Tabulation
Tabulation refers to the orderly arrangements of data in a table or other summary format. It
presents responses or the observations on a question-by-question or item-by item basis and
provides the most basic form of information. It tells the researcher how frequently each response
occurs.
Need for Tabulation
It conserves space and reduces explanatory and descriptive statement to a minimum
The most commonly used measure of central tendency is the mean. To compute the mean, you
add up all the numbers and divide by how many numbers there are. It's not the average nor a
halfway point, but a kind of center that balances high numbers with low numbers. For this
reason, it's most often reported along with some simple measure of dispersion, such as the
range, which is expressed as the lowest and highest number.
The median is the number that falls in the middle of a range of numbers. It's not the average;
it's the halfway point. There are always just as many numbers above the median as below it. In
cases where there is an even set of numbers, you average the two middle numbers. The median
is best suited for data that are ordinal, or ranked. It is also useful when you have extremely low
or high scores.
The mode is the most frequently occurring number in a list of numbers. It's the closest thing to
what people mean when they say something is average or typical. The mode doesn't even have
to be a number. It will be a category when the data are nominal or qualitative. The mode is
useful when you have a highly skewed set of numbers, mostly low or mostly high. You can
also have two modes (bimodal distribution) when one group of scores are mostly low and the
other group is mostly high, with few in the middle.
4. Measure of Dispersion:
Measure of Dispersion: is a measurement how the value of an item scattered around the truth-
value of the average. Average value fails to give any idea bout the dispersion of the values of an
item or a variable around the truth-value of the average. After identifying the typical value of a
variable the researcher can measure how the value of an item is scattered around the true value of
the mean. It is a measurement of how far is the value of the variable from the average value. It
measures the variation of the value of an item.
Important measures of dispersion are:
Range: measures the difference between the maximum and the minimum value of the
observed variable.
Mean Deviation: it is the average dispersion of an observation around the mean value. E
(Xi-X)/n
Variance: it is mean square deviation. It measures the sample variability.
Standard deviation: the square root of variance
1. Correlation
The most commonly used relational statistic is correlation and it's a measure of the strength of
some relationship between two variables, not causality. Interpretation of a correlation coefficient
does not even allow the slightest hint of causality. The most a researcher can say is that the
variables share something in common; that is, are related in some way. The more two things have
something in common, the more strongly they are related. There can also be negative relations, but
the important quality of correlation coefficients is not their sign, but their absolute value. A
correlation of -.58 is stronger than a correlation of .43, even though with the former, the
relationship is negative. The following table lists the interpretations for various correlation
coefficients:
.8 to 1.0 Very strong
.6 to .8 Strong
.4 to .6 Moderate
.2 to .4 Weak
.0 to .2 Very weak
Pearson's correlation coefficient, or small r, represents the degree of linear association between any
two variables. Unlike regression, correlation doesn't care which variable is the independent one or
the dependent one, therefore, you cannot infer causality. Correlations are also dimension-free, but
they require a good deal of variability or randomness in your outcome measures. A correlation
coefficient always ranges from negative one (-1) to one (1), so a negative correlation coefficient of
-0.65 indicates that "65% of the time, when one variable is low, the other variable is high" and it's
up to you, the researcher to guess which one is usually initially low. A positive correlation
coefficient of 0.65 indicates, "65% of the time, when one variable exerts a positive influence, the
other variable also exerts a positive influence". Researchers often report the names of the variables
in such sentences, rather than just saying "one variable". A correlation coefficient at zero, or close
to zero, indicates no linear relationship.
The most frequently used correlation coefficient in data analysis is the Pearson product moment
correlation. It is symbolized by the small letter r, and is fairly easy to compute from raw scores
using the following formula:
II. Is there any cause and effect (causal relationship) between two variables or between one
variable on one side and two or more variables on the other side?
This question can be answered by the use of regression analysis. In regression analysis the
researcher tries to estimate or predict the average value of one variable on the basis of the value of
other variable.
2. Regression
Regression is the closest thing to estimating causality in data analysis, and that's because it
predicts how much the numbers "fit" a projected straight line. The most common form of
regression, however, is linear regression, and the least squares method to find an equation that
best fits a line representing what is called the regression of y on x. Instead of finding the perfect
number, however, one is interested in finding the perfect line, such that there is one and only one
line (represented by equation) that perfectly represents, or fits the data, regardless of how
scattered the data points. The slope of the line (equation) provides information about predicted
directionality, and the estimated coefficients (or beta weights) for x and y (independent and
dependent variables) indicates the power of the relationship.
Yi= Bo + B1Zi
Yi= Outcomes score for the nth unit (dependent variable)
B0= coefficient for the intercept
B1= Coefficient for slope
Zi= independent variable