You are on page 1of 36

Chapter 5

Data Processing, Statistical Analysis and Interpretation

Outline

 Data analysis  Statistical Methods in Research


 Data preparation/processing Descriptive Statistics
Logging Correlation
Editing Regression
Coding Chi-square
Tabulation
Graphs & diagrams  Interpretation

I. Data Analysis

By the time you get to the analysis of your data, most of the really difficult work has been done.
It's much more difficult to: define the research problem; develop and implement a sampling
plan; conceptualize, operationalize and test your measures; and develop a design structure. If
you have done this work well, the analysis of the data is usually a fairly straightforward affair.

Analysis of data involves a number of closely related operations that are performed with the
purpose of summarizing the collected data and organizing these in such a manner that they will
yield answers to the research questions and research hypothesis and imitated the study.

Analysis of data includes comparison of the outcomes of the various treatments upon the
several groups and the making of the decision as to the achievement of the goals of research.

Analysis of data means to make the raw data meaningful or to draw some results from the data
after the proper treatment.

Process involved in Data Analysis

Some authors differentiate data analysis and data preparation stating that data preparation is
one of the steps in research activities whereas data analysis is the other step.

Others put the steps involved in data analysis, in general, as being classified as;

1. classification or establishment of categories for data


2. application of categories to raw data through coding

3. the tabulation of data

4. statistical analysis of data


5. Inferences about causal relationship among variables.

The other approach used to classify the data analysis in social research involves three major
steps, done in roughly this order:

1. Cleaning and organizing the data for analysis (Data Preparation/processing)


2. Describing the data (Descriptive Statistics)
3. Testing Hypotheses and Models (Inferential Statistics)

In most research studies, the analysis section follows these three phases of analysis.
Descriptions of how the data were prepared tend to be brief and to focus on only the more
unique aspects to your study, such as specific data transformations that are performed. The
descriptive statistics that you actually look at can be voluminous. In most write-ups, these are
carefully selected and organized into summary tables and graphs that only show the most
relevant or important information. Usually, the researcher links each of the inferential analyses
to specific research questions or hypotheses that were raised in the introduction, or notes any
models that were tested that emerged as part of the analysis. In most analysis write-ups it's
especially critical to not "miss the forest for the trees." If you present too much detail, the reader
may not be able to follow the central line of the results. Often extensive analysis details are
appropriately relegated to appendices, reserving only the most critical analysis summaries for
the body of the report itself.

Data Preparation/processing

Data Preparation or processing includes:

 checking or logging the data in;


 checking the data for accuracy/editing

 coding the data

 Tabulation

a) Logging the Data

 In any research project you may have data coming from a number of different sources at
different times: For example from; mail surveys returns, coded interview data, pretest or
posttest data and observational data.

 In all but the simplest of studies, you need to set up a procedure for logging the information
and keeping track of it until you are ready to do a comprehensive data analysis.

 Different researchers differ in how they prefer to keep track of incoming data. In most
cases, you will want to set up a database that enables you to assess at any time what data is
already in and what is still outstanding.
 You could do this with any standard computerized database program (e.g., Microsoft
Access, Claris Filemaker), although this requires familiarity with such programs. Or you
can accomplish these using standard statistical programs (e.g., SPSS, SAS, Minitab, Data
desk) and running simple descriptive analyses to get reports on data status.

 It is also critical that the data analyst retain the original data records for a reasonable period
of time -- returned surveys, field notes, test protocols, and so on.

 A database for logging incoming data is a critical component in good research record-
keeping.

b) Checking the Data for Accuracy/ Editing

As soon as data is received you should screen it for accuracy. In some circumstances doing this
right away will allow you to go back to the sample to clarify any problems or errors.

Editing of data is a process of examining the raw collected data to detect errors and omissions
and to correct these when possible.

Editing is done to assure that the data are accurate, consistent with other facts gathered,
uniformly entered, as completed as possible and have been well arranged to facilitate coding
and tabulation.

There are several questions you should ask as part of this initial data screening:

 Are the responses legible/ readable?


 Are all important questions answered?
 Are the responses complete?
 Is all relevant contextual information included (e.g., data, time, place, researcher)?

Editing could be:

 Field editing and Central editing

Field editing:

 Consists in the review of the reporting forms by the investigators for completing
(translating or rewriting) what the latter has written in abbreviated and/or in
illegible form at the time of recording the respondents’ responses.
 Done for checking whether the handwriting is readable or not.

 Must not include correcting errors of omission by guessing

Central Editing:

 Take place when all forms or schedules have been completed and returned to the
office.
 Implies that all forms should get a thorough editing by a single editor in a small
study and by a team in case of large inquiry.

 The obvious errors may be corrected such as wrong place entry.

 In case of omission of responses, sometimes the editor can enter the answer by
considering other information.

Editors must keep in view several points while performing their work:

 They should be familiar with instructions given to the interviewers and coders as
well as with the editing instructions supplied to them for the purpose.
 While crossing out an original entry for one reason or another, they should just draw
a single line on it so that the same may remain legible.

 They must make entries in distinctive colors and in standard forms

 They should initial all answers which they change or supply.

 Editor’s initials and the date of editing should be placed on each completed form or
schedule.

c) Coding

 Refers to the process of assigning numbers or other symbols to answers so that


responses can be put into a limited number of categories or classes.

 Such classes are appropriate to the research problem under consideration.

 The classes must be exhaustive, mutually exclusive and unidimensional.

 Coding is necessary for efficient analysis and through it the several replies may be
reduced to small number of classes which contain critical information requires for
analysis.

d) Classification

 Is the process of arranging data into sequences and groups according to their
common characteristics, or separating them into different but related parts.

 Is the scheme of breaking a category into a set of parts, called classes, according to
some precisely defined differing characteristics possessed by all the elements of the
category.

 Is the arrangement of data into different classes which are to be determined


depending upon the nature, objective, and scope of the enquiry.
 Reduces a large volume of data into homogeneous groups of manageable size.

 Is the process of arranging data in groups or classes on the basis of common


characteristics.

Characteristics of Classification

 When we make a classification, we break up the subject matter into a number of


classes. It is important that the classification should possess the following
characteristics:
Exhaustive: the classification system must be exhaustive. There must be no
item which cannot find a class. There must be a class for each item of data
in one of the class. If classification is made exhaustive, there will be no
place for ambiguity.

Mutually exclusive: there must not be overlap. That is, each item of data
must find its place in one class and one class only. There must be no item
which can find its way into more than one class.

Stability: classification must proceed at every stage in accordance with one


principle and that principle should be maintained through out. If a
classification is not stable and is changed for every inquiry, then data
would not be fit for comparison.

Flexibility: A good classification should be flexible and should have the


capacity of adjustment to new situations and circumstances.

Homogeneity: the items included in one class should be homogeneous.

Suitability: the classification should conform to the objects of enquiry. If an


investigation is carried on to enquire into the economic conditions of
laborers, then it will be useless to classify them on the basis of their religion.

Arithmetic Accuracy: the total of the items included in different classes,


should tally with the total of the universe.

Types of classification

 Classification can be done in one of the following ways, depending on the nature of
the phenomenon involved:

1. Classification According to Attributes

 Is sometimes called classification based on difference in kind or qualitative


classification.
 Data can be classified on the basis of common characteristics which can either be
descriptive or numerical.

 Descriptive characteristics refer to qualitative phenomenon which can not be


measured quantitatively. Such data are called statistics of attributes and their
classification is called classification according to attributes.

 Such classification can be

o Simple classification or

o Manifold classification.

 In simple classification, we consider only one attribute and divide the universe into
two classes- one class consisting the attribute and the other not consisting the
attribute.

 In manifold classification, we consider two or more attributes simultaneously, and


divide the data into a number of classes.

2. Classification according to class-intervals

 Sometimes known as classification based on difference of degree of a given


characteristics or quantitative classification.

 The numerical characteristics refer to quantitative phenomenon can be measured


through some statistical units.

 Such data are known as statistics of variables and are classified on the basis of class
intervals.

 Each class of interval has upper limit and lower limit which are known as class
limits. The difference between the two class limits is called class magnitudes.

 Class magnitudes may be equal or unequal.

 The number of items which fall in a given class is known as the frequency of the
given class.

 All the classes or groups, taken together with their respective frequencies and put in
the form of a table, are described as a group frequency distribution.

3. Geographical classification

 The data are classified according to the geographical location such as continents,
countries, states, districts, or other subdivisions.
4. Chronological Classification

 When the given data is classified on the basis of time , it is named chronological
classification.

 In this type of classification, the data may be classified on the basis of time, i.e.,
years, months, weeks, days or hours.

5. Alphabetical Classification

 When the data are arranged according to alphabetical order, it is called


alphabetical classification.

 This type of classification is mostly adopted for data of general use because it aids
in locating the items easily.

Objectives of Classification

 The chief objectives of classification are:

 To present the facts in a simple form: classification process eliminates unnecessary


details and makes the mass complex data, simple, brief, logical and
understandable.

 To bring out clearly points of similarity and dissimilarity: Classification brings


out clearly the points of similarity and dissimilarity of the data to that they can be
easily grasped. Facts having similar characteristics are placed in a class, such as
educated, uneducated, employed, unemployed etc,.

 To facilitate comparison: Classification of data enables one to make comparison,


draw inferences and locate facts. This is not possible in an unclassified data. If
marks obtained by B.com students in two colleges are given, no comparison can
be made of their intelligence level. Classification of students in to first, second,
third and failure classes on the basis of marks obtained by them will make such
comparison easy.

 To bring out relationship: classification helps in finding out cause effect


relationship, if there is any in the data.

 To present a mental picture: The process of classification enables one to form a


mental picture of objects of perception and conception. Summarized data can be
easily understood and remembered.

 To prepare the basis for tabulation: classification prepares the basis for tabulation
and statistical analysis of the data. Unclassified data can not be presented.

Limitations:
 The number of groups and magnitude size determination is challenging.
 Choosing class limit and type( inclusive or exclusive) as well is a difficult decision.

 Determination of the frequency of each class is another challenge.

e) Tabulation

 Is the process of summarizing raw data and displaying the same in compact form
(i.e., in the form of statistical tables) for further analysis.

 Is an orderly arrangement of data in columns and rows.

 Is the orderly and systematic presentation of numerical data in a form designed to


elucidate the problem under consideration.

 Statistical table is the logical listing of related quantitative data in vertical columns
and horizontal rows of numbers with sufficient explanatory and qualifying words,
phrases and statements in the form of titles, headings and notes to make clear the
full meaning of data and their origin.

Objectives of tabulation

 Tabulation is a process which helps in understanding complex numerical facts.


 The purpose of table is to summarize a mass of numerical information and to
present it in the simplest possible form consistent with the purpose for which it is to
be used.

 In general tabulation has the following objectives

To clarify the object of investigation

 The function of tabulation in the general scheme statistical investigation is to arrange


in easily accessible form the answer with which the investigation is concerned.

To clarify the characteristics of data

 A table presents facts clearly and concisely, eliminating the need for wordy
explanation. It brings out the chief characteristics of data.

To present facts in the minimum space

 A table presents facts in minimum of space and communicates information in a far


better way than textual material.

To facilitate statistical process


 It simplifies references to data and facilitates comparative analysis and interpretation
of the facts.

Advantages of Tabulation

 Tabulation is essential for the following reasons:


 It conserves space and reduces explanatory and descriptive statement to a
minimum.

 It facilitates the process of comparison

 It facilitates the summation of items and the detection of errors and


omissions.

 It provides a basis for various statistical computations.

Limitations of Tabulation

A table contains only figures and not their description. It is not easy to
understand it by persons who are not adept in assimilating facts from table.
It requires specialized knowledge to understand tables. A layman cannot derive
any conclusion from a table.

A table does not lay emphasis on any section of particular importance.

Main Parts of Statistical tables

1. Table number: every table should be numbered so that it can be identified. The
number is normally indicated at the top of the table.
2. Title: Each table must bear a title indicating the type of data contained. The title
should not be very lengthy so as to run in several lines. It should be clear and
unambiguous.

3. Captions and Stubs: A table consists of rows and columns. The headings or
subheadings given in columns are known as captions while those given in rows
are stubs. It is necessary that a table should have captions and stubs to indicate
what columns and rows stand for. It is also desirable to provide for an extra
column and row in the table for the column and row totals.

4. Main body of the table: As this part of the table contains data, it is most
important part. Its size and shape should be suitable to accommodate the data.
The data are enterd from the top to the bottom in columns and from left to the
right rows.

5. Ruling and spacing:

6. Head note:
7. Footnote:

8. Sourcenote:

 Tabulation can be classified as simple or complex.

 Simple tabulation gives information about one or more groups of


independent questions. It results in one way tables which supplies answers to
questions about one characteristic of data only.

 Complex tabulation shows the division of data in to two or more categories


and as such is designed to give information concerning one or more sets of
inter-related questions. It usually results in two-way tables (which give
information about two inter-related characteristics of data), three-ways, or
still higher order tables or manifold tables.

f) Graphic presentation of data

In the previous topic we have seen that tabulation is one method of presenting data. Another
way of presenting data is in form of diagrams and graphs. However, this method of data
presentation is also not with out limitation.

Importance of graphic and diagrammatic presentation of Data

1. On account of their visual impact, the data presented through graphic and diagrammatic
presentation are better grasped and remembered than the tabulated ones.
2. These forms of presentation transform data in simple, clear and effective manner.

3. They are able to attract the attention of the reader particularly when several colors and
pictures are used in preparation.

4. A major advantage of these presentations is that they have better appeal even to a
layman. For the layman, simple charts, maps and pictures facilitate a much better
understanding of the data on which these are based.

5. Since they lead to a better understanding, they save considerable time.

6. Even when data show highly complex relations among variables, these devices make
them much clear. They thus greatly facilitate in the interpretation and analysis of data.

7. These devices are extremely helpful in depicting mode, median, skewness, correlation
and regression, normal distribution, time series analysis and so on.

Limitations of Graphs and Diagrams

1. In presenting data by these devices, it is not possible to maintain 100% precision. As


such these devices are not suitable where precision is needed.
2. There can not be a complete substitute for tabulation. They can serve the purpose better
when they are accompanied by suitable tables.

3. When too many details are to be presented, these devices fail to present them without
loss of clarity.

4. In those cases, where mathematical treatment is required, these devices turn out to be
extremely unsuitable.

5. Small differences in large measurements can not be properly brought out by means of
graphs and diagrams.

6. While graphs and diagrams are generally simple to understand, one should know that
all graphic devices are not simple. Particularly when ratio graphs and multidimensional
figures are used, these may be beyond the comprehension of the common man. A
proper understanding of these figures needs some expertise on the part of the reader.

g) Graphic devices

There are two major categories of graphs:

The natural scale graph and Ratio scale graph

The natural scale graph is more frequently used.

Within the natural scale graph, again there are two types of graphs:

Time series graph

Frequency graph

Time series graph shows the data against time, which could be any measure such as hours, day,
weeks, months and years.

In frequency graphs, time is not a measure instead some other variables such as income of
employees the number of employees earning that income, if plotted on a graph. Within
frequency graph category there are some graphs such as histogram, frequency polygon, and the
ogive curve are the popular ones.

Time series Graphs:

1. Line Graph

Time period is measured along X-axis and the corresponding values are on the Y-axis.

2. Silhouette or Net Balance Graph


In such a graph the two related series are plotted in such a manner as to highlight the
difference or gap between them.

3. Component or Band Graph

Under this device, phenomena, which form part of the whole, are shown by successive
bands or components to enable an overall picture alongwith the successive contributions of
the component.

4. Range Graph

This graph shows the range, that is the highest and the lowest of a certain product or items
under reference.

Frequency Graphs:

The following are the types of frequency graphs

1. Line graph 5. Frequency curve


2. Polygon
6. Lorenz Curve
3. Ogive
7. Z-chart.
4. Histogram

Line Graph

Line graph is also used to present a discrete frequency distribution.

On the axis of X is measured the size of the items while on the axis of Y is measured the
corresponding frequency.

Histogram

In histogram, we measure the size of the item in question, given in terms of class intervals,
on the axis of X while the corresponding frequencies are shown on the axis of Y. Unlike the
line graph, here the frequencies are shown in the form of rectangles the basis of which is the
class interval. Furthermore, the rectangles are adjacent to each other without having any gap
amongst them. A histogram generally represents a continuous frequency distribution in
contrast to line graph, which represents either a discrete frequency distribution or a time
series.

Advantages

Each rectangle shows distinctly separate class in the distribution.


The area of each rectangle in relation to all other rectangles shows the proportion of the
total number of observations pertaining to that class.

Frequency Polygon

A frequency polygon like any polygon consists of many angles. A histogram can be easily
transformed into frequency polygon by joining the mid-points of the rectangles by straight
lines. Frequency polygon can also be drawn by taking the mid points of each class interval and
by joining the mid points by the straight lines. This can be done only when we have a
continuous series.

Advantages

The frequency polygon is simpler as compared to histogram.


It shows more vividly an outline of the data pattern.

As the number of classes and the number of observations increase, so also the frequency
polygon becomes increasingly smooth.

Frequency Curve

When a frequency polygon is smoothened and rounded at the top, then it is known as
frequency curve.

Cumulative Frequency Curve or Ogive

Cumulative frequency curve enables us to know how many observations are above or below a
certain value. It is also known as ogive.

Z-curve

It is commonly used in business. The name of this device is derived from its shape. It is the
combination of three curves, namely

I. The curve based on the original data


II. The curve based on the cumulative frequency

III. The curve based on the moving totals (which can be obtained by
adding the past X number of data).

II. Statistical Analysis in Research

Analysis means the computation of certain indices or measures along with searching for
patterns of relationship that exist among the data groups.

Analysis involves estimating the values of unknown parameters of the population and testing
of hypothesis for drawing inferences.
Analysis, therefore, may be classified as descriptive analysis and inferential analysis.

 In descriptive statistics we are simply describing what is or what the data shows.
 With inferential statistics, we are trying to reach conclusions that extend beyond the
immediate data alone. For instance, we use inferential statistics to try to infer from the
sample data what the population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups is a dependable
one or one that might have happened by chance in this study. Thus, we use inferential
statistics to make inferences from our data to more general conditions; we use descriptive
statistics simply to describe what's going on in our data.

 Descriptive Statistics are used to describe the basic features of the data in a study. They
provide simple summaries about the sample and the measures. Together with simple
graphics analysis, they form the basis of virtually every quantitative analysis of data. With
descriptive statistics you are simply describing what is, what the data shows.

 Inferential Statistics investigate questions, models and hypotheses. In many cases, the
conclusions from inferential statistics extend beyond the immediate data alone. For
instance, we use inferential statistics to try to infer from the sample data what the
population thinks. Or, we use inferential statistics to make judgments of the probability that
an observed difference between groups is a dependable one or one that might have
happened by chance in this study. Thus, we use inferential statistics to make inferences
from our data to more general conditions; we use descriptive statistics simply to describe
what's going on in our data.

Descriptive Statistics

Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis,
they form the basis of virtually every quantitative analysis of data.

Descriptive Statistics are used to present quantitative descriptions in a manageable form.


Descriptive statistics help us to simplify large amounts of data in a sensible way. Each
descriptive statistic reduces lots of data into a simpler summary. However, every time you try
to describe a large set of observations with a single indicator you run the risk of distorting the
original data or losing important detail. For instance, the GPA doesn't tell you whether the
student was in difficult courses or easy ones, or whether they were courses in their major field
or in other disciplines. Even given these limitations, descriptive statistics provide a powerful
summary that may enable comparisons across people or other units.

a. Univariate Analysis

Univariate analysis involves the examination across cases of one variable at a time. There are
three major characteristics of a single variable that we tend to look at:

 the distribution  the central  the dispersion


tendency

In most situations, we would describe all three of these characteristics for each of the variables
in our study.

The Distribution

The distribution is a summary of the frequency of individual values or ranges of values for a
variable. The simplest distribution would list every value of a variable and the number of
persons who had each value. For instance, a typical way to describe the distribution of college
students is by year in college, listing the number or percent of students at each of the four/three
years. Or, we describe gender by listing the number or percent of males and females.

Table 1. Frequency distribution table.

One of the most common ways to describe a single variable is with a frequency distribution.

 Depending on the particular variable, all of the data values may be represented, or you may
group the values into categories first (e.g., with age, price, or temperature variables, it
would usually not be sensible to determine the frequencies for each value. Rather, the
values are grouped into ranges and the frequencies determined.).
 Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1 shows
an age frequency distribution with five categories of age ranges defined. The same
frequency distribution can be depicted in a graph as shown in Figure 2. This type of graph is
often referred to as a histogram or bar chart.
Table 2. Frequency distribution bar chart.
 Distributions may also be displayed using percentages. For example, you could use
percentages to describe the:

 percentage of people in different income levels


 percentage of people in different age ranges
 percentage of people in different ranges of standardized test scores

Central Tendency/statistical average

The central tendency of a distribution is an estimate of the "center" of a distribution of values.

Tells us the point about which items have a tendency to cluster.

There are three major types of estimates of central tendency:

 Mean  Median  Mode

The Mean or average or arithmetic mean is probably the most commonly used method of
describing central tendency. To compute the mean all you do is add up all the values and divide
by the number of values. For example, the mean or average quiz score is determined by
summing all the scores and dividing by the number of students taking the exam. For example,
consider the test score values:

15, 20, 21, 20, 36, 15, 25, 15

The sum of these 8 values is 167, so the mean is 167/8 = 20.875.

The Median is the score found at the exact middle of the set of values. One way to compute the
median is to list all scores in numerical order, and then locate the score in the center of the
sample. For example, if there are 500 scores in the list, score #250 would be the median. If we
order the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36

There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores
are 20, the median is 20. If the two middle scores had different values, you would have to
interpolate to determine the median.

The mode is the most frequently occurring value in the set of scores. To determine the mode,
you might again order the scores as shown above, and then count each one. The most
frequently occurring value is the mode. In our example, the value 15 occurs three times and is
the model. In some distributions there is more than one modal value. For instance, in a bimodal
distribution there are two values that occur most frequently.

Notice that for the same set of 8 scores we got three different values -- 20.875, 20, and 15 -- for
the mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped),
the mean, median and mode are all equal to each other.

Dispersion

Dispersion refers to the spread of the values around the central tendency. There are two
common measures of dispersion, the range and the standard deviation.

The range is simply the highest value minus the lowest value. In our example distribution, the
high value is 36 and the low is 15, so the range is 36 - 15 = 21.

 There are two problems with the range as a measure of spread. When calculating the range
you are looking at the two most extreme points in the data, and hence the value of the
range can be unduly influenced by one particularly large or small value, known as an
outlier. The second problem is that the range is only really suitable for comparing (roughly)
equally sized samples as it is more likely that large samples contain the extreme values of a
population.

The Inter-Quartile Range


The inter-quartile range describes the range of the middle half of the data and so is less prone to
the influence of the extreme values.
 To calculate the inter-quartile range (IQR) we simply divide the ordered data into four
quarters.
 The three values that split the data into these quarters are called the quartiles. The first
quartile (lower quartile, Q1) has 25% of the data below it; the second quartile (median, Q2)
has 50% of the data below it; and the third quartile (upper quartile, Q3) has 75% of the data
below it.
 Quartiles are calculated as follows:
n 1
Q1  observation
4

n 1
Q3  3
4
Just as with the median, these quartiles might not correspond to actual observations.
The inter-quartile range is simply the difference between the upper and lower quartiles, that is
IQR = Q3 − Q1
The inter-quartile range is useful as it allows us to start to make comparisons between the
ranges of two data sets, without the problems caused by outliers or uneven sample sizes.

Variance

The sample variance is the standard measure of spread used in statistics. It is usually denoted by
s2 and is simply the “average” of the squared distances of the observations from the sample
mean.
Strictly speaking, the sample variance measures deviation about a value calculated from the
data (the sample mean) and so we use an n − 1 divisor rather than n.

Mathematical notation is

1

n
( xi2  x )
2
s2 
n  1 i 1

The Standard Deviation is a more accurate and detailed estimate of dispersion because an
outlier can greatly exaggerate the range (as was true in this example where the single outlier
value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation
that set of scores has to the mean of the sample.

In the top part of the ratio, the numerator, we see that each score has the mean subtracted from
it, the difference is squared, and the squares are summed. In the bottom part, we take the
number of scores minus 1. The ratio is the variance and the square root is the standard
deviation.

The standard deviation is the square root of the sum of the squared deviations from the mean
divided by the number of scores minus one.

The standard deviation allows us to reach some conclusions about specific scores in our
distribution. Assuming that the distribution of scores is normal or bell-shaped, the following
conclusions can be reached:

 approximately 68% of the scores in the sample fall within one standard deviation of the
mean
 approximately 95% of the scores in the sample fall within two standard deviations of the
mean
 approximately 99% of the scores in the sample fall within three standard deviations of
the mean

For instance, if mean for a given data is 20.875 and the standard deviation is 7.0799, we can
estimate that approximately 95% of the scores will fall in the range of 20.875-(2*7.0799) to
20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind of information is a critical stepping
stone to enabling us to compare the performance of an individual on one variable with their
performance on another, even when the variables are measured on entirely different scales.

Measures of skewness (asymmetry)

When the distribution of item in a series happens to be perfectly symmetrical, we then have to
the following type of curve for the distribution.

X M Z

Curve showing no skewness in which cases we have X  M  Z

Such a curve is techinically described as a normal curve and relating distribution as normal
distribution.

Such a curve is perfectly bell shaped curve in which case the value of X or M or Z is just the
same and skewness is altogether absent.

But if the curve is distorted (whether on the right side or on the left side), we have asymmetrical
distribution which indicates that there is skewness.
If the curve is distorted on the right side, we have positive skewness but when the curve is
distorted to the left, we have negative skewness.

Skewness is, thus, measure of asymmetry and shows the manner in which the items are
clustered around the average.

In asymmetrical distribution, the items show a perfect balance on the either side of the mode,
but in a skew distribution the balance is thrown to one side.

The amount by which the balance exceeds on one side measures the skewness of the series.

The difference between the mean, median, or the mode provides an easy way of expressing
skewness in a series.

 In case of positive skewness, we have Z  M  X


 In case of negative skewness, we have X  M  Z .

Graphic representation is as follows:

Us

Z M Z
Positively Skewed Negatively skewed

ually we measure skewness as:

Skewness= X  Z and its coefficient (j) is worked out as

X Z
j

In case Z is not well defined, then we work out skewness as under:

Skewness=3( X  M ) and its coefficient (j) is worked out as

3( X  M )
j

The significance of skewness lies in the fact that through it one can study the formation of series
and can have the idea about the shape of the curve, whether normal or otherwise, when the
items of a given series are plotted on graph.

Kurtosis is the measure of flat-toppedness of a curve.

Kurtosis is the hompedness of the curve and points to the nature of distribution of items in the
middle of a series.

 A bell shaped curve or the normal curve is Mesokurtic because it is kurtic in the center
 If the curve is relatively more peaked than the normal curve, it is called Leptokurtic

 If a curve is more flat than the normal curve, it is called Platykurtic.

Knowing the shape of the distribution curve is crucial to the use of statistical method in research
analysis since most methods make specific assumptions about the nature of the distribution
curve.

b) Bivariate and Multivariate Analysis

Whenever we deal with data on two or more variables, we said to have a bivariate or
multivariate population.

Such situations usually happen when we wish to know the relation of the two and/or more
variables in the data with one another.

There are different methods of determining the relationship between variables, but no method
can tell us for certain that a correlation is indicative of causal relationship.

Thus we have to answer to types of questions in bivariate or multivariate population viz.,

 Does there exist association or correlation between the two (or more) variables? If yes, of
what degree?
 Is there any cause and effect relationship between two variables in case of bivariate
population or between one variable on one side and two or more variables on the other
side in case of multivariate population? If yes , of what degree and in which direction?

The first question can be answered by the use of correlation technique and the second question
by the technique of regression.

There are several methods of applying the two techniques, but the important ones are as
under:

In case of bivariate population:

 Correlation can be studied through:


 Cross tabulation

 Scattergram

 Charles Spearman’s Coefficient of correlation

 Karl Pearson’s Coefficient of correlation

 Cause and effect relationship can be studied through;

 Simple regression analysis

In case of multivariate population

 Correlation can be studied through;


 Coefficient of multiple correlation

 Coefficient of partial correlation

 Cause and effect relationship can be studied through:

 Multiple regression

Cross tabulation

 Is useful when the data is in nominal form.


 We classify each variable in to two or more categories and then cross classify the
variables in these subcategories.

 Begins with the two wat table which indicates whether there is or there is not an
interrelationship between the variables.

 Then we look for intersections between them which may be symmetrical, reciprocal or
asymmetrical.

 A symmetrical relationship is one in which the two variable vary together, but we
assume that neither variable is due to the other.

 A reciprocal relationship exists when the two variable mutually influence or


reinforce each other.

 Asymmetrical relationship is said to exist if one variable ( the independent variable)


is responsible for another variable( the dependent variable).

 The strength of a relationship is determined by the pattern of difference between the


values of variables.
 If there are marked percentage difference between the different categories of the
variables, the relationship between them is strong

 If the percentage differences are slight, the relationship is weak.

 The statistical significance of a relationship is determined by using appropriate test of


significance.

Scattergram

 Is a graph on which a researcher plots each case or observation, where each axis
represents the value of one variable.
 Is used for variables measured at the interval or ratio level, rarely for ordinal variables,
and never if either of the variables is nominal.

 Usually put independent variable on X-axis and dependent on Y-axis.

 Can show three aspects of the bivariate relationship for the researcher.

 Form

 Direction

 Precision

Form

 Relationship can take three forms

 Independence

 Linear

 Curvilinear

 Independence

No relationship

Is the easiest to see

Looks like a random scatter with no pattern or straight line that is exactly parallel to the
horizontal or vertical axis.

 Linear Relationship
Means that a straight line can be visualized in the middle of a maze of cases running
from corner to another.

 Curvilinear Relationship

Means that the center of a maze of cases would form a U curve, right side up or upside
down or an S curve.

Direction

 Linear relationships can have a positive or negative direction.

The plot of a positive relationship looks like a diagonal line from the lower left
to the upper right. Higher values on X-axis tend to go with higher values on Y,
and vice versa.

A negative relationship looks like a line from the upper left to the lower right.
It means that higher values on one variable go with lower values on the other.

Precision

 Bivariate relationships differ in their degree of precision.

 Precision is the amount of spread in the points on the graph.

 A higher level of precision occurs when the points hug the line that summarizes the
relationship.

Spearman’s Rank Order Correlation coefficient (Rank Correlation) Or Rho

 Is the oldest of the frequently used measures of ordinal association.


 Rho(e) is the measure of the extent of agreement or disagreement between two sets of
ranks.

 Is a non-parametric measure and so it does not require the assumption of a bivariate


normal distribution.

 Its values ranges between -1.0(perfect negative association) and +1.0(perfect positive
association)

 Its underlining logic centers on the difference between ranks.

 Mathematically;
 6 D 2 
e 1  
 n( n  1) 
2

where
D  the difference between X , Y ranks asigned to an object
n  the number of observation

Requirements for using Rho

The following conditions should be satisfied for using Rho

 A straight line correlation should exist between the variables.


 Both X and Y variables must be ranked or ordered.

 Sample members must have been taken at random from a larger population.

 Research Example: A researcher in a study of two factor theory of job satisfaction, used
Rho.

 Ranks were given a perceived needs for supervisors and clerks on each job
factor according to the magnitude of mean scores, and Rho was calculated. The
calculated value was significant(Rho=0.86, p less than 0.01) indicating
similarity between the two groups in their perceived need importance.

Karl Pearson’s Coefficient of Correlation ( or Simple Correlation)

 Is the most widely used method of measuring the degree of relationship between two
variables.
 Expresses both the strength and direction of linear correlation.

 Also known as the product moment correlation coefficient.

 Denoted by “r” which can be in between  1 .

 Positive value of “r” indicate positive correlation between the two


variables(i.e., changes in both variables take place in the same direction

 Negative value of “r” indicates negative correlation(i.e., changes in two


variables take place in opposite direction.

 A zero value of “r” indicates that there is no association between the two
variables.

 Assumes the following:

 That there is linear relationship between the two variables.


 That the two variables are causally related which means that one of the
variables is independent and the other one is dependent.

 A large number of independent causes are operating in both variables so as


to produce a normal distribution.

 Mathematically it can be expressed as:

Testing the Significance of a Correlation

Once you've computed a correlation, you can determine the probability that the observed
correlation occurred by chance. That is, you can conduct a significance test. Most often you are
interested in determining the probability that the correlation is a real one and not a chance
occurrence. In this case, you are testing the mutually exclusive hypotheses:

Null Hypothesis: r=0


Alternative Hypothesis: r <> 0

As in all hypotheses testing,

 We need to first determine the significance level. For example, we use the common
significance level of alpha = .05. This means that we are conducting a test where the odds
that the correlation is a chance occurrence is no more than 5 out of 100.
 The degrees of freedom or df is equal to N-2.

 Finally, deciding the type of test to be applied(two-tailed test or one tailed test) is to be
done.

 Accept or reject the null hypothesis.

Other Correlations

There are a wide variety of other types of correlations for other circumstances. For instance,
 if you have two ordinal variables, you could use the Kendall rank order Correlation
(tau).
 When one measure is a continuous interval level one and the other is dichotomous (i.e.,
two-category) you can use the Point-Biserial Correlation.

Partial Coefficient of correlation

Partial correlation measures separately the relationship between two variables in such a way
that the effects of other related variables are eliminated.

In partial correlation analysis, we aim at measuring the relations between a dependent variable
and a particular independent variable by holding all other variables constant.

Each partial coefficient of correlation measures the effect of its independent variable on the
dependent variable.

The partial correlation shows the relationship between two variables, excluding the effect of
other variables. In a way, the partial correlation is a special case of multiple correlation.

The difference between simple correlation and partial correlation is that the simple correlation
does not include the effect of other variables as they are completely ignored. There is almost an
implicit assumption that the variables not included don not have any impact on the dependent
variable. But such is not the case in the partial correlation, where the impact of other
independent variables is held constant.

N.B. In multiple correlation, three or more variables are studied simultaneously. But in partial
correlation we consider only two variables influencing each other while the effect of other
variables is held constant.

For example, suppose we have a problem comprising three variables X1, X2, and Y. X1 is the
number of hours studied, X2 is I.Q. and Y is the number of marks obtained in the examination.
In a multiple correlation, we will study the relationship between the marks obtained(Y) and the
two variables, number of hours studied(X1) and I.Q.(X2). In contrast, when we study the
relationship between X1 and Y keeping an average I.Q. as constant, it is said to be a study
involving partial correlation.

If we denote r12.3 as the coefficient of partial correlation between X1 and X2, holding X3 constant,
then

r12  r13 r23


r12.3 
1  r132 1  r232

Multiple Correlation Analysis

Unlike the partial correlation, multiple correlation is based on three or more variables withpout
excluding the effect of anyone. It’s denoted by R .
In case of three variables X1, X2, and x3, the multiple correlation coefficient will be:

R1.23=Multiple correlation coefficient with X1 as a dependent variable while X2 and X3 as


independent variables.

R2.13=Multiple correlation coefficient with X2 as a dependent variable while X1 and X3 as


independent variables.

R3.12= Multiple correlation coefficient with X3 as a dependent variable while X1 and X2 as


independent variables.

It may be recalled that the concepts of dependent and independent variables were non-existent
in case of simple bivariate correlation. In contrast, the concepts of dependent and independent
variables are introduced here in multiple correlation.

Symbolically, the multiple correlation coefficient can be shown as follows:

r122  r132  2r12 r13 r23


R1.23 
1  r232

Simple Regression Analysis

Regression is the determination of a statistical relationship between two or more variables.

Regression analysis is a mathematical measure of the average relationship between two or more
variables interms of the original units of the data.

Regression analysis is a statistical method to deal with the formulation of mathematical model
depicting relationship amongst variables which can be used for the purpose of prediction of the
values of dependent variable, given in the value of the independent variable.

In simple regression, we have only two variables, one variable ( defined as independent) in
cause of the behavior of another one (defined as dependent variable).

Regression can only interpret what exists physically i.e., there must be a physical way in which
independent variable X can affect dependent variable Y.

The basic relationship between X and Y is given by

Y  a  bX

This equation is known as the regression equation of Y on X(also represents the regression line
of Y on X when drawn on a graph) which means that each unit change in X produces a change
of b in Y, which is positive for direct and negative for inverse relationships.

Regression coefficients
As we saw the regression of Y on X, it is possible that we may think of X as dependent variable
and Y as an independent one.

In that case, we may have to use X  a  bY as an estimating equation.

Regression coefficient and Correlation coefficient

If all the points on the scatter diagram fall on the regression line , the correlation between the
two variables involved is perfect.

This is as much true about the regression line of Y on X as about the line of X on Y.

This means is that if the correlation is perfect the regression line can pass through more than
one point. This is because of the fact that one and only one line can pass through more than one
point.

If, however, the two lines diverge and intersect each other the correlation is not perfect.

Properties of Regression Coefficient

1. The coefficient of correlation is the geometric mean of the two regression coefficients.
2. As the coefficient of correlation can not exceed 1, in case one of the regression
coefficients is greater than 1, then the other must be less than 1.

3. Both the regression coefficient will have the same sign, either positive or negative. If one
regression coefficient is positive then the other will also be positive.

4. The coefficient of correlation and the regression coefficient will have the same sign. If the
regression coefficients are positive, then the correlation will also be positive and vice
versa.

5. The average of two regression coefficients will always be greater than the correlation
coefficient.

6. Regression coefficients are not affected by change of origin

Multiple regression coefficient

When there are two or more than two independent variables, the analysis concerning cause and
effect relatioship is known as multiple correlation and the equation describing such relationship
as the multiple regression equation.

Multiple regression equation assumes the form:

Ÿ= a  b1 X 1  b2 X 2
Where X1 and X2 are two independent variables and Y being the dependent variable, and the
a,b1 and b2 constants.

In multiple regression analysis, the regression coefficients(viz., b1 b2) becomes less reliable as the
degree of correlation between the independent variables (viz X1, X2 ) increases.

If there is a high degree of correlation between independent variables, we have a problem of


what is commonly described as the problem of multicollinearity.

In such a situation we should use only one set of the independent variable to make our
estimate. Infact, adding a second variable, say X2, that is correlated with the first variable, say X1,
distorts the values of the regression coefficients.

Measures of Association in Case of Attributes

When data is collected on the basis of some attribute or attributes, we have a statistics
commonly termed as statistics of attributes.

In such a situation our interest may remain in knowing whether the attributes are associated
with each other or not.

The (two) attributes are associated if they appear together in a greater number of cases than is to
be expected if they are independent and not simply on the basis that they are appearing
together in a number of cases as is done in ordinary life.

The association may be positive or negative (negative association is also known as


disassociation).

If class frequency of AB, symbolically written as (AB), is greater than the expectation of AB
being together if they are independent, then we say the two attributes are positively associated;
but if the class frequency of AB is less than this expectation, the two attributes are said to be
negatively associated.

In case the class frequency of AB is equal to expectation, the two attributes are considered as
independent i.e., are said to have no association.

Symbolically:

( A) ( B)
 If (AB)>   N , then AB are postively related / associated
N N
( A) ( B)
 If (AB)<   N , then AB are negatively related / associated
N N

( A) ( B)
 If (AB)=   N , then AB are independenti.e., have no association
N N

Where (AB) =frequency of class AB and


( A) ( B )
  N  Expectation of AB, if A and B are independent , and N being number of items.
N N

In order to find out the degree or intensity of association between two or more sets of attributes,
we should work out the coefficient of association. Yule’s coefficient of association is most
popular and is often used for this purpose.

It can be mentioned as under:

( AB )(ab)  ( Ab)(aB )
QAB=
( AB )(ab)  ( Ab)(aB )

Where,

QAB=Yule’s coefficient of association between attributes A and B

(AB)=Frequency of class AB in which A and B are present.

(Ab)=Frequency of class Ab in which A is present and B is absent

(aB)=Frequency of class aB in which A is absent and B is present.

(ab) = Frequency of class ab in which both A and B are absent.

The value of this coefficient will be in between +1 and -1.

 If the attributes are completely associated (perfect positive association) with each other,
the coefficient will be +1.
 If the attributes are completely disassociated (perfect negative association) with each
other, the coefficient will be -1.

 If the attributes are completely independent of each other, the coefficient will be 0.

In order to judge the significance of association between two attributes, we make use of Chi
square test by finding the value of Chi square(  2 ) and using Chi square distribution the value
of  2 can be worked out as under:

(Oij  E ij ) 2
 2 =
E ij

Where Oij=Observed frequencies

Eij=expected frequencies.

Chi Square Test


 Is an important test amongst the several tests of significance tests.
 Used in the context of sampling analysis for comparing a variance to a theoretical
variance.

 Can be used as a non parametric test to determine if categorical data shows dependency
or the two classifications are independent.

 Can be used to make a comparision between theoretical populations and actual data
when categories are used.

 Is a technique through the use of which it is possible for researchers to

 Test the goodness of fit

 Test the significance of association between two attributes

 Test the homogeneity of population variance.

Chi-square as a Test for Comparing Variance

The chi square is often used to judge the significance of population variance i.e., we can use the
test to judge if a random sample has been drawn from a normal population with mean (  )
and with a specified variance (  p ).
2

The test is based on  2 distribution.

The  2 distribution is not symmetrical and all the values are positive.

For making use of this distribution, one is required to know the dgrees of freedom since for
different degrees of freedom we have different curves.

The smaller the number of degrees of freedom, the more skewed is the distribution is.

In brief when we have to use chi-square as a test of population variance, we have to workout
the value of  2 to test the null hypothesis( viz., Ho:  s2   p2 ) as under:

 S2
  2 (n  1)
2

P

Then by comparing the calculated value with the table value of  2 for (n-1) degrees of
freedom at a given level of significance, we may either accept or reject the null hypothesis.

 If the calculated value of  2 is less than the table value, the null hypothesis is accepted.
 If the calculated value of  2 is equal or greater than the table value, the hypothesis is
rejected.
Chi Square As a Non Parametric Test

Chi square is an important non-parametric test and as such no regid assumptions are necessary
in respect of the type of population.

We require degree of freedom( implicitely of course the size of the sample) for using this test.

As a non-parametric test, chi-square can be used:

 As a test of goodness of fit


 As a test of independence.

As a test of goodness of fit,  2 test enables us to see how well does the assumed theoretical
distribution ( such as Binomial, Poisson or Normal distribution) fit to the observed data.

When some theoretical distribution is fitted to the given data, we are always interested in
knowing as to how well this distribution fits with the observed data.

 If the calculated value of  2 is less than the table value at certain level of significance ,
the fit is considered to be a good one which means that the divergence between the
observed and expected frequencies is attributable to fluctuations of sampling.
 If the calculated value of  2 is greater than its table value, that fit is not considered to be
a good one.

As a test of independence,  2 test enables us to explain whether or not two attributes are
associated. In asserting so we first calculate the expected frequencies and then workout the
value of  2 .

 If the calculated value of  2 is less than the table value at certain level of significance
for a given degree of freedom, we conclude that null hypothesis stands which means
that two attributes are independent or not associated.
 If the calculated value of  2 is greater than its table value , the inference would be null
hypothesis doesnot hold good which means that the two attributes are associated and
association is not because of some chance factor but it exists in reality.

N.B. 1)  2 is not a measure of the degree of relationship or the form of relationship between
two attributes, but it simply a technique of judging the significance of such association or
relationship between two attributes.

2) in order that we may apply the chi-square test either as a test of goodness of fit or as a
test to judge the significance of association between attributes, it is necessary that the
observed as well theoretical or expected frequencies must be grouped in the same way and
the theoretical distribution must be adjusted to give the same total frequency as we find the
case observed distribution.

Conditions for the Application of  2 Test


1. Observation recorded and used are collected on a random basis.
2. All the items in the sample must be independent.

3. No group should contain very few items, say less than 10. In case where the frequencies
are less than 10, regrouping is done by combining the frequencies of adjoining groups so
that the new frequencies become greater than 10.

4. The overall number of items must also be reasonably large. It should normally be at least
50, howsoever small the number of groups may be.

5. The constraints may be linear. Constraints which involve linear equations in the cell
frequencies of a contingency table are known as linear constraints.

Important Characteristics of  2 Test

1. The test (as a non-parametric test) is based on frequencies and not on the parameters
like mean and standard deviation.
2. The test is used for testing the hypothesis and is not useful for estimation.

3. This test possesses the additive property .

4. This test can also be applied to a complex contingency table with several classes and
as such is very useful in research work.

5. This test is an important non-parametric test as no rigid assumptions are necessary


in regard to the type of population, no need of parameter values and relatively less
mathematical details are involved.

Steps involved in Applying Chi-Square Test

The various steps involved are as follows:

1. Calculate the expected frequencies on the basis of given hypothesis or on the basis of
null hypothesis. Usually in case of 2x2 or any contingency table, the expected
frequency for any given cell is worked out as under:

rows total for the row of that cell 


columntotal for the column of that cell
Expected frequency of any cell 
Grand total

2. Obtain the difference between observed and expected frequencies and find out the
squares of such differences i.e., calculate (Oij-Eij) 2.
3. Divide the quantity (Oij-Eij)2 obtained as stated above by the corresponding expected
frequency to get(Oij-Eij)2/Eij and this should be done for all the cell frequencies or the
group frequencies.
(Oij  E ij ) 2
2
4. Find the summation of (Oij-Eij) /Eij value or what we call  E ij
. This is the

required  2 value.

5. Compare the calculated with its table value at n-1 degree of freedom and draw the
inference.

Interpretation of The research findings


Meaning of Interpretation
Interpretation refers to the task of drawing inferences from the collected facts after an analytical
and/or experimental study.
Interpretation is concerned with relationships within the collected data, partially overlapping
analysis. It also extends beyond the data of the study to include the results of other research,
theory and hypothesis.
Interpretation is the device through which the factors that seem to explain what has been
observed by researcher in the course of the study can be better understood and it also provides
a theoretical conception which can serve as a guide for further researches.
In general interpretation as two major aspects
1. The efforts to establish continuity in research through linking the results of a given
study with those of another
2. The establishment of some explanatory concepts.
Why interpretation?
Interpretation is essential for the simple reason that the usefulness and utility of research
findings lie in proper interpretation. It is being considered a basic component of research
process because of the following reasons.
1. It is through interpretation that the researcher can well understand the abstract principle
that works beneath his findings. Through this he can link up his findings with those of
other studies, having the same abstract principle, and thereby can predict about the
concrete world of events. Fresh inquiries can test these predictions later on. This way the
continuity in research can be maintained.
2. Interpretation leads to the establishment of explanatory concepts that can serve as a
guide for future research studies; it opens new avenues for intellectual adventure and
stimulate the quest for more knowledge.
3. Researcher can better appreciate only through interpretation why his findings are what
they are and can make others to understand the real significance of his research findings.
4. The interpretation of the findings of exploratory research study often results into
hypotheses for experimental research and as such interpretation involved in the
transition from exploratory to experimental research.
Techniques of Interpretation
The task of interpretation is not an easy job, rather it requires a great skill and dexterity on the
part of researcher.
Interpretation is an art that one learns through practice and experience.
The techniques of interpretation often involve the following steps:
1. Researcher must give reasonable explanations of the relations which he has found and
he must interpret the lines of relationship interms of the underlying processes and must
try to find out the thread of uniformity that lies under the surface layer of his diversified
research findings. In fact, this is the techniques of how generalization should be done
and concept be formulated.
2. Extraneous information, if collected during the study, must be considered while
interpreting the final results of research study, for it may be a key factor in
understanding the problem under consideration.

You might also like