Professional Documents
Culture Documents
Chapter 4. Data Processing, Statistical Analysis&Analysis
Chapter 4. Data Processing, Statistical Analysis&Analysis
Outline
I. Data Analysis
By the time you get to the analysis of your data, most of the really difficult work has been done.
It's much more difficult to: define the research problem; develop and implement a sampling
plan; conceptualize, operationalize and test your measures; and develop a design structure. If
you have done this work well, the analysis of the data is usually a fairly straightforward affair.
Analysis of data involves a number of closely related operations that are performed with the
purpose of summarizing the collected data and organizing these in such a manner that they will
yield answers to the research questions and research hypothesis and imitated the study.
Analysis of data includes comparison of the outcomes of the various treatments upon the
several groups and the making of the decision as to the achievement of the goals of research.
Analysis of data means to make the raw data meaningful or to draw some results from the data
after the proper treatment.
Some authors differentiate data analysis and data preparation stating that data preparation is
one of the steps in research activities whereas data analysis is the other step.
Others put the steps involved in data analysis, in general, as being classified as;
The other approach used to classify the data analysis in social research involves three major
steps, done in roughly this order:
In most research studies, the analysis section follows these three phases of analysis.
Descriptions of how the data were prepared tend to be brief and to focus on only the more
unique aspects to your study, such as specific data transformations that are performed. The
descriptive statistics that you actually look at can be voluminous. In most write-ups, these are
carefully selected and organized into summary tables and graphs that only show the most
relevant or important information. Usually, the researcher links each of the inferential analyses
to specific research questions or hypotheses that were raised in the introduction, or notes any
models that were tested that emerged as part of the analysis. In most analysis write-ups it's
especially critical to not "miss the forest for the trees." If you present too much detail, the reader
may not be able to follow the central line of the results. Often extensive analysis details are
appropriately relegated to appendices, reserving only the most critical analysis summaries for
the body of the report itself.
Data Preparation/processing
Tabulation
In any research project you may have data coming from a number of different sources at
different times: For example from; mail surveys returns, coded interview data, pretest or
posttest data and observational data.
In all but the simplest of studies, you need to set up a procedure for logging the information
and keeping track of it until you are ready to do a comprehensive data analysis.
Different researchers differ in how they prefer to keep track of incoming data. In most
cases, you will want to set up a database that enables you to assess at any time what data is
already in and what is still outstanding.
You could do this with any standard computerized database program (e.g., Microsoft
Access, Claris Filemaker), although this requires familiarity with such programs. Or you
can accomplish these using standard statistical programs (e.g., SPSS, SAS, Minitab, Data
desk) and running simple descriptive analyses to get reports on data status.
It is also critical that the data analyst retain the original data records for a reasonable period
of time -- returned surveys, field notes, test protocols, and so on.
A database for logging incoming data is a critical component in good research record-
keeping.
As soon as data is received you should screen it for accuracy. In some circumstances doing this
right away will allow you to go back to the sample to clarify any problems or errors.
Editing of data is a process of examining the raw collected data to detect errors and omissions
and to correct these when possible.
Editing is done to assure that the data are accurate, consistent with other facts gathered,
uniformly entered, as completed as possible and have been well arranged to facilitate coding
and tabulation.
There are several questions you should ask as part of this initial data screening:
Field editing:
Consists in the review of the reporting forms by the investigators for completing
(translating or rewriting) what the latter has written in abbreviated and/or in
illegible form at the time of recording the respondents’ responses.
Done for checking whether the handwriting is readable or not.
Central Editing:
Take place when all forms or schedules have been completed and returned to the
office.
Implies that all forms should get a thorough editing by a single editor in a small
study and by a team in case of large inquiry.
In case of omission of responses, sometimes the editor can enter the answer by
considering other information.
Editors must keep in view several points while performing their work:
They should be familiar with instructions given to the interviewers and coders as
well as with the editing instructions supplied to them for the purpose.
While crossing out an original entry for one reason or another, they should just draw
a single line on it so that the same may remain legible.
Editor’s initials and the date of editing should be placed on each completed form or
schedule.
c) Coding
Coding is necessary for efficient analysis and through it the several replies may be
reduced to small number of classes which contain critical information requires for
analysis.
d) Classification
Is the process of arranging data into sequences and groups according to their
common characteristics, or separating them into different but related parts.
Is the scheme of breaking a category into a set of parts, called classes, according to
some precisely defined differing characteristics possessed by all the elements of the
category.
Characteristics of Classification
Mutually exclusive: there must not be overlap. That is, each item of data
must find its place in one class and one class only. There must be no item
which can find its way into more than one class.
Types of classification
Classification can be done in one of the following ways, depending on the nature of
the phenomenon involved:
o Simple classification or
o Manifold classification.
In simple classification, we consider only one attribute and divide the universe into
two classes- one class consisting the attribute and the other not consisting the
attribute.
Such data are known as statistics of variables and are classified on the basis of class
intervals.
Each class of interval has upper limit and lower limit which are known as class
limits. The difference between the two class limits is called class magnitudes.
The number of items which fall in a given class is known as the frequency of the
given class.
All the classes or groups, taken together with their respective frequencies and put in
the form of a table, are described as a group frequency distribution.
3. Geographical classification
The data are classified according to the geographical location such as continents,
countries, states, districts, or other subdivisions.
4. Chronological Classification
When the given data is classified on the basis of time , it is named chronological
classification.
In this type of classification, the data may be classified on the basis of time, i.e.,
years, months, weeks, days or hours.
5. Alphabetical Classification
This type of classification is mostly adopted for data of general use because it aids
in locating the items easily.
Objectives of Classification
To prepare the basis for tabulation: classification prepares the basis for tabulation
and statistical analysis of the data. Unclassified data can not be presented.
Limitations:
The number of groups and magnitude size determination is challenging.
Choosing class limit and type( inclusive or exclusive) as well is a difficult decision.
e) Tabulation
Is the process of summarizing raw data and displaying the same in compact form
(i.e., in the form of statistical tables) for further analysis.
Statistical table is the logical listing of related quantitative data in vertical columns
and horizontal rows of numbers with sufficient explanatory and qualifying words,
phrases and statements in the form of titles, headings and notes to make clear the
full meaning of data and their origin.
Objectives of tabulation
A table presents facts clearly and concisely, eliminating the need for wordy
explanation. It brings out the chief characteristics of data.
Advantages of Tabulation
Limitations of Tabulation
A table contains only figures and not their description. It is not easy to
understand it by persons who are not adept in assimilating facts from table.
It requires specialized knowledge to understand tables. A layman cannot derive
any conclusion from a table.
1. Table number: every table should be numbered so that it can be identified. The
number is normally indicated at the top of the table.
2. Title: Each table must bear a title indicating the type of data contained. The title
should not be very lengthy so as to run in several lines. It should be clear and
unambiguous.
3. Captions and Stubs: A table consists of rows and columns. The headings or
subheadings given in columns are known as captions while those given in rows
are stubs. It is necessary that a table should have captions and stubs to indicate
what columns and rows stand for. It is also desirable to provide for an extra
column and row in the table for the column and row totals.
4. Main body of the table: As this part of the table contains data, it is most
important part. Its size and shape should be suitable to accommodate the data.
The data are enterd from the top to the bottom in columns and from left to the
right rows.
6. Head note:
7. Footnote:
8. Sourcenote:
In the previous topic we have seen that tabulation is one method of presenting data. Another
way of presenting data is in form of diagrams and graphs. However, this method of data
presentation is also not with out limitation.
1. On account of their visual impact, the data presented through graphic and diagrammatic
presentation are better grasped and remembered than the tabulated ones.
2. These forms of presentation transform data in simple, clear and effective manner.
3. They are able to attract the attention of the reader particularly when several colors and
pictures are used in preparation.
4. A major advantage of these presentations is that they have better appeal even to a
layman. For the layman, simple charts, maps and pictures facilitate a much better
understanding of the data on which these are based.
6. Even when data show highly complex relations among variables, these devices make
them much clear. They thus greatly facilitate in the interpretation and analysis of data.
7. These devices are extremely helpful in depicting mode, median, skewness, correlation
and regression, normal distribution, time series analysis and so on.
3. When too many details are to be presented, these devices fail to present them without
loss of clarity.
4. In those cases, where mathematical treatment is required, these devices turn out to be
extremely unsuitable.
5. Small differences in large measurements can not be properly brought out by means of
graphs and diagrams.
6. While graphs and diagrams are generally simple to understand, one should know that
all graphic devices are not simple. Particularly when ratio graphs and multidimensional
figures are used, these may be beyond the comprehension of the common man. A
proper understanding of these figures needs some expertise on the part of the reader.
g) Graphic devices
Within the natural scale graph, again there are two types of graphs:
Frequency graph
Time series graph shows the data against time, which could be any measure such as hours, day,
weeks, months and years.
In frequency graphs, time is not a measure instead some other variables such as income of
employees the number of employees earning that income, if plotted on a graph. Within
frequency graph category there are some graphs such as histogram, frequency polygon, and the
ogive curve are the popular ones.
1. Line Graph
Time period is measured along X-axis and the corresponding values are on the Y-axis.
Under this device, phenomena, which form part of the whole, are shown by successive
bands or components to enable an overall picture alongwith the successive contributions of
the component.
4. Range Graph
This graph shows the range, that is the highest and the lowest of a certain product or items
under reference.
Frequency Graphs:
Line Graph
On the axis of X is measured the size of the items while on the axis of Y is measured the
corresponding frequency.
Histogram
In histogram, we measure the size of the item in question, given in terms of class intervals,
on the axis of X while the corresponding frequencies are shown on the axis of Y. Unlike the
line graph, here the frequencies are shown in the form of rectangles the basis of which is the
class interval. Furthermore, the rectangles are adjacent to each other without having any gap
amongst them. A histogram generally represents a continuous frequency distribution in
contrast to line graph, which represents either a discrete frequency distribution or a time
series.
Advantages
Frequency Polygon
A frequency polygon like any polygon consists of many angles. A histogram can be easily
transformed into frequency polygon by joining the mid-points of the rectangles by straight
lines. Frequency polygon can also be drawn by taking the mid points of each class interval and
by joining the mid points by the straight lines. This can be done only when we have a
continuous series.
Advantages
As the number of classes and the number of observations increase, so also the frequency
polygon becomes increasingly smooth.
Frequency Curve
When a frequency polygon is smoothened and rounded at the top, then it is known as
frequency curve.
Cumulative frequency curve enables us to know how many observations are above or below a
certain value. It is also known as ogive.
Z-curve
It is commonly used in business. The name of this device is derived from its shape. It is the
combination of three curves, namely
III. The curve based on the moving totals (which can be obtained by
adding the past X number of data).
Analysis means the computation of certain indices or measures along with searching for
patterns of relationship that exist among the data groups.
Analysis involves estimating the values of unknown parameters of the population and testing
of hypothesis for drawing inferences.
Analysis, therefore, may be classified as descriptive analysis and inferential analysis.
In descriptive statistics we are simply describing what is or what the data shows.
With inferential statistics, we are trying to reach conclusions that extend beyond the
immediate data alone. For instance, we use inferential statistics to try to infer from the
sample data what the population might think. Or, we use inferential statistics to make
judgments of the probability that an observed difference between groups is a dependable
one or one that might have happened by chance in this study. Thus, we use inferential
statistics to make inferences from our data to more general conditions; we use descriptive
statistics simply to describe what's going on in our data.
Descriptive Statistics are used to describe the basic features of the data in a study. They
provide simple summaries about the sample and the measures. Together with simple
graphics analysis, they form the basis of virtually every quantitative analysis of data. With
descriptive statistics you are simply describing what is, what the data shows.
Inferential Statistics investigate questions, models and hypotheses. In many cases, the
conclusions from inferential statistics extend beyond the immediate data alone. For
instance, we use inferential statistics to try to infer from the sample data what the
population thinks. Or, we use inferential statistics to make judgments of the probability that
an observed difference between groups is a dependable one or one that might have
happened by chance in this study. Thus, we use inferential statistics to make inferences
from our data to more general conditions; we use descriptive statistics simply to describe
what's going on in our data.
Descriptive Statistics
Descriptive statistics are used to describe the basic features of the data in a study. They provide
simple summaries about the sample and the measures. Together with simple graphics analysis,
they form the basis of virtually every quantitative analysis of data.
a. Univariate Analysis
Univariate analysis involves the examination across cases of one variable at a time. There are
three major characteristics of a single variable that we tend to look at:
In most situations, we would describe all three of these characteristics for each of the variables
in our study.
The Distribution
The distribution is a summary of the frequency of individual values or ranges of values for a
variable. The simplest distribution would list every value of a variable and the number of
persons who had each value. For instance, a typical way to describe the distribution of college
students is by year in college, listing the number or percent of students at each of the four/three
years. Or, we describe gender by listing the number or percent of males and females.
One of the most common ways to describe a single variable is with a frequency distribution.
Depending on the particular variable, all of the data values may be represented, or you may
group the values into categories first (e.g., with age, price, or temperature variables, it
would usually not be sensible to determine the frequencies for each value. Rather, the
values are grouped into ranges and the frequencies determined.).
Frequency distributions can be depicted in two ways, as a table or as a graph. Table 1 shows
an age frequency distribution with five categories of age ranges defined. The same
frequency distribution can be depicted in a graph as shown in Figure 2. This type of graph is
often referred to as a histogram or bar chart.
Table 2. Frequency distribution bar chart.
Distributions may also be displayed using percentages. For example, you could use
percentages to describe the:
The Mean or average or arithmetic mean is probably the most commonly used method of
describing central tendency. To compute the mean all you do is add up all the values and divide
by the number of values. For example, the mean or average quiz score is determined by
summing all the scores and dividing by the number of students taking the exam. For example,
consider the test score values:
The Median is the score found at the exact middle of the set of values. One way to compute the
median is to list all scores in numerical order, and then locate the score in the center of the
sample. For example, if there are 500 scores in the list, score #250 would be the median. If we
order the 8 scores shown above, we would get:
15,15,15,20,20,21,25,36
There are 8 scores and score #4 and #5 represent the halfway point. Since both of these scores
are 20, the median is 20. If the two middle scores had different values, you would have to
interpolate to determine the median.
The mode is the most frequently occurring value in the set of scores. To determine the mode,
you might again order the scores as shown above, and then count each one. The most
frequently occurring value is the mode. In our example, the value 15 occurs three times and is
the model. In some distributions there is more than one modal value. For instance, in a bimodal
distribution there are two values that occur most frequently.
Notice that for the same set of 8 scores we got three different values -- 20.875, 20, and 15 -- for
the mean, median and mode respectively. If the distribution is truly normal (i.e., bell-shaped),
the mean, median and mode are all equal to each other.
Dispersion
Dispersion refers to the spread of the values around the central tendency. There are two
common measures of dispersion, the range and the standard deviation.
The range is simply the highest value minus the lowest value. In our example distribution, the
high value is 36 and the low is 15, so the range is 36 - 15 = 21.
There are two problems with the range as a measure of spread. When calculating the range
you are looking at the two most extreme points in the data, and hence the value of the
range can be unduly influenced by one particularly large or small value, known as an
outlier. The second problem is that the range is only really suitable for comparing (roughly)
equally sized samples as it is more likely that large samples contain the extreme values of a
population.
n 1
Q3 3
4
Just as with the median, these quartiles might not correspond to actual observations.
The inter-quartile range is simply the difference between the upper and lower quartiles, that is
IQR = Q3 − Q1
The inter-quartile range is useful as it allows us to start to make comparisons between the
ranges of two data sets, without the problems caused by outliers or uneven sample sizes.
Variance
The sample variance is the standard measure of spread used in statistics. It is usually denoted by
s2 and is simply the “average” of the squared distances of the observations from the sample
mean.
Strictly speaking, the sample variance measures deviation about a value calculated from the
data (the sample mean) and so we use an n − 1 divisor rather than n.
Mathematical notation is
1
n
( xi2 x )
2
s2
n 1 i 1
The Standard Deviation is a more accurate and detailed estimate of dispersion because an
outlier can greatly exaggerate the range (as was true in this example where the single outlier
value of 36 stands apart from the rest of the values. The Standard Deviation shows the relation
that set of scores has to the mean of the sample.
In the top part of the ratio, the numerator, we see that each score has the mean subtracted from
it, the difference is squared, and the squares are summed. In the bottom part, we take the
number of scores minus 1. The ratio is the variance and the square root is the standard
deviation.
The standard deviation is the square root of the sum of the squared deviations from the mean
divided by the number of scores minus one.
The standard deviation allows us to reach some conclusions about specific scores in our
distribution. Assuming that the distribution of scores is normal or bell-shaped, the following
conclusions can be reached:
approximately 68% of the scores in the sample fall within one standard deviation of the
mean
approximately 95% of the scores in the sample fall within two standard deviations of the
mean
approximately 99% of the scores in the sample fall within three standard deviations of
the mean
For instance, if mean for a given data is 20.875 and the standard deviation is 7.0799, we can
estimate that approximately 95% of the scores will fall in the range of 20.875-(2*7.0799) to
20.875+(2*7.0799) or between 6.7152 and 35.0348. This kind of information is a critical stepping
stone to enabling us to compare the performance of an individual on one variable with their
performance on another, even when the variables are measured on entirely different scales.
When the distribution of item in a series happens to be perfectly symmetrical, we then have to
the following type of curve for the distribution.
X M Z
Such a curve is techinically described as a normal curve and relating distribution as normal
distribution.
Such a curve is perfectly bell shaped curve in which case the value of X or M or Z is just the
same and skewness is altogether absent.
But if the curve is distorted (whether on the right side or on the left side), we have asymmetrical
distribution which indicates that there is skewness.
If the curve is distorted on the right side, we have positive skewness but when the curve is
distorted to the left, we have negative skewness.
Skewness is, thus, measure of asymmetry and shows the manner in which the items are
clustered around the average.
In asymmetrical distribution, the items show a perfect balance on the either side of the mode,
but in a skew distribution the balance is thrown to one side.
The amount by which the balance exceeds on one side measures the skewness of the series.
The difference between the mean, median, or the mode provides an easy way of expressing
skewness in a series.
Us
Z M Z
Positively Skewed Negatively skewed
X Z
j
3( X M )
j
The significance of skewness lies in the fact that through it one can study the formation of series
and can have the idea about the shape of the curve, whether normal or otherwise, when the
items of a given series are plotted on graph.
Kurtosis is the hompedness of the curve and points to the nature of distribution of items in the
middle of a series.
A bell shaped curve or the normal curve is Mesokurtic because it is kurtic in the center
If the curve is relatively more peaked than the normal curve, it is called Leptokurtic
Knowing the shape of the distribution curve is crucial to the use of statistical method in research
analysis since most methods make specific assumptions about the nature of the distribution
curve.
Whenever we deal with data on two or more variables, we said to have a bivariate or
multivariate population.
Such situations usually happen when we wish to know the relation of the two and/or more
variables in the data with one another.
There are different methods of determining the relationship between variables, but no method
can tell us for certain that a correlation is indicative of causal relationship.
Does there exist association or correlation between the two (or more) variables? If yes, of
what degree?
Is there any cause and effect relationship between two variables in case of bivariate
population or between one variable on one side and two or more variables on the other
side in case of multivariate population? If yes , of what degree and in which direction?
The first question can be answered by the use of correlation technique and the second question
by the technique of regression.
There are several methods of applying the two techniques, but the important ones are as
under:
Scattergram
Multiple regression
Cross tabulation
Begins with the two wat table which indicates whether there is or there is not an
interrelationship between the variables.
Then we look for intersections between them which may be symmetrical, reciprocal or
asymmetrical.
A symmetrical relationship is one in which the two variable vary together, but we
assume that neither variable is due to the other.
Scattergram
Is a graph on which a researcher plots each case or observation, where each axis
represents the value of one variable.
Is used for variables measured at the interval or ratio level, rarely for ordinal variables,
and never if either of the variables is nominal.
Can show three aspects of the bivariate relationship for the researcher.
Form
Direction
Precision
Form
Independence
Linear
Curvilinear
Independence
No relationship
Looks like a random scatter with no pattern or straight line that is exactly parallel to the
horizontal or vertical axis.
Linear Relationship
Means that a straight line can be visualized in the middle of a maze of cases running
from corner to another.
Curvilinear Relationship
Means that the center of a maze of cases would form a U curve, right side up or upside
down or an S curve.
Direction
The plot of a positive relationship looks like a diagonal line from the lower left
to the upper right. Higher values on X-axis tend to go with higher values on Y,
and vice versa.
A negative relationship looks like a line from the upper left to the lower right.
It means that higher values on one variable go with lower values on the other.
Precision
A higher level of precision occurs when the points hug the line that summarizes the
relationship.
Its values ranges between -1.0(perfect negative association) and +1.0(perfect positive
association)
Mathematically;
6 D 2
e 1
n( n 1)
2
where
D the difference between X , Y ranks asigned to an object
n the number of observation
Sample members must have been taken at random from a larger population.
Research Example: A researcher in a study of two factor theory of job satisfaction, used
Rho.
Ranks were given a perceived needs for supervisors and clerks on each job
factor according to the magnitude of mean scores, and Rho was calculated. The
calculated value was significant(Rho=0.86, p less than 0.01) indicating
similarity between the two groups in their perceived need importance.
Is the most widely used method of measuring the degree of relationship between two
variables.
Expresses both the strength and direction of linear correlation.
A zero value of “r” indicates that there is no association between the two
variables.
Once you've computed a correlation, you can determine the probability that the observed
correlation occurred by chance. That is, you can conduct a significance test. Most often you are
interested in determining the probability that the correlation is a real one and not a chance
occurrence. In this case, you are testing the mutually exclusive hypotheses:
We need to first determine the significance level. For example, we use the common
significance level of alpha = .05. This means that we are conducting a test where the odds
that the correlation is a chance occurrence is no more than 5 out of 100.
The degrees of freedom or df is equal to N-2.
Finally, deciding the type of test to be applied(two-tailed test or one tailed test) is to be
done.
Other Correlations
There are a wide variety of other types of correlations for other circumstances. For instance,
if you have two ordinal variables, you could use the Kendall rank order Correlation
(tau).
When one measure is a continuous interval level one and the other is dichotomous (i.e.,
two-category) you can use the Point-Biserial Correlation.
Partial correlation measures separately the relationship between two variables in such a way
that the effects of other related variables are eliminated.
In partial correlation analysis, we aim at measuring the relations between a dependent variable
and a particular independent variable by holding all other variables constant.
Each partial coefficient of correlation measures the effect of its independent variable on the
dependent variable.
The partial correlation shows the relationship between two variables, excluding the effect of
other variables. In a way, the partial correlation is a special case of multiple correlation.
The difference between simple correlation and partial correlation is that the simple correlation
does not include the effect of other variables as they are completely ignored. There is almost an
implicit assumption that the variables not included don not have any impact on the dependent
variable. But such is not the case in the partial correlation, where the impact of other
independent variables is held constant.
N.B. In multiple correlation, three or more variables are studied simultaneously. But in partial
correlation we consider only two variables influencing each other while the effect of other
variables is held constant.
For example, suppose we have a problem comprising three variables X1, X2, and Y. X1 is the
number of hours studied, X2 is I.Q. and Y is the number of marks obtained in the examination.
In a multiple correlation, we will study the relationship between the marks obtained(Y) and the
two variables, number of hours studied(X1) and I.Q.(X2). In contrast, when we study the
relationship between X1 and Y keeping an average I.Q. as constant, it is said to be a study
involving partial correlation.
If we denote r12.3 as the coefficient of partial correlation between X1 and X2, holding X3 constant,
then
Unlike the partial correlation, multiple correlation is based on three or more variables withpout
excluding the effect of anyone. It’s denoted by R .
In case of three variables X1, X2, and x3, the multiple correlation coefficient will be:
It may be recalled that the concepts of dependent and independent variables were non-existent
in case of simple bivariate correlation. In contrast, the concepts of dependent and independent
variables are introduced here in multiple correlation.
Regression analysis is a mathematical measure of the average relationship between two or more
variables interms of the original units of the data.
Regression analysis is a statistical method to deal with the formulation of mathematical model
depicting relationship amongst variables which can be used for the purpose of prediction of the
values of dependent variable, given in the value of the independent variable.
In simple regression, we have only two variables, one variable ( defined as independent) in
cause of the behavior of another one (defined as dependent variable).
Regression can only interpret what exists physically i.e., there must be a physical way in which
independent variable X can affect dependent variable Y.
Y a bX
This equation is known as the regression equation of Y on X(also represents the regression line
of Y on X when drawn on a graph) which means that each unit change in X produces a change
of b in Y, which is positive for direct and negative for inverse relationships.
Regression coefficients
As we saw the regression of Y on X, it is possible that we may think of X as dependent variable
and Y as an independent one.
If all the points on the scatter diagram fall on the regression line , the correlation between the
two variables involved is perfect.
This is as much true about the regression line of Y on X as about the line of X on Y.
This means is that if the correlation is perfect the regression line can pass through more than
one point. This is because of the fact that one and only one line can pass through more than one
point.
If, however, the two lines diverge and intersect each other the correlation is not perfect.
1. The coefficient of correlation is the geometric mean of the two regression coefficients.
2. As the coefficient of correlation can not exceed 1, in case one of the regression
coefficients is greater than 1, then the other must be less than 1.
3. Both the regression coefficient will have the same sign, either positive or negative. If one
regression coefficient is positive then the other will also be positive.
4. The coefficient of correlation and the regression coefficient will have the same sign. If the
regression coefficients are positive, then the correlation will also be positive and vice
versa.
5. The average of two regression coefficients will always be greater than the correlation
coefficient.
When there are two or more than two independent variables, the analysis concerning cause and
effect relatioship is known as multiple correlation and the equation describing such relationship
as the multiple regression equation.
Ÿ= a b1 X 1 b2 X 2
Where X1 and X2 are two independent variables and Y being the dependent variable, and the
a,b1 and b2 constants.
In multiple regression analysis, the regression coefficients(viz., b1 b2) becomes less reliable as the
degree of correlation between the independent variables (viz X1, X2 ) increases.
In such a situation we should use only one set of the independent variable to make our
estimate. Infact, adding a second variable, say X2, that is correlated with the first variable, say X1,
distorts the values of the regression coefficients.
When data is collected on the basis of some attribute or attributes, we have a statistics
commonly termed as statistics of attributes.
In such a situation our interest may remain in knowing whether the attributes are associated
with each other or not.
The (two) attributes are associated if they appear together in a greater number of cases than is to
be expected if they are independent and not simply on the basis that they are appearing
together in a number of cases as is done in ordinary life.
If class frequency of AB, symbolically written as (AB), is greater than the expectation of AB
being together if they are independent, then we say the two attributes are positively associated;
but if the class frequency of AB is less than this expectation, the two attributes are said to be
negatively associated.
In case the class frequency of AB is equal to expectation, the two attributes are considered as
independent i.e., are said to have no association.
Symbolically:
( A) ( B)
If (AB)> N , then AB are postively related / associated
N N
( A) ( B)
If (AB)< N , then AB are negatively related / associated
N N
( A) ( B)
If (AB)= N , then AB are independenti.e., have no association
N N
In order to find out the degree or intensity of association between two or more sets of attributes,
we should work out the coefficient of association. Yule’s coefficient of association is most
popular and is often used for this purpose.
( AB )(ab) ( Ab)(aB )
QAB=
( AB )(ab) ( Ab)(aB )
Where,
If the attributes are completely associated (perfect positive association) with each other,
the coefficient will be +1.
If the attributes are completely disassociated (perfect negative association) with each
other, the coefficient will be -1.
If the attributes are completely independent of each other, the coefficient will be 0.
In order to judge the significance of association between two attributes, we make use of Chi
square test by finding the value of Chi square( 2 ) and using Chi square distribution the value
of 2 can be worked out as under:
(Oij E ij ) 2
2 =
E ij
Eij=expected frequencies.
Can be used as a non parametric test to determine if categorical data shows dependency
or the two classifications are independent.
Can be used to make a comparision between theoretical populations and actual data
when categories are used.
The chi square is often used to judge the significance of population variance i.e., we can use the
test to judge if a random sample has been drawn from a normal population with mean ( )
and with a specified variance ( p ).
2
The 2 distribution is not symmetrical and all the values are positive.
For making use of this distribution, one is required to know the dgrees of freedom since for
different degrees of freedom we have different curves.
The smaller the number of degrees of freedom, the more skewed is the distribution is.
In brief when we have to use chi-square as a test of population variance, we have to workout
the value of 2 to test the null hypothesis( viz., Ho: s2 p2 ) as under:
S2
2 (n 1)
2
P
Then by comparing the calculated value with the table value of 2 for (n-1) degrees of
freedom at a given level of significance, we may either accept or reject the null hypothesis.
If the calculated value of 2 is less than the table value, the null hypothesis is accepted.
If the calculated value of 2 is equal or greater than the table value, the hypothesis is
rejected.
Chi Square As a Non Parametric Test
Chi square is an important non-parametric test and as such no regid assumptions are necessary
in respect of the type of population.
We require degree of freedom( implicitely of course the size of the sample) for using this test.
As a test of goodness of fit, 2 test enables us to see how well does the assumed theoretical
distribution ( such as Binomial, Poisson or Normal distribution) fit to the observed data.
When some theoretical distribution is fitted to the given data, we are always interested in
knowing as to how well this distribution fits with the observed data.
If the calculated value of 2 is less than the table value at certain level of significance ,
the fit is considered to be a good one which means that the divergence between the
observed and expected frequencies is attributable to fluctuations of sampling.
If the calculated value of 2 is greater than its table value, that fit is not considered to be
a good one.
As a test of independence, 2 test enables us to explain whether or not two attributes are
associated. In asserting so we first calculate the expected frequencies and then workout the
value of 2 .
If the calculated value of 2 is less than the table value at certain level of significance
for a given degree of freedom, we conclude that null hypothesis stands which means
that two attributes are independent or not associated.
If the calculated value of 2 is greater than its table value , the inference would be null
hypothesis doesnot hold good which means that the two attributes are associated and
association is not because of some chance factor but it exists in reality.
N.B. 1) 2 is not a measure of the degree of relationship or the form of relationship between
two attributes, but it simply a technique of judging the significance of such association or
relationship between two attributes.
2) in order that we may apply the chi-square test either as a test of goodness of fit or as a
test to judge the significance of association between attributes, it is necessary that the
observed as well theoretical or expected frequencies must be grouped in the same way and
the theoretical distribution must be adjusted to give the same total frequency as we find the
case observed distribution.
3. No group should contain very few items, say less than 10. In case where the frequencies
are less than 10, regrouping is done by combining the frequencies of adjoining groups so
that the new frequencies become greater than 10.
4. The overall number of items must also be reasonably large. It should normally be at least
50, howsoever small the number of groups may be.
5. The constraints may be linear. Constraints which involve linear equations in the cell
frequencies of a contingency table are known as linear constraints.
1. The test (as a non-parametric test) is based on frequencies and not on the parameters
like mean and standard deviation.
2. The test is used for testing the hypothesis and is not useful for estimation.
4. This test can also be applied to a complex contingency table with several classes and
as such is very useful in research work.
1. Calculate the expected frequencies on the basis of given hypothesis or on the basis of
null hypothesis. Usually in case of 2x2 or any contingency table, the expected
frequency for any given cell is worked out as under:
2. Obtain the difference between observed and expected frequencies and find out the
squares of such differences i.e., calculate (Oij-Eij) 2.
3. Divide the quantity (Oij-Eij)2 obtained as stated above by the corresponding expected
frequency to get(Oij-Eij)2/Eij and this should be done for all the cell frequencies or the
group frequencies.
(Oij E ij ) 2
2
4. Find the summation of (Oij-Eij) /Eij value or what we call E ij
. This is the
required 2 value.
5. Compare the calculated with its table value at n-1 degree of freedom and draw the
inference.