You are on page 1of 21

UNIT-3

PROCESSING
&
ANALYSIS OF DATA
RESEARCH METHODOLOGY
(F010401TB)

By
SHAILENDRA KUMAR

MAHARANA PRATAP GROUP OF


INSTITUTIONS
KANPUR
UNIT – 3
Processing & Analysis of Data

1. Data Analysis: Editing, Coding, Tabular Representation of Data:


Data Analysis is a process of inspecting, cleaning, transforming, and modelling data
with the goal of discovering useful information, informing conclusions, and supporting
decision-making. Data analysis has multiple facets and approaches, encompassing
diverse techniques under a variety of names, while being used in different business,
science, and social science domains. In today’s business, data analysis is playing a role in
making decisions more scientific and helping the business achieve effective operation.
Data processing is done in three distinct stages:
1. Editing
2. Coding
3. Tabulation

1. Editing:
Editing is the process of detecting and correcting errors and omissions in the
questionnaire. The purpose of Editing is maintaining consistency between and among
responses. It helps in completeness in responses– to reduce effects of item non-
response. To better utilize questions answered out of order. To facilitate the coding
process. Editing of data may be accomplished in two ways - (i) field editing and (ii) in-
house also called central editing.
Essentials of editing:
1. Completeness: For effective editing, it should be considered that there should be no
omissions. Therefore it is essential that the entire question have been asked and the
corresponding responses have been recorded respectively. In case, any data missing,
the researcher either can deduce the missing data based on other data in the
questionnaire, or can fill the data by recalling.
2. Accuracy: Accuracy of the responses can be estimated with the help of check
questions included in the questionnaire specifically for important data. Researcher can
also conclude the response with the help of other related questions in the questionnaire.
Some time researcher can also contact the respondents again to get the correct
response.
3. Consistency: Editing looks out for such inconsistencies so that they can be correct. If
the level of inconsistency is too high, then it indicates carelessness in administration of
the instrument or ambiguity in the instrument.
4. Regularity: It refers to uniformity in asking the questions and recording the answers.
Regularity ensures that there is no bias in the data collected. Generally irregularity is
reduced. Similarly in certain situation respondents may not be able to give factually
correct answers.
Field Editing: Field editing is done by the enumerator. The schedule filled up by the
enumerator or the respondent might have some abbreviated writings, illegible writings
and the like. These are rectified by the enumerator. This should be done soon after the
enumeration or interview before the loss of memory. The field editing should not
extend to giving some guess data to fill up omissions.
Central editing: It should be carried out when all the forms of schedules have been
completed and returned to the headquarters. This type of editing requires that all the
forms are thoroughly edited by a single person (editor) in a small field study or a small
group of persons in case of a large field study, The editor may correct the obvious
errors, such as an entry in a wrong place, entry recorded in daily terms whereas it
should have been recorded in weeks/months, etc. Sometimes, inappropriate or missing
replies can also be recorded by the editor by reviewing the other information recorded
in the schedule. If necessary, the respondent may be contacted for clarification. All the
incorrect replies, which are quite obvious, must be deleted from the schedules.

2. Coding of data:
Coding refers to the process of assigning numerals or other symbols to answers so that
responses can be put into a limited number of categories or classes. Such classes should
be appropriate to the research problem under consideration. They must also possess
the characteristic of exhaustiveness (i.e., there must be a class for every data item) and
also that of mutual exclusively which means that a specific answer can be placed in one
and only one cell in a given category set. Another rule to be observed is that of
unidimensionality by which is meant that every class is defined in terms of only one
concept.
Coding is necessary for efficient analysis and through it the several replies may be
reduced to a small number of classes which contain the critical information required for
analysis. Coding decisions should usually be taken at the designing stage of the
questionnaire. This makes it possible to precode the questionnaire choices and which in
turn is helpful for computer tabulation as one can straight forward key punch from the
original questionnaires. Coding is done using a code book.
I. Code Book: A code book contains each variable in the study and specifies the
application of coding rules to the variables. The code book guides the researcher in
assigning numerical codes to different response categories.
Coding for an open-ended question is more tedious than the closed ended question. For
a closed ended or structured question, the coding scheme is very simple and designed
prior to the field work. For example, consider the following question.
What is your sex?
Male Female
We may assign a code of `0' to male and `1' to female respondent. These codes may be
specified prior to the field work and if the codes are written on all questions of a
questionnaire, it is said to be wholly precoded. The same approach could also be used
for coding numeric data that either are not be coded into categories or have had their
relevant categories specified.
For example, what is your monthly income? Here the respondent would indicate his
monthly income which may be entered in the relevant column. The same question may
also be asked like this: What is your monthly income? < Rs. 5000 Rs. 5000 - 8999 Rs.
13000 – 12999 Rs. 13000 or above. We may code the class less than Rs.5000' as , ‘1', Rs.
5000 - 8999' as `2', `Rs. 9000 - 12999' as `3' and `Rs. 13000 or above as `4'.
II. Classification: Coding is closely followed by classification. Classification refers to
the grouping of related facts into classes, wherein each class is homogenous with
respect to some common characteristic. Classification can be done on three basis.
 Geographical basis
 Chronological basis
 Qualitative basis
1. Class limit: It refers to the lowest and the highest limit of the values that can be
included in a class. E.g. In class 100-200, the lower limit is 100, below which there can
be no item in the class and the upper limit is 200, above which there can be no item in
the class.
2. Types of class: The class can be created in two ways: Exclusive class method and
Inclusive class method.
(i) Exclusive class method: It is a method of classification of given data in such a
manner that the upper limit of the previous class intervals gets repeated in the lower
limit of the next class interval. In this classification we include only the value lower limit
and do not include the value of upper limit in the distribution table. For example: - 0 –
10, 10 – 20, 20 – 30, and so on.
Considering the Exclusive method of distribution we can draw the table as shown below

Marks No. of students (frequency)


0 – 10 3
10 – 20 3
20 – 30 5
30 – 40 3
40 – 50 1

(ii) Inclusive class method: Under this method of classification of data, the classes are
formed in such a manner that the upper limit of a class interval does not repeat itself as
the lower limit of the next class interval. In such a series, both the upper limit and the
lower limit are included in the particular class interval, for example, 1–5, 6–10, 11–15
and so on. The interval 1–5 includes both the limits i.e. 1 and 5.
Considering the Inclusive method of distribution we can draw the table as shown below:
-
Marks No. of students (frequency)
0 – 10 4
11 – 20 3
21 – 30 5
31 – 40 3
3. Class interval: A class interval represents the difference between the upper class
limit and the lower class limit. In other words, a class interval represents the width of
each class in a frequency distribution.
Class interval = upper class limit - lower class limit
Midpoint = Lower class limit + Upper class limit / 2

3. Tabulation:
Tabulation is a systematic & logical presentation of numeric data in rows and columns
to facilitate comparison and statistical analysis. It facilitates comparison by bringing
related information close to each other and helps in further statistical analysis and
interpretation. In other words, the method of placing organised data into a tabular form
is called as tabulation. It may be complex, double or simple depending upon the nature
of categorisation.
Objectives of Tabulation:
(1) To Simplify the Complex Data- It reduces the bulk of information, i.e. raw data in a
simplified and meaningful form so that it could be easily by a common man in less time.
(2) To Bring Out Essential Features of the Data- It brings out the chief/main
characteristics of data. It presents facts clearly and precisely without textual
explanation.
(3) To Facilitate Comparison- Presentation of data in row & column is helpful in
simultaneous detailed comparison on the basis of several parameters.
(4) To Facilitate Statistical Analysis- Tables serve as the best source of organised
data for further statistical analysis. The task of computing average, dispersion,
correlation, etc. becomes easier if data is presented in the form of a table.
(5) Saving of Space- A table presents facts in a better way than the textual form. It
saves space without sacrificing the quality and quantity of data.
Types of Tabulation
1. Simple Tabulation or One-way Tabulation: When the data in the table are
tabulated to one characteristic, it is termed as a simple tabulation or one-way
tabulation. For example, Data tabulation of all the people of the World is classified
according to one single characteristic like religion.
2. Double Tabulation or Two-way Tabulation: When the data in the table are
tabulated considering two different characteristics at a time, then it is defined as a
double tabulation or two-way tabulation. For example, Data tabulation of all the people
of the World is classified by two different characteristics like religion and sex.
3. Complex Tabulation: When the data in the table are tabulated according to many
characteristics, it is referred to as a complex tabulation. For example, Data tabulation
of all the people of the World is classified by three or more characteristics like religion,
sex, and literacy, etc.

2. Hypothesis:
Hypothesis is usually considered as an important mechanism in Research. Hypothesis is
a tentative assumption made in order to test its logical or empirical consequences. If we
go by the origin of the word, it is derived from the Greek word- ‘hypotithenai’ meaning
‘to put under’ or to ‘to suppose’.
Etymologically hypothesis is made up of two words, “hypo” and “thesis” which means
less than or less certain than a thesis. It is a presumptive statement of a proposition or a
reasonable guess, based upon the available evidence, which the researcher seeks to
prove through his study.
According to Lundberg-“A hypothesis is a tentative generalisation, the validity of
which remains to be tested. In its most elementary stage, the hypothesis may be any
hunch, guess, imaginative idea, which becomes the basis for action or investigation”.

“A hypothesis can be defined as a tentative explanation of the


research problem, a possible outcome of the research, or an educated guess about the
research outcome.” 1 Goode and Hatt have defined it as “a proposition which can be put
to test to determine its validity”

Nature of Hypothesis: The hypothesis is a clear statement of what is intended to be


investigated. It should be specified before research is conducted and openly stated in
reporting the results.
Qualities of a good research:
 Identify the research objectives.
 Identify the key abstract concepts involved in the research.
 Identify its relationship to both the problem statement and the literature review.
 A problem cannot be scientifically solved unless it is reduced to hypothesis form.
 It is a powerful tool of advancement of knowledge, consistent with existing
knowledge and conducive to further enquiry.
 It can be tested – verifiable or falsifiable.
 Hypotheses are not moral or ethical questions.
 It is neither too specific nor to general.
 It is a prediction of consequences.
 It is considered valuable even if proven false.

Types of Hypothesis:
1. Null Hypothesis: Null hypothesis is represented by Ho and states that there is no
difference between the population parameter and the sample statistic being
compared. The null hypothesis, denoted by H0, is usually the hypothesis that sample
observations result purely from chance. The null hypothesis, always states that the
treatment has no effect (no change, no difference). According to the null hypothesis, the
population mean after treatment is the same as it was before treatment. The α-level
establishes a criterion, or "cut-off", for making a decision about the null hypothesis. The
alpha level also determines the risk of a Type I error.
2. The Alternative Hypothesis: Alternative hypothesis is represented by Ha and
states that there is significance difference between population parameter and
sample statistics. The alternative hypothesis is a statement of what a hypothesis test is
set up to establish. Designated by: H1 or Ha. It is opposite of Null Hypothesis. It is only
reached if Ha is rejected. Frequently “alternative” is actual desired conclusion of the
researcher.
4. Various concepts of hypothesis testing:

(i) Null hypothesis: Null hypothesis is a statistical hypothesis that assumes that the
observation is due to a chance factor. Null hypothesis is denoted by; H0: μ1 = μ2, which
shows that there is no difference between the two population means.
(ii) Alternative hypothesis: Contrary to the null hypothesis, the alternative hypothesis
shows that observations are the result of a real effect.

(iii) Level of significance: Refers to the degree of significance in which we accept or


reject the null-hypothesis. 100% accuracy is not possible for accepting or rejecting a
hypothesis, so we therefore select a level of significance that is usually 5%.
(iv)Type I error: When we reject the null hypothesis, although that hypothesis was
true. Type I error is denoted by alpha. In hypothesis testing, the normal curve that
shows the critical region is called the alpha region.
(v) Type II errors: When we accept the null hypothesis but it is false. Type II errors are
denoted by beta. In Hypothesis testing, the normal curve that shows the acceptance
region is called the beta region.
(vi) Power: Usually known as the probability of correctly accepting the null hypothesis.
1-beta is called power of the analysis.
(vii) One-tailed test: When the given statistical hypothesis is one value like H0: μ1 =
μ2, it is called the one-tailed test.
(viii) Two-tailed test: When the given statistics hypothesis assumes a less than or
greater than value, it is called the two-tailed test.
Importance of Hypothesis Testing:
Hypothesis testing is one of the most important concepts in statistics because it is how
you decide if something really happened, or if certain treatments have positive effects,
or if groups differ from each other or if one variable predicts another. In short, you want
to proof if your data is statistically significant and unlikely to have occurred by chance
alone. In essence then, a hypothesis test is a test of significance.

Hypothesis testing Procedure:


5. Graphical Representation of Data: Appropriate Usages of Bar Chart,
Pie Charts, Histogram:
Graphic representation is another way of analyzing numerical data. A graph is a sort
of chart through which statistical data are represented in the form of lines or curves
drawn across the coordinated points plotted on its surface.
Graphs enable us in studying the cause and effect relationship between two variables.
Graphs help to measure the extent of change in one variable when another variable
changes by a certain amount.
Graphs also enable us in studying both time series and frequency distribution as they
give clear account and precise picture of problem. Graphs are also easy to understand
and eye catching.

6. Bar chart:
A bar chart or bar graph is a chart or graph that presents categorical data with
rectangular bars with heights or lengths proportional to the values that they represent.
The bars can be plotted vertically or horizontally. A vertical bar chart is sometimes
called a line graph.

A bar graph shows comparisons among discrete categories. One axis of the chart shows
the specific categories being compared, and the other axis represents a measured value.
Some bar graphs present bars clustered in groups of more than one, showing the values
of more than one measured variable.
A vertical bar graph is shown below:
Number of students went to different states for study:
7. Pie chart:
A pie chart (or a circle chart) is a circular statistical graphic, which is divided into slices
to illustrate numerical proportion. In a pie chart, the arc length of each slice (and
consequently its central angle and area), is proportional to the quantity it represents.
While it is named for its resemblance to a pie which has been sliced, there are variations
on the way it can be presented. The earliest known pie chart is generally credited to
William Playfair’s Statistical Breviary of 1801.
Pie charts are very widely used in the business world and the mass media. However,
they have been criticized, and many experts recommend avoiding them, pointing out
that research has shown it is difficult to compare different sections of a given pie chart,
or to compare data across different pie charts. Pie charts can be replaced in most cases
by other plots such as the bar chart, box plot or dot plots.

Fig. – Pie chart of populations of English native speakers

8. Histogram:
Histogram is a non-cumulative frequency graph, it is drawn on a natural scale in which
the representative frequencies of the different class of values are represented through
vertical rectangles drawn closed to each other. Measure of central tendency, mode can
be easily determined with the help of this graph.
How to draw a Histogram?
Step—1 Represent the class intervals of the variables along the X axis and their frequencies
along the Y-axis on natural scale.
Step—2 Start X axis with the lower limit of the lowest class interval. When the lower limit
happens to be a distant score from the origin give a break in the X-axis n to indicate that the
vertical axis has been moved in for convenience.
Step—3 Now draw rectangular bars in parallel to Y axis above each of the class intervals
with class units as base: The areas of rectangles must be proportional to the frequencies of the
corresponding classes.
Uncle Bruno owns a garden with 30 black cherry trees. Each tree is of a different height.
The height of the trees (in inches): 61, 63, 64, 66, 68, 69, 71, 71.5, 72, 72.5, 73, 73.5, 74,
74.5, 76, 76.2, 76.5, 77, 77.5, 78, 78.5, 79, 79.2, 80, 81, 82, 83, 84, 85, 87. We can group
the data as follows in a frequency distribution table by setting a range:

Height Range (ft) Number of Trees (Frequency)

60 - 75 3

66 - 70 3

71 - 75 8

76 - 80 10

81 - 85 5

86 - 90 1
This data can be now shown using a histogram. We need to make sure that while
plotting a histogram, there shouldn’t be any gaps between the bars.
10. T-Test (Mean, Proportion):

The t test is one type of inferential statistics. It is used to determine whether there is a
significant difference between the means of two groups. With all inferential statistics,
we assume the dependent variable fits a normal distribution. When we assume a normal
distribution exists, we can identify the probability of a particular outcome.
We specify the level of probability (alpha level, level of significance, p) we are willing to
accept before we collect data (p < .05 is a common value that is used). After we collect
data we calculate a test statistic with a formula. We compare our test statistic with a
critical value found on a table to see if our results fall within the acceptable level of
probability.

When the difference between two population averages is being investigated, a t test is
used. In other words, a t test is used when we wish to compare two means (the scores
must be measured on an interval or ratio measurement scale).
We would use a t test if we wished to compare the reading achievement of boys and
girls. With a t test, we have one independent variable and one dependent variable. The
independent variable (gender in this case) can only have two levels (male and female).
The dependent variable would be reading achievement. If the independent had more
than two levels, then we would use a one-way analysis of variance (ANOVA).

Assumptions underlying the t test:


 The samples have been randomly drawn from their respective populations
 The scores in the population are normally distributed
 The scores in the populations have the same variance (s1=s2) Note: We use a
different calculation for the standard error if they are not.
Three types of t tests:

1. Pair-difference t test (a.k.a. t-test for dependent groups, correlated t test) df= n
(number of pairs) -1

This is concerned with the difference between the average scores of a single sample of
individuals who are assessed at two different times (such as before treatment and after
treatment). It can also compare average scores of samples of individuals who are paired
in some way (such as siblings, mothers, daughters, persons who are matched in terms of
a particular characteristics).

2. t test for Independent Samples (with two options):

This is concerned with the difference between the averages of two populations.
Basically, the procedure compares the averages of two samples that were selected
independently of each other, and asks whether those sample averages differ enough to
believe that the populations from which they were selected also have different averages.
An example would be comparing math achievement scores of an experimental group
with a control group.
 Equal Variance (Pooled-variance t-test) df=n (total of both groups) -2 Note: Used
when both samples have the same number of subject or when s1=s2 (Levene or F-
max tests have p > .05).
 Unequal Variance (Separate-variance t test) df dependents on a formula, but a rough
estimate is one less than the smallest group Note: Used when the samples have
different numbers of subjects and they have different variances — s1<>s2 (Levene
or F-max tests have p < .05).
How do i decide which type of t test to use?
11. F- Test, Z – Test:

F- TEST:
An F-test is any statistical test in which the test statistic has an F-distribution under the
null hypothesis. It is most often used when comparing statistical models that have been
fitted to a data set, in order to identify the model that best fits the population from
which the data were sampled. Exact “F-tests” mainly arise when the models have been
fitted to the data using least squares. The name was coined by George W. Snedecor, in
honour of Sir Ronald A. Fisher. Fisher initially developed the statistic as the variance
ratio in the 1920s
Assumptions of F- Test
Several assumptions are made for the test. Your population must be approximately
normally distributed (i.e. fit the shape of a bell curve) in order to use the test. Plus, the
samples must be independent events. In addition, you’ll want to bear in mind a few
important points:-
 The larger variance should always go in the numerator (the top number) to force the
test into a right-tailed test. Right-tailed tests are easier to calculate.
 For two-tailed tests, divide alpha by 2 before finding the right critical value.
 If you are given standard deviations, they must be squared to get the variances.
 If your degrees of freedom aren’t listed in the F Table, use the larger critical value.
This helps to avoid the possibility of Type I errors.

Common examples: Common examples of the use of F-tests include the study of the
following cases:
 The hypothesis that the means of a given set of normally distributed populations, all
having the same standard deviation, are equal. This is perhaps the best-known F-
test, and plays an important role in the analysis of variance (ANOVA).
 The hypothesis that a proposed regression model fits the data well. See Lack-of-fit
sum of squares.
 The hypothesis that a data set in a regression analysis follows the simpler of two
proposed linear models that are nested within each other.
F Test to compare two variances by hand: Steps
 Warning: F tests can get really tedious to calculate by hand, especially if you
have to calculate the variances. You’re much better off using technology (like
Excel — see below).
 These are the general steps to follow. Scroll down for a specific example (watch
the video underneath the steps).
Step 1: If you are given standard deviations, go to Step 2. If you are given variances to
compare, go to Step 3.
Step 2: Square both standard deviations to get the variances. For example, if σ1 = 9.6
and σ2 = 10.9, then the variances (s1 and s2) would be 9.62 = 92.16 and 10.92 = 118.81.
Step 3: Take the largest variance, and divide it by the smallest variance to get the f-
value. For example, if your two variances were s1 = 2.5 and s2 = 9.4, divide 9.4 / 2.5 =
3.76. Why? Placing the largest variance on top will force the F-test into a right tailed
test, which is much easier to calculate than a left-tailed test.
Step 4: Find your degrees of freedom. Degrees of freedom is your sample size minus 1.
As you have two samples (variance 1 and variance 2), you’ll have two degrees of
freedom: one for the numerator and one for the denominator.
Step 5: Look at the f-value you calculated in Step 3 in the f-table. Note that there are
several tables, so you’ll need to locate the right table for your alpha level. Unsure how to
read an f-table? Read What is an f-table?.
Step 6: Compare your calculated value (Step 3) with the table f-value in Step 5. If the f-
table value is smaller than the calculated value, you can reject the null hypothesis.
12. Z-TEST:
A Z-test is any statistical test for which the distribution of the test statistic under the
null hypothesis can be approximated by a normal distribution. Because of the central
limit theorem, many test statistics are approximately normally distributed for large
samples. For each significance level, the Z-test has a single critical value (for example,
1.96 for 5% two tailed) which makes it more convenient than the Student’s t-test which
has separate critical values for each sample size.
Therefore, many statistical tests can be conveniently performed as approximate Z-tests
if the sample size is large or the population variance is known. If the population
variance is unknown (and therefore has to be estimated from the sample itself) and the
sample size is not large (n < 30), the Student’s t-test may be more appropriate.
A one-sample location test, two-sample location test, paired difference test and
maximum likelihood estimate are examples of tests that can be conducted as z-tests. Z-
tests are closely related to t-tests, but t-tests are best performed when an experiment
has a small sample size. Also, t-tests assume the standard deviation is unknown, while z-
tests assume it is known. If the standard deviation of the population is unknown, the
assumption of the sample variance equalling the population variance is made.

One-Sample Z-Test Example:


For example, assume an investor wishes to test whether the average daily return of a stock is
greater than 1%. A simple random sample of 50 returns is calculated and has an average of
2%. Assume the standard deviation of the returns is 2.50%. Therefore, the null hypothesis is
when the average, or mean, is equal to 3%. Conversely, the alternative hypothesis is whether
the mean return is greater than 3%. Assume an alpha of 0.05% is selected with a two-tailed
test. Consequently, there is 0.025% of the samples in each tail, and the alpha has a critical
value of 1.96 or -1.96. If the value of z is greater than 1.96 or less than -1.96, the null
hypothesis is rejected.
The value for z is calculated by subtracting the value of the average daily return selected for
the test, or 1% in this case, from the observed average of the samples. Next, divide the
resulting value by the standard deviation divided by the square root of the number of
observed values. Therefore, the test statistic is calculated to be 2.83, or (0.02 – 0.01) / (0.025 /
(50)^(1/2)). The investor rejects the null hypothesis since z is greater than 1.96, and
concludes that the average daily return is greater than 1%.

Cross Tabulation, Chi-Squared Test:


Cross Tabulation is a main frame statistical model which follows on similar lines, it
help you take informed decision with regards to your research by identifying patterns,
trends and correlation between parameters within your study. When conducting a
study, the raw data can usually be daunting and will always points to several chaotic
possible outcomes, in such situation cross-tab helps you zero in on a single theory
beyond doubt by drawing trends, comparisons and correlations between factors that
are mutually inclusive within your study.
Cross tabulation also known as cross-tab or contingency table is a statistical tool that is
used for categorical data. Categorical data involves values that are mutually exclusive to
each other. Data is always collected in numbers, but numbers have no value unless they
mean something. 4,7,9 are simply numerical unless until specified. For example, 4
apples, 7 bananas, and 9 kiwis.
Cross tabulation is usually used to examine the relationship within the data that is not
evident. It is quite useful in market research studies and in surveys. A cross tab report
shows the connection between two or more question asked in the survey.
Chi-Squared Test
A chi-squared test, also written as χ2 test, is any statistical hypothesis test where the
sampling distribution of the test statistic is a chi-squared distribution when the null
hypothesis is true. Without other qualification, ‘chi-squared test’ often is used as short
for Pearson’s chi-squared test. The chi-squared test is used to determine whether there
is a significant difference between the expected frequencies and the observed
frequencies in one or more categories.
In the standard applications of the test, the observations are classified into mutually
exclusive classes, and there is some theory, or say null hypothesis, which gives the
probability that any observation falls into the corresponding class. The purpose of the
test is to evaluate how likely the observations that are made would be, assuming the
null hypothesis is true.
Chi-squared tests are often constructed from a sum of squared errors, or through the
sample variance. Test statistics that follow a chi-squared distribution arise from an
assumption of independent normally distributed data, which is valid in many cases due
to the central limit theorem. A chi-squared test can be used to attempt rejection of the
null hypothesis that the data are independent.
How to Calculate a Chi-square Statistics?
The formula for calculating a Chi-square statistic is:

Where,
O stands for the observed frequency,
E stands for the expected frequency.
Expected count is subtracted from the observed count to find the difference between
the two. Then the square of the difference is calculated to get rid of the negative vales
(as the squares of 2 and −2 are, of course, both 4). Then the square of the difference is
divided by the expected count to normalize bigger and smaller values (because we don’t
want to get bigger Chi-square values just because we are working on large data sets).
The sigma sign in front of them denotes that we have, to sum up, these values calculated
for each cell.

You might also like