You are on page 1of 125

Unit -IV: Analysis

• Coding, Editing and Tabulation of Data.


• Application through software:
• Methods of Descriptive Analysis- Concept of Mean, Median Mode,
Standard Deviation, Variance, Various Kinds of Charts and Diagrams
Used in Data Analysis;
• Methods of Inferential Statistics: T-test, ANOVA, Correlation and
Regression

Dr. Indira Sharma


Processing (Organizing) of Data
• The collected data in research is processed and analyzed to come to some
conclusions or to verify the hypothesis made.
• Processing of data is important as it makes further analysis of data easier and
efficient. Processing of data technically means:
• Editing of the data
• Coding of data
• Classification of data
• Tabulation of data.

Dr. Indira Sharma


Editing of Data
• Data editing is a process by which collected data is examined to detect any errors
or omissions and further these are corrected as much as possible before
proceeding further.

• Editing is of two types:


• Field Editing
• Central Editing.

Dr. Indira Sharma


Editing of Data
• FIELD EDITING: This is a type of editing that relates to abbreviated or illegible
written form of gathered data. Such editing is more effective when done on same
day or the very next day after the interview. The investigator must not jump to
conclusion while doing field editing.
• CENTRAL EDITING: Such type of editing relates to the time when all data
collection process has been completed. Here a single or common editor corrects
the errors like entry in the wrong place, entry in wrong unit etc. As a rule all the
wrong answers should be dropped from the final results.

Dr. Indira Sharma


Editing of Data - Considerations
• Editor must be familiar with the interviewer’s mind set, objectives and
everything related to the study.
• Different colors should be used when editors make entry in the data collected.
• The editors name and date of editing should be placed on the data sheet.

Dr. Indira Sharma


Coding of Data
• In coding a particular numeral or symbol is assigned to the answers in order to
put the responses in some definite categories or classes.
• The classes of responses determined by the researcher should be appropriate and
suitable to the study.
• Coding enables efficient and effective analysis as the responses are categorized
into meaningful classes.
• Coding decisions are considered while developing or designing the questionnaire
or any other data collection tool.
• Coding can be done manually or through computer.
Dr. Indira Sharma
Classification of Data
• Classification is the process of arranging data into sequences and groups
according to their common characteristics or separating them into different but
related parts.
Secrist
• Classification of the data implies that the collected raw data is categorized into
common group having common feature.
• Data having common characteristics are placed in a common group.
• The entire data collected is categorized into various groups or classes, which
convey a meaning to the researcher.
Dr. Indira Sharma
Functions of Classification
• It condenses data
• It facilitates comparisons
• It helps to study the relationships
• It facilitates statistical treatment of data.
Rules for Classification:
• It should be unambiguous.
• It should be exhaustive and mutually exclusive.
• It should be stable.
Dr. Indira Sharma
Bases of Classification
• Geographical – Area wise or regional
• Chronological – Occurrence of time
• Qualitative – Some character or attributes
• Quantitative – Numerical values or magnitude

Dr. Indira Sharma


Tabulation of Data
• The mass of data collected has to be arranged in some kind of concise and logical
order.
• Tabulation summarizes the raw data and displays data in form of some statistical
tables.
• Tabulation is an orderly arrangement of data in rows and columns.

OBJECTIVE OF TABULATION:
• Conserves space & minimizes explanation and descriptive statements.
• Facilitates process of comparison and summarization.
• Facilitates detection of errors and omissions.
• Establish the basis of various statistical computations.
Dr. Indira Sharma
Tabulation of Data - Principles
• Tables should be clear, concise & adequately titled.
• Every table should be distinctly numbered for easy reference.
• Column headings & row headings of the table should be clear & brief.
• Units of measurement should be specified at appropriate places.
• Explanatory footnotes concerning the table should be placed at appropriate
places.
• Source of information of data should be clearly indicated.

Dr. Indira Sharma


Tabulation of Data - Principles
• The columns & rows should be clearly separated with dark lines
• Demarcation should also be made between data of one class and that of another.
• Comparable data should be put side by side.
• The figures in percentage should be approximated before tabulation.
• The alignment of the figures, symbols etc. should be properly aligned and
adequately spaced to enhance the readability of the same.
• Abbreviations should be avoided.

Dr. Indira Sharma


Analysis of Data
• The important statistical measures that are used to analyze the research or the
survey are:
• Measures of central tendency(mean, median & mode)
• Measures of dispersion(standard deviation, range, mean deviation)
• Measures of asymmetry(skew ness)
• Measures of relationship etc.( correlation and regression)
• Association in case of attributes.
• Time series Analysis
Dr. Indira Sharma
Statistical Techniques of Analysis of Data : Descriptive Statistics
• The transformation of raw data into a form that will make them easy to understand and
interpret; rearranging, ordering, and manipulating data to generate descriptive
information. Descriptive statistics summarizes the features of the data set.
• Descriptive statistics summarize and organize characteristics of a data set. A data set is
a collection of responses or observations from a sample or entire population.
• In quantitative research, after collecting data, the first step of statistical analysis is to
describe characteristics of the responses, such as the average of one variable (e.g., age),
or the relation between two variables (e.g., age and creativity).
• The next step is inferential statistics, which help you decide whether your data
confirms or disproves your hypothesis and whether it is generalizable to a larger
population.
Dr. Indira Sharma
Statistical Techniques of Analysis of Data : Descriptive Statistics

Dr. Indira Sharma


Descriptive Statistics: Types
There are 3 main types of descriptive statistics:

• The distribution concerns the frequency of each value.


• The central tendency concerns the averages of the values.
• The variability or dispersion concerns how spread out the values are.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution
A frequency distribution shows the frequency of repeated items in a graphical
form or tabular form. It gives a visual display of the frequency of items or shows
the number of times they occurred. Frequency distribution is used to organize the
collected data in table form.
Example: Marks scored by students, temperatures of different towns, points scored
in a volleyball match, etc.
After data collection, we have to show data in a meaningful manner for better
understanding. Organize the data in such a way that all its features are summarized
in a table. This is known as frequency distribution.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution
Marks Obtained in Class Test No. of Students (Frequency)
(Class Intervals)
0-5 3
6 - 10 4
11 - 15 5
16 - 20 8
Total 20
Dr. Indira Sharma
Descriptive Statistics: Frequency Distribution Consideration
Figures or numbers collected for some definite purpose is called data.
Frequency is the value in numbers that shows how often a particular item occurs in
the given data set.
There are two types of frequency table - Grouped Frequency Distribution and
Ungrouped Frequency Distribution.
Data can be shown using graphs like histograms, bar graphs, frequency polygons,
and so on.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution Graphs
Describing data in the form of graphs, using a frequency distribution graph. The graphs help us to
understand the collected data in an easy way. The graphical representation of a frequency distribution can
be shown using the following:
Line Graphs: A line graph is a type of chart used to show information that changes over time. We plot
line graphs using several points connected by straight lines.
Bar Graphs: Bar graphs represent data using rectangular bars of uniform width along with equal spacing
between the rectangular bars.
Histograms: A histogram is a graphical presentation of data using rectangular bars of different heights. In
a histogram, there is no space between the rectangular bars.
Pie Chart: A pie chart is a type of graph that visually displays data in a circular chart. It records data in a
circular manner and then it is further divided into sectors that show a particular part of data out of the
whole part.
Frequency Polygon: A frequency polygon is drawn by joining the mid-points of the bars in a histogram.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution Line Graph

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution - Bar Graphs
A bar graph, or bar chart, is used to represent values in relation to other values.
They’re often used to compare data taken over long periods of time, but they’re
most often used on very small sets of data.
These graphs can be horizontal or vertical. If it’s horizontal, the “categories” for
what the actual data being represented is across the bottom and at the side,
horizontally, are numbers that represent the actual data.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution - Bar Graphs
When to use Bar Graphs:
• A bar chart is particularly useful when one or two categories 'dominate' results.
• It can be very clear and easy to read.
• Most people understand what is presented without having to have detailed statistical
knowledge.
• It can represent data expressed as actual numbers, percentages and frequencies.
• A bar chart can represent either discrete or continuous data.
• If the data is discrete there should be a gap between the bars (as in the diagram above).
• If the data is continuous there should be no gap between the bars.
Dr. Indira Sharma
Descriptive Statistics: Frequency Distribution Bar Graph

90
80
70
60
50 East
40 West
30 North
20
10
0
1st Qtr 2nd Qtr 3rd Qtr 4th Qtr
Dr. Indira Sharma
Descriptive Statistics: Frequency Distribution Bar Graph
How did you find your las t job?
643 Netw orking
213 print ad
Temporary agency 1.5 % 179 Online recruitment site
112 Placement f irm
18 Temporary agency
Placement f irm 9.6 %

Online recruitment site 15.4 %

print ad 18.3 %

Netw orking 55.2 %

0 100 200 300 400 500 600 700

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution Histogram

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution – Pie Charts
• A pie graph, also known as a pie chart, is a type of graph commonly used in
conjunction with percentages.
• A large circle is divided into sections depending on those percentages and each
section represents part of the whole.
• In a pie chart, the arc length of each separate sector is meant to be proportional to
the percentage it’s supposed to represent.
• The first pie chart was created in 1801 by William Playfair.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution – Pie Charts
When to use Pie Charts:
• It is best used to present the proportions of a sample.
• It is most useful where one or two results dominate the findings.
• It can represent data expressed as actual numbers or percentages.
• Do not use where there are a large number of categories, or where each has a
small, fairly equal share, as this can be unclear.

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution Pie Charts

Dr. Indira Sharma


Descriptive Statistics: Frequency Distribution Frequency Polygon

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency
• Measures of central tendency are statistical measures which describe the
position of a distribution.
• They are also called statistics of location, and are the complement of statistics of
dispersion, which provide information concerning the variance or distribution of
observations.
• In the univariate context, the mean, median and mode are the most commonly
used measures of central tendency.
• computable values on a distribution that discuss the behavior of the center of a
distribution.
Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency
• The value or the figure which represents the whole series is neither the lowest
value in the series nor the highest it lies somewhere between these two extremes.
• The average represents all the measurements made on a group, and gives a
concise description of the group as a whole.
• When two are more groups are measured, the central tendency provides the basis
of comparison between them.
• Clark has expressed “Average is an attempt to find one single figure to describe
whole group of figure ”

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mean
• Arithmetic mean is a mathematical average and it is the most popular measures
of central tendency. It is frequently referred to as ‘mean’ it is obtained by
dividing sum of the values of all observations in a series (ƩX) by the number of
items (N) constituting the series.
• Thus, mean of a set of numbers X1, X2, X3,………..Xn denoted by x̅ and is
defined as

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mean
Arithmetic Mean Calculated Methods :
• Direct Method :

• Short Cut Method :

• Step Deviation Method :

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mean
Example : Calculate the Arithmetic Mean Monthly Users in the University Library
Month No. of Working Days Total Users Average Users per month
Sep-2011 24 11618 484.08
Oct-2011 21 8857 421.76
Nov-2011 23 11459 498.22
Dec-2011 25 8841 353.64
Jan-2012 24 5478 228.25
Feb-2012 23 10811 470.04
Total 140 57064 Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency - Mean

= 407.6

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mean
Advantages of Mean:

• It is based on all the values.


• It is rigidly defined.
• It is easy to understand the arithmetic average even if some of the details of the
data are lacking.
• It is not based on the position in the series.

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mean
Disadvantages of Mean:

• It is affected by extreme values.


• It cannot be calculated for open end classes.
• It cannot be located graphically
• It gives misleading conclusions.
• It has upward bias.

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Median
• Median is a central value of the distribution, or the value which divides the
distribution in equal parts, each part containing equal number of items. Thus it is
the central value of the variable, when the values are arranged in order of
magnitude.
• Connor has defined as “ The median is that value of the variable which divides
the group into two equal parts, one part comprising of all values greater, and the
other, all values less than median”

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Median
Calculation of Median –Discrete series:
• Arrange the data in ascending or descending order.
• Calculate the cumulative frequencies.
• Apply the formula.

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Median
Calculation of Median – Continuous series:
• For calculation of median in a continuous frequency distribution the following
formula will be employed. Algebraically,

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Median
Example: Median of a set Grouped Data in a Distribution of Respondents by age
Age Group Frequency of Median class(f) Cumulative frequencies(cf)
0-20 15 15
20-40 32 47
40-60 54 101
60-80 30 131
80-100 19 150
Total 150
Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency - Median

Median (M)=40+
= 40+0.52X20
= 40+ = 40+10.37
= 50.37

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Median
Advantages of Median:

• Median can be calculated in all distributions.

• Median can be understood even by common people.

• Median can be ascertained even with the extreme items.

• It can be located graphically.

• It is most useful dealing with qualitative data.


Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency - Median
Disadvantages of Median:

• It is not based on all the values.


• It is not capable of further mathematical treatment.
• It is affected fluctuation of sampling.
• In case of even no. of values it may not the value from the data.

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mode
• Mode is the most frequent value or score in the distribution.
• It is defined as that value of the item in a series.
• It is denoted by the capital letter Z.
• It is the highest point of the frequencies in the distribution curve.
• Croxton and Cowden : defined it as “the mode of a distribution is the value at the point
armed with the item tend to most heavily concentrated. It may be regarded as the most
typical of a series of value”
• The exact value of mode can be obtained by the following formula.

Z=L1+
Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency - Mode
Calculate Mode for the distribution of monthly rent Paid by Libraries in Karnataka
Monthly rent (Rs) Number of Libraries (f)
500-1000 5
1000-1500 10
1500-2000 8
2000-2500 16
2500-3000 14
3000 & Above 12
Total 65
Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency - Mode
Z=2000+

Z =2000+
Z=2000+0.8×500
Z=2400
Dr. Indira Sharma
Descriptive Statistics: Measures of Central Tendency - Mode
Advantages of Mode:

• Mode is readily comprehensible and easily calculated


• It is the best representative of data
• It is not at all affected by extreme value.
• The value of mode can also be determined graphically.
• It is usually an actual value of an important part of the series.

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Mode
Disadvantages of Mode:

• It is not based on all observations.


• It is not capable of further mathematical manipulation.
• Mode is affected to a great extent by sampling fluctuations.
• Choice of grouping has great influence on the value of mode.

Dr. Indira Sharma


Descriptive Statistics: Measures of Central Tendency - Conclusion
• A measure of central tendency is a measure that tells us where the middle of a
bunch of data lies.
• Mean is the most common measure of central tendency. It is simply the sum of
the numbers divided by the number of numbers in a set of data. This is also
known as average.
• Median is the number present in the middle when the numbers in a set of data
are arranged in ascending or descending order. If the number of numbers in a
data set is even, then the median is the mean of the two middle numbers.
• Mode is the value that occurs most frequently in a set of data.
Dr. Indira Sharma
Descriptive Statistics: Measures of Dispersion
• The mean median or mode is usually not by itself a sufficient measure to reveal
the shape of a distribution of a data set. We also need a measure that can provide
some information about the variation among data set values.
• Two data sets with the same mean may have completely different spreads. The
variation among the values of observations for one data set may be much larger
or smaller than for the other data set.
• The measures that help us to know the spread of the data set are called measures
of dispersion.
• The measure of central tendency and dispersion taken together give a better
picture of a data set. This looks at how ‘spread out’ the data are.
Dr. Indira Sharma
Descriptive Statistics: Measures of Dispersion - Importance
• In some cases, two sets of data with same mean and same median, but don’t
mean that they have the same dispersion.

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion - Objectives
• To determine the reliability of average.
• To compare the variability of two or more series.
• For facilitating the use of other statistical measures.
• Basis of statistical quality control

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion - Range
• It is the largest score minus the smallest score.
• It is a quick and direct measure of variability.
• Because the range is greatly affected by extreme scores, it may give a distorted
picture of the scores.

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion - Range

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion - Range
The range is the difference between the highest and lowest numbers. What is the range
of …

3, 5, 8, 8, 9, 10, 12, 12, 13, 15


Mean = 9.5 range = 12 (3 to 15)

1, 5, 8, 8, 9, 10, 12, 12, 13, 17


Mean = 9.5 range = 16 (1 to 17)
Example from Cara Flanagan, Research Methods for AQA A Psychology (2005) Nelson
Thornes p 15

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion – Inter Quartile Range
• The interquartile range (IQR) is the difference between the upper and lower
quartile of a given data set and is also called a midspread. It is a measure of
statistical distribution, which is equal to the difference between the upper and
lower quartiles. Also, it is a calculation of variation while dividing a data set into
quartiles. If Q1 is the first quartile and Q3 is the third quartile, then the IQR
formula is given by;

• IQR = Q3 – Q1

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion – Inter Quartile Range

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion – Inter Quartile Range
Question: Find the quartiles of the following data: 4, 6, 7, 8, 10, 23, 34.
Solution: Here the numbers are arranged in the ascending order and number of items, n =
7
Lower quartile, Q1 = [(n+1)/4] th item
Q1= 7+1/4 = 2nd item = 6
Median, Q2 = [(n+1)/2]th item
Q2= 7+1/2 item = 4th item = 8
Upper Quartile, Q3 = [3(n+1)/4]th item
Q3 = 3(7+1)/4 item = 6th item = 23

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion – Variance
• Variance is the expected value of the squared variation of a random variable from its mean
value, in probability and statistics. Informally, variance estimates how far a set of numbers
(random) are spread out from their mean value.
• The value of variance (σ2) is equal to the square of standard deviation.
• Variance is a measure of how data points differ from the mean. According to Layman, a
variance is a measure of how far a set of data (numbers) are spread out from their mean
(average) value.
• Variance means to find the expected difference of deviation from actual value. Therefore,
variance depends on the standard deviation of the given data set.
• The more the value of variance, the data is more scattered from its mean and if the value of
variance is low or minimum, then it is less scattered from mean. Therefore, it is called a
measure of spread of data from mean.

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion – Variance & Standard Deviation

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion –Standard Deviation
• Standard deviation tells us the average distance of
each score from the mean.
• 68% of normally distributed data is within 1 sd
each side of the mean
• 95% within 2 SD
• Almost all is within 3 SD
• The square root of the variance is known as the
standard deviation i.e. S.D. = √σ.

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion –Standard Deviation
Question:
Mean IQ = 100, sd = 15
What is the IQ of 68% of population (i.e. what is the
range of possible IQs)?
Between what IQ scores would 95% of people be?
Dan says he has done an online IQ test, and he has
an IQ of 170. Should you believe him? Why/not?

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion –Standard Deviation
John scores 61% in the test. His mum says that’s
rubbish.
Solution: Points out that the mean score in class was
50%, with an SD of 5. Did he do well?
What if the SD was only 2?
What if SD was 15?

Dr. Indira Sharma


Descriptive Statistics: Advantages and disadvantages of Range ad Standard Deviation

Advantages Disadvantages
Range Quick and easy to calculate Affected by extreme values
(outliers)
Does not take into account all
the values
Standard deviation More precise measure of Much harder to calculate than
dispersion because all values the range
are taken into account

Dr. Indira Sharma


Descriptive Statistics: Measures of Dispersion – Variance & Standard Deviation
Find the Variance and Standard Deviation of the Following Numbers: 1, 3, 5, 5, 6, 7, 9, 10.

The mean = (1+ 3+ 5+ 5+ 6+ 7+ 9+ 10)/8 = 46/ 8 = 5.75


Step 1: Subtract the mean value from individual value
(1 – 5.75), (3 – 5.75), (5 – 5.75), (5 – 5.75), (6 – 5.75), (7 – 5.75), (9 – 5.75), (10 – 5.75)
= -4.75, -2.75, -0.75, -0.75, 0.25, 1.25, 3.25, 4.25
Step 2: Squaring the above values we get, 22.563, 7.563, 0.563, 0.563, 0.063, 1.563, 10.563,
18.063
Step 3: 22.563 + 7.563 + 0.563 + 0.563 + 0.063 + 1.563 + 10.563 + 18.063 = 61.504
Step 4: n = 8, therefore Variance (σ2) = ∑(X−μ)2/N = 61.504/ 8 = 7.69
Now, Standard deviation (σ) = √σ = √ 7.69 = 2.77
Dr. Indira Sharma
Descriptive Analysis
Type of Type of
Measurement descriptive analysis

Two
categories Frequency table
Proportion (percentage)
Nominal
More than Frequency table
two categories Category proportions
(percentages)
Mode
Dr. Indira Sharma
Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Ordinal Rank order


Median

Dr. Indira Sharma


Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Ordinal Rank order


Median

Dr. Indira Sharma


Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Interval Arithmetic mean

Dr. Indira Sharma


Descriptive Analysis

Type of Type of
Measurement descriptive analysis

Ratio Index numbers


Geometric mean
Harmonic mean

Dr. Indira Sharma


Inferential Statistics:
• Inferential statistics is a branch of statistics that makes the use of various
analytical tools to draw inferences about the population data from sample data.
• Inferential statistics help to draw conclusions about the population while
descriptive statistics summarizes the features of the data set.
• There are two main types of inferential statistics - hypothesis testing and
regression analysis.

Dr. Indira Sharma


Inferential Statistics: Objectives
• Inferential statistics enables one to make descriptions of data and draw
inferences and conclusions from the respective data.
• Inferential statistics uses sample data because it is more cost-effective and less
tedious than collecting data from an entire population.
• It allows one to come to reasonable assumptions about the larger population
based on a sample’s characteristics.

Dr. Indira Sharma


Inferential Statistics: Types
Inferential Statistics

Hypothesis Testing Regression Analysis

Z Test Linear Regression


F Test Nominal Regression
T Test Logistic Regression
ANOVA Test Ordinal Regression

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing
• Hypothesis testing can be defined as a statistical tool that is used to identify if the
results of an experiment are meaningful or not.
• It involves setting up a null hypothesis and an alternative hypothesis. These two
hypotheses will always be mutually exclusive.
• This means that if the null hypothesis is true then the alternative hypothesis is
false and vice versa.
• An example of hypothesis testing is setting up a test to check if a new medicine
works on a disease in a more efficient manner.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing
• Hypothesis testing is a type of inferential statistics that is used to test
assumptions and draw conclusions about the population from the available
sample data.
• It involves setting up a null hypothesis and an alternative hypothesis followed by
conducting a statistical test of significance.
• A conclusion is drawn based on the value of the test statistic, the critical value,
and the confidence intervals.
• A hypothesis test can be left-tailed, right-tailed, and two-tailed.

Dr. Indira Sharma


Inferential Statistics: Null Hypothesis
• The null hypothesis is a concise mathematical statement that is used to indicate
that there is no difference between two possibilities.
• In other words, there is no difference between certain characteristics of data. This
hypothesis assumes that the outcomes of an experiment are based on chance
alone. It is denoted as H0
• Hypothesis testing is used to conclude if the null hypothesis can be rejected or
not.
• Suppose an experiment is conducted to check if girls are shorter than boys at the
age of 5. The null hypothesis will say that they are the same height.
Dr. Indira Sharma
Inferential Statistics: Alternate Hypothesis
• The alternative hypothesis is an alternative to the null hypothesis.
• It is used to show that the observations of an experiment are due to some real
effect.
• It indicates that there is a statistical significance between two possible outcomes
and can be denoted as H1 or Ha
• For the above-mentioned example, the alternative hypothesis would be that girls
are shorter than boys at the age of 5.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing P Value
• In hypothesis testing, the p value is used to indicate whether the results obtained
after conducting a test are statistically significant or not (Correct or not).
• It also indicates the probability of making an error in rejecting or not rejecting
the null hypothesis. This value is always a number between 0 and 1.
• The p value is compared to an alpha level, αor significance level. The alpha level
can be defined as the acceptable risk of incorrectly rejecting the null hypothesis.
The alpha level is usually chosen between 1% to 5%.
• The level of significance(α) is a predefined threshold that should be set by the
researcher. It is generally fixed as 0.05.
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing P Value Formula

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing - P Value Table
P-value Description Hypothesis
P-value ≤ 0.05 It indicates the null Rejected
hypothesis is very unlikely.
P-value > 0.05 It indicates the null Accepted or it “fails to
hypothesis is very likely. reject”.

P-value = 0.05 The P-value is near the cut- The hypothesis needs more
off. It is considered as attention.
marginal

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: P Value - Example
Example: P-value is 0.3105. If the level of significance is 5%, find if we can reject
the null hypothesis.
Solution: Looking at the P-value table, the p-value of 0.3105 is greater than the
level of significance of 0.05 (5%), we fail to reject the null hypothesis.
Example: P-value is 0.0219. If the level of significance is 5%, find if we can reject
the null hypothesis.
Solution: Looking at the P-value table, the p-value of 0.0219 is less than the level
of significance of 0.05, we reject the null hypothesis.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test & Z test
A t-test is a type of inferential statistic used to determine if there is a significant
difference between the means of two groups, which may be related in certain
features.
The t-test is one of many tests used for the purpose of hypothesis testing in
statistics.
Calculating a t-test requires three key data values.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test & Z test

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test & Z test

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test & Z test

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test & Z test

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test & Z test

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test Types
An Independent Samples t-test compares the means for two groups.
A Paired sample t-test compares means from the same group at different times
(say, one year apart).
A One sample t-test tests the mean of a single group against a known mean.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: t Test Types
When to use t Test
A t-test can only be used when comparing the means of two groups or pairwise
comparison. If you want to compare more than two groups, or if you want to do
multiple pairwise comparisons, use an ANOVA test or a post-hoc test.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA
ANOVA is a statistical technique specially designed to test whether the means of
more than two quantitative populations are equal.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA - Assumptions
1) Normality: The values in each group are normally distributed.
2) Homogeneity of variances: The variance within each group should be equal
for all groups.
3) Independence of error: The error(variation of each value around its own
group mean) should be independent for each value.
4) Skewness Kurtosis Kolmogorov-Smirnov Shapiro-Wilk test Box-and-whiskers
plots Histogram

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA - Steps
1) State Null & Alternate Hypothesis
2) State Alpha
3) Calculate degrees of Freedom
4) State decision rule
5) Calculate test statistic - Calculate variance between samples - Calculate
variance within the samples –
6) Calculate F statistic - If F is significant, perform post hoc test
7) State Results & conclusion
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: ANOVA - Steps
• State null & alternative hypotheses
H0 : all sample means are equal H0 :At least one sample has different
mean
• State Alpha i.e 0.05
• Calculate degrees of Freedom K-1 & n-1
k= No of Samples, n= Total No of observations
• State decision rule If calculated value of F >table value of F, reject Ho
• Calculate test statistic
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: ANOVA - Steps
Calculating variance between samples
1. Calculate the mean of each sample.
2. Calculate the Grand average
3. Take the difference between means of various samples & grand
average.
4. Square these deviations & obtain total which will give sum of
squares between samples (SSC)
5. Divide the total obtained in step 4 by the degrees of freedom to
calculate the mean sum of square between samples (MSC).
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: ANOVA - Steps
Calculating Variance within Samples
1. Calculate mean value of each sample
2. Take the deviations of the various items in a sample from the mean values of
the respective samples.
3. Square these deviations & obtain total which gives the sum of square within
the samples (SSE)
4. Divide the total obtained in 3rd step by the degrees of freedom to calculate
the mean sum of squares within samples (MSE).

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: ANOVA - Steps
1. Null hypothesis – No significant difference in the means of 3 samples
2. State Alpha i.e 0.05
3. Calculate degrees of Freedom k-1 & n-k = 2 & 12
4. State decision rule: Table value of F at 5% level of significance for d.f 2 & 12
is 3.88 The calculated value of F > 3.88 ,H0 will be rejected 5. Calculate test
statistic
5.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Correlation
• Correlation is a LINEAR association between two random variables.
• Correlation analysis show us how to determine both the nature and strength of
relationship between two variables.
• When variables are dependent on time correlation is applied.
• Correlation lies between +1 to -1
• A zero correlation indicates that there is no relationship between the variables.
• A correlation of –1 indicates a perfect negative correlation.
• A correlation of +1 indicates a perfect positive correlation
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: Correlation - Types

Types of Correlation

Type 1 Type 2 Type 3

Positive
Simple
Negative Linear
Multiple
No Non Linear
Partial
Perfect
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: Correlation
• Type1: If two related variables are such that when one increases (decreases), the other
also increases (decreases).
If two variables are such that when one increases (decreases), the other decreases
(increases)
If both the variables are independent
• Type 2: Linear: When plotted on a graph it tends to be a perfect line
• Non – linear :When plotted on a graph it is not a straight line
• Example a) heights and weights (b) amount of rainfall and yields of crops (c) price and
supply of a commodity (d) income and expenditure on luxury goods (e) blood pressure
and age
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: Correlation
• Type 3
Simple: Two independent and one dependent variable
Multiple: One dependent and more than one independent variables
Partial: One dependent variable and more than one independent variable but only
one independent variable is considered and other independent variables are
considered constant

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Correlation

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Correlation - Methods
• Scatter Diagram Method.
• Karl Pearson Coefficient Correlation of Method.
• Spearman’s Rank Correlation Method

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Correlation – Linear Relationship

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Coefficient of Correlation
• A measure of the strength of the linear relationship between two variables that is
defined in terms of the (sample) covariance of the variables divided by their
(sample) standard deviations
• Represented by “r”
• r lies between +1 to -1
• Describes Magnitude and Direction
• -1 < r < +1
• The + and – signs are used for positive linear correlations and negative linear
correlations, respectively.
Dr. Indira Sharma
Inferential Statistics: Hypothesis Testing: Interpreting Coefficient of Correlation
• Strong correlation: r > .70 or r < –.70
• Moderate correlation: r is between .30 & .70 or r is between –.30 and –.70
• Weak correlation: r is between 0 and .30 or r is between 0 and –.30 .

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Coefficient of Determination
• Coefficient of determination lies between 0 to 1 and represented by r2
• The coefficient of determination is a measure of how well the regression line represents the data. If the
regression line passes exactly through every point on the scatter plot, it would be able to explain all of
the variation. The further the line is away from the points, the less it is able to explain
• r 2, is useful because it gives the proportion of the variance (fluctuation) of one variable that is
predictable from the other variable. It is a measure that allows us to determine how certain one can be
in making predictions from a certain model/graph
• The coefficient of determination is the ratio of the explained variation to the total variation
• The coefficient of determination is such that 0 < r 2 < 1, and denotes the strength of the linear
association between x and y
• The Coefficient of determination represents the percent of the data that is the closest to the line of best
fit
• For example, if r = 0.922, then r 2 = 0.850 Which means that 85% of the total variation in y can be
explained by the linear relationship between x and y (as described by the regression equation). The
other 15% of the total variation in y remains unexplained.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Spearmans rank coefficient
• A method to determine correlation when the data is not available in numerical
form and as an alternative the method, the method of rank correlation is used.
Thus when the values of the two variables are converted to their ranks, and there
from the correlation is obtained, the correlations known as rank correlation.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Spearmans rank coefficient
• Computation of Rank Correlation- Spearman’s rank correlation coefficient
• ρ can be calculated when:
• Actual ranks given
• Ranks are not given but grades are given but not repeated
• Ranks are not given and grades are given and repeated
• Testing the significance of correlation coefficient

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Utility of Correlation
• It helps to find whether the two variables are correlated or not.
• It helps to study the nature of relationship between the variables-expressed as
positive or negative correlation.
• It helps to study the degree of correlation between the given variables.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Regression
• Regression analysis is a mathematical measure of the average relationship
between two or more variables in terms of the original units of the data.
• Thus term regression is used to denote estimation or prediction of the average
value of one variable for a specified value of the other variable.
• The estimation is done by means of suitable equation, derived on the basis of
available bivariate data. Such an equation and its geometrical representation is
called regression curve.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Regression
• In regression analysis there are two types of variables and they are: i.
Independent ii. Dependent.
• Dependent variable(Y): The variable whose value is influenced or is to be
predicted is called dependent variable.
• Independent variable(X): The variable which influences the values or is used for
prediction is called independent variable.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Utility of Regression
• It helps in making predictions or estimates of dependent variable for given
values of independent variable.
• It establishes cause and effect relationship between the variables.
• It explains the nature of relationship between the variables, i.e. positive or
negative relationship.
• It helps to determine the rate of change in one variable in terms of change in the
other variable.

Dr. Indira Sharma


Inferential Statistics: Hypothesis Testing: Limitations of Regression
• It is based on the assumption of linear relationship between the variables- it is
not always true.
• It assumes constant relationship between the variables.
• Relationship between dependent and independent variables is true within the
limits of experiment only.

Dr. Indira Sharma


Difference between Correlation & Regression
Correlation Regression
It determines the interconnection or a co-relationship ‘Regression’ explains how an independent variable is
between the variables. numerically associated with the dependent variable.
In Correlation, both the independent and dependent However, in Regression, both the dependent and
values have no difference. independent variable are different.
The primary objective of Correlation is, to find out a The primary intent of regression is to calculate the
quantitative/numerical value expressing the values of a random variable based on the values of the
association between the values. fixed variable.
Correlation stipulates the degree to which both of the Regression specifies the effect of the change in the
variables can move together. unit, in the known variable(p) on the evaluated
variable (q).
Correlation helps to constitute the connection between Regression helps in estimating a variable’s value
the two variables. based on another given value.

Dr. Indira Sharma

You might also like