Unit 4

UNIT 4
HANDLING DATA IN
RELEVANT STATISTICAL
SOFTWARE
Topics to be covered
• Identifying Variables: Nominal, Ordinal, Interval, Ratio, entering data,
labelling and sorting of data, computing new variable, recoding existing
variable into new variable.
• Steps to be followed for Computing Variable, and Recoding
• Conditions when to apply different tests while comparing means with
assumptions.
• One Sample t-test, Independent Sample t-test, Paired Sample t-test,
(Assumptions Testing and Inferential Analysis): Interpretation of results,
Identify type of test
• One-way ANOVA, Repeated Measure ANOVA
• Non Parametric Tests: Kolmogorov-Smirnov, Kruskal-Wallis and Wilcoxon
tests.
– Focus on interpretation and not on calculations
What is SPSS
• Software Package used for Statistical Analysis of data.
• Produced by SPSS Inc. in 1968.
• SPSS used to stand for “Statistical Package for the Social

Sciences”
• Later changed to “Statistical Product and Service Solutions”
• Acquired by IBM in 2009. Now known as IBM-SPSS Statistics

Opening SPSS
• The default window will have the data
editor
• There are two sheets in the window:
1. Data view 2. Variable view
Enter data in SPSS directly
5
Example: Hospital-stay data
6
Columns:
variables
Rows:
cases
Under Data
View
7
Enter Variables
1. Click Variable View

2. Type variable name under
2. Type variable 4. Description
name of variable
Name column (AGE).
NOTE: Variable name can be
3. Type:
numeric or
64 bytes long, and the first
string character must be a letter
or one of the characters @,
#, or $.
3. Type: Numeric, string, etc.
1. Click this 4. Label: description of
Window
variables.
8
Enter variables
Based on your
code book!
9
Enter cases
10
Import data from Excel
• Select File Open Data
• Choose Excel as file type
• Select the file you want to import
• Then click Open
11
Open Excel files in SPSS
12
Continue
Save this
file as
SPSS data
13
Types of Data
• Variable- any characteristic that is recorded for
subjects in a study
– Categorical- if each observation belongs to one of a set of
categories
– Quantitative- if observations on it take numerical values
that represent different magnitudes of the variable
• Discrete- if its possible values form a set of separate numbers,
such as 0, 1, 2, …
• Continuous- if its possible values form an interval
Other Valuable Terminology
• Parameter- a numerical summary of the population
• Statistic- a numerical summary of a sample taken

from the population
• Frequency table- a listing of possible values for a

variable, together with the number of observations
for each value
– Relative frequency- proportions and percentages
Scales of Measurement
•Nominal Scale - groups or classes
✓Gender
•Ordinal Scale - order matters
✓Ranks (top ten videos)
•Interval Scale - difference or distance matters –
has arbitrary zero value.
✓Temperatures (0F, 0C), Likert Scale
•Ratio Scale - Ratio matters – has a natural zero
value.
✓Salaries
Data management
• Defining variables
• Coding values
• Entering and editing data
• Creating new variables
• Recoding variables
• Selecting cases
Data analysis
• Univariate statistics
• Bivariate statistics
• Multivariate statistics
The data entry process
• Define your variables in Variable View
• Enter the data, the values of the variables,
in Data View
Definition of variables
10 characteristics are used to define a variable:
Name Values
Type Missing
Width Column
Decimals Align
Label Measure
Name
• Each variable must have a unique name of not
more than 8 characters and starting with a
letter
• Try to give meaningful variable names:
– Describing the characteristic: for example, age
– Linking to the questionnaire: for example, A1Q3
• Keep the names consistent across files
Type
• Internal formats: • Output formats:
– Numeric – Comma
– String (alphanumeric) – Dot
– Date – Scientific notation
– Dollar
– Custom currency
Numeric
• Numeric variables:
– Numeric measurements
– Codes
• Definition of the size of the variable
String (alphanumeric)
• String variables contain words or characters;
strings can include numbers but, taken here as
characters, mathematical operations cannot
be applied to them
• The maximum size of a string variable is 255
characters
Date
• The input format for date variables must be
defined, such as DD/MM/YYYY, MM/DD/YYYY
or MM/DD/YY
• Computers store dates as numbers from a
base date; in SPSS, dates are stored as the
number of seconds from 14 October 1582
Example
• Create two variables:

– ID: the unique identifier, which will be
alphanumeric with a maximum of 8 characters
– Age: the age of the respondent measured in
years, a discrete variable ranging between 10
and 100
Labels
• Descriptors for the variables
• Maximum 255 characters
• Used in the output
Values
• Value labels are descriptors of the categories
of a variable
• Coding
Missing
• Defines missing values
• The values are excluded from some analysis
• Options:
– Up to 3 discrete missing values
– A range of missing values plus one discrete
missing value
Columns and Align
• Columns sets the amount of space reserved to
display the contents of the variable in Data View;
generally the default value is adequate
• Align sets whether the contents of the variable
appear on the left, centre or right of the cell in
Data View
• Numeric variables are right-hand justified by
default and string variables left-hand justified by
default; the defaults are generally adequate
Measure
• Levels of measurement:
– Nominal
– Ordinal
– Interval
– Ratio
• In SPSS, interval and ratio are designated
together as Scale
• The default for string variables is Nominal
• The default for numeric variables is Scale
Data Management
• Entering data
• Defining variables
• Importing data
• Sorting
• Filtering
Data Analysis
• Descriptive Statistics
• Frequency table
• Charts
• Crosstabs-
– Chi-square Test
Chi-Square Test: Applications &
Procedure in SPSS
• The chi-square test is used for two purposes:
– first, to test the goodness of fit and,
– second, to test the independence of two
attributes.
• In both the situations, we intend to determine
whether the observed frequencies
significantly differ from the theoretical
(expected) frequencies.
Assumptions of Chi-Square Test
• Sample must be random.
• Frequencies of each attribute must be numeric and should not be in
percentages or ratios.
• Sample size must be sufficiently large. The chi-square test shall yield
inaccurate findings if the sample size is small. In that case, the researcher
might end up committing a type II error.
• The observations must be independent of each other. In other words, the
chisquare test cannot be used to test the correlated data. In that situation,
McNemar’s test is used.
• Normally, all cell frequencies must be 5 or more. In large contingency
tables, 80% of cell frequencies must be 5 or more. If this assumption is not
met, the Yates’ correction is applied.
• The expected frequencies should not be too low. Generally, it is acceptable
if 20% of the events have expected frequencies less than 5, but in case of
chi-square with one degree of freedom, the conclusions may not be
reliable if expected frequencies are less than 10. In all such cases, Yates’
correction must be applied.
To Test the Goodness of Fit
• To know whether the pattern of
frequencies that are observed fits well
with the expected ones or not.
– a chi-square test for goodness of fit is used
to verify whether an observed frequency
distribution differs from a theoretical
distribution or not
To Test the Goodness of Fit
• The chi-square test for goodness of fit can
also be used to test an equal occurrence
hypothesis.
• Example: By using this test, one can test
whether all brands are equally popular, or
whether all the car models are equally
preferred.
• In using the chi-square test for goodness of
fit, only one categorical variable is involved
Example
• A beverages company produces cold drink with
three different colors. One hundred and twenty
college students were asked about their
preferences. The responses are shown in Table.
Do these data show that all the flavors were
equally liked by the students? Test your
hypothesis at .05 level of significance.
Color White Orange Brown

Frequencies 50 40 30
Solution
• Here it is required to test the null hypothesis of
equal occurrence; hence, expected frequencies
corresponding to each of the three observed
frequencies shall be obtained by dividing the total
of all the observed frequencies by the number of
categories.
• Hence, expected frequency (fe) for each category
shall be 40
• Here, number of categories or rows (r) 3 and
number of columns (c) 2.
Solution..contd.
Solution: Testing the Significance of Chi-Square
• The degrees of freedom = (r-1)*(c-1) = 2

• From Chi-square Table, the obtained critical value
(5% and two degrees of freedom) is 5.991
• Since Calculated value of Chi-Square Statistic <
Critical Value
• Null hypothesis may not be rejected at .05 level
of significance.
• Thus, it may be concluded that all the three
colors of cold drinks are equally liked by the
college students.
To Test the Independence of
Attributes
• The chi-square test of independence is used to know
whether paired observations on two attributes, expressed
in a contingency table, are independent of each other.
• Chi-square test may be used to test the significance of an

association between any two attributes:
– For instance one may test the significance of association

between
• income level & brand preference,
• family size & television size purchased
• educational background & the type of job one does.
Example of Independence of
Attributes
• Consider a situation where it is required to test the significance of
association between Gender (male and female) and Response
(“prefer day shift” and “prefer night shift”). In this situation,
following hypotheses may be tested:
– H0: Gender and Response toward shift preferences are independent.
– H1: There is an association between the Gender and Response toward
shift preferences.
• The calculated value of chi-square is compared with that of its
tabulated value for testing the null hypothesis.
• Thus, if calculated Chi-Square is less than tabulated Chi-square with
df =(r 1) (c 1) df at some level of significance, then H0 may not be
rejected otherwise H0 may be rejected.
• Remark If H0 is rejected, we may interpret that there is a significant
association between the gender and their preferences toward
shifts. Here, significant association simply means that the response
pattern of male and female is different.
Example: Independence of Attributes
• Five hundred families were investigated to test
the belief that high income people usually prefer
to visit private hospitals and low-income people
often go to government hospitals whenever they
fall sick.
• The results so obtained are shown in Table
• Test whether income and hospital preferences
are independent. Compute the contingency
coefficient to find the strength of association.
Test your hypothesis at 5% level.
Solution
• The null hypothesis to be tested is
– H0: Income and hospital preferences are
independent.
• Before computing the value of chi-square, the
expected frequencies for each cell need to be
computed with the marginal totals and grand
totals given in the observed frequency (fo)
table
Solution
Observed Frequencies
Expected Frequencies
Solution
Calculation of Chi–square Statistic
Solution
Test of Significance
Here, r = 2 and c = 2, and therefore degree of freedom is
(r 1) (c 1) = 1.
From Table critical value =3.841
Since Cal. Value > Table value, the null hypothesis may be
rejected at .05 level of significance. It may therefore be
concluded that there is an association between the
income level and the types of hospital preferred by the
people.
Testing the Significance of Chi-Square
in SPSS
• In SPSS, the null hypothesis is not tested on the
basis of the comparison between calculated and
tabulated chi-square; rather, it uses the concept
of p value
• p value is the probability of rejecting the null

hypothesis when actually it is true.
• Thus, the chi-square is said to be significant at 5%

level if the p value is less than .05 and is
insignificant if it is more than .05.
Contingency Coefficient
• Contingency coefficient (C) provides the magnitude of association
between the attributes in the cross tabulation. Its value can range
from 0 (no association) to 1 (the theoretical maximum possible
association).
• Chi-square simply tests the significance of an association between
any two attributes but does not provide the magnitude of the
association. Thus, if the chi-square value becomes significant, one
must compute the contingency coefficient (C) to know the extent of
association between the attributes. The contingency coefficient C is
computed by the following formula:
– where N is the sum of all frequencies in the contingency table.

Example
• Out of 200 MBA students, 40 were given an academic
counseling throughout the semester, whereas other 40
did not receive this counseling. On the basis of their
marks in the final examination, their performance was
categorized as improved, unchanged, and deteriorated.
Based on the results shown in Table, can it be
concluded that the academic counseling is effective at
5% level?
Solution
• In order to check whether academic counseling is effective, we shall test
the significance of association between treatment and performance.
• If the association between these two attributes is significant, then it may

be interpreted that the pattern of performance in the counseling and
control groups is not same.
– In that case, it might be concluded that the counseling is effective since the
number of improved cases is higher in counseling group than that of control
group.
• Thus, it is important to compute the chi-square first in order to test the

null hypothesis.
– H0: There is no association between treatment and performance. Against the
alternative hypothesis:
– H1: There is an association between treatment and performance
Chi square in SPSS: Testing the
hypothesis of Equal Occurrence
• In a study, 90 workers were tested for their job satisfaction.
Their job satisfaction level was obtained on the basis of the
questionnaire, and the respondents were classified into one
of the three categories, namely, low, average, and high. The
observed frequencies are shown in Table. Compute chi-square
in testing whether there is any specific trend in their job
satisfaction.
Correlation & Partial Correlation
Exercise
Steps in Regression
• Compute descriptive statistics like mean, standard deviation, skewness, kurtosis, frequency
distribution, etc., and check the distribution of each variable by testing the significance of
skewness and kurtosis.
• Assess the linearity of each independent variable with the dependent variable by plotting the
scatter diagram.
• Check for multicollinearity among the independent variables by computing the correlation
matrix among the independent variables. If multicollinearity exists between the independent
variables then one of the independent variables must be dropped as it does not explain
additional variability in the dependent variable.
• Develop a regression equation by using the unstandardized regression coefficients (B
coefficients).
• Test the significance of the regression coefficients by using the t-test. As a rule of thumb, a t-
value greater than 2.0 is usually statistically significant but one must consult a t-table to be
sure.
• Test the significance of the regression model by using the F-test. The F-value is computed by
dividing the explained variance by the unexplained variance. In general, an F-value of greater
than 4.0 is usually statistically significant, but one must consult an F-table to be sure.
• Compute R2 and adjusted R2 to know the percentage variance of the dependent variable as
explained by all the independent variables together in the regression model.
Regression Exercise
• In order to assess the feasibility of a guaranteed annual wage,
the Rand Corporation conducted a study to assess the
response of labor supply in terms of average hours of work(Y)
based on different independent parameters. The data were
drawn from a national sample of 6,000 households with head
earnings less than $15,000 annually. These data are given in
(Excel). Apply regression analysis by using SPSS to suggest a
regression model for estimating the average hours worked
during the year based on identified independent parameters.
Regression Exercise
Data on average yearly hour and other socioeconomic variables
• Hours(X1): average hours worked during the year
• Rate(X2): average hourly wage (dollars)
• ERSP(X3): average yearly earnings of spouse (dollars)
• ERNO(X4): average yearly earnings of other family members
(dollars)
• NEIN(X5): average yearly non-earned income
• Assets(X6): average family asset holdings (bank account)
(dollars)
• Age(X7): average age of respondent
• Dep(X8): average number of dependents
• School(X9): average highest grade of school completed
Regression Exercise: Solution
To develop the regression model for estimating the average hours of working
during the year for guaranteed wages on the basis of socioeconomic
variables, do the following steps:
1. Choose the “stepwise regression” method in SPSS to get the regression

coefficients of the independent variables identified in the model for
developing the regression equation.
2. Test the regression coefficients for its significance through t-test by using
its significance value (p value) in the output.
3. Test the regression model for its significance through the F-value by
looking to its significance value (p value) in the output.
4. Use the value of R2 in the output to know the amount of variance
explained in the dependent variable by the identified independent
variables together in the model.
Regression Exercise 2
The data on copper industry and its determinants in the US market during
1951–1980 are shown in the following table. Construct a regression model
and develop the regression equation by using the SPSS. Test the significance
of regression coefficients and explain the robustness of the regression model
to predict the price of the copper in the US market.
• DPC ¼ 12-month average US domestic price of copper (cents per pound)
• GNP ¼ annual gross national product ($, billions)
• IIP ¼ 12-month average index of industrial production
• MEPC ¼ 12-month average London Metal Exchange price of copper
(pounds sterling)
• NOH ¼ number of housing starts per year (thousands of units)
• PA ¼ 12-month average price of aluminum (cents per pound)
Note: The data are from the sources such as American Metal Market, Metals
Week, and US Department of Commerce publications
Limitations of Multiple Regression
• Like simple regression, multiple regression also will not be efficient
if the independent variables are not linearly related with dependent
variable.
• Multiple regression can be used only if the variables are either

measured on interval or ratio scale. In case the data is measured on
some other scale, other methods should be used for estimation.
• Simple regression having one dependent and one independent

variable usually requires a minimum of 30 observations. In general,
add minimum of at least 10 observations for each additional
independent variable added in the study.
Hypothesis Testing
• Hypotheses are any assertion or statement about certain
characteristics of the population
• A hypothesis is said to be statistical hypothesis if the following three

conditions prevail:
– The population may be defined.
– Sample may be drawn.
– The sample may be evaluated to test the hypothesis.
• Statistical hypotheses are based on the concept of proof by

contradiction. For example, consider that a hypothesis concerning
population mean (m) is tested to see if an experiment has caused
an increase or decrease in m. This is done by proof of contradiction
by formulating a null hypothesis
Null Hypothesis
• Null hypothesis is a hypothesis of no difference.
• It is denoted by H0. It is formulated to test an alternative

hypothesis.
• Null hypothesis is assumed to be true.

– By assuming the null hypothesis to be true, the distribution of the test
statistic can be well defined.
– Further, null signifies the unbiased approach of the researcher in
testing the research hypothesis.
• The researcher verifies the null hypothesis by assuming that it is

true and rejects it in favor of research hypothesis if any
contradiction is observed.
Alternative Hypothesis
• Alternative hypothesis is also known as research hypothesis. In any
research study, the researcher first develops a research hypothesis
for testing some parameter of the population, and accordingly null
hypothesis is formulated to verify it.
• The alternative hypothesis is denoted by H1.
• Alternative hypothesis means that there is a difference between

the population parameter and the sample value.
• In testing of hypothesis, the whole focus is to test whether research

hypothesis can be accepted or not, and this is done by contradicting
the null hypothesis.
Other Concepts
• Critical Value & Critical region
• Rejection region
• Level of significance
• Type I and Type II Errors
• One tail and two tail tests
Testing Hypothesis
• in case of large sample (n > 30), for testing the hypothesis concerning
mean, z-test is used.
• However, in cases of small sample (n<30), the distribution of sample mean

follows t-distribution if the population variance is not known.
– In such situation, t-test is used.
• In case population standard deviation (s) is unknown, it is estimated by

the sample standard deviation(S).
• For different sample size, the t-curve is different, and it approaches to

normal curve for sample size n > 30.
– All these curves are symmetrical and bell shaped and distributed around t ¼ 0.
The exact shape of the t-curve depends on the degrees of freedom.
• In one-way ANOVA, the comparison between group variance and within-

group variance is done by using the F-statistic.
One Sample Test
• A t-test can be defined as a statistical test used for testing of
hypothesis in which the test statistic follows a Student’s t-
distribution under the assumption that the null hypothesis is true.
• Used if the population standard deviation is not known and the

distribution of the population from which the sample has been
drawn is normally distributed.
– Small sample size (n < 30) where population standard deviation is not
known.
– Large Sample (n>30) and the population standard deviation is not
known
• Used to test whether the population mean is equal to a predefined

value or not.
• An example of a one-sample t-test may be to see whether

population average sleep time is equal to 5 h or not.
One Sample Test
• t-statistic is computed by the following formula:
• Calculated t is compared with tabulated t at 0.05

level of significance and n-1 degrees of freedom if
the hypothesis is to be tested at 5% level.
• If p value is less than .05, the t-statistic becomes

significant, and we reject the null hypothesis
against the alternative hypothesis.
– On the other hand, if the p value is more than 0.05,
the null hypothesis is failed to be rejected.
Exercise
• A professor wishes to know if his statistics class has a good
background of basic math. Ten students were randomly
chosen from the class and were given a math proficiency
test. Based on the previous experience, it was hypothesized
that the average class performance on such math
proficiency test is 75. The professor wishes to know
whether this hypothesis may be accepted or not. Test your
hypothesis at 5% level assuming that the distribution of the
population is normal. The scores obtained by the students
are as follows:
Math proficiency score: 71, 60, 80, 73, 82, 65, 90, 87, 74, and
72
Two-Sample t-Test
for Unrelated Groups
• Used for testing the hypothesis of equality of
means of two normally distributed
populations
• We often want to compare the means of two
different populations
• For example, comparing the effect of two
different diets on weights, the effect of two
teaching methodologies on the performance,
or the IQ of boys and girls.
Two-Sample t-Test
for Unrelated Groups: Assumptions
• The distributions of both the populations from
which the samples have been drawn are
normally distributed.
• The variances of the two populations are
nearly equal
• Population variances are unknown.
• The samples are independent to each other.
Two-Sample t-Test
for Unrelated Groups: Exercise
• Counseling cell of a college keeps conducting sessions
with the problematic students by using different
methods. Since the number of visitors keeps increasing
every day in the center, they have decided to test
whether audio-visual-based counseling and personal
counseling are equally effective in reducing the stress
level. Eighteen women students were randomly chosen
among those who visited the center. Nine of them
were given the personal counseling, whereas the other
nine were given the sessions with the audiovisual
presentation. After the session, the students were
tested for their stress level. The data so obtained are
shown in Table
Two Sample t-test Exercise---- contd.
• Test your hypothesis at 1%level, whether any
one method of counseling is better than other.
It is assumed that population variances are
equal and both the populations are normally
distributed
Paired t-Test for Related Groups
• Paired t-test is used to test the null hypothesis that the difference
between the two responses measured on the same experimental
units has a mean value of zero.
• Used to test the research hypothesis as to whether the post-

treatment response is different than the pre-treatment response.
• Paired t-test is used in all those situations where there is only one
experimental group and no control group.
– Question which is tested here is to know whether the treatment is
effective or not.
– Done by measuring the responses of the subjects in the experimental
group before and after the treatment
Paired t-Test for Related Groups
• Also known as “repeated measures” t-test.
• In using the paired t-test, the data must be obtained in pair on
the same set of subjects before and after the experiment.
• While applying the paired t-test for two related groups, the
pairwise differences, di, is computed for all n paired data. The
mean, d-bar and standard deviation, Sd, of the differences di
are calculated.
• Paired t-statistic is computed as:
• where “t” follows the Student’s t-distribution with n 1

degrees of freedom
Paired t-Test for Related Groups:
Assumptions
• The distribution of the population is normal.
• The distribution of scores obtained by pairwise
difference is normal, and the differences are a
random sample.
• Cases must be independent of each other.
Remark: If the normality assumption is not fulfilled, you may use

the non-parametric Wilcoxon sign rank test for paired difference
designs
Paired t-Test for Related Groups:
Exercise
• Twelve women participated in a nutritional educative
program. Their calorie intake, before and after the program,
was measured which are shown in Table. Can you draw the
conclusion that the nutritional educative program was
successful in reducing the participant’s calorie requirements?
Test your hypothesis at 5% level assuming that the differences
of the scores are normally distributed.
T-tests: Practice Exercises
• An experiment was conducted to assess delivery performance of the two
pizza companies. Customers were asked to reveal the delivery time of the
pizza they have ordered from these two companies. Following are the
delivery time in minutes of the two pizza companies as reported by their
customers. Can it be concluded that the delivery time of the two
companies is different? Test your hypothesis at 5% level.
• An experiment was conducted to know the impact of new
advertisement campaign on sale of television of a particular
brand. The number of television units sold on 12 consecutive
working days before and after launching the advertisement
campaign in a city was recorded. The data obtained are shown
in Table
• The age of the 15 randomly
chosen employees of an
organization is shown in
Table. Can it be concluded
that the average age of the
employees in the
organization is 28 years? Test
your hypothesis at 5% level
and interpret your findings.
ANOVA
• One-way ANOVA
• Repeated Measure ANOVA
One-Way ANOVA
• Statistical technique used for comparing
means of more than two groups
• Tests the null hypothesis that samples in

different groups have been drawn from the
same population
• Used in a situation where the data is

measured either on interval or ratio scale.
One-Way ANOVA
• In one-way ANOVA, group means are compared by comparing the
variability between groups with that of variability within the groups.
– Done by computing an F-statistic
– F-value is computed by dividing the mean sum of squares between the
groups by the mean sum of squares within the groups
• As per the central limit theorem, if the groups are drawn

from the same population, the variance between the group
means should be lower than the variance within the groups
• Thus, a higher ratio (F-value) indicates that the samples have been
drawn from different populations
One-Way ANOVA: Example
• A human resource manager may wish to determine
whether the achievement motivation differs among the
employees in three different age categories (<25, 26–35,
and >35 years) after attending a training program.
• Here, the independent variable is the employee’s age

category, whereas the achievement motivation is the
dependent variable.
• To test whether the data provide sufficient evidence to

indicate that the mean achievement motivation of any age
category differs from other – ANOVA can be used
Principles of ANOVA Experiment
• Three basic principles of design of experiments:
– Randomization,
– Replication
– Local control
• Only randomization and replication need to be satisfied by the one-

way ANOVA experiments.
• Randomization refers to the random allocation of the treatment to

experimental units.
• Replication refers to the application of each individual level of the

factor to multiple subjects.
– Experiment must be replicated in more than one subject
T-test vs. One-way ANOVA
• One-Way ANOVA is used to compare the means
of more than two independent groups.
– the effect of different levels of only one factor on the
dependent variable is investigated.
• One-way ANOVA is used for more than two

groups as two groups may be compared using t-
test.
• In comparing two group means, the t and F are

related as F = t2
Repeated Measures ANOVA
• Used when same subjects are given different treatments at different time interval.
• In this design, same criterion variable is measured many times on each subject.
• Repeated measures are taken at different time in order to see the impact of time
on changes in criterion variable.
• In some studies of repeated measure design, same criterion variable is compared

under two or more different conditions.
– the carryover effect should not exist in administering different treatments on the same
subjects
• For example, in order to see the impact of temperature on memory retention, a

subject’s memory might be tested once in an air-conditioned atmosphere and
then without it.
Assumptions
in Using One-Way ANOVA
1. The data must be measured either on interval or
ratio* scale.
2. The samples must be independent.
3. The dependent variable must be normally
distributed.
4. The population from which the samples have
been drawn must be normally distributed.
5. The variances of the population must be equal.
6. The errors are independent and normally
distributed.
* In case the data is ordinal, a nonparametric alternative such as Kruskal-Wallis, one-way
analysis of variance should be used instead of parametric one-way ANOVA
Practical Exercise
• The data in the following
table indicates the
psychological health
ratings of corporate
executives in banking,
insurance, and retail
sectors. Apply oneway
ANOVA to test whether
the executives of any
particular sector are
healthier in their
psychological health in
comparison to other
sectors. Test your
hypothesis at 5% as well
as 1% level
REPEATED MEASURES ANOVA
Non Parametric Tests
• Abraham Fischler; South eastern University-
Module 9
Parametric vs. Non-Parametric Tests
Two Sample Wilcoxon Rank Sum Test
(Mann-Whitney U Test)
• https://openpress.usask.ca/introtoappliedstat
sforpsych/chapter/16-4-two-sample-wilcoxon-
rank-sum-test-mann-whitney-u-test/
• This test is an alternative to the two sample t-test
• The test assumes that the population of differences has a

symmetric distribution and tests the following hypothesis pair :
– H0: The means of the two populations are the same.
– H1: The means of the two populations are the different.
• which is exactly the hypothesis tested by the t-test.
• The samples are independent (no pairs) and, although this test
compares means (parameters) and not medians, it does not use the
values of the means to do the comparison — therefore this is a
non-parametric test. It is based on a binomial distribution.
• Mann Whitney/Wilcoxson Rank Sum tests is a
non-parametric alternative to the independent
sample t-test
• So the data file will be organized the same way in

SPSS:
– one independent variable with two qualitative levels
and one independent variable
• Choose > Analyze > Nonparametric Tests > Legacy

Dialogues > Independent Samples
Paired Wilcoxon Signed Rank Test
• This test is an alternative to the paired sample
t-test; it is a hypothesis test about means
• It is based on a binomial distribution and
there are two cases
– one for small samples and one for large samples.
• For understanding of calculations
– https://openpress.usask.ca/introtoappliedstatsfor
psych/chapter/16-5-paired-wilcoxon-signed-rank-
test/
Kruskal-Wallis Test (H Test)
• Rank-based nonparametric test that can be used
to determine if there are statistically significant
differences between two or more groups of an
independent variable on a continuous or ordinal
dependent variable.
• It is considered the nonparametric alternative to

the one-way ANOVA, and
– an extension of the Mann-Whitney U test to allow the
comparison of more than two independent groups
Examples
• To understand whether exam performance, measured on a
continuous scale from 0-100, differed based on test anxiety levels
(i.e., your dependent variable would be "exam performance" and
your independent variable would be "test anxiety level", which has
three independent groups: students with "low", "medium" and
"high" test anxiety levels).
• To understand whether attitudes towards pay discrimination, where

attitudes are measured on an ordinal scale, differed based on job
position (i.e., your dependent variable would be "attitudes towards
pay discrimination", measured on a 5-point scale from "strongly
agree" to "strongly disagree", and your independent variable would
be "job description", which has three independent groups: "shop
floor", "middle management" and "boardroom").
H test
• It is an omnibus test statistic and cannot tell you
which specific groups of your independent
variable are statistically significantly different
from each other; it only tells you that at least two
groups were different.
• Since you may have three, four, five or more

groups in your study design, determining which
of these groups differ from each other can be
done using a post hoc test
H test: Assumptions
1. Dependent variable should be measured at the ordinal or continuous level (i.e.,
interval or ratio).
2. Independent variable should consist of two or more categorical, independent

groups.
– Typically, a Kruskal-Wallis H test is used when you have three or more categorical,
independent groups, but it can be used for just two groups (i.e., a Mann-Whitney U test is
more commonly used for two groups)
3. Independence of observations, which means that there is no relationship

between the observations in each group or between the groups themselves. For
example, there must be different participants in each group with no participant
being in more than one group
4. In order to know how to interpret the results from a Kruskal-Wallis H test, you
have to determine whether the distributions in each group (i.e., the
distribution of scores for each group of the independent variable) have the
same shape (which also means the same variability).
H test vs. ANOVA
• Kruskal-Wallis H test does not assume
normality in the data and is much less
sensitive to outliers, it can be used when
these assumptions have been violated and the
use of a one-way ANOVA is inappropriate.
• In addition, if your data is ordinal, a one-way

ANOVA is inappropriate, but the Kruskal-Wallis
H test is not
H test: Example
• A medical researcher has heard anecdotal evidence that certain anti-depressive drugs can have the
positive side-effect of lowering neurological pain in those individuals with chronic, neurological
back pain, when administered in doses lower than those prescribed for depression.
• The medical researcher would like to investigate this anecdotal evidence with a study. The
researcher identifies 3 well-known, anti-depressive drugs which might have this positive side effect,
and labels them Drug A, Drug B and Drug C.
• The researcher then recruits a group of 60 individuals with a similar level of back pain and
randomly assigns them to one of three groups – Drug A, Drug B or Drug C treatment groups – and
prescribes the relevant drug for a 4 week period.
• At the end of the 4 week period, the researcher asks the participants to rate their back pain on a
scale of 1 to 10, with 10 indicating the greatest level of pain.
• The researcher wants to compare the levels of pain experienced by the different groups at the end
of the drug treatment period. The researcher runs a Kruskal-Wallis H test to compare this ordinal,
dependent measure (Pain_Score) between the three drug treatments (i.e., the independent
variable, Drug_Treatment_Group, is the type of drug with more than two groups).
Kolmogorov-Smirnov Test
• Two sample Kolmogorov-Smirnov test is a nonparametric test that
compares the cumulative distributions of two data sets(1,2).
• The test is nonparametric. It does not assume that data are sampled from
Gaussian distributions (or any other defined distributions).
• The results will not change if you transform all the values to logarithms or
reciprocals or any transformation. The KS test report the maximum
difference between the two cumulative distributions, and calculates a P
value from that and the sample sizes
• Converting all values to their ranks also would not change the maximum
difference between the cumulative frequency
– Thus, although the test analyzes the actual data, it is equivalent to an analysis
of ranks.
– Thus the test is fairly robust to outliers (like the Mann-Whitney test).
Kolmogorov-Smirnov Test
• The null hypothesis is that both groups were sampled
from populations with identical distributions.
– It tests for any violation of that null hypothesis -- different
medians, different variances, or different distributions.
• Because it tests for more deviations from the null

hypothesis than does the Mann-Whitney test, it has
less power to detect a shift in the median but more
power to detect changes in the shape of the
distributions

Unit 4

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 4

Uploaded by

Copyright:

Available Formats

UNIT 4

• Produced by SPSS Inc. in 1968.

• SPSS used to stand for “Statistical Package for the Social

• Later changed to “Statistical Product and Service Solutions”

• Acquired by IBM in 2009. Now known as IBM-SPSS Statistics

1. Click Variable View

• Statistic- a numerical summary of a sample taken

• Frequency table- a listing of possible values for a

• Create two variables:

Color White Orange Brown

• The degrees of freedom = (r-1)*(c-1) = 2

• Chi-square test may be used to test the significance of an

– For instance one may test the significance of association

• p value is the probability of rejecting the null

• Thus, the chi-square is said to be significant at 5%

– where N is the sum of all frequencies in the contingency table.

• If the association between these two attributes is significant, then it may

• Thus, it is important to compute the chi-square first in order to test the

1. Choose the “stepwise regression” method in SPSS to get the regression

• Multiple regression can be used only if the variables are either

• Simple regression having one dependent and one independent

• A hypothesis is said to be statistical hypothesis if the following three

• Statistical hypotheses are based on the concept of proof by

• It is denoted by H0. It is formulated to test an alternative

• Null hypothesis is assumed to be true.

• The researcher verifies the null hypothesis by assuming that it is

• The alternative hypothesis is denoted by H1.

• Alternative hypothesis means that there is a difference between

• In testing of hypothesis, the whole focus is to test whether research

• However, in cases of small sample (n<30), the distribution of sample mean

• In case population standard deviation (s) is unknown, it is estimated by

• For different sample size, the t-curve is different, and it approaches to

• In one-way ANOVA, the comparison between group variance and within-

• Used if the population standard deviation is not known and the

• Used to test whether the population mean is equal to a predefined

• An example of a one-sample t-test may be to see whether

• Calculated t is compared with tabulated t at 0.05

• If p value is less than .05, the t-statistic becomes

• Used to test the research hypothesis as to whether the post-

• where “t” follows the Student’s t-distribution with n 1

Remark: If the normality assumption is not fulfilled, you may use

• Tests the null hypothesis that samples in

• Used in a situation where the data is

• As per the central limit theorem, if the groups are drawn

• Here, the independent variable is the employee’s age

• To test whether the data provide sufficient evidence to

• Only randomization and replication need to be satisfied by the one-

• Randomization refers to the random allocation of the treatment to

• Replication refers to the application of each individual level of the

• One-way ANOVA is used for more than two

• In comparing two group means, the t and F are

• In some studies of repeated measure design, same criterion variable is compared

• For example, in order to see the impact of temperature on memory retention, a

• The test assumes that the population of differences has a

• So the data file will be organized the same way in

• Choose > Analyze > Nonparametric Tests > Legacy

• It is considered the nonparametric alternative to

• To understand whether attitudes towards pay discrimination, where

• Since you may have three, four, five or more

2. Independent variable should consist of two or more categorical, independent

3. Independence of observations, which means that there is no relationship

• In addition, if your data is ordinal, a one-way

• Because it tests for more deviations from the null

You might also like