You are on page 1of 104

UNIT 4

HANDLING DATA IN
RELEVANT STATISTICAL
SOFTWARE
Topics to be covered
• Identifying Variables: Nominal, Ordinal, Interval, Ratio, entering data,
labelling and sorting of data, computing new variable, recoding existing
variable into new variable.
• Steps to be followed for Computing Variable, and Recoding
• Conditions when to apply different tests while comparing means with
assumptions.
• One Sample t-test, Independent Sample t-test, Paired Sample t-test,
(Assumptions Testing and Inferential Analysis): Interpretation of results,
Identify type of test
• One-way ANOVA, Repeated Measure ANOVA
• Non Parametric Tests: Kolmogorov-Smirnov, Kruskal-Wallis and Wilcoxon
tests.
– Focus on interpretation and not on calculations
What is SPSS
• Software Package used for Statistical Analysis of data.

• Produced by SPSS Inc. in 1968.

• SPSS used to stand for “Statistical Package for the Social


Sciences”

• Later changed to “Statistical Product and Service Solutions”

• Acquired by IBM in 2009. Now known as IBM-SPSS Statistics


Opening SPSS
• The default window will have the data
editor
• There are two sheets in the window:
1. Data view 2. Variable view
Enter data in SPSS directly

5
Example: Hospital-stay data
Table 2.11 Hospital-stay d a ta

F i rs t R e c e iv e d
D u ra t io n First W BC R e c e iv e d b a c te r ia l
of Sex te m p . (× 1 0 3) a n tib i o tic c u ltu r e S e r v ic e
ID h o s p ita l 1 = M fo ll o w in g fo l lo w in g 1 = yes 1 = yes 1 = m ed.
no. s ta y A ge 2 = F a d m is s io n a d m i s s io n 2 = no 2 = no 2 = su rg .

1 5 30 2 9 9 .0 8 2 2 1
2 10 73 2 9 8 .0 5 2 1 1
3 6 40 2 9 9 .0 12 2 2 2
4 11 47 2 9 8 .2 4 2 2 2
5 5 25 2 9 8 .5 11 2 2 2
6 14 82 1 9 6 .8 6 1 2 2
7 30 60 1 9 9 .5 8 1 1 1
8 11 56 2 9 8 .6 7 2 2 1
9 17 43 2 9 8 .0 7 2 2 1
10 3 50 1 9 8 .0 12 2 1 2
11 9 59 2 9 7 .6 7 2 1 1
12 3 4 1 9 7 .8 3 2 2 2
13 8 22 2 9 9 .5 11 1 2 2
14 8 33 2 9 8 .4 14 1 1 2
15 5 20 2 9 8 .4 11 2 1 2
16 5 32 1 9 9 .0 9 2 2 2
17 7 36 1 9 9 .2 6 1 2 2
18 4 69 1 9 8 .0 6 2 2 2
19 3 47 1 9 7 .0 5 1 2 1
20 7 22 1 9 8 .2 6 2 2 2
21 9 11 1 9 8 .2 10 2 2 2
22 11 19 1 9 8 .6 14 1 2 2
23 11 67 2 9 7 .6 4 2 2 1
24 9 43 2 9 8 .6 5 2 2 2
25 4 41 2 9 8 .0 5 2 2 1

6
Columns:
variables

Rows:
cases

Under Data
View

7
Enter Variables

1. Click Variable View


2. Type variable name under
2. Type variable 4. Description
name of variable
Name column (AGE).
NOTE: Variable name can be
3. Type:
numeric or
64 bytes long, and the first
string character must be a letter
or one of the characters @,
#, or $.
3. Type: Numeric, string, etc.
1. Click this 4. Label: description of
Window
variables.

8
Enter variables

Based on your
code book!

9
Enter cases

10
Import data from Excel
• Select File Open Data
• Choose Excel as file type
• Select the file you want to import
• Then click Open

11
Open Excel files in SPSS

12
Continue

Save this
file as
SPSS data

13
Types of Data
• Variable- any characteristic that is recorded for subjects in a
study
– Categorical- if each observation belongs to one of a set of categories
– Quantitative- if observations on it take numerical values that represent
different magnitudes of the variable
• Discrete- if its possible values form a set of separate numbers, such as 0, 1,
2, …
• Continuous- if its possible values form an interval
Other Valuable Terminology
• Parameter- a numerical summary of the population

• Statistic- a numerical summary of a sample taken from the


population

• Frequency table- a listing of possible values for a variable,


together with the number of observations for each value
– Relative frequency- proportions and percentages
Scales of Measurement
• Nominal Scale - groups or classes
Gender
• Ordinal Scale - order matters
Ranks (top ten videos)
• Interval Scale - difference or distance matters –
has arbitrary zero value.
Temperatures (0F, 0C), Likert Scale
• Ratio Scale - Ratio matters – has a natural zero
value.
Salaries
Data management
• Defining variables
• Coding values
• Entering and editing data
• Creating new variables
• Recoding variables
• Selecting cases
Data analysis
• Univariate statistics
• Bivariate statistics
• Multivariate statistics
The data entry process
• Define your variables in Variable View
• Enter the data, the values of the variables,
in Data View
Definition of variables
10 characteristics are used to define a variable:

Name Values
Type Missing
Width Column
Decimals Align
Label Measure
Name
• Each variable must have a unique name of not
more than 8 characters and starting with a
letter
• Try to give meaningful variable names:
– Describing the characteristic: for example, age
– Linking to the questionnaire: for example, A1Q3
• Keep the names consistent across files
Type
• Output
Internalformats:
formats:
– Comma
Numeric
– Dot
String (alphanumeric)
– Scientific
Date notation
– Dollar
– Custom currency
Numeric
• Numeric variables:
– Numeric measurements
– Codes
• Definition of the size of the variable
String (alphanumeric)
• String variables contain words or characters;
strings can include numbers but, taken here as
characters, mathematical operations cannot
be applied to them
• The maximum size of a string variable is 255
characters
Date
• The input format for date variables must be
defined, such as DD/MM/YYYY, MM/DD/YYYY
or MM/DD/YY
• Computers store dates as numbers from a
base date; in SPSS, dates are stored as the
number of seconds from 14 October 1582
Example

• Create two variables:


– ID: the unique identifier, which will be
alphanumeric with a maximum of 8 characters
– Age: the age of the respondent measured in
years, a discrete variable ranging between 10
and 100
Labels
• Descriptors for the variables
• Maximum 255 characters
• Used in the output
Values
• Value labels are descriptors of the categories
of a variable
• Coding
Missing
• Defines missing values
• The values are excluded from some analysis
• Options:
– Up to 3 discrete missing values
– A range of missing values plus one discrete
missing value
Columns and Align
• Columns sets the amount of space reserved to
display the contents of the variable in Data
View; generally the default value is adequate
• Align sets whether the contents of the variable
appear on the left, centre or right of the cell in
Data View
• Numeric variables are right-hand justified by
default and string variables left-hand justified
by default; the defaults are generally adequate
Measure
• Levels of measurement:
– Nominal
– Ordinal
– Interval
– Ratio
• In SPSS, interval and ratio are designated
together as Scale
• The default for string variables is Nominal
• The default for numeric variables is Scale
Data Management
• Entering data
• Defining variables
• Importing data
• Sorting
• Filtering
Data Analysis
• Descriptive Statistics
• Frequency table
• Charts
• Crosstabs-
– Chi-square Test
Chi-Square Test: Applications & Procedure in
SPSS
• The chi-square test is used for two purposes:
– first, to test the goodness of fit and,
– second, to test the independence of two
attributes.
• In both the situations, we intend to determine
whether the observed frequencies
significantly differ from the theoretical
(expected) frequencies.
Assumptions of Chi-Square Test
• Sample must be random.
• Frequencies of each attribute must be numeric and should not be in
percentages or ratios.
• Sample size must be sufficiently large. The chi-square test shall yield
inaccurate findings if the sample size is small. In that case, the researcher
might end up committing a type II error.
• The observations must be independent of each other. In other words, the
chisquare test cannot be used to test the correlated data. In that situation,
McNemar’s test is used.
• Normally, all cell frequencies must be 5 or more. In large contingency
tables, 80% of cell frequencies must be 5 or more. If this assumption is not
met, the Yates’ correction is applied.
• The expected frequencies should not be too low. Generally, it is
acceptable if 20% of the events have expected frequencies less than 5, but
in case of chi-square with one degree of freedom, the conclusions may not
be reliable if expected frequencies are less than 10. In all such cases,
Yates’ correction must be applied.
To Test the Goodness of Fit
• To know whether the pattern of
frequencies that are observed fits well
with the expected ones or not.
– a chi-square test for goodness of fit is used
to verify whether an observed frequency
distribution differs from a theoretical
distribution or not
To Test the Goodness of Fit
• The chi-square test for goodness of fit can also
be used to test an equal occurrence hypothesis.
• Example: By using this test, one can test
whether all brands are equally popular, or
whether all the car models are equally
preferred.
• In using the chi-square test for goodness of fit,
only one categorical variable is involved
Example
• A beverages company produces cold drink with three
different colors. One hundred and twenty college
students were asked about their preferences. The
responses are shown in Table. Do these data show
that all the flavors were equally liked by the students?
Test your hypothesis at .05 level of significance.

Color White Orange Brown


Frequencies 50 40 30
Solution
• Here it is required to test the null hypothesis of equal
occurrence; hence, expected frequencies
corresponding to each of the three observed
frequencies shall be obtained by dividing the total of
all the observed frequencies by the number of
categories.
• Hence, expected frequency (fe) for each category
shall be 40
• Here, number of categories or rows (r) 3 and number
of columns (c) 2.
Solution..contd.
Solution: Testing the Significance of Chi-Square

• The degrees of freedom = (r-1)*(c-1) = 2


• From Chi-square Table, the obtained critical value
(5% and two degrees of freedom) is 5.991
• Since Calculated value of Chi-Square Statistic <
Critical Value
• Null hypothesis may not be rejected at .05 level of
significance.
• Thus, it may be concluded that all the three colors of
cold drinks are equally liked by the college students.
To Test the Independence of Attributes
• The chi-square test of independence is used to know whether
paired observations on two attributes, expressed in a contingency
table, are independent of each other.

• Chi-square test may be used to test the significance of an


association between any two attributes:

– For instance one may test the significance of association


between
• income level & brand preference,
• family size & television size purchased
• educational background & the type of job one does.
Example of Independence of Attributes
• Consider a situation where it is required to test the significance of association
between Gender (male and female) and Response (“prefer day shift” and
“prefer night shift”). In this situation, following hypotheses may be tested:
– H0: Gender and Response toward shift preferences are independent.
– H1: There is an association between the Gender and Response toward shift preferences.
• The calculated value of chi-square is compared with that of its tabulated value
for testing the null hypothesis.
• Thus, if calculated Chi-Square is less than tabulated Chi-square with d f =(r   1)
(c   1) df at some level of significance, then H 0 may not be rejected otherwise
H0 may be rejected.
• Remark If H0 is rejected, we may interpret that there is a significant association
between the gender and their preferences toward shifts. Here, significant
association simply means that the response pattern of male and female is
different.
Example: Independence of Attributes
• Five hundred families were investigated to test the
belief that high income people usually prefer to visit
private hospitals and low-income people often go to
government hospitals whenever they fall sick.
• The results so obtained are shown in Table
• Test whether income and hospital preferences are
independent. Compute the contingency coefficient
to find the strength of association. Test your
hypothesis at 5% level.
Solution
• The null hypothesis to be tested is
– H0: Income and hospital preferences are
independent.
• Before computing the value of chi-square, the
expected frequencies for each cell need to be
computed with the marginal totals and grand
totals given in the observed frequency (fo)
table
Solution
Observed Frequencies

Expected Frequencies
Solution
Calculation of Chi–square Statistic
Solution
Test of Significance
Here, r = 2 and c = 2, and therefore degree of freedom is
(r   1)   (c   1) = 1.
From Table critical value =3.841
Since Cal. Value > Table value, the null hypothesis may be
rejected at .05 level of significance. It may therefore be
concluded that there is an association between the
income level and the types of hospital preferred by the
people.
Testing the Significance of Chi-Square in
SPSS
• In SPSS, the null hypothesis is not tested on the basis
of the comparison between calculated and tabulated
chi-square; rather, it uses the concept of p value

• p value is the probability of rejecting the null


hypothesis when actually it is true.

• Thus, the chi-square is said to be significant at 5% level


if the p value is less than .05 and is insignificant if it is
more than .05.
Contingency Coefficient
• Contingency coefficient (C) provides the magnitude of association between
the attributes in the cross tabulation. Its value can range from 0 (no
association) to 1 (the theoretical maximum possible association).
• Chi-square simply tests the significance of an association between any two
attributes but does not provide the magnitude of the association. Thus, if the
chi-square value becomes significant, one must compute the contingency
coefficient (C) to know the extent of association between the attributes. The
contingency coefficient C is computed by the following formula:

– where N is the sum of all frequencies in the contingency table.


Example
• Out of 200 MBA students, 40 were given an academic
counseling throughout the semester, whereas other 40
did not receive this counseling. On the basis of their
marks in the final examination, their performance was
categorized as improved, unchanged, and deteriorated.
Based on the results shown in Table, can it be concluded
that the academic counseling is effective at 5% level?
Solution
• In order to check whether academic counseling is effective, we shall test
the significance of association between treatment and performance.

• If the association between these two attributes is significant, then it may


be interpreted that the pattern of performance in the counseling and
control groups is not same.
– In that case, it might be concluded that the counseling is effective since the
number of improved cases is higher in counseling group than that of control
group.

• Thus, it is important to compute the chi-square first in order to test the


null hypothesis.
– H0: There is no association between treatment and performance. Against the
alternative hypothesis:
– H1: There is an association between treatment and performance
Chi square in SPSS: Testing the hypothesis of
Equal Occurrence
• In a study, 90 workers were tested for their job satisfaction.
Their job satisfaction level was obtained on the basis of the
questionnaire, and the respondents were classified into one
of the three categories, namely, low, average, and high. The
observed frequencies are shown in Table. Compute chi-square
in testing whether there is any specific trend in their job
satisfaction.
Correlation & Partial Correlation Exercise
Steps in Regression
• Compute descriptive statistics like mean, standard deviation, skewness, kurtosis, frequency
distribution, etc., and check the distribution of each variable by testing the significance of
skewness and kurtosis.
• Assess the linearity of each independent variable with the dependent variable by plotting the
scatter diagram.
• Check for multicollinearity among the independent variables by computing the correlation
matrix among the independent variables. If multicollinearity exists between the independent
variables then one of the independent variables must be dropped as it does not explain
additional variability in the dependent variable.
• Develop a regression equation by using the unstandardized regression coefficients (B
coefficients).
• Test the significance of the regression coefficients by using the t-test. As a rule of thumb, a t-
value greater than 2.0 is usually statistically significant but one must consult a t-table to be
sure.
• Test the significance of the regression model by using the F-test. The F-value is computed by
dividing the explained variance by the unexplained variance. In general, an F-value of greater
than 4.0 is usually statistically significant, but one must consult an F-table to be sure.
• Compute R2 and adjusted R2 to know the percentage variance of the dependent variable as
explained by all the independent variables together in the regression model.
Regression Exercise
• In order to assess the feasibility of a guaranteed annual wage,
the Rand Corporation conducted a study to assess the
response of labor supply in terms of average hours of work(Y)
based on different independent parameters. The data were
drawn from a national sample of 6,000 households with head
earnings less than $15,000 annually. These data are given in
(Excel). Apply regression analysis by using SPSS to suggest a
regression model for estimating the average hours worked
during the year based on identified independent parameters.
Regression Exercise
Data on average yearly hour and other socioeconomic variables
• Hours(X1): average hours worked during the year
• Rate(X2): average hourly wage (dollars)
• ERSP(X3): average yearly earnings of spouse (dollars)
• ERNO(X4): average yearly earnings of other family members
(dollars)
• NEIN(X5): average yearly non-earned income
• Assets(X6): average family asset holdings (bank account)
(dollars)
• Age(X7): average age of respondent
• Dep(X8): average number of dependents
• School(X9): average highest grade of school completed
Regression Exercise: Solution
To develop the regression model for estimating the average hours of working
during the year for guaranteed wages on the basis of socioeconomic
variables, do the following steps:

1. Choose the “stepwise regression” method in SPSS to get the regression


coefficients of the independent variables identified in the model for
developing the regression equation.
2. Test the regression coefficients for its significance through t-test by using
its significance value (p value) in the output.
3. Test the regression model for its significance through the F-value by
looking to its significance value (p value) in the output.
4. Use the value of R2 in the output to know the amount of variance
explained in the dependent variable by the identified independent
variables together in the model.
Regression Exercise 2
The data on copper industry and its determinants in the US market during
1951–1980 are shown in the following table. Construct a regression model
and develop the regression equation by using the SPSS. Test the significance
of regression coefficients and explain the robustness of the regression model
to predict the price of the copper in the US market.
• DPC ¼ 12-month average US domestic price of copper (cents per pound)
• GNP ¼ annual gross national product ($, billions)
• IIP ¼ 12-month average index of industrial production
• MEPC ¼ 12-month average London Metal Exchange price of copper
(pounds sterling)
• NOH ¼ number of housing starts per year (thousands of units)
• PA ¼ 12-month average price of aluminum (cents per pound)
Note: The data are from the sources such as American Metal Market, Metals
Week, and US Department of Commerce publications
Limitations of Multiple Regression
• Like simple regression, multiple regression also will not be efficient
if the independent variables are not linearly related with
dependent variable.

• Multiple regression can be used only if the variables are either


measured on interval or ratio scale. In case the data is measured on
some other scale, other methods should be used for estimation.

• Simple regression having one dependent and one independent


variable usually requires a minimum of 30 observations. In general,
add minimum of at least 10 observations for each additional
independent variable added in the study.
Hypothesis Testing
• Hypotheses are any assertion or statement about certain characteristics
of the population

• A hypothesis is said to be statistical hypothesis if the following three


conditions prevail:
– The population may be defined.
– Sample may be drawn.
– The sample may be evaluated to test the hypothesis.

• Statistical hypotheses are based on the concept of proof by contradiction.


For example, consider that a hypothesis concerning population mean (m)
is tested to see if an experiment has caused an increase or decrease in m.
This is done by proof of contradiction by formulating a null hypothesis
Null Hypothesis
• Null hypothesis is a hypothesis of no difference.

• It is denoted by H0. It is formulated to test an alternative hypothesis.

• Null hypothesis is assumed to be true.


– By assuming the null hypothesis to be true, the distribution of the test statistic
can be well defined.
– Further, null signifies the unbiased approach of the researcher in testing the
research hypothesis.

• The researcher verifies the null hypothesis by assuming that it is true and
rejects it in favor of research hypothesis if any contradiction is observed.
Alternative Hypothesis
• Alternative hypothesis is also known as research hypothesis. In any
research study, the researcher first develops a research hypothesis for
testing some parameter of the population, and accordingly null
hypothesis is formulated to verify it.

• The alternative hypothesis is denoted by H1.

• Alternative hypothesis means that there is a difference between the


population parameter and the sample value.

• In testing of hypothesis, the whole focus is to test whether research


hypothesis can be accepted or not, and this is done by contradicting the
null hypothesis.
Other Concepts
• Critical Value & Critical region
• Rejection region
• Level of significance
• Type I and Type II Errors
• One tail and two tail tests
Testing Hypothesis
• in case of large sample (n > 30), for testing the hypothesis concerning mean, z-
test is used.

• However, in cases of small sample (n<30), the distribution of sample mean


follows t-distribution if the population variance is not known.
– In such situation, t-test is used.

• In case population standard deviation (s) is unknown, it is estimated by the


sample standard deviation(S).

• For different sample size, the t-curve is different, and it approaches to normal
curve for sample size n > 30.
– All these curves are symmetrical and bell shaped and distributed around t ¼ 0. The exact
shape of the t-curve depends on the degrees of freedom.

• In one-way ANOVA, the comparison between group variance and within-group


variance is done by using the F-statistic.
One Sample Test
• A t-test can be defined as a statistical test used for testing of hypothesis in
which the test statistic follows a Student’s t-distribution under the
assumption that the null hypothesis is true.

• Used if the population standard deviation is not known and the


distribution of the population from which the sample has been drawn is
normally distributed.
– Small sample size (n < 30) where population standard deviation is not known.
– Large Sample (n>30) and the population standard deviation is not known

• Used to test whether the population mean is equal to a predefined value


or not.

• An example of a one-sample t-test may be to see whether population


average sleep time is equal to 5 h or not.
One Sample Test
• t-statistic is computed by the following formula:

• Calculated t is compared with tabulated t at 0.05 level of


significance and n-1 degrees of freedom if the hypothesis
is to be tested at 5% level.

• If p value is less than .05, the t-statistic becomes


significant, and we reject the null hypothesis against the
alternative hypothesis.
– On the other hand, if the p value is more than 0.05, the null
hypothesis is failed to be rejected.
Exercise
• A professor wishes to know if his statistics class has a good
background of basic math. Ten students were randomly
chosen from the class and were given a math proficiency test.
Based on the previous experience, it was hypothesized that
the average class performance on such math proficiency test
is 75. The professor wishes to know whether this hypothesis
may be accepted or not. Test your hypothesis at 5% level
assuming that the distribution of the population is normal.
The scores obtained by the students are as follows:

Math proficiency score: 71, 60, 80, 73, 82, 65, 90, 87, 74, and 72
Two-Sample t-Test
for Unrelated Groups
• Used for testing the hypothesis of equality of
means of two normally distributed populations
• We often want to compare the means of two
different populations
• For example, comparing the effect of two
different diets on weights, the effect of two
teaching methodologies on the performance,
or the IQ of boys and girls.
Two-Sample t-Test
for Unrelated Groups: Assumptions
• The distributions of both the populations from
which the samples have been drawn are
normally distributed.
• The variances of the two populations are
nearly equal
• Population variances are unknown.
• The samples are independent to each other.
Two-Sample t-Test
for Unrelated Groups: Exercise
• Counseling cell of a college keeps conducting sessions with
the problematic students by using different methods. Since
the number of visitors keeps increasing every day in the
center, they have decided to test whether audio-visual-based
counseling and personal counseling are equally effective in
reducing the stress level. Eighteen women students were
randomly chosen among those who visited the center. Nine
of them were given the personal counseling, whereas the
other nine were given the sessions with the audiovisual
presentation. After the session, the students were tested for
their stress level. The data so obtained are shown in Table
Two Sample t-test Exercise---- contd.

• Test your hypothesis at 1%level, whether any


one method of counseling is better than other.
It is assumed that population variances are
equal and both the populations are normally
distributed
Paired t-Test for Related Groups
• Paired t-test is used to test the null hypothesis that the difference
between the two responses measured on the same experimental units
has a mean value of zero.

• Used to test the research hypothesis as to whether the post-treatment


response is different than the pre-treatment response.

• Paired t-test is used in all those situations where there is only one
experimental group and no control group.
– Question which is tested here is to know whether the treatment is effective or
not.
– Done by measuring the responses of the subjects in the experimental group
before and after the treatment
Paired t-Test for Related Groups
• Also known as “repeated measures” t-test.
• In using the paired t-test, the data must be obtained in pair on the
same set of subjects before and after the experiment.
• While applying the paired t-test for two related groups, the pairwise
differences, di, is computed for all n paired data. The mean, d-bar
and standard deviation, Sd, of the differences di are calculated.
• Paired t-statistic is computed as:

• where “t” follows the Student’s t-distribution with n-1 degrees of


freedom
Paired t-Test for Related Groups:
Assumptions
• The distribution of the population is normal.
• The distribution of scores obtained by pairwise
difference is normal, and the differences are a
random sample.
• Cases must be independent of each other.

Remark: If the normality assumption is not fulfilled, you may use


the non-parametric Wilcoxon sign rank test for paired difference
designs
Paired t-Test for Related Groups: Exercise
• Twelve women participated in a nutritional educative
program. Their calorie intake, before and after the program,
was measured which are shown in Table. Can you draw the
conclusion that the nutritional educative program was
successful in reducing the participant’s calorie requirements?
Test your hypothesis at 5% level assuming that the differences
of the scores are normally distributed.
T-tests: Practice Exercises
• An experiment was conducted to assess delivery performance of the two
pizza companies. Customers were asked to reveal the delivery time of the
pizza they have ordered from these two companies. Following are the
delivery time in minutes of the two pizza companies as reported by their
customers. Can it be concluded that the delivery time of the two
companies is different? Test your hypothesis at 5% level.
T-tests: Practice Exercises
• An experiment was conducted to know the impact of new
advertisement campaign on sale of television of a particular
brand. The number of television units sold on 12 consecutive
working days before and after launching the advertisement
campaign in a city was recorded. The data obtained are
shown in Table
T-tests: Practice Exercises
• The age of the 15 randomly
chosen employees of an
organization is shown in
Table. Can it be concluded
that the average age of the
employees in the
organization is 28 years? Test
your hypothesis at 5% level
and interpret your findings.
ANOVA

• One-way ANOVA
• Repeated Measure ANOVA
One-Way ANOVA
• Statistical technique used for comparing means
of more than two groups

• Tests the null hypothesis that samples in different


groups have been drawn from the same
population

• Used in a situation where the data is measured


either on interval or ratio scale.
One-Way ANOVA
• In one-way ANOVA, group means are compared by comparing the
variability between groups with that of variability within the groups.
– Done by computing an F-statistic
– F-value is computed by dividing the mean sum of squares between the
groups by the mean sum of squares within the groups

• As per the central limit theorem, if the groups are drawn from
the same population, the variance between the group means
should be lower than the variance within the groups

• Thus, a higher ratio (F-value) indicates that the samples have been
drawn from different populations
One-Way ANOVA: Example
• A human resource manager may wish to determine whether the
achievement motivation differs among the employees in three
different age categories (<25, 26–35, and >35 years) after
attending a training program.

• Here, the independent variable is the employee’s age category,


whereas the achievement motivation is the dependent variable.

• To test whether the data provide sufficient evidence to indicate


that the mean achievement motivation of any age category
differs from other – ANOVA can be used
Principles of ANOVA Experiment
• Three basic principles of design of experiments:
– Randomization,
– Replication
– Local control

• Only randomization and replication need to be satisfied by the one-way ANOVA


experiments.

• Randomization refers to the random allocation of the treatment to


experimental units.

• Replication refers to the application of each individual level of the factor to


multiple subjects.
– Experiment must be replicated in more than one subject
T-test vs. One-way ANOVA
• One-Way ANOVA is used to compare the means of
more than two independent groups.
– the effect of different levels of only one factor on the
dependent variable is investigated.

• One-way ANOVA is used for more than two groups as


two groups may be compared using t-test.

• In comparing two group means, the t and F are related


as F = t2
Repeated Measures ANOVA
• Used when same subjects are given different treatments at different time interval.

• In this design, same criterion variable is measured many times on each subject.

• Repeated measures are taken at different time in order to see the impact of time on
changes in criterion variable.

• In some studies of repeated measure design, same criterion variable is compared


under two or more different conditions.
– the carryover effect should not exist in administering different treatments on the same
subjects

• For example, in order to see the impact of temperature on memory retention, a


subject’s memory might be tested once in an air-conditioned atmosphere and then
without it.
Assumptions
in Using One-Way ANOVA
1. The data must be measured either on interval or
ratio* scale.
2. The samples must be independent.
3. The dependent variable must be normally
distributed.
4. The population from which the samples have been
drawn must be normally distributed.
5. The variances of the population must be equal.
6. The errors are independent and normally distributed.
* In case the data is ordinal, a nonparametric alternative such as Kruskal-Wallis, one-way
analysis of variance should be used instead of parametric one-way ANOVA
Practical Exercise
• The data in the following
table indicates the
psychological health ratings
of corporate executives in
banking, insurance, and
retail sectors. Apply oneway
ANOVA to test whether the
executives of any particular
sector are healthier in their
psychological health in
comparison to other sectors.
Test your hypothesis at 5%
as well as 1% level
REPEATED MEASURES ANOVA
Non Parametric Tests
• Abraham Fischler; South eastern University-
Module 9
Parametric vs. Non-Parametric Tests
Two Sample Wilcoxon Rank Sum Test
(Mann-Whitney U Test)
• https://openpress.usask.ca/introtoappliedstat
sforpsych/chapter/16-4-two-sample-wilcoxon-
rank-sum-test-mann-whitney-u-test/
Two Sample Wilcoxon Rank Sum Test
(Mann-Whitney U Test)
• This test is an alternative to the two sample t-test

• The test assumes that the population of differences has a


symmetric distribution and tests the following hypothesis pair :
– H0: The means of the two populations are the same.
– H1: The means of the two populations are the different.
• which is exactly the hypothesis tested by the t-test.

• The samples are independent (no pairs) and, although this test
compares means (parameters) and not medians, it does not use
the values of the means to do the comparison — therefore this is a
non-parametric test. It is based on a binomial distribution.
Two Sample Wilcoxon Rank Sum Test
(Mann-Whitney U Test)
• Mann Whitney/Wilcoxson Rank Sum tests is a non-
parametric alternative to the independent sample t-test

• So the data file will be organized the same way in SPSS:


– one independent variable with two qualitative levels and one
independent variable

• Choose > Analyze > Nonparametric Tests > Legacy


Dialogues > Independent Samples
Paired Wilcoxon Signed Rank Test
• This test is an alternative to the paired sample
t-test; it is a hypothesis test about means
• It is based on a binomial distribution and there
are two cases
– one for small samples and one for large samples.
• For understanding of calculations
– https://openpress.usask.ca/introtoappliedstatsfor
psych/chapter/16-5-paired-wilcoxon-signed-rank-t
est/
Kruskal-Wallis Test (H Test)
• Rank-based nonparametric test that can be used to
determine if there are statistically significant
differences between two or more groups of an
independent variable on a continuous or ordinal
dependent variable.

• It is considered the nonparametric alternative to the


one-way ANOVA, and
– an extension of the Mann-Whitney U test to allow the
comparison of more than two independent groups
Examples
• To understand whether exam performance, measured on a continuous
scale from 0-100, differed based on test anxiety levels (i.e., your
dependent variable would be "exam performance" and your independent
variable would be "test anxiety level", which has three independent
groups: students with "low", "medium" and "high" test anxiety levels).

• To understand whether attitudes towards pay discrimination, where


attitudes are measured on an ordinal scale, differed based on job position
(i.e., your dependent variable would be "attitudes towards pay
discrimination", measured on a 5-point scale from "strongly agree" to
"strongly disagree", and your independent variable would be "job
description", which has three independent groups: "shop floor", "middle
management" and "boardroom").
H test
• It is an omnibus test statistic and cannot tell you
which specific groups of your independent variable
are statistically significantly different from each
other; it only tells you that at least two groups were
different.

• Since you may have three, four, five or more groups


in your study design, determining which of these
groups differ from each other can be done using a
post hoc test
H test: Assumptions
1. Dependent variable should be measured at the ordinal or continuous level (i.e., interval
or ratio).

2. Independent variable should consist of two or more categorical, independent groups.


– Typically, a Kruskal-Wallis H test is used when you have three or more categorical, independent
groups, but it can be used for just two groups (i.e., a Mann-Whitney U test is more commonly used
for two groups)

3. Independence of observations, which means that there is no relationship between the


observations in each group or between the groups themselves. For example, there must
be different participants in each group with no participant being in more than one group

4. In order to know how to interpret the results from a Kruskal-Wallis H test, you have to
determine whether the distributions in each group (i.e., the distribution of scores for
each group of the independent variable) have the same shape (which also means the
same variability).
H test vs. ANOVA
• Kruskal-Wallis H test does not assume normality
in the data and is much less sensitive to
outliers, it can be used when these assumptions
have been violated and the use of a one-way
ANOVA is inappropriate.

• In addition, if your data is ordinal, a one-way


ANOVA is inappropriate, but the Kruskal-Wallis
H test is not
H test: Example
• A medical researcher has heard anecdotal evidence that certain anti-depressive drugs can have the positive
side-effect of lowering neurological pain in those individuals with chronic, neurological back pain, when
administered in doses lower than those prescribed for depression.

• The medical researcher would like to investigate this anecdotal evidence with a study. The researcher
identifies 3 well-known, anti-depressive drugs which might have this positive side effect, and labels them
Drug A, Drug B and Drug C.

• The researcher then recruits a group of 60 individuals with a similar level of back pain and randomly assigns
them to one of three groups – Drug A, Drug B or Drug C treatment groups – and prescribes the relevant drug
for a 4 week period.

• At the end of the 4 week period, the researcher asks the participants to rate their back pain on a scale of 1 to
10, with 10 indicating the greatest level of pain.

• The researcher wants to compare the levels of pain experienced by the different groups at the end of the
drug treatment period. The researcher runs a Kruskal-Wallis H test to compare this ordinal, dependent
measure (Pain_Score) between the three drug treatments (i.e., the independent variable,
Drug_Treatment_Group, is the type of drug with more than two groups).
Kolmogorov-Smirnov Test
• Two sample Kolmogorov-Smirnov test is a nonparametric test that compares the
cumulative distributions of two data sets(1,2).

• The test is nonparametric. It does not assume that data are sampled from Gaussian
distributions (or any other defined distributions).

• The results will not change if you transform all the values to logarithms or reciprocals
or any transformation. The KS test report the maximum difference between the two
cumulative distributions, and calculates a P value from that and the sample sizes

• Converting all values to their ranks also would not change the maximum difference
between the cumulative frequency
– Thus, although the test analyzes the actual data, it is equivalent to an analysis of ranks.
– Thus the test is fairly robust to outliers (like the Mann-Whitney test).
Kolmogorov-Smirnov Test
• The null hypothesis is that both groups were sampled
from populations with identical distributions.
– It tests for any violation of that null hypothesis -- different
medians, different variances, or different distributions.

• Because it tests for more deviations from the null


hypothesis than does the Mann-Whitney test, it has
less power to detect a shift in the median but more
power to detect changes in the shape of the
distributions

You might also like