Professional Documents
Culture Documents
HANDLING DATA IN
RELEVANT STATISTICAL
SOFTWARE
Topics to be covered
• Identifying Variables: Nominal, Ordinal, Interval, Ratio, entering data,
labelling and sorting of data, computing new variable, recoding existing
variable into new variable.
• Steps to be followed for Computing Variable, and Recoding
• Conditions when to apply different tests while comparing means with
assumptions.
• One Sample t-test, Independent Sample t-test, Paired Sample t-test,
(Assumptions Testing and Inferential Analysis): Interpretation of results,
Identify type of test
• One-way ANOVA, Repeated Measure ANOVA
• Non Parametric Tests: Kolmogorov-Smirnov, Kruskal-Wallis and Wilcoxon
tests.
– Focus on interpretation and not on calculations
What is SPSS
• Software Package used for Statistical Analysis of data.
5
Example: Hospital-stay data
6
Columns:
variables
Rows:
cases
Under Data
View
7
Enter Variables
8
Enter variables
Based on your
code book!
9
Enter cases
10
Import data from Excel
• Select File Open Data
• Choose Excel as file type
• Select the file you want to import
• Then click Open
11
Open Excel files in SPSS
12
Continue
Save this
file as
SPSS data
13
Types of Data
• Variable- any characteristic that is recorded for
subjects in a study
– Categorical- if each observation belongs to one of a set of
categories
– Quantitative- if observations on it take numerical values
that represent different magnitudes of the variable
• Discrete- if its possible values form a set of separate numbers,
such as 0, 1, 2, …
• Continuous- if its possible values form an interval
Other Valuable Terminology
• Parameter- a numerical summary of the population
Name Values
Type Missing
Width Column
Decimals Align
Label Measure
Name
• Each variable must have a unique name of not
more than 8 characters and starting with a
letter
• Try to give meaningful variable names:
– Describing the characteristic: for example, age
– Linking to the questionnaire: for example, A1Q3
• Keep the names consistent across files
Type
• Internal formats: • Output formats:
– Numeric – Comma
– String (alphanumeric) – Dot
– Date – Scientific notation
– Dollar
– Custom currency
Numeric
• Numeric variables:
– Numeric measurements
– Codes
• Definition of the size of the variable
String (alphanumeric)
• String variables contain words or characters;
strings can include numbers but, taken here as
characters, mathematical operations cannot
be applied to them
• The maximum size of a string variable is 255
characters
Date
• The input format for date variables must be
defined, such as DD/MM/YYYY, MM/DD/YYYY
or MM/DD/YY
• Computers store dates as numbers from a
base date; in SPSS, dates are stored as the
number of seconds from 14 October 1582
Example
Expected Frequencies
Solution
Calculation of Chi–square Statistic
Solution
Test of Significance
Here, r = 2 and c = 2, and therefore degree of freedom is
(r 1) (c 1) = 1.
From Table critical value =3.841
Since Cal. Value > Table value, the null hypothesis may be
rejected at .05 level of significance. It may therefore be
concluded that there is an association between the
income level and the types of hospital preferred by the
people.
Testing the Significance of Chi-Square
in SPSS
• In SPSS, the null hypothesis is not tested on the
basis of the comparison between calculated and
tabulated chi-square; rather, it uses the concept
of p value
Math proficiency score: 71, 60, 80, 73, 82, 65, 90, 87, 74, and
72
Two-Sample t-Test
for Unrelated Groups
• Used for testing the hypothesis of equality of
means of two normally distributed
populations
• We often want to compare the means of two
different populations
• For example, comparing the effect of two
different diets on weights, the effect of two
teaching methodologies on the performance,
or the IQ of boys and girls.
Two-Sample t-Test
for Unrelated Groups: Assumptions
• The distributions of both the populations from
which the samples have been drawn are
normally distributed.
• The variances of the two populations are
nearly equal
• Population variances are unknown.
• The samples are independent to each other.
Two-Sample t-Test
for Unrelated Groups: Exercise
• Counseling cell of a college keeps conducting sessions
with the problematic students by using different
methods. Since the number of visitors keeps increasing
every day in the center, they have decided to test
whether audio-visual-based counseling and personal
counseling are equally effective in reducing the stress
level. Eighteen women students were randomly chosen
among those who visited the center. Nine of them
were given the personal counseling, whereas the other
nine were given the sessions with the audiovisual
presentation. After the session, the students were
tested for their stress level. The data so obtained are
shown in Table
Two Sample t-test Exercise---- contd.
• Test your hypothesis at 1%level, whether any
one method of counseling is better than other.
It is assumed that population variances are
equal and both the populations are normally
distributed
Paired t-Test for Related Groups
• Paired t-test is used to test the null hypothesis that the difference
between the two responses measured on the same experimental
units has a mean value of zero.
• Paired t-test is used in all those situations where there is only one
experimental group and no control group.
– Question which is tested here is to know whether the treatment is
effective or not.
– Done by measuring the responses of the subjects in the experimental
group before and after the treatment
Paired t-Test for Related Groups
• Also known as “repeated measures” t-test.
• In using the paired t-test, the data must be obtained in pair on
the same set of subjects before and after the experiment.
• While applying the paired t-test for two related groups, the
pairwise differences, di, is computed for all n paired data. The
mean, d-bar and standard deviation, Sd, of the differences di
are calculated.
• Paired t-statistic is computed as:
• One-way ANOVA
• Repeated Measure ANOVA
One-Way ANOVA
• Statistical technique used for comparing
means of more than two groups
• Thus, a higher ratio (F-value) indicates that the samples have been
drawn from different populations
One-Way ANOVA: Example
• A human resource manager may wish to determine
whether the achievement motivation differs among the
employees in three different age categories (<25, 26–35,
and >35 years) after attending a training program.
• In this design, same criterion variable is measured many times on each subject.
• Repeated measures are taken at different time in order to see the impact of time
on changes in criterion variable.
• The samples are independent (no pairs) and, although this test
compares means (parameters) and not medians, it does not use the
values of the means to do the comparison — therefore this is a
non-parametric test. It is based on a binomial distribution.
Two Sample Wilcoxon Rank Sum Test
(Mann-Whitney U Test)
• Mann Whitney/Wilcoxson Rank Sum tests is a
non-parametric alternative to the independent
sample t-test
4. In order to know how to interpret the results from a Kruskal-Wallis H test, you
have to determine whether the distributions in each group (i.e., the
distribution of scores for each group of the independent variable) have the
same shape (which also means the same variability).
H test vs. ANOVA
• Kruskal-Wallis H test does not assume
normality in the data and is much less
sensitive to outliers, it can be used when
these assumptions have been violated and the
use of a one-way ANOVA is inappropriate.
• The medical researcher would like to investigate this anecdotal evidence with a study. The
researcher identifies 3 well-known, anti-depressive drugs which might have this positive side effect,
and labels them Drug A, Drug B and Drug C.
• The researcher then recruits a group of 60 individuals with a similar level of back pain and
randomly assigns them to one of three groups – Drug A, Drug B or Drug C treatment groups – and
prescribes the relevant drug for a 4 week period.
• At the end of the 4 week period, the researcher asks the participants to rate their back pain on a
scale of 1 to 10, with 10 indicating the greatest level of pain.
• The researcher wants to compare the levels of pain experienced by the different groups at the end
of the drug treatment period. The researcher runs a Kruskal-Wallis H test to compare this ordinal,
dependent measure (Pain_Score) between the three drug treatments (i.e., the independent
variable, Drug_Treatment_Group, is the type of drug with more than two groups).
Kolmogorov-Smirnov Test
• Two sample Kolmogorov-Smirnov test is a nonparametric test that
compares the cumulative distributions of two data sets(1,2).
• The test is nonparametric. It does not assume that data are sampled from
Gaussian distributions (or any other defined distributions).
• The results will not change if you transform all the values to logarithms or
reciprocals or any transformation. The KS test report the maximum
difference between the two cumulative distributions, and calculates a P
value from that and the sample sizes
• Converting all values to their ranks also would not change the maximum
difference between the cumulative frequency
– Thus, although the test analyzes the actual data, it is equivalent to an analysis
of ranks.
– Thus the test is fairly robust to outliers (like the Mann-Whitney test).
Kolmogorov-Smirnov Test
• The null hypothesis is that both groups were sampled
from populations with identical distributions.
– It tests for any violation of that null hypothesis -- different
medians, different variances, or different distributions.