Are you sure?
This action might not be possible to undo. Are you sure you want to continue?
SPSS
a c h i l l i b r e e z e p u b l i c a t i o n
a
c h i l l i b r e e z e
p u b l i c a t i o n
SPSS Tutorial for Beginners
So you are going to be working on SPSS. Welcome to a whole new world of figures, data and statistics. We understand that it is natural to be a little apprehensive before you start using this program to analyze your data or for your class. You must have realized that it is not child’s play to use the multiple features that SPSS has to offer. Don’t worry. Our SPSS tutorial will help you navigate through the whole process and take you through a journey that covers the basic features of SPSS that everyone uses. You might not turn into a pro overnight, but you will definitely be much more comfortable using SPSS. Our examples and exercise problems will help you understand the features even better. This tutorial will be especially helpful for students of biology.
SPSS Tutorial for Beginners
Contents
SECTION I Back to Basic A Review of the StAtiSticS thAt you leARnt (oR did not leARn?) in college ... 6 i. ReSeARch deSign ............................................................................................................. 7 a. aNIMaL RESEaRCH ............................................................................ 7 a) Selection of species..............................................................................7 b) Selection of controls.............................................................................7 c) Feeding of controls ...............................................................................7 d) Treatment of controls ...........................................................................8 e) Ethical guidelines .................................................................................8 b. HUMaN RESEaRCH .............................................................................. 8 a) Case histories .....................................................................................9 b) Descriptive studies ..............................................................................9 c) Prospective studies (cohort studies) ......................................................9 d) Retrospective studies (casecontrol studies): ....................................... 10 e) Retrospectiveprospective study: a combination ...................................12 f) Open trials: ........................................................................................12 g) Crossover trials: ................................................................................12 h) Blind trials: ........................................................................................12 i) Doubleblind crossover interventional trial: ..........................................13 j) Metabolic studies: ...............................................................................13 k) In vitro studies: ..................................................................................13 ii) evAluAtion of MeASuRing inStRuMentS ................................................................ 14 a) MEaSURINg SENSItIvIty aNd SpECIfICIty of a tESt.................................. 14 b) SaMpLINg aNd SaMpLE SIzE ................................................................ 15 a) Sampling procedures ..........................................................................15 b) Sample Size .......................................................................................15 (iii) SoMe uSeful teRMS to know................................................................................... 16 (iv) typeS of dAtA And AppRopRiAte StAtiSticAl teStS ............................................. 17 a) The ttest ........................................................................................... 17 b) Analysis of Variance (AoV) ................................................................. 19 c) Correlation .........................................................................................20 d) Regression ......................................................................................... 21
SPSS Tutorial for Beginners
e) ChiSquare test ..................................................................................22 f) McNemar test .....................................................................................23 g) Sign test ............................................................................................24 h) MannWhitney U test ..........................................................................25 h) Statistical Abuses ..............................................................................26 j) Quiz to test yourself ............................................................................27 SECTION II SPSS at Last SpSS At lASt ......................................................................................................................... 32 CREatINg aNd EdItINg a data fILE ..........................................................33 i) typicAl SpSS SeSSion ..................................................................................................... 35 (ii) cReAting A new dAtA file with the dAtA editoR................................................. 38 (iii) loAding An exiSting dAtA file into the dAtA editoR........................................ 42 (iv) cReAting And executing SpSS coMMAndS ...................................................... 44 k) The EXPLORE command ....................................................................44 l)The FREQUENCIES command ..............................................................46 m) The DESCRIPTIVES command ..........................................................49 n) The IF and COMPUTE commands ....................................................... 51 o) The MEANS command .......................................................................53 p) The TTEST command ........................................................................54 q) The ONEWAY ANALYSIS OF VARIANCE command ...........................59 r) Scattergrams and Regression ..............................................................65 s) Multiple Regression ............................................................................69 t) CHISQUARE test using CROSSTABS ................................................ 74 u) Selection of a Subset of Cases for Analysis .........................................78 v) The Nonparametric Tests ....................................................................79 w) MannWhitney U test ..........................................................................80 x) Bivariate Correlations .........................................................................82 y) Survival..............................................................................................83 SpSS SyntAx windowS ..................................................................................................... 89
SECTION I
Back to Basic
SPSS Tutorial for Beginners
A Review of the Statistics that you learnt (or did not learn?) in college
Before we start to work on the actual software, we think it is necessary to go through some basics of the statistics involved. We included this section because it is not enough to know to use SPSS. You need to know what test to use in what situation and how to plan your research. If you think, you already know this stuff, that’s fine. Just skip it or consider it a review before you plunge into actually using the software. Here are a few things that will be included in “The basics of Statistics” section. Research Design Evaluation of Measuring Instruments Sampling and Sample Size Mean, variance, Standard deviation, Degrees of freedom Ttest Analysis of variance Correlation and Regression Chisquare test, Sign test, ManWhitney U test etc Statistical abuses
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
i. Research design
In order to get good data, you need good research design skills. Even if you are reading someone else’s research, to understand it better, it helps to know something about correct and incorrect research design. There are lots of ways to design a research project and not all of it is documented. Here, we will describe some of the more common methods plus will give you a few tips for your research design too.
a. aNIMaL RESEaRCH
Most biomedical research is first conducted using animal models. Much of this initial animal research cannot be performed in humans due to ethical, cost and/or time considerations. a) Selection of species The particular species to be used as a model is chosen for one/more of the following reasons: 1. Similarity of the organ system, disease, metabolic pathways, etc. to the equivalent in human beings. Small size for ease and economy in housing, feeding, manipulation, etc. Relatively short life span to allow for lifetime studies, studies over more than one generation, etc. Comparisons to work of other investigators in the same or similar model.
2. 3.
4.
b) Selection of controls This is one of the most important considerations in the design of an animal experiment. Let us say the investigator is performing a study to determine the deficiency of a particular nutrient on some measurable parameter. Animals in the experimental group would be fed a diet containing required amounts of all known nutrients except the nutrient under study (or the diet would be low in the nutrient). The control group would receive a normal (same in all respects but with the nutrient present) diet. The controls need to be similar to the experimental animals in all respects like weight, age, genetic strain, etc. In fact, a group of animals need to be randomly divided into experimental animals and controls. To avoid bias, someone other than the principal investigator can perform the actual separation. c) Feeding of controls Some animals will consume less of an incomplete diet than they will of a nutritionally complete diet. If the experiment is run as designed above, the investigator may get false results that show that deficiency of the nutrient influences the variable being measured. In reality, the variable may have changed because of the low quantity of food consumed by the experimental animals. To avoid this, it may be best to have a second control
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Research Design
group that is “pairfed” to the experimental group. The amount of food that is consumed by the experimental group is measured and then that amount of the control diet is fed to the pairfed controls. In another approach, the mean intake of food eaten by the experimental animals is calculated each day and all of the control animals are fed this amount the following day. d) Treatment of controls Do unto your controls as you would do unto your experimental animals. Basically, try and eliminate other factors which might be different between experimental and control animals. If some procedure like an injection or a surgery is performed on the experimental animals, the controls too should receive a sham procedure. Other variables to consider in animal experiments are: water consumption, nondietary sources of nutrients (e.g. zinc from nibbling on cage bars), time of day procedures are done, amount of space allocated to the animals, amount of exercise, location within the cage, number of animals per cage, nature of animals in neighboring cages, etc. Also, there should be concern for diseases that the animals might transmit to each other or to the investigator and for diseases that the investigator might transmit to the animals. e) Ethical guidelines Ethical guidelines should be followed meticulously to ensure that research is done in a humane fashion. Animals and their cages should be kept clean, temperature should be correct, undue and unnecessary distress should be avoided, etc. Usually, there are committees on campuses that oversee the treatment of animals. Also, your university is bound to have courses that you need to take before you take on an animal experiment.
b. HUMaN RESEaRCH
In human research, there are a lot of other guidelines and considerations. One can divide the types of research design into 1. OBSERVATIONAL STUDIES where the investigator does not alter the natural occurrence of events but records them and formulates hypotheses and/or conclusions about what he/she observes. Observational studies are of several types including: Case histories Descriptive studies Prospective studies (Cohort studies) Retrospective studies (Casecontrol studies) 2. INTERVENTIONAL STUDIES As opposed to the passive role of the investigator in observational studies, the researcher takes an active part in these studies. In interventional studies, the subjects are exposed to (or denied exposure to) a factor or method of treatment and followed over time to determine the outcome. Individuals may serve as their own controls or separate groups of control individuals may be used. These kind of studies have several research design methods: Open trials Crossover trials Blind trials Doubleblind crossover trials Metabolic studies In vitro studies
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Research Design
Let us have a closer look at these different kinds of research designs and what information each can provide us with. This is important for statistical analysis because, in order to interpret your data, you need to know the usefulness and limitations of each kind of study and how to classify any particular study. Let’s say you have results of a Retrospective study and you ask the statistical program to compute Absolute Risk, this would be meaningless because Retrospective studies can only measure Relative Risk. a) Case histories These studies are often referred to as anecdotal evidence. They are widely used as testimonials in advertising. In science, they may not be of much value as data, but they do provide an insight into areas of possible further research. But case histories cannot give definitive evidence that a certain factor is causal for a certain disease or tat a certain treatment is effective. Many journals separate these articles into a different section of each issue. These studies serve as a method of rapid communication of clinical findings and hypotheses to the scientific community and help generate new leads for future research. Example: In a recent report, a physician described a casehistory in which a patient with Common cold, experienced complete remission of symptoms following one week of supplementation with Vitamin C. Let us examine this closely. Does this report demonstrate that Common cold is caused by Vitamin C deficiency? No. This evidence is not sufficient for a conclusion of this magnitude. Doing that would be equivalent to saying an infection is caused by an antibiotic deficiency. Does this report demonstrate that most common cold patients can be treated successfully with Vitamin C? No. The study has been performed on a single patient and the results cannot be extrapolated to a population. Does this report show that this particular patient was cured by Vitamin C? No/Maybe. The patient may have been cured due to other factors/drugs/ his own immunity/other remedies etc. There was no control for this case. There is no way of knowing if the patient would have been cured without intervention.
b) Descriptive studies These are often large population studies in which data on lots of different variables is collected. It is somewhat like a census. Statistical analyses on the data collected may show various relationships that lead to hypotheses for further study. They may also provide estimates of the magnitude of a particular problem and the frequency of certain behaviors among the population. Sometimes they may also generate meaningless correlations (example: most alcoholics in a particular area send their children to private schools) that need to be neglected. Descriptive studies are also referred to as epidemiological studies or surveys. c) Prospective studies (cohort studies) A prospective study consists of two samples, one of which has been exposed to the suspected risk factor (a + b) and one not exposed (c + d). The two samples are followed through time to determine which group has the higher incidence (or causespecific mortality). In other words, a prospective study compares the absolute risk (of illness or
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Research Design
death) of those exposed with the absolute risk of those not exposed. An example of a table of data from this type of study is shown below.
Disease Exposed Not Exposed a c
No Disease b d
Total a+b c+d
Information provided by this prospective study: 1. Absolute Risk (Incidence) of the exposed a/a+b as compared to those not exposed. c/c+d 2. Relative Risk the ratio of the absolute risk in the exposed to the absolute risk in those not exposed. (a / a + b) / (c / c + d) Relative Risk is the best measure of the strength of association between the risk factor and the disease in question. How much does the causal factor cause the disease? A value of 1 means there is no association between the factors being studied. A value of less than 1 means the factor is protective. A value of more than 1 means the causal factor under consideration causes the disease. Usually, a 2fold difference is considered to be significant (only when a large population is studied). These studies usually give some of the strongest evidence of disease causation available from observational studies. Human prospective studies tend to be very expensive to undertake and maintain. Also, some may take decades and are only taken up by federal research agencies. Even those few are often subject to criticisms on the basis of the controls used, the ability to stick to a particular scientific protocol over long periods of time, the ethics leaving individuals at risk, exposed to a certain factor that may be a risk factor, etc. The Framingham heart study is an example of a cohort study. Example of a prospective (cohort study): Lung Cancer deaths Smoker* Nonsmoker 227 7 No Lung Cancer deaths 99,773 99,993 Total 100,000 100,000
* = Someone who smokes 25 or more cigarettes a day is defined as a Smoker Absolute Risk for Smokers = 227/ 100,000 Absolute Risk for Nonsmokers = 7/ 100,000 Relative Risk = (227/100,000)/ (7/100,000) = 227/7 = 32.4 (This means that a smoker is 32 times more likely to die of lung cancer than a nonsmoker) Attributable Risk = 227/100,000 – 7/100,000 = 220/100, 000 (This means that so many deaths out of 100,000 could have been prevented by eliminating smoking) d) Retrospective studies (casecontrol studies): A retrospective study consists of two groups, one with the disease (a + c) and the other without the disease (c + d) under study. The group with the disease consists of
10
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Research Design
“cases” and the group without the disease is the group of “controls”. The typical way of expressing data from such a study would be as below Disease (cases) Exposed Not Exposed Total a c a+c No Disease (controls) b d c+d
Information provided by such a study design:
1.
Cases and controls are compared with respect to the proportions that have been exposed to the risk factor. The proportion of cases exposed (a / a + c) is compared to the proportion of controls exposed (b / b + d) i.e. rate of exposure in diseased/rate of exposure in the nondiseased. If, in the population at large, the number of persons who have the disease is quite small compared to the nondiseased population, Relative Risk may be estimated by using the “odds ratio” or crossproduct ratio” or “relative odds”
2.
ad _____ bc
****** Neither absolute risk (incidence) nor attributable risk can be inferred from a retrospective study without reference to outside information. Then, you may wonder, why are these tests done at all? The reasons may be: 1. 2. 3. They consume less time Consequently they take up less money too Epidemiologists can first try out such calculations before undertaking a cohort study. Subjects need not be controlled because we already have the outcomes. These studies are less unethical because we are not denying/ causing exposure to any suspected curative/ causative agent. Epidemiologists seldom bother about accuracy unlike statisticians (Just kidding!)
4. 5.
6.
Example of a retrospective (casecontrol study): Lung cancer deaths Smoker Nonsmoker Totals 464 36 500 Deaths with no lung cancer 167 333 500 Totals 631 369 1000
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
11
SPSS Tutorial for Beginners
Research Design
Relative odds = (464 x 333) / (167 x 36) = 25.7 We cannot determine incidence from this data. Information from casecontrol studies does not “prove” causation but gives stronger evidence than that from casehistories. If similar relationships are found by different members of the scientific community in different experiments, then the findings are gradually accepted as strongly suggestive of causation. e) Retrospectiveprospective study: a combination Some countries such as Finland have their entire population registered in the national health program since a very long time. This provides investigators with solid records and accurate information about any section of the population going back to any period of time. The investigators can then start with the point at which exposure took place and then track individuals until the present. This type of study combines the advantages of both prospective and retrospective studies. f) Open trials: This is the simplest kind of interventional study. In this design, the researcher and the subjects, both are aware of the nature of treatment and the intended/ expected results.
Example: A group of diabetic women were used to study the effect of a particular herb extract on their fasting blood sugar levels. The fasting blood sugar levels would be determined at the beginning of the experiment and again after a 10week period of receiving the extract. In this instance, each subject served as her own control. Alternatively, there could have been two groups, one receiving the extract (experimental group) and one receiving a placebo (controls). In both the above cases, the herb extract use, the placebo use and the hypothesis being studied would be known to all parties. These trials have the benefit of being very easy to conduct and also avoid the ethical issue of withholding a particular treatment from the control group. The biggest disadvantage to this method of research design is that knowledge of the treatment method/ placebo may influence the outcome due to actions and psychology of the subjects and/or the researchers.
g) Crossover trials: In crossover trials, two groups of subjects are studied. One group receives the active treatment for an initial period of time while the second receives a placebo. At the end of the designated period, the first group is switched to placebo and the second to the active treatment. This design is used to help eliminate effects caused just due to participation in an experiment and those due to seasonal variations in the variables being measured. h) Blind trials: Here, some of all of the participants and/ or researchers are prevented from knowing the identity of the group receiving the treatment until the conclusion of the experiment. In a “singleblind” trial, the subjects are denied knowledge of whether they are receiving the active treatment or placebo but the researchers are aware of the identities of the experimental groups and controls. In a “doubleblind” trial, neither the subjects nor the researchers know which individuals receive the active treatment or placebo until the experiment is over and the data has been collected. The latter experimental design helps
12
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Research Design
to control for the “placebo effect” as well as to control for experimental bias. You may ask Why should the investigator be “blinded”? The answer remains the sameTo avoid bias. Let’s say the investigator gets a good result from the diabetes experiment (the herb reduces fasting blood sugar); he will not repeat the experiment because the “good result” is in the experimental group, which strengthens his hypothesis. If the same result is obtained in the control group (the placebo reduces fasting blood sugar), he will definitely repeat the experiment thinking that it is an error. So the treatment of experimental groups and controls may not be exactly the same. To avoid this, the investigator’s unbiased colleague does the labeling, feeding and keeping of records.
Example: Let us say the investigator wants to perform the same experiment test if the herb extract influences fasting blood sugar in diabetic women. A group of diabetic women are randomly divided into two groups of equal size. All relevant characteristics are similar in both groups (weight, height, age, blood sugar levels, diet etc.) A capsule is developed for the experiment one with the extract and one with the placebo. The two capsules are indistinguishable in appearance. One group receives the extract capsule for 10 weeks and the other receives the placebo capsule for 10 weeks. The distribution is done by an individual not directly related to the project. After the final blood samples have been analyzed and decisions made as to who improved and who did not, the code is revealed to the investigator as to which group received the extract.
i) Doubleblind crossover interventional trial: This is a blending of the doubleblind and the crossover design. It is one of the most respected forms of trial design for human studies. j) Metabolic studies: These are usually carried our in a relatively small number of subjects who are placed in a metabolic ward for intensive study. Individuals generally serve as their own controls. Mineral balance studies would be an example of this kind of experimental study. Here, the subjects are fed chemically defined diets containing varying and known amounts of the mineral being studied. Then the losses from hair, skin, swear, urine, feces etc. are carefully determined. These studies need a lot of time and effort, and any minute error while performing the study may result in totally erroneous results. k) In vitro studies: In vitro means the study is conducted outside the living body. It may be performed on cells, tissues, organs etc. in laboratories under specific conditions. This type of work is very common and generates a lot of data for statistical analysis e.g. cell counts, isotope counts, optical density, area under the curve, etc. Sometimes the nature of the laboratory work requires that the data be converted into percent of control for statistical analysis.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
13
SPSS Tutorial for Beginners
ii) evaluation of Measuring instruments
Measuring instruments, whatever their nature, have two properties that must be evaluated before they can be Used to perform meaningful measurements.
Precision (or Repeatability) is the ability of a measuring instrument to give consistent results on repeated trials. For studying precision, we do not need extraneous information. Validity (or accuracy) is the ability of a measuring instrument to give a true measure. Validity can be evaluated only if there exists an accepted independent method for confirming the accurate test measurement.
Validity has two measures: Sensitivity and Specificity.
a) MEaSURINg SENSItIvIty aNd SpECIfICIty of a tESt
Sensitivity of a test measurement is defined as the percentage of true positives identified by the test. A high sensitivity would mean that there would be more false positives. Specificity of a test measurement is defined as the percentage of true negatives identified by the test. A high sensitivity would mean that there would be more false negatives. An effective measurement tool would have an acceptable value for all these factors.
Example: Suppose a new test was developed to determine HIV positivity. We call our test “Test A”. We run our test on a large group of people and then run the PCR reaction followed by Western immunoblotting for HIV on the same group of people and record the true outcome.
True outcome +ve +ve Test A results ve total 530 (a) 5 (c) 535 (a + c) ve 165 (b) 300 (d) 465 (b + d) total 695 305 1000
Test A results a = true positives Sensitivity = (a / a + c) x 100 = 530/535 x 100 = 99.07 % b = false positives c = false negatives d = true negatives Specificity = ( d / b + d) x 100 = 300/465 x 100 = 64.52 %
14
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Evaluation of Measuring Instruments
b) SaMpLINg aNd SaMpLE SIzE
a) Sampling procedures Most statistical procedures require that our sample be drawn randomly and that our experimental units (subjects, animals, etc.) be allocated independently (randomly) to experimental and control groups. If we fail to do this, the validity of our research in drawing conclusions about the general population is limited. For example if we were interested in determining the average expenditure of people living in New York and drew our sample from a list of people living in Brooklyn, we would get an incorrect idea of the expenditures. Similarly, if we were trying to solve a public health problem and only used the group of people listed in the telephone directory, we would still be making a limited conclusion because everyone need not have private phone numbers. Thus Sampling techniques become very important. The rule we must meet if we are to have a “fair” sample is that every individual within the population must have an equal chance of being selected. There are many sampling designs possible, but the three that are most commonly used are: 1. Simple Random Sample: This is something like pulling out names from a hat. It may involve the use of a computer and a random number generator. A complete randomized list is necessary. Systematic Sample: This is a system where every “nth” individual is chosen e.g. every 15th diabetic is chosen from a randomized list of diabetics. Stratified sample: This is a system in which the population is divided into distinct subpopulations (strata) and samples are selected from each subpopulation. Each stratum must be weighted in the final calculation of results unless that was done by proportionate sampling from each stratum.
2.
3.
b) Sample Size The sample size in an experiment is an important feature in its design and the interpretation of its results. If the sample size if too small, it may be tough or impossible to find statistically significant results. But that does not automatically imply that the larger the sample size, the better our experiment. If the sample size is too large, trivial differences will have very statistically significant values and we will be easily impressed with the findings. E.g. If we want to determine the differences in mathematical skills in boys and girls, and we use 100, 000 boys and 100, 000 girls for this study, we might end up with a statistically significant result even if the actual difference may be unimportant. Hence, the number of subjects used for a study needs to be chosen very carefully. There are several methods and formulae for selection of appropriate sample size and these must be used before we begin an experiment.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
15
SPSS Tutorial for Beginners
(iii) Some useful terms to know
Level of significance: the probability of committing a Type I error. Mean: the arithmetic average. Null hypothesis: the hypothesis being tested in statistics (that there is no difference
between the groups under study).
p or p value: level of significance e.g. p< 0.05 means that there is a less than 5 % chance
of having committed a Type I error.
Regression: a type of analysis used to establish formulae describing the relationships
between variables.
Standard Deviation: a measure of variability. It is equal to the square root of the
variance.
Standard Error: a special measure of variability inappropriately used by many researchers
to make their results look better.
Type I Error: rejection of a true null hypothesis. Variance: a measure of the average variability of a data set from the mean.
1
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
(iv) types of data and Appropriate Statistical tests
In research, there are two types of data collected. One type is called continuous (measurement) data. This is data that have an infinite number of possible data points between each whole number (e.g. infinite number of points between 1.0 and 2.0 like 1.1, 1.2, 1.23, 1.2455, and so on). Examples of such data would be weight, height, blood sugar, hemoglobin, etc. As long as it is possible to measure at smaller and smaller increments, we have a continuous variable. The other type of data is called Discrete (frequency/categorical) data. These data can only be integers (1, 2, 3 and so on). Examples would be numbers such as number of deaths due to cancer (We cannot have 1.5 deaths). Since the underlying distributions which are possible with these two types of data are different, we must use different types of statistical tests depending on which kind of data we are collecting.
Statistical procedures for continuous data:

ttest Analysis of variance Correlation Regression ChiSquare test McNemar test Sign test MannWhitney U test
Statistical procedures for discrete data:
a) The ttest Many experiments seek to compare data collected from 2 groups of subjects or animals (an experimental group and a control group). If the data collected are for a continuous variable, the best test to use would be the “ttest” or the “Student’s ttest” (Student was a pseudonym used by W. S. Gosset in his statistical writings). The “t” distribution was developed to overcome a major problem with using the normal distribution for hypothesis testing. The normal distribution and tests of hypothesis related to it, require that we have data from every member of the population to calculate the mean (µ) and the variance (s2). We rarely have this information. Usually we must rely on data from samples of individuals drawn from a population. The “t” distribution is used in describing samples and testing hypotheses related to them. (Actually there are an infinite number of “t” distributions, each one determined by the degrees of freedom of s2. When the degrees of freedom (n – 1) reaches infinity, the t distribution is same as normal distribution.) Use of a “ttest” requires that the following assumptions are true: 1. The sample is randomly selected. 2. The sample is drawn from an underlying normal population.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
1
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
We 1. 2. 3.
must have the following data: A mean for each sample (ttest allows comparison of only 2 means) The variance for each sample The number of individuals in each group
The hypothesis being tested by a ttest is that two samples are equal Ho: x1 – x2 = 0 The alternate hypothesis is that the means are not equal. Different research situations call for slightly different versions of the formula for calculation of the ttest. If the variance of the two samples is equal, a “pooledvariance” ttest is conducted. E.g. If we are studying blood sugar of males and females, we expect the variance to be similar for the two groups. The formula for pooledvariance ttest is: x1 – x2 t = ______________________ (n1 – 1)s12 + (n2 – 1)s22 where sp = n1 + n2 2 √ the degrees of freedom (d.f.) = n1 + n2 – 2 If the variance of the two samples is not equal, we use the “separate variance” ttest. This is also called independent ttest. It is calculated with the following formula: x1 – x2 _____________ s12 + s22  n1 n2 √ [(s12/n1) + (s22/n2)]2 [(s12/n1)2/ (n1 + 1)] + [(s22/n2)2/ (n2 + 1)]
sp
t=
degrees of freedom (d.f.) =
After selecting the appropriate formula, we calculate the tvalue using the data from our experiment and then look up the critical of “t” in a table. If our tvalue exceeds the critical value of “t” at our degrees of freedom, then we must reject the null hypothesis and infer that the alternate hypothesis is true (The means are significantly different).
Example:
Suppose we conducted an experiment to determine the effect of a protein supplement on the body weight of low birth weight babies. The results of the experiments are as follows: Experimental group 100 gms Control group 88 gms
Mean weight gain
1
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
Standard deviation Sample size
3 25
3 25
Our variances are equal (3 squared = 9) therefore we will use the pooledvariance ttest formula. 100 – 88 12 t =  =  = 14.142 ___________ _____ 3 √ 1/25 + 1/25 3 √ 0.08 From a table of tvalues we find that with 48 degrees of freedom, our tvalue exceeds the critical value of “t” at the 0.001 level. Therefore we report that the protein supplement had a significant effect on weight gain.
Paired ttest: This is a special ttest to use for research designs in which the data are
from paired samples e.g. if we collect data from the same individual before and after a particular treatment. Here, everyone is his own control. We can calculate a mean change or mean difference from our experiment and similarly a variance of the differences. Then we can use the paired ttest formula:
t=
x d _____ sd √ nd
[the degrees of freedom is (nd – 1)]
Note about statistical tables: The tables for looking up pvalues are readily available
online and also in online calculators (the simplest method according to us). Just look for pvalue calculators. And one more thing Once you start using SPSS for your analysis, SPSS will look up the pvalues for you. So do not worry about the tables. This is true for all the tests that follow including Analysis of variance, Sign test etc. etc. b) Analysis of Variance (AoV) It would be erroneous to use the ttest when we have more than 2 groups. In this case, we need to use another test. The most common test for analyzing more than 2 groups is called “Analysis of variance”. Analysis of variance uses the Fdistribution which, like the tdistribution is a modification of normal distribution. If we are only interested in the differences between group means, then we would use a oneway analysis of variance. If we are also interested in testing for differences within groups, we would need to use a twoway analysis of variance (not very commonly used in biotechnological research). If we are to calculate a oneway AoV, we need the following information. GT = the sum of all observations (to get this, just add all data points) Ti = the sum of the observations in the ith sample (In a given group, what is the total sum) ni = the number of observations in the ith sample N = total number of observations (Ti) 2 ∑  = the sum of the square of each sample’s total divided by its sample size (for
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 1
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
ni
every group)
∑i ∑j (x ij )2 = the sum of the squares of all observations k = the number of groups The formula for analysis of variance is: (∑ (Ti2 / ni ) ) – (GT2 / N) _______________________ (k – 1) F = _______________________________________________ (∑i ∑j (x ij )2 ) – ((Ti) 2/ ni ) __________________________ (n – k) Once we have calculated the F value, we can compare it to the critical value of ‘F’ in the table. We can then determine whether to accept or reject the null hypothesis that no difference exists between group means. This will only tell us that at least 2 means are different they would be the highest and lowest values. If we want to know which means are different from each other, we need to perform further tests. Examples include the Scheffe test (most stringent), the Newman Keul’s test, Duncan’s multiple range test and others. Our research situation and our needs will determine which of the above is appropriate. One interesting thing is that in AoV calculations, if k = 2, we will get the same results as a ttest. AoV automatically adjusts to ‘t’ which in turn adjusts to a value of 2. c) Correlation Many a time, it is necessary to examine a relationship between two (or more) quantitative variables, e.g. height and weight. A simple way to examine the data would be to create a scatter plot. This is a graph in which the horizontal axis (Xaxis) represents one of the variables and the vertical axis (Yaxis) represents the other. If the points are totally scattered, we would say that there is no relationship between the 2 variables. If there exists some sort of linear or curvilinear relationship then the variables are related. If our two variables (x and y) are perfectly related to each other they will form a straight line. Then we would say they are perfectly correlated and they have a Peason’s correlation coefficient (r) equal to 1.0. If the variables increase or decrease in unison, then the r is positive (+ 1.0). If one increases and the other decreases, the r is negative (1.0). Of course, in real life it is tough to see such perfectly correlated relationships. If there is no relationship, then r = 0.0 or vice versa. The correlation coefficient can be calculated as follows: n = the number of data points x = values for the xaxis x = mean of x y = values for the yaxis y = mean of y ∑xy – (∑x) (∑y)/n ______________________________________ _____________________________ √ (∑x2 – (∑x)2 / n) (∑y2  (∑y)2/ n)
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
r=
20
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
Let us look at an example:
Person height (in) weight (lb)
1 2 3 4 5 6 7 8 9 10
64 65 71 67 63 62 67 64 65 64
130 148 180 175 120 127 141 118 120 119
Using the above data we would get 90294 – (652) (1378/10) r = ____________________________________________ = + 0.8353 n2= 8) _____________________________________ √ ((42570  6522 / 10) (194724 – 13782 / 10))
(d.f =
The correlation coefficient can be tested for statistical significance using a table of r values or by using a special ttest. Either way, it is necessary to calculate the d.f. (degrees of freedom) which for this calculation (correlation coefficient) is equal to n2. The special ttest is calculated using the following formula. Here we ignore the sign of r. ______________ t = r √ (n2) / (1 – r2) Using the above data we would get: ______________ t = 0.8353 √ (8) / (0.3023) = 4.297 with 8 degrees of freedom. From a ttable, we would find that p< 0.01. This means that there is a significant correlation between the variables. The correlation coefficient is a measure of the strength of the association between 2 variables on a scale of 0 to 1.0. A better way to interpret the association is to square the coefficient. Then r2 will measure the proportion of the variation in the two variables that is common to both variables, i.e. what percentage of the variation in one variable can be predicted by the other. In this case: r2 = 0.8353 2 = 0.6977 or in other words: 69.77 % of the variation in weight can be explained by height. d) Regression Another related way of looking at the relationship between two quantitative variables is regression analysis or regression. In regression analysis, one of the variables is logically dependent on (or influenced by) the other variable. For example, heart rate might be influenced by age. In the jargon of regression analysis, heart rate would be a dependent
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 21
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
variable and age would be the independent variable, since it is not logically influenced by heart rate. By convention, in regression analysis, the yaxis of a plot is used for the dependent variable and the xaxis for the independent variable. As in correlation analysis, we can use a scatter plot, but in addition, regression analysis will provide the best fitting straight line through our plot. To describe the line, we must have two pieces of information: the slope of the line, b and the yintercept of the line, a. The slope measures how much the y variable changes for each unit of change in the xvariable (+ means the line rises and – means the line falls). The yintercept tells us where the line starts (i.e. the value of y when x = 0). Once we have this information we can use the formula for a straight line: y = a + bx The slope of the line can be calculated by the following formula: ∑xy – (∑x) (∑y) / n b= ______________________________ ∑x2 – (∑x)2 / n The table with data (that we used for the correlation coefficient) can be used here, but is not a real example for regression analysis. Using the table gives us: B = 448.4/59.6 = 7.52 (means that for every unit of x increase y will increase 7.52 units). The yintercept can be calculated since logic dictates that the best fitting line must pass through the pint at the mean of x, mean of y. And since: y = a + bx Or equivalently a = y  b x All that is required is that we calculate x and y to be able to obtain a. Using the data table, the means of y and x are 137.8 and 65.2 respectively. So: a = 137.8 – (7.52 x 65.2) = 352.73 This type of calculation can be very useful in the laboratory. For example, if we know the concentration of a standard protein and we measure absorbance of unknown samples, using regression, we can calculate the unknown concentrations of the protein samples. This is a very commonly used calculation. e) ChiSquare test For research situations in which frequency and/or categorical data are collected, a different set of statistical procedures must be used. One of the most common of these procedures is called the Chisquare test. This test can handle any number of groups and any number of possible outcomes. For simplicity, we will look at a test with a 2 x 2 situation (two groups and two possible outcomes). The formula for calculating Chisquare is as follows: __ __ 2 ( O – E) χ2 =  E
22 CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
then: y = a + b x
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
Where: O = the observed frequency E = the expected frequency = (Tr x Tc ) / N Tr = row total Tc = column total N = total number of observations d.f. = (r1) (c1) r = number or rows and c = number of columns
Example :
Suppose we conducted an experiment to determine the possible detrimental effect of a new drug on birth weight of rats Pregnant rats were given either the drug or a placebo and the results were observed in their offspring. Data collected: Control group Experimental group Total # of low birth weight rats # of low birth weight rats Total 10 90 100 40 80 120 50 170 220
Calculations:
O 10 90 40 80 E 22.73 77.27 27.27 92.73 OE 12.73 12.73 12.73 12.73 (OE) 2 162.0529 162.0529 162.0529 162.0529 ChiSquare value (OE) 2/ E 7.1295 2.0972 5.9425 1.7476 ________ 16.9168
We then compare this value (16.2) to the critical value of Chisquare in the Chisquare table and make our decision about our hypothesis. Here p < 0.005. So the new drug causes low birth weight in rats compared to the placebo. One problem with Chisquare test: Here if E < 5.0 then there is distortion in data. When we do (OE), the gap between the points decreases as the numbers get smaller. This can be prevented by using Yates’ correction i.e. ( O – E   0.5 ) 2 is used as numerator in the formula in such cases. (Do not worry about this at all. SPSS does this automatically. We put in this information just to show off a little bit.) f) McNemar test In case of the Chisquare test, one prerequisite is that the groups are independent and not related, matched or paired. For experiments in which paired samples or “before and after” experiments, the more appropriate test would be the McNemar test. It is a modified type of Chisquare test. The data are arranged as follows: AFTER BEFORE + A C + B D
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
23
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
In this test, we are only concerned with the observations that changed during the experiment i.e. cells A and D of the table. Since A + D represents the total number of observations that changed ½ (A + D) would be expected to change in one direction and ½ (A + D) in the other direction if our experiment had no effect on outcome (If null hypothesis is true, A and D would be the same). (A – D) 2 χ = (A + D) Example: Suppose we want to test whether or not the presence of a particular object in the room influenced the biting behavior of dogs caged in the room. To conduct our experiment, we would need a room full of caged dogs, the suspected object and an artificial hand for the dogs to bite. The hand could be placed in the cage of each dog in the absence and presence of the suspected object. If the hand is bitten, we would record a +ve response and if not bitten, a – ve response. Our data can be shown as follows:
2
BEFORE χ2
+ = (110) 2 / 130 = 93.08
10 50
AFTER
+ 20 120
The interpretation is similar to the Chisquare test. So the null hypothesis can be rejected. g) Sign test Occasionally, in some research designs, quantitative data are impossible or impractical, but it may be possible to rank with respect to each other, the two members of a pair. The sign test is applicable to a research design of 2 related samples when the experimenter wishes to establish that two conditions are different. E.g. a skin rash can only be classified as mild, moderate and severe. Or the degrees of pain can be classified by the patient as same, improved or worsened and so on. The only underlying assumption of this test is that the variable under consideration has a continuous distribution. In this test, we assign a plus (+) or a minus () sign to each pair for the variable of interest. If the experimental conditions have made no difference, then we would expect an equal number of pluses and minuses. (If the members of a pair are not different they can be dropped from the analysis.) We need to only determine N (number of pairs) and the x (number of fewer signs). And compare to the table for sign test.
Example:
Suppose we were attempting to determine the effect of low iron intake on taste acuity for sourness. After eight weeks on a low iron diet, a group of 17 volunteers were asked to taste two sour solutions ( one 1 % citric acid and the other 0.5 % citric acid) and compare if the first was sourer (+), same (0) or less sour () than the second.
24
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
Subject # 1 2 3 4 5 6 7 8 9
Result 0 + 0 +
Subject # 10 11 12 13 14 15 16 17
Result + 0 
From the results N = 14 and x = 3. The table for sign test shows that the onetailed probability for such an occurrence is p = 0.029. Since this is less than our level of significance (0.05) we can infer that low iron intake decreases ability to distinguish sourness. h) MannWhitney U test After the ordinal level of measurement has been achieved, MannWhitney U test can be used to test if two independent groups have been drawn from the same population. This is one of the most powerful of the nonparametric tests. The null hypothesis is that both groups have the same distribution. The alternate hypothesis is that one group ranks higher (better) than the other. If the experiment has no effect then we expect the groups to be equal. To conduct the test, we must first assign a score to each subject: combine the data from the two groups; and rank the scores in order of increasing size. We let n1 = the number of cases in the smaller of the groups and n2 = number of cases in the larger. To determine U, we focus on the smaller groups (if groups are of equal size, just pick any one). The value of U is equal to the number of times that a score in the larger group precedes a score in the smaller group. We then compare our U to the table for MannWhitney U test.
Example:
The effect of zinc deficiency on learning was studied on rats. Ten control rats are fed a nutritionally complete diet while experimental rats are fed a low zinc diet for 10 weeks. Initially, all rats had been trained to imitate a leader rat in a T maze. The experimenter records the number of trials each rat requires to reach a criterion of 10 (in a row) correct imitations in 10 trials. The more the trials required, the more the memory affected. The data are presented below: Control rats 78 64 75 45 82 77 62 76 48 90 Experimental rats 110 70 53 51 93 68 57 54
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
25
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
Group C C E E E E C C E E C C C C C C E E Score 45 48 51 53 54 57 62 64 68 70 75 76 77 78 82 90 93 110 We determine U by counting the number of C scores preceding each E score. U = 2 + 2 + 2 + 2 + 4 + 4 + 10 + 10 = 36 From the table for MannWhitney U test, we see that to achieve the 0.05 level of significance, U value must be equal or less than 17. Our U value is 36. So we infer that zinc deficiency did not affect learning in this experiment. h) Statistical Abuses Scientific research abounds with misapplication of statistical methods. Many a time, reviewers detect these errors and investigators have to repeat their statistical analysis, causing a lot of expenditure of time and effort. But sometimes, some errors of application go unnoticed and end up being published. This is rarely due to an intention to mislead; rather it is often the result of insufficient deliberation and study of the particular experimental problem. It’s probably right to say that the majority of problems and difficulties in handling one’s experimental data are caused by haste and consequent superficiality if not errors on the part of the investigator. The researcher who rushes into an experiment, hurries the data collection and rushes the publication of his results runs the risk of wasting his entire effort just to save a little time in the beginning. It has been said that if you really want to mess things up, use a computer. Indeed, the widespread use of computerized statistical software packages lead to misapplications and misinterpretations of results. The major areas of errors include: 1. Sample Size selection 2. Inappropriate Statistical tests 3. Inappropriate display of results
Sample size selection
Very small samples are unlikely to give good estimates of true population values. One unusual animal or subject can have a big effect on the outcome. The findings of such studies are viewed by skepticism by most scientists. Would you use a new drug if it were launched after testing it in 2 rats? Very small samples make it tough to find statistical significance even if there may be biological significance. Very large samples also have their own problems. If the sample is sufficiently large, any numerical difference between groups can be shown to be statistically significant, whether there is any actual biological significance or not.
Inappropriate statistical tests
These are difficult to spot without a certain amount of statistical expertise. Most common of these would be the use of ttests when there are more than two groups under study, using tests meant for continuous data on discrete data and false assumptions about the independence of samples and homogeneity of variances.
Examples: Example 1:
ttests are designed for no more than 2 samples. If we have an experiment in which there are five groups of subjects (Groups AE) we might wish to make the following comparisons: A vs B, A vs C, A vs D, A vs E, B vs C, B vs D, B vs E, C vs E and D vs E
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
In all, ten different comparisons. If we consider 0.05 to be an acceptable level of significance, we might conduct 10 ttests and never realize that we actually have a 31 % probability (instead of a 5 % probability) of having a statistically significant finding.
Example 2:
Sometimes, literature may have findings in which ttests were used in the analysis of categorical or frequency data. These findings are meaningless if based on the outcome of the wrong statistical procedure. If we use 1’s to code for males and 2’s to code for females, what does it mean if we get a mean of 1.37? Absolutely nothing.
Example 3:
One of the most common errors will be when the researcher uses a statistical procedure that makes a waste of his data. Conversion of measurement data to ranks or categories or percentages may allow us to use a statistical procedure that is arithmetically easier to compute but makes a much weaker statement about our findings.
Inappropriate display of results obtained
Scientific literature and scientific meetings (more so) are filled with examples of the misuse of data representation techniques. These are also extremely common on television and in presentations before government committees. Many of them are conscious attempts to mislead. There are specific ways to present data and misrepresentation may make data seem more significant than it really is.
In the above graph, A is the mean heart rate of 30 females and B is the mean heart rate of 30 males. A and B look significantly different from each other, in the graph. This is because the graph is plotted from 60 to 76. If the graph had been plotted from 0 to 100, the difference between A and B would not seem so high. In reality, the difference between A and B is not significant. But if we were unethical and wanted to present this data to an audience to convince them that the heart rate of males was much higher than females, we could easily represent the data in this manner and achieve our end. This is extremely misleading and an incorrect way of representing statistical results. j) Quiz to test yourself Have a look at the questions/problems below. You will be able to solve them yourself. Our tutorial has not provided solutions. If you find it impossible to get the solution for any particular problem, feel free to contact us anytime.
I) Name the type of association between the risk factor and disease in each of the examples below:
A.
It has been observed that people who have low cholesterol are more likely to be interested in travel than individuals who do not. What is the most likely type of association to explain this finding?
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
B.
Vitamin C deficient diets especially on ships are responsible for a lot of people suffering from bleeding disorders, the disease being called scurvy. What is the association between Vitamin C and scurvy? Studies have discovered a high incidence of rickets in the Scandinavian countries. Rickets is a Vitamin D deficiency disease and Vitamin D is made in our skin from exposure to sunlight. What is the association between sunlight exposure and rickets in these people?
C.
II) Give the appropriate statistical test for each of the following research designs
A.
Eight groups of mice (n = 40) were fed either a complete normal diet (Group I or controls) or diets restricted in protein as follows: Group II = 10 % less protein, Group III = 25 % less protein, Group IV = 20 % less protein, and Group V = 30 % less protein. Mean weight gains were compared after feeding these diets for 12 weeks. Investigators compared each mean to the control group. In the same research project described in “A”, numbers of rats who gained less than 90 % of the weight gained by the control rats were compared. Mean fasting plasma cholesterol concentrations were determined in a group of individuals before an after 10 weeks of taking Lipitor. Hypertensive men were placed on a diet which included a daily dose of 10 gms of corn oil or fish oil for 36 weeks. The number of men whose diastolic blood pressure dropped by 5 mm Hg or more was compared for the two groups. A group of rats was split into a control group and experimental group. The experimental group received three weeks of zinc deficient diet while the control group received an adequate diet. At the end of the experiment, the rats were scored as to their ability to solve a maze problem. The investigators wished to determine if the experimental groups have lower ranking scores than the controls. An experiment was conducted to determine the effect an iron supplement diet had on memory in a group of low income children. The children were assessed on their ability to remember a nine digit number two minutes after having been shown the number. The children were tested before and after six months of supplementation. Individuals were scored as correct or incorrect before and after the supplementation period.
B.
C.
D.
E.
F.
III) An investigation was performed to determine the relationship between breast cancer and coffee consumption. Breast cancer cases and healthy controls were
interviewed to determine their history of coffee consumption. The results are presented below: Breast cancer cases Less than 2 cups a day More than 2 cups a day Total 350 150 500 Healthy controls 375 125 500 Total 725 275 1000
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
What is the absolute risk of having breast cancer among women drinking more than 2 cups of coffee per day? IV) Your project for your grant proposal will examine the effect of various diets on the prevention of protein cross linkage in rats treated with the prooxidant doxorubicin. You intend to compare three different antioxidants as dietary supplements. You feed four groups of rats a semipurified diet which contains either no antioxidant (control group) or one of the antioxidants under study. You need to now tell your professor how many animals to buy. How do you determine the sample size for each group of rats?
V) For each of the following, state the type of study design:
A.
A national survey indicated that high fiber diets were associated with low incidence of stroke. An obstetrician reported that four of pregnant patients with severe morning sickness were in good health and free of symptoms after one week of supplementation with pyridoxine. A study was conducted to determine if fish consumption might be related to future coronary heart disease (CHD) incidence. A group of patients with CHD was matched to a group of controls shown to be free of CHD. Medical records were searched and interviews were conducted to determine the level of past fish intake. It was determined that low fish intake was positively associated with CHD. The association between high fiber diets and colon cancer was studied in a group of 5000 vegetarians and a group of 4000 individuals who ate mostly beef and did not have high fiber diets. The two groups were followed for a period of 10 years. At the end of the period incidences of colon cancer were compared.
B.
C.
D.
VI) An investigator developed a new assay for determination of sickle cell anemia. The table below displays the results of an experiment. (A positive test indicates a sickle cell patient)
Traditional Methodology Sickle cell Sickle cell New test Normal Total Choose the correct answers: The sensitivity of this test is: a. 100/150 = 66.7 % b. 91/100 = 91 % c. 12/50 = 24 % d. 38/50 = 76 % The specificity of this test is: b. 91/103 = 88.3 % 91 9 100 Normal 12 38 50 Total 103 47 150
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
2
SPSS Tutorial for Beginners
Types of data and appropriate statistical tests
c. d. e.
103/150 = 68.7 % 38/50 = 76 % 47/150 = 31.3 %
As we have said before, if you get stuck on any of the above problems, do ask us for the solution.
30
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SECTION II
SPSS at Last
SPSS Tutorial for Beginners
SpSS at last
Now that we have given you a brief (was it?) introduction to the basics of statistics, we think that you will be able to understand SPSS much better and faster and will not struggle with every tiny step of the tutorial that SPSS offers with its package. In this SPSS tutorial, we shall be covering the basics of how to use SPSS. Students usually use SPSS for their classes and more importantly for their research analysis. Let other people talk about how tough it is to work on SPSS. We will show you how simple it is. The versions of SPSS may keep changing. Do not panic and go to the store to buy the latest version every time it is launched. The basic features have stayed the same since a very long time just like any other software program (Think about it Do you see any change in MS Word in the past five years?) Now straight to the tutorial. When you open SPSS (obviously by doubleclicking on the icon) you will see this window.
If you want to run the tutorial, you know what to do. You can access the tutorial anytime via the Help pull down menu on the data editor.
32
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SPSS at last
We suggest you take the tutorial at some point of time, especially if you need information about every teenyweeny aspect of every icon. As for now, you can take this tutorial and start actually using SPSS. After all that is what you bought this for, right? In order to use SPSS for statistical analysis, you must first have a file containing data to be analyzed.
CREatINg aNd EdItINg a data fILE
There are several ways of doing this. You can create your data file using the Data Editor of SPSS. This may be the easiest way to work with SPSS. (We will tell you how to go about this). You can use a spreadsheet software package such as MS Excel. You might want to do this because let’s say you already have a lot of data on Excel and want to transfer it to SPSS you would not want to type everything all over again. You would then need to follow the instructions associated with that software package. WARNING: This is not a simple Copy and Paste procedure. Be sure to save your data file as a tabdelimited text file. Otherwise, this process itself will become your greatest headache. (You can specify your variable names on the first line of your Excel spreadsheet and load them directly into SPSS when you Open the file. You can get more instructions regarding this in the SPSS Help section). You can enter your data in a wordprocessing software package such as MS Word. Again you need to separate your variables with Tabs and not spaces. And you must save the file as Text only.
33
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SPSS at last
You can enter your data into a command file between the “BEGIN DATA” and “END DATA” commands.
NOTE: This should be done only for very small data sets.
Regardless of how you create your data file, the first step is to determine what the data are and how you plan to organize the whole thing. It is recommended that you use fixedfield format for your data until you become an SPSSpro. You don’t have to worry about this. SPSS itself uses a fixedformat system for your data by default. Freefield format may seem tempting, but before you step into that arena, remember that it is difficult to edit if your files are large and a lot of steps and additional typing will be required if you have missing data. (Any real experiment is bound to have some missing data). Once you have created a data file you can start performing statistical analysis on your data using SPSS.
34
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
i) typical SpSS Session
Now that we have given you a brief (was it?) introduction to the basics of statistics, we think that you will be able to understand SPSS much better and faster and will not struggle with every tiny step of the tutorial that SPSS offers with its package. In this SPSS tutorial, we shall be covering the basics of how to use SPSS. Students usually use SPSS for their classes and more importantly for their research analysis. Let other people talk about how tough it is to work on SPSS. We will show you how simple it is. The versions of SPSS may keep changing. Do not panic and go to the store to buy the latest version every time it is launched. The basic features have stayed the same since a very long time just like any other software program (Think about it Do you see any change in MS Word in the past five years?) Now straight to the tutorial. When you open SPSS (obviously by doubleclicking on the icon) you will see this window.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
35
SPSS Tutorial for Beginners
Typical SPSS session
If you want to run the tutorial, you know what to do. You can access the tutorial anytime via the Help pull down menu on the data editor.
We suggest you take the tutorial at some point of time, especially if you need information about every teenyweeny aspect of every icon. As for now, you can take this tutorial and start actually using SPSS. After all that is what you bought this for, right? In order to use SPSS for statistical analysis, you must first have a file containing data to be analyzed.
Creating and Editing a Data File
There are several ways of doing this. You can create your data file using the Data Editor of SPSS. This may be the easiest way to work with SPSS. (We will tell you how to go about this). You can use a spreadsheet software package such as MS Excel. You might want to do this because let’s say you already have a lot of data on Excel and want to transfer it to SPSS you would not want to type everything all over again. You would then need to follow the instructions associated with that software package. WARNING: This is not a simple Copy and Paste procedure. Be sure to save your data file as a tabdelimited text file. Otherwise, this process itself will become your greatest headache. (You can specify your variable names on the first line of your Excel spreadsheet and load them directly into SPSS when you Open the file. You can get more instructions regarding this in the SPSS Help section). You can enter your data in a wordprocessing software package such as MS Word. Again you need to separate your variables with Tabs and not spaces. And you must save the file as Text only. You can enter your data into a command file between the “BEGIN DATA” and “END DATA” commands.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
3
SPSS Tutorial for Beginners
Typical SPSS session
NOTE: This should be done only for very small data sets.
Regardless of how you create your data file, the first step is to determine what the data are and how you plan to organize the whole thing. It is recommended that you use fixedfield format for your data until you become an SPSSpro. You don’t have to worry about this. SPSS itself uses a fixedformat system for your data by default. Freefield format may seem tempting, but before you step into that arena, remember that it is difficult to edit if your files are large and a lot of steps and additional typing will be required if you have missing data. (Any real experiment is bound to have some missing data). Once you have created a data file you can start performing statistical analysis on your data using SPSS. Step 1 : Open SPSS. The program will load and the Data entry Window will open.
Step 2: Get your data. How you do this depends on the method used to create your data file. Step 3: Generate SPSS commands. Select your commands from the pull down menus of the toolbar at the top of the screen or load them into SPSS as an SPSS syntax file.
Getting Started
We suggest that you begin each SPSS session in the following way (especially if you plan to print out your results for your professor or for whatever reason). This will minimize the amount of paper output you generate (Not all universities offer free printing. At 10 cents a page, you can get a burger by saving 25 pages. Just don’t blame us when you get fat). Select Options from the Edit pull down menu on the toolbar at the top of the screen. From the Options dialog box, select the Viewer tab at the top of the dialog box. Click on Infinite in the Text Output Page Size section of the dialog box. Also, click Display commands in the log to cause your commands to appear on your output. Then click on OK. This will cause SPSS to use the minimum amount of paper possible when you print your output at the end of your session. SPSS, left by itself, issues a lot of “end of page” commands that are pretty much gibberish for you and your professor and consume too many papers while printing.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
3
SPSS Tutorial for Beginners
(ii) creating a new data file with the data editor
Click on the “Variable View” tab at the bottom of the Data Editor. Note that the spreadsheet labels have changed. Instead of “var” at the top of each column, “Name”, “Type”, “Width”, etc. appear at the top of each column. Each row corresponds to a variable (i.e. each row corresponds to one column in the data and provides all the information about one particular column)
3
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating a New Data File with the Data Editor
Let us say you type “Age” under variable name, the other values appear as they are set by default. You can change all the values as per your requirements.
First you need to decide on names for each of your variables.
Rules involved in creating variable names
Variable names must begin with a letter or the @ symbol
3
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating a New Data File with the Data Editor
They cannot contain the following symbols (&, !, ?, ‘, /) or blanks They cannot exceed 8 characters in length The following have other specific meanings for SPSS and cannot be used as variable names by themselves ALL, AND, BY, EQ, GE, GT, LE, LT, NE, NOT, OR, TO, WITH (Though it is possible to use ALL1/ALL2 and so on)
Our tip : Avoid using symbols in variable names. That way, you won’t need to remember which ones are okay and which ones are not.
Enter a name for your variable in the 1st box. By default, the variable type is numeric with a w.d. (width. decimal) format of 8.2. If this is incorrect, click on the … button in the Type box to specify the correct type in the Variable Type dialog box. If the column width and number of decimal points are incorrect, fix them in their boxes. Variable LABELS and VALUE labels can be used to make your output easier to understand. A variable LABEL is used to descriptively label the variables. Its use makes the output easier to read and can be very useful if the output is used over a long time period. Each label cannot exceed 60 characters (most procedures will only print 40 at a time and some will print even fewer) Example of a variable LABEL: “Liver weight in Grams” “Plasma Cholesterol mg/dL” A VALUE label is used to descriptively label the values of a variable. Its use makes the output simpler to read and can be very useful if the output is used very a long time period. Similar to variable labels, each label must not exceed 60 characters (Most procedures will only print 20 of them). Example of a VALUE label: For a variable named COMMUNITY, A value of 1 could have the label Asian, a value of 2 could have the label African American, and a value of 3 could have the label Hispanic and so on… For 1 2 3 4 5 a variable named AGE, 010 years 1120 years 2045 years 4565 years 65 years+
Missing Data
The presence of missing data is very common in any kind of research. You will always come across dead mice, sick children, noncompliant adults, unfilled forms, lost samples and so on. You can’t really do anything about it, other than planning for it while creating your data file. You can enter a special code to indicate missing data or you can leave the item blank. The good thing about SPSS is that any blank or period
40
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating a New Data File with the Data Editor
(“.”) is considered missing data unlike some other software programs which consider a blank as zero (can’t even imagine the amount of distortion that would cause).
Saving your information
Now proceed to name the remaining variables (columns) of your spreadsheet. Then click on the Data View tab at the bottom of the Data Editor to enter your data (This is very easy for us to list as a step, but will take a long long time. Beware of errors while entering your data. Slow and steady is the best way to go). Finally, save your data to your disc by using the File pull down menu from the toolbar and selecting Save As. Name your file, select the format of the file with the Save as type option. And you have saved your file.
Exercise
The following data are total cholesterol levels in mg/dL for 8 groups of subjects. Create a data file and use SPSS to list it back onto paper. Group 1 181 177 200 171 159 173 Group 2 249 325 425 272 217 Group 3 260 261 263 276 235 Group 4 296 245 306 309 Group 5 334 340 356 350 374 Group 6 250 232 220 235 243 Group 7 223 305 210 309 254 236
Group 8 320 345 340 325 341 333
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
41
SPSS Tutorial for Beginners
(iii) loading an existing data file into the data editor
So you are ready with some data to analyze? No? Oops, then what are we going to do? Learning SPSS needs data and the more the better. OK, let’s learn it without data. But again, that would be like fishing without going to the water at all. OK. OK. Relax. Doesn’t matter if you do not have any data at this stage. We will provide you with all the data required to learn SPSS from our tutorial. After that, of course, you will need your own set of data (Isn’t it the other way round? You need SPSS to analyze data and not data to…..). Tired of our meaningless chatter? Sorry. Let’s go ahead.
So now you have a few data files on your hard disk or diskette. They can be loaded onto the SPSS Data Editor by using the File pull down menu and selecting Open and then clicking on Data. If the file is not in SPSS format (e.g. a text file or a file from a spreadsheet program like MS Excel), use the Files by type part of the Open dialog box to display the appropriate file type, then highlight your file and click on Open. Your file will be loaded into the SPSS Data Editor. If you have not defined your variables, set their names, types, etc. you need to do that right now, and then save your data file to your disc using Save As from the File pull down menu, as described above. Unless you specify otherwise, it will be saved as an SPSS data file and will include your variable names, type specifications, etc. A file in this format is only readable by SPSS.
Fixed Format: Data files are most often fixed format. This is simper to read, edit
and understand. Fixed format means that the data for a given variable occupies the same line columns and line position for each case (Note: If there is more than one line of data per case, you must use the RECORDS subcommand to specify the number if lines per case).
42
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Loading an existing Data File into the Data Editor
Example: ID # Columns 13 Weight Columns 59 Height Columns 1113 001 065.3 091……. 002 141.3 171……. 003 12.4 154……. Freefield format: This is another form of data organization. The variables are in the same order for each case but do not necessarily occupy the same line columns for each case. The data for each variable must be separated by one or more spaces or a comma in the data file. According to us, it is better to stick to the fixed format while preparing data files for SPSS especially since in the freefield format, there will always arise the problem of missing data (part and parcel of any experiment). Example:
ID WEIGHT HEIGHT 001 65.3 91 002 141.3 171 003 122.4 154
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
43
SPSS Tutorial for Beginners
(iv) creating and executing SpSS commands
Now that you have a working data set, you can begin to create and execute some SPSS commands from the pull down menus (or if you want to get technical by tying them directly into an SPSS Syntax Window). k) The EXPLORE command The first thing to do when you get data from any research project is to examine the data in detail. It does not matter whether the procedure used to do this is simple or complicated. The basic point is If the data is incorrect, conclusions will be faulty and the whole experiment will come to naught. To avoid this it is better to eyeball your data using SPSS first to look for any errors. Errors can enter data at multiple steps. Measurement can be in error due to poor technique, faulty instruments, or careless observers. Recording of data into a notebook can be done incorrectly or illegibly. Transcription to an electronic format can be in error. Data can be corrupted by defective disks, faulty hardware or software, computer viruses, user errors, etc. Some errors are relatively easier to spot e.g. an LDL cholesterol of 0.0 or 5000.00 mg/dL. Others may be difficult, if not impossible e.g. an LDL cholesterol value of 312 mg/dL recorded as 123 mg/dL. Unless you begin your data analysis with a careful check of your collected data, errors can distort your findings. Careful editing of data files by comparing them to the original raw data is a good first step. Careful examination of the data displayed on the screen or a print out can help with this task too. The command EXPLORE can help us a lot in this process. EXPLORE can help us look at our data in different ways to determine if the data seem reasonable. EXPLORE can also help us to decide, in some cases, on the appropriate statistical tests to use. This command will allow the examination of summary statistics such as range, median, mean, mode, measures of variability, measures of skewedness, etc. as well as displaying the data in a variety of graphic output formats. If you need extra information about this command, refer to the base manual. Use the Analyze pull down menu from the toolbar and select Descriptive Statistics and then Explore. The Explore dialog box will appear.
44
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
(To open this box, you need to have some data on your data editor of SPSS. This is just an example file). You can then select the variables of interest (one or more “dependent” variables) and choose one or more grouping variables (factors). By default, you will get box plots, stemandleaf plots and basic descriptive statistics for each dependent variable (either for the whole group or separated by a grouping variable). You can suppress the display of plots or descriptive statistics. You can choose additional statistics and plots as described below. In the Explore dialog box, you can click on the Statistics button and choose one or more of the following to be displayed Descriptives This is the default SPSS uses. It includes the mean and its confidence intervals, median, 5 % trimmed mean, standard error, variance, standard deviation, minimum, maximum, range, interquartile range (IQR), skewedness and its standard error, kurtosis and its standard error. IQRs are computed by the H AVERAGE method. This is a weighted average method and is described in detail in the base manual. By default, 95 % confidence interval is displayed. The dialog box offers the choice to set this to any value between 1 % and 99.99 %.
Mestimators
This produces Mestimators that are robust maximumlikelihood estimators of location. This kind of statistical analysis is not used in Biological research. Do refer to the manual for more information. n HUBER(c) produces Huber’s Mestimator with a default of c=1.339. n ANDREW(c) produces Andrew’s wave estimator with a default of c=1.34π. n TUKEY(c) produces Tukey’s biweight estimator with a default of c+4.685 Outliers This displays the cases with the five highest and five lowest values for each variable. It is labeled “Extreme values” on the output sheet. Percentiles This command displays the 5th, 10th, 25th, 50th, 75th, 90th and 95th percentiles using the H AVERAGE or weighted average at X(W+1)p method to calculate the percentiles. It also displays Tukey’s hinges (25th, 50th and 75th percentiles). Refer to the manual for more details. Grouped Frequency tables This displays tables for the total samples and broken down by any factor variables. In the Explore dialog box, you can click on the Plots button and choose one or more of the following to be displayed. Box plots You can choose one of the following alternatives. Factor levels together This is the default. It displays side by side box plots for a given dependent variable for each group defined as a factor variable. If no factors have been defined, a box plot for the total sample is displayed. Dependents together This displays side by side box plots for a given group for each dependent variable. None This suppresses the display of box plots. Descriptive You can choose one or both of the following alternatives. Stemandleaf This is the default. A stemandleaf plot is produced in which each observed value is divided into two componentsleading digits (stem) and trailing digits (leaf). See the manual for more details. Histogram A histogram is printed. The range of observed values for the variable is
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 45
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
divided into intervals and the number of cases in each interval is displayed. Spread vs. level with Levene’s Test A spreadversuslevel plot is produced with the slope of the regression line and Levene’s test for homogeneity of variances. If no factor variable has been defined, this plot is not produced. You can choose one of the following. None : No plots or tests are produced Power estimation : For each group, the natural log of the median is plotted against the log of the interquartile range. Estimated power is also displayed. Use this to determine an appropriate transformation of your data. Transformed : The data are transformed according to userspecified power. One of the alternatives below needs to be selected. Natural log: This is the default. A natural log transformation. 1/square root: The reciprocal of the square root is calculated for each data value. Reciprocal: A reciprocal transformation Square root: A square root transformation Square: Data values are squared Cube: Data values are cubed
Untransformed : No data transformation is performed i.e. The power value is 1 Normality plots with tests : Normal probability and dendrended probability plots are produced with a number of test statistics. In the Explore dialog box, you can click on the Options button and choose one of the following as a method to handle missing values Exclude cases listwise This is the default. Cases with missing data for any dependent or factor variable are excluded from all analyses. Exclude cases pairwise Only cases with no missing values for variables in a cell are included in the analysis of that cell. A case may have missing data for variables not used in that cell. Report values Missing values for factor variables are treated as an additional category and reported as such in all output. The EXPLORE command is a very useful and powerful tool for understanding any data generated by our research. It needs to be explored in much more detail for those of you who wish to use it more (For the rest of you, on with the tutorial). Do use the Base manual for this.
l)The FREQUENCIES command
This command produces tables of frequency counts and percentages for the values of individual variables. By default, a table is created that displays counts for each value of a variable, the counts percentaged over all cases and over all cases with nonmissing values and the cumulative percentage over all cases with nonmissing values. The values
4 CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
are ordered from lowest to highest. All variable labels and value labels are printed if they have been defined. These defaults can be altered with a number of subcommands. Also, bar charts, histograms and statistics can be chosen. Select the FREQUENCIES command by clicking on the Analyze pull down menu on the toolbar and then select Descriptive Statistics and then Frequencies. The Frequencies dialog box will open and allow you to select variables for generating output. If all you want is the default, just click the OK button. You can also select optional output from the dialog box. Display frequency tables This is the default. To suppress frequency tables, click on the little box to remove the check mark.
Statistics This command controls the display of statistics. By default, no statistics are displayed. One or more of the following choices may be used. Under the Percentiles Values box, you can select one or more of the following: n Quartiles: Displays the 25th, 50th and 75th percentiles. n Cut points for n equal groups: Displays percentile values that divide the sample into equalsized groups. 10 is the default. You can enter any positive integer between 2 and 100. The number of percentiles displayed is one fewer than the number of groups specified. n Percentile(s): Displays userspecified percentiles. You can enter any percentile value between 0 and 100, and then click on Add. You can continue to add additional percentile values to build a list which will be displayed. You can remove or change your entered percentiles before execution of the command by highlighting them and clicking the appropriate button Under the Dispersion box, you can select one or more of the following: n Standard deviation n Variance n Range n Minimum
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 4
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
n Maximum n S.E mean (Standard error of the mean) Under the Central Tendency box you can select one or more of the following: n Mean n Median Under the Distribution box, you can select one or more of the following: n Skewness: This is a measure of how much a distribution is shifted (right or left) from a normal distribution. Positive values indicate a right shift and negative values, a left shift. Kurtosis: This is a measure of spread of a distribution for a given standard deviation. Positive values indicate that the distribution is more peaked than a normal distribution and negative values indicate a flatter distribution.
n
Charts This allows selection of bar charts and histograms and some control over how bar charts are labeled. You can choose only one of the following: n n None: This is the default. No charts are displayed. Bar chart(s): Scale is determined by the frequency of the largest category plotted. Pie chart(s): It is possible to plot the graph as a pie chart showing percentages. Histogram(s): Available for numerical variables only. The maximum number of intervals plotted is 21.
n
n
4
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
n Mode n Sum
With normal curve superimposes a normal curve over the histogram. The Chart Values display box allows control of the vertical axis label for bar charts and pie charts using one of the following: n Frequencies: This is the default. The label is frequencies. n Percentages: The label is percentages. Format This controls the output of frequency tables. Under the Order by box, you can select one of the following n Ascending values: This is the default. Sorts categories by ascending order or values Descending values: Sorts categories by descending order or values Ascending counts: Sorts categories by ascending order of frequency counts Descending counts: Sorts categories by descending order of frequency counts
n n
n
If you have selected percentiles or a histogram, you get the output for Ascending values, regardless of your selection here.
You can also control the printing of large tables and the printing of multiple variables. m) The DESCRIPTIVES command This command generates a listing containing the variable name, variable label,
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
4
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
mean, standard deviation, minimum and the maximum. Additionally, the standard error of the mean, variance, kurtosis, skewness, range and sum may be requested under the Options subcommand. To use the DESCRIPTIVES command, select Analyze from the toolbar and then Descriptive Statistics and then Descriptives. A dialog box will appear which will allow you to select variables, select additional descriptive statistics and set the order that the variables are displayed in. This is a very useful tool for a quick summary of your data.
Exercise Using our data file diet.sav and the commands Frequencies and Descriptives, answer the following: How many and what percent of individuals fall into the following categories? # % Males ___________ __________ Females ___________ __________ Asians ___________ __________ Orientals ___________ __________
50
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
African Americans
___________
__________
For each of the following, give the mean, standard deviation, minimum and maximum values without using the Frequencies command: Mean ________ ________ ________ ________ ________ ________ ________ ________ ________ SD _______ _______ _______ _______ _______ _______ _______ _______ _______ MinMax _______ _______ _______ _______ _______ _______ _______ _______ _______
Age Kilocalories Fat Protein Carbohydrate Total cholesterol Polyunsaturated fatty acids Saturated fatty acids Calcium
n) The IF and COMPUTE commands These commands are extremely useful when preparing data and creating new variables from the raw data file. The COMPUTE command allows us to create a new variable or modify an existing variable. Choose the command from the toolbar using the Transform and Compute pull down menus. The Compute Variable dialog box will appear. You will be able to modify an existing variable or create a new one. Type in the name of the new variable (or an existing one) in the Target Variable box. Click on the Numeric Expression box and select variables and operators from the calculator pad or type them in from the keyboard. Arithmetic + * ** / operators are: Addition Subtraction Multiplication Exponentiation Division
Numeric functions are: ABS Absolute Value TRUNC Truncate SQRT Square Root LG 10 Base 10 Logarithm RND Round MOD10 Modulus EXP Exponential LN Natural Logarithm SIN Sine COS Cosine ATAN ArcTangent There are a number of other functions available. Since it is possible to define some fairly complex expressions, it is important to understand the order in which operations are performed. Numeric functions are
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 51
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
performed first, exponentiation next, then multiplication and division and finally, addition and subtraction. Expecting this hierarchy, expressions are evaluated lefttoright. The order of operations can be controlled by using parentheses. Examples: X = A + B * 2 X = (A + B) * 2 If A = 4 and B = 5, then the first expression would equal 14 while the second would equal 18. Examples: RATIO = (A + B) * (C + D)/ (E – F) **2 STANDARD = 142.675 X = ABS (Y) TEST2 = TEST1 + 3
The IF command causes a variable to be created or modified when certain conditions are met. In other words, the Compute command is only executed for those cases for who the IF command is true. A logical expression is an expression that can be evaluated as true, false or missing, based on conditions found in the data. Logical expressions can be simple logical relations among variables, or they can be complex logical tests involving variables, constants, functions, relational operators and logical operators. Relational operators are: = Equal to < Less than > Greater than ~= Not equal to <= Less than or equal to
52
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
>=
Greater than or equal to
Logical operators are: & And  Or ~ Not Example: IF (x < 12) IF (SEX = ‘F’) IF (AGE < 22)
COMPUTE A = B + 2 COMPUTE IRONRDA = 18 + Y COMPUTE AGEGROUP = 1
o) The MEANS command This command produces tables of means, standard deviations and group counts for a dependent variable within groups defined by one or more independent variables. Several other SPSS procedures are capable of displaying similar output. Use the Analyze pull down menu and choose Compare Means and then select the Means command. The Means dialog box will allow you to select those dependent variables for which you want means calculated and allow you the option of selecting grouping (independent) variables. You can also “layer” your independent variables to further subdivide your sample. By default you will get the mean, standard deviation and count.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
53
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
You can choose the Options dialog box to alter this list or to add a number of other descriptive statistics. You can also control the use of variable and value labels and choose an analysis of variance or/and a test for linearity.
p) The TTEST command This command is for performing ttests. To use this command, select the Analyze pull down menu and choose Compare Means followed by selection of the appropriate ttest i.e. independent or paired depending on your data. For nonpaired data, select IndependentSamples T test and the Independent Samples ttest dialog box will be opened. Select the variable(s) for which you want a ttest performed and identify the grouping variable. The grouping variable must include a definition of the group codes within the grouping variable. Use the Define Groups dialog box to do this. Example: To conduct a ttest on the variable, KCAL, between two groups coded as 1 and 2 in a variable called SEX you would select KCAL as a test variable, SEX as the grouping variable and then define group 1 as 1 and group 2 as 2 in the dialog boxes (meaning
54 CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
group 1 = 1 = male and group 2 = 2 = female). You can also select the Options dialog box to change how missing data are handled, alter label displays and change the confidence interval.
The output will be similar to the following: TTest
Group Statistics Std. Error Mean 7.445 16.845
CHOL
SEX 1 2
N 16 16
Mean 350.00 225.00
Std. Deviation 29.781 67.382
Independent Samples Test Levene's Test for Equality of Variances
ttest for Equality of Means 95% Confidence Interval of the Difference Lower Upper 87.387 162.613 86.659 163.341
CHOL
Equal variances assumed Equal variances not assumed
F 7.258
Sig. .011
t 6.787 6.787
df
30
Sig. (2tailed) .000 .000
Mean Difference 125.000 125.000
Std. Error Difference 18.417 18.417
20.645
Levene’s test is performed to find if we should use unequal variances t value or equal variances t value. If p> 0.05 we use equal variance t value. Here p for Levene’s test is 0.011, which means we must use the unequal variance t test p value that is 0.000 (< 0.05) which means that our test is significant. There is a significant difference in the
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 55
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Kcal consumption between males and females. For paired data, we select PairedSamples T test and the respective dialog box is opened. Then we select the two variables containing the paired data (click on the first one, hold down the Shift key and click on the second one). The Options dialog box is identical for both the T tests. Here we do not have paired data.
Exercise A few problems to test your ttest quotient I) The data below were collected during an animal feeding experiment. One group of rats was provided a complete diet (controls) and the other group was fed a diet low in protein (deficient diet). Determine if the deficient diet produced a different weight gain (in gms) as compared to the control diet. Deficient Diet Rat 1 2 3 4 5 Initial Wt 222 224 225 224 227 Final Wt 352 370 360 381 352 Rat 1 2 3 4 5 Control Diet Initial Wt. 223 219 224 225 226 Final Wt. 417 416 415 417 419
Answer the following: Mean Initial Wt. Standard Deviation Statistical test used Level of significance Mean Final Wt. Standard Deviation Statistical test used Level of significance Control Group _________ _________ _________ _________ _________ _________ _________ _________ Deficient Group _________ _________ _________ _________ _________ _________ _________ _________
Decision on the effect of the deficient diet ________________________________. II) An experiment was conducted to determine the effect of a high salt mean on the systolic blood pressure of subjects. Blood pressure was determined in 12 subjects
5 CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
before and after ingestion of a test meal containing 10.0 gms of salt. The data obtained were: Subject 1 2 3 4 5 6 7 8 9 10 11 12 SBP before meal 120 130 133 120 123 140 131 120 125 130 131 140 SBP after meal 148 144 148 115 122 157 144 134 140 169 133 153
What is the mean SBP for each time period?
__________
__________
What is the standard deviation for each time period?
__________
__________
Are the means statistically different? ___________ Which statistical test did you use? _______________________ What was the level of significance? ______________ III) Compare mean hematocrit levels (%) from the two groups below: Mice from Group X 40.5 39.3 41.5 40.5 40.4 41.8 39.3 Mice from Group Y 43.9 47.4 46.7 47.9 43.7 54.3 48.7 46.4 47.8 49.2 56.5 Note: hematocrit means % Red blood cells in peripheral blood. Each data point above represents the arithmetic mean of 3+ determinations. (Hint: Do you know the normal mouse hematocrit? It is about 3949 %) Mean hematocrit for group X mice ____________ Standard deviation ____________ Mean hematocrit for group Y mice ____________
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 5
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Standard deviation ____________ Which statistical test did you use? ____________ What is the level of significance? __________ Was there a biological significance? (Note: this is different from statistical significance. Look at the normal values and decide the biological significance). IV) Compare means for Coenzyme Q levels in whole blood from cardiomyopathy patients. Group 1 is placebo treated and Group 2 received 100 mg/day of CoQ. The data are listed below. (A clinical improvement is seen in patients with CoQ concentrations at or above 2.5 microg/ml) CoQ concentrations (microg/ml) Group 1 0.7 0.9 1.1 0.5 0.4 Group 1 mean CoQ level _______ Standard deviation _______ Group 2 mean CoQ level _______ Standard deviation _______ Which statistical test did you use? _________ Level of significance was __________ Was there a biological difference between groups in this experiment? (Note: Look at the value needed to see a biological improvement.) V) The data below are from a six week animal study in which rats in the experimental group received 35 % of their energy as ethanol. Control rats received the same diet with dextrin substituted for the ethanol calories. All animals were fed ad libitum and both diets were nutritionally complete.
Experimental Initial wt. 89 93 87 90 91 88 86 95 87 89 Final wt. 283 287 292 285 267 295 280 282 284 296 Initial wt. 89 87 87 90 91 88 86 95 93 89 Control Final wt. 335 342 344 336 321 348 332 331 336 349
Group 2 2.2 3.0 2.3 1.4 2.5
Mean initial weight Standard Deviation Statistical test used was?
5
Experimental ___________ ___________ ___________
Control __________ __________ __________
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Level of significance was? Mean weight gain Standard Deviation Statistical test used was? Level of significance was?
___________ ___________ ___________ ___________ ___________
__________ __________ __________ __________ __________
Was there a biological difference between the groups in this experiment? q) The ONEWAY ANALYSIS OF VARIANCE command Oneway Analysis of Variance is a technique used to compare group means when there are more than two groups of subjects. The Oneway Analysis of Variance dialog box can be opened by selecting the Analyze pull down menu followed by Compare Means and then Oneway ANOVA. The dialog box will allow you to select those variables (dependent) for which you would like to conduct oneway analysis of variance. You must also select and define the grouping variable (Factor). You can open the Contrasts dialog box to test for trends or to define a priori contrasts. If you wish to test for trends, select polynomial and define the Degree. You can choose to test for linear, quadratic, cubic, 4th degree or 5th degree polynomials. You can open the Post Hoc dialog box to conduct post hoc multiple comparison of means tests at the 0.05 plevel. You can choose one or more of the following: Least significant difference, Bonferroni, Duncan’s multiple range test, StudentNewmanKeuls, Tukey’s honestly significant difference, Tukey’s b and/or Scheffe. You can also select the way the means are calculated. You can open the Options dialog box to control the way missing data are handled, to print out descriptive statistics for each group, to print out the Levene’s statistic and to control the display of labels. EXAMPLE: Suppose an experiment were conducted to determine dietary status of the three racial groups in our data. A oneway AoV test can be performed on the variables KCAL (kcal consumption), PROT (protein consumption), VITA (Vitamin A) and FE (iron). The variable which defines the groups to be compared is RACE. The minimum code used to define groups is 1 and maximum is 3. A Scheffe multiple range test can be performed to compare the group means to one another if the AoV is significant. The Scheffe procedure will be performed at the 0.05 level of significance. A table containing group sample sizes, means, standard deviations, standard errors, minimums, maximums and 95 % confidence intervals for each variable tested can be displayed.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
5
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
The output obtained will be similar to this. ONEWAY KCAL PRO VITA FE BY RACE /STATISTICS DESCRIPTIVES /MISSING ANALYSIS /POSTHOC = SCHEFFE ALPHA(.05).
0
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
ONEWAY
Descriptives 95% Confidence Interval for Mean Lower Bound Upper Bound 2472.53 2950.20 2597.06 2352.31 2609.86 89.19 91.10 87.02 93.37 4347.07 4228.31 2991.36 4364.66 7.812 8.245 6.345 8.924 3133.27 2929.69 2888.64 104.08 108.57 105.65 102.13 5619.84 6227.69 5605.53 5400.34 12.861 13.789 12.788 11.826
KCAL
PRO
VITA
F E
Asian African American Oriental Total Asian African American Oriental Total Asian African American Oriental Total Asian African American Oriental Total
N 11 12 9 32 11 12 9 32 11 12 9 32 11 12 9 32
Mean 2711.36 2865.17 2641.00 2749.25 96.64 99.83 96.33 97.75 4983.45 5228.00 4298.44 4882.50 10.336 11.017 9.567 10.375
Std. Deviation 355.507 421.962 375.573 386.605 11.084 13.750 12.114 12.136 947.277 1573.407 1700.459 1436.305 3.7583 4.3626 4.1914 4.0240
Std. Error 107.189 121.810 125.191 68.343 3.342 3.969 4.038 2.145 285.615 454.203 566.820 253.905 1.1332 1.2594 1.3971 .7114
Minimum 2195 2195 2195 2195 80 80 80 80 2973 973 977 973 4.8 4.8 4.8 4.8
Maximum 3241 3241 3241 3241 120 120 120 120 6570 6570 5966 6570 16.5 16.5 16.5 16.5
ANOVA Sum of Squares 282491.8 4350866 4633358 83.788 4482.212 4566.000 4614641 59337495 63952136 10.838 491.142 501.980
KCAL
PRO
VITA
F E
Between Groups Within Groups Total Between Groups Within Groups Total Between Groups Within Groups Total Between Groups Within Groups Total
df
2 29 31 2 29 31 2 29 31 2 29 31
Mean Square 141245.894 150029.869 41.894 154.559 2307320.525 2046120.515 5.419 16.936
F.941
Sig. .402
.271
.764
1.128
.338
.320
.729
Since the p values are not significant, Scheffe’s test has no meaning in this case.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
1
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
POST HOC TESTS
Multiple Comparisons Scheffe Mean Difference (IJ) 153.803 70.364 153.803 224.167 70.364 224.167 3.197 .303 3.197 3.500 .303 3.500 244.545 685.010 244.545 929.556 685.010 929.556 .6803 .7697 .6803 1.4500 .7697 1.4500
Dependent Variable KCAL
(I) RACE Asian African American Oriental
PRO
Asian African American Oriental
VITA
Asian African American Oriental
F E
Asian African American Oriental
(J) RACE African American Oriental Asian Oriental Asian African American African American Oriental Asian Oriental Asian African American African American Oriental Asian Oriental Asian African American African American Oriental Asian Oriental Asian African American
Std. Error 161.684 174.095 161.684 170.800 174.095 170.800 5.189 5.588 5.189 5.482 5.588 5.482 597.094 642.929 597.094 630.759 642.929 630.759 1.7178 1.8497 1.7178 1.8147 1.8497 1.8147
Sig. .640 .922 .640 .433 .922 .433 .828 .999 .828 .817 .999 .817 .920 .573 .920 .351 .573 .351 .925 .917 .925 .729 .917 .729
95% Confidence Interval Lower570.91 Upper 263.31 Bound Bound 378.76 519.49 570.91 263.31 216.46 664.79 519.49 378.76 664.79 216.46 16.58 10.19 14.72 14.11 10.19 16.58 10.64 17.64 14.72 14.11 17.64 10.64 1784.92 1295.83 973.61 2343.63 1784.92 1295.83 697.67 2556.78 2343.63 973.61 2556.78 697.67 3.751 5.112 5.542 4.002 3.751 5.112 3.232 6.132 5.542 4.002 3.232 6.132
HOMOGENEOUS SUBSETS
KCAL Scheffe
a,b
RACE Oriental Asian African American Sig.
N
9 11 12
Subset for alpha = .05 1 2641.00 2711.36 2865.17 .425
Means for groups in homogeneous subsets are displayed. a. b. Uses Harmonic Mean Sample Size = 10.513. The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
PRO Scheffe
a,b
RACE Oriental Asian African American Sig.
N
9 11 12
Subset for alpha = .05 1 96.33 96.64 99.83 .813
Means for groups in homogeneous subsets are displayed. a. b. Uses Harmonic Mean Sample Size = 10.513. The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.
VITA Scheffe
a,b
RACE Oriental Asian African American Sig.
N
9 11 12
Subset for alpha = .05 1 4298.44 4983.45 5228.00 .343
Means for groups in homogeneous subsets are displayed. a. b. Uses Harmonic Mean Sample Size = 10.513. The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.
F E Scheffe
a,b
RACE Oriental Asian African American Sig.
N
9 11 12
Subset for alpha = .05 1 9.567 10.336 11.017 .724
Means for groups in homogeneous subsets are displayed. a. b. Uses Harmonic Mean Sample Size = 10.513. The group sizes are unequal. The harmonic mean of the group sizes is used. Type I error levels are not guaranteed.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
3
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
The subset for alpha = 0.05 box is not divided for any variable analyzed. If the variables had been different, this box would have been divided into 1, 2 and so on. The groups that are in separate boxes are different from each other. Exercise I) The data below were collected from an animal feeding experiment. Analyze the data using SPSS and answer our questions. Group A A A A A B B B B B C C C C C Rat number 1 2 3 4 5 1 2 3 4 5 1 2 3 4 5 Initial wt. 80 79 89 85 83 82 85 87 82 83 80 86 87 84 86 Final wt. 335 342 353 337 334 335 345 357 335 343 398 405 398 398 395 Brain wt. 6.7 6.6 7.1 6.5 6.8 4.3 4.7 4.3 4.5 4.6 8.1 8.2 7.3 7.6 8.2 Group C _______
Group A Group B Initial weight (means +/ SD) _______ _______ Statistical significance ________ Statistical test used _________ Your interpretation of the results_______________ Group A Group B Weight gain (means +/ SD) _______ _______ Statistical significance ________ Statistical test used _________ Which groups were different from each other? _______________ Your interpretation of the results ____________________ Group A Group B
Group C _______
Group C _______
Brain wt as a % of body weight (means +/ SD) _______ _______ Statistical significance ________ Statistical test used _________ Which groups were different from each other? _______________ Your interpretation of the results ____________________
II) An experiment was conducted to determine the effect of various diets containing different levels of protein on weight gain in rats. The data are presented below: Diet A 89 97
4
Diet B 112 110
Diet C 125 128
Diet D 159 159
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
90 90 93 86 99 82 92 92 90
120 113 106 106 108 116 105 105 106 Weight Gains (g) (Means +/ s.d.) ______________________ ______________________ ______________________ ______________________
134 126 127 122 128 140 130 130 123
166 167 161 165 159 159 158 158 157
Group Group Group Group
A B C D
Which groups are significantly different from each other? _____________ Which test did you use? ______________ Level of Significance? ________________ Was there a linear trend to the results? _____________ Which test did you use? _______________ Level of significance? _____________ III) Use SPSS to test for polynomial trends in the following data: Group A 13 15 16 17 20 Group B 169 225 256 289 400 (Mean +/ SD) ________________________ ________________________ ________________________ Group C 2197 3375 4096 4913 8000
Group A Group B Group C
Which groups are significantly different from each other? _______ What test did you use? _______________ Level of significance? ________________ What test did you use? ________________ Level of significance? _________________ What type trend existed? _______________ r) Scattergrams and Regression SPSS can be used to generate a scatterplot of data as well as a variety of statistics describing and testing the plot. This command can be useful for eyeballing your data. It can also be used for a variety of purposes including the generation of data for a standard
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 5
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
curve for lab assays. This is actually a number of different but related procedures that can be found under the Analyze pull down menu with the selection of Regression followed by the selection of the desired procedure. We will only try a few of them. A scattergram can be a simple way to view your data. It is an excellent way to detect outliers (that even may be errors). To get a quick scattergram, select the Graphs pull down menu and choose Scatter. The Scattergram dialog box will appear. You would normally choose Simple Scattergram and click the Define button to choose the variables to be used in your graph.
You can then identify the Yaxis variable and the Xaxis variable and control labeling of the graph. This is an example of a simple scatterplot generated with SPSS.
3250
3000
2750
KCAL
2500 2250 2000 80 90 100 110 120
PRO
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
You can also use SPSS to generate the information needed to define a standard curve and calculate unknowns from laboratory work. Open the absorbance.sav data file. Use the Analyze pull down menu, select Regression and then Curve Estimation. The following dialog box will open:
Select the dependent (Yaxis) variable and the independent (Xaxis) variable. You will use the default (linear) model. You can also control labeling. SPSS will display a plot of the data. This is not the important output. Close the Chart Carousel window and look at the results in the output window. EXAMPLE: Suppose a set of standards were measured spectroscopically to determine the absorbances associated with the concentration of Vitamin X listed below. Vitamin X 10.6 15.4 20.2 25.5 30.2 35.8 Absorbance 0.110 0.165 0.220 0.271 0.318 0.370
The Output window has a chart and the data we need to plot a straight line. Y = intercept + slope * (X) Or Y = B0 + B1 * (X) Curve Fit MODEL: MOD _ 1. _
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Independent:
Absorbance Rsq .998 d.f. 4 F 2052.68 Sigf .000 b0 .5490 b1 96.9697
Dependent Mth VitaminX LIN
Where b0 is the Yintercept and b1 is the slope.
VitaminX
Observed Linear
40.00
35.00
30.00
25.00
20.00
15.00
10.00 0.100 0.150 0.200 0.250 0.300 0.350 0.400
Absorbance
Therefore our straight line (standard curve) would be: Y = 0.5490 + 96.9697 * (X) We could now use this information to create a Compute command to convert a set of absorbances into concentrations of Vitamin X and use List Cases to print them out. Exercise I) Use SPSS to create a scatterplot of Kcal and Total cholesterol from the diet.
sav data file. Is there any pattern or is it just random?
II) Use SPSS to determine a standard curve for an assay called Bradford assay to measure protein concentrations. In this, we first determine the absorbances of some known concentration protein samples, that we call standards. Then we plot a curve and determine our correlation coefficient, Y intercept and slope. With that information we can calculate protein concentrations of unknown samples if we get their absorbance values. Draw the curve on a sheet of graph paper and compare it to the curve drawn by the computer. Use SPSS’ formula function for a straight line (Y = A + BX) to determine protein concentrations in the unknowns (Note: Let X = absorbance and Y = protein concentration). Plot absorbance on the X axis and protein concentrations on the Y axis. Standard curve data Protein (microg) Absorbance at 595 nm 0.160 0.192 0.255
20 30 40
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
50 60 70 80 90 100
0.296 0.352 0.357 0.390 0.451 0.500
Correlation coefficient _____________ Yintercept _____________ Slope ____________ Therefore straight line formula = Y = A + BX = ________________ So X (unknown) = _______________ Data from unknown samples Protein (microg)
Absorbance at 595 nm
0.37 0.43 0.31 0.37 0.24 0.58
Calculate the protein concentrations using the formula function/Compute of SPSS. s) Multiple Regression This command is used to produce multiple regression equations and associated statistics and plots. You can use SPSS to generate a multiple regression analysis. Use the Analyze pull down menu and select Regression and then Linear. The following dialog box will appear:
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Select your variable of interest and place it in the dependent variable box. Then select each of the predictors that you want used in your equation and place them in the independent(s) box. Select the appropriate method. Stepwise is the suggested method. This will identify the best predictor and generate output, then the best predictor which goes with the first and so on. REGRESSION
Descriptive Statistics KCAL CHO FAT PRO FIB C A CHOL Mean 2749.25 334.75 113.25 97.75 11.75 818.50 287.50 Std. Deviation 386.605 51.285 23.267 12.136 1.951 173.161 81.599 N 32 32 32 32 32 32 32
0
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
a Variables Entered/Removed
Model 1
Variables Entered
Variables Removed
C A
.
2
FAT
.
3
CHO
.
4
PRO
.
5
FIB
.
6
CHOL
.
Method Stepwise (Criteria: Probabilit yofFtoenter <= .050, Probabilit yofFtoremo ve >= . 100). Stepwise (Criteria: Probabilit yofFtoenter <= .050, Probabilit yofFtoremo ve >= . 100). Stepwise (Criteria: Probabilit yofFtoenter <= .050, Probabilit yofFtoremo ve >= . 100). Stepwise (Criteria: Probabilit yofFtoenter <= .050, Probabilit yofFtoremo ve >= . 100). Stepwise (Criteria: Probabilit yofFtoenter <= .050, Probabilit yofFtoremo ve >= . 100). Stepwise (Criteria: Probabilit yofFtoenter <= .050, Probabilit yofFtoremo ve >= . 100).
a.
Dependent Variable: KCAL
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
1
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Model Summary Adjusted R Square .795 .964 .997 1.000 1.000 1.000 Std. Error of the Estimate 175.198 73.443 21.887 .000 .000 .000
Model 1 2 3 4 5 6 a.
R.895a R Square .801 .983b .966 c .999 .997 1.000d 1.000 1.000e 1.000 f 1.000 1.000
b. Predictors: (Constant), CA c. Predictors: (Constant), CA, FAT
d. Predictors: (Constant), CA, FAT, CHO e. Predictors: (Constant), CA, FAT, CHO, PRO
f. Predictors: (Constant), CA, FAT, CHO, PRO, FIB Predictors: (Constant), CA, FAT, CHO, PRO, FIB, CHOL
For instance, in the Model Summary table, a value of 0.795 for calcium means that 53 percent of change in Kcal can be predicted if we have the Calcium intake. Then a value of 0.964 below that means that if we have both Calcium and Fat intake, we can predict 96 percent of the change in Kcal and so on.
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Coefficientsa Unstandardized Coefficients B 1113.479 Std. Error 151.926 1.998 683.316 1.412 8.038 218.675 .334 9.189 3.634 .000 .000 9.000 4.000 4.000 .000 .000 9.000 4.000 4.000 .000 .000 .000 9.000 4.000 4.000 .000 .000 .182 73.224 .091 .675 34.632 .068 .212 .210 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 .000 Standardized Coefficients Beta .895 .632 .484 .150 .553 .482 .000 .542 .531 .126 .000 .542 .531 .126 .000 .000 .542 .531 .126 .000 .000
Model 1 2
3
4
5
6
(Constant) C A (Constant) C A FAT (Constant) C A FAT CHO (Constant) C A FAT CHO PRO (Constant) C A FAT CHO PRO FIB (Constant) C A FAT CHO PRO FIB CHOL
t 7.329 10.998 9.332 15.563 11.905 6.314 4.916 43.352 17.278 . . . . . . . . . . . . . . . . . .
Sig. .000 .000 .000 .000 .000 .000 .000 .000 .000 . . . . . . . . . . . . . . . . . .
a.
Dependent Variable: KCAL
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
3
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
f Excluded Variables
Model 1
2
3
4 5 a.
CHO FAT PRO FIB CHOL CHO PRO FIB CHOL PRO FIB CHOL FIB CHOL CHOL
Beta.102a In .484a .220a .146a .013a .482b .009b .124b .143b .126c .063c .021c .000d .000d .000e
t .474 11.905 1.280 1.214 .104 17.278 .111 2.697 3.028 . 5.991 1.205 . . .
Sig. .639 .000 .211 .235 .918 .000 .913 .012 .005 . .000 .239 . . .
Partial Correlation .088 .911 .231 .220 .019 .956 .021 .454 .497 1.000 .755 .226 . . .
Collinearity Statistics Tolerance .147 .705 .219 .450 .423 .133 .204 .450 .406 .184 .422 .321 .181 .305 .223
b. Predictors in the Model: (Constant), CA c. Predictors in the Model: (Constant), CA, FAT
d. Predictors in the Model: (Constant), CA, FAT, CHO e. Predictors in the Model: (Constant), CA, FAT, CHO, PRO
f. Predictors in the Model: (Constant), CA, FAT, CHO, PRO, FIB Dependent Variable: KCAL
Additional subcommands exist if the input is a matrix or if the user wishes to write the matrix to an external file; if the user wishes to examine residuals; or if plots of various types are desired. These can be used under the “syntax window system” of creating commands. The Regression procedure can be very complex. The order of subcommands is very important in determining the output. It is therefore imperative that the manual be studied carefully when using this procedure. t) CHISQUARE test using CROSSTABS When discrete data have been collected, it is often desirable to use the Chisquare test. One way to have SPSS calculate the Chisquare for us is the use the Crosstabs procedure. The Crosstabs command has a variety of parts, many of which are optional. The discussion below is intended to clarify some of the information provided in the manual. Use the Analyze pull down menu and select Descriptive Statistics and then Crosstabs. The following dialog box will appear:
4
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Select your row and column variables. Select the Statistics button and choose Chisquare.
You can also control what is printed into the cells of the table by selecting the Cells option.
By default, you will get only Counts. You can also control the Format of the output by selecting the Format option. The output would look similar to the following: CROSSTABS
Case Processing Summary Cases Missing N 0 Percent .0%
Valid RACE * SEX N 32 Percent 100.0%
Total N 32 Percent 100.0%
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
5
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
RACE * SEX Crosstabulation Count SEX RACE 1 2 3 1 5 7 4 16 2 6 5 5 16 Total11 12 9 32
Total
ChiSquare Tests Asymp. Sig. (2sided) .765 .764 1.000
Pearson ChiSquare Likelihood Ratio LinearbyLinear Association N of Valid Cases a.
Value a .535 .537 .000 32
df
2 2 1
2 cells (33.3%) have expected count less than 5. The minimum expected count is 4.50.
Looking at the p values, we can infer that there is no significant difference in the distribution. Exercise I) The following data were collected from an experiment to determine the outcome of a zinc supplement program on the performance of children on a standardized intelligent test. Determine statistical significance by Chisquare. Improved Control Supplemented Total 6 19 25 Did not improve 17 8 25 Total 23 27 50
Chisquare value = __________ p < _____________ II) The following data were collected from an experiment on the effects of a daily 30 minute jogging schedule on weight loss. The control group spent 30 minutes watching a TV commercial of a tooth whitening product.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Group
Control
Initial wt. 251 229 240 229 243 308 196 222 207 274 239 251 209 220 294
Final wt. 252 226 241 225 245 309 202 221 200 268 220 251 196 217 300
Group
Experimental
Initial wt. 244 231 241 257 253 299 196 232 243 264 220 221 209 229 298
Final wt. 236 208 241 258 234 302 184 225 196 268 225 211 197 216 285
A successful weight loss is defined by the experimenters as being at least five pounds. Determine if the percentages of successful weight loss were different in the two groups. (Hint: In your calculations, divide the groups into successful weight loss and no successful weight loss, controls and experimentals) Controls Experimentals Percentage of successful wt.loss _______ _______ Chisquare value _______ p< _______ III) We wish to evaluate the presence of breast cancer as a risk factor for subsequent cancer in the other breast. From a group of 5565 yr old women, you select a group of breast cancer cases and a group of controls (not currently suffering from breast cancer). You use a questionnaire and a thorough search of cancer registry records to determine past histories of breast cancer. Your data are presented below. Cancer (cases) Previous cancer No previous cancer Total 12 38 50 No Cancer (controls) 6 44 50
Total 18 82 100
Chisquare value = _____________ p < ___________ As you can see, this is a retrospective study, which means you can calculate relative risk. Relative risk (cross product ratio) = ______________ That means that a person with cancer in one breast is _________ times more likely to have breast cancer in the other breast than a person who does not suffer from breast
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
cancer at all. Do you think this value is significant biologically? u) Selection of a Subset of Cases for Analysis Many a time, you might like to run a statistical test on selected subsets of your data. For example you may have a huge data file and decide that you want to look only at the males in your data. SPSS allows you to create a permanent data file to analyze only males or do so temporarily. Use the Data pull down menu and select Select cases. The following dialog box will appear.
Click the If condition is satisfied button and click on If. The following box will then appear:
Set your conditions for case selection and your data file will now look like this:
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
You can now choose whatever statistical procedure(s) you want to perform on this subset of cases. NOTE: The Filtered button (the default) allows you to use Select cases again to undo or alter your selection. The Deleted button makes this a permanent case selection. You can then save the new data as another file and work on it. v) The Nonparametric Tests SPSS allows the user to perform a number of nonparametric statistical tests. The tests available can be grouped into broad categories depending on the type of the experimental data you have e.g. onesample tests, relatedsample tests and independentsamples tests. These tests are found in the Analyze pull down menu on selecting Nonparametric Tests and then choosing the appropriate category. You may make one of the following choices: Chisquare… Binomial… Runs… 1sample KS… 2 Independent Samples… Gives a onesample Chisquare test. Gives a variable. binomial test for a dichotomous
Gives a “runs” test to determine if the order of occurrence of two values of a variable is random. Gives test. a onesample KolmogorovSmirnov
Allows a choice of tests comparing 2 independent groups of cases By default, the MannWhitney U test is performed. Other tests that can be chosen include: KS Z, Moses extreme reactions or
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
WaldWolfowitz runs. K Independent Samples… 2 Related Samples… By default you get a KruskalWallis H statistic computed. You can opt for a Median test. By default you get a Wilcoxon test (for ranked data). You can choose a Sign test or a McNemar test. By default you get a Friedman test. You can choose a Kendall’s W or a Cochran’s Q.
K Related Samples…
For each of these, you have choices as to which statistics are displayed an how labels and missing data are handled. We will examine two of these as examples. w) MannWhitney U test A MannWhitney U test can be obtained by selecting the 2 Independent Samples option under the Nonparametric Tests menu. The following dialog box will appear.
Since MannWhitney U test is the default, we will select our variables to test and our grouping variable. We must of course define our groups similar to previous tests. The output looks similar to the following: NPAR TESTS MannWhitney Test
Ranks AGE SEX 1 2 Total N 16 16 32 Mean 17.34 Rank 15.66 Sum of 277.50 Ranks 250.50
0
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Test Statisticsb MannWhitney U Wilcoxon W Z Asymp. Sig. (2tailed) Exact Sig. [2*(1tailed Sig.)] a. AGE 114.500 250.500 .510 .610 .616
a
b. Not corrected for ties. Grouping Variable: SEX
Here a 0.616 significance means there is no significant difference between ages of males and females in the data. (Since we do not have ranking data here, we used ages. But ideally this test would be used to compare ranks). Median test A Median test can be obtained by selecting K Independent Samples under the Nonparametric Tests menu. The following dialog box will appear:
Since we need to perform a median test we must deselect KruskalWallis H (the default) and choose Median test. We can then select the variable(s) to be tested and the grouping variable. We must define the range of values of our grouping variable. The output will be similar to the following: NPAR TESTS Median Test
Frequencies SEX AGE > Median <= Median 1 8 8 2 8 8
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
1
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
In this data, we have equal distribution; therefore our significance level is 1.0 which means that there is no significant difference in the medians of ages in males and females. x) Bivariate Correlations There are multiple ways to use SPSS to get correlation coefficients. W can select Correlate from the Analyze pull down menu and then select Bivariate. The following dialog box will appear.
We can define variables to be used to create a correlation coefficient matrix. By default, we get Pearson’s correlation coefficients. Our output will be similar to the following: Correlations
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
We do not get r2 by using this procedure. If we want the r2 value, we can calculate it with a calculator or choose Regression and then Linear under the Analyze pull down menu. Here we would choose Enter under the Method box. This would give us the r and r2.
Model Summary Model 1 R .583(a) R Square .340 Adjusted R Square .318 Std. Error of the Estimate 19.211
a Predictors: (Constant), PRO
The r2 value means that 34 percent of the fat intake can be predicted by protein intake. (Since this is not the kind of data you would use for regression analysis, do not try to make sense out of this statement.) y) Survival The SURVIVAL command produces life tables, plots and related statistics for examining the length of time between two events (let’s say exposure and development of disease). Cases can be classified into groups, and separate analyses and comparisons obtained for the groups. The time interval between two dates can be calculated with the SPSS date and time conversion functions (e.g. YRMODA). Example: If the data file contains dates of important events such as diagnosis or outcome, you can use the Compute command and the YRMODA function to calculate elapsed time.
3
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Use the survival.sav data file provided to you to work on this function. The outcomes are either 1 (died) or 2 (survived). The treatments are either 1 (Vitamin A), 2 (Beta Blocker), 3 (ACE inhibitor) or 4 (Aspirin). Use the Compute command to calculate a variable xyz (time elapsed between exposure and the event i.e. death)
xyz will now be created as a new variable on your data file and will provide you the number of years elapsed between the two dates. You can then use the survival analysis command to compute median “survival time” for various groups within your experiment. The Survival command can be found under the Analyze pull down menu. Then select Life Tables. The following dialog box will appear:
Place your variable defining the time elapsed in the Time box. The Display Time Intervals boxes are for defining the number of time units that will be displayed on your output. The Status box is for defining your outcome and the Factor box is for defining your grouping variable. Look at the examples below.
4
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
You can also select the Options button to generate statistical testing your groups and to control the Output.
The output will be similar to the following:
Available workspace allows for exact comparisons of This subfile contains: Life Table Survival Variable for Number Entrng this Intrvl 19.0 19.0 19.0 19.0 19.0 19.0 19.0 Number Wdrawn During Intrvl .0 .0 .0 .0 .0 .0 .0 40 observations xyz treatmen Number Exposd to Risk 19.0 19.0 19.0 19.0 19.0 19.0 19.0 survival = Number of Termnl Events .0 .0 .0 .0 .0 .0 .0 Propn Terminating .0000 .0000 .0000 .0000 .0000 .0000 .0000 Propn Surviving 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1 Vitamin A Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 174762 observations
Intrvl Start Time .0 1.0 2.0 3.0 4.0 5.0 6.0
Cumul Propn Surv at End 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 5
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
7.0 8.0 9.0 10.0+ **
19.0 19.0 19.0 19.0
.0 .0 .0 4.0
19.0 19.0 19.0 17.0
.0 .0 .0 15.0
.0000 .0000 .0000 .8824
1.0000 1.0000 1.0000 .1176
1.0000 1.0000 1.0000 .1176
.0000 .0000 .0000 **
.0000 .0000 .0000 **
These calculations for the last interval are meaningless. 10.00+
The median survival time for these data is SE of Cumul Surviving .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0781 SE of Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Intrvl Start Time .0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0+
SE of Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Life Table Survival Variable for Number Intrvl Entrng Start this Time Intrvl  .0 10.0 1.0 10.0 2.0 10.0 3.0 10.0 4.0 10.0 5.0 10.0 6.0 10.0 7.0 10.0 8.0 10.0 9.0 10.0 10.0+ 10.0 ** Number Wdrawn During Intrvl .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 .0
xyz treatmen Number Exposd to Risk 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0 10.0
survival = Number of Termnl Events .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 10.0 Propn Terminating .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 1.0000 Propn Surviving 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .0000 2 Beta Blocker Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Cumul Propn Surv at End 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .0000
Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
These calculations for the last interval are meaningless. 10.00+
The median survival time for these data is
Intrvl Start Time .0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0+
SE of Cumul Surviving .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000
SE of Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
SE of Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
Life Table Survival Variable for Number Intrvl Entrng Start this Time Intrvl  .0 7.0 1.0 7.0 2.0 7.0 3.0 7.0 4.0 7.0 5.0 7.0 6.0 7.0 7.0 7.0 8.0 7.0 9.0 7.0 10.0+ 7.0 ** Number Wdrawn During Intrvl .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 3.0
xyz treatmen Number Exposd to Risk 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 7.0 5.5
survival = Number of Termnl Events .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 4.0 Propn Terminating .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .7273 Propn Surviving 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .2727 3 ACE inhibitor Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Cumul Propn Surv at End 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .2727
Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
These calculations for the last interval are meaningless. 10.00+
The median survival time for these data is
Intrvl Start Time .0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0+
SE of Cumul Surviving .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .1899
SE of Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
SE of Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Life Table Survival Variable for Number Intrvl Entrng Start this Time Intrvl  .0 4.0 1.0 4.0 2.0 4.0 3.0 4.0 4.0 4.0 5.0 4.0 6.0 4.0 7.0 4.0 8.0 4.0 9.0 4.0 10.0+ 4.0 ** Number Wdrawn During Intrvl .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 1.0
xyz treatmen Number Exposd to Risk 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 4.0 3.5
survival = Number of Termnl Events .0 .0 .0 .0 .0 .0 .0 .0 .0 .0 3.0 Propn Terminating .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .8571 Propn Surviving 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .1429 4 Aspirin Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Cumul Propn Surv at End 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 .1429
Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
These calculations for the last interval are meaningless.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
Creating and Executing SPSS Commands
The median survival time for these data is
10.00+
Intrvl Start Time .0 1.0 2.0 3.0 4.0 5.0 6.0 7.0 8.0 9.0 10.0+
SE of Cumul Surviving .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .1870
SE of Probability Densty .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
SE of Hazard Rate .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 .0000 **
Comparison of survival experience using the Wilcoxon (Gehan) statistic Survival Variable xyz survival grouped by treatmen Overall comparison Group 1 2 3 4 label Vitamin A Beta Blocker ACE inhibitor Aspirin Extended Name treatment statistic Total N 19 10 7 4 11.350 Uncen 15 10 4 3 D.F. Cen 4 0 3 1 3 Prob. .0100
Pct Cen 21.05 .00 42.86 25.00
Mean Score 9.0526 19.1000 3.5714 11.0000
Abbreviated Name treatment
The main values to look for are the median survival times and finally the overall comparison table for the probability value. Here the probability value is 0.01 which means that there is no significant difference in the survival of the differently treated groups.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SpSS Syntax windows
It is possible to bypass the pull down menus (If you are like me, you would probably ask Why would anyone want to do that? But believe me, some people do and this is for their benefit.) To do this one has to type SPSS commands directly into an SPSS syntax window or load commands from a file into an SPSS syntax window. You can then modify the commands or execute them. To use SPSS to type directly into a syntax window, use the File pull down men and select New and then select SPSS Syntax. A blank syntax window will appear. You can now type SPSS commands directly into the window and execute them by clicking the execute button. Remember to end each command with a period. You can also use the File/Open pull down menu to load a file containing SPSS commands into the syntax window. In addition, SPSS has trizillions (just kidding!) of commands available. Here is a brief outline of many of them. (arranged in alphabetical order) THE ADD VALUE LABELS COMMAND The Add Value labels command adds or alters value labels without affecting the value labels that have already been assigned for that variable. In contrast, Value labels adds or alters value labels but deletes all existing value labels for that variable. THE AGGREGATE COMMAND The Aggregate command creates a new data file from your old data file by aggregating groups of cases into single cases. The values for one or more variables are used to define the groups. The grouping variables are called break variables. All cases in the old file with identical values for the break variable become a break group. Each break group is assigned a single value for each newly created variable. There are nineteen aggregate functions for creating new variables. The functions include: SUM The sum across cases MEAN The mean across cases SD The standard deviation across cases MAX The maximum value across cases MIN The minimum value across cases. PGT Percentage of cases greater than some value PLT Percentage of cases lesser than some value PIN Percentage of cases between two values inclusive POUT Percentage of cases not between two values FGT Fraction of cases greater than some value FLT Fraction of cases lesser than some value FIN Fraction of cases between two values inclusive FOUT Fraction of cases not between two values N Weighted number of cases in break group NU Unweighted number of cases in break group NMISS Weighted number of missing cases
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SPSS syntax windows
NUMISS FIRST LAST
Unweighted number of missing cases First nonmissing observed value in break group First nonmissing observed value in break group
Example: Suppose we had a large data file named EXPERIMENT.DAT and wished to construct a new data file named FEEDING.DAT containing mean Kilocalorie consumption for each day rather than individual data for each mouse. We can use Aggregate to do this. Our command might look something like this: AGGREGATE OUTFILE = ‘FEEDING.DAT’ /BREAK = FEEDING /AVGKCAL = MEAN (KCAL). THE AUTORECODE COMMAND The Autorecode command recodes the values of both string and numeric variables to consecutive integers and puts the new values into a different variable called the target variable. Example: If we have a category called Race and our variable labels were 1 = White 2 = Hispanic 3 = African American 4 = Asian We can recode this as 1 = White and 2 = NonWhite THE ANoVA COMMAND This command does not exist in the toolbar. We need to create it in the Syntax window. It performs analysis of variance for data from experiments with factorial designs. A factorial design is used when the researcher wishes to study the effects of several factors simultaneously. The ANoVA command also allows the user to perform analysis of covariance procedures. Other SPSS commands which also perform ANoVA are ONEWAY and MEANS. Example: Suppose a researcher wants to study the relationships between total kilocalorie consumption, sex and race, the following command may be used: ANOVA VARIABLES = KCAL BY SEX (1,2) RACE (1,4) /STATISTICS = 3. THE CORRELATION COMMAND This command produces Pearson productmoment correlations with onetailed probabilities. We can also opt for additional output including univariate statistics, covariances and crossproduct deviations. Example: CORRELATION VARIABLES = HEIGHT WEIGHT AGE /VARIABLES = HEIGHT WEIGHT WITH AGE /OPTIONS = 2 3 /STATISTICS = 1 The first VARIABLES subcommand causes a square matrix of correlation coefficients to be created among the variables HEIGHT, WEIGHT AND AGE. The second VARIABLES
0
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SPSS syntax windows
subcommand requests that a rectangular correlation matrix be created in which HEIGHT and WEIGHT are the rows and AGE is the column. The OPTION 2 requests pairwise deletion of missing values. OPTION 3 requests twotailed probabilities. STATISTICS 1 requests means, standard deviations and counts for each variable. THE COUNT COMMAND This command creates a new variable for each case that contains a count of the number of occurrences of a particular value or range of values across a list of variables. In social science, for instance, we can have responses to several questions such as 1 = strongly agree 2 = agree 3 = neutral 4 = disagree 5 = strongly disagree and then count how many 1s, 2s… we have. Example: COUNT X = Y,Z,W (2) A new variable X is created for each case. It will contain a count of the number of times a value of 2 was found across the variables Y, Z and W. Therefore the value of X will be 0, 1, 2 or 3. THE DISPLAY COMMAND This command gives detailed information about the variables in the active file. It gives the variable name and label, value labels, missing value flags, the variable type and the variable width. Not a very useful command if you ask us. THE EXPORT AND IMPORT COMMANDS The Export command is used to produce a portable ASCII (“the text” in Windows. ASCII is the table which codes for each letter or number in let’s say MS Word) data file and dictionary that can be read with the Import command in SPSS on a computer. These commands too are hardly used. You might use it to export your output to Word (But why do that when you can Copy and Paste onto MS Word anything from your output?) THE GET AND SAVE COMMANDS The Save command produces an SPSS system file which includes all data and a data dictionary with variable and values labels, if specified, missing value flags and print formats for each variable. The Get command retrieves a system file created by a Save command. We can see these commands in the output window when we open files. THE INCLUDE COMMAND This command allows you to execute SPSS commands from a file. This allows you to use your favorite text editor (such as MS Word) to create an SPSS command file rather than using Review. The @ character can be used in place of Include. Example: If you had created your commands in a file named XYZ.CMD you could execute it directly with the following: Include ‘XYZ.CMD’. This command too is not used much anymore. THE MISSING VALUE COMMAND This command is used to declare usermissing values for specified variables. System generated missing values are assigned when the raw data for a given variable is blank
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com 1
SPSS Tutorial for Beginners
SPSS syntax windows
or when an illegal calculation is requested. Example: Missing Value list of variables (n). Where “n” is the value to be used to identify missing data. THE N OF CASES COMMAND This command limits the number of cases in the active file to the first “n” cases You may try this command to see if every command runs correctly. Example: N of Cases n. THE RECODE COMMAND This command changes the coding of an existing variable or list of variables on a valuebyvalue basis. When it is possible to use it, Recode command is much more efficient than a series of If commands used to produce the same transformation. This command does not require us to know much about our data. There are a number of keywords which can be used with this command. They are: LO or LOWEST HI or HIGHEST THRU MISSING SYSMIS ELSE specifies the lowest value found (including usermissing values, but not systemmissing values) specifies the highest value found (including usermissing values, but not systemmissing values) specifies a value range, inclusive of end values specifies user and system missing vales for recoding specifies systemmissing values only includes all values not already specified including the systemmissing value
Example: RECODE INCOME (LO THRU 500 = 1) (501 THRU 1000 = 2) (1001 THRU HI = 3)/RACE (2 THRU HI = 2). THE REPORT COMMAND The report command produces case listings and summary statistics in a format specified by the user. The user has considerable control over the appearance of the output. There are dozens of subcommands and specifications for this procedure. It is usually used for Business reports. No p value, significance etc are generated. THE SAMPLE COMMAND This transformation temporarily draws a random sample of cases for use in the next procedure. This may help us when we have a really huge data file (10000s of cases). Examples: Sample .25. This would build a random sample of approximately 25 % of cases from the active file. Sample 50 From 2600 This would build a random sample of 50 cases from the first 2600 cases in the active file. If there are fewer than 2600 cases in the active file, a sample of 50/2600 % will be
2
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SPSS syntax windows
built. THE SAVE COMMAND See the “Get” command THE SORT CASES COMMAND This command records the sequence of cases in the active file based on the values of one or more variables. This command is again used for printing out business reports. Its syntax is as follows: Sort Cases By variable (A or D) variable (A or D) etc. The cases are sorted for each variable listed. The default is ascending order (A). To sort in descending order specify (D). After the initial sort with the first variable, additional variables cause successive sorts to be performed within categories as determined by the preceding sort(s). Up to 10 variables may be used as sort keys. This command uses a large amount of disk space for scratch files. THE SUBTITLE COMMAND This command is now rarely used. It inserts a leftjustified subtitle on the second line of each page of output. The default is a blank line. The subtitle can be up to 64 characters long. If the subtitle is enclosed in quotes or apostrophes, it will be printed exactly. If they are omitted, it will be printed in uppercase. The syntax is: Subtitle any string 64 or fewer characters THE TITLE COMMAND This prints a centered title on the first line of each page. The date and page number are also printed. THE TRANSLATE COMMAND This command either creates an active file from or translates the active file e.g. Excel file to a file from a spreadsheet program. The Begin Data and End Data commands (Under data definition) These commands are used when we do not wish to use a separate data file. Normally this is used only when we have a small amount of data. This allows us to include our data in our command file. This used to be useful when punch cards were used to feed commands. Example: BEGIN DATA. 1 3.4 158 2 2.9 166 3 3.0 178 END DATA THE SET COMMAND (Under information and settings) This command changes how SPSS runs. When SPSS is started the defaults of the Set command are in force. Subcommands include:
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
3
SPSS Tutorial for Beginners
SPSS syntax windows
controls whether a blank line functions as a command terminator. Default is YES. DIRECTORY sets the folder from which and to which files are read or written. Default is the SPSS folder. NOTE: You can also set the default directory (folder) from the File pull down menu. UNDEFINED controls whether a warning is displayed for each occurrence of nonnumeric data where numeric data is expected. Default is YES. BLANKS controls interpretation of blanks in the data file. Allows the conversion of blanks to a number. Default is to assign blanks as System Missing. FORMAT controls format of output. Default is F8.2 LENGTH sets the numbers of lines per page of output. Default is 60. NONE suppresses page ejects issued by SPSS procedures. WIDTH sets page width for output. Default is 80 characters. PRINTBACK controls display of SPSS commands in output files. Default is YES. HEADER controls display of the generic SPSS header at the top of each ‘page’ of output. Default is YES. CASE allows conversion of variable names and value labels to upper case on output. Default is UPLOW i.e. variable names and value labels are displayed in upper or lower case as entered. COMPRESSION controls the use of data compression for work files during the SPSS session. Compression saves disk space but slows operation. For most modern machines the slowing is unnoticeable. Default is YES. SEED sets the ‘seed’ for the random number generator. Default is 2000000. MXLOOPS sets the maximum number of iterations of the LOOPS command for each case. Default is 40. BOX sets the characters used in CROSSTABS and other procedures which draw boxes. Defaults are the screen graphics characters that may print incorrectly on some printers. BLOCK specifies the character to be used in icicle plots. Default is a slid block. ] HISTOGRAM specifies the character to be used in icicle plots. Default is a solid block. ] TB1 sets the characters used in TABLES commands to draw boxes. Defaults are screen graphics characters which may print incorrectly on some printers. CCA, CCB, CCC, CCD, CCE allows specification of up to 5 different custom formats for displaying currency. CP1, LP1 these apply to the SPSS Categories product. THE SHOW COMMAND This command displays a table of the current specifications on the Set command THE WEIGHT COMMAND This command is used to weight cases according to sampling weights which have
NULLINE
4
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
SPSS Tutorial for Beginners
SPSS syntax windows
been provided for each case. This is usually used in research designs having complex sampling plans or in situations when one or more groups have been over or undersampled. THE WRITE COMMAND This command is used to write cases from the active file to an ASCII file on the disk. We believe in learning by doing. SPSS comes with a great Help section. We compiled this tutorial with a lot of quizzes and exercise problems so that we could help you learn and understand SPSS better. Test yourself each time you learn an SPSS command and soon you will master all the essential commands. Here we come to the end of our tutorial. Hope you gained as much from it as we hoped you would. Our 100 page+ association with you has been nice and we bid farewell to you with a heavy heart and wish you luck in your ventures using SPSS.
CHILLIBREEZE PUBLICATIONS  http://www.chillibreeze.com
5
This action might not be possible to undo. Are you sure you want to continue?
Use one of your book credits to continue reading from where you left off, or restart the preview.