Professional Documents
Culture Documents
1. INTRODUCTION TO SPSS
SPSS (Statistical Package for the Social Sciences), also known as IBM SPSS Statistics, is a software package
used for the analysis of statistical data. SPSS is a package of programs for manipulating, analyzing, and
presenting data; the package is widely used in the social, behavioral sciences, healthcare, marketing, and
education research. There are several forms of SPSS. The core program is called SPSS Base and several
add-on modules extend the range of data entry, statistical, or reporting capabilities.
SPSS is a widely used program for statistical analysis in Social Sciences. it is also used by market
researchers, survey companies, government data mines, and others.
in addition to statistical analysis data management (case selection, file reshaping, creating derived data)
and data documentation (a metadata dictionary stored in the data file) are the features of SPSS software.
SPSS data sets have to dimensional table structure where rows are called cases and columns represent
variables.
Data is in unorganized and unformulated form. it is first collected and then analyzed; data only becomes
information suitable for making decisions once it has been analyzed in some fashion. knowledge is
derived from extensive amounts of information on a subject.
Wisdom means the status of the person in possession of a created knowledge who also knows under
which circumstances it is good to use it.
Data is the least abstract concept; knowledge information is the next least and knowledge is the most
abstract. Data becomes information through interpretation. Data is the plural form whereas data is the
singular form.
DATABASE
A database is an organized Collection of data. It is a collection of Schemes, tables, queries, reports,
views, and other objects. formally a database refers to a set of related data and the way it is organized.
Databases are used to hold and support the internal operations of the organization. it is also used to
hold administrative information and more specialized data such as engineering data or economic
models.
A database management system (DBMS) is a computer software application that interacts with the user,
other applications, and the database itself to capture and analyze the data. examples of DBMS are my
SQL, Oracle, Microsoft SQL, IBM DB2, etc.
DATA ANALYSIS
Analysis of data is a process of inspecting, cleansing, transforming, and modeling data to discover useful
information, suggesting conclusions, and supporting decision-making.
Data mining is a particular data analysis technique that focuses on modeling and knowledge discovery
for predictive rather than purely descriptive purposes.
VARIABLE
A variable is a value that can change depending on the conditions or on information passed to the
program. Data collection happens on the variable.
Example- The heights of the students who attended a particular class is data where height is the variable.
TYPES OF VARIABLES
2. Non-Metric variable
Metric variables refers to those variables having attributes where as non-metric variables refers to those
that have presence or absence of a characteristic or a property.
1. Discrete variable
2. Continuous variable
Continuous variables are measurable form of data- example height, weight of the students.
Nominal scale: Numbers are labels or groups or classes. Simple codes assigned to objects as labels. We
use nominal scale for qualitative data, e.g. professional classification, geographic classification.
E.g. blonde: 1, brown: 2, red: 3, black: 4. A person with red hair does not possess more ‘hairiness’ than a
person with blonde hair.
Ordinal scale: Data elements may be ordered according to their relative size or quality, the numbers
assigned to objects or events represent the rank order (1st, 2nd, 3rd etc.) E.g. top lists of companies.
Interval scale: There is a meaning of distances between any two observations. The "zero point" is
arbitrary. Negative values can be used. Ratios between numbers on the scale are not meaningful, so
operations such as multiplication and division cannot be carried out directly. E.g. temperature with the
Celsius scale.
Ratio scale (Scale): This is the strongest scale of measurement. Distances between observations and also
the ratios of distances have a meaning. It contains a meaningful zero. E.g. mass, length.
Once the SPSS software is opened, a dialo gue box appears, click cancel and proceed. Then SPSS window
in the data view format will be open by default. To get started always check for the status process is
ready on the extreme right to process for the day entry
DATA EDITOR
The data editor allows you to create your own data set and perform statistical operations interactively
using pull-down menus. The data editor window has two sheets.
1. Data view
2. Variable view
1. Data view
By default, the data view opens whenever you open the data editor. It contains your actual data set.
Hear the variables are represented I have columns and each row is called a case.
2. Variable View
The variable view allows you to name your variable to identify the missing values. Space should not be
used when writing a variable name we can use an underscore’_’ Instead.
Once the variables are entered, they are a string of columns associated with the variable.
1. Name
Name is the name of the variable. This will appear in the column holders in the data view. In SPSS
variable names don't have spaces. Space is indicated using an underscore ( _ ).
2. Type
Type is the type of data in the variable. A string type refers to the data stored as text numeric variable
stored data is a number. Other useful options are data and Dollars. To perform different statistical
functions numeric data type is preferred as SPSS cannot perform statistical functions on string data type.
3. Width
Width tells the computer how much space each case needs to take up. This is increased in character.
4. Label
label is useful for explaining what the value is measuring. Label gives the complete information about
the variable when there is restriction on the number of characters (normally in older versions).
5. Values
This allows you to display certain labels depending on the data in each case. Values are generally used
for nonmetric variables where it takes data only as numbers, punctuations and alphabets are not taken
into concentration.
To give a value to a nonmetric variable take the mouse pointer and left-click, a dialogue box appears.
Give your prepared value to the variable.
For example gender is a nonmetric variable the value will be
1. male
2. female
6. Alignment
Alignment is the manner in which the numbers are placed one below the other. By default, the
numbers are right-aligned and the text is always left-aligned.
In the menu bar, there are some important pull-down menus. The most important are the following.
File It helps in creating new data files, opening existing one and save and
print data file.
Edit Allow editing functions like copying, cut, paste.
View Shows menu editor for changes in fonts, grid lines , value labels etc.
Data Details about entering data, define variable properties. Helps in
identifying duplicate cases, sorting cases, transposing, restructuring
and aggregating data.
Transform Transforms data compatible for analysis using recode, replace
missing value etc.
Analyse All inferential statistics are available. It contains statistical tools and
techniques
Graphs Builds different types of charts, graphs etc. Using chart builder
Utilities Command which are used for more complex statistical computations
Add-ons List extra features available for advanced level
Window Allows you to arrange, select and control the attributes of windows
Help Supports in gaining insights about the procedures of SPSS. Contains
tutorials, coach etc.
4. OUTPUT VIEWER
The output window assimilates the results of the work done in SPSS. This means that it identifies the
results of previously conducted analyses.
In the output window, the right side is the output from the SPSS procedures that were run, and on the
left side is the outline of that output. The SPSS output is composed of a series of output objects that
can be titles like frequencies, descriptive, cross tables, table of charts and numbers, etc. Each of
these objects is listed in the outline view. The outline view makes navigating the output easier.
There are 4 types of files in SPSS
1. Data file
2. Output file
3. Syntax file
4. Script file.
The data file and the output file are the two important files which are frequently used in SPSS.
The data file is the file in which the data is stored. This file is saved with the extension .Sav.
Once the data file is saved, the output file is generated with the message-data file saved. This file also
contains the various outputs generated using statistical operations. The output file is saved with the
extension .SPV.
5. INTRODUCTION TO STATISTICS
The meaning of statistics word is varying to the different person. In day-to-day human life, the
knowledge of this subject is use in different ways. We have used statistics for personal purpose as well as
professional purpose. In personal life, we have used statistics for general calculation of household
budget. Generally, there are two types of information i.e., quantitative and qualitative information. Thus,
this subject is used by the people to take appropriate decision about the problems/ budget on the basis
of the both types of information’s.
MEANING OF STATISTICS
The word of statistics has been derived from the ‘status’, which is Latin word OR ‘Statista’ which is Italian
word. In the18th century, Prof. G. F. Achenwall has been used it first time. For a common man, ‘Statistics’
means numerical information expressed in quantitative terms, which may relate to objects, subjects,
activities, information, phenomena, or regions of space. The word statistics can be defined in two broad
different ways, because it is used to convey different meanings in singular and plural sense.
DEFINITION OF STATISTICS
The definition of statistics has been given by the different statistician in different ways. Some important
definitions of statistics are given below;
A. L. Bowley defined that “Statistics may be called the science of counting”. He also said that
“Statistics may rightly be called the science of average”.
C. According to Selligman “Statistics is the science which deals with the methods of collecting,
classifying, tabulation, comparing and interpreting numerical data collected to throw some light
on any sphere of enquiry”.
D. Croxton and Cowden defined that “statistics as the collection, tabulation, presentation, analysis
and interpretation of numerical data”.
TYPES OF STATISTICS
Descriptive Statistics is the branch, which deals with descriptions of obtained data. It is a summary
statistic that summarizes features/ characteristics from a collection of information. Moreover, it includes
classification, tabulation, measurement of central tendency as well as variability. The researchers use
these measures to understand the tendency of data/ scores. Which further enhances the ease of
description of the phenomena.
2. Inferential statistics
Statistical inference (SI) is the process of data analysis to deduce properties of a probability distribution.
Inferential statistical analysis infers properties of a population or census through the testing hypotheses
and deriving estimates which are based on the primary assumption i.e., the observed data set is sampled
from a larger population. It also deals with the drawing of conclusions about population/ census.
Moreover, It provides a technique to compute the probabilities of future behavior of the subjects/ areas.
6.DESCRIPTIVE STATISTICS
MEASURES OF CENTRAL TENDENCY
Measures of central tendency is a single value that attempts to describe a set of data by identifying a
central position within that set of data.
The mean, median, and mode are all valid measures of central tendency but under different conditions.
Some measures of central tendency become more appropriate to use than others.
Mean
The most commonly used measure of central tendency is called mean (or the average). Here the main of
interest is to learn how to calculate the mean when the data set is in type of ungrouped (raw data). The
mean for ungrouped data is obtained by dividing the sum of all values by the number of values in that
data set.
Median
The median is the value of the middle term in a data set that has been ranked in increasing or decreasing
order.
Mode
The mode is the value that occurs most often in a data set.
MEASURES OF VARIATION
1.Range
In simple words, the range for a data set depends on two values (the smallest and the largest values)
among all values in such data set.
Range = Largest value – Smallest value
2.Mean Deviation
Another measure of variation is called mean deviation; it is the mean of the distances between each
value and the mean.
3.Variance and Standard Deviation
A most used measure of variation is called standard deviation denoted by (σ for the population and S for
the sample). The numerical value of this measure helps us how the values of the dataset corresponding
to such measure are relatively closely around the mean.
DESCRIPTIVE STATISTICS
1. Given below are the combined parental income of 30 students using SPSS.
i. Calculate mean, median, and mode of the parental income
1) Open SPSS
1- Serieal -No
2- Parental-Income
Both the above variables are metric variables.
5) The variables given in the variables view becomes the columns in the data view.
6) Enter the data in the data view.
7) To find out the measures of central tendency and the measures of dispersion, we do a
descriptive analysis.
Output:
Frequencies
Statistics
Income
N Valid 30
Missing 0
Mean 151066.67
Median 59500.00
Mode 59000
Std. Deviation 263103.092
Variance 69223236781.609
Range 1174000
Minimum 26000
Maximum 1200000
SKEWNESS
Skewness means lack of symmetry. In mathematics, a figure is called symmetric if there exists a point in
it through which if a perpendicular is drawn on the X-axis, it divides the figure into two congruent parts
i.e. identical in all respect or one part can be superimposed on the other i.e mirror images of each other.
In Statistics, a distribution is called symmetric if mean, median and mode coincide. Otherwise, the
distribution becomes asymmetric. If the right tail is longer, we get a positively skewed distribution for
which mean > median > mode while if the left tail is longer, we get a negatively skewed distribution for
which mean < median < mode.
Symmetrical Curve
To get the complete idea about the shape of the distribution which can be studied with the help of
Kurtosis. Prof. Karl Pearson has called it the “Convexity of a Curve”. Kurtosis gives a measure of flatness
of distribution. The degree of kurtosis of a distribution is measured relative to that of a normal curve.
The curves with greater peakedness than the normal curve are called “Leptokurtic”. The curves which are
more flat than the normal curve are called “Platykurtic”. The normal curve is called “Mesokurtic.”
7.INFERENTIAL STATISTICS
In inferential statistics data is used from the sample and conclusions or inferences are made about the
larger population from which the sample is drawn. The goal of the inferential statistics is to draw
conclusions from a sample and generalize them to the population. It determines the probability of the
characteristics of the sample using probability theory. The most common methodologies used are
hypothesis tests, Analysis of variance etc.
Inferential statistics is divided into 2 types.
1. Parametric Tests
parametric statistics as “statistics used for the inference from a sample to a population that
assume the variances of each group are similar and that the sample is large enough to represent
the population”
2. Non-Parametric Tests
Non-parametric statistics can be described as tests that do not involve testing of hypotheses
related to population parameters. Salkind (2014, page 46) described non-parametric statistics as
“distribution- free statistics that do not require the same assumptions as do parametric
statistics”
Most parametric tests are based on normal distribution and have four basic assumptions
that must be met for the test to be accurate. The assumptions of parametric tests are
The rationale behind hypothesis testing relies on data being normally distributed and so if
this assumption is not met, logic behind hypothesis testing is flawed.
2. Homogeneity of variance
This assumption means that the variances should be the same throughout the data.
3. Interval data
The data should be measured at least in an interval scale.
4. Independence
8.CORRELATION
Correlation is a bivariate analysis that measures the strength of the association between two variables
and the direction of the relationship. In terms of the strength of the relationship, the value of the
correlation coefficient varies between +1 and -1. A value of ± 1 indicates a perfect degree of association
between the two variables. As the correlation coefficient value goes towards 0, the relationship
between the two variables will be weaker. The direction of the relationship is indicated by the sign of
the coefficient; a + sign indicates a positive relationship and a – sign indicates a negative relationship.
Calculate Karl pearson’s coefficient of correlation for the advertisement cost and sales of a company.
Is there a correlation between the advertisement cost and the sales of the product.
Step 1. Open IBM SPSS statistics
Step 2. Check for the statement “IBM SPSS is ready”
Step3. Give the variables in the variable view
Variable 1 Serial No
Variable 2 Cost
Variable 3 Sales
Step4. Click on Data view. The variables in the variable view are displayed as columns in data view.
Step5. Enter the data in data view
S.No Cost Sales
1 39 47
2 65 53
3 62 58
4 90 86
5 82 62
6 75 68
7 25 60
8 98 91
9 36 51
10 78 84
Step6. After entering the data, click on Analyse----Correlate ------Bivariate. A dialog box appears
Step7. In the dialog box, drag cost and sales to the right side. In the correlation coefficient, select
Pearson’s coefficient. In Options select means and Standard deviations and click continue, Click OK.
Step8. An Output screen appears with the correlation results.
SPSS Output:
Correlations
Cost Sales
Cost Pearson Correlation 1 .780**
Sig. (2-tailed) .008
N 10 10
Sales Pearson Correlation .780** 1
Sig. (2-tailed) .008
N 10 10
**. Correlation is significant at the 0.01 level (2-tailed).
Plant
Plant
S.No Capacity Utilization
1 2.6 2
2 2.8 2
3 3 2.6
4 3 2.5
5 3 2.4
6 3.2 2.8
7 3.8 3
8 4.9 3.9
9 5.4 4.8
10 6 5
Correlations
Plant Capacity Plant Utilization
Kendall's tau_b Plant Capacity Correlation Coefficient 1.000 .954**
Sig. (2-tailed) . .000
N 10 10
Plant Utilization Correlation Coefficient .954** 1.000
Sig. (2-tailed) .000 .
N 10 10
**. Correlation is significant at the 0.01 level (2-tailed).
Conclusion:-
The above results of the correlation coefficients indicate that we can see that Kendall's correlation
coefficient, τ b, is 0.954, and that this is statistically significant (p = 0.000).
9. REGESSION ANALYSIS
Regression is the measure of the average relationship between two or more variables in terms of original
units of data. It is one of the most frequently used techniques in economics and business research to
find the relationship between two or more variables that are related to causality is called regression
analysis.
Example:
A personnel manager of a company wants to find a measure which he can use to fix the monthly
income of the persons applying for a job in production department. As an experimental project, he
collected the data on 7 persons from that department referring to years of service and their monthly
income.
Find regression equation of Income (X) on years of service.
Years of
S.No service Income(000)
1 11 10
2 7 8
3 9 6
4 5 5
5 8 9
6 6 7
7 10 11
Output
Model Summary
Adjusted R Std. Error of the
Model R R Square Square Estimate
1 .750a .563 .475 1.565
a. Predictors: (Constant), Income(000)
ANOVAa
Sum of
Model Squares df Mean Square F Sig.
1 Regression 15.750 1 15.750 6.429 .052b
Residual 12.250 5 2.450
Total 28.000 6
a. Dependent Variable: Years of service
b. Predictors: (Constant), Income(000)
Coefficientsa
Standardized
Unstandardized Coefficients Coefficients
Model B Std. Error Beta t Sig.
1 (Constant) 2.000 2.439 .820 .450
Income(000) .750 .296 .750 2.535 .052
a. Dependent Variable: Years of service
10.T-TEST
T test can be used to test whether two means of two different groups are different or not. If we take a
very large sample from a population and calculate the mean for each sample and then plot a frequency
distribution of the means then the resultant sampling distribution would be student t distribution.
DEGREE OF FREEDOM
It may be noted that for the students t test the number of degrees of freedom is n-1.
ASSUMPTIONS OF T TEST
1) The parent population from which the sample is drawn is normal
2) The sample observations are randomly distributed
3) The population standard deviation is not Known
TYPES OF T TESTS IN SPSS
There are three types of t tests in SPSS
1) Independent sample t test
2) Dependent sample t test or Paired t test
3) One sample t test
Example: Test were made at short intervals on Spark plugs of two manufactures. The following
tabulation gives the number of hours of service from plugs the two sources.
Source A Source B
200 190
210 200
190 190
200 180
190 190
200 210
180 200
200 192
200
210
Do these results indicate a statistically significant difference between spark plugs as far as the mean
length of service is concerned.
Step2. Click on variable view and enter the variable names given in the problem such as Source, Hours
Variable 1 S. No
Variable 2 Source
Variable 3 Hours
Step3. In variable view enter the names as source and hours then click on values in the source column.
Type value as 1 label as “Source A” and Value as 2 for label “source B” then click on OK
Step4. Go to data view and enter the data as follows
S.No Source Hours
1 1 200
2 1 210
3 1 190
4 1 200
5 1 190
6 1 200
7 1 180
8 1 200
9 1 200
10 1 210
11 2 190
12 2 200
13 2 190
14 2 180
15 2 190
16 2 210
17 2 200
18 2 192
Step6. Move variable Hours to test variable box and source variable to Grouping variable box then define
groups as 1 and 2 after that click on continue and click on OK in main dialog box.
Output:
Group Statistics
Source N Mean Std. Deviation Std. Error Mean
Hours Source A 10 198.00 9.189 2.906
Source B 8 194.00 9.071 3.207
Applying t test to determine whether re-organization had any effect on the sales.
Step2. Click on variable view and enter the variable names given in the problem such as Source, Hours
Variable 1 S. No
Variable 2 Week No
Variable 3 Sales-Before
Variable 4 Sales-After
Step3. Go to data view and enter the data as follows
Step4. After entering the data in the data view click on Analyze----Compare means-----paired sample t
test
Step5. Move sales-Before and sales-After to paired variables box and click on OK.
The SPSS output will be as follows
Step2. Click on variable view and enter the variable names given in the problem such as Source, Hours
Variable 1 S. No
Variable 2 Weights
Step3. Go to data view and enter the data as follows
S.No weights
1 50
2 49
3 52
4 44
5 45
6 48
7 45
8 46
9 49
10 45
Step4. After entering the data in the data view click on Analyze----Compare means-----One sample t test
Step5. Move weights variable to Test variable box and click on OK. The output will be as follows.
Output:
One-Sample Statistics
N Mean Std. Deviation Std. Error Mean
weight 10 47.30 2.669 .844
One-Sample Test
Test Value = 50
95% Confidence Interval of the
Mean Difference
t df Sig. (2-tailed) Difference Lower Upper
weight -3.199 9 .011 -2.700 -4.61 -.79
Conclusion: There is a significant difference in the mean weight of bags (p < .011), the average weight of
bags are less 3 kgs than 50 kgs.
11.CHI-SQUARE TEST
The Chi-Square Test of Independence determines whether there is an association between categorical
variables (i.e., whether the variables are independent or related). It is a nonparametric test. This test is
also known as the Chi-Square Test of Association.
CROSS TABS
It is used to aggregate and jointly display the distribution of two or more variables by tabulating. It is
widely used for find out interrelationship and interactions between variables.
The following is the data regarding the level of awareness of online banking of males and females in a
particular area such as recharge, shopping, Ticket booking, fund transfer and bill payment.
We want to find there is any association among people and their level of awareness.
Then SPSS window will be opened it contains data view and variable view
Step2. Click on variable view and enter the variable names given in the problem such as Source, Hours
Variable 1 S. No
Variable 2 gender
Variable 3 level of awareness
Step3. In values column enter gender such as 1 is male and 2 is female, then after same way level of
awareness 5 vales like 1=recharge, 2=shopping, 3=Ticket booking, 4=fund transfer and 5=bill payment.
Step4. Go to data view and enter the data as follows
level of
S.No Gender Awareness
1 2 5
2 1 3
3 2 4
4 2 1
5 2 2
6 1 4
7 1 1
8 2 5
9 1 3
10 1 2
Output:
Chi-Square Tests
Asymptotic
Significance (2-
Value df sided)
Pearson Chi-Square 4.000a 4 .406
Likelihood Ratio 5.545 4 .236
Linear-by-Linear .720 1 .396
Association
N of Valid Cases 10
a. 10 cells (100.0%) have expected count less than 5. The minimum
expected count is 1.00.
Conclusion: Since the p-value is greater than our chosen significance level (α = 0.05), we do not
reject the null hypothesis. Rather, we conclude that there is not enough evidence to suggest an
association between gender and level of awareness.
In one area a survey was conducted among people regarding having a passport or not. The results are
as follows.
Respondents: 1 2 3 4 5 6 7 8
Response: 1 1 1 1 1 1 2 1
Apply the chi square test of Goodness of fit.
Then SPSS window will be opened it contains data view and variable view
Step2. Click on variable view and enter the variable names given in the problem such as Source, Hours
Variable 1 Respondents
Variable 2 Having Passport
Step3. In values column enter 1 is having pass port that is Yes and not having passport that is No=2.
Step4. Go to data view and enter the data as follows
Having
Respondents Passport
1 1
2 1
3 1
4 1
5 1
6 1
7 2
8 1
Step5. Go to Analyse --Non-parametric test-Legacy Dialog-Chi-square test-select variable and click on OK.
Output:
Having Passport
Observed N Expected N Residual
yes 7 4.0 3.0
No 1 4.0 -3.0
Total 8
Test Statistics
Having Passport
Chi-Square 4.500a
df 1
Asymp. Sig. .034
a. 2 cells (100.0%) have expected frequencies less than 5. The minimum expected
cell frequency is 4.0.
Conclusion: The above table, Test Statistics, provides the actual result of the chi-square goodness-of-fit
test. We can see from this table that our test statistic is statistically significant: χ2(2) = 4.500, p < .034.
Therefore, we can reject the null hypothesis.
12.ANOVA
The following are the salaries of the employees working in different departments such as finance,
Human Resources, Marketing. The details as follows.
Output:
ANOVA
Salaries
Sum of
Squares df Mean Square F Sig.
Between Groups 240166.667 2 120083.333 .913 .444
Within Groups 920833.333 7 131547.619
Total 1161000.000 9
Conclusion: We conclude that the mean salaries among the departments is not statistically significantly
(F2, 350 = .913, p < 0.444) so accept the null hypothesis as means are equal among groups.
13.GRAPHS IN SPSS
From the below data prepare Pie chart, Box plot and Histogram
Move the variable for which you are creating a pie chart into the “Define slices by” box
Select your desired option under “Slices Represent”
Click “OK”
Pie Chart of the variable times parting per week
BOXPLOT
A boxplot (also known as a box and whisker plot) is a way of graphically illustrating the
distribution of numeric data using the “five number summary” of the data set – namely the
minimum, first quartile, median, third quartile, and maximum values. It also identifies any outliers
that may exist in the data set.
Steps: Click Graphs -> Legacy Dialogs -> Boxplots in earlier versions of SPSS
Select Simple and Summaries of separate variables
Click Define
Click Reset (recommended)
Select the variable for which you wish to create a boxplot, and move it into the Boxes Represent
box
Click OK
Box plot of the variable study time per week
HISTOGRAM
Quick Steps
Click Graphs -> Legacy Dialogs -> Histogram
Drag variable you want to plot as a histogram from the left into the Variable text box
Select “Display normal curve” (recommended)
Click OK