You are on page 1of 110

Introduction to SPSS and Data

Analysis
• Statistics is a field of study concerned with

collection, organization and analysis of data

to make more effective decisions


 Methods of Organizing, summarizing and
presenting data in an informative way

5
INFERENTIAL STATISTICS

Drawing conclusions about a population based on sample results.

Population?
For example based on a sample survey results reported in USA today, only
46% of high school students can solve problems involving fractions,
decimals and percentages

This is inference about the population(all high school students) based on


sample data

7
Basic Concepts
Population
A population is a collection of all individuals or objects of interest.

Sample

The subset or a portion of a population is called sample.


• Often there is limited time, staff and money to gather data from the
entire population.

• So researchers use samples


Basic Concepts
Parameter
 Any value (like Mean, SD) calculated from the population.

Example
 Average heights of all students of DUHS
 Average price of all statistics books in a book fair
Basic Concepts
Statistic
 Any value (like Mean, SD) calculated from the sample.

Examples
 Average heights of any 30 students selected from DUHS
 Average price of any 20 statistics books selected from a book fair
• A variable is something whose value can vary

• For example age ,Gender ,blood type,etc are the variables.

variables

Age Gender Blood Type

32 Male A

24 Male B data

40 Female A
Types of Variable
Categorical Variables :

• They can only be classified into categories. For example:

• Gender
- Male
- Female

• Socio economic status


- upper class
- middle class
- lower class

• Type of disease
- Diarrhea
- Fever
Nominal Categorical Variables
• The ordering of categories is completely arbitrary (No ordering)

Examples
- Gender: Male, Female

- Religion: Islam, Christianity

- Hair Colour : Black, Brown etc


Ordinal Categorical Variables
• There is an natural order among the categories, for example

• Personal Health Status


- Very Good
- Good
- Fair
- Poor

• Faculty Position
- Professor
- Associate Professor
- Assistant Professor
Continuous variables

• Any thing that can be measured

•Values from a continuous range of possible values

Examples

- Blood Pressure(mmHg) of the Subject

- Body mass index(kg/m2) of the Subject

- Length of stay of a patient in hospital


Discrete variables
• Anything that can be expressed in numbers

• Can assume only whole (integer) numbers

Examples

- Number of clinical visits


- Number of adverse events

- Number of angina attacks

- Number of Household in a community, etc


Distinguish Between Qualitative and
Quantitative Variables
• Colour of eyes

• The number of leaves on Neem tree.

• The length of time on a phone call.

• Religion of people of a country

• Colour of hair of 100 children


Distinguish Between Discrete and
Continuous Variables
• The number of light bulbs that burn out in a room

• The age of a children

• The numbers of flowers on a tree

• The daily temperature recorded at the weather bureau

• No of Houses in a city

• Life time of an energy saver


Data Collection

Survey/Questionnaires Records

Experimentation
• The first step to describe a set of data is in the form of table called
frequency distribution

24
A grouping of data into non-overlapping classes showing the
number of observations in each class.

25
Frequency Distribution for qualitative
Variables
• Following table showing the frequency distribution of education
status for a group of 25 people surveyed

Degrees Frequency
None 2
Bachelor 11
Master 7
Doctorate 5
Construct frequency Table for
• Gender, ID Gender Minority
• Minority 1
2
m
m
No
No
3 f No
4 f No
5 m No
6 m No
7 m No
8 f No
9 f No
10 f No
11 f No
12 m Yes
13 m Yes
14 f Yes
15 m No
16 m No
17 m No
18 m No
19 m No
20 f No
Frequency table for Gender

Gender frequency
Female 8
Male 12
Total 20

Frequency table for Minority


Minority frequency
No 17
Yes 3
Total 20
Relative frequency
•It is the proportion of cases in any category/class to total cases

Relative
Degree Frequency Frequency(%)

None 2 =2/25*100=8

Bachelor 11 =11/25*100=44

Master 7 =7/25*100=28

Doctorate 5 =5/25*100=20

Total 25 =25/25*100=100
Cumulative Relative frequency
•It is the proportion of cases in a particular category and all preceding
category

Commulative
Relative
Degree Frequency Relative
Frequency
frequency

None 2 8% 8%
Bachelor 11 44% 52%
Master 7 28% 80%
Doctorate 5 20% 100%
Total 25 100%  
The commonly used graphic forms are:
• Bar Graph
• Pie Chart
Bar Graph: A graph in which classes are reported on the horizontal
axis and class frequencies on vertical axis. It is important to note that
as the length / height of the bar increases the value is greater

 y – axis: Frequency or frequency percentage


 x – axis: Class/Category
Gender frequency
Female 8
Male 12
Total 20
 Single bar graph
 Used to convey discrete values of each category
shown on opposite axis
 When more than one discrete values for each
category are meant to be represented
 It is a preliminary data analysis tool. It is used
to show segments of total.
 Usually not used b/c results cannot be read
properly and can lead to misleading conclusions
Pie Chart: A graph in which circle is divided into sectors. Each
sector represents a category of data.

Degree Frequency Percent(%)


None 2 8

Bachelor 11 44

Master 7 28

Doctorate 5 20
Total 25 100
• A researcher wishes to prepare a report showing the number of hours
per week students spend studying. He selects a random sample of 30
students and determines the number of hours each student studied last
week.

15.0, 23.7, 19.7, 15.4, 18.3, 23.0, 14.2, 20.8, 13.5,


20.7, 17.4, 18.6, 12.9, 20.3, 13.7, 21.4, 18.3, 29.8,
17.1, 18.9, 10.3, 26.1, 15.7, 14.0, 17.8, 33.8, 23.2,
12.9, 27.1, 16.6.
• Organize the data into a frequency distribution.
Frequency Distribution for Quantitative
variables
Study Hours Frequency Percent CP
8
10-15 27 27
11
15-20 37 63
7
20-25 23 87
3
25-30 10 97

1
3
30-35 100
30
Total 100  
The commonly used graphic forms is:
• Histograms
A Histogram shows the shape of a distribution. In Histogram the classes
are marked on the horizontal axis and the class frequencies on the vertical
axis.
Common Shapes of frequency distribution
Measures of Central Tendency
Measures of Central Tendency
• A measure of central tendency is a measure which indicates where the
middle of the data is.

• The three most commonly used measures of central tendency are:

- Mean

- Median

- Mode
Mean:
For a given set of n observation, x1, x2, x3, …, xn, the mean is the sum of
these numbers divided by n, and denoted by

x i
x i 1
n
• The following are the weight losses of 10 individuals who entered in a 5
week weight-control program:
• 9, 7, 10, 11, 10, 11, 4, 8, 10, 9

10

x i
9  7  10  ...  9 89
x i 1
   8.9
n 10 10

45
Median:
•It is the middle most value of the data set.
•It divides the data in such a way that half the observations
are less than that number and half the observations are
greater than that number.
•It is denoted by ~
x

46
Example:
•Find median of following three numbers 2, 8, 5

•To find median, first arrange data in ascending or descending order

•The arranged data is 2,5,8

•Now ,median =middle most observation=5

47
• If n (no of observations) is odd, median = (n+1)/2 th ordered
observation.

Example :

Data: 1, 7, 6, 2, 5 n=5

Ordered: 1, 2, 5, 6, 7
 n  1th th=5  1
rd
median is 
 2 
 observation =
 3  observation
 2 

So, median = 5

If n (no of observations) is even, median= mean of th
n 
observation and th observation.
 1
2 

EX2.
Data: 4, 6, 2, 7, 5, 8 n=6

Ordered: 2, 4, 5, 6, 7, 8
n 6
median    3rd observation
2 2

n
and   1  (3  1)th observation  4th observation
2

56
So, median   5.5
2
Determine the median of following two dataset:

1. {1, 2, 3, 4, 5}

2. {2, 3, 4, 5, 6, 7, 8,9}
The Mode:
 The value which occurs most frequently in the data set.
 If all values are different there is no mode.
 Sometimes, there are more than one mode.

EX.
Data: 4, 5, 2, 2, 6, 8 n=6

So mode = 2 Here Mean=4.50 & Median=4.5


 For example we have following monthly incomes
30000,45000,45000,45000,200000

Here Median=45,000,Mode=45,000 and Mean

30000  45000  45000  45000  20000


  73000
5
• We can see mean is inflated to 73000 because of one extreme observation.
• So, median is a better choice
Mean
 Uses all the value of data sets so it is most sensitive to variations in
the data.
 The mean is affected by extremely high or low values called outliers
and may not be used in appropriate average to use in these
situations.

53
Median
 The median is affected less than the mean by extremely high or
extremely low values and is therefore a valuable measure of central
tendency when such values occur.
• It is less sensitive to variations in the data

54
Measure of Dispersion
7 7 7 8
3 2
7 77 7 77
7 8 13
7 6
9

Mean = 7 Mean = 7
Mean = 7
Measure of Dispersion

Example:

In a study conducted in pharmaceutical company to determine


the use of omeperazole and rabeperazole (anti inflammatory
enzymes) has an influence on the density of drug. The
densities in grams/cm3 of drugs are as

Omeperazole 0.55 0.32 0.36 0.37 0.39 0.43 0.43 0.47 0.52 0.53
Rabeperazole 0.26 0.26 0.43 0.23 0.47 0.51 0.52 0.55 0.59 0.55
 The obtained measure of central tendency of the previous example are as
follow:

Mean Median Mode


Omeperazole 0.437 0.430 0.430
Rabeperazole 0.437 0.490 0.260

 Both drugs have the same mean i.e.0.437.but still two drugs differ.
There is more variation in the values of Rabeperazole

 Therefore, Measure of Central Tendency do not give the complete


description of data
• Measure of dispersion indicates the amount of variability in the data
set.

- Range

- Variance

- Standard Deviation
 Range = Maximum Value – Minimum Value
 Let’s calculate the range of previous example

Mininum Maximum Range


Omeperazole 0.32 0.55 0.23
Rabeperazole 0.23 0.59 0.36
• The variation in the densities of drugs using two contents is given below
Obs. Xi

1 0.28 -0.129 0.016641


2 0.32 -0.089 0.007921
3 0.36 -0.049 0.002401
4 0.37 -0.039 0.001521
5 0.38 -0.029 0.000841
6 0.43 0.021 0.000441
7 0.43 0.021 0.000441
8 0.47 0.061 0.003721
9 0.52 0.111 0.012321
10 0.53 0.121 0.014641
Total 4.09 0 0.06089

62
 Due to square quantity, the variance is not considered as good measure
of dispersion. To avoid this, square root of variance is taken, which is
called as standard deviation

n
1
s  s2  
n  1 i 1
( xi  x ) 2

• The standard deviation of the densities of drugs using two contents


is given below
Measure of Shape: Skewness
• Frequency distribution can assume many shapes. the three most
observed shapes are :

- symmetrical

- positively skewed and

- negatively skewed.
• Measure of skewness describes the shape of data
For a symmetrical distribution, the mean will equal the median, and the
skewness coefficient will be zero.

mean = median = mode


69
 If the distribution is skewed to the right, the mean will be greater
than the median, and the coefficient will be positive.

70
mean > median > mode
 If the distribution is skewed to the right, the mean will be less than
the median, and the coefficient will be negative.

mode > median > mean 72


• One of the most popular statistical packages which can perform highly
complex data manipulation and analysis with simple instructions
• Start → All Programs → SPSS Inc→ SPSS 19.0 → SPSS
19.0
• The default window will have the data editor
• There are two sheets in the window:
1. Data view 2. Variable view
• Data Editor
- For defining, entering, editing, and displaying data. Extension of
the saved file will be “sav.”

Output Viewer
• Displays output. Extension of the saved file will be “spv.”
• Displays output. Extension of the saved file will be “spv.”
• How would you put the following information into SPSS?

Patient Smoking
Gender Age Height
ID Status
1 2 50 5.4 2
2 2 45 5.1 2
3 1 43 5.6 1
4 2 35 6 1
5 1 29 5.9 2
6 2 32 5.6 2
7 2 36 5.8 2
8 2 55 5 2
9 1 49 5.4 1
10 1 43 5.11 1

Gender: 1= Male ,2 = Female


Smoking Status:1= Yes, 2 = No
Click
Practice 1
• To save the data file you created simply click ‘file’ and click ‘save as.’

Click
• Sort the data by the ‘Height’ of students in descending order.

• Click ‘Data’ and then click Sort Cases


Sorting the data (cont’d)

Double Click
Sorting the data (cont’d)

Click
Opening the sample data
• Open ‘Employee data.sav’ from the SPSS
• Go to “File,” “Open,” and Click Data
• Go to Program Files,” “SPSSInc,” “SPSS16,” and “Samples” folder.
• Open “Employee Data.sav” file
Opening the sample data
• Recoding into the same variable
• Recoding into different variables
• It is always recommended to recode into
different variables and not to alter the original
variable
• Click on Transform > Recode > Into different variables.

• Select the variable you want to recode. Educational Level

• Start by giving the new


variable a new name (educat)
• Click on Change
• Click on Old and New Values
• Use “Range” (fourth option down) to recode as follows. Remember
to click on “Add” after entering each recode.
- 8 to 12 = 1
- 13 to 16 = 2
- 17 to 21 = 3

• Click Continue
• And then OK.
Basic Analysis with SPSS
• Frequencies
- This analysis produces frequency tables showing frequency counts
and percentages of the values of individual variables.

• Descriptive
- This analysis shows the maximum, minimum, mean, Range ,standard
deviation etc. of the variables
Frequencies
• Click ‘Analyze,’ ‘Descriptive statistics,’ then click ‘Frequencies’
Frequencies
• Click gender and put it into the variable box.
• Click ‘Charts.’
• Then click ‘Bar charts’ and click ‘Continue.’

Click
Click
• Finally Click OK in the Frequencies box.

Click
Frequencies
• Click ‘Analyze,’ ‘Descriptive statistics,’ then click ‘Descriptives…’
• Click ‘Current Salary’ and ‘Beginning Salary,’ and put it into the
variable box.
• Click Options

Click
• The options allows you to analyze other descriptive statistics besides the mean and Std.
• Click ‘variance’, ‘Minimum’, ’Maximum’ and ‘Range’
• Finally click ‘Continue’

Click
• Finally Click OK in the Descriptives box. You will be able to see the
result of the analysis.
 Select File Open Data
 Choose Excel as file type
 Select the file you want to import
 Then click Open

106
107
 Key in values and labels for each variable
 Run frequency for each variable
 Check outputs to see if you have variables with
wrong values.
 Check missing values and physical surveys if you
use paper surveys, and make sure they are real
missing.
 Sometimes, you need to recode string variables
into numeric variables
108
Wrong
entries

109
 Recode variables
1. Select Transform Recode
into Different Variables
2. Select variable that you want
to transform (e.g. Q20): we
want
1= Yes and 0 = No
3. Click Arrow button to put
your variable into the right
window
4. Under Output Variable: type
name for new variable and
label, then click Change
5. Click Old and New Values

110

You might also like