You are on page 1of 37

Explorative Data Analysis

Outline
 Data Description
 Simple Inference for Continuous Data
 Simple Inference for Categorical Data
 Graphical Presentation of a Data Set

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Analysis in SPSS
2

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Data: Simple Inference for Continuous Data
3

 The lifespan of two groups of rats, one group given a


restricted diet (106) and the other a free eating (89)
are shown in the next slide.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Sample of a Data Set
4

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics of a Continuous variable
5

 Measures of Central Tendency


 Mean: Arithmetic Mean, Geometric Mean, Harmonic Mean
 Median
 Mode
 Measures of Location
 Quantile: Quartile, Decile, Percentile

 Measures of Dispersion
 Absolute Measures
 Range
 Quartile Deviation
 Mean Deviation
 Standard Deviation
 Relative Measures
 Coefficient of Variation
 Shape Characteristics: Skewness & Kurtosis

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics of a continuous variable
6

How can we obtain descriptive statistics of a continuous


variable?
Manually:
Analysis> Descriptive statistics >Frequency> Select
Variable>Click on Statistics Button> Choose the options>
Continue>Ok
Syntax :
FREQUENCIES VARIABLES=LIFESPAN
/NTILES=4
/NTILES=10
/PERCENTILES=55.0 89.0
/STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM
SEMEAN MEAN MEDIAN MODE SKEWNESS SESKEW
KURTOSIS SEKURT
/ORDER=ANALYSIS.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics of a continuous variable
7

Mean lifetime of the


data is 836.3718 days
Negatively Skewed with standard deviation
274.32259 days

Median and mode of lifespan


are 791 and 630 days
Leptokurtic

678 days is the 25th


percentile indicating
25% observations are
lying under 678 days.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


HISTOGRAM
8

Compared THE HISTOGRAM


to normal SHOWS THAT
curve, it is MOST OF THE
negatively OBSERVATIONS
skewed LIES IN THE
RIGHT SIDE.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics: Categorical Variable
9

Conversion of a Continuous variable to categorical variable


Manual:
Transform> Recode into different variable
Syntax
RECODE LIFESPAN (0 thru 500=1) (501 thru 1000=2) (1001 thru 1500=3)
INTO category.
VARIABLE LABELS category "Categorical Lifespan".
VALUE LABELS category 1 "<=500" 2 "501-1000" 3 "1001-1500".
EXECUTE.
Manual for Frequency Distribution:
Analysis> Descriptive statistics >Frequency> Select Variable> Ok
Syntax:
FREQUENCIES VARIABLES=LIFESPAN.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Frequency Distribution
10

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Bar Chart
11

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


PIE DIAGRAM
12

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics
13

The option “Descriptive Statistics” is used to find the descriptive


statistics only for continuous variables.

Manually:
Analysis> Descriptive statistics >Descriptives > Select
Variable>Click on Statistics Button> Choose the options>
Continue>OK.

Syntax:
DESCRIPTIVES VARIABLES=LIFESPAN
/STATISTICS=MEAN STDDEV VARIANCE RANGE MIN MAX
SEMEAN KURTOSIS SKEWNESS.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Output
14

Sampling
distribution
of mean. Its
standard
deviation is
standard
error

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics
15

How can you calculate Geometric Mean and Harmonic Mean for
a continuous variable?
Ans.: It is available in Compare Means options. But its require a
categorical independent variable to compare. But we get total and
sub-total. So we can easily get the result of GM and HM for our
interested dependent variable.
Lets see.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics
16

For the given data, let run the Compare Means command using
LIFESPAN as dependent variable and diet as DIET as independent
variable.
Manually:
Analysis> Compare Means >Means> Select Dependent
Variable> Select Independent Variable> Click on Options
Button> Choose the options> Continue>Ok / Paste
Syntax:
MEANS TABLES=LIFESPAN BY DIET
/CELLS MEAN COUNT STDDEV MEDIAN GMEDIAN SEMEAN SUM
MIN MAX RANGE FIRST LAST VAR KURT SEKURT SKEW
SESKEW HARMONIC GEOMETRIC SPCT NPCT

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Output
17

These are the required


HM and GM for Total

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Descriptive Statistics for a continuous variable according to categorical Variable
18

We want to see the descriptive statistics by different groups. In such case we have to
utilize Explore Command of Descriptive Statistics
 Explore Command will be utilized here
 Need one or more dependent variable
 Need one categorical variable to insert in Factor List
Manually:
Analysis> Descriptive statistics >Explore> Select Dependent Variable> Select
Categorical Variable in Factor List>Click on Statistics & Plot Buttons>
Choose the options> Continue>Ok

Syntax:
EXAMINE VARIABLES=LIFESPAN BY DIET
/PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT
/COMPARE GROUP
/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE
/STATISTICS DESCRIPTIVES EXTREME
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Output
19

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Output
20

5% trimmed
mean
indicates the
mean of the
observations
by excluding
lower and
upper 5% of
the
observations

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Outlier
21

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Normality Test
22

H0: Lifespan data of rats


taking restricted diet follows
normal distribution.
Results indicate significantly
reject H0.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Histogram
23

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Histogram
24

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Steam & Leaf Plot
25

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Box Plot
26

median
Lower quartile Upper quartile

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Simple Inference for Categorical Data
27

Basics on Cross-tabulation
 The cross-tabulation analysis is the basic technique for examining
the relationship between two or more categorical (nominal or
ordinal) variables (attribute), possibly controlling for additional
layering variables.
 The Crosstabs procedure offers tests of independence and
measures of association for nominal and ordinal data.
 Additionally, you can obtain estimates of the relative risk of an
event given the presence or absence of a particular characteristic.

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


Cross-tabulation

Manually:
Analyze> Descriptive Statistics
>Crosstabs... >Select Column
Variable > Select Row Variable>
Choose other options> Continue >
OK

Syntax:
CROSSTABS
/TABLES=X3 BY X9
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT ROW
/COUNT ROUND CELL.
Cross-tabulation

The data set consists of some dependent and independent variables.


We want to see the association between these dependent and
independent variables.
Cross tabulation

Dependent variables: by name


1. undernut: “0” & “1”
2. underwt_ord: 0,1, 2

We here attempt to measure


the association of all the
independent variables in
terms of “undernut” &
“underwt_ord”.

Select all the independent


variables.
Click on the arrow (>) to
transfer them to the Row
box. Click on “undernut”
(dependent variable) to
transfer it to the columns
box.
Cross tabulation

All we need to do now is to


select some options and run the
procedure.

Click on the Statistics button


and its dialogue box will be
loaded on the screen.

Select Chi-square to test the


goodness of fit and other
options as you need.
Cross tabulation

From Cell Display dialog


box select observed as
counts.
You can select Expected
as well. In percentage select
any of the Row, Column
or Total.
You also can get some other
statistics as you require.
We here select observed
as count and Row as
percentages.
Undernutrition vs. Children Age

Crosstab
Undernutrition status Total
Nourish Underweight
Age of children for 12-23 Count 482 688 1170
ordinal regression % within Age 41.2% 58.8% 100.0%
of children for
ordinal
regression
24+ Count 1764 1931 3695
% within Age 47.7% 52.3% 100.0%
of children for
ordinal
regression
0-11 Count 917 222 1139
% within Age 80.5% 19.5% 100.0%
of children for
ordinal
regression
Total Count 3163 2841 6004
% within Age 52.7% 47.3% 100.0%
of children for
ordinal
regression
Test of Independence between Undernutrition Status & Children Age

Chi-Square Tests
Value df Asymp. Sig.
(2-sided)
Pearson Chi- 451.927a 2 .000
Square
Likelihood Ratio 482.072 2 .000
Linear-by-Linear 353.863 1 .000
Association
N of Valid Cases 6004
a. 0 cells (.0%) have expected count less than 5. The minimum expected count
is 538.96.
In syntax mode:
After selecting all the options from dialog boxes instead of
click in “ok” we click in “paste”. And its open a new
window called syntax window. We may get the following
syntax for our required analysis.

CROSSTABS
/TABLES=ord_age interval mot_edu wealth icfi
care_ind ca_bmi ari_ord fever_ord diar_ord ord_1 wt
BY undernut
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT ROW
/COUNT ROUND CELL
Graphical Presentation
36

 Bar Diagram
 Histogram
 Pie Diagram
 Stem and Leaf Plot
 Box Plot
 Scatter Plot
 Population Pyramid

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017


37

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

You might also like