Explorative Data Analysis

Explorative Data Analysis
Outline
 Data Description
 Simple Inference for Continuous Data
 Simple Inference for Categorical Data
 Graphical Presentation of a Data Set
Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Analysis in SPSS
2

Data: Simple Inference for Continuous Data
3
 The lifespan of two groups of rats, one group given a

restricted diet (106) and the other a free eating (89)
are shown in the next slide.

Sample of a Data Set
4

Descriptive Statistics of a Continuous variable
5
 Measures of Central Tendency

 Mean: Arithmetic Mean, Geometric Mean, Harmonic Mean
 Median
 Mode
 Measures of Location
 Quantile: Quartile, Decile, Percentile
 Measures of Dispersion
 Absolute Measures
 Range
 Quartile Deviation
 Mean Deviation
 Standard Deviation
 Relative Measures
 Coefficient of Variation
 Shape Characteristics: Skewness & Kurtosis

Descriptive Statistics of a continuous variable
6
How can we obtain descriptive statistics of a continuous

variable?
Manually:
Analysis> Descriptive statistics >Frequency> Select
Variable>Click on Statistics Button> Choose the options>
Continue>Ok
Syntax :
FREQUENCIES VARIABLES=LIFESPAN
/NTILES=4
/NTILES=10
/PERCENTILES=55.0 89.0
/STATISTICS=STDDEV VARIANCE RANGE MINIMUM MAXIMUM
SEMEAN MEAN MEDIAN MODE SKEWNESS SESKEW
KURTOSIS SEKURT
/ORDER=ANALYSIS.

Descriptive Statistics of a continuous variable
7
Mean lifetime of the

data is 836.3718 days
Negatively Skewed with standard deviation
274.32259 days
Median and mode of lifespan

are 791 and 630 days
Leptokurtic
678 days is the 25th

percentile indicating
25% observations are
lying under 678 days.

HISTOGRAM
8
Compared THE HISTOGRAM

to normal SHOWS THAT
curve, it is MOST OF THE
negatively OBSERVATIONS
skewed LIES IN THE
RIGHT SIDE.

Descriptive Statistics: Categorical Variable
9
Conversion of a Continuous variable to categorical variable

Manual:
Transform> Recode into different variable
Syntax
RECODE LIFESPAN (0 thru 500=1) (501 thru 1000=2) (1001 thru 1500=3)
INTO category.
VARIABLE LABELS category "Categorical Lifespan".
VALUE LABELS category 1 "<=500" 2 "501-1000" 3 "1001-1500".
EXECUTE.
Manual for Frequency Distribution:
Analysis> Descriptive statistics >Frequency> Select Variable> Ok
Syntax:
FREQUENCIES VARIABLES=LIFESPAN.

Frequency Distribution
10

Bar Chart
11

PIE DIAGRAM
12

Descriptive Statistics
13
The option “Descriptive Statistics” is used to find the descriptive

statistics only for continuous variables.
Manually:
Analysis> Descriptive statistics >Descriptives > Select
Variable>Click on Statistics Button> Choose the options>
Continue>OK.
Syntax:
DESCRIPTIVES VARIABLES=LIFESPAN
/STATISTICS=MEAN STDDEV VARIANCE RANGE MIN MAX
SEMEAN KURTOSIS SKEWNESS.

Output
14
Sampling
distribution
of mean. Its
standard
deviation is
standard
error

15
How can you calculate Geometric Mean and Harmonic Mean for
a continuous variable?
Ans.: It is available in Compare Means options. But its require a
categorical independent variable to compare. But we get total and
sub-total. So we can easily get the result of GM and HM for our
interested dependent variable.
Lets see.

16
For the given data, let run the Compare Means command using
LIFESPAN as dependent variable and diet as DIET as independent
variable.
Manually:
Analysis> Compare Means >Means> Select Dependent
Variable> Select Independent Variable> Click on Options
Button> Choose the options> Continue>Ok / Paste
Syntax:
MEANS TABLES=LIFESPAN BY DIET
/CELLS MEAN COUNT STDDEV MEDIAN GMEDIAN SEMEAN SUM
MIN MAX RANGE FIRST LAST VAR KURT SEKURT SKEW
SESKEW HARMONIC GEOMETRIC SPCT NPCT

Output
17
These are the required

HM and GM for Total

Descriptive Statistics for a continuous variable according to categorical Variable
18
We want to see the descriptive statistics by different groups. In such case we have to
utilize Explore Command of Descriptive Statistics
 Explore Command will be utilized here
 Need one or more dependent variable
 Need one categorical variable to insert in Factor List
Manually:
Analysis> Descriptive statistics >Explore> Select Dependent Variable> Select
Categorical Variable in Factor List>Click on Statistics & Plot Buttons>
Choose the options> Continue>Ok
Syntax:
EXAMINE VARIABLES=LIFESPAN BY DIET
/PLOT BOXPLOT STEMLEAF HISTOGRAM NPPLOT
/COMPARE GROUP
/PERCENTILES(5,10,25,50,75,90,95) HAVERAGE
/STATISTICS DESCRIPTIVES EXTREME
/CINTERVAL 95
/MISSING LISTWISE
/NOTOTAL.

Output
19

Output
20
5% trimmed
mean
indicates the
mean of the
observations
by excluding
lower and
upper 5% of
the
observations

Outlier
21

Normality Test
22
H0: Lifespan data of rats

taking restricted diet follows
normal distribution.
Results indicate significantly
reject H0.

Histogram
23

Histogram
24

Steam & Leaf Plot
25

Box Plot
26
median
Lower quartile Upper quartile

Simple Inference for Categorical Data
27
Basics on Cross-tabulation
 The cross-tabulation analysis is the basic technique for examining
the relationship between two or more categorical (nominal or
ordinal) variables (attribute), possibly controlling for additional
layering variables.
 The Crosstabs procedure offers tests of independence and
measures of association for nominal and ordinal data.
 Additionally, you can obtain estimates of the relative risk of an
event given the presence or absence of a particular characteristic.

Cross-tabulation
Manually:
Analyze> Descriptive Statistics
>Crosstabs... >Select Column
Variable > Select Row Variable>
Choose other options> Continue >
OK
Syntax:
CROSSTABS
/TABLES=X3 BY X9
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT ROW
/COUNT ROUND CELL.
Cross-tabulation
The data set consists of some dependent and independent variables.

We want to see the association between these dependent and
independent variables.
Cross tabulation
Dependent variables: by name

1. undernut: “0” & “1”
2. underwt_ord: 0,1, 2
We here attempt to measure

the association of all the
independent variables in
terms of “undernut” &
“underwt_ord”.
Select all the independent

variables.
Click on the arrow (>) to
transfer them to the Row
box. Click on “undernut”
(dependent variable) to
transfer it to the columns
box.
Cross tabulation
All we need to do now is to

select some options and run the
procedure.
Click on the Statistics button

and its dialogue box will be
loaded on the screen.
Select Chi-square to test the

goodness of fit and other
options as you need.
Cross tabulation
From Cell Display dialog

box select observed as
counts.
You can select Expected
as well. In percentage select
any of the Row, Column
or Total.
You also can get some other
statistics as you require.
We here select observed
as count and Row as
percentages.
Undernutrition vs. Children Age
Crosstab
Undernutrition status Total
Nourish Underweight
Age of children for 12-23 Count 482 688 1170
ordinal regression % within Age 41.2% 58.8% 100.0%
of children for
ordinal
regression
24+ Count 1764 1931 3695
% within Age 47.7% 52.3% 100.0%
of children for
ordinal
regression
0-11 Count 917 222 1139
% within Age 80.5% 19.5% 100.0%
of children for
ordinal
regression
Total Count 3163 2841 6004
% within Age 52.7% 47.3% 100.0%
of children for
ordinal
regression
Test of Independence between Undernutrition Status & Children Age
Chi-Square Tests
Value df Asymp. Sig.
(2-sided)
Pearson Chi- 451.927a 2 .000
Square
Likelihood Ratio 482.072 2 .000
Linear-by-Linear 353.863 1 .000
Association
N of Valid Cases 6004
a. 0 cells (.0%) have expected count less than 5. The minimum expected count
is 538.96.
In syntax mode:
After selecting all the options from dialog boxes instead of
click in “ok” we click in “paste”. And its open a new
window called syntax window. We may get the following
syntax for our required analysis.
CROSSTABS
/TABLES=ord_age interval mot_edu wealth icfi
care_ind ca_bmi ari_ord fever_ord diar_ord ord_1 wt
BY undernut
/FORMAT=AVALUE TABLES
/STATISTICS=CHISQ
/CELLS=COUNT ROW
/COUNT ROUND CELL
Graphical Presentation
36
 Bar Diagram
 Histogram
 Pie Diagram
 Stem and Leaf Plot
 Box Plot
 Scatter Plot
 Population Pyramid

37

Explorative Data Analysis

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Explorative Data Analysis

Uploaded by

Copyright:

Available Formats

Explorative Data Analysis

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

 The lifespan of two groups of rats, one group given a

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

 Measures of Central Tendency

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

How can we obtain descriptive statistics of a continuous

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Mean lifetime of the

Median and mode of lifespan

678 days is the 25th

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Compared THE HISTOGRAM

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Conversion of a Continuous variable to categorical variable

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

The option “Descriptive Statistics” is used to find the descriptive

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

These are the required

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

H0: Lifespan data of rats

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

The data set consists of some dependent and independent variables.

Dependent variables: by name

We here attempt to measure

Select all the independent

All we need to do now is to

Click on the Statistics button

Select Chi-square to test the

From Cell Display dialog

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

Sumon Kanti Das, Lecturer, Dept. of Statistics, SUST, Sylhet 5/2/2017

You might also like