You are on page 1of 13

3/19/2020

Distributions of
Data
Understanding the importance of
location, spread, and shape

Reading materials

Chapter 2 . Mayatt and Johnson (2014)

Procedure FREQ in SAS documentation


http://support.sas.com/documentation/cdl/en/procstat/63104/HTML/default/viewer.htm#procstat_freq_sect001.htm

Do examples of the FREQ procedure given in SAS documentation under the procedure

1
3/19/2020

Frequency distribution and bar charts


Variable “Origin” measured on nominal scale (sashelp.cars)

Indicates the frequencies in each nominal category

Variable “PLT” measured on ordinal scale

PLT: Number of mother’s previous premature labors

Note: Most observations fall in category 1, and the frequency decreases


as values increase. So the cases where the mothers had previously
had several premature labors are few.

2
3/19/2020

Frequency histogram is useful for variables with and ordered scale – ordinal,
Interval, or ration – that contains a large number of values

Frequency distribution of the variable “Acceleration”

You can easily draw the conclusion that extremely low and high values are less likely
This may not have been easy to deduce if there were a large number of values

Shapes of frequency distribution


Frequency distributions also give you an idea of the shape of the distribution and
help decide if there were some outliers

Are there two groups?

Outlier?

3
3/19/2020

Box plots – another way to represent distributions

Shapes of distributions
Skewness

Skewness: zero implies symmetric distribution


+ve values imply a longer right tail
-ve values imply a longer left tail

4
3/19/2020

Kurtosis = 0 suggests same kurtosis as a normal distribution


higher (positive) values suggest a higher peak near the mean
lower (negative) values suggest a flatter peak

It is of interest to the analyst to identify variables that have unusual distributions.

For example these distributions may be associated with outliers, or may represent
Variables that are non-normally distributed, sometimes requiring transformation.

The measures of Skewness and Kurtosis may be quickly obtained (for example
through PROC MEANS) to identify variables needing more detailed distributional
analysis.

5
3/19/2020

SAS Graph tasks to draw a distribution

Graph task Bar chart

Histogram

FREQ PROCEDURE
proc freq Data=new;
tables a / missprint;
title '1-WAY FREQUENCY TABLE WITH MISSPRINT OPTION';
run;

6
3/19/2020

Analysis of distributions of
variables

PROC UNIVARIATE

Data Gains

7
3/19/2020

PROC UNIVARIATE: Vanilla Flavor

Proc univariate data=gains ;


var height;
run;

Part of SAS output is shown below:

MORE
OUTPUT
ON NEXT
PAGE

PROC UNIVARIATE: OUTPUT

MORE
OUTPUT
ON NEXT
PAGE

8
3/19/2020

PROC UNIVARIATE: OUTPUT (Cont.)

PROC UNIVARIATE: NOPRINT OPTION


Proc univariate noprint data = gains ;
var height weight ;

output out =unigain


Output Statement

mean = hmean wmean

pctlpts = 1 5 10 25 75 90 95 99 Percentile points pctl + pts

pctlpre= h w ; Prefix for percentile-variables


pre + pctl
run;

Note: Use proc print to print data UNIGAIN

SAS output is shown below:

9
3/19/2020

ODS Tables Produced with the PROC UNIVARIATE Statement

ODS Table Name Description Option


BasicIntervals Confidence intervals for CIBASIC
mean, standard deviation,
variance
BasicMeasures Measures of location and Default
variability
ExtremeObs Extreme observations Default
ExtremeValues Extreme values NEXTRVAL=
Frequencies Frequencies FREQ
LocationCounts Counts used for sign test LOCCOUNT
and signed rank test
MissingValues Missing values Default, if missing values
exist
Modes Modes MODES
Moments Sample moments Default
Plots Line printer plots PLOTS
Quantiles Quantiles Default
RobustScale Robust measures of scale ROBUSTSCALE
SSPlots Line printer side-by-side box PLOTS (with BY statement)
plots
TestsForLocation Tests for location Default
TestsForNormality Tests for normality NORMALTEST
TrimmedMeans Trimmed means TRIMMED=
WinsorizedMeans Winsorized means WINSORIZED=

SAS data set used in examples

10
3/19/2020

Getting estimates of basic measures and quintiles of a distribution

Requesting tables of interest - ODS SELECT statement

title 'Systolic and Diastolic Blood Pressure';


ods select BasicMeasures Quantiles; Requested ODS tables

proc univariate data=BPressure;


var Systolic Diastolic;
run;

ODS Output: Basic Measures & Quintiles


Basic Statistical Measures
Location Variability

Mean 121.2727 S Std Deviation 14.28346


y
s
Median 120.0000 t Variance 204.01732
o
l
Mode 120.0000 i Range 69.00000
c

a Interquartile Range 13.00000


n
d
Quantiles (Definition
D
5)
i
Quantile a Estimate
s
t
100% Max o 165
l
i
99% c 165
B
95% l 140
o
o
90% d 134
P
75% Q3 r 125
e
s
50% Median s 120
u
r
25% Q1 e 112

10% 108

5% 100

1% 96

0% Min 96

11
3/19/2020

Proc Univariate: Robust measures of location and scale

Data BPressure;
Set BPressure;
Run;

12
3/19/2020

Robust measures of location

Winsorized Means
Percent Number Winsorized Std Error 95% DF t for H0: Pr > |t|
Winsorized Winsorized Mean Winsorized Confidence Mu0=0.00
in Tail in Tail Mean Limits

13.64 3 120.64 2.42 115.48 125.78 15 49.9102 <.0001

Trimmed Means
Percent Number Trimmed Std Error 95% Confidence Limits DF t for H0: Pr > |t|
Trimmed Trimmed Mean Trimmed Mu0=0.00
in Tail in Tail Mean

4.55 1 120.3500 2.573536 114.9635 125.7365 19 46.76446 <.0001

13.64 3 120.3125 2.395387 115.2069 125.4181 15 50.22675 <.0001

1 observation trimmed mean


0.1 trimmed mean (10% or 2.2 obs  3 obs trimmed; n=22)

Robust measures of scale

Robust Measures of Scale


Measure Value Estimate
of Sigma
IInterquartile Range 13.00000 9.63691 Q3-Q1
Gini's Mean Difference 15.03030 13.32026
MAD 6.50000 9.63690
Sn 9.54080 9.54080
Qn 13.33140 11.36786

13

You might also like