0% found this document useful (0 votes)
97 views34 pages

Visualizing Relationships in Data Analysis

This document discusses visualizing relationships in data and creating visualizations. It provides examples of creating scatter plots, bar charts, and box plots using SAS procedures like PROC SGPLOT to visualize relationships between variables and classify observations. It demonstrates labeling outliers, layering graphs, and adding reference lines. The goal is to gain insight into relationships and patterns in data through visual representation.

Uploaded by

Sandy Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views34 pages

Visualizing Relationships in Data Analysis

This document discusses visualizing relationships in data and creating visualizations. It provides examples of creating scatter plots, bar charts, and box plots using SAS procedures like PROC SGPLOT to visualize relationships between variables and classify observations. It demonstrates labeling outliers, layering graphs, and adding reference lines. The goal is to gain insight into relationships and patterns in data through visual representation.

Uploaded by

Sandy Khan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd

Understanding

Relationships
“The purpose of computing is insight, not numbers.”
Computer Scientist Richard Hamming (1915-1998)
Correlation matrix: Iris data
Calculating correlation coefficient;
proc corr data=sashelp.iris;
run;
Strong correlation
Visualizing relationships
Visualizing relationships
Negative non-linear relationship Note that the
relationship is not linear
R = - 0.7
Fitting a linear regression
line would result in:

over- prediction of fuel


economy of mid-sized
cars

under-prediction of fuel
economy of small and
large engine sized cars
Visualizing relationships

In a regression of
Y on X, the
coefficient of X
will likely be
statistically
insignificant

The correlation coefficient is not statistically


significantly different from zero
Visualizing relationships
Watch for outliers in the data. An outlier is an observation that is unlike the other
observations in the data set.

Outliers are to be investigated and their causes have to be understood.


Automatically deleting these observations is not a good strategy
NEXT:
Creating visualizations to represent relationships
Creating visualizations

Now let’s talk about how to create the graphics that represent relationships

Basically there are two ways

1. You can create graphs using menus in the graph tasks


2. You can create them by writing code to implement graphics
procedures such as proc SGPLOT, proc SGSCATTER

For better control, and tailoring the visualizations to your needs you would
need to take the second option.

We start with option-1 before moving on to option-2.


How to Create Scatter Plots – a menu driven approach
Scatter Plots – a menu driven approach

Summary: Be in “Tasks and Utilities”


1. From graph tasks select Scatter Plot
2. From DATA select SASHELP.IRIS
3. From ROLES select SepalLength as X
1 variable, and SepalWidth as Y-variable
4. Click on the Run icon
Output: Scatter plot created from graph tasks menu
Tasks and Utilities > Graph > Scatter plot > Roles x= SepalLength y=SepalWidth
Creating scatter plots by group

proc sgplot data=sashelp.iris;


title 'Scatter plot of petalLength vs sepalWidth by species';
scatter x= petalLength y= sepalWidth / group =species ;
run;
title;

Try this with other pairs of measurements in IRIS dataset. What insight do you get?
What cars are these?
Which vehicles are these?
Label outliers
Creating a variable containing text labels for outliers

data cars;
set sashelp.cars;
if mpg_city gt 45 then POI_label = model ;
else POI_label = "";
run;

The values of POI_label variable equal model name for outliers, and blank for all other obs
Label outliers
proc sgplot data = cars;
title height=2 color= BIB "Highly fuel efficient cars";
scatter x= engineSize y= mpg_city / datalabel = POI_label;
run;
title;
Classification with bar charts
Horizontal bars representing frequencies

* Bar charts: vehicle TYPE frequencies;


proc sgplot data= sashelp.cars;
title 'Horizontal bar chart of vehicle TYPE (frequencies)';
hbar type ;
run;
title;
Classification with bar charts: Bars clustered by origin
* Bar charts: vehicle TYPE bars clustered by origin;

proc sgplot data= sashelp.cars;


title 'vertical bars vehicle TYPE clustered by origin';
title2 'Using groupdisplay = cluster option ';
vbar type / group= origin groupdisplay = cluster ;
run;
title;
title2;
Classification with bar charts: bar stacked by origin

* Bar charts: Vehicle type bars stacked by ORIGIN;

proc sgplot data= sashelp.cars;


title 'Vehicle TYPE stacked by origin';
vbar type / group= origin ;
run;
title;
Classification with response variable: INVOICE (price)

* Bar charts: Bars representing Mean INVOICE (price) for each vehicle TYPE ;

proc sgplot data= sashelp.cars;


title 'Mean INVOICE (price) for vehicle TYPEs';
title2 'Using response = INVOICE and stat = mean options' ;
hbar type / response=invoice stat=mean;
run;
title;
title2;

Sport cars are expensive


Hybrid vehicles don’t cost that much

Why?
Classification with response variable: HORSEPOWER
* Bar charts: Bars representing Mean HORSEPOWER for each vehicle TYPE ;

proc sgplot data= sashelp.cars;


title 'Mean INVOICE (price) for vehicle TYPEs';
title2 'Using response = HORSEPOWER and stat = mean options' ;
hbar type / response=horsepower stat=mean;
run;
title;

Porsche 911. Engine 443 HP. Gets you from 0 to 62 mph (100 km /h) in 3.6 seconds. Price $ 113,200
Layered bar charts

proc sgplot data=sashelp.cars;


vbar type /response = mpg_city stat= mean;
vbar type /response = mpg_highway stat= mean barwidth = 0.5;
run;

What is the most remarkable feature of this graph?


Create this graph
Bar and line chart

proc sgplot data= sashelp.stocks; Where statement to


where stock='IBM' and year(date) =2005; select IBM stock for
vbarbasic date/ response =volume y2axis;
2005
series x=date y=close / markers;

title 'IBM stock price and volume for 2005';


run;
title;

When was the most significant


movement in the IBM stock price?

Homework: Why did this happen?

Homework: Create the graph for another stock


Box plots by category

Data cars;
set sashelp.cars Homework: create POI_label
--- create POI_labels ---

proc sgplot data=cars;


vbox invoice / category= type datalabel =POI_label;
run;
Sashelp.stocks data
* Series plots of closing price for each stock;

proc sgplot data=sashelp.stocks;

title height = 2 color =BIB 'Stock Market after 9/11';


title2 height = 1.3 color= deepPink 'what happened in the next 12 months?';

series x=date y=close / group= stock;

band y= close upper='11SEP01'd lower='11SEP02’d run /


fillattrs=(color=lightgreen transparency=0.5);
run;
;
Twinkle twinkle little star
How I wonder what age you are
proc sgplot data= sashelp.class;
title height =2.0 color=DAG 'Weight is related to height';
title2 height =1.5 color=BIG 'Some childern are taller because they are older';
title3 height =1.25 color=royalBlue 'or perhaps because they are male';

bubble x=height y=weight size= Age/ group= sex;


run;
proc sgplot data=sashelp.heart;
title height = 2 color= Gray 'Heatmap for cholesterol and weight';
title2 height = 1.3 color= BIG 'with reference lines for borderline and high cholesterol';

heatmap x=weight y=cholesterol ;


refline 200 240 / axis=y
lineattrs=(thickness=3 color=orange pattern=dash)
label= ("borderline" "high") ;
run;
You can tinker with colors to bring out more clearly where the density is

You might also like