Understanding
Relationships
“The purpose of computing is insight, not numbers.”
Computer Scientist Richard Hamming (1915-1998)
Correlation matrix: Iris data
Calculating correlation coefficient;
proc corr data=sashelp.iris;
run;
Strong correlation
Visualizing relationships
Visualizing relationships
Negative non-linear relationship Note that the
relationship is not linear
R = - 0.7
Fitting a linear regression
line would result in:
over- prediction of fuel
economy of mid-sized
cars
under-prediction of fuel
economy of small and
large engine sized cars
Visualizing relationships
In a regression of
Y on X, the
coefficient of X
will likely be
statistically
insignificant
The correlation coefficient is not statistically
significantly different from zero
Visualizing relationships
Watch for outliers in the data. An outlier is an observation that is unlike the other
observations in the data set.
Outliers are to be investigated and their causes have to be understood.
Automatically deleting these observations is not a good strategy
NEXT:
Creating visualizations to represent relationships
Creating visualizations
Now let’s talk about how to create the graphics that represent relationships
Basically there are two ways
1. You can create graphs using menus in the graph tasks
2. You can create them by writing code to implement graphics
procedures such as proc SGPLOT, proc SGSCATTER
For better control, and tailoring the visualizations to your needs you would
need to take the second option.
We start with option-1 before moving on to option-2.
How to Create Scatter Plots – a menu driven approach
Scatter Plots – a menu driven approach
Summary: Be in “Tasks and Utilities”
1. From graph tasks select Scatter Plot
2. From DATA select SASHELP.IRIS
3. From ROLES select SepalLength as X
1 variable, and SepalWidth as Y-variable
4. Click on the Run icon
Output: Scatter plot created from graph tasks menu
Tasks and Utilities > Graph > Scatter plot > Roles x= SepalLength y=SepalWidth
Creating scatter plots by group
proc sgplot data=sashelp.iris;
title 'Scatter plot of petalLength vs sepalWidth by species';
scatter x= petalLength y= sepalWidth / group =species ;
run;
title;
Try this with other pairs of measurements in IRIS dataset. What insight do you get?
What cars are these?
Which vehicles are these?
Label outliers
Creating a variable containing text labels for outliers
data cars;
set sashelp.cars;
if mpg_city gt 45 then POI_label = model ;
else POI_label = "";
run;
The values of POI_label variable equal model name for outliers, and blank for all other obs
Label outliers
proc sgplot data = cars;
title height=2 color= BIB "Highly fuel efficient cars";
scatter x= engineSize y= mpg_city / datalabel = POI_label;
run;
title;
Classification with bar charts
Horizontal bars representing frequencies
* Bar charts: vehicle TYPE frequencies;
proc sgplot data= sashelp.cars;
title 'Horizontal bar chart of vehicle TYPE (frequencies)';
hbar type ;
run;
title;
Classification with bar charts: Bars clustered by origin
* Bar charts: vehicle TYPE bars clustered by origin;
proc sgplot data= sashelp.cars;
title 'vertical bars vehicle TYPE clustered by origin';
title2 'Using groupdisplay = cluster option ';
vbar type / group= origin groupdisplay = cluster ;
run;
title;
title2;
Classification with bar charts: bar stacked by origin
* Bar charts: Vehicle type bars stacked by ORIGIN;
proc sgplot data= sashelp.cars;
title 'Vehicle TYPE stacked by origin';
vbar type / group= origin ;
run;
title;
Classification with response variable: INVOICE (price)
* Bar charts: Bars representing Mean INVOICE (price) for each vehicle TYPE ;
proc sgplot data= sashelp.cars;
title 'Mean INVOICE (price) for vehicle TYPEs';
title2 'Using response = INVOICE and stat = mean options' ;
hbar type / response=invoice stat=mean;
run;
title;
title2;
Sport cars are expensive
Hybrid vehicles don’t cost that much
Why?
Classification with response variable: HORSEPOWER
* Bar charts: Bars representing Mean HORSEPOWER for each vehicle TYPE ;
proc sgplot data= sashelp.cars;
title 'Mean INVOICE (price) for vehicle TYPEs';
title2 'Using response = HORSEPOWER and stat = mean options' ;
hbar type / response=horsepower stat=mean;
run;
title;
Porsche 911. Engine 443 HP. Gets you from 0 to 62 mph (100 km /h) in 3.6 seconds. Price $ 113,200
Layered bar charts
proc sgplot data=sashelp.cars;
vbar type /response = mpg_city stat= mean;
vbar type /response = mpg_highway stat= mean barwidth = 0.5;
run;
What is the most remarkable feature of this graph?
Create this graph
Bar and line chart
proc sgplot data= sashelp.stocks; Where statement to
where stock='IBM' and year(date) =2005; select IBM stock for
vbarbasic date/ response =volume y2axis;
2005
series x=date y=close / markers;
title 'IBM stock price and volume for 2005';
run;
title;
When was the most significant
movement in the IBM stock price?
Homework: Why did this happen?
Homework: Create the graph for another stock
Box plots by category
Data cars;
set sashelp.cars Homework: create POI_label
--- create POI_labels ---
proc sgplot data=cars;
vbox invoice / category= type datalabel =POI_label;
run;
Sashelp.stocks data
* Series plots of closing price for each stock;
proc sgplot data=sashelp.stocks;
title height = 2 color =BIB 'Stock Market after 9/11';
title2 height = 1.3 color= deepPink 'what happened in the next 12 months?';
series x=date y=close / group= stock;
band y= close upper='11SEP01'd lower='11SEP02’d run /
fillattrs=(color=lightgreen transparency=0.5);
run;
;
Twinkle twinkle little star
How I wonder what age you are
proc sgplot data= sashelp.class;
title height =2.0 color=DAG 'Weight is related to height';
title2 height =1.5 color=BIG 'Some childern are taller because they are older';
title3 height =1.25 color=royalBlue 'or perhaps because they are male';
bubble x=height y=weight size= Age/ group= sex;
run;
proc sgplot data=sashelp.heart;
title height = 2 color= Gray 'Heatmap for cholesterol and weight';
title2 height = 1.3 color= BIG 'with reference lines for borderline and high cholesterol';
heatmap x=weight y=cholesterol ;
refline 200 240 / axis=y
lineattrs=(thickness=3 color=orange pattern=dash)
label= ("borderline" "high") ;
run;
You can tinker with colors to bring out more clearly where the density is