Attribution Non-Commercial (BY-NC)

296 views

Attribution Non-Commercial (BY-NC)

- Shelf life.pdf
- Regression Analysis
- Lecture 29
- OSL
- Business Statistics
- The Effect of Working Satisfaction Factors on Employees’ Working Discipline at Sme of Bead Craft in Jombang Regency
- Tropical Soils
- Multiple Regression
- Manual Estatistico HP30s
- A Study on Impact of Working Capital on Profitability Perfor
- FinalExamReview_Take Home 2015
- 1. Food Sci - IJFST - Determinants of Local Rice Consumption - Okeke, Anayo Michael
- berzar color print assignment (1).docx
- Business Statistics
- Kelompok 6
- Correlation
- Math (Regression Theory)
- Report Writing Template
- 8_ Correlation Simple Regression
- Correlation-Aided Support Vector Regression for Forex Time Series

You are on page 1of 27

BINF 5210

Spring 2011

1

Correlation Analysis

• It is used to measure the linear association (degree to

which they are related) between two quantitative variables

measured on the same subjects

• For example, if you want to see the relationship between

the height and weight of a group of children ages 8 to 10 to

investigate the physical growth, correlation analysis might

be a better option for you.

• Plotting the variables of interest in a scatter plot and then

examining the relationship visually is one way of examining

correlation. It is a recommended practice.

• Pearson’s product-moment correlation or Pearson’s

correlation is the most commonly used for correlation

measurement between 2 quantitative variables

2

Pearson’s Correlation

• Pearson’s product moment correlation measured on a

population is ρ (Greek letter rho) which is the measure

of degree to which the variables of interest (2

quantitative) are related. When measured (estimated)

on a sample, it is designated by r (Pearson’s r)

• It measures the extent (degree) to which the points in

a scatter plot of the variables of interest fall on a

straight line (linear relationship)

• Value for Pearson’s correlation ranges from +1 to -1 (+1

for perfect positive correlation, - 1 for perfect negative

correlation and 0 means no correlation (zero

correlation))

3

Calculating Pearson’s Correlation

• Lets say we want to verify the correlation

between variable X and variable Y (both

quantitative variables) of a sample dataset.

The formula to calculate Pearson’s correlation

is:

∑XY – (∑X ∑Y)/N

r = ------------------------------------------- (divided by)

√ ((∑X2 – ((∑X)2/N)) (∑Y2 – ((∑Y)2/N))

N is the number of elements (observations or subjects)

4

Correlation in SAS

• SAS provides a procedure called PROC CORR for the

analysis of correlation coefficient between two

variables (quantitative)

• It tests the hypotheses-

H0 (Null Hypothesis): There is no linear relationship

between the two variables of interest (Pearson’s r=0)

Ha (Alternative Hypothesis): There is a linear

relationship between the two variables of interest

(Pearson’s r ≠ 0)

and determines if estimated correlation coefficient is

significantly differ from 0.

5

PROC CORR Assumptions

• Data is a random sample drawn from normally

distributed population (bivariate)

• If the population is not normal, then use non

parametric correlation estimation procedure (most

common is Spearman’s rho)

• PROC CORR also provides Spearman’s rho as well but

you have to request it in PROC CORR option

• Spearman’s correlation can be calculated by calculating

the rank for each of the values of the variables of

interest and then applying the Pearson’s correlation

coefficient method on the ranks of the variables.

6

PROC CORR Structure

• PROC CORR <<option(s)>>;

<<statement(s)>>;

Data=your_dataset_name

spearman (to request non parametric test non normal population)

You can also use NOSIMPLE (not to display simple statistics), NOPROB (not to display

probability value, p-value)

VAR variables of interest;

BY variable; /* Optional (for categorical variable, it will produce output

separately for each category level)*/

WITH variable /* Optional (when you want the correlation between variables

in VAR list with other variables (listed in WITH list))*/

7

PROC CORR Example

• Consider the data set for this assignment (external text file smoke_drug in my document). All

columns value are tab delimited. All data are numerical type.

First column is Gender (1=male and 2= female)

Second column is Age

Third column is Race of subjects (1=white, 2= black, 3= Hispanic, 4= other)

Fourth column is smoker? (1=yes 2=no)

Fifth column is Systolic blood pressure

Sixth column is diastolic blood pressure

• As an investigator, you are interested to examine the relationship between age and

(SYSTOLIC AND DIASTOLIC) blood pressure of randomly selected subjects as a part of a

clinical trial.

8

PROC CORR in SAS

First we read the data into SAS:

data mydata;

INFILE "C:\smoke_drug.txt" DLM ='09'x;

INPUT GENDER AGE RACE SMOKER SYSTOLIC DIASTOLIC;

RUN;

• Then we run PROC CORR on the variables of our interest

ODS HTML;

PROC CORR DATA=MYDATA;

VAR AGE SYSTOLIC DIASTOLIC; /*list of variables we are interested this will generate

correlation for variables pair wise (3 pairs)*/

TITLE 'CORRELATION OF AGE SYSTOLIC, AGE DIASTOLIC AND SYSTOLIC AND DIASTOLIC

BLOOD PRESSURE';

RUN;

ODS HTML CLOSE;

9

PROC CORR OUTPUT

SAS not to

display this

table for

basic

statistics by

2 using

NOSIMPLE

option in

PROC CORR

This is the

correlation

matrix

containing

3 pair wise

Pearson

correlations

between

each of the

3 variables

PROC CORR Output Interpretation

• Table 3 is of our interest in this example.

• We can see the correlation between AGE and

SYSTOLIC pressure is 0.511150 (positive

relationship but not perfect positive) and the p-

value is <.0001 (so we reject the null hypothesis

that there is no linear relationship between AGE

and SYSTOLIC).

• Read the correlation same way for the other

combination of variables (i,e, AGE DIASTOLIC and

so on.)

11

PROC CORR

• If your population is non normal then use Spearman’s correlation

test by specifying this in the PROC CORR option either with Pearson

or by itself.

• Using WITH statement: sometimes we want to examine the

correlations of one or more variables with other variables. WITH

statement becomes handy in such cases.

• Lets say in our example you want to verify the correlations of AGE

with multiple measures of Systolic blood pressure (lets say 4

measures Sys1, Sys2, Sys3, Sys 4). In this case you have to include

WITH statement, for example,

PROC CORR data=data_set_name;

VAR Sys1-sys4;

WITH AGE;

This will produce correlations between AGE and each of Sys1, Sys2,

Sys3, and Sys4.

12

PROC CORR -Plot

• You should always produce a scatter plot of

the variables of interest to verify the

correlations between them

• One option is to use ODS GRAPHICS option on

PROC CORR. This will generate the graphs and

plots associated with output of SAS PROC

CORR.

13

Linear Regression

• Correlation gives you the measures of linear

relationship between two variables and regression

analysis utilizes this relationship to predict the

dependent variable from the independent variable

• In order to predict (value of) a dependent variable from

a given value of an independent variable, simple linear

regression is appropriate

• For example, as part of the investigation of the effect s

of physical exercises (amount of time spent for

exercising daily) on BMI, a simple linear regression can

be used to predict the BMI from the amount of time

spent daily for physical exercises.

14

Simple Linear Regression Model Basics

• The following mathematical equation of a (theoretical) line

describes the association (relationship) between an

independent variable X and a dependent variable Y:

Y= α + βx + ε

(α is the Y intercept, β is slope of the line and ε is the error

whose mean is 0 and whose variance if fixed. If the slope,

β=0, then there is no predictive relationship between the

variables )

When we perform a regression analysis on data to predict

variable, we actually calculate a regression line to describe

the relationship of the variables of our interest ,which

(regression line) is an estimate of the theoretical line

above.

15

Simple Linear Regression Model Basics

Y’ = a + bx

Where a and b are the least square estimate of the

parameters α and β respectively, x is the given value of

independent variable, Y’ is the dependent variable (value)

we are trying to predict.

Note: Least square estimates because the regression line tries

to minimize the sum of the squared errors of the

predictions (square of the error between the actual value

of the outcome variable and the predicted value of the

outcome variable. Please check Lane text book, chapter 15,

for details)

16

Simple Linear Regression in SAS

• SAS provides a procedure called PROC REG for regression

analysis of data

• When we specify the regression model in SAS by specifying

the dependent variable and independent variable, SAS

formulates a regression line (same equation in the previous

slide) based on the given dataset and predicts the dependent

variable (value)

• First step is to check if there is any relationship existed

between the variables specified in SAS.

This is done by testing the null hypothesis that there is no

linear relationship predictable between the variables (that is

the slope of the equation, β= 0) .

Ha (Alternative Hypothesis): There is linear relationship

predictable between the two variables of interest (the slope

of the equation is not 0).

17

Simple Linear Regression in SAS

Therefore, H0: β =0 and

H a: β ≠ 0

If we have a small p-value (usually <.05), then we

can reject the null hypothesis and conclude that β

≠ 0 (there is a predictable relationship between

the variables).

In other words, we can say knowing the value of

the independent variable would be helpful to

predict the value of the dependent variable.

• Next step would be to use this linear relationship

to predict the value of the dependent variable

18

Simple Linear Regression Using PROC REG

PROC REG <<option(s): >>;

such as data=, SIMPLE to include for basic statistics in output

• MODEL statement has the structure:

MODEL dependent_var=independent_var/ options;

• Some of the MODEL statement options are (check

SAS manual and know their functions):

P (for requesting a table of predicted values),R (for residual analysis),CLM (for expected

value), CLI (for individual values of the dependent variable), INCLUDE, SELECTION,

SLSTAY, SLENTRY

19

Simple linear regression using PROC REG

Example

• Lets consider the data we used for the correlation

analysis example. In this example we are

interested to see if systolic blood pressure can be

used to predict the diastolic blood pressure. After

reading the dataset into SAS, we run the

following PROC REG:

ODS HTML;

TITLE 'SIMPLE LINEAR REGRESSION EXAMPLE';

PROC REG DATA= MYDATA;

MODEL DIASTOLIC = SYSTOLIC;

/* SPECIFYING THE OUTCOME VARIABLE (DEPENDENT) AND PREDICTOR (INDEPENDENT) VARIABLE

FOR THE MODEL OF REGRESSION, what you want to predict from what*/

RUN;

ODS HTML CLOSE;

20

PROC REG Output If we could reject the

null hypothesis, then

only we would have

continued here.

R-square is the

measure of how strong

the relationship

between the variables

We can not predict 1 is. The closer it is to 1,

the stronger the

DIASTOLIC from

relationship. In this

SYSTOLIC because

example the value is

there is no

very small, 0.0141

significant

(0.01, there is barely a

relationship 2 relationship).

between them. So

This is how to interpret

we do not go any

this value: only 1% of

further.

the variability in

This table

tells you

DIASTOLIC variable can

about the be explained by

SYSTOLIC variable.

strength of

the

3

relationship

Statistical test on

SYSTOLIC row is for

Least square estimate of a the β=0. Can not

reject null

4 hypothesis. So there

Least square estimate of b is no relationship

(slope is not

significantly different

Y’= a + b x This table is associated with regression from 0)

model 21

PROC REG Output

• When we read PROC REG output, 3 things are usually are of our interest to understand the results

(also shown in the output in the last slide):

1- R -square (tells you the strength of the relationship)

2- Slope (check the regression table for the independent variable

and check the p-value for the test if it is significant or not. This is the

test for whether the slope=0 or not)

3- Parameter estimate: Intercept and independent variable (estimate of a and b for the regression

equation for

prediction)

• From this example, we conclude that there is no significant predictive linear relationship between

Diastolic and Systolic blood pressure according to our dataset. Since the slope=0 t-test is significant

(slope is for the dependent and the independent variables, the intercept not involved. So we check

the regression table for the independent variable and check the p-value for the test if it is

significant or not.)

• Therefore, we can not predict Diastolic from Systolic.

• So we stop our analysis by concluding and we do not need to verify the parameters and regression

equation for prediction.

22

How to Interpret PROC REG Output

(When slope is not zero)

• Now consider the following output. Think about it as if

is based on the same table but for different data values

and also think that this time the slope test produced

smaller p-value ( lets say.04) for significance.

• This is a made up output where just the p-value of

Systolic is changed to make it significant so that the

null hypothesis is rejected. This is just to show how to

read and interpret the output of a regression when the

slope is not 0 and how to predict the dependent

variable from the value of independent variable using

the regression line equation.

(Again this is just for explanation, not correct output on a dataset)

23

PROC REG Output (Slope≠0)

Check R square to see the strength of

the relationship. Then check the R-square is the

slope test p-value in the last column measure of how strong

in the regression table for the

independent variable, in this case the

the relationship

row for independent variable between the variables

SYSTOLIC, the p-value (Pr> |t|) in is. The closer it is to 1,

regression table (4). Then report the

parameter estimate (a and b) from 1 the stronger the

relationship. In this

the third column in regression table

(4). Test for Intercept is not of our example the value is

interest but the value is. 0.0141.

This is how to interpret

this value: only 1% of

2 the variability in

DIASTOLIC variable can

be explained by

SYSTOLIC variable.

This table

tells you

about the

strength of

the

3

relationship

Statistical test on

SYSTOLIC row is for

Least square estimate of a the β=0. Can reject

null hypothesis. So

4 there is a

Least square estimate of b relationship

(slope is significantly

Y’= a + b x This table is associated with regression

different from 0)

(This is the predictive equation)

PROC REG Output (Slope≠0)

• In this case, parameter estimates are 1.47628

and 0.00110 for the Intercept and Systolic

(remember these are least square estimates

of a and b in the regression line equation)

• So we can calculate the equation of the

regression line as:

Y’= a + b x (Value of x)

(outcome variable) (predictor)

In this situation, we would have used this regression equation to predict the dependent variable from the

values of the independent variable.

25

Simple Linear Regression Plot

• It is always recommended to create a plot for

the variables of interest to visually inspect the

linear relationship of the data. The regression

line can give you ideas about the predictive

values of the dependent variable for each unit

change of the independent variable.

• You can simply plot the variables by using plot

option available in SAS simply by adding PLOT

statement after MODEL statement

(PLOT DEPENDENTVARIABLE * INDEPENDENTVARIABLE;) or by using

PROC GPLOT procedure after MODEL statement.

26

Assignment

• Read The Little SAS BOOK chapter 8 (8.4 (for

correlation), 8.5 & 8.6 (for regression))

• Learn how to create plot for regression, how

to read the plot elements, and how to read

regression output.

• There will be assignment on these coming

soon.

27

- Shelf life.pdfUploaded byMihir Dixit
- Regression AnalysisUploaded byrahulsaha1986
- Lecture 29Uploaded byhrishabh_agrawal
- OSLUploaded byOkay325
- Business StatisticsUploaded byShivshankar Yadav
- The Effect of Working Satisfaction Factors on Employees’ Working Discipline at Sme of Bead Craft in Jombang RegencyUploaded byIOSRjournal
- Tropical SoilsUploaded byAinin N
- Multiple RegressionUploaded byGifuGifu
- Manual Estatistico HP30sUploaded byAnonymous ZbrmXSI
- A Study on Impact of Working Capital on Profitability PerforUploaded byKuthubudeen T M
- FinalExamReview_Take Home 2015Uploaded byhello there
- 1. Food Sci - IJFST - Determinants of Local Rice Consumption - Okeke, Anayo MichaelUploaded byTJPRC Publications
- berzar color print assignment (1).docxUploaded bylimon ahmed
- Business StatisticsUploaded byShivshankar Yadav
- Kelompok 6Uploaded byHalili Kendari
- CorrelationUploaded byDr Swati Raj
- Math (Regression Theory)Uploaded byAlina Borisenko
- Report Writing TemplateUploaded byKirz Serrano
- 8_ Correlation Simple RegressionUploaded bySunil Kamat
- Correlation-Aided Support Vector Regression for Forex Time SeriesUploaded byThoth Dwight
- 663-1992-1-SM (3).pdfUploaded bygasmeeg
- Chapter 5 FinalUploaded byinflibnet inflibnet
- Simple RegressionUploaded byCarl Martin Engcoy
- RegressionUploaded bySandeep Behera
- PERFORMANCE OF THE STOCK SHARES OF FOOD SECTOR COMPANIES AND THEIR IMPACT ON THE PRICE AND QUOTATION INDEX (CPI) OF THE MEXICAN STOCK EXCHANGE.Uploaded byIJAR Journal
- Wells Copper Smith 1994Uploaded byedu
- Hypothesis-Testing-Regression.docxUploaded byJhullian Frederick Val Vergara
- page_001Uploaded byMuhammad Talha
- Ch 12 Solutions Manual.pdfUploaded byErika Moreno
- Theoretical and Case Study on Multiple Equal Part Linear RegressionUploaded byRicardo Chegwin Hillembrand

- 10.1.1.118.8524Uploaded bykexadex2
- Compression RatioUploaded byHesbon Moriasi
- Panel Data EconometricsUploaded byNuur Ahmed
- AJAE ExplainUploaded bybeatlenzo87
- cs229-notes1.pdfUploaded byazfarhussain8776
- The Research Manual-Hatch&Larazaton 1991Uploaded byyusuf hidayat
- Numerical Analysis By Shanker Rao.pdfUploaded byMuhammed M H
- Basel III Impact Analysis for Indian Banks Siddharth ShuklaUploaded byDeepak Lotia
- Chapter 9Uploaded byjaved765
- Nhkch Case Presentation. 12.07.16Uploaded bynaveen
- CSSBB Body of Knowledge 2015Uploaded byAntilohije
- Hou 2017Uploaded byKhyati Nirmal
- Econometrics-I-19 (2).pptxUploaded byLIZ KATHERIN
- AML StatisticsUploaded bySomobrata Ballabh
- ForecastingUploaded byOmkar Sawant
- read_meUploaded byEdilson Vagner C-young Sitoe
- sensors-16-00184Uploaded byOtman Ighoulassen
- The October 1980 El-Asnam Earthquake in n.w. Algeria Modelling Incoherence of Ground MotionUploaded bycontrerasc_sebastian988
- Least Square MethodUploaded bygknindrasenan
- Improving Lean Healthcare EffectivenessUploaded byHadi P.
- Stock ReturnUploaded byafina_fajri
- Bu Met Actuarial Science BrochureUploaded byalex
- D Demo Family 1987Uploaded byAndra Andru
- SPSS2Uploaded byradislamy-1
- Statistical Analysis Using Microsoft ExcelUploaded byphanhangcntp
- as06Uploaded byLakshmi Seth
- Nurlis Analysis of Financial Distress Prediction in Cooperative Financial InstitutionsUploaded byleli trisna
- 3rdTREAD for WebSiteUploaded byMoisi Dragos
- Roch et alUploaded byOnyo Oclares
- SLS_corrected_1.4.16.pdfUploaded bydalp86

## Much more than documents.

Discover everything Scribd has to offer, including books and audiobooks from major publishers.

Cancel anytime.