# Correlation and Regression Analysis

Nayyar Raza Kazmi M.B.,B.S, D.H.P.M, M.P.H, M.Sc

Objectives of the Lecture
• To understand the concept of Correlation and Regression Analysis. • Understand the areas in which Correlation and regression Models can be applied. • Understand interpreting Correlation and Regression parameters.

• Most of studies done by Post graduate trainees are crosssectional in nature. • Analysis of such studies is mostly confined to application of descriptive univariate statistics. • Quality of such studies can be enhanced by further data mining by Correlation and Regression Analysis.

Correlation
– Strength of association between two variables. – Tells us how much the two variables are associated with one another. – However doesn’t assume CAUSATION. – Simply tells us whether the two variables are positively or negatively correlated.

Regression
• If there is a strong correlation between two variables, Regression is used to determine the value of dependent variable (Y) from the value of independent variable (X) • Types
– Simple Linear Regression – Multiple Linear Regression – Logistic Regression

Correlation Analysis is a group of statistical techniques to measure the association between two variables. A Scatter Diagram is a chart that portrays the relationship between two variables. The Dependent
30 Sales (\$thousands) 25 20 15 10 5 0 70 90 110 130 Advertising Minutes 150 170 190

The Independent

Variable is the variable
being predicted or estimated.

Variable provides the
basis for estimation. It is the predictor variable.
Correlation Analysis

The Coefficient of Correlation (r) is a measure of the strength of the relationship between two variables. Also called Pearson’s r and It requires interval or ratioPearson’s product moment scaled data. correlation coefficient. P e a r rs o n ' s It can range from -1.00 to 1.00. Values of -1.00 or 1.00 indicate perfect and strong -1 0 correlation. 1 Negative values indicate an Values close to 0.0 indicate inverse relationship and weak correlation. positive values indicate a The Coefficient of Correlation, direct relationship.
r

Y

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10

Perfect Negative Correlation

Y

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10

Perfect Positive Correlation

Y

10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 X 6 7 8 9 10

Zero Correlation

Phi Co-efficient
• Used for two categorical variables

Ф =

Regression Equation and Regression Line Yc
=

a

+

bX

• where Y = computed value of the dependent variable a c = Y-intercept where X equals zero • b = slope of the regression line, which is the increase or decrease • in Y for each change of one unit of X X = a given value of the independent variable •

Simple Linear Regression
• Determines the value of a Dependent Variable based on a single independent Variable. • Simplest form of Regression Analysis.

Multiple Linear Regression
• Used when the Dependent Variable is a continuous variable and independent variables are continuous or categorical.

Y = a + b1x1 + b2x2+……..+bkxk

Putting MLR in Practice
• A descriptive study on normal healthy adults aged 14-25 years gathers date about their weight, systolic Blood Pressure and Serum Cholesterol levels.

?????
• Is serum cholesterol level associated with weight and systolic blood pressure? • Can we predict Serum Cholesterol levels if we know a persons weight and systolic blood pressure.

Y = a + b1x1 + b2x2+……..+bkxk
Y= 18.52+3.20(BP)+[-4.06(Weight)] So What could be the Serum Cholesterol level for a person who weighs 75Kg and has a systolic Blood Pressure of 145mm Hg????

Y= 18.52+3.20(145)+[-4.06(75)] Y= 18.52+464+[-304.5] Y= 18.52+464-304.5 Y= 178.02

Logistic Regression
• Logistic Regression is used when the outcome variable is categorical • The independent variables could be either categorical or continuous • Logistic Regression determines the Odds Ratio for various independent variables for the dichotomous dependent variable

• The Dichotomous Dependent variable could be presence/ absence of a complication, disease etc. • Data for dichotomous variables must be binary coded like 1 for presence of complication or disease and 0 for Absence of complication or disease.

Putting Logistic Regression in Practice
• Risk Factors for Complications of Diabetes Mellitus in patients admitted to a Tertiary Care Hospital

What can I derive from this Data??????

Risk Factors for No of patients Retinopathy (n=32) BMI> 30 Smoking Level of prior awareness HbA1C >7 Duration of Diabetes > 10 Years 13 28 14 10 20

%age 40.26 87.5 43.75 31.25 62.5

Where Correlation and Regression Models can be applied
• Cross-sectional studies. • K.A.P Studies • Studies aiming to determine relationships between certain factors of interest and their outcomes

Softwares to use
• MS Excel with Data Analysis add-in installed • SPSS • Epi Info 2002 • MedCalc (Recommended because of ease of use and power to perform all types of statistical calculations)

• Thankyou for your patience.(There is a Negative Strong Correlation between length of Biostats lecture and the Your moods evident by the 11 “O” Clock sign on your forheads • Questions, Queries and Suggestions are welcome.