You are on page 1of 5

48 Journal of The Association of Physicians of India ■ Vol.

65 ■ April 2017

STATISTICS FOR RESEARCHERS

Principles of Regression Analysis


NJ Gogtay, SP Deshpande, UM Thatte

Introduction relationship between one or more the “dependent” variable. In the


independent variables and the second example, since CAD is

W hile correlation analysis helps


in identifying associations
o r r e l a t i o n s h i p s b e t we e n t w o
dependent variable. A correlation
analysis with a scatter plot and
a regression line 1 is however a
associated with both diabetes and
hypertension, CAD would become
the dependent variable and diabetes
variables, 1 the regression technique prerequisite to regression and and hypertension would be the two
or regression analysis is used to both analyses are often carried out independent variables. If you will
“model” this relationship so as together. notice, diabetes in one example is
to be able to predict what will a dependent variable and in the
happen in a real-world setting. Let Dependent and other, an independent variable.
us understand this with something Independent Variables Thus, what is a dependent and
that happens in daily life. As what is an independent variable
children, we are often told by our An important first step before needs to be defined à priori by the
parents that education is a vital carrying out a regression analysis researcher before carrying out the
part of our lives and a “good” is to understand the concept regression analysis. The choice
education will help us get “good of independent and dependent of the variables would in turn be
jobs” and thus “good wages”! If variables. An independent variable, defined by the research question
we were to explore the relationship as the name suggests is a “stand and the hypothesis being explored.
between education and wages, a l o n e ” va r i a b l e , a n d o n e t h a t Independent variables are often
we could think up two questions remains unaffected by other c a l l e d p r e d i c t o r va r i a b l e s o r
– a) Does a relationship exist variables that are measured in a exogenous variables and dependent
between education and wages? And study. The “dependent” variable is variables are called prognostic or
a related but more interesting and the one that is usually of interest to endogenous variables.
important question b) for every extra the researcher and alters in response
year spent in education, do wages to change/s in the independent variable. Types of Regression
commensurately increase [and if Let us understand this concept
Essentially in medical research,
yes, by how much?]? The latter with the following two research
there are three common types of
question is moving into the realm of questions – 1) “As age increases,
regression analyses that are used
prediction. Regression attempts to does the risk of developing diabetes
viz., linear, logistic regression and
answer these and similar questions as measured by HbA1C or blood
Cox regression. These are chosen
regarding relationships between sugar levels increase? Or b) “Given
depending on the type of variables
variables. that a patient has both diabetes and
that we are dealing with (Table 1).
hypertension, what is his risk of
Correlation Versus developing coronary artery disease
Cox regression is a special type of
regression analysis that is applied
Regression- [CAD]”? In the former example,
to survival or “time to event “data
Understanding the we are exploring the relationship
and will be discussed in detail in
between age and diabetes and in
Distinction the latter, the relationship between
the next article in the series.
two variables with a third variable– Linear regression can be simple
The difference between linear or multiple linear regression
correlation analysis and regression i.e., diabetes and hypertension
with CAD. In the first example, while Logistic regression could be
lies in the fact that the former focuses Polynomial in certain cases (Table
on the strength and direction of age would be the independent
variable while diabetes, which is 1).
the relationship between two or
more variables without making any dependent on age would become The type of regression analysis
assumptions about one variable being
independent and the other dependent
Department of Clinical Pharmacology, Seth GS Medical College and KEM Hospital, Mumbai, Maharashtra
[see below], but regression analysis
Received: 07.01.2017; Accepted: 15.01.2017
assumes a dependence or causal
Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017 49

Table 1: Types of regression duration of surgery, extent of


Type of regression Dependent variable Independent variable Relationship between blood loss and Hb levels at
and its nature and its nature variables day 7, it is intuitive that all
Simple linear One, continuous, One, continuous, Linear are correlated amongst each
normally distributed normally distributed other and therefore will lead to
Multiple linear One, continuous Two or more, may Linear erroneous results. What should
be continuous or
be taken as the independent
categorical
variable is only the duration of
Logistic One, binary Two or more, may Need not be linear
be continuous or surgery.
categorical • The sample size sh oul d be
Polynomial (logistic) Non-binary Two or more, may Need not be linear adequate
[multinomial] be continuous or
categorical If the dependent variable is
Cox or proportional Time to an event Two or more, may Is rarely linear non-binary and has more than
hazards regression be continuous or two possibilities, we use the
categorical
multinomial or polynomial
to be used in a given situation is be quantitative or qualitative. logistic regression.
primarily driven by the following Both the independent variables
three metrics here could be expressed either Step Wise Regression
• N u m b e r a n d n a t u r e o f as continuous data (blood Stepwise regression is an
independent variable/s pressure or HbA1C values) automated tool that can be used
or qualitative data (presence in the exploratory stages of model
• N u m b e r a n d n a t u r e o f t h e
or absence of diabetes as building to identify a useful subset
dependent variable/s
defined by the ADA 2016 or of predictor variables. The process
• Shape of the regression line hypertension as defined by JNC systematically adds the most
A. L i n e a r r e g r e s s i o n : L i n e a r VIII). 3,4 A linear relationship significant variable or removes
regression is the most basic should exist between the the least significant variable during
and commonly used regression dependent and independent each step. The idea behind using
technique and is of two types variables. such automated tools is to maximize
viz. simple and multiple B. Logistic regression: This type the power of prediction with a
regression. You can use Simple of regression analysis is used minimum number of independent
linear regression when there when the dependent variable is variables.
is a single dependent and a binary in nature. For example, if
single independent variable the outcome of interest is death Steps in Conducting a
(e.g. for the research question in a cancer study, any patient Regression Analysis
d e s c r i b e d a b o v e “A s a g e in the study can have only one
increases, does the risk of of two possible outcomes- dead Regression analysis is done in
developing diabetes increase or alive. The impact of one or 3 steps:
as measured by HbA1C or more predictor variables on this 1. A n a l y z i n g t h e c o r r e l a t i o n
blood sugar levels?”. Both the binary variable is assessed. The [strength and directionality of
variables must be continuous predictor variables can be either the data]
(quantitative data2) and the line quantitative or qualitative.
describing the relationship is a 2. Fitting the regression or least
Unlike linear regression, this
straight line (linear ). squares line, and
type of regression does not
Multiple linear regression on require a linear relationship 3. Evaluating the validity and
the other hand can be used b e t we e n t h e p r e d i c t o r a n d usefulness of the model.  
when we have one continuous dependent variables. Step 1: This has been described in
dependent variable and two or For logistic regression to be the article on correlation analysis 1
more independent variables, meaningful, the following Step 2: Fitting the regression line
for example when we want to criteria must be met/satisfied Conventionally, in mathematics,
answer the second question
• T h e i n d e p e n d e n t va r i a b l e s the equation of a line is given by
mentioned above, “Given that
must not be correlated the formula “y = mx+c”, where,
a patient has both diabetes and
amongst each other e.g. if m is the “slope” or gradient of
hypertension, what is his risk
t h e d e p e n d e n t va r i a b l e i s the line and “c” is where the line
of developing coronary artery
presence (or absence) of post- “cuts” the y-axis, also called as
disease (CAD)”. Importantly,
operative wound infection and the “intercept”, (Figure 1). In
the independent variables could
the independent variables are regression, this equation is given by
50 Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017

Regression Equa�on Table 2: Age and Blood pressure in


women (n=30)
Sr. No. Age (years) Systolic blood
HbA1C (%) HbA1C (Y) = B0 + B1*age (X). pressure (mm Hg)
B1 1 22 131
Y-axis
2 23 128
B0 3 24 116
4 27 106
5 28 114
6 29 123
X - axis Age (years) 7 30 117
8 32 122
9 35 99
Fig. 1: Regression analysis exploring the relationship between Age and HbA1C
10 35 121

A scatter plot and regression line of age 11


12
40
41
147
139
versus systolic blood pressure y = 0.889x + 94.61 13 41 171
R2 = 0.292 14 46 137
200 15 47 111
16 48 115
150 17 49 133
18 49 128
SBP

100 19 50 183
20 51 130
21 51 133
50
22 51 144
23 52 128
0
24 54 105
0 10 20 30 40 50 60 70 80 25 56 145
26 57 141
Age
27 58 153
Fig. 2: Scatter plot and regression Line 28 59 157
29 63 155
y=B 0 +B 1x which are the notations of the model 30 67 176
commonly used by statisticians. Va l i d a t i o n t e c h n i q u e s c a n is affected by the choice of the
Here, B 0 represents the intercept, be broadly divided into two – independent variables and the
while B1 represents the slope. In our numerical and graphical. An presence of outliers 1 and hence
example of relationship between easy numerical technique is to researchers need to be careful
age and HbA1C levels, the equation look at the value of R 2 . You will while using R 2 alone for validation.
would be given by HbA1C levels recollect that the coefficient of Graphical analysis of residuals is a
(y) = B 0 + B 1*age (x) (Figure 2). correlation (r ) 1 is squared to obtain graphical technique for validation
The intercept B0 is that value of Y the Coefficient of Determination or that uses graphs to visually inspect
or the dependent variable (HbA1C, R 2. This is the value that is used for the data to assess its robustness.
in this case), when the value of the predicting the extent of variability
predictor variable is zero (age). in the dependent variable that can Utility of Regression
In reality age can never be zero, be explained by the independent Analysis
so this is the value of HbA1C that variable. In our example (Figure
is present regardless of age. The 1), the R 2 is 0.29 or 29% indicating The example given below
equation y = B 0 + B 1 x classically that only 29% of the variability in highlights the utility of regression
describes simple linear regression. SBP can be explained by age. If the analysis. Table 2 has values of
For Multiple linear regression, the value of R 2 is high, then we could systolic blood pressure [SBP] for 30
equation would be y = B 0 + B 1X 1 + assume that the model has a high women with age being presented
B 2X 2 + B 3X 3+……………………+B kX k predictive value. The value of 29% in an ascending order. Age here
depending upon the number of indicates that 71% of the variability would be the independent variables
predictor variables [X 1 , X 2 and still remains unaccounted for and and SBP, the dependent variable.
so on] and k is the kth predictor thus model may not have a great We first make a scatter plot
variable. predictive value. It is also useful and eye ball the data and then
Step 3: Evaluating the validity to remember that the value of R 2
Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017 51

subsequently generate the predictor [X] variables [when there variables rather than temperature
regression line and the regression are multiple predictor variables] were better at forecasting. They
equation (Figure 1). and all regression coefficients [B 1, concluded that this data could
The regression equation here is B 2 …..B k ] are tested together. (b) be used for better allocation of
y = 0.889x + 94.61 Testing a few X variables- here we personnel for the management of
can choose a few select X variables ED services.
OR
[these are the ones the researcher Prediction
SBP = 94.61 + 0.889 x age may think are maximally relevant]
The given data set if you will alone to be tested for their impact Brazer et al7 used logistic
notice does not contain the age on the Y variable and finally (c) regression to predict risk factors for
60 or the corresponding SBP. Say A test for individual significance colorectal cancer in a community
we as asking the question, what where a single X variable is tested practice where they studied 461
would be the SBP of a 60-year old for its impact on the Y variable. consecutive patients undergoing
woman? We would “fit” 60 into the colonoscopy. Of these, 129
regression equation and calculate Applications of Regression had adenomatous polyps (pre-
cancerous) and 34 had colorectal
it as Analysis
cancer. They randomly chose 292
SBP = 94.61 + 0.889 x 60 or 148 patients and evaluated the impact
There are three major uses of
mm Hg. of several independent variables in
regression analysis – attributing
It is important to note here causality [cause and effect a model that looked at prediction
that multiple lines can be drawn relationship], forecasting and of occurrence of colorectal cancer.
through the data set and hence prediction. These are explained F i ve va r i a b l e s we r e i d e n t i f i e d
we need to define the criterion for with three examples. t o b e p r e d i c t i ve - t h e p a t i e n t ’ s
drawing the line. The method that age, sex, hematocrit, fecal occult
Causality
is most commonly used is called the blood test result and indication
“least squares methods”. Pa l m e r K T a n d c o l l e a g u e s
for colonoscopy. When this model
assessed working aged people
When we apply this equation was applied to the remainder of
from the general population in
to the population for making a the 169 patients, it was found to
the United Kingdom to estimate
prediction, we would really not be be a reliable indicator of risk of
the risks of occupational exposure
able to predict either the systolic colorectal neoplasia.
to noise on self reported hearing
blood pressure perfectly. Hence,
we need to taken into account an
difficulties and tinnitus using a Understanding what
validated questionnaire. The study
“error” or “deviation” that is likely
showed that, in both sexes, after
Confounders are and the
to occur when this equation is used. Concept of “Adjustment”
adjustment [see below] for age, the
Thus, the equation would read as
risk of severe hearing difficulty and
below In study by Palmer and
persistent tinnitus rose with years
noise colleagues presented earlier,6
spent in a noisy job indicative of a
the relationship between being
y = B0 + B1x + e cause -effect relationship. 5
in a noisy job and the extent of
Forecasting
hearing difficulty and tinnitus
linear component Efficient management of patient was assessed. Intuitively, we do
Simply put, this could be flow in emergency departments know that noise is not the only
interpreted as (EDs) is a very important issue for reason why hearing difficulty can
Response = signal + noise hospital administrators. Marcilio occur. This may be related to stress,
I and colleagues 6 studied diverse gender, tiredness, older age and
Or in medical research this models in an attempt to forecast many more factors. A confounder
would be the daily number of patients is defined as that variable that is
Outcome = prediction + deviation seeking emergency department “hidden” or “lurking” and was
[ED] services in a general hospital not accounted for or thought of
Testing for Significance in Sao Paolo, Brazil, using calendar initially and impacts the outcome
Once, a regression equation is variables and ambient temperature being studied. Confounders [to
generated, the next logical step reading as the independent the extent possible] need to be
is to “test for significance” and variables. They found that the identified before the start of the
generate a p value. There are three mean number of ED visits was study and addressed during
different ways in which this can be 389 [166-613] with a seasonal analysis by a process called as
done (a) Carrying out an overall distribution with the highest “adjustment”. This is a statistical
test of significance where all the patient volumes seen on Mondays technique to eliminate the influence
and lowest on weekends.  Calendar of one or more confounders on
52 Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017

the treatment effect. Simply put, Conclusion 2. Deshpande S, Gogtay NJ, Thatte UM. Data
it can be understood as a process types. J Assoc Phy Ind 2016; 64:64-65.
In summary, regression analysis
of “statistical correction” that is is a statistical tool that helps 3. Chamberlain JJ, Rhinehart AS, Shaefer CE,
done once the data is gathered. Neuman A. Diagnosis and management
evaluate relationships between of diabetes:synopsis of the 2016 American
In regression analysis, once a a dependent variable and one or Diabetes Assoication Standards of Medical
significant association between more independent or predictor Care in Diabetes. Ann Intern Med 2016;
the independent variable/s and variables. More specifically, it helps 164:542-552.
d e p e n d e n t va r i a b l e h a s b e e n us understand how the dependent 4. Hernandez-Vila E. A Review of the JNC
found, it is important to see if variable changes with changes in 8 Blood Pressure Guideline.  Texas Heart
this significance still persists the independent variable and thus Institute Journal 2015; 42:226-228.
after potential confounders have finds its application in forecasting 5. Palmer KT, Griffin MJ, Syddall HE, Davis
been adjusted for. In the study by and predicting. The technique
A, Pannett B, Coggon D. Occupational
exposure to noise and the attributable
Palmer, age was identified as a must however be used with clear burden of hearing difficulties in Great
potential confounder a priori [since understanding of the assumptions Britain. Occup Environ Med 2002; 59:634-9.
we know the hearing worsens with in each type of regression analysis, 6. Marcilio I, Hajat S, Gouveia N. Forecasting
age] and adjusted for in the final their limitations and the potential daily emergency department visits
analysis. Post adjustment, in both error that can occur when models using calendar variables and Ambient
sexes, the relationship between are applied to a larger population. temperature readings. Academic Emergency
years spent in a noisy job and Medicine 2013; 20:769–777.
severe hearing difficulty continued References 7. Brazer SR, Pancotto FS, Long TT 3rd, Harrell
to remain significant [relative to FE Jr, Lee KL, Tyor MP, Pryor DB. Using
1. Gogtay NJ, Deshpande S, Thatte UM. ordinal logistic regression to estimate the
those in non-noisy jobs].
Principles of Correlation analysis. J Assoc likelihood of colorectal neoplasia. J Clin
Phy Ind 2017; 65:78-81. Epidemiol 1991; 44:1263-70.

You might also like