Professional Documents
Culture Documents
65 ■ April 2017
100 19 50 183
20 51 130
21 51 133
50
22 51 144
23 52 128
0
24 54 105
0 10 20 30 40 50 60 70 80 25 56 145
26 57 141
Age
27 58 153
Fig. 2: Scatter plot and regression Line 28 59 157
29 63 155
y=B 0 +B 1x which are the notations of the model 30 67 176
commonly used by statisticians. Va l i d a t i o n t e c h n i q u e s c a n is affected by the choice of the
Here, B 0 represents the intercept, be broadly divided into two – independent variables and the
while B1 represents the slope. In our numerical and graphical. An presence of outliers 1 and hence
example of relationship between easy numerical technique is to researchers need to be careful
age and HbA1C levels, the equation look at the value of R 2 . You will while using R 2 alone for validation.
would be given by HbA1C levels recollect that the coefficient of Graphical analysis of residuals is a
(y) = B 0 + B 1*age (x) (Figure 2). correlation (r ) 1 is squared to obtain graphical technique for validation
The intercept B0 is that value of Y the Coefficient of Determination or that uses graphs to visually inspect
or the dependent variable (HbA1C, R 2. This is the value that is used for the data to assess its robustness.
in this case), when the value of the predicting the extent of variability
predictor variable is zero (age). in the dependent variable that can Utility of Regression
In reality age can never be zero, be explained by the independent Analysis
so this is the value of HbA1C that variable. In our example (Figure
is present regardless of age. The 1), the R 2 is 0.29 or 29% indicating The example given below
equation y = B 0 + B 1 x classically that only 29% of the variability in highlights the utility of regression
describes simple linear regression. SBP can be explained by age. If the analysis. Table 2 has values of
For Multiple linear regression, the value of R 2 is high, then we could systolic blood pressure [SBP] for 30
equation would be y = B 0 + B 1X 1 + assume that the model has a high women with age being presented
B 2X 2 + B 3X 3+……………………+B kX k predictive value. The value of 29% in an ascending order. Age here
depending upon the number of indicates that 71% of the variability would be the independent variables
predictor variables [X 1 , X 2 and still remains unaccounted for and and SBP, the dependent variable.
so on] and k is the kth predictor thus model may not have a great We first make a scatter plot
variable. predictive value. It is also useful and eye ball the data and then
Step 3: Evaluating the validity to remember that the value of R 2
Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017 51
subsequently generate the predictor [X] variables [when there variables rather than temperature
regression line and the regression are multiple predictor variables] were better at forecasting. They
equation (Figure 1). and all regression coefficients [B 1, concluded that this data could
The regression equation here is B 2 …..B k ] are tested together. (b) be used for better allocation of
y = 0.889x + 94.61 Testing a few X variables- here we personnel for the management of
can choose a few select X variables ED services.
OR
[these are the ones the researcher Prediction
SBP = 94.61 + 0.889 x age may think are maximally relevant]
The given data set if you will alone to be tested for their impact Brazer et al7 used logistic
notice does not contain the age on the Y variable and finally (c) regression to predict risk factors for
60 or the corresponding SBP. Say A test for individual significance colorectal cancer in a community
we as asking the question, what where a single X variable is tested practice where they studied 461
would be the SBP of a 60-year old for its impact on the Y variable. consecutive patients undergoing
woman? We would “fit” 60 into the colonoscopy. Of these, 129
regression equation and calculate Applications of Regression had adenomatous polyps (pre-
cancerous) and 34 had colorectal
it as Analysis
cancer. They randomly chose 292
SBP = 94.61 + 0.889 x 60 or 148 patients and evaluated the impact
There are three major uses of
mm Hg. of several independent variables in
regression analysis – attributing
It is important to note here causality [cause and effect a model that looked at prediction
that multiple lines can be drawn relationship], forecasting and of occurrence of colorectal cancer.
through the data set and hence prediction. These are explained F i ve va r i a b l e s we r e i d e n t i f i e d
we need to define the criterion for with three examples. t o b e p r e d i c t i ve - t h e p a t i e n t ’ s
drawing the line. The method that age, sex, hematocrit, fecal occult
Causality
is most commonly used is called the blood test result and indication
“least squares methods”. Pa l m e r K T a n d c o l l e a g u e s
for colonoscopy. When this model
assessed working aged people
When we apply this equation was applied to the remainder of
from the general population in
to the population for making a the 169 patients, it was found to
the United Kingdom to estimate
prediction, we would really not be be a reliable indicator of risk of
the risks of occupational exposure
able to predict either the systolic colorectal neoplasia.
to noise on self reported hearing
blood pressure perfectly. Hence,
we need to taken into account an
difficulties and tinnitus using a Understanding what
validated questionnaire. The study
“error” or “deviation” that is likely
showed that, in both sexes, after
Confounders are and the
to occur when this equation is used. Concept of “Adjustment”
adjustment [see below] for age, the
Thus, the equation would read as
risk of severe hearing difficulty and
below In study by Palmer and
persistent tinnitus rose with years
noise colleagues presented earlier,6
spent in a noisy job indicative of a
the relationship between being
y = B0 + B1x + e cause -effect relationship. 5
in a noisy job and the extent of
Forecasting
hearing difficulty and tinnitus
linear component Efficient management of patient was assessed. Intuitively, we do
Simply put, this could be flow in emergency departments know that noise is not the only
interpreted as (EDs) is a very important issue for reason why hearing difficulty can
Response = signal + noise hospital administrators. Marcilio occur. This may be related to stress,
I and colleagues 6 studied diverse gender, tiredness, older age and
Or in medical research this models in an attempt to forecast many more factors. A confounder
would be the daily number of patients is defined as that variable that is
Outcome = prediction + deviation seeking emergency department “hidden” or “lurking” and was
[ED] services in a general hospital not accounted for or thought of
Testing for Significance in Sao Paolo, Brazil, using calendar initially and impacts the outcome
Once, a regression equation is variables and ambient temperature being studied. Confounders [to
generated, the next logical step reading as the independent the extent possible] need to be
is to “test for significance” and variables. They found that the identified before the start of the
generate a p value. There are three mean number of ED visits was study and addressed during
different ways in which this can be 389 [166-613] with a seasonal analysis by a process called as
done (a) Carrying out an overall distribution with the highest “adjustment”. This is a statistical
test of significance where all the patient volumes seen on Mondays technique to eliminate the influence
and lowest on weekends. Calendar of one or more confounders on
52 Journal of The Association of Physicians of India ■ Vol. 65 ■ April 2017
the treatment effect. Simply put, Conclusion 2. Deshpande S, Gogtay NJ, Thatte UM. Data
it can be understood as a process types. J Assoc Phy Ind 2016; 64:64-65.
In summary, regression analysis
of “statistical correction” that is is a statistical tool that helps 3. Chamberlain JJ, Rhinehart AS, Shaefer CE,
done once the data is gathered. Neuman A. Diagnosis and management
evaluate relationships between of diabetes:synopsis of the 2016 American
In regression analysis, once a a dependent variable and one or Diabetes Assoication Standards of Medical
significant association between more independent or predictor Care in Diabetes. Ann Intern Med 2016;
the independent variable/s and variables. More specifically, it helps 164:542-552.
d e p e n d e n t va r i a b l e h a s b e e n us understand how the dependent 4. Hernandez-Vila E. A Review of the JNC
found, it is important to see if variable changes with changes in 8 Blood Pressure Guideline. Texas Heart
this significance still persists the independent variable and thus Institute Journal 2015; 42:226-228.
after potential confounders have finds its application in forecasting 5. Palmer KT, Griffin MJ, Syddall HE, Davis
been adjusted for. In the study by and predicting. The technique
A, Pannett B, Coggon D. Occupational
exposure to noise and the attributable
Palmer, age was identified as a must however be used with clear burden of hearing difficulties in Great
potential confounder a priori [since understanding of the assumptions Britain. Occup Environ Med 2002; 59:634-9.
we know the hearing worsens with in each type of regression analysis, 6. Marcilio I, Hajat S, Gouveia N. Forecasting
age] and adjusted for in the final their limitations and the potential daily emergency department visits
analysis. Post adjustment, in both error that can occur when models using calendar variables and Ambient
sexes, the relationship between are applied to a larger population. temperature readings. Academic Emergency
years spent in a noisy job and Medicine 2013; 20:769–777.
severe hearing difficulty continued References 7. Brazer SR, Pancotto FS, Long TT 3rd, Harrell
to remain significant [relative to FE Jr, Lee KL, Tyor MP, Pryor DB. Using
1. Gogtay NJ, Deshpande S, Thatte UM. ordinal logistic regression to estimate the
those in non-noisy jobs].
Principles of Correlation analysis. J Assoc likelihood of colorectal neoplasia. J Clin
Phy Ind 2017; 65:78-81. Epidemiol 1991; 44:1263-70.