You are on page 1of 9

# 1.

Develop a suitable simple linear regression model to check if
there is any relationship between “Total Cost to Hospital” and
“AGE”. For the fitted model, interpret the regression coefficient
corresponding to “AGE”.
> library("ISLR", lib.loc="~/R/win-library/3.3")
> d<-read.csv('E:/KOZHI official/4. Term 4/DA-R/Assignment 2 mission
hospital/Mission_2.csv',header=T)
> names(d)
[1] "SL."
"AGE"
"GENDER"
[4] "MALE"
"MARITAL.STATUS"
"UNMARRIED"
[7] "KEY.COMPLAINTS..CODE"
"ACHD"
"CAD.DVD"
[10] "CAD.SVD"
"CAD.TVD"
"CAD.VSD"
.
.
.
> attach(d)
> mod_1<-lm(TOTAL.COST.TO.HOSPITAL~AGE)
> summary(mod_1)
Call:
lm(formula = TOTAL.COST.TO.HOSPITAL ~ AGE)
Residuals:
Min
1Q Median
3Q
Max
-232683 -61888 -19440 28238 600773
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 141216.6 10610.7 13.309 < 2e-16 ***
AGE
1991.2
273.8 7.273 4.67e-12 ***
--Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 111400 on 246 degrees of freedom
Multiple R-squared: 0.177, Adjusted R-squared: 0.1736
F-statistic: 52.9 on 1 and 246 DF, p-value: 4.672e-12
> plot(mod_1,which=c(1,2))

HOSPITAL) ~ AGE) Residuals: Min 1Q Median 3Q Max -1. p-value: 4. Log model > mod_2<-lm(log(TOTAL. This shows that the variances are not same. which=c(1.00536 0.01 ‘*’ 0.1894 F-statistic: 58.1927. Error t value Pr(>|t|) (Intercept) 11.TO. and the pattern that was first visible in the previous graph is not there.814724 0.001118 7.1 ‘ ’ 1 Residual standard error: 0.TO.51748 -0.001 ‘**’ 0. Also the normal QQ Plot shows a better the fit of normality than the previous plot.455 on 246 degrees of freedom Multiple R-squared: 0.The graphs above show that the assumptions of normality and homoscedasticity is not being followed.25388 1. Hence the underlying assumptions for a linear relationship are not satisfied.043326 272. So we try the log linear model. codes: 0 ‘***’ 0. the values are clustered with lower fitted values and far apart with higher fitted values. as in the residual vs fitted graph we can see a pattern. The beta 1 shows that one unit change in age will change the total cost to hospital by a factor of Rs. Adjusted R-squared: 0.212e-13 > plot(mod_2.693 < 2e-16 *** AGE 0. Similarly the Normal QQ Plot shows that the plot of the values deviate from the normal line.05 ‘.24402 -0. 1. they depend on the covariance of fitted values.7 on 1 and 246 DF.008565 0.COST.HOSPITAL)~AGE) > summary(mod_2) Call: lm(formula = log(TOTAL.’ 0.21e-13 *** --Signif.662 4.39912 Coefficients: Estimate Std.2)) We see in the residual vs fitted graph that it shows random variances.COST.0086 .

Suppose Mission Hospital is planning to introduce a package price for the treatment and has decided to charge INR 250. At the time of admission. Interpret the results.34373 13.455) [1] 0.data.COST.HOSPITAL~GENDER) > plot(mod_2.3411574 34% is the probability that the treatment cost exceeds package price. which=c(1.2.TO. What is the probability that the treatment cost will exceed the package price? Do you think that the Mission Hospital should revise the package price? Residual standard error acts as proxy for sigma square Residual standard error: 0.000 for patients of age 50 years.COST.HOSPITAL)~GENDER) > plot(mod_4.2)) > mod_4<-lm(log(TOTAL.14223 > exp(p[2]) [1] 84434. 4.TO. > mod_3<-lm(TOTAL. The hospital should not revise the package price as it is greater than the mean.24298 11.2)) . what will be the minimum cost of treatment for this patient at 95% confidence level? > p<-predict(mod_2. suppose a patient’s age is 50 years. Build a simple linear regression model between “Total Cost to Hospital” and “GENDER”.frame(AGE=50). Based on the fitted model in (1). which=c(1.interval="prediction") >p fit lwr upr 1 12.41 3.455 on 246 degrees of freedom > 1-pnorm((log(250000)-p[1])/.

Error t value Pr(>|t|) (Intercept) 11.837 0.004934 > contrasts(GENDER) M F0 M1 > exp(0. Adjusted R-squared: 0.28273 -0.4983 on 246 degrees of freedom Multiple R-squared: 0.001 ‘**’ 0. The dummy variable formed is GENDERM.01 ‘*’ 0.’ 0.210242 Gender being a qualitative variable becomes a dummy variable here in the regression model.TO.COST.02774 F-statistic: 8.19082) [1] 1.57082 Coefficients: Estimate Std.> summary(mod_4) Call: lm(formula = log(TOTAL.1 ‘ ’ 1 Residual standard error: 0.06726 2. . 5.93436 0.05 ‘.00493 ** --Signif.210242 and p value shows it to be significant.048 on 1 and 246 DF.865 < 2e-16 *** GENDERM 0.31142 -0.19082 0. Build a simple linear regression model between “Total Cost to Hospital” and “MARITAL STATUS”.26109 1.HOSPITAL) ~ GENDER) Residuals: Min 1Q Median 3Q Max -1. The model shows that for males the total cost to hospital will be increased by a factor of 1.05503 216. p-value: 0.08258 0. Interpret the results.03168. The contrast command shows that it is coded as 1 for male and 0 for female. codes: 0 ‘***’ 0.

01 ‘*’ 0.STATUS) Residuals: Min 1Q Median 3Q Max -1.001 ‘**’ 0.’ 0. The contrast command shows that it is coded as 1 for unmarried and 0 for married.4641 on 246 degrees of freedom Multiple R-squared: 0. Adjusted R-squared: 0.STATUS) UNMARRIED MARRIED 0 UNMARRIED 1 > summary(mod_6) Call: lm(formula = log(TOTAL. The model .COST.05944 -6.HOSPITAL) ~ MARITAL.1 ‘ ’ 1 Residual standard error: 0.847 6e-11 *** --Signif.STATUSUNMARRIED -0.40697 0.STATUSUNNMARRIED.1601.04466 275.TO.29182 0.05 ‘.2360 -0.4042 Coefficients: Estimate Std.229 <2e-16 *** MARITAL.40697) [1] 0.1566 F-statistic: 46.2396 1. codes: 0 ‘***’ 0.> contrasts(MARITAL. Error t value Pr(>|t|) (Intercept) 12.0334 0. p-value: 5.88 on 1 and 246 DF.3608 -0. The dummy variable formed is MARITAL.998e-11 > exp(-0.6656642 Marital Status being a qualitative variable becomes a dummy variable here in the regression model.

The total cost will be multiplied by a factor of 0. in the combined model.132570 -0.80578 --Signif.2019. Adjusted R-squared: 0. Gender and marital status are insignificant as seen by p value.1 ‘ ’ 1 Residual standard error: 0.394e-12 Only Age is significant.2603 -0.HOSPITAL)~AGE+GENDER+MARITAL.09667 .STATUS) > > summary(mod_11) Call: lm(formula = log(TOTAL. and “AGE.COST. Build a multiple linear regression model with “Total Cost to Hospital” as dependent variable.0104 0.011 < 2e-16 *** AGE 0. the effect is not significant.05 ‘.790187 0.001 ‘**’ 0.6656642 and the p value shows that it is significant.COST. p-value: 6.TO.151136 78.246 0.4543 on 244 degrees of freedom Multiple R-squared: 0.TO. Error t value Pr(>|t|) (Intercept) 11. the gender and marital status show a lot of significant impact on the total cost to hospital. “GENDER” and “MARITAL STATUS” as predictors. however.STATUSUNMARRIED -0.00308 ** GENDERM 0. 6.104211 0. however in 4 and 5 these variables were coming as significant.HOSPITAL) ~ AGE + GENDER + MARITAL. This shows if considered independently. MARITAL.3529 Coefficients: Estimate Std.STATUS) Residuals: Min 1Q Median 3Q Max -1.989 0. codes: 0 ‘***’ 0.062490 1.’ 0.shows that for unmarried people the total cost to hospital will be decreased.58 on 3 and 244 DF.668 0. .007637 0.1921 F-statistic: 20. Compare the results with that of (4) and (5).01 ‘*’ 0.002555 2.032630 0.5285 -0. mod_11<-lm(log(TOTAL.2470 1.

2809374 0.tertalogy+PM.general + other.1444585 0.341603 .4186867 0.677148 CAD..tertalogy 0.668 0.0005591 0...TVD+CAD.HOSPITAL)~AGE+MALE+UNMARRIED+ACHD+CAD.9e-05 *** BODY.general+other.0023049 -0.2406915 1.respiratory 0.VSD + RHD + BODY.0964430 0.nervous 0.031108 * PM.heart+other.000433 *** CAD.244907 RHD 0.0022855 0.96533 -0.nervous + other.514 0.4989676 20.0606913 0.122602 other.130964 other.heart + other.6289222 0.WEIGHT 0.234 3.443 0.PULSE 0.SVD 0.596 0.HEIGHT 0. Build a multiple linear regression model with appropriate set of predictors.VSD 0.640 0. Comment on the performance of the fitted model.537890 BODY.4193210 1.0016910 0.WEIGHT + BODY.505364 ACHD 0.785 0. Identify the statistically significant predictors that the Mission Hospital can use in predicting “Total Cost to Hospital”.1454933 0.respiratory+other.442926 OS.1300201 3.ASD 0.nervou s+other. Error t value Pr(>|t|) (Intercept) 10.1408546 2.0019315 2.112 0.LOW + RR + Diabetes1 + Diabetes2 + hypertension1 + hypertension2 + hypertension3 + other + HB + UREA + CREATININE + AMBULANCE + TRANSFERRED + ELECTIVE) Residuals: Min 1Q Median 3Q Max -0.417 0.VSD+RHD+BODY.HIGH -0..COST.1152326 2.954 0.1333216 4.DVD+CAD.DVD + CAD.SVD + CAD..573 0.1517427 1.HEIGHT + HR.19462 1.3141862 1.general -1.0716926 -0.2303903 0.567339 UNMARRIED 0.tertalogy + PM.357 0.0736222 0.7. > mod_9<lm(log(TOTAL.175 0.2947377 0.VSD + OS.4195765 0.HOSPITAL) ~ AGE + MALE + UNMARRIED + ACHD + CAD.3492459 0.respiratory + other.TO.PULSE + BP.4634972 -3.268025 CAD.3220618 0.TO.011488 * other.HEIGHT+HR.01659 0.3441462 0.769 0..TVD 0.552 0.2061631 0.ASD + other.1693884 2.19165 Coefficients: Estimate Std.167 0.741381 HR.heart 0.TVD + CAD.4675391 0.COST.518 0.009129 ** BP.0030825 2..VSD+OS.VSD 0.6509382 0.ASD+other.0085850 0.5645466 0.WEIGHT+BODY.331 0..015670 * CAD.000577 *** other.HIGH + BP.721494 other.006015 ** MALE -0.558 0.DVD 0.SVD+ CAD.HIGH+BP.0410937 0.0037020 0.882 < 2e-16 *** AGE 0.0050994 0.L OW+RR+Diabetes1+Diabetes2+hypertension1+hypertension2+hypertension3+other+HB +UREA+CREATININE+AMBULANCE+TRANSFERRED+ELECTIVE) > > summary(mod_9) Call: lm(formula = log(TOTAL..18093 -0.PULSE+BP.617 0.0021987 0.3684828 0.

0090719 1.000399 *** CREATININE 0.757307 CREATININE 0.0008210 0.806 0.1999772 0.tertalogy 0.401122 0.282 0.000186 *** CAD.190 0.1756235 1.419724 -3.PULSE + CREATININE) Residuals: Min 1Q Median 3Q Max -1.0005388 0..PULSE+CREATININE) > summary(mod_10) Call: lm(formula = log(TOTAL. p-value: < 2.221803 0.TO.167 0.0703775 0..058343 .heart 0.532 0.2e-16 .001 ‘**’ 0.1271125 2.100450 4.176893 62.1217057 -0.001 ‘**’ 0.974447 0.105391 3.003162 ** other.571496 Diabetes2 0.064466 3.114124 2.907 0.680 0.543 0.037444 * AMBULANCE 0.11e-06 *** HR.310 0. Error t value Pr(>|t|) (Intercept) 10.1 ‘ ’ 1 Residual standard error: 0.600 0.HOSPITAL) ~ AGE + CAD.BP.0027892 0.general + other.0026521 0.3199244 0.001594 3.813456 UREA 0.05 ‘.26342 Coefficients: Estimate Std.040 < 2e-16 *** AGE 0.882 2.tertalogy+RHD+HR.TO.71 on 9 and 205 DF.2090071 0.0032198 -0.02119 0.236 0.LOW -0.general -1.109755 3.19485 1.HOSPITAL)~AGE+CAD.0623585 0. codes: 0 ‘***’ 0.1643344 -0.000101 *** CAD.288918 0.3979 F-statistic: 16.4061 on 205 degrees of freedom (33 observations deleted due to missingness) Multiple R-squared: 0.223745 0.388842 0.3965 on 156 degrees of freedom (57 observations deleted due to missingness) Multiple R-squared: 0.0173013 0.567 0.001672 3.570932 HB 0.569 0.heart+other.’ 0. p-value: 5.3115261 0.PULSE 0.987 0.05 ‘.DVD+CAD.19 on 34 and 156 DF.177 0.471 0.074259 2.570339 other -0.5307. Diabetes1 -0.778221 --Signif.2662347 0.’ 0.490360 0.174e-13 The significant predictors are highlighted in yellow in the table above.0118002 0.472 0.1239298 -0.0878894 0.000298 *** other.544496 0.328 0.743607 TRANSFERRED -0.240923 ELECTIVE 0.tertalogy + RHD + HR.TVD+other.01 ‘*’ 0.01 ‘*’ 0.965 0.1048268 0.TVD + other.568 0.2261663 -1.4285 F-statistic: 5.TVD 0.4232..143028 hypertension3 0.235820 hypertension1 -0.general+ other.005739 0.0931856 0.20151 -0.COST. Adjusted R-squared: 0.512 0. Adjusted R-squared: 0. > mod_10<lm(log(TOTAL.609116 hypertension2 -0.099 0.2667857 0. codes: 0 ‘***’ 0.012103 * RHD 0.000490 *** other.1137384 0.867311 RR 0.DVD 0.06605 -0.006630 0.COST.1 ‘ ’ 1 Residual standard error: 0.000633 *** --Signif.DVD + CAD.2203463 0.heart + other.1496889 -1.

32% of the model.3979.The fitted model with all the significant predictor also has a multiple r square of 42. .32% and the adjusted r square of 0. showing the model explains 42.