You are on page 1of 9

Q1)

1. The yield y of a chemical process is a random variable whose value is


considered to be a linear function of the temperature x. The following data of
corresponding values of x and y is found:
Temperature in ◦C (x) Yield in grams (y)

First we have to create a data frame:-


> my_data<-data.frame(
+ x=c(0,10,15,20,25,30,35,40,45,50,55,60,65,70,75,80,85,90,95,100),
+
y=c(14,38,54,76,98,102,114,128,137,145,152,175,189,221,245,289,310,354,378,40
5),
+ stringAsFactors=FALSE
+ )
> View(my_data)

a). Find the average and standard deviation for both x and y and also find the
correlation between these two variable.
mean(my_data$x)
[1] 52.25
> mean(my_data$y)
[1] 181.2
> sd(my_data$x)
[1] 30.02083
> sd(my_data$y)
[1] 115.3957
> cor(my_data$x,my_data$y)
[1] 0.9741692

b). Build the linear model that can be used to predict y if a corresponding x is
known.
model<-lm(y~x,data=my_data)
> model

Call:
lm(formula = y ~ x, data = my_data)

Coefficients:
(Intercept) x
-14.454 3.745
So we can get the formula to be y= (-14.454) + (3.745 * x)
c. Using R function show whether significant relationship between yield and
temperature is there. Also
write the null hypothesis and alternative hypothesis and interpret your results.
cor.test(my_data$x,my_data$y)

Pearson's product-moment correlation

data: my_data$x and my_data$y


t = 18.302, df = 18, p-value = 4.428e-13
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.9345038 0.9899377
sample estimates:
cor
0.9741692
Null hypothesis:- correlation is equal to zero
Alternative hypothesis:- correlation is not equal to zero
With the value of correlation we get to know that there is a relation between yield
and temperature.

d. What is the t-statistic and p-value?


t-statistic = 18.302
p- value= 0.00000004428

e. If significant predict the value of y for an unknown dataset defined by you.


new.temp<-data.frame(x=c(32,48,77))
> predict(model,newdata=new.temp)
1 2 3
105.3726 165.2856 273.8780

f. Draw the scatterplot showing regression line using ggplot.


ggplot(my_data,aes(x,y))+
+ geom_point()+
+ stat_smooth(method=lm)

g. Give the 95% confidence interval of the expected yield at a temperature of


xnew = 80 ◦C.
new.temp<-data.frame(x=c(80))
> predict(model,newdata = new.temp,interval='confidence')
fit lwr upr
(1) 285.1117 267.7778 302.4455
h. What is the upper quartile of the residuals?
quantile(my_data$y)
0% 25% 50% 75% 100%
14.0 101.0 148.5 256.0 405.0
From this the upper quartile will be 256 grams of yield

Q2)
In a study of pollution in a water stream, the concentration of pollution is
measured at different locations. The locations are at different distances to the
pollution source. In the table below, these distances and the average pollution
are given:
First we have to create a data frame:-

a. Build the linear model that can be used to predict y if a corresponding x is


known. What are the parameter estimates for the three unknown parameters
in the usual linear regression model:
1) The intercept (β0),
2) the slope (β1) and
3) error standard deviation (σ)?
From this we get that :-
β0 :- 11.40000
β1 :- -0.06364
σ :- 3.138

b. How large a part of the variation in concentration can be explained by the


distance?

The variation in concentration with the distance can be easily calculated by the
given formula:-
Variation in distance = 11.40000 + (-0.06364 * Distance)

c. What is a 90 %-confidence interval for the expected pollution concentration


7 km from the pollution source?
d. Draw the scatter plot showing prediction interval as well as regression line.

Q3)
Use the Boston dataset from MASS library in R which records medv (median
house value) for 506
neighborhoods around Boston using 13 predictors such as rm (average number
of rooms per house), age (average age of houses), and lstat (percent of
households with low socioeconomic status).

a). Fit the regression model, with medv as the response and lstat as the
predictor.
b). Find all the detailed information like coefficients, p-values and standard
errors for the coefficients, as well as the R2 statistic and F-statistic for the
model.

c. Use the predict () function to produce confidence intervals and prediction


interval for a defined
unknown value of lstat and plot it.

Prediction Plot:-
Confidence interval plot:-

You might also like