You are on page 1of 9

Homework

Gianluigi De Rubertis

18/06/2023

knitr::opts_chunk$set(echo = TRUE)

SERIES A (EX. A.01)


a) In this section a contingency table analogous to the one shown in the text is created.
Contable <- matrix(c(4268, 1702, 495, 2640, 1000, 760, 1585, 1460, 507, 1469, 990, 454, 140, 482, 180),
colnames (Contable) <- c("Primary", "Secondary", "Tertiary")
rownames(Contable) <- c( "North-West", "North-East", "Centre", "South", "Islands")
Contable

## Primary Secondary Tertiary


## North-West 4268 1702 495
## North-East 2640 1000 760
## Centre 1585 1460 507
## South 1469 990 454
## Islands 140 482 180
b) This second piece of code instead adds marginal frequencies
c1 <- addmargins(Contable)
colnames(c1) <- c("Primary", "Secondary", "Tertiary", "Total")
rownames(c1) <- c( "North-West", "North-East", "Centre", "South", "Islands", "Total")
c1

## Primary Secondary Tertiary Total


## North-West 4268 1702 495 6465
## North-East 2640 1000 760 4400
## Centre 1585 1460 507 3552
## South 1469 990 454 2913
## Islands 140 482 180 802
## Total 10102 5634 2396 18132
In this third secition a table with conditional frequencies (and margins) is created.
addmargins(prop.table(Contable,margin = 1),margin = 2)

## Primary Secondary Tertiary Sum


## North-West 0.6601701 0.2632637 0.07656613 1
## North-East 0.6000000 0.2272727 0.17272727 1
## Centre 0.4462275 0.4110360 0.14273649 1
## South 0.5042911 0.3398558 0.15585307 1
## Islands 0.1745636 0.6009975 0.22443890 1
Pearson_Contable<-chisq.test(Contable, correct = FALSE)
Pearson_Contable

1
##
## Pearson's Chi-squared test
##
## data: Contable
## X-squared = 1200.4, df = 8, p-value < 2.2e-16
c) In conclusion, given that the p-value of the Pearson’s test is very low, we can reject the null hypothesis
(please recall that H0:“The two events are independent”). Also the low value of the Cramer’s V suggests
that the two events are independent (the closer to zero, the greater the independence).
d) Calculating Cramer’s association index is possible as follows:
CramerV <- function(Contable) {

depo <- chisq.test(Contable, correct = FALSE)

V <- sqrt(depo$statistic/(sum(Contable) * (min(dim(Contable)) - 1)))

return(V)
}
CramerV(Contable)

## X-squared
## 0.1819409

SERIES B (EX. B.02)


Considering the definition of population covariance:
n
1X
σxy = (xi − µx )(yi − µy )
n n=1

Prove that:
σxy = µxy − µx µy

Given that:

n
1X
µxy = xi yi
n i=1
The main equation can be rearranged such that:
n
1X
σxy = xi yi − xi µy − µx yi + µx µy
n n=1

which can also be written as


n n
1X 1X
xi yi + −xi µy − µx yi + µx µy .
n n=1 n n=1
Thus, it is now possible to substitute
n
1X
µxy = xi yi .
n i=1
Now, by collecting the two means in the second part of the equation, the following is obtained.
n
1 X −xi yi
= µxy + µx µy − +1=
n n=1 µx µy

2
1
Pn 1
Pn
xi yi
= µxy + µx µy ( n1 Pn=1
n − n
1
Pn=1
n + 1) =
n n=1 xi n n=1 yi

= µxy + µx µy (−1 − 1 + 1)
= µxy − µx µy
Q.E.D.

SERIES C (EX. C.05)


Performing a significance test for two population means. First we import the database.
library(readr)
pub_employees <- read_csv("pub_employees.csv")

## Rows: 80 Columns: 3
## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (1): id
## dbl (2): empl2019, empl2022
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
Then the mean of the value for the two years is calculated.
x1 <- mean(pub_employees$empl2019)
x2 <- mean(pub_employees$empl2022)

After calculating the standard error (se), the test statistic can now be calculated.
l1 <- length(pub_employees$empl2019)
l2 <- length(pub_employees$empl2022)

s1<-sd(pub_employees$empl2019)
s2<-sd(pub_employees$empl2022)

se <- sqrt((s1ˆ2)/l1+(s2ˆ2)/l2)

tstat <- (x1-x2)/se

Thus we have:
(x1 − x2 ) − 0
t= = 0.1209.
se
We now calculate the degrees of freedom thorugh the following formula
s2 s22 2
( l11 + l2 )
df = s21 2 s22 2
1 1
l1 −1 ( l1 ) + l2 −1 ( l2 )

df <- ((s1ˆ2/l1 + s2ˆ2/l2)ˆ2) /((1/(l1-1)*(s1ˆ2/l1)ˆ2)+(1/(l2-1)*(s2ˆ2/l2)ˆ2))

Given the large value for the degrees of freedom, the t distribution can be approximated to a bell shaped
distribution. In this way, by computing the two sided p-value we get:
2*(1-0.5478)

## [1] 0.9044

3
P-value = 0.9044 is well above the significance threshold. It is not possible to reject h0. This result is coherent
with the t.test computed by R itself:
t.test(pub_employees$empl2019, pub_employees$empl2022, alternative = "two.sided",conf.level = .99)

##
## Welch Two Sample t-test
##
## data: pub_employees$empl2019 and pub_employees$empl2022
## t = 0.12091, df = 157.54, p-value = 0.9039
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## -0.8483232 0.9308232
## sample estimates:
## mean of x mean of y
## 3.86250 3.82125

SERIES E (EX. E.02)


The following strings of code generate a regression model for house prices including all the regressors available
in “windsor.csv”.
library(readr)
windsor <- read_csv("C:/Users/derub/OneDrive/Desktop/Stat 2/windsor.csv")

## Rows: 546 Columns: 12


## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (6): driveway, recroom, fullbase, gashw, airco, prefarea
## dbl (6): price, lotsize, bedrooms, bathrms, stories, garagepl
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
str(windsor)

## spec_tbl_df [546 x 12] (S3: spec_tbl_df/tbl_df/tbl/data.frame)


## $ price : num [1:546] 42000 38500 49500 60500 61000 66000 66000 69000 83800 88500 ...
## $ lotsize : num [1:546] 5850 4000 3060 6650 6360 4160 3880 4160 4800 5500 ...
## $ bedrooms: num [1:546] 3 2 3 3 2 3 3 3 3 3 ...
## $ bathrms : num [1:546] 1 1 1 1 1 1 2 1 1 2 ...
## $ stories : num [1:546] 2 1 1 2 1 1 2 3 1 4 ...
## $ driveway: chr [1:546] "yes" "yes" "yes" "yes" ...
## $ recroom : chr [1:546] "no" "no" "no" "yes" ...
## $ fullbase: chr [1:546] "yes" "no" "no" "no" ...
## $ gashw : chr [1:546] "no" "no" "no" "no" ...
## $ airco : chr [1:546] "no" "no" "no" "no" ...
## $ garagepl: num [1:546] 1 0 0 0 0 0 2 0 0 1 ...
## $ prefarea: chr [1:546] "no" "no" "no" "no" ...
## - attr(*, "spec")=
## .. cols(
## .. price = col_double(),
## .. lotsize = col_double(),
## .. bedrooms = col_double(),
## .. bathrms = col_double(),
## .. stories = col_double(),

4
## .. driveway = col_character(),
## .. recroom = col_character(),
## .. fullbase = col_character(),
## .. gashw = col_character(),
## .. airco = col_character(),
## .. garagepl = col_double(),
## .. prefarea = col_character()
## .. )
## - attr(*, "problems")=<externalptr>
modSLR1 <- lm(price~., windsor)
summary(modSLR1)

##
## Call:
## lm(formula = price ~ ., data = windsor)
##
## Residuals:
## Min 1Q Median 3Q Max
## -41389 -9307 -591 7353 74875
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -4038.3504 3409.4713 -1.184 0.236762
## lotsize 3.5463 0.3503 10.124 < 2e-16 ***
## bedrooms 1832.0035 1047.0002 1.750 0.080733 .
## bathrms 14335.5585 1489.9209 9.622 < 2e-16 ***
## stories 6556.9457 925.2899 7.086 4.37e-12 ***
## drivewayyes 6687.7789 2045.2458 3.270 0.001145 **
## recroomyes 4511.2838 1899.9577 2.374 0.017929 *
## fullbaseyes 5452.3855 1588.0239 3.433 0.000642 ***
## gashwyes 12831.4063 3217.5971 3.988 7.60e-05 ***
## aircoyes 12632.8904 1555.0211 8.124 3.15e-15 ***
## garagepl 4244.8290 840.5442 5.050 6.07e-07 ***
## prefareayes 9369.5132 1669.0907 5.614 3.19e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 15420 on 534 degrees of freedom
## Multiple R-squared: 0.6731, Adjusted R-squared: 0.6664
## F-statistic: 99.97 on 11 and 534 DF, p-value: < 2.2e-16
a) According to the model, we cannot conclude that the number of bedrooms causes any significant change
in the price of the house since the related p-value is larger than 0.05.
Differently, given that the p-value related to the number of bathrooms is below the significance threshold of
0.05 we can conclude that there is a significant difference in price depending on this variable. More specifically,
a unitary increase in the independent variable “bathrms” causes an increase in price corresponinding to
$14335.5585.
b) Analogously to what stated in point a, it is possible to state that the size of the property (variable
“lotsize”) affects the price of the house. The regressor indeed presents a p-value lower than the significance
threshold of 0.05. More specifically, a unitary increase in the size of the lot increases the house’s price
by 3.5463.
Performing a 99% confidence intervalis possible by doing the following:

5
t.test(windsor$price, windsor$lotsize, conf.level=.99)

##
## Welch Two Sample t-test
##
## data: windsor$price and windsor$lotsize
## t = 54.923, df = 552.19, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 99 percent confidence interval:
## 60007.82 65934.84
## sample estimates:
## mean of x mean of y
## 68121.597 5150.266
c) Given the small-enaugh p-value of the regressor, it is possible to state that, according to our model, the
presence of a recreational room increases the house’s price by $4511.2838 .

SERIES F (EX. F.04)


library(readr)
transport <- read_csv("transport.csv")

## Rows: 418 Columns: 6


## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (4): slot, weekday, rain, fog
## dbl (2): traffic, time
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
View(transport)

SLR1 <- lm(time~., transport)


summary(SLR1)

##
## Call:
## lm(formula = time ~ ., data = transport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -26.826 -11.766 -5.181 7.328 92.456
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.75553 2.98864 2.930 0.003585 **
## traffic 1.42829 0.04486 31.837 < 2e-16 ***
## slot0701-1100 28.47282 2.48988 11.435 < 2e-16 ***
## slot1101-1630 -3.45872 2.48952 -1.389 0.165498
## slot1631-1900 29.90356 2.55230 11.716 < 2e-16 ***
## weekdayMon 3.72713 2.81705 1.323 0.186557
## weekdayThu -10.83878 2.74687 -3.946 9.36e-05 ***
## weekdayTue -9.87156 2.87802 -3.430 0.000665 ***
## weekdayWed -13.06827 2.74204 -4.766 2.62e-06 ***

6
## rainyes 2.57764 2.23251 1.155 0.248934
## fogyes 2.08659 2.17968 0.957 0.338987
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.03 on 407 degrees of freedom
## Multiple R-squared: 0.7817, Adjusted R-squared: 0.7763
## F-statistic: 145.7 on 10 and 407 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(SLR1)

Standardized residuals
Residuals vs Fitted Normal Q−Q
47260 47
Residuals

260
141 141

4
40

0
−40

0 50 100 150 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


6

47260 47
141 260
141
1.5

Cook's distance
0.0

−2

0 50 100 150 0.00 0.01 0.02 0.03 0.04

Fitted values Leverage

SLR2 <- lm(time ~ traffic + slot + weekday, transport)


summary(SLR2)

##
## Call:
## lm(formula = time ~ traffic + slot + weekday, data = transport)
##
## Residuals:
## Min 1Q Median 3Q Max
## -27.712 -11.870 -5.614 7.699 94.439
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.76979 2.90711 3.361 0.000851 ***
## traffic 1.42626 0.04483 31.812 < 2e-16 ***

7
## slot0701-1100 28.54097 2.48394 11.490 < 2e-16 ***
## slot1101-1630 -3.26866 2.48655 -1.315 0.189403
## slot1631-1900 30.29515 2.53756 11.939 < 2e-16 ***
## weekdayMon 3.40333 2.80763 1.212 0.226146
## weekdayThu -10.93632 2.74061 -3.990 7.81e-05 ***
## weekdayTue -10.06135 2.87567 -3.499 0.000519 ***
## weekdayWed -13.12877 2.73721 -4.796 2.27e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 18.03 on 409 degrees of freedom
## Multiple R-squared: 0.7805, Adjusted R-squared: 0.7762
## F-statistic: 181.8 on 8 and 409 DF, p-value: < 2.2e-16
par(mfrow = c(2, 2))
plot(SLR2)

Standardized residuals
Residuals vs Fitted Normal Q−Q
47260 47260
Residuals

141 141
4
40

0
−40

0 50 100 150 −3 −2 −1 0 1 2 3

Fitted values Theoretical Quantiles


Standardized residuals

Standardized residuals

Scale−Location Residuals vs Leverage


6

47260
141 260 47
1.5

280
2

Cook's distance
0.0

−2

0 50 100 150 0.000 0.010 0.020 0.030

Fitted values Leverage

a, b) The first graph (Residuals vs Fitted) attempts to check MLR4. If the assumption holds, residuals’
average value shall be approximately equal to zero. However, as it can be duducted from the graph, this is
not the case. An analogous reasoning can be carried on for the first graph of the second model as well.
The second graph compares the distribution of residuals to the normal distribution (Assumption MLR6). In
both the two cases, the many variations make us think that such assumption is violated.
The third graph is the one that allows us to detect possible homoskedasticty assumption’s violations. To
exclude homoskedasticty assumption’s violations, the plot’s interpolation line shall be a horizontal straight
line. However, this is not the case for both the two graphs analysed: there is variability and it is not possible
to surely state that the assumption is not violated.

8
The fourth graph does not check for any assumption, it checks for outliers. None of the observations falls in
the region corresponding to the isopleths of the Cook’s distance. No outliers are detected from the graphs.
c) A possible solution to improve the model could be to introduce a squared term in the model’s equation
to make it better fit the distribution. In that case, more than one assumption shall be “recovered”.

You might also like