You are on page 1of 47

Statistics Module 2 – Applied Statistics

ASSIGNMENT
Valerio Langè
(valerio.lange@unibocconi.it)

Report in the table here below Name, Surname and ID Number for each member of your group.

Name Surname ID Number

Alessandro Campo Lobato 3164258

Giuseppe Lista 3164613


Context and description of the dataset:

The German demographic evolution is a phenomenon that is raising a strong political debate in the
country. A new National Family Plan will be soon designed. The “German Observatory on Demography”
has commissioned your research group to conduct a survey in order to explore the influential factors
regarding cultural values in German men. Part of these results will be used to design a new “National
Family Plan” that is intended to support young couples to create a new family.
After a consultation phase with the competent Officers of the “German Observatory on Demography” you
decide to implement a survey collecting data on a sample of 150 men aged 35-50 years old with a
specifically designed questionnaire.
You agree upon a shared version of the questionnaire with the above mentioned Policy Officers. You can
find the questionnaire attached to the case study.
The “German Observatory on Demography” delivers the questionnaire to 150 Germans 35-50 years old
men and when data are collected the dataset is released to your group.
The objective is now to prepare a Research Report for the “German Observatory on Demography”
following the instructions reported here below.

Instructions:

In the assignment folder you can find


· this file 30457_Assignment_Text_2022.docx
· the dataset 30457_Assignment_Data_2022.csv
· the questionnaire 30457_Assignment_Questionnaire_2022.pdf

Always report in the text the R-Studio output generated and all procedures and commands used. In case
this information is missing a score will not be assigned.

When you have completed the assignment send the final document in .doc or .docx format and the
database used to the following email address: valerio.lange@unibocconi.it

The assignment must be delivered by 11:59pm on 16th of December 2022. Assignments delivered later
will not be considered.

For any question related to the assignment please refer to the following email address:
valerio.lange@unibocconi.it

A. Descriptive statistics of the sample

A.1. Report the main information for each variable (e.g.: frequency distribution table, average and/or other
summary measures, etc.). Comment on results.

B. Confidence intervals

B.1. Compute the 95% confidence interval for the mean age of men. Also comment on it.

B.2. Compute the 95% confidence interval for the mean size of the apartment. Also comment on it.

B.3. Compute the 95% confidence interval for the difference in the mean age between men who have
children and men who don’t have children. Also comment on it.

C. Hypothesis testing

C.1. Is the mean age of men significantly different from 42 years? Use α=0.01. Comment on the results.

C.2. Is the mean size of the apartment significantly greater than 120 squared feet? Comment on the
results using different levels of α (0.1; 0.05; 0.01).

C.3. Is the difference in the mean value of the size of the apartment significantly different between men
with children and men without children? Use α=0.01. Comment on the results.
C.4. Is the difference in the mean value of age significantly different between men with children and men
without children? Use α=0.01. Comment on the results.

D. Linear Regression

D.1. There is now interest to explore the importance that men give to having children with some socio-
demographic and living characteristics. Run a linear regression model with v7_9 as dependent variable
and v1 [Age] as independent variable. Comment on the output results and evaluate the goodness of fit of
the model. Finally check with appropriate techniques whether the simple linear regression model meets
the linearity, normality and homoscedasticity assumptions and comment on the results.

D.2. Run a linear regression model with v7_9 as dependent variable and v6 [Size of the apartment] as
independent variable. Comment on the output results and evaluate the goodness of fit of the model.
Finally check with appropriate techniques whether this second simple linear regression model meets the
linearity, normality and homoscedasticity assumptions and comment on the results.

D.3. Finally, run a linear regression model with v7_9 as dependent variable and v1 [Age] and v6 [Size of
the apartment] as independent variables. Comment on the output results and evaluate the goodness of fit
of the model.

E. Prediction

E.1. By using the model in D.3, predict the value of v7_9 for a person aged 40 and living in a house sized
100 squared feet. Comment on it.

E.2. By using the model in D.3, predict the mean value of v7_9 for a person aged 40 and living in a house
sized 100 squared feet and also determine the 95% confidence interval for it. Comment on it.

A: Descriptive statistics of the sample

1. AGE: v1

table(Assignment_Data_2022$v1)
summary(Assignment_Data_2022$v1)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


36.00 39.00 43.50 42.97 47.00 49.00 2

var(Assignment_Data_2022$v1, na.rm=TRUE)
19.18294

standard_deviation_age <- sd(Assignment_Data_2022$v1, na.rm=TRUE)

[1] 4.379833

boxplot(Assignment_Data_2022$v1, col="lightblue", main = "Age Boxplot",


ylab= "Age(years)")

hist(Assignment_Data_2022$v1, col="brown1", main="Age and frequence")


xlab="Age(years")

plot(ecdf(Assignment_Data_2022$v1), col="brown1", main="CDF age",


xlab="Age(years)", ylab="Cumulative Relative Frequencies")
IQR(Assignment_Data_2022$v1, na.rm=TRUE)

cv1 <- sd(Assignment_Data_2022$v1, na.rm=TRUE)/mean(Assignment_Data_2022$v1,


na.rm=TRUE)
show(cv1)

Coefficient of variation

0.1019206

2. Do you have a partner/wife?

1 = Relationship and living together


2 = Relationship and not living together
3 = Single

table(Assignment_Data_2022$v2)
summary(Assignment_Data_2022$v2)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.00 1.00 2.00 1.98 3.00 3.00

var(Assignment_Data_2022$v2, na.rm=TRUE)

0.6908725

standard_deviation_v2 <- sd(Assignment_Data_2022$v2, na.rm=TRUE)


0.8311874

IQR(Assignment_Data_2022$v2, na.rm=TRUE)

2
cv2 <- sd(Assignment_Data_2022$v2, na.rm=TRUE)/mean(Assignment_Data_2022$v2,
na.rm=TRUE)
show(cv2)

0.4197916

3. Present professional career

1 = Employed
2 = Leave from work
3 = Unemployed
4 = Seeking a job
5 = Other

table(Assignment_Data_2022$v3)

38 25 30 26 31

summary(Assignment_Data_2022$v3)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.000 1.250 3.000 2.913 4.000 5.000
standard_deviation_v3 <- sd(Assignment_Data_2022$v3, na.rm=TRUE)

1.478874

hist(Assignment_Data_2022$v3, col="lightblue", main="Age and frequence")


xlab="Age(years")
non riesco a cambiare l’x axis

IQR(Assignment_Data_2022$v3, na.rm=TRUE)

2.75

cv3 <- sd(Assignment_Data_2022$v3, na.rm=TRUE)/mean(Assignment_Data_2022$v3,


na.rm=TRUE)
show(cv3)

0.5076228

4. Family Economic Status

1: We manage to save money at the end of the month


2: Only money to make a living
3: Not enough money to make a living
4:I don’t know

table(Assignment_Data_2022$v4)

1 2 3 4
36 28 51 31

summary(Assignment_Data_2022$v4)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 2.000 3.000 2.527 3.000 4.000 4

var(Assignment_Data_2022$v4, na.rm=TRUE)
1.175106

standard_deviation_v4 <- sd(Assignment_Data_2022$v4, na.rm=TRUE)


standard_deviation_v4

1.084023

boxplot(Assignment_Data_2022$v4, col="brown1", main = "Family Economic


Status",
ylab= "Situation")

hist(Assignment_Data_2022$v4, col="lightblue", main="Family economic status")


xlab="Levels"

IQR(Assignment_Data_2022$v4, na.rm=TRUE)

cv4 <- sd(Assignment_Data_2022$v4, na.rm=TRUE)/mean(Assignment_Data_2022$v4,


na.rm=TRUE)
show(cv4)

0.4289089

5. Do you have any children?

1 = Yes
0 = No

table(Assignment_Data_2022$v5)

0 1
74 68

var(Assignment_Data_2022$v5, na.rm=TRUE)

0.2513235

standard_deviation_v5 <- sd(Assignment_Data_2022$v5, na.rm=TRUE)


standard_deviation_v5

0.5013218

cv5 <- sd(Assignment_Data_2022$v5, na.rm=TRUE)/mean(Assignment_Data_2022$v5,


na.rm=TRUE)
show(cv5)

1.046878

6. Size of apartment

summary(Assignment_Data_2022$v6)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


50.00 86.25 117.50 121.14 158.00 198.00 4

var(Assignment_Data_2022$v6, na.rm=TRUE)
1772.436

standard_deviation_v6 <- sd(Assignment_Data_2022$v6, na.rm=TRUE)


standard_deviation_v6

42.10031

boxplot(Assignment_Data_2022$v6, col="lightblue", main = "Size of the


apartment in squared feet",
ylab= "Square feet")

hist(Assignment_Data_2022$v6, col="brown1", main="Size of apartment in squared


feet")
xlab="Square feet”

plot(ecdf(Assignment_Data_2022$v6), col="brown1", main="CDF squared feet",


xlab="Square feet", ylab="Cumulative Relative Frequencies")
IQR(Assignment_Data_2022$v6, na.rm=TRUE)

71.75

cv6 <- sd(Assignment_Data_2022$v6, na.rm=TRUE)/mean(Assignment_Data_2022$v6,


na.rm=TRUE)
show(cv6)

0.347543

v7_1: Loyalty
how important is it:
1 = Not important at all
10 = Extremely important

table(Assignment_Data_2022$v7_1)

1 2 3 4 5 6 7 8 9 10
19 11 15 14 20 14 16 8 17 15

summary(Assignment_Data_2022$v7_1)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.000 5.000 5.403 8.000 10.000 1
var(Assignment_Data_2022$v7_1, na.rm=TRUE)

8.444858

standard_deviation_7_1 <- sd(Assignment_Data_2022$v7_1, na.rm=TRUE)


standard_deviation_7_1

2.906004

boxplot(Assignment_Data_2022$v7_1, col="brown1", main = "How much do you value


loyalty in a relationship from 1 to 10",
ylab= "Scale")

hist(Assignment_Data_2022$v7_1, col="lightblue", main="Importance of loyalty")


xlab="Values")

plot(ecdf(Assignment_Data_2022$v7_1), col="brown1", main="CDF loyalty",


xlab="Importance of loyalty", ylab="Cumulative Relative Frequencies")
IQR(Assignment_Data_2022$v7_1, na.rm=TRUE)

cv7_1 <- sd(Assignment_Data_2022$v7_1,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_1, na.rm=TRUE)
show(cv7_1)

0.5378814

7_2: Having the same social background

table(Assignment_Data_2022$v7_2)

1 2 3 4 5 6 7 8 9 10
15 18 18 12 17 18 9 13 12 16

summary(Assignment_Data_2022$v7_2)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.000 5.000 5.277 8.000 10.000 2

var(Assignment_Data_2022$v7_2, na.rm=TRUE)

8.419333

standard_deviation_v7_2 <- sd(Assignment_Data_2022$v7_2, na.rm=TRUE)


standard_deviation_v7_2
2.901609

boxplot(Assignment_Data_2022$v7_2, col="lightblue", main = "Same social


background importance Boxplot",
ylab= "Importance")

hist(Assignment_Data_2022$v7_2, col="lightblue", main="Importance of having


the same social background")
xlab="Importance(scale")

plot(ecdf(Assignment_Data_2022$v7_2), col="lightblue", main="CDF having the


same social background importance", xlab="Importance", ylab="Cumulative
Relative Frequencies")
IQR(Assignment_Data_2022$v7_2, na.rm=TRUE)

cv7_2 <- sd(Assignment_Data_2022$v7_2,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_2, na.rm=TRUE)
show(cv7_2)

0.5498567

V7_3 = Giving importance to the family nest

table(Assignment_Data_2022$v7_3)

1 2 3 4 5 6 7 8 9 10
11 19 7 18 18 12 20 16 11 15

summary(Assignment_Data_2022$v7_3)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.500 6.000 5.585 8.000 10.000 3

var(Assignment_Data_2022$v7_3, na.rm=TRUE)

7.737583

standard_deviation_v7_3 <- sd(Assignment_Data_2022$v7_3, na.rm=TRUE)


standard_deviation_v7_3

2.781651

boxplot(Assignment_Data_2022$v7_3, col="lightblue", main = "Importance to the


giving importance to the family nest",
ylab= "Scale")

hist(Assignment_Data_2022$v7_3, col="lightblue", main="Importance to the


giving importance to the family nest")
xlab="Scale")

plot(ecdf(Assignment_Data_2022$v7_3), col="red", main="CDF importance to the


giving importance family nest", xlab="Scale", ylab="Cumulative Relative
Frequencies")
IQR(Assignment_Data_2022$v7_3, na.rm=TRUE)

4.5

cv7_3 <- sd(Assignment_Data_2022$v7_3,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_3, na.rm=TRUE)
show(cv7_3)

0.4980545

V7_4 = How important is having appropriate living conditions?

table(Assignment_Data_2022$v7_4)

1 2 3 4 5 6 7 8 9 10
17 14 11 10 14 14 15 15 12 21

summary(Assignment_Data_2022$v7_4)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.000 6.000 5.699 8.000 10.000 7

var(Assignment_Data_2022$v7_4, na.rm=TRUE)

9.21176

standard_deviation_v7_4 <- sd(Assignment_Data_2022$v7_4, na.rm=TRUE)


standard_deviation_v7_4

3.035088
boxplot(Assignment_Data_2022$v7_4, col="lightblue", main = "Importance of
having appropriate living conditions for a relationship",
ylab= "Scale")

hist(Assignment_Data_2022$v7_4, col="lightblue", main="Importance of having


appropriate living conditions for a relationship")
xlab="Scale")

IQR(Assignment_Data_2022$v7_4, na.rm=TRUE)

cv7_4 <- sd(Assignment_Data_2022$v7_4,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_4, na.rm=TRUE)
show(cv7_4)

0.5325369
V7_5: How important is sharing the same cultural values?

table(Assignment_Data_2022$v7_5)

1 2 3 4 5 6 7 8 9 10
12 19 11 19 10 8 18 13 15 22

summary(Assignment_Data_2022$v7_5)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.000 6.000 5.728 8.500 10.000 3

var(Assignment_Data_2022$v7_5, na.rm=TRUE)

9.144628

standard_deviation_v7_5 <- sd(Assignment_Data_2022$v7_5, na.rm=TRUE)


standard_deviation_v7_5

3.024009

boxplot(Assignment_Data_2022$v7_5, col="blue", main = "Importance of sharing


the same cultural values",
ylab= "Scale")

hist(Assignment_Data_2022$v7_5, col="orange", main="Importance of sharing the


same cultural values")
xlab="(Scale")
IQR(Assignment_Data_2022$v7_5, na.rm=TRUE)

5.5

cv7_5 <- sd(Assignment_Data_2022$v7_5,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_5, na.rm=TRUE)
show(cv7_5)

0.5279445

V7_6: Importance of living far from the families of origin

table(Assignment_Data_2022$v7_6)

1 2 3 4 5 6 7 8 9 10
13 12 16 14 17 14 13 14 16 19

summary(Assignment_Data_2022$v7_6)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.000 6.000 5.723 8.000 10.000 2

var(Assignment_Data_2022$v7_6, na.rm=TRUE)
8.432938

standard_deviation_v7_6 <- sd(Assignment_Data_2022$v7_6, na.rm=TRUE)


standard_deviation_v7_6

2.903952

boxplot(Assignment_Data_2022$v7_6, col="lightgreen", main = "Living far from


families of origin Boxplot",
ylab= "Scale")

hist(Assignment_Data_2022$v7_6, col="lightgreen", main="Living far from


families of origin Histogram")
xlab="Scale")

plot(ecdf(Assignment_Data_2022$v7_6), col="red", main="CDF living far from


families of origin Histogram", xlab="Scale", ylab="Cumulative Relative
Frequencies")
IQR(Assignment_Data_2022$v7_6, na.rm=TRUE)

cv7_6 <- sd(Assignment_Data_2022$v7_6,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_6, na.rm=TRUE)
show(cv7_6)

0.5074202

V7_7: Importance of living in a safe place

table(Assignment_Data_2022$v7_7)

1 2 3 4 5 6 7 8 9 10
17 14 12 22 21 12 9 14 17 12

summary(Assignment_Data_2022$v7_7)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.000 3.000 5.000 5.293 8.000 10.000

var(Assignment_Data_2022$v7_7, na.rm=TRUE)

8.061029

standard_deviation_v7_7 <- sd(Assignment_Data_2022$v7_7, na.rm=TRUE)


standard_deviation_v7_7
2.839195
boxplot(Assignment_Data_2022$v7_7, col="lightgreen", main = "Importance of
living in a safe place Boxplot",
ylab= "Scale")

hist(Assignment_Data_2022$v7_7, col="blue", main="Importance of living in a


safe place and frequencies")
xlab="Scale"

IQR(Assignment_Data_2022$v7_7, na.rm=TRUE)

cv7_7 <- sd(Assignment_Data_2022$v7_7,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_7, na.rm=TRUE)
show(cv7_7)

0.5363719
V7_8: Sharing housing responsibilities

table(Assignment_Data_2022$v7_8)

1 2 3 4 5 6 7 8 9 10
12 13 7 18 16 15 14 12 19 18

summary(Assignment_Data_2022$v7_8)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 4.000 6.000 5.875 9.000 10.000 6

var(Assignment_Data_2022$v7_8, na.rm=TRUE)

8.236014

standard_deviation_7_8 <- sd(Assignment_Data_2022$v7_8, na.rm=TRUE)


standard_deviation_7_8

2.869846

boxplot(Assignment_Data_2022$v7_8, col="lightgreen", main = "Importance of


sharing housing responsibilities Boxplot",
ylab= "Scale")

hist(Assignment_Data_2022$v7_8, col="orange", main="Importance of sharing


housing responsibilities")
xlab="Frequency")
IQR(Assignment_Data_2022$v7_8, na.rm=TRUE)

cv7_8 <- sd(Assignment_Data_2022$v7_8,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_8, na.rm=TRUE)
show(cv7_8)

0.4884844

V7_9: Importance of having children

table(Assignment_Data_2022$v7_9)

1 2 3 4 5 6 7 8 9 10
23 17 13 13 9 17 14 14 12 17

summary(Assignment_Data_2022$v7_9)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 2.000 5.000 5.255 8.000 10.000 1

var(Assignment_Data_2022$v7_9, na.rm=TRUE)

9.407491

standard_deviation_v7_9 <- sd(Assignment_Data_2022$v7_9, na.rm=TRUE)


standard_deviation_v7_9

3.067163

boxplot(Assignment_Data_2022$v7_9, col="orange", main = "Age Boxplot",


ylab= "Level of importance")

hist(Assignment_Data_2022$v7_9, col="orange", main="Importance of having


children")
xlab="Scale")

IQR(Assignment_Data_2022$v7_9, na.rm=TRUE)

cv7_9 <- sd(Assignment_Data_2022$v7_9,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_9, na.rm=TRUE)
show(cv7_9)

0.583662
V7_10: Being open to face problems

table(Assignment_Data_2022$v7_10)

1 2 3 4 5 6 7 8 9 10
18 17 9 25 13 9 19 14 7 18

summary(Assignment_Data_2022$v7_10)

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's


1.000 3.000 5.000 5.275 8.000 10.000 1

var(Assignment_Data_2022$v7_10, na.rm=TRUE)

8.538636

standard_deviation_v7_10 <- sd(Assignment_Data_2022$v7_10, na.rm=TRUE)


standard_deviation_v7_10

2.922094

boxplot(Assignment_Data_2022$v7_10, col="orange", main = "Importance of being


open to face problems",
ylab= "Scale")

hist(Assignment_Data_2022$v7_10, col="orange", main="Importance of being open


to face problems")
xlab="Scale"
IQR(Assignment_Data_2022$v7_10, na.rm=TRUE)

cv7_10 <- sd(Assignment_Data_2022$v7_10,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_10, na.rm=TRUE)
show(cv7_10)

0.5539339

V7_11: Importance of having time to meet friends

table(Assignment_Data_2022$v7_11)

1 2 3 4 5 6 7 8 9 10
12 19 12 18 10 12 19 9 20 19

summary(Assignment_Data_2022$v7_11)

Min. 1st Qu. Median Mean 3rd Qu. Max.


1.0 3.0 6.0 5.7 9.0 10.0

var(Assignment_Data_2022$v7_11, na.rm=TRUE)

8.855705
standard_deviation_7_11 <- sd(Assignment_Data_2022$v7_11, na.rm=TRUE)
standard_deviation_7_11

2.975854

boxplot(Assignment_Data_2022$v7_11, col="orange", main = "Having time to meet


friends Boxplot",
ylab= "Scale")

plot(ecdf(Assignment_Data_2022$v7_11), col="red", main="CDF importance of


having time to meet friends", xlab="Scale", ylab="Cumulative Relative
Frequencies")

IQR(Assignment_Data_2022$v7_11, na.rm=TRUE)

cv7_11 <- sd(Assignment_Data_2022$v7_11,


na.rm=TRUE)/mean(Assignment_Data_2022$v7_11, na.rm=TRUE)
show(cv7_11)

0.5220796

B: Confidence Intervals

#Exercise 1
> age <- Assignment_Data_2022$v1
> apt_size <- Assignment_Data_2022$v6
> children_wanting <- Assignment_Data_2022$v7_9
> t.test(age, conf.level = 0.95)

One Sample t-test

data: age
t = 119.36, df = 147, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
42.26149 43.68446
sample estimates:
mean of x
42.97297
#Exercise 2
> t.test(apt_size, conf.level = 0.95)

One Sample t-test

data: apt_size
t = 34.767, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
114.2505 128.0235
sample estimates:
mean of x
121.137

#Exercise 3
> t.test(age ~ children_yesorno, conf.level = 0.95)

Welch Two Sample t-test

data: age by children_yesorno


t = -0.51023, df = 137.25, p-value = 0.6107
alternative hypothesis: true difference in means between group 0 and group 1
is not equal to 0
95 percent confidence interval:
-1.857274 1.095389 #the difference at 95% conf.level isn’t very wide
sample estimates:
mean in group 0 mean in group 1
42.78082 43.16176

C: Hypothesis testing

C1

> t.test(age, mu=42, conf.level=0.99)

One Sample t-test

data: age
t = 2.7026, df = 147, p-value = 0.007691
Alternative hypothesis: true mean is not equal to 42
99 percent confidence interval:
42.03343 43.91251
Sample estimates:
Mean of x
42.97297
#we reject the null hypothesis because of the small P-value: the real mean is
not 42 with a 99 % confidence level

C2

> t.test(apt_size, mu = 120, alternative = "greater", conf.level = 0.9)

One Sample t-test

data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
90 percent confidence interval:
116.6513 Inf
sample estimates:
mean of x
121.137

> t.test(apt_size, mu = 120, alternative = "greater", conf.level = 0.95)

One Sample t-test

data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
95 percent confidence interval:
115.3691 Inf
sample estimates:
mean of x
121.137
> t.test(apt_size, mu = 120, alternative = "greater", conf.level = 0.99)

One Sample t-test

data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
99 percent confidence interval:
112.9409 Inf
sample estimates:
mean of x
121.137

#Since our P-value is always bigger than all the alphas considered, we don’t
have enough empirical evidence to reject the null hypothesis. As alpha
increases also the lower bound of confidence intervals get smaller with
respect to the sample mean, so we could infer that the sample mean tends to
overestimate the real mean.
We can consider, from the null hypothesis, that the mean size of apartments is
not bigger than 120 with a 99 % confidence level.

C.3. Is the difference in the mean value of the size of the apartment significantly different between
men with children and men without children? Use α=0.01. Comment on the results.

> t.test( apt_size ~ children_yesorno, conf.level = 0.99)

Welch Two Sample t-test

data: apt_size by children_yesorno


t = 1.5034, df = 133.78, p-value = 0.1351
alternative hypothesis: true difference in means between group 0 and group 1
is not equal to 0
99 percent confidence interval:
-7.992637 29.650434
sample estimates:
mean in group 0 mean in group 1
127.0135 116.1846
#Since our P-value is smaller than alpha, we don’t have empirical evidence to
reject the null hypothesis: there is no relevant difference between means of
group 1 and group 0.

C.4. Is the difference in the mean value of age significantly different between men with children
and men without children? Use α=0.01. Comment on the results.

> t.test(age ~ children_yesorno, conf.level = 0.99)

Welch Two Sample t-test

data: age by children_yesorno


t = -0.51023, df = 137.25, p-value = 0.6107
alternative hypothesis: true difference in means between group 0 and group 1
is not equal to 0
99 percent confidence interval:
-2.331162 1.569276
sample estimates:
mean in group 0 mean in group 1
42.78082 43.16176
#Since our P-value is smaller than alpha, we don’t have empirical evidence to
reject the null hypothesis: there is no relevant difference between means of
group 1 and group 0.

D: Linear Regression exercise


> #Exercise 1
> m1<- lm(formula = children_wanting ~ age, data = Assignment_Data_2022)
> summary(m1)

Call:
lm(formula = children_wanting ~ age, data = Assignment_Data_2022)

Residuals:
Min 1Q Median 3Q Max
-4.2887 -3.1470 -0.1341 2.7886 4.8787

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.65759 2.50136 1.862 0.0646
age 0.01288 0.05793 0.222 0.8244
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.074 on 145 degrees of freedom


(3 osservazioni eliminate a causa di valori mancanti)
Multiple R-squared: 0.0003408, Adjusted R-squared: -0.006553
F-statistic: 0.04944 on 1 and 145 DF, p-value: 0.8244

# we can see that the distribution of the residuals does not appear to be strongly symmetrical. That
means that the model could predict certain points that fall far away from the actual observed points.
The intercept 4.65759 is the average propension to want children by the youngest men in the
sample.
The slope 0.01288 is indicating the amount of age increase that is needed to have an increase of 1
in “children importance score”. The amount of this increase can vary by 0.05793, Std. Error. The
coefficient t-value is a measure of how many standard deviations our coefficient estimate is far
away from 0. We want it to be far away from zero as this would indicate we could reject the null
hypothesis - that is, we could declare a relationship between children importance and age exists.
They are both above 0 but still small. The P-value probability of observing any value equal or larger
than t. A small p-value (usually < 0.05) for the intercept and the slope indicates that we can reject the
null hypothesis which allows us to conclude that there is a relationship, like in this case. The
Residual Standard Error is the average amount that the response (children) will deviate from the true
regression line, 3.074. The R-squared statistic provides a measure of how well the model is fitting
the actual data, the closer it is to 1 the better. In this case it’s small but we have to keep in mind that
an acceptable level of R squared depends case by case.
F-statistic (0.04944) is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is, but this is not the case.
> plot(m1)
Residuals vs Fitted
4
2
Residuals

0
-2
-4

81 91
10
-6

5.4 5.6 5.8 6.0

Fitted values
lm(children_wanting ~ age)
#The plot doesn’t follow a horizontal line, so it looks like the linearity
assumption is not verified.
Normal Q-Q
2
1
Standardized residuals

0
-1

10 91 81

-2 -1 0 1 2

Theoretical Quantiles
lm(children_wanting ~ age)

#The normal probability plot of residuals approximately follows a straight


line, so we can assume Normality.
Scale-Location

81 91
10
1.2
1.0
Standardized residuals

0.8
0.6
0.4
0.2
0.0

5.4 5.6 5.8 6.0

Fitted values
lm(children_wanting ~ age)

#It can be seen that the variability (variances) of the residual points
doesn’t increase with the value of the fitted outcome variable, suggesting
constant variances in the residuals errors (or homoscedasticity).
> #Exercise 2
> m2 <- lm(formula = children_wanting ~ apt_size, data = Assignment_Data_2022)
> summary(m2)

Call:
lm(formula = children_wanting ~ apt_size, data = Assignment_Data_2022)

Residuals:
Min 1Q Median 3Q Max
-4.7791 -2.4263 -0.1017 2.2673 4.8885

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.672100 0.742584 6.292 3.66e-09 ***
apt_size 0.008451 0.005766 1.466 0.145
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 2.895 on 142 degrees of freedom


(6 osservazioni eliminate a causa di valori mancanti)
Multiple R-squared: 0.0149, Adjusted R-squared: 0.007964
F-statistic: 2.148 on 1 and 142 DF, p-value: 0.145

# we can see that the distribution of the residuals appears quite symmetrical. That means that the
model could predict points that don’t fall far away from the actual observed points.
The intercept 4.6721 is the average propension to want children by the smallest apartments in the
sample.
The slope 0.008451 is indicating the amount of apartment size increase that is needed to have an
increase of 1 in “children importance score”. The amount of this increase can vary by 0.005766, Std.
Error. The coefficient t-value is a measure of how many standard deviations our coefficient estimate
is far away from 0. We want it to be far away from zero as this would indicate we could reject the
null hypothesis - that is, we could declare a relationship between children importance and apartment
size. They are both above 0 and quite big. The P-value probability of observing any value equal or
larger than t. A small p-value (usually < 0.05) for the intercept and the slope indicates that we can
reject the null hypothesis which allows us to conclude that there is a relationship, like in this case.
The Residual Standard Error is the average amount that the response (children) will deviate from the
true regression line, 2.895. The R-squared statistic provides a measure of how well the model is
fitting the actual data, the closer it is to 1 the better. In this case it’s small but we have to keep in
mind that an acceptable level of R squared depends case by case.
F-statistic (2.148) is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is, like in this case.
plot(m2)

Residuals vs Fitted
6

211398
4
2
Residuals

0
-2
-4
-6

5.2 5.4 5.6 5.8 6.0 6.2

Fitted values
lm(children_wanting ~ apt_size)
#The plot stays around 0 but doesn’t really follow a straight line. Linearity
may not be satisfied.
Normal Q-Q
2

98113 2
1
Standardized residuals

0
-1

-2 -1 0 1 2

Theoretical Quantiles
lm(children_wanting ~ apt_size)
#The normal probability plot of residuals approximately follows a straight
line, so we can assume Normality.
Scale-Location
211398
1.2
1.0
Standardized residuals

0.8
0.6
0.4
0.2
0.0

5.2 5.4 5.6 5.8 6.0 6.2

Fitted values
lm(children_wanting ~ apt_size)
#It can be seen that the variability (variances) of the residual points
increases with the value of the fitted outcome variable, suggesting non-
constant variances in the residuals errors (or heteroscedasticity). A possible
solution to reduce the heteroscedasticity problem is to use a log or square
root transformation of the outcome variable (y).
> #Exercise 3
> m3 <-lm(formula= children_wanting ~ age + apt_size, data = Assignment_Data_2022)
> summary(m3)

Call:
lm(formula = children_wanting ~ age + apt_size, data = Assignment_Data_2022)

Residuals:
Min 1Q Median 3Q Max
-5.0579 -2.4534 0.0841 2.3927 5.3139

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.972912 2.521399 0.782 0.435
age 0.063309 0.055714 1.136 0.258
apt_size 0.008190 0.005827 1.406 0.162

Residual standard error: 2.908 on 139 degrees of freedom


(8 osservazioni eliminate a causa di valori mancanti)
Multiple R-squared: 0.02255, Adjusted R-squared: 0.00849
F-statistic: 1.604 on 2 and 139 DF, p-value: 0.2049

# we can see that the distribution of the residuals appears quite symmetrical. That means that
the model could predict points that don’t fall far away from the actual observed points.
The intercept 1.972912 is the average propension to want children by the smallest apartments
and youngest men in the sample.
The slopes indicate the amount of apartment size increase and increase in age needed to have
an increase of 1 in “children importance score”. The amount of this increase can vary by the
respective Std. Error. The coefficient t-value is a measure of how many standard deviations our
coefficient estimate is far away from 0. We want it to be far away from zero as this would
indicate we could reject the null hypothesis - that is, we could declare a relationship between
children importance and apartment size. They are all above 0 but quite small. The P-value
probability of observing any value equal or larger than t. A small p-value (usually < 0.05) for the
intercept and the slope indicates that we can reject the null hypothesis which allows us to
conclude that there is a relationship, this is not the case. The Residual Standard Error is the
average amount that the response (children) will deviate from the true regression line, 2.908.
The R-squared statistic provides a measure of how well the model is fitting the actual data, the
closer it is to 1 the better. In this case it’s small but we have to keep in mind that an acceptable
level of R squared depends case by case.
F-statistic (1.604) is a good indicator of whether there is a relationship between our predictor
and the response variables. The further the F-statistic is from 1 the better it is, like in this case.
> plot(m3)

Residuals vs Fitted
6

113
2 90
4
2
Residuals

0
-2
-4
-6

5.0 5.5 6.0 6.5

Fitted values
lm(children_wanting ~ age + apt_size)
#even if the plot varies a bit with the fitted values, it looks stable and around 0. Linearity seems
to be satisfied.
Normal Q-Q
2

113
90 2
1
Standardized residuals

0
-1

-2 -1 0 1 2

Theoretical Quantiles
lm(children_wanting ~ age + apt_size)
#The normal probability plot of residuals approximately follows a straight
line, so we can assume Normality.
Scale-Location
1.4

113
2 90
1.2
1.0
Standardized residuals

0.8
0.6
0.4
0.2
0.0

5.0 5.5 6.0 6.5

Fitted values
lm(children_wanting ~ age + apt_size)
#It can be seen that the variability (variances) of the residual points doesn’t increase with the value of the
fitted outcome variable, suggesting constant variances in the residuals errors (or homoscedasticity).

E. Prediction

E.1. By using the model in D.3, predict the value of v7_9 for a person aged 40 and living in a house sized
100 squared feet. Comment on it.

E.2. By using the model in D.3, predict the mean value of v7_9 for a person aged 40 and living in a house
sized 100 squared feet and also determine the 95% confidence interval for it. Comment on it.
E1:
> predict(m3, newdata = data.frame(age = 40, apt_size = 100))
1
5.071264

#the value obtained leads us to predict that a man of 40 living in a 100 sq feet house will be
indifferent about having children (5 out of 10)
E2:
> predict(m3, newdata = data.frame(age = 40, apt_size = 100), interval = "confidence", level = 0.95)
fit lwr upr
1 5.071264 4.400601 5.741926

#the value obtained leads us to predict that a man of 40 living in a 100 sq feet house will have with
95 % of confidence, a score assigned to children importance between 4.400601 and 5.741926, so
still indifferent.

You might also like