Professional Documents
Culture Documents
ASSIGNMENT
Valerio Langè
(valerio.lange@unibocconi.it)
Report in the table here below Name, Surname and ID Number for each member of your group.
The German demographic evolution is a phenomenon that is raising a strong political debate in the
country. A new National Family Plan will be soon designed. The “German Observatory on Demography”
has commissioned your research group to conduct a survey in order to explore the influential factors
regarding cultural values in German men. Part of these results will be used to design a new “National
Family Plan” that is intended to support young couples to create a new family.
After a consultation phase with the competent Officers of the “German Observatory on Demography” you
decide to implement a survey collecting data on a sample of 150 men aged 35-50 years old with a
specifically designed questionnaire.
You agree upon a shared version of the questionnaire with the above mentioned Policy Officers. You can
find the questionnaire attached to the case study.
The “German Observatory on Demography” delivers the questionnaire to 150 Germans 35-50 years old
men and when data are collected the dataset is released to your group.
The objective is now to prepare a Research Report for the “German Observatory on Demography”
following the instructions reported here below.
Instructions:
Always report in the text the R-Studio output generated and all procedures and commands used. In case
this information is missing a score will not be assigned.
When you have completed the assignment send the final document in .doc or .docx format and the
database used to the following email address: valerio.lange@unibocconi.it
The assignment must be delivered by 11:59pm on 16th of December 2022. Assignments delivered later
will not be considered.
For any question related to the assignment please refer to the following email address:
valerio.lange@unibocconi.it
A.1. Report the main information for each variable (e.g.: frequency distribution table, average and/or other
summary measures, etc.). Comment on results.
B. Confidence intervals
B.1. Compute the 95% confidence interval for the mean age of men. Also comment on it.
B.2. Compute the 95% confidence interval for the mean size of the apartment. Also comment on it.
B.3. Compute the 95% confidence interval for the difference in the mean age between men who have
children and men who don’t have children. Also comment on it.
C. Hypothesis testing
C.1. Is the mean age of men significantly different from 42 years? Use α=0.01. Comment on the results.
C.2. Is the mean size of the apartment significantly greater than 120 squared feet? Comment on the
results using different levels of α (0.1; 0.05; 0.01).
C.3. Is the difference in the mean value of the size of the apartment significantly different between men
with children and men without children? Use α=0.01. Comment on the results.
C.4. Is the difference in the mean value of age significantly different between men with children and men
without children? Use α=0.01. Comment on the results.
D. Linear Regression
D.1. There is now interest to explore the importance that men give to having children with some socio-
demographic and living characteristics. Run a linear regression model with v7_9 as dependent variable
and v1 [Age] as independent variable. Comment on the output results and evaluate the goodness of fit of
the model. Finally check with appropriate techniques whether the simple linear regression model meets
the linearity, normality and homoscedasticity assumptions and comment on the results.
D.2. Run a linear regression model with v7_9 as dependent variable and v6 [Size of the apartment] as
independent variable. Comment on the output results and evaluate the goodness of fit of the model.
Finally check with appropriate techniques whether this second simple linear regression model meets the
linearity, normality and homoscedasticity assumptions and comment on the results.
D.3. Finally, run a linear regression model with v7_9 as dependent variable and v1 [Age] and v6 [Size of
the apartment] as independent variables. Comment on the output results and evaluate the goodness of fit
of the model.
E. Prediction
E.1. By using the model in D.3, predict the value of v7_9 for a person aged 40 and living in a house sized
100 squared feet. Comment on it.
E.2. By using the model in D.3, predict the mean value of v7_9 for a person aged 40 and living in a house
sized 100 squared feet and also determine the 95% confidence interval for it. Comment on it.
1. AGE: v1
table(Assignment_Data_2022$v1)
summary(Assignment_Data_2022$v1)
var(Assignment_Data_2022$v1, na.rm=TRUE)
19.18294
[1] 4.379833
Coefficient of variation
0.1019206
table(Assignment_Data_2022$v2)
summary(Assignment_Data_2022$v2)
var(Assignment_Data_2022$v2, na.rm=TRUE)
0.6908725
IQR(Assignment_Data_2022$v2, na.rm=TRUE)
2
cv2 <- sd(Assignment_Data_2022$v2, na.rm=TRUE)/mean(Assignment_Data_2022$v2,
na.rm=TRUE)
show(cv2)
0.4197916
1 = Employed
2 = Leave from work
3 = Unemployed
4 = Seeking a job
5 = Other
table(Assignment_Data_2022$v3)
38 25 30 26 31
summary(Assignment_Data_2022$v3)
1.478874
IQR(Assignment_Data_2022$v3, na.rm=TRUE)
2.75
0.5076228
table(Assignment_Data_2022$v4)
1 2 3 4
36 28 51 31
summary(Assignment_Data_2022$v4)
var(Assignment_Data_2022$v4, na.rm=TRUE)
1.175106
1.084023
IQR(Assignment_Data_2022$v4, na.rm=TRUE)
0.4289089
1 = Yes
0 = No
table(Assignment_Data_2022$v5)
0 1
74 68
var(Assignment_Data_2022$v5, na.rm=TRUE)
0.2513235
0.5013218
1.046878
6. Size of apartment
summary(Assignment_Data_2022$v6)
var(Assignment_Data_2022$v6, na.rm=TRUE)
1772.436
42.10031
71.75
0.347543
v7_1: Loyalty
how important is it:
1 = Not important at all
10 = Extremely important
table(Assignment_Data_2022$v7_1)
1 2 3 4 5 6 7 8 9 10
19 11 15 14 20 14 16 8 17 15
summary(Assignment_Data_2022$v7_1)
8.444858
2.906004
0.5378814
table(Assignment_Data_2022$v7_2)
1 2 3 4 5 6 7 8 9 10
15 18 18 12 17 18 9 13 12 16
summary(Assignment_Data_2022$v7_2)
var(Assignment_Data_2022$v7_2, na.rm=TRUE)
8.419333
0.5498567
table(Assignment_Data_2022$v7_3)
1 2 3 4 5 6 7 8 9 10
11 19 7 18 18 12 20 16 11 15
summary(Assignment_Data_2022$v7_3)
var(Assignment_Data_2022$v7_3, na.rm=TRUE)
7.737583
2.781651
4.5
0.4980545
table(Assignment_Data_2022$v7_4)
1 2 3 4 5 6 7 8 9 10
17 14 11 10 14 14 15 15 12 21
summary(Assignment_Data_2022$v7_4)
var(Assignment_Data_2022$v7_4, na.rm=TRUE)
9.21176
3.035088
boxplot(Assignment_Data_2022$v7_4, col="lightblue", main = "Importance of
having appropriate living conditions for a relationship",
ylab= "Scale")
IQR(Assignment_Data_2022$v7_4, na.rm=TRUE)
0.5325369
V7_5: How important is sharing the same cultural values?
table(Assignment_Data_2022$v7_5)
1 2 3 4 5 6 7 8 9 10
12 19 11 19 10 8 18 13 15 22
summary(Assignment_Data_2022$v7_5)
var(Assignment_Data_2022$v7_5, na.rm=TRUE)
9.144628
3.024009
5.5
0.5279445
table(Assignment_Data_2022$v7_6)
1 2 3 4 5 6 7 8 9 10
13 12 16 14 17 14 13 14 16 19
summary(Assignment_Data_2022$v7_6)
var(Assignment_Data_2022$v7_6, na.rm=TRUE)
8.432938
2.903952
0.5074202
table(Assignment_Data_2022$v7_7)
1 2 3 4 5 6 7 8 9 10
17 14 12 22 21 12 9 14 17 12
summary(Assignment_Data_2022$v7_7)
var(Assignment_Data_2022$v7_7, na.rm=TRUE)
8.061029
IQR(Assignment_Data_2022$v7_7, na.rm=TRUE)
0.5363719
V7_8: Sharing housing responsibilities
table(Assignment_Data_2022$v7_8)
1 2 3 4 5 6 7 8 9 10
12 13 7 18 16 15 14 12 19 18
summary(Assignment_Data_2022$v7_8)
var(Assignment_Data_2022$v7_8, na.rm=TRUE)
8.236014
2.869846
0.4884844
table(Assignment_Data_2022$v7_9)
1 2 3 4 5 6 7 8 9 10
23 17 13 13 9 17 14 14 12 17
summary(Assignment_Data_2022$v7_9)
var(Assignment_Data_2022$v7_9, na.rm=TRUE)
9.407491
3.067163
IQR(Assignment_Data_2022$v7_9, na.rm=TRUE)
0.583662
V7_10: Being open to face problems
table(Assignment_Data_2022$v7_10)
1 2 3 4 5 6 7 8 9 10
18 17 9 25 13 9 19 14 7 18
summary(Assignment_Data_2022$v7_10)
var(Assignment_Data_2022$v7_10, na.rm=TRUE)
8.538636
2.922094
0.5539339
table(Assignment_Data_2022$v7_11)
1 2 3 4 5 6 7 8 9 10
12 19 12 18 10 12 19 9 20 19
summary(Assignment_Data_2022$v7_11)
var(Assignment_Data_2022$v7_11, na.rm=TRUE)
8.855705
standard_deviation_7_11 <- sd(Assignment_Data_2022$v7_11, na.rm=TRUE)
standard_deviation_7_11
2.975854
IQR(Assignment_Data_2022$v7_11, na.rm=TRUE)
0.5220796
B: Confidence Intervals
#Exercise 1
> age <- Assignment_Data_2022$v1
> apt_size <- Assignment_Data_2022$v6
> children_wanting <- Assignment_Data_2022$v7_9
> t.test(age, conf.level = 0.95)
data: age
t = 119.36, df = 147, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
42.26149 43.68446
sample estimates:
mean of x
42.97297
#Exercise 2
> t.test(apt_size, conf.level = 0.95)
data: apt_size
t = 34.767, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
114.2505 128.0235
sample estimates:
mean of x
121.137
#Exercise 3
> t.test(age ~ children_yesorno, conf.level = 0.95)
C: Hypothesis testing
C1
data: age
t = 2.7026, df = 147, p-value = 0.007691
Alternative hypothesis: true mean is not equal to 42
99 percent confidence interval:
42.03343 43.91251
Sample estimates:
Mean of x
42.97297
#we reject the null hypothesis because of the small P-value: the real mean is
not 42 with a 99 % confidence level
C2
data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
90 percent confidence interval:
116.6513 Inf
sample estimates:
mean of x
121.137
data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
95 percent confidence interval:
115.3691 Inf
sample estimates:
mean of x
121.137
> t.test(apt_size, mu = 120, alternative = "greater", conf.level = 0.99)
data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
99 percent confidence interval:
112.9409 Inf
sample estimates:
mean of x
121.137
#Since our P-value is always bigger than all the alphas considered, we don’t
have enough empirical evidence to reject the null hypothesis. As alpha
increases also the lower bound of confidence intervals get smaller with
respect to the sample mean, so we could infer that the sample mean tends to
overestimate the real mean.
We can consider, from the null hypothesis, that the mean size of apartments is
not bigger than 120 with a 99 % confidence level.
C.3. Is the difference in the mean value of the size of the apartment significantly different between
men with children and men without children? Use α=0.01. Comment on the results.
C.4. Is the difference in the mean value of age significantly different between men with children
and men without children? Use α=0.01. Comment on the results.
Call:
lm(formula = children_wanting ~ age, data = Assignment_Data_2022)
Residuals:
Min 1Q Median 3Q Max
-4.2887 -3.1470 -0.1341 2.7886 4.8787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.65759 2.50136 1.862 0.0646
age 0.01288 0.05793 0.222 0.8244
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# we can see that the distribution of the residuals does not appear to be strongly symmetrical. That
means that the model could predict certain points that fall far away from the actual observed points.
The intercept 4.65759 is the average propension to want children by the youngest men in the
sample.
The slope 0.01288 is indicating the amount of age increase that is needed to have an increase of 1
in “children importance score”. The amount of this increase can vary by 0.05793, Std. Error. The
coefficient t-value is a measure of how many standard deviations our coefficient estimate is far
away from 0. We want it to be far away from zero as this would indicate we could reject the null
hypothesis - that is, we could declare a relationship between children importance and age exists.
They are both above 0 but still small. The P-value probability of observing any value equal or larger
than t. A small p-value (usually < 0.05) for the intercept and the slope indicates that we can reject the
null hypothesis which allows us to conclude that there is a relationship, like in this case. The
Residual Standard Error is the average amount that the response (children) will deviate from the true
regression line, 3.074. The R-squared statistic provides a measure of how well the model is fitting
the actual data, the closer it is to 1 the better. In this case it’s small but we have to keep in mind that
an acceptable level of R squared depends case by case.
F-statistic (0.04944) is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is, but this is not the case.
> plot(m1)
Residuals vs Fitted
4
2
Residuals
0
-2
-4
81 91
10
-6
Fitted values
lm(children_wanting ~ age)
#The plot doesn’t follow a horizontal line, so it looks like the linearity
assumption is not verified.
Normal Q-Q
2
1
Standardized residuals
0
-1
10 91 81
-2 -1 0 1 2
Theoretical Quantiles
lm(children_wanting ~ age)
81 91
10
1.2
1.0
Standardized residuals
0.8
0.6
0.4
0.2
0.0
Fitted values
lm(children_wanting ~ age)
#It can be seen that the variability (variances) of the residual points
doesn’t increase with the value of the fitted outcome variable, suggesting
constant variances in the residuals errors (or homoscedasticity).
> #Exercise 2
> m2 <- lm(formula = children_wanting ~ apt_size, data = Assignment_Data_2022)
> summary(m2)
Call:
lm(formula = children_wanting ~ apt_size, data = Assignment_Data_2022)
Residuals:
Min 1Q Median 3Q Max
-4.7791 -2.4263 -0.1017 2.2673 4.8885
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.672100 0.742584 6.292 3.66e-09 ***
apt_size 0.008451 0.005766 1.466 0.145
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
# we can see that the distribution of the residuals appears quite symmetrical. That means that the
model could predict points that don’t fall far away from the actual observed points.
The intercept 4.6721 is the average propension to want children by the smallest apartments in the
sample.
The slope 0.008451 is indicating the amount of apartment size increase that is needed to have an
increase of 1 in “children importance score”. The amount of this increase can vary by 0.005766, Std.
Error. The coefficient t-value is a measure of how many standard deviations our coefficient estimate
is far away from 0. We want it to be far away from zero as this would indicate we could reject the
null hypothesis - that is, we could declare a relationship between children importance and apartment
size. They are both above 0 and quite big. The P-value probability of observing any value equal or
larger than t. A small p-value (usually < 0.05) for the intercept and the slope indicates that we can
reject the null hypothesis which allows us to conclude that there is a relationship, like in this case.
The Residual Standard Error is the average amount that the response (children) will deviate from the
true regression line, 2.895. The R-squared statistic provides a measure of how well the model is
fitting the actual data, the closer it is to 1 the better. In this case it’s small but we have to keep in
mind that an acceptable level of R squared depends case by case.
F-statistic (2.148) is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is, like in this case.
plot(m2)
Residuals vs Fitted
6
211398
4
2
Residuals
0
-2
-4
-6
Fitted values
lm(children_wanting ~ apt_size)
#The plot stays around 0 but doesn’t really follow a straight line. Linearity
may not be satisfied.
Normal Q-Q
2
98113 2
1
Standardized residuals
0
-1
-2 -1 0 1 2
Theoretical Quantiles
lm(children_wanting ~ apt_size)
#The normal probability plot of residuals approximately follows a straight
line, so we can assume Normality.
Scale-Location
211398
1.2
1.0
Standardized residuals
0.8
0.6
0.4
0.2
0.0
Fitted values
lm(children_wanting ~ apt_size)
#It can be seen that the variability (variances) of the residual points
increases with the value of the fitted outcome variable, suggesting non-
constant variances in the residuals errors (or heteroscedasticity). A possible
solution to reduce the heteroscedasticity problem is to use a log or square
root transformation of the outcome variable (y).
> #Exercise 3
> m3 <-lm(formula= children_wanting ~ age + apt_size, data = Assignment_Data_2022)
> summary(m3)
Call:
lm(formula = children_wanting ~ age + apt_size, data = Assignment_Data_2022)
Residuals:
Min 1Q Median 3Q Max
-5.0579 -2.4534 0.0841 2.3927 5.3139
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.972912 2.521399 0.782 0.435
age 0.063309 0.055714 1.136 0.258
apt_size 0.008190 0.005827 1.406 0.162
# we can see that the distribution of the residuals appears quite symmetrical. That means that
the model could predict points that don’t fall far away from the actual observed points.
The intercept 1.972912 is the average propension to want children by the smallest apartments
and youngest men in the sample.
The slopes indicate the amount of apartment size increase and increase in age needed to have
an increase of 1 in “children importance score”. The amount of this increase can vary by the
respective Std. Error. The coefficient t-value is a measure of how many standard deviations our
coefficient estimate is far away from 0. We want it to be far away from zero as this would
indicate we could reject the null hypothesis - that is, we could declare a relationship between
children importance and apartment size. They are all above 0 but quite small. The P-value
probability of observing any value equal or larger than t. A small p-value (usually < 0.05) for the
intercept and the slope indicates that we can reject the null hypothesis which allows us to
conclude that there is a relationship, this is not the case. The Residual Standard Error is the
average amount that the response (children) will deviate from the true regression line, 2.908.
The R-squared statistic provides a measure of how well the model is fitting the actual data, the
closer it is to 1 the better. In this case it’s small but we have to keep in mind that an acceptable
level of R squared depends case by case.
F-statistic (1.604) is a good indicator of whether there is a relationship between our predictor
and the response variables. The further the F-statistic is from 1 the better it is, like in this case.
> plot(m3)
Residuals vs Fitted
6
113
2 90
4
2
Residuals
0
-2
-4
-6
Fitted values
lm(children_wanting ~ age + apt_size)
#even if the plot varies a bit with the fitted values, it looks stable and around 0. Linearity seems
to be satisfied.
Normal Q-Q
2
113
90 2
1
Standardized residuals
0
-1
-2 -1 0 1 2
Theoretical Quantiles
lm(children_wanting ~ age + apt_size)
#The normal probability plot of residuals approximately follows a straight
line, so we can assume Normality.
Scale-Location
1.4
113
2 90
1.2
1.0
Standardized residuals
0.8
0.6
0.4
0.2
0.0
Fitted values
lm(children_wanting ~ age + apt_size)
#It can be seen that the variability (variances) of the residual points doesn’t increase with the value of the
fitted outcome variable, suggesting constant variances in the residuals errors (or homoscedasticity).
E. Prediction
E.1. By using the model in D.3, predict the value of v7_9 for a person aged 40 and living in a house sized
100 squared feet. Comment on it.
E.2. By using the model in D.3, predict the mean value of v7_9 for a person aged 40 and living in a house
sized 100 squared feet and also determine the 95% confidence interval for it. Comment on it.
E1:
> predict(m3, newdata = data.frame(age = 40, apt_size = 100))
1
5.071264
#the value obtained leads us to predict that a man of 40 living in a 100 sq feet house will be
indifferent about having children (5 out of 10)
E2:
> predict(m3, newdata = data.frame(age = 40, apt_size = 100), interval = "confidence", level = 0.95)
fit lwr upr
1 5.071264 4.400601 5.741926
#the value obtained leads us to predict that a man of 40 living in a 100 sq feet house will have with
95 % of confidence, a score assigned to children importance between 4.400601 and 5.741926, so
still indifferent.