Statistics Assignment (Lista-Campo)

Statistics Module 2 – Applied Statistics
ASSIGNMENT
Valerio Langè
(valerio.lange@unibocconi.it)
Report in the table here below Name, Surname and ID Number for each member of your group.
Name Surname ID Number
Alessandro Campo Lobato 3164258
Giuseppe Lista 3164613

Context and description of the dataset:
The German demographic evolution is a phenomenon that is raising a strong political debate in the
country. A new National Family Plan will be soon designed. The “German Observatory on Demography”
has commissioned your research group to conduct a survey in order to explore the influential factors
regarding cultural values in German men. Part of these results will be used to design a new “National
Family Plan” that is intended to support young couples to create a new family.
After a consultation phase with the competent Officers of the “German Observatory on Demography” you
decide to implement a survey collecting data on a sample of 150 men aged 35-50 years old with a
specifically designed questionnaire.
You agree upon a shared version of the questionnaire with the above mentioned Policy Officers. You can
find the questionnaire attached to the case study.
The “German Observatory on Demography” delivers the questionnaire to 150 Germans 35-50 years old
men and when data are collected the dataset is released to your group.
The objective is now to prepare a Research Report for the “German Observatory on Demography”
following the instructions reported here below.
Instructions:
In the assignment folder you can find

· this file 30457_Assignment_Text_2022.docx
· the dataset 30457_Assignment_Data_2022.csv
· the questionnaire 30457_Assignment_Questionnaire_2022.pdf
Always report in the text the R-Studio output generated and all procedures and commands used. In case
this information is missing a score will not be assigned.
When you have completed the assignment send the final document in .doc or .docx format and the
database used to the following email address: valerio.lange@unibocconi.it
The assignment must be delivered by 11:59pm on 16th of December 2022. Assignments delivered later
will not be considered.
For any question related to the assignment please refer to the following email address:
valerio.lange@unibocconi.it
A. Descriptive statistics of the sample
A.1. Report the main information for each variable (e.g.: frequency distribution table, average and/or other
summary measures, etc.). Comment on results.
B. Confidence intervals
B.1. Compute the 95% confidence interval for the mean age of men. Also comment on it.
B.2. Compute the 95% confidence interval for the mean size of the apartment. Also comment on it.
B.3. Compute the 95% confidence interval for the difference in the mean age between men who have
children and men who don’t have children. Also comment on it.
C. Hypothesis testing
C.1. Is the mean age of men significantly different from 42 years? Use α=0.01. Comment on the results.
C.2. Is the mean size of the apartment significantly greater than 120 squared feet? Comment on the
results using different levels of α (0.1; 0.05; 0.01).
C.3. Is the difference in the mean value of the size of the apartment significantly different between men
with children and men without children? Use α=0.01. Comment on the results.
C.4. Is the difference in the mean value of age significantly different between men with children and men
without children? Use α=0.01. Comment on the results.
D. Linear Regression
D.1. There is now interest to explore the importance that men give to having children with some socio-
demographic and living characteristics. Run a linear regression model with v7_9 as dependent variable
and v1 [Age] as independent variable. Comment on the output results and evaluate the goodness of fit of
the model. Finally check with appropriate techniques whether the simple linear regression model meets
the linearity, normality and homoscedasticity assumptions and comment on the results.
D.2. Run a linear regression model with v7_9 as dependent variable and v6 [Size of the apartment] as
independent variable. Comment on the output results and evaluate the goodness of fit of the model.
Finally check with appropriate techniques whether this second simple linear regression model meets the
linearity, normality and homoscedasticity assumptions and comment on the results.
D.3. Finally, run a linear regression model with v7_9 as dependent variable and v1 [Age] and v6 [Size of
the apartment] as independent variables. Comment on the output results and evaluate the goodness of fit
of the model.
E. Prediction
E.1. By using the model in D.3, predict the value of v7_9 for a person aged 40 and living in a house sized
100 squared feet. Comment on it.
E.2. By using the model in D.3, predict the mean value of v7_9 for a person aged 40 and living in a house
sized 100 squared feet and also determine the 95% confidence interval for it. Comment on it.
A: Descriptive statistics of the sample
1. AGE: v1
table(Assignment_Data_2022$v1)
summary(Assignment_Data_2022$v1)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

36.00 39.00 43.50 42.97 47.00 49.00 2
var(Assignment_Data_2022$v1, na.rm=TRUE)
19.18294
standard_deviation_age <- sd(Assignment_Data_2022$v1, na.rm=TRUE)
[1] 4.379833
boxplot(Assignment_Data_2022$v1, col="lightblue", main = "Age Boxplot",

ylab= "Age(years)")
hist(Assignment_Data_2022$v1, col="brown1", main="Age and frequence")

xlab="Age(years")
plot(ecdf(Assignment_Data_2022$v1), col="brown1", main="CDF age",

xlab="Age(years)", ylab="Cumulative Relative Frequencies")
IQR(Assignment_Data_2022$v1, na.rm=TRUE)
cv1 <- sd(Assignment_Data_2022$v1, na.rm=TRUE)/mean(Assignment_Data_2022$v1,

na.rm=TRUE)
show(cv1)
Coefficient of variation
0.1019206
2. Do you have a partner/wife?
1 = Relationship and living together

2 = Relationship and not living together
3 = Single
Min. 1st Qu. Median Mean 3rd Qu. Max.

1.00 1.00 2.00 1.98 3.00 3.00
0.6908725
standard_deviation_v2 <- sd(Assignment_Data_2022$v2, na.rm=TRUE)

0.8311874
2
na.rm=TRUE)
show(cv2)
0.4197916
3. Present professional career
1 = Employed
2 = Leave from work
3 = Unemployed
4 = Seeking a job
5 = Other
38 25 30 26 31

1.000 1.250 3.000 2.913 4.000 5.000
1.478874
hist(Assignment_Data_2022$v3, col="lightblue", main="Age and frequence")

xlab="Age(years")
non riesco a cambiare l’x axis
2.75

na.rm=TRUE)
show(cv3)
0.5076228
4. Family Economic Status
1: We manage to save money at the end of the month

2: Only money to make a living
3: Not enough money to make a living
4:I don’t know
1 2 3 4
36 28 51 31

1.000 2.000 3.000 2.527 3.000 4.000 4
1.175106

standard_deviation_v4
1.084023
boxplot(Assignment_Data_2022$v4, col="brown1", main = "Family Economic

Status",
ylab= "Situation")
hist(Assignment_Data_2022$v4, col="lightblue", main="Family economic status")

xlab="Levels"

na.rm=TRUE)
show(cv4)
0.4289089
5. Do you have any children?
1 = Yes
0 = No
0 1
74 68
0.2513235

0.5013218

na.rm=TRUE)
show(cv5)
1.046878
6. Size of apartment

50.00 86.25 117.50 121.14 158.00 198.00 4
1772.436

42.10031
boxplot(Assignment_Data_2022$v6, col="lightblue", main = "Size of the

apartment in squared feet",
ylab= "Square feet")
hist(Assignment_Data_2022$v6, col="brown1", main="Size of apartment in squared

feet")
xlab="Square feet”
plot(ecdf(Assignment_Data_2022$v6), col="brown1", main="CDF squared feet",

xlab="Square feet", ylab="Cumulative Relative Frequencies")
71.75

na.rm=TRUE)
show(cv6)
0.347543
v7_1: Loyalty
how important is it:
1 = Not important at all
10 = Extremely important
table(Assignment_Data_2022$v7_1)
1 2 3 4 5 6 7 8 9 10
19 11 15 14 20 14 16 8 17 15
summary(Assignment_Data_2022$v7_1)

1.000 3.000 5.000 5.403 8.000 10.000 1
var(Assignment_Data_2022$v7_1, na.rm=TRUE)
8.444858
standard_deviation_7_1 <- sd(Assignment_Data_2022$v7_1, na.rm=TRUE)

standard_deviation_7_1
2.906004
boxplot(Assignment_Data_2022$v7_1, col="brown1", main = "How much do you value

loyalty in a relationship from 1 to 10",
ylab= "Scale")
hist(Assignment_Data_2022$v7_1, col="lightblue", main="Importance of loyalty")

xlab="Values")
plot(ecdf(Assignment_Data_2022$v7_1), col="brown1", main="CDF loyalty",

xlab="Importance of loyalty", ylab="Cumulative Relative Frequencies")
IQR(Assignment_Data_2022$v7_1, na.rm=TRUE)
cv7_1 <- sd(Assignment_Data_2022$v7_1,

na.rm=TRUE)/mean(Assignment_Data_2022$v7_1, na.rm=TRUE)
show(cv7_1)
0.5378814
7_2: Having the same social background
1 2 3 4 5 6 7 8 9 10
15 18 18 12 17 18 9 13 12 16

1.000 3.000 5.000 5.277 8.000 10.000 2
8.419333
standard_deviation_v7_2 <- sd(Assignment_Data_2022$v7_2, na.rm=TRUE)

standard_deviation_v7_2
2.901609
boxplot(Assignment_Data_2022$v7_2, col="lightblue", main = "Same social

background importance Boxplot",
ylab= "Importance")
hist(Assignment_Data_2022$v7_2, col="lightblue", main="Importance of having

the same social background")
xlab="Importance(scale")
plot(ecdf(Assignment_Data_2022$v7_2), col="lightblue", main="CDF having the

same social background importance", xlab="Importance", ylab="Cumulative
Relative Frequencies")

show(cv7_2)
0.5498567
V7_3 = Giving importance to the family nest
1 2 3 4 5 6 7 8 9 10
11 19 7 18 18 12 20 16 11 15

1.000 3.500 6.000 5.585 8.000 10.000 3
7.737583

2.781651
boxplot(Assignment_Data_2022$v7_3, col="lightblue", main = "Importance to the

giving importance to the family nest",
ylab= "Scale")
hist(Assignment_Data_2022$v7_3, col="lightblue", main="Importance to the

giving importance to the family nest")
xlab="Scale")
plot(ecdf(Assignment_Data_2022$v7_3), col="red", main="CDF importance to the

giving importance family nest", xlab="Scale", ylab="Cumulative Relative
Frequencies")
4.5

show(cv7_3)
0.4980545
V7_4 = How important is having appropriate living conditions?
1 2 3 4 5 6 7 8 9 10
17 14 11 10 14 14 15 15 12 21

1.000 3.000 6.000 5.699 8.000 10.000 7
9.21176

3.035088
boxplot(Assignment_Data_2022$v7_4, col="lightblue", main = "Importance of
having appropriate living conditions for a relationship",
ylab= "Scale")
hist(Assignment_Data_2022$v7_4, col="lightblue", main="Importance of having

appropriate living conditions for a relationship")
xlab="Scale")

show(cv7_4)
0.5325369
V7_5: How important is sharing the same cultural values?
1 2 3 4 5 6 7 8 9 10
12 19 11 19 10 8 18 13 15 22

1.000 3.000 6.000 5.728 8.500 10.000 3
9.144628

3.024009
boxplot(Assignment_Data_2022$v7_5, col="blue", main = "Importance of sharing

the same cultural values",
ylab= "Scale")
hist(Assignment_Data_2022$v7_5, col="orange", main="Importance of sharing the

same cultural values")
xlab="(Scale")
5.5

show(cv7_5)
0.5279445
V7_6: Importance of living far from the families of origin
1 2 3 4 5 6 7 8 9 10
13 12 16 14 17 14 13 14 16 19

1.000 3.000 6.000 5.723 8.000 10.000 2
8.432938

2.903952
boxplot(Assignment_Data_2022$v7_6, col="lightgreen", main = "Living far from

families of origin Boxplot",
ylab= "Scale")
hist(Assignment_Data_2022$v7_6, col="lightgreen", main="Living far from

families of origin Histogram")
xlab="Scale")
plot(ecdf(Assignment_Data_2022$v7_6), col="red", main="CDF living far from

families of origin Histogram", xlab="Scale", ylab="Cumulative Relative
Frequencies")

show(cv7_6)
0.5074202
V7_7: Importance of living in a safe place
1 2 3 4 5 6 7 8 9 10
17 14 12 22 21 12 9 14 17 12

1.000 3.000 5.000 5.293 8.000 10.000
8.061029

2.839195
boxplot(Assignment_Data_2022$v7_7, col="lightgreen", main = "Importance of
living in a safe place Boxplot",
ylab= "Scale")
hist(Assignment_Data_2022$v7_7, col="blue", main="Importance of living in a

safe place and frequencies")
xlab="Scale"

show(cv7_7)
0.5363719
V7_8: Sharing housing responsibilities
1 2 3 4 5 6 7 8 9 10
12 13 7 18 16 15 14 12 19 18

1.000 4.000 6.000 5.875 9.000 10.000 6
8.236014

2.869846
boxplot(Assignment_Data_2022$v7_8, col="lightgreen", main = "Importance of

sharing housing responsibilities Boxplot",
ylab= "Scale")
hist(Assignment_Data_2022$v7_8, col="orange", main="Importance of sharing

housing responsibilities")
xlab="Frequency")

show(cv7_8)
0.4884844
V7_9: Importance of having children
1 2 3 4 5 6 7 8 9 10
23 17 13 13 9 17 14 14 12 17

1.000 2.000 5.000 5.255 8.000 10.000 1
9.407491

3.067163
boxplot(Assignment_Data_2022$v7_9, col="orange", main = "Age Boxplot",

ylab= "Level of importance")
hist(Assignment_Data_2022$v7_9, col="orange", main="Importance of having

children")
xlab="Scale")

show(cv7_9)
0.583662
V7_10: Being open to face problems
1 2 3 4 5 6 7 8 9 10
18 17 9 25 13 9 19 14 7 18

1.000 3.000 5.000 5.275 8.000 10.000 1
8.538636

2.922094
boxplot(Assignment_Data_2022$v7_10, col="orange", main = "Importance of being

open to face problems",
ylab= "Scale")
hist(Assignment_Data_2022$v7_10, col="orange", main="Importance of being open

to face problems")
xlab="Scale"

show(cv7_10)
0.5539339
V7_11: Importance of having time to meet friends
1 2 3 4 5 6 7 8 9 10
12 19 12 18 10 12 19 9 20 19

1.0 3.0 6.0 5.7 9.0 10.0
8.855705
2.975854
boxplot(Assignment_Data_2022$v7_11, col="orange", main = "Having time to meet

friends Boxplot",
ylab= "Scale")
plot(ecdf(Assignment_Data_2022$v7_11), col="red", main="CDF importance of

having time to meet friends", xlab="Scale", ylab="Cumulative Relative
Frequencies")

show(cv7_11)
0.5220796
B: Confidence Intervals
#Exercise 1
> age <- Assignment_Data_2022$v1
> apt_size <- Assignment_Data_2022$v6
> children_wanting <- Assignment_Data_2022$v7_9
> t.test(age, conf.level = 0.95)
One Sample t-test
data: age
t = 119.36, df = 147, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
42.26149 43.68446
sample estimates:
mean of x
42.97297
#Exercise 2
> t.test(apt_size, conf.level = 0.95)
One Sample t-test
data: apt_size
t = 34.767, df = 145, p-value < 2.2e-16
alternative hypothesis: true mean is not equal to 0
114.2505 128.0235
sample estimates:
mean of x
121.137
#Exercise 3
> t.test(age ~ children_yesorno, conf.level = 0.95)
Welch Two Sample t-test
data: age by children_yesorno

t = -0.51023, df = 137.25, p-value = 0.6107
alternative hypothesis: true difference in means between group 0 and group 1
is not equal to 0
-1.857274 1.095389 #the difference at 95% conf.level isn’t very wide
sample estimates:
mean in group 0 mean in group 1
42.78082 43.16176
C: Hypothesis testing
C1
> t.test(age, mu=42, conf.level=0.99)
One Sample t-test
data: age
t = 2.7026, df = 147, p-value = 0.007691
Alternative hypothesis: true mean is not equal to 42
42.03343 43.91251
Sample estimates:
Mean of x
42.97297
#we reject the null hypothesis because of the small P-value: the real mean is
not 42 with a 99 % confidence level
C2
> t.test(apt_size, mu = 120, alternative = "greater", conf.level = 0.9)
One Sample t-test
data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
alternative hypothesis: true mean is greater than 120
116.6513 Inf
sample estimates:
mean of x
121.137
One Sample t-test
data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
115.3691 Inf
sample estimates:
mean of x
121.137
One Sample t-test
data: apt_size
t = 0.32632, df = 145, p-value = 0.3723
112.9409 Inf
sample estimates:
mean of x
121.137
#Since our P-value is always bigger than all the alphas considered, we don’t
have enough empirical evidence to reject the null hypothesis. As alpha
increases also the lower bound of confidence intervals get smaller with
respect to the sample mean, so we could infer that the sample mean tends to
overestimate the real mean.
We can consider, from the null hypothesis, that the mean size of apartments is
not bigger than 120 with a 99 % confidence level.
C.3. Is the difference in the mean value of the size of the apartment significantly different between
men with children and men without children? Use α=0.01. Comment on the results.
> t.test( apt_size ~ children_yesorno, conf.level = 0.99)
data: apt_size by children_yesorno

t = 1.5034, df = 133.78, p-value = 0.1351
is not equal to 0
-7.992637 29.650434
sample estimates:
127.0135 116.1846
#Since our P-value is smaller than alpha, we don’t have empirical evidence to
reject the null hypothesis: there is no relevant difference between means of
group 1 and group 0.
C.4. Is the difference in the mean value of age significantly different between men with children
and men without children? Use α=0.01. Comment on the results.
> t.test(age ~ children_yesorno, conf.level = 0.99)
data: age by children_yesorno

t = -0.51023, df = 137.25, p-value = 0.6107
is not equal to 0
-2.331162 1.569276
sample estimates:
42.78082 43.16176
#Since our P-value is smaller than alpha, we don’t have empirical evidence to
reject the null hypothesis: there is no relevant difference between means of
group 1 and group 0.
D: Linear Regression exercise

> #Exercise 1
> m1<- lm(formula = children_wanting ~ age, data = Assignment_Data_2022)
> summary(m1)
Call:
lm(formula = children_wanting ~ age, data = Assignment_Data_2022)
Residuals:
Min 1Q Median 3Q Max
-4.2887 -3.1470 -0.1341 2.7886 4.8787
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 4.65759 2.50136 1.862 0.0646
age 0.01288 0.05793 0.222 0.8244
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 3.074 on 145 degrees of freedom

(3 osservazioni eliminate a causa di valori mancanti)
Multiple R-squared: 0.0003408, Adjusted R-squared: -0.006553
F-statistic: 0.04944 on 1 and 145 DF, p-value: 0.8244
# we can see that the distribution of the residuals does not appear to be strongly symmetrical. That
means that the model could predict certain points that fall far away from the actual observed points.
The intercept 4.65759 is the average propension to want children by the youngest men in the
sample.
The slope 0.01288 is indicating the amount of age increase that is needed to have an increase of 1
in “children importance score”. The amount of this increase can vary by 0.05793, Std. Error. The
coefficient t-value is a measure of how many standard deviations our coefficient estimate is far
away from 0. We want it to be far away from zero as this would indicate we could reject the null
hypothesis - that is, we could declare a relationship between children importance and age exists.
They are both above 0 but still small. The P-value probability of observing any value equal or larger
than t. A small p-value (usually < 0.05) for the intercept and the slope indicates that we can reject the
null hypothesis which allows us to conclude that there is a relationship, like in this case. The
Residual Standard Error is the average amount that the response (children) will deviate from the true
regression line, 3.074. The R-squared statistic provides a measure of how well the model is fitting
the actual data, the closer it is to 1 the better. In this case it’s small but we have to keep in mind that
an acceptable level of R squared depends case by case.
F-statistic (0.04944) is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is, but this is not the case.
> plot(m1)
Residuals vs Fitted
4
2
Residuals
0
-2
-4
81 91
10
-6
5.4 5.6 5.8 6.0
Fitted values
lm(children_wanting ~ age)
#The plot doesn’t follow a horizontal line, so it looks like the linearity
assumption is not verified.
Normal Q-Q
2
1
Standardized residuals
0
-1
10 91 81
-2 -1 0 1 2
Theoretical Quantiles
#The normal probability plot of residuals approximately follows a straight

line, so we can assume Normality.
Scale-Location
81 91
10
1.2
1.0
0.8
0.6
0.4
0.2
0.0
5.4 5.6 5.8 6.0
Fitted values
#It can be seen that the variability (variances) of the residual points
doesn’t increase with the value of the fitted outcome variable, suggesting
constant variances in the residuals errors (or homoscedasticity).
> #Exercise 2
> m2 <- lm(formula = children_wanting ~ apt_size, data = Assignment_Data_2022)
> summary(m2)
Call:
lm(formula = children_wanting ~ apt_size, data = Assignment_Data_2022)
Residuals:
-4.7791 -2.4263 -0.1017 2.2673 4.8885
Coefficients:
(Intercept) 4.672100 0.742584 6.292 3.66e-09 ***
apt_size 0.008451 0.005766 1.466 0.145
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Multiple R-squared: 0.0149, Adjusted R-squared: 0.007964
# we can see that the distribution of the residuals appears quite symmetrical. That means that the
model could predict points that don’t fall far away from the actual observed points.
The intercept 4.6721 is the average propension to want children by the smallest apartments in the
sample.
The slope 0.008451 is indicating the amount of apartment size increase that is needed to have an
increase of 1 in “children importance score”. The amount of this increase can vary by 0.005766, Std.
Error. The coefficient t-value is a measure of how many standard deviations our coefficient estimate
is far away from 0. We want it to be far away from zero as this would indicate we could reject the
null hypothesis - that is, we could declare a relationship between children importance and apartment
size. They are both above 0 and quite big. The P-value probability of observing any value equal or
larger than t. A small p-value (usually < 0.05) for the intercept and the slope indicates that we can
reject the null hypothesis which allows us to conclude that there is a relationship, like in this case.
The Residual Standard Error is the average amount that the response (children) will deviate from the
true regression line, 2.895. The R-squared statistic provides a measure of how well the model is
fitting the actual data, the closer it is to 1 the better. In this case it’s small but we have to keep in
mind that an acceptable level of R squared depends case by case.
F-statistic (2.148) is a good indicator of whether there is a relationship between our predictor and
the response variables. The further the F-statistic is from 1 the better it is, like in this case.
plot(m2)
Residuals vs Fitted
6
211398
4
2
Residuals
0
-2
-4
-6
5.2 5.4 5.6 5.8 6.0 6.2
Fitted values
lm(children_wanting ~ apt_size)
#The plot stays around 0 but doesn’t really follow a straight line. Linearity
may not be satisfied.
Normal Q-Q
2
98113 2
1
0
-1
-2 -1 0 1 2
Scale-Location
211398
1.2
1.0
0.8
0.6
0.4
0.2
0.0
5.2 5.4 5.6 5.8 6.0 6.2
Fitted values
#It can be seen that the variability (variances) of the residual points
increases with the value of the fitted outcome variable, suggesting non-
constant variances in the residuals errors (or heteroscedasticity). A possible
solution to reduce the heteroscedasticity problem is to use a log or square
root transformation of the outcome variable (y).
> #Exercise 3
> m3 <-lm(formula= children_wanting ~ age + apt_size, data = Assignment_Data_2022)
> summary(m3)
Call:
lm(formula = children_wanting ~ age + apt_size, data = Assignment_Data_2022)
Residuals:
-5.0579 -2.4534 0.0841 2.3927 5.3139
Coefficients:
(Intercept) 1.972912 2.521399 0.782 0.435
age 0.063309 0.055714 1.136 0.258
apt_size 0.008190 0.005827 1.406 0.162

Multiple R-squared: 0.02255, Adjusted R-squared: 0.00849
# we can see that the distribution of the residuals appears quite symmetrical. That means that
the model could predict points that don’t fall far away from the actual observed points.
The intercept 1.972912 is the average propension to want children by the smallest apartments
and youngest men in the sample.
The slopes indicate the amount of apartment size increase and increase in age needed to have
an increase of 1 in “children importance score”. The amount of this increase can vary by the
respective Std. Error. The coefficient t-value is a measure of how many standard deviations our
coefficient estimate is far away from 0. We want it to be far away from zero as this would
indicate we could reject the null hypothesis - that is, we could declare a relationship between
children importance and apartment size. They are all above 0 but quite small. The P-value
probability of observing any value equal or larger than t. A small p-value (usually < 0.05) for the
intercept and the slope indicates that we can reject the null hypothesis which allows us to
conclude that there is a relationship, this is not the case. The Residual Standard Error is the
average amount that the response (children) will deviate from the true regression line, 2.908.
The R-squared statistic provides a measure of how well the model is fitting the actual data, the
closer it is to 1 the better. In this case it’s small but we have to keep in mind that an acceptable
level of R squared depends case by case.
F-statistic (1.604) is a good indicator of whether there is a relationship between our predictor
and the response variables. The further the F-statistic is from 1 the better it is, like in this case.
> plot(m3)
Residuals vs Fitted
6
113
2 90
4
2
Residuals
0
-2
-4
-6
5.0 5.5 6.0 6.5
Fitted values
lm(children_wanting ~ age + apt_size)
#even if the plot varies a bit with the fitted values, it looks stable and around 0. Linearity seems
to be satisfied.
Normal Q-Q
2
113
90 2
1
0
-1
-2 -1 0 1 2
Scale-Location
1.4
113
2 90
1.2
1.0
0.8
0.6
0.4
0.2
0.0
5.0 5.5 6.0 6.5
Fitted values
#It can be seen that the variability (variances) of the residual points doesn’t increase with the value of the
fitted outcome variable, suggesting constant variances in the residuals errors (or homoscedasticity).
E. Prediction
E.1. By using the model in D.3, predict the value of v7_9 for a person aged 40 and living in a house sized
100 squared feet. Comment on it.
E.2. By using the model in D.3, predict the mean value of v7_9 for a person aged 40 and living in a house
sized 100 squared feet and also determine the 95% confidence interval for it. Comment on it.
E1:
> predict(m3, newdata = data.frame(age = 40, apt_size = 100))
1
5.071264
#the value obtained leads us to predict that a man of 40 living in a 100 sq feet house will be
indifferent about having children (5 out of 10)
E2:
> predict(m3, newdata = data.frame(age = 40, apt_size = 100), interval = "confidence", level = 0.95)
fit lwr upr
1 5.071264 4.400601 5.741926
#the value obtained leads us to predict that a man of 40 living in a 100 sq feet house will have with
95 % of confidence, a score assigned to children importance between 4.400601 and 5.741926, so
still indifferent.

Statistics Assignment (Lista-Campo)

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Statistics Assignment (Lista-Campo)

Uploaded by

Copyright:

Available Formats

Statistics Module 2 – Applied Statistics

Name Surname ID Number

Alessandro Campo Lobato 3164258

Giuseppe Lista 3164613

In the assignment folder you can find

A. Descriptive statistics of the sample

A: Descriptive statistics of the sample

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_age <- sd(Assignment_Data_2022$v1, na.rm=TRUE)

boxplot(Assignment_Data_2022$v1, col="lightblue", main = "Age Boxplot",

hist(Assignment_Data_2022$v1, col="brown1", main="Age and frequence")

plot(ecdf(Assignment_Data_2022$v1), col="brown1", main="CDF age",

cv1 <- sd(Assignment_Data_2022$v1, na.rm=TRUE)/mean(Assignment_Data_2022$v1,

2. Do you have a partner/wife?

1 = Relationship and living together

Min. 1st Qu. Median Mean 3rd Qu. Max.

standard_deviation_v2 <- sd(Assignment_Data_2022$v2, na.rm=TRUE)

3. Present professional career

Min. 1st Qu. Median Mean 3rd Qu. Max.

hist(Assignment_Data_2022$v3, col="lightblue", main="Age and frequence")

cv3 <- sd(Assignment_Data_2022$v3, na.rm=TRUE)/mean(Assignment_Data_2022$v3,

4. Family Economic Status

1: We manage to save money at the end of the month

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_v4 <- sd(Assignment_Data_2022$v4, na.rm=TRUE)

boxplot(Assignment_Data_2022$v4, col="brown1", main = "Family Economic

hist(Assignment_Data_2022$v4, col="lightblue", main="Family economic status")

cv4 <- sd(Assignment_Data_2022$v4, na.rm=TRUE)/mean(Assignment_Data_2022$v4,

5. Do you have any children?

standard_deviation_v5 <- sd(Assignment_Data_2022$v5, na.rm=TRUE)

cv5 <- sd(Assignment_Data_2022$v5, na.rm=TRUE)/mean(Assignment_Data_2022$v5,

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_v6 <- sd(Assignment_Data_2022$v6, na.rm=TRUE)

boxplot(Assignment_Data_2022$v6, col="lightblue", main = "Size of the

hist(Assignment_Data_2022$v6, col="brown1", main="Size of apartment in squared

plot(ecdf(Assignment_Data_2022$v6), col="brown1", main="CDF squared feet",

cv6 <- sd(Assignment_Data_2022$v6, na.rm=TRUE)/mean(Assignment_Data_2022$v6,

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_7_1 <- sd(Assignment_Data_2022$v7_1, na.rm=TRUE)

boxplot(Assignment_Data_2022$v7_1, col="brown1", main = "How much do you value

hist(Assignment_Data_2022$v7_1, col="lightblue", main="Importance of loyalty")

plot(ecdf(Assignment_Data_2022$v7_1), col="brown1", main="CDF loyalty",

cv7_1 <- sd(Assignment_Data_2022$v7_1,

7_2: Having the same social background

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_v7_2 <- sd(Assignment_Data_2022$v7_2, na.rm=TRUE)

boxplot(Assignment_Data_2022$v7_2, col="lightblue", main = "Same social

hist(Assignment_Data_2022$v7_2, col="lightblue", main="Importance of having

plot(ecdf(Assignment_Data_2022$v7_2), col="lightblue", main="CDF having the

cv7_2 <- sd(Assignment_Data_2022$v7_2,

V7_3 = Giving importance to the family nest

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_v7_3 <- sd(Assignment_Data_2022$v7_3, na.rm=TRUE)

boxplot(Assignment_Data_2022$v7_3, col="lightblue", main = "Importance to the

hist(Assignment_Data_2022$v7_3, col="lightblue", main="Importance to the

plot(ecdf(Assignment_Data_2022$v7_3), col="red", main="CDF importance to the

cv7_3 <- sd(Assignment_Data_2022$v7_3,

V7_4 = How important is having appropriate living conditions?

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_v7_4 <- sd(Assignment_Data_2022$v7_4, na.rm=TRUE)

hist(Assignment_Data_2022$v7_4, col="lightblue", main="Importance of having

cv7_4 <- sd(Assignment_Data_2022$v7_4,

Min. 1st Qu. Median Mean 3rd Qu. Max. NA's

standard_deviation_v7_5 <- sd(Assignment_Data_2022$v7_5, na.rm=TRUE)