You are on page 1of 5

JTMS-03 Applied Statistics with R

Spring Semester 2023

Lab 10 – Simple linear regression – Solution


April 18, 2023

You work at the Federal Ministry of Labor and Social Affairs, and have to devise evidence-based policy
recommendations for improving social cohesion in Germany. Therefore, you work with the data of the 2017
Social Cohesion Radar. The study developed an index of social cohesion for the so-called planning regions
of Germany (Raumordnungsregionen1) and related it to various regional characteristics. In particular, you
investigate to what extent the degree of social cohesion in the regions is related to their economic affluence
in terms of GDP. Test the hypothesis that the more affluent regions are more cohesive.

Data SCR2017_ROR.sav
Source Bertelsmann Stiftung (2017)2
Variables (only the relevant ones)
region Planning region
abscoh Degree of social cohesion in region (0= very weak to 100= very strong)
gdpEUR Gross domestic product per capita of region (in thousand EUR)

Reading the data in R

setwd("Type/your/directory/here")
library(foreign)
data.lab10 <- read.spss("SCR2017_ROR.sav", header= T, to.data.frame= T,
use.value.labels= F, use.missings= T)
attach(data.lab10)

Tasks
1. Descriptive statistics (see Lab 9 for descriptive information on social cohesion).
a) Describe the regions’ GDP in terms of: central tendency (median, mean) and dispersion (minimum,
maximum, range, interquartile range, standard deviation, skewness, kurtosis).

summary(gdpEUR)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 21.32 27.38 31.76 32.99 35.59 68.33

library(psych)
describe(gdpEUR, IQR= T)
## vars n mean sd median min max range skew kurtosis se IQR
## X1 1 79 32.99 8.61 31.76 21.32 68.33 47.01 1.59 3.22 0.97 8.21

The average per capita GDP of the German regions is about 32,990 EUR. Half of the regions have a GDP
below 31,760 EUR, the other half has a GDP that is greater than this value.

1
Germany has 96 planning regions. However, in order to keep the total sample size within feasible limits (over 5,000
respondents), the study put some neighboring planning regions with similar socio-demographic characteristics
together, thereby arriving at 79 ‘homogenized’ regions.
2
https://www.bertelsmann-stiftung.de/en/publications/publication/did/sozialer-zusammenhalt-in-deutschland-2017

1
The GDP of the relatively least affluent region is about 21,320 EUR, whereas that of the relatively most
affluent region is about 68,330 EUR. The difference between the most affluent and the least affluent
German region is, thus, about 47,010 EUR. In contrast to this wide range of the entire distribution of regional
GDP, the interquartile range is only about 8,100 EUR: Half of the regions have a GDP between about
27,380 EUR and 35,590 EUR. Regional GDP varies around its mean by about 8,610 EUR. In addition, the
distribution is characterized by a strong positive skew (g1 = 1.59) and yet more pronounced leptokurtosis
(g2 = 3.22), which are indications for major deviations from a normal distribution.

b) Visualize the distribution of GDP using a box plot.

The boxplot in Figure 1.1 visualizes the distribution of regional affluence. It shows, for example, that the
wide range and the positive skew of regional GDP are attributable to six comparatively very affluent regions.
The latter stand out as outliers relative to the other regions.

boxplot(gdpEUR, horizontal= T, xlab= "GDP pc (T EUR)")

Figure 1.1 Boxplot for the distribution of regional affluence (GDP pc, TEUR)

2. Association between social cohesion and affluence.


a) What regression method can be used to test the hypothesis, and why? Which variable is the
independent and which the dependent variable in the model?

The hypothesis involves the relationship between two variables measured on a continuous scale. In more
precise terms, the measurement of social cohesion is of interval quality, whereas that of GDP is at the ratio
level. The appropriate regression method for addressing the hypothesis is simple linear regression with
social cohesion (the dependent variable) regressed on GDP (the predictor). (The same hypothesis can also
be tested outside the regression framework with a Pearson’s correlation.)

b) Specify the regression model. Report and interpret the evidence in terms of: amount of explained
variation, overall test of model fit, unstandardized and standardized coefficients. Assess the
significance of the estimates at the 5% level.

reg.m1 <- lm(abscoh ~ gdpEUR)

The model with GDP as the only predictor explains about 13.4 % (R2 = 0.134) of the differences in the level
of social cohesion of the regions. According to the overall ANOVA test of model fit, this amount of explained
variation is highly significant: F(1, 77) = 11.91, p < 0.01.

2
As informed by the sign of the unstandardized regression coefficient for the effect of regional GDP on social
cohesion (b = 0.10), the relationship between the two characteristics is positive. Indeed, as hypothesized,
the more affluent regions appear to be more cohesive (see Figure 1.2). In addition, the size of the
unstandardized regression coefficient (slope of the predictor) shows that if a region is by 1,000 EUR more
affluent (a one-unit increase on the predictor) than another region, the level of social cohesion in the former
is 0.1 points stronger. The relationship between GDP and social cohesion is highly significant (p < 0.01).

library(lm.beta)
summary(lm.beta(reg.m1))
##
## Call:
## lm(formula = abscoh ~ gdpEUR)
##
## Residuals:
## Min 1Q Median 3Q Max
## -5.9147 -1.4389 -0.2015 1.5241 4.0478
##
## Coefficients:
## Estimate Standardized Std. Error t value Pr(>|t|)
## (Intercept) 58.08713 0.00000 0.99241 58.531 < 2e-16 ***
## gdpEUR 0.10048 0.36595 0.02912 3.451 0.000911 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.213 on 77 degrees of freedom
## Multiple R-squared: 0.1339, Adjusted R-squared: 0.1227
## F-statistic: 11.91 on 1 and 77 DF, p-value: 0.0009112

The standardized regression coefficient for the effect of regional affluence on social cohesion (β = 0.366)
informs that the association between the two characteristics is moderate. A region whose GDP is one
standard deviation greater has a degree of social cohesion that is 0.366 standard deviations stronger.

c) Visualize the relationship between social cohesion and affluence.

plot(gdpEUR, abscoh, main= "Relationship between GDP and cohesion", cex= 0.5,
cex.lab= 1.3, pch= 20, xlab= "GDP pc (in TEUR)", ylab= "Social cohesion")
abline(reg.m1, lwd= 2, col= "red")
text(gdpEUR, abscoh+0.2, labels= region, cex= 0.8)

3
Figure 1.2 Scatter plot for the effect of regional affluence (GDP) on social cohesion

d) What conclusion can be reached regarding the hypothesis?

The degree of social cohesion was found to be significantly stronger in more affluent regions. Hence, the
data provide empirical evidence in support of the hypothesis.

e) Compute the Pearson correlation between social cohesion and affluence. What do you notice?

cor.test(gdpEUR, abscoh)
##
## Pearson's product-moment correlation
##
## data: gdpEUR and abscoh
## t = 3.4505, df = 77, p-value = 0.0009112
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## 0.1575884 0.5431122
## sample estimates:
## cor
## 0.3659476

The Pearson correlation coefficient for the association between regional affluence and the regional level of
social cohesion (r = 0.366) is identical to the standardized regression coefficient for the effect of affluence
on cohesion (β = 0.366). In addition, the empirical probability of both estimates is identical (p = 0.0009112).

4
f) Write down the regression equation for the prediction of cohesion on the basis of affluence.

Predicted degree of social cohesion = 58.087 + 0.100 * GDP

g) Assuming that the GDP in the region of Hamburg (HH1) is 58.5 thousand Euro and that in the
region around Dortmund (NW8) is 29.5 thousand Euro, calculate the difference in the predicted
levels of social cohesion of the two regions.

Hamburg (HH1): Predicted degree of social cohesion = 58.087 + 0.100 * 58.5 = 63.937
Dortmund (NW8): Predicted degree of social cohesion = 58.087 + 0.100 * 29.5 = 61.037

Thus, the difference in the predicted degrees of social cohesion of the two regions is 2.9 points.

You might also like