Professional Documents
Culture Documents
Sheet V - Solutions
Descriptive Statistics - Linear Regression
1. For the X, Y data:
xi 2 6 3 4 5
yi 3 7 4 7 6
draw a scatterplot and compute
(a) covariance
(b) coefficient of correlation
(c) regression line: criterion variable Y and predictor variable X
(d) regression line: criterion variable X and predictor variable Y
i x _ i y _ i x _ i * x _ i y _ i * y _ i x _ i * y _ i
1 2 3 4 9 6
2 6 7 3 6 4 9 4 2
3 3 4 9 1 6 1 2
4 4 7 1 6 4 9 2 8
5 5 6 2 5 3 6 3 0
S u m 2 0 2 7 9 0 1 5 9 1 1 8
X Y S c a tte r p lo t
m e a n 4 5 ,4
v a r ia n c e 2 ,5 3 ,3
9
c o v a r ia n c e 2 ,5 8
c o e ff. o f c o rr. 0 ,8 7 0 3 8 8 2 8
7
Y = a + b X : a = 1 ,4 6
b = 1
5 L in e a r ( y _ i)
y _ i
X = a + b Y : a = -0 ,0 9 0 9 0 9 0 9 4 L in e a r ( x _ i)
b = 0 ,7 5 7 5 7 5 7 6
3
2
1
x y 0
2 ,1 8 1 8 1 8 1 8 2 3 1 2 3 4 5 6 7
5 ,2 1 2 1 2 1 2 1 2 7
x _ i
Answer:
#####################################################
# D e s c r i p t i v e S t a t i s t i c s : S i m p l e example l i n e a r r e g r .
#
# F i l e : d e s s t a t e x l i n r e g .R
#
#####################################################
# data
x <− c ( 2 , 6 , 3 , 4 , 5 )
# scatterplot
p l o t ( x , y , main=” s c a t t e r p l o t ” , x l i m=c ( 0 , 7 ) , ylim = c ( 0 , 8 ) )
# covariance
cov ( x , y ) # 2 . 5
# c o e f f i c i e n t of correlation
cor (x , y) # 0.8703883
# r e g r e s s i o n l i n e : y=a+bx
# c r i t e r i o n v a r i a b l e Y and p r e d i c t o r variable X
lm ( y ˜ x )
# a = 1.4 , b = 1.0
# r e g r e s s i o n l i n e : x = alpha + beta ∗ y
# c r i t e r i o n v a r i a b l e X and p r e d i c t o r v a r i a b l e Y
lm ( x ˜ y )
# alpha = −0.0909 , beta = 0 . 7 5 7 6
a l p h a <− lm ( x ˜ y ) $ c o e f f i c i e n t s [ 1 ]
b e t a <− lm ( x ˜ y ) $ c o e f f i c i e n t s [ 2 ]
# t r a n s f o r m t h e r e g r e s s i o n x = a l p h a + b e t a ∗ x t o y= a ’ + b ’ x
a s t r i c h <− −a l p h a / b e t a
b s t r i c h <− 1/ b e t a
# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / e x b i . e p s ” )
2. For a certain class, the relationship between the amount of time spent
in exercises (X) and the test score (Y) was examined.
(a) Draw a scatterplot of the data.
4
2
0
5 10 15 20
time spent
(a)
(b) The scatter plot indicates a positive association.
(c) covariance: sxy = 3.25, coefficient of correlation: r = 0.861269
(d) regression line: Y = −0.9693878 + 0.6466837 · X
0 2 4 6 8 12
time spent
0 5 10 15 20
time spent
blue=regr. without (20,0),red=regr. with (20,0)
The new value of coefficient of determination is B̂ = 0.01258843.
We see that linear regression is quite sensitive to outliers.
#####################################################
# D e s c r i p t i v e S t a t i s t i c s : t i m e s p e n t i n e x e r c i s e and
# s c o r e i n exam
#
# F i l e : d e s s t a t e x e r c s c o r e .R
#
#####################################################
# load package
library ( tidyverse )
# For a c e r t a i n c l a s s , t h e r e l a t i o n s h i p between t h e
# amount o f t i m e s p e n t i n e x e r c i s e s (X) and t h e t e s t
# s c o r e (Y) was examined .
r e s u l t s <−
tibble (
time = c ( 1 0 , 9 , 9 , 1 1 , 1 0 , 1 0 , 6 , 1 0 , 8 , 1 2 ,9 ,4 ,12) ,
score = c (5 , 5 ,4 ,6 , 7 , 5 , 3 ,4 , 5 ,7 , 4 ,2 ,8)
)
# a ) Draw a s c a t t e r p l o t o f t h e d a t a .
plot (x = results$time ,
y = results$score ,
main=”t i m e s p e n t i n e x e r c i s e s and s c o r e ” ,
x l a b =”t i m e s p e n t ” , y l a b =” s c o r e ” )
# Compute t h e c o v a r i a n c e and t h e c o e f f i c i e n t o f c o r r e l a t i o n .
cov ( r e s u l t s $ t i m e , r e s u l t s $ s c o r e ) # 3 . 2 5
cor ( results$time , r e s u l t s $ s c o r e ) # 0.861269
# or
r e s u l t s %>%
summarise ( ” c o v a r i a n c e ” = cov ( time , s c o r e ) ,
” c o e f f . o f c o r r e l a t i o n ” = c o r ( time , s c o r e ) )
# Find t h e p r e d i c t e d s c o r e f o r someone w i t h 8 u n i t s of
# time spent i n e x e r c i s e s .
p r e d t e s t s c o r e <− a+b ∗8
p r e d t e s t s c o r e # 4.204082
plot (x = results$time ,
y = results$score ,
xlim = c ( −1 ,13) ,
ylim = c ( −1 ,10) ,
main=”t i m e s p e n t i n e x e r c i s e s and s c o r e ” ,
x l a b =”t i m e s p e n t ” , y l a b =” s c o r e ” )
# add t h e r e g r e s s i o n l i n e
a b l i n e ( r e g 1 , c o l =” b l u e ” )
# add l i n e s e g m e n t s
segments ( 8 , p r e d t e s t s c o r e , 0 , p r e d t e s t s c o r e )
segments ( 8 , 0 , 8 , p r e d t e s t s c o r e )
# add a x i s
a b l i n e ( h=0)
a b l i n e ( v=0)
# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / s c o r e r e g 2 . e p s ” )
# add t h e p o i n t ( 2 0 , 0 )
r e s u l t s <−
r e s u l t s %>%
add row ( t i m e =20 , s c o r e =0)
# c a l c u l a t e again the r e g r e s s i o n l i n e
r e g 2 <− lm ( r e s u l t s $ s c o r e ˜ r e s u l t s $ t i m e )
# b l o t both r e g r e s s i o n l i n e s
plot (x = results$time ,
y = results$score ,
xlim = c ( −1 ,21) ,
ylim = c ( −1 ,10) ,
main=”t i m e s p e n t i n e x e r c i s e s and s c o r e ” ,
sub=” b l u e=r e g r . w i t h o u t ( 2 0 , 0 ) , r e d=r e g r . w i t h (20 ,0)” ,
x l a b =”t i m e s p e n t ” , y l a b =” s c o r e ” )
# add r e g r e s s i o n l i n e s
a b l i n e ( r e g 2 , c o l =”r e d ” )
a b l i n e ( r e g 1 , c o l =” b l u e ” )
# add a x i s
a b l i n e ( h=0)
a b l i n e ( v=0)
# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / s c o r e r e g 3 . e p s ” )
# c o e f f i e c i e n t of determination
B new <− c o r ( r e s u l t s $ t i m e , r e s u l t s $ s c o r e ) ˆ 2
B new
Answer:
(a) Calculating the ranks of the rates we get the following table
cafe x y Rx Ry
1 3 6 1 2
2 8 7 4 3
3 7 10 3 5
4 9 8 5 4
5 5 4 2 1
We get the Spearman’s rank correlation coefficient R = 0.6 by
calculating the correlation coefficient of the variables R x and R y.
Thus a high rating of X correlates with a high rating of Y.
(b) Now we use a decreasing order of rating, i.e. highest values get
the lowest rank.
cafe x y R x R y Rdesc x Rdesc y
1 3 6 1 2 5 4
2 8 7 4 3 2 3
3 7 10 3 5 3 1
4 9 8 5 4 1 2
5 5 4 2 1 4 5
The odds ratio, i.e the ratio of the chances of rating a bad coffee
and rating a good coffee, is
Q|J
f1|1
Q|J
f1|2 2
Q|J
= ≈ 2.666
f2|1 0.75
Q|J
f2|2
# load packages
library ( tidyverse )
library ( xtable ) # necessary to g e n e r a t e t e x−t a b l e s
# data
r a t e <−
tibble (
cafe = c (1 ,2 ,3 ,4 ,5) ,
x = c (3 ,8 ,7 ,9 ,5) ,
y = c (6 ,7 ,10 ,8 ,4)
)
# add t h e r a n k s
r a t e <−
r a t e %>%
mutate ( R x = r a n k ( x ) ) %>%
mutate ( R y = r a n k ( y ) )
# d i s p l a y data
rate
# tex table
t e x t a b <− x t a b l e ( r a t e )
p r i n t ( t e x t a b , i n c l u d e . rownames = FALSE , f l o a t i n g = FALSE)
# Spearman ’ s r a n k c o r r e l a t i o n c o e f f i c i e n t
R <− c o r ( x = r a t e $ R x , y = r a t e $ R y )
R
# or d i r e c t l y using the cor ( ) f u n c t i o n
c o r ( x = r a t e $ x , y = r a t e $ y , method = ” spearman ” )
# tex table
t e x t a b <− x t a b l e ( r a t e )
p r i n t ( t e x t a b , i n c l u d e . rownames = FALSE , f l o a t i n g = FALSE)
# Spearman ’ s r a n k c o r r e l a t i o n c o e f f i c i e n t
Rdesc <− c o r ( x = r a t e $ R d e s c x , y = r a t e $ R d e s c y )
Rdesc
# tex table
t e x t a b <− x t a b l e ( r a t e 2 )
p r i n t ( t e x t a b , i n c l u d e . rownames = FALSE , f l o a t i n g = FALSE)
# odds r a t i o
OR <− ( ( 2 / 5 ) / ( 3 / 5 ) ) / ( ( 1 / 5 ) / ( 4 / 5 ) )
OR
Answer:
Course Result
Attendance Pass Fail Totals
Over 70% 40 (35) 10 (15) 50
(a)
30%-70% 20 (21) 10 (9) 30
Under 30% 10(14) 10 (6) 20
Totals 70 30 100
(b) χ2 = 6.349206, C = 0.2443389, Ccorr = 0.3455474
(a) Determine the contingency table for the variables “travel class”
and “rescue status”. This can be done by the following steps
• Create a tibble raw data with the columns class (possible val-
ues: first, second, third, staff) and state (possible value: res-
cued and not.recued) and fill the rows with the corresponding
number of pairs given by the survival data from above.
• Count the number of rescued and non rescued in the classes
and the crew by applying table() to the two columns of raw data.
(b) Use the contingency table to summarize the conditional relative
frequency distribution of rescue status given travel class. Could
there be an association of the two variables?
(c) What would be the contingency table table from a) look like under
the independence assumption? Calculate χ2 , C and Ccorr . Is there
any association between rescue status and travel class?
(d) Combine the categorie “first class” and “second class” as well as
“third class” and “staff”. Create a contingency table based on
these new categories. Determine and interpret χ2 , C and Ccorr .
Answer:
(a) Contingency table:
not rescued rescued Sum
first 122 203 325
second 167 118 285
third 528 178 706
staff 673 212 885
Sum 1490 711 2201
Thus the chance of being rescued, i.e. the ratio rescued/not res-
cued, was was 3.42 times higher for the first and second class
compared to the third class and staff.
This is plausible, as passengers in the better classes were ac-
commodated on higher decks and thus had better access to the
lifeboats, while 3rd class passengers and staff were further down
the ship where the water penetrated first.
The hypothesis of independence of both variables can be checked
with a statistical test (χ2 test). If this is carried out, this hypoth-
esis must be rejected.
#####################################################
# Descriptive S t a t i s t i c s : Exercise 4.4 ,
# Heumann , Schomaker , page 91
#
# F i l e : d e s s t a t t i t a n i c .R
#
######################################################
# load packages
library ( tidyverse )
library ( xtable ) # necessary to generate tex tables
library ( titanic )
# contingency table
c o n t t a b <−
r a w d a t a %>%
table ()
# display
cont tab
# add m a r g i n s
addmargins ( c o n t t a b )
# tex table
t e x t a b <− x t a b l e ( a d d m a r g i n s ( c o n t t a b ) )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)
# tex table
t e x t a b <− x t a b l e ( c o n d f r e q )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)
# indifference table
i n d t a b <− a s . m a t r i x ( rowSums ( c o n t t a b ) ) %∗%
t ( a s . m a t r i x ( colSums ( c o n t t a b ) ) ) / sum ( c o n t t a b )
# display
ind tab
# add m a r g i n s
addmargins ( ( i n d t a b ) )
# tex table
t e x t a b <− x t a b l e ( i n d t a b )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)
# c h i −s q u a r e
c h i 2 <− sum ( ( c o n t t a b −i n d t a b ) ˆ 2 / i n d t a b )
chi2
# Pear son ’ s c o n t i n g e n c y c o e f f i c i e n t
C <− ( c h i 2 / ( c h i 2+nobs ) ) ∗ ∗ 0 . 5
C
# c o r r e c t e d Pearson ’ s c o n t i n g e n c y c o e f f i c i e n t
C c o r <− ( 2 / 1 ∗ c h i 2 / ( c h i 2+nobs ) ) ∗ ∗ 0 . 5
C cor
###############################################################
# g r o u p f i r s t and s e c o n d c l a s s a s w e l l a s t h i r d c l a s s and s t a f f
###############################################################
c o n t t a b n e w <−
r a w d a t a %>%
mutate ( ” c l a s s ” = i f e l s e ( c l a s s == ” f i r s t ” | c l a s s == ” s e c o n d ” ,
” f i r s t +s e c o n d ” , ” t h i r d+ s t a f f ” ) ) %>%
table ()
# display
cont tab new
# add m a r g i n s
addmargins ( cont tab new )
# indifference table
i n d t a b n e w <−
a s . m a t r i x ( rowSums ( c o n t t a b n e w ) ) %∗%
t ( a s . m a t r i x ( colSums ( c o n t t a b n e w ) ) ) / sum ( c o n t t a b n e w )
# display
ind tab new
# add m a r g i n s
addmargins ( i nd tab new )
# tex table
t e x t a b <− x t a b l e ( i n d t a b n e w )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)
# c h i −s q u a r e
c h i 2 n e w <− sum ( ( c o n t t a b n e w −i n d t a b n e w ) ˆ 2 / i n d t a b n e w )
chi2 new
# Pear son ’ s c o n t i n g e n c y c o e f f i c i e n t s
C new <− ( c h i 2 n e w / ( c h i 2 n e w+nobs ) ) ∗ ∗ 0 . 5
C new
C c o r n e w <− ( 2 / 1 ∗ c h i 2 n e w / ( c h i 2 n e w+nobs ) ) ∗ ∗ 0 . 5
C cor new
# c o n d i t i o n a l f r e q u e n c i e s : r e s c u e s t a t u s g i v e n new c l a s s e s
c o n d f r e q n e w <− c o n t t a b n e w / rowSums ( c o n t t a b n e w )
cond freq new
c o n d f r e q n e w [ , 2 ] # p r o p o r t i o n o f t h e r e s c u e d p e r s o n s d e p e n d i n g on t h e class
# relative risks
r e l r i s k s <− c o n d f r e q n e w [ 1 , ] / c o n d f r e q n e w [ 2 , ]
r e l r i s k s [ 1 ] # p r o p o r t i o n o f n o t r e s c u e d p e r s o n s among 1 . / 2 . c l a s s when compared w i t h 3 . c l a s s and s t a f f
r e l r i s k s [ 2 ] # p r o p o r t i o n o f r e s c u e d p e r s o n s among 1 . / 2 . c l a s s when compared w i t h 3 . c l a s s and s t a f f
# odds r a t i o
rel risks [2] / r e l r i s k s [ 1 ] # c h a n c e t o be r e s c u e d for 1./2. c l a s s when compared w i t h 3 . c l a s s and staff