You are on page 1of 13

1

Course of Study Exercises Statistics


Bachelor Computer Science WS 2020/21

Sheet V - Solutions
Descriptive Statistics - Linear Regression
1. For the X, Y data:
xi 2 6 3 4 5
yi 3 7 4 7 6
draw a scatterplot and compute

(a) covariance
(b) coefficient of correlation
(c) regression line: criterion variable Y and predictor variable X
(d) regression line: criterion variable X and predictor variable Y
i x _ i y _ i x _ i * x _ i y _ i * y _ i x _ i * y _ i
1 2 3 4 9 6
2 6 7 3 6 4 9 4 2
3 3 4 9 1 6 1 2
4 4 7 1 6 4 9 2 8
5 5 6 2 5 3 6 3 0
S u m 2 0 2 7 9 0 1 5 9 1 1 8

X Y S c a tte r p lo t
m e a n 4 5 ,4
v a r ia n c e 2 ,5 3 ,3
9
c o v a r ia n c e 2 ,5 8
c o e ff. o f c o rr. 0 ,8 7 0 3 8 8 2 8
7
Y = a + b X : a = 1 ,4 6
b = 1
5 L in e a r ( y _ i)
y _ i

X = a + b Y : a = -0 ,0 9 0 9 0 9 0 9 4 L in e a r ( x _ i)
b = 0 ,7 5 7 5 7 5 7 6
3
2
1

x y 0
2 ,1 8 1 8 1 8 1 8 2 3 1 2 3 4 5 6 7
5 ,2 1 2 1 2 1 2 1 2 7
x _ i

Answer:

#####################################################
# D e s c r i p t i v e S t a t i s t i c s : S i m p l e example l i n e a r r e g r .
#
# F i l e : d e s s t a t e x l i n r e g .R
#
#####################################################

# data
x <− c ( 2 , 6 , 3 , 4 , 5 )

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


2
y <− c ( 3 , 7 , 4 , 7 , 6 )

# scatterplot
p l o t ( x , y , main=” s c a t t e r p l o t ” , x l i m=c ( 0 , 7 ) , ylim = c ( 0 , 8 ) )

# covariance
cov ( x , y ) # 2 . 5

# c o e f f i c i e n t of correlation
cor (x , y) # 0.8703883

# r e g r e s s i o n l i n e : y=a+bx
# c r i t e r i o n v a r i a b l e Y and p r e d i c t o r variable X
lm ( y ˜ x )
# a = 1.4 , b = 1.0

# r e g r e s s i o n l i n e : x = alpha + beta ∗ y
# c r i t e r i o n v a r i a b l e X and p r e d i c t o r v a r i a b l e Y
lm ( x ˜ y )
# alpha = −0.0909 , beta = 0 . 7 5 7 6
a l p h a <− lm ( x ˜ y ) $ c o e f f i c i e n t s [ 1 ]
b e t a <− lm ( x ˜ y ) $ c o e f f i c i e n t s [ 2 ]

# t r a n s f o r m t h e r e g r e s s i o n x = a l p h a + b e t a ∗ x t o y= a ’ + b ’ x
a s t r i c h <− −a l p h a / b e t a
b s t r i c h <− 1/ b e t a

# gem einsam es Diagramm


p l o t ( x , y , main=” s c a t t e r p l o t ” ,
sub=” r e g r e s s i o n : y ˜ x ( b l u e ) , x ˜ y ( r e d ) ” ,
x l i m=c ( 1 , 7 ) , y l i m = c ( 0 , 8 ) )
a b l i n e ( lm ( y ˜ x ) , c o l =” b l u e ” )
a b l i n e ( a=a s t r i c h , b=b s t r i c h , c o l =”r e d ” )

# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / e x b i . e p s ” )

2. For a certain class, the relationship between the amount of time spent
in exercises (X) and the test score (Y) was examined.
(a) Draw a scatterplot of the data.

(b) Is there a positive or a negative association be-


xi yi tween X and Y?
10 5
(c) Compute the covariance and the coefficient of cor-
9 5
relation.
9 4
11 6 (d) Compute the regression line Y = a + bX.
10 7
10 5 (e) Find the predicted test score for someone with 8
6 3 units of time spent in exercises.
10 4
8 5 (f) Interpret the values of the parameters of the re-
12 7 gression line.
9 4 (g) Compute the proportion of variation explained by
4 2 the simple linear regression.
12 8
(h) Add a the point (20,0) to the data. Inspect how
this additional point influences the linear regres-
sion.

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


3
Answer:

time spent in exercises and score


8
6
score

4
2
0

5 10 15 20

time spent
(a)
(b) The scatter plot indicates a positive association.
(c) covariance: sxy = 3.25, coefficient of correlation: r = 0.861269
(d) regression line: Y = −0.9693878 + 0.6466837 · X

time spent in exercises and score


0 2 4 6 8
score

0 2 4 6 8 12

time spent

(e) predicted score = −0.9693878 + 0.6466837 · 8 = 4.204082


(f) The coefficient a is the intercept, and b is the slope of the regres-

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


4
sion line. b can be interpreted as the additional score a student
will get if he spent 1 time unit more in exercises. A possible inter-
pretation of a is the score a student will get if he does not spent
any time in exercises. Since a is negative this interpretation is not
meaningfull.
(g) B = r2 = 0.7417842 proportion of variation explained by the
simple linear regression
(h) If we add the point (20,0) to the data additionally we will get the
red regression line instead of the former blue regression line.

time spent in exercises and score


0 2 4 6 8
score

0 5 10 15 20

time spent
blue=regr. without (20,0),red=regr. with (20,0)
The new value of coefficient of determination is B̂ = 0.01258843.
We see that linear regression is quite sensitive to outliers.
#####################################################
# D e s c r i p t i v e S t a t i s t i c s : t i m e s p e n t i n e x e r c i s e and
# s c o r e i n exam
#
# F i l e : d e s s t a t e x e r c s c o r e .R
#
#####################################################

# load package
library ( tidyverse )

# For a c e r t a i n c l a s s , t h e r e l a t i o n s h i p between t h e
# amount o f t i m e s p e n t i n e x e r c i s e s (X) and t h e t e s t
# s c o r e (Y) was examined .
r e s u l t s <−
tibble (
time = c ( 1 0 , 9 , 9 , 1 1 , 1 0 , 1 0 , 6 , 1 0 , 8 , 1 2 ,9 ,4 ,12) ,
score = c (5 , 5 ,4 ,6 , 7 , 5 , 3 ,4 , 5 ,7 , 4 ,2 ,8)
)

# a ) Draw a s c a t t e r p l o t o f t h e d a t a .
plot (x = results$time ,
y = results$score ,
main=”t i m e s p e n t i n e x e r c i s e s and s c o r e ” ,
x l a b =”t i m e s p e n t ” , y l a b =” s c o r e ” )

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


5
# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / s c o r e r e g 1 . e p s ” )

# Compute t h e c o v a r i a n c e and t h e c o e f f i c i e n t o f c o r r e l a t i o n .
cov ( r e s u l t s $ t i m e , r e s u l t s $ s c o r e ) # 3 . 2 5
cor ( results$time , r e s u l t s $ s c o r e ) # 0.861269
# or
r e s u l t s %>%
summarise ( ” c o v a r i a n c e ” = cov ( time , s c o r e ) ,
” c o e f f . o f c o r r e l a t i o n ” = c o r ( time , s c o r e ) )

# Compute t h e r e g r e s s i o n l i n e Y=a+bX . ( X = time , Y = s c o r e )


r e g 1 <− lm ( r e s u l t s $ s c o r e ˜ r e s u l t s $ t i m e )
# c o e f f i c i e n t s of the l i n e
a <− r e g 1 $ c o e f f i c i e n t s [ 1 ]
b <− r e g 1 $ c o e f f i c i e n t s [ 2 ]
a;b

# add t h e r e s i d u a l s and t h e f i t t e d v a l u e s t o t h e tibble results


r e s u l t s <− r e s u l t s %>%
mutate ( r e s . s c o r e s = r e g 1 $ r e s i d u a l s ,
pred . s c o r e s = r e g 1 $ f i t t e d . v a l u e s )
results

# Find t h e p r e d i c t e d s c o r e f o r someone w i t h 8 u n i t s of
# time spent i n e x e r c i s e s .
p r e d t e s t s c o r e <− a+b ∗8
p r e d t e s t s c o r e # 4.204082

plot (x = results$time ,
y = results$score ,
xlim = c ( −1 ,13) ,
ylim = c ( −1 ,10) ,
main=”t i m e s p e n t i n e x e r c i s e s and s c o r e ” ,
x l a b =”t i m e s p e n t ” , y l a b =” s c o r e ” )
# add t h e r e g r e s s i o n l i n e
a b l i n e ( r e g 1 , c o l =” b l u e ” )
# add l i n e s e g m e n t s
segments ( 8 , p r e d t e s t s c o r e , 0 , p r e d t e s t s c o r e )
segments ( 8 , 0 , 8 , p r e d t e s t s c o r e )
# add a x i s
a b l i n e ( h=0)
a b l i n e ( v=0)
# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / s c o r e r e g 2 . e p s ” )

# Compute t h e p r o p o r t i o n o f v a r i a t i o n e x p l a i n e d by t h e linear reg .


B <− c o r ( r e s u l t s $ t i m e , r e s u l t s $ s c o r e ) ˆ 2
B # 0.7417842

# add t h e p o i n t ( 2 0 , 0 )
r e s u l t s <−
r e s u l t s %>%
add row ( t i m e =20 , s c o r e =0)

# c a l c u l a t e again the r e g r e s s i o n l i n e
r e g 2 <− lm ( r e s u l t s $ s c o r e ˜ r e s u l t s $ t i m e )
# b l o t both r e g r e s s i o n l i n e s
plot (x = results$time ,
y = results$score ,
xlim = c ( −1 ,21) ,
ylim = c ( −1 ,10) ,
main=”t i m e s p e n t i n e x e r c i s e s and s c o r e ” ,
sub=” b l u e=r e g r . w i t h o u t ( 2 0 , 0 ) , r e d=r e g r . w i t h (20 ,0)” ,
x l a b =”t i m e s p e n t ” , y l a b =” s c o r e ” )
# add r e g r e s s i o n l i n e s
a b l i n e ( r e g 2 , c o l =”r e d ” )
a b l i n e ( r e g 1 , c o l =” b l u e ” )
# add a x i s
a b l i n e ( h=0)
a b l i n e ( v=0)

# eps− f i l e
dev . c o p y 2 e p s ( f i l e = ” . . / p i c t u r e s / s c o r e r e g 3 . e p s ” )

# c o e f f i e c i e n t of determination
B new <− c o r ( r e s u l t s $ t i m e , r e s u l t s $ s c o r e ) ˆ 2
B new

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


6
Descriptive Statistics - Contingency Tables
1. Exercise 4.1 from Heumann, Schomaker, p. 90
A newspaper asks two of its staff to review the coffee quality at different
trendy cafes. The coffee can be rated on a scale from 1 (miserable) to
10 (excellent). The results of the two journalists X and Y are:
Cafe X Y
1 3 6
2 8 7
3 7 10
4 9 8
5 5 4

(a) Calculate Spearman’s rank correlation coefficient.


(b) Does the coefficient differ depending on whether the ranks are
assigned in a decreasing or increasing order?
(c) Suppose that the coffee can only be rated as eiter good (> 5)
or bad (≤ 5). Do the chances of good rating differ between the
journalists?

Answer:
(a) Calculating the ranks of the rates we get the following table
cafe x y Rx Ry
1 3 6 1 2
2 8 7 4 3
3 7 10 3 5
4 9 8 5 4
5 5 4 2 1
We get the Spearman’s rank correlation coefficient R = 0.6 by
calculating the correlation coefficient of the variables R x and R y.
Thus a high rating of X correlates with a high rating of Y.
(b) Now we use a decreasing order of rating, i.e. highest values get
the lowest rank.
cafe x y R x R y Rdesc x Rdesc y
1 3 6 1 2 5 4
2 8 7 4 3 2 3
3 7 10 3 5 3 1
4 9 8 5 4 1 2
5 5 4 2 1 4 5

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


7
Using the values of Rdes x, Rdes y we get the same coefficient.
(c) The ratings are now:
quality x y
bad 2 1
good 3 4
The ratio of the relative risks that X rates a bad coffee resp. a
good coffee compared to Y are (Q = quality, J = journalist):
Q|J Q|J
f1|1 2/5 f2|1 3/5
Q|J
= =2 Q|J
= = 0.75
f1|2 1/5 f2|2 4/5

The odds ratio, i.e the ratio of the chances of rating a bad coffee
and rating a good coffee, is
Q|J
f1|1
Q|J
f1|2 2
Q|J
= ≈ 2.666
f2|1 0.75
Q|J
f2|2

The chance of rating a coffee as bad is 2.666 times higher for X


compared to Y.
#####################################################
# Descriptive S t a t i s t i c s : Exercise 4.1 ,
# Heumann , Schomaker , page 90
#
# F i l e : d e s s t a t c o f f e e .R
#
######################################################

# load packages
library ( tidyverse )
library ( xtable ) # necessary to g e n e r a t e t e x−t a b l e s

# data
r a t e <−
tibble (
cafe = c (1 ,2 ,3 ,4 ,5) ,
x = c (3 ,8 ,7 ,9 ,5) ,
y = c (6 ,7 ,10 ,8 ,4)
)
# add t h e r a n k s
r a t e <−
r a t e %>%
mutate ( R x = r a n k ( x ) ) %>%
mutate ( R y = r a n k ( y ) )
# d i s p l a y data
rate

# tex table
t e x t a b <− x t a b l e ( r a t e )
p r i n t ( t e x t a b , i n c l u d e . rownames = FALSE , f l o a t i n g = FALSE)

# Spearman ’ s r a n k c o r r e l a t i o n c o e f f i c i e n t
R <− c o r ( x = r a t e $ R x , y = r a t e $ R y )
R
# or d i r e c t l y using the cor ( ) f u n c t i o n
c o r ( x = r a t e $ x , y = r a t e $ y , method = ” spearman ” )

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


8
# change the the o r d e r o f r a t i n g
r a t e <−
r a t e %>%
mutate ( R d e s c x = r a n k (−x ) ) %>%
mutate ( R d e s c y = r a n k (−y ) )
# display
rate

# tex table
t e x t a b <− x t a b l e ( r a t e )
p r i n t ( t e x t a b , i n c l u d e . rownames = FALSE , f l o a t i n g = FALSE)

# Spearman ’ s r a n k c o r r e l a t i o n c o e f f i c i e n t
Rdesc <− c o r ( x = r a t e $ R d e s c x , y = r a t e $ R d e s c y )
Rdesc

# o n l y t h e r a t i n g s good and bad


r a t e 2 <−
tibble (
q u a l i t y = c ( ” bad ” , ” good ” ) ,
x = c (2 ,3) ,
y = c (1 ,4)
)
# display
rate2

# tex table
t e x t a b <− x t a b l e ( r a t e 2 )
p r i n t ( t e x t a b , i n c l u d e . rownames = FALSE , f l o a t i n g = FALSE)

# odds r a t i o
OR <− ( ( 2 / 5 ) / ( 3 / 5 ) ) / ( ( 1 / 5 ) / ( 4 / 5 ) )
OR

2. The following 3x2 contingency table categorizes students according to


whether or not they pass an introductory statistics course and their
level of attendence:
Course Result
Attendance Pass Fail Totals
Over 70% 40 10 50
30%-70% 20 10 30
Under 30% 10 10 20
Totals 70 30 100

(a) Calculate the expected values in case of no association between


attendance and course result.
(b) Calculate χ2 , C and Ccorr .

Answer:
Course Result
Attendance Pass Fail Totals
Over 70% 40 (35) 10 (15) 50
(a)
30%-70% 20 (21) 10 (9) 30
Under 30% 10(14) 10 (6) 20
Totals 70 30 100
(b) χ2 = 6.349206, C = 0.2443389, Ccorr = 0.3455474

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


9
3. Exercise 4.4 from Introduction to Statistics and Data Analysis from
Heumann, Schomaker, p. 91
The famous passenger liner Titanic hit an iceberg in 1912 and sank.
The data set Titanic of the package titanic contains the survival data
of the passengers:
• A total of 325 passengers travelled in first class, 285 in second
class, and 706 in third class. In addition, there were 885 staff
members on board.
• Not all passengers could be rescued. The following were not res-
cued: 122 from the first class, 167 from the second class, 528 from
the third class and 673 staff.

(a) Determine the contingency table for the variables “travel class”
and “rescue status”. This can be done by the following steps
• Create a tibble raw data with the columns class (possible val-
ues: first, second, third, staff) and state (possible value: res-
cued and not.recued) and fill the rows with the corresponding
number of pairs given by the survival data from above.
• Count the number of rescued and non rescued in the classes
and the crew by applying table() to the two columns of raw data.
(b) Use the contingency table to summarize the conditional relative
frequency distribution of rescue status given travel class. Could
there be an association of the two variables?
(c) What would be the contingency table table from a) look like under
the independence assumption? Calculate χ2 , C and Ccorr . Is there
any association between rescue status and travel class?
(d) Combine the categorie “first class” and “second class” as well as
“third class” and “staff”. Create a contingency table based on
these new categories. Determine and interpret χ2 , C and Ccorr .

Answer:
(a) Contingency table:
not rescued rescued Sum
first 122 203 325
second 167 118 285
third 528 178 706
staff 673 212 885
Sum 1490 711 2201

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


10
202
(b) Conditional frequencies: for example frescued | first class = 337

not rescued rescued


first 0.38 0.62
second 0.59 0.41
staff 0.76 0.24
third 0.75 0.25
Sum 0.68 0.32
It is obvious that the proportion of passengers being rescued differs
by class. May there is an association between the two variables
that there are better chance of being rescued among passenger
from higher classes.
fi· ·f·j
(c) The elements of the indifference table I are given by Iij = f··

not rescued rescued Sum


first 220.01 104.99 325
second 192.94 92.06 285
staff 599.11 285.89 885
third 477.94 228.06 706
Sum 1490 731 2201
The value of the coefficients measuring the association are:

χ2 = 190.4011, C = 0.27, Ccorr = 0.4

These values are indicating a moderate association which contra-


dicts the hypothesis derived from the values of the conditional
freuencies.
(d) Combining the first and second class as well as the the third class
and staff we get:
contingency table
not rescued rescued Sum
first+second 289 321 610
third+staff 1201 390 1591
Sum 1490 711 2201
indifference table
not rescued rescued Sum
first+second 412.95 197.05 610
third+staff 1077.05 513.95 1591
Sum 1490 711 2201

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


11
The value of the coefficients measuring the association are:

χ2 = 159.3265, C = 0.26, Ccorr = 0.37

These coefficients indicate a moderate association.


If we consider the relative risks
289 321
f1|1 610 f2|1 610
= 1201 ≈ 0.63 = 390 ≈ 2.15
f1|2 1591
f2|2 1591

we see that the proportion of passengers who were rescued was


2.15 times higher in first and second class compared to third class
and staff. Furthermore the proportion of passengers who were not
rescued was 0.62 times lower in the first and second class compared
to the third class and staff.
The odds ratio
f2|1
f2|2
f1|1
≈≈ 3.42,
f1|2

Thus the chance of being rescued, i.e. the ratio rescued/not res-
cued, was was 3.42 times higher for the first and second class
compared to the third class and staff.
This is plausible, as passengers in the better classes were ac-
commodated on higher decks and thus had better access to the
lifeboats, while 3rd class passengers and staff were further down
the ship where the water penetrated first.
The hypothesis of independence of both variables can be checked
with a statistical test (χ2 test). If this is carried out, this hypoth-
esis must be rejected.
#####################################################
# Descriptive S t a t i s t i c s : Exercise 4.4 ,
# Heumann , Schomaker , page 91
#
# F i l e : d e s s t a t t i t a n i c .R
#
######################################################

# load packages
library ( tidyverse )
library ( xtable ) # necessary to generate tex tables
library ( titanic )

# g e n e r a t e t h e d a t a from t h e t i t a n i c data set


t d a t a <−
a s . t i b b l e ( T i t a n i c ) %>%
s p r e a d ( key=S u r v i v e d , v a l u e=n ) %>%
s e l e c t (−Age , −Sex ) %>%
g r o u p b y ( C l a s s ) %>%
summarise ( n o t . r e s c u e d = sum ( No ) ,
r e s c u e d = sum ( Yes ) ) %>%
mutate (Sum = n o t . r e s c u e d+r e s c u e d )
tdata

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


12
# use of the given values ! ! ! ! ! ! !
# g e n e r a t e raw d a t a
r a w d a t a <−
tibble (
” c l a s s ” = c ( rep (” f i r s t ” ,325) ,
rep (” second ” ,285) ,
rep (” t h i r d ” ,706) ,
rep (” s t a f f ” ,885)) ,
” s t a t e ” = c ( rep (” not r e s c u e d ” , 1 2 2 ) , rep (” rescued ” ,203) ,
rep (” not r e s c u e d ” , 1 6 7 ) , rep (” rescued ” ,118) ,
rep (” not r e s c u e d ” , 5 2 8 ) , rep (” rescued ” ,178) ,
rep (” not r e s c u e d ” , 6 7 3 ) , rep (” rescued ” ,212))
)
# number o f o b s e r v a t i o n s
nobs <− l e n g t h ( r a w d a t a $ c l a s s )

# contingency table
c o n t t a b <−
r a w d a t a %>%
table ()
# display
cont tab
# add m a r g i n s
addmargins ( c o n t t a b )

# tex table
t e x t a b <− x t a b l e ( a d d m a r g i n s ( c o n t t a b ) )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)

# conditional f r e q u e n c i e s rescue status given c l a s s


c o n d f r e q <− c o n t t a b / rowSums ( c o n t t a b )
c o n d f r e q [ , 2 ] # p r o p o r t i o n o f t h e r e s c u e d p e r s o n s d e p e n d i n g on t h e class

# tex table
t e x t a b <− x t a b l e ( c o n d f r e q )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)

# indifference table
i n d t a b <− a s . m a t r i x ( rowSums ( c o n t t a b ) ) %∗%
t ( a s . m a t r i x ( colSums ( c o n t t a b ) ) ) / sum ( c o n t t a b )
# display
ind tab
# add m a r g i n s
addmargins ( ( i n d t a b ) )

# tex table
t e x t a b <− x t a b l e ( i n d t a b )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)

# c h i −s q u a r e
c h i 2 <− sum ( ( c o n t t a b −i n d t a b ) ˆ 2 / i n d t a b )
chi2

# A simple s o l u t i o n to evaluate the contingency table , the i n d i f f e r e n c e t a b l e


# and Chi ˆ2 i s a p p l y i n g c h i s q . t e s t ( ) t o t h e r a w d a t a
t e s t . r e s <− c h i s q . t e s t ( x = a s . f a c t o r ( r a w d a t a $ c l a s s ) , y= a s . f a c t o r ( r a w d a t a $ s t a t e ) )
t e s t . r e s $ s t a t i s t i c # c h i ˆ2
t e s t . res$observed # contingency table
test . res$expected # indifferennce table

# Pear son ’ s c o n t i n g e n c y c o e f f i c i e n t
C <− ( c h i 2 / ( c h i 2+nobs ) ) ∗ ∗ 0 . 5
C

# c o r r e c t e d Pearson ’ s c o n t i n g e n c y c o e f f i c i e n t
C c o r <− ( 2 / 1 ∗ c h i 2 / ( c h i 2+nobs ) ) ∗ ∗ 0 . 5
C cor

###############################################################
# g r o u p f i r s t and s e c o n d c l a s s a s w e l l a s t h i r d c l a s s and s t a f f
###############################################################
c o n t t a b n e w <−
r a w d a t a %>%
mutate ( ” c l a s s ” = i f e l s e ( c l a s s == ” f i r s t ” | c l a s s == ” s e c o n d ” ,
” f i r s t +s e c o n d ” , ” t h i r d+ s t a f f ” ) ) %>%
table ()
# display
cont tab new
# add m a r g i n s
addmargins ( cont tab new )

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21


13
# tex table
t e x t a b <− x t a b l e ( c o n t t a b n e w )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)

# indifference table
i n d t a b n e w <−
a s . m a t r i x ( rowSums ( c o n t t a b n e w ) ) %∗%
t ( a s . m a t r i x ( colSums ( c o n t t a b n e w ) ) ) / sum ( c o n t t a b n e w )
# display
ind tab new
# add m a r g i n s
addmargins ( i nd tab new )

# tex table
t e x t a b <− x t a b l e ( i n d t a b n e w )
p r i n t ( t e x t a b , i n c l u d e . rownames = TRUE, f l o a t i n g = FALSE)

# c h i −s q u a r e
c h i 2 n e w <− sum ( ( c o n t t a b n e w −i n d t a b n e w ) ˆ 2 / i n d t a b n e w )
chi2 new

# Pear son ’ s c o n t i n g e n c y c o e f f i c i e n t s
C new <− ( c h i 2 n e w / ( c h i 2 n e w+nobs ) ) ∗ ∗ 0 . 5
C new
C c o r n e w <− ( 2 / 1 ∗ c h i 2 n e w / ( c h i 2 n e w+nobs ) ) ∗ ∗ 0 . 5
C cor new

# Chi ˆ2 Test , p−v a l u e < 2 . 2 e −16


chisq . t e s t ( cont tab new )

# c o n d i t i o n a l f r e q u e n c i e s : r e s c u e s t a t u s g i v e n new c l a s s e s
c o n d f r e q n e w <− c o n t t a b n e w / rowSums ( c o n t t a b n e w )
cond freq new
c o n d f r e q n e w [ , 2 ] # p r o p o r t i o n o f t h e r e s c u e d p e r s o n s d e p e n d i n g on t h e class

# relative risks
r e l r i s k s <− c o n d f r e q n e w [ 1 , ] / c o n d f r e q n e w [ 2 , ]
r e l r i s k s [ 1 ] # p r o p o r t i o n o f n o t r e s c u e d p e r s o n s among 1 . / 2 . c l a s s when compared w i t h 3 . c l a s s and s t a f f
r e l r i s k s [ 2 ] # p r o p o r t i o n o f r e s c u e d p e r s o n s among 1 . / 2 . c l a s s when compared w i t h 3 . c l a s s and s t a f f

# odds r a t i o
rel risks [2] / r e l r i s k s [ 1 ] # c h a n c e t o be r e s c u e d for 1./2. c l a s s when compared w i t h 3 . c l a s s and staff

Prof. Dr. Falkenberg, Faculty 2 Statistics WS 20/21

You might also like