Professional Documents
Culture Documents
Valentin Wimmer
contact: Valentin.Wimmer@wzw.tum.de
December 2010
Contents
3 / 122
Statistics and Data Analysis Using R
important notes
useful hints
description of datasets
4 / 122
Statistics and Data Analysis Using R
Introduction
What is R?
5 / 122
Statistics and Data Analysis Using R
Introduction
Advantages of R
É Rapid developement
É Open Source and thus no ”Black Box”’. All algorithms and functions
are visible in the source code.
É No fees
É Many systems: Macintosh, Windows, Linux, ...
É Quite fast
É Many new statistical methods available as R-packages
6 / 122
Statistics and Data Analysis Using R
Introduction
Disadvantages of R
7 / 122
Statistics and Data Analysis Using R
Introduction
Installing R
Installing R
8 / 122
Statistics and Data Analysis Using R
Introduction
Working with R
Working with R
Specifying a path in R
9 / 122
Statistics and Data Analysis Using R
Introduction
Working with R
Working with R
10 / 122
Statistics and Data Analysis Using R
Introduction
First steps
First steps
É Use R as a calculator
R> print(sqrt(5) * 2)
[1] 4.472136
É Assign a value to an object, return value of an object
R> a <- 2
R> a + 3
[1] 5
R> b <- a
R> b
[1] 2
É Load an add-on R package
R> library(MASS)
11 / 122
Statistics and Data Analysis Using R
Introduction
First steps
First steps
Note that R is case sensitiv. Names of objects have to start with a letter.
12 / 122
Statistics and Data Analysis Using R
Introduction
First steps
13 / 122
Statistics and Data Analysis Using R
Introduction
R help
R help
É The help-function
R> help(mean)
R> `?`(mean)
É Overview over an add-on package
R> help(package = "MASS")
É R manuals
É An Introduction to R
É R Data Import/Export
É R Installation and Administration
É Writing R Extensions
É www.rseek.org
14 / 122
Statistics and Data Analysis Using R
Introduction
R help
Exercises
15 / 122
Statistics and Data Analysis Using R
Introduction
R help
Exercises
É Calculate the following with R
p 3
( 5 − 2) 2 − 6 + 5 · e2
É Look at the help file of floor() and ceiling().
É What is the result?
R> a <- (4 <= floor(4.1))
R> b <- (5 > ceiling(4.1))
R> a != b
É What is the result (give a solution)?
R> (5 < 3) | (3 < 5)
R> (5 < 3) || (3 < 5)
R> (5 < 3) || c((3 < 5), c(5 < 3))
R> (5 < 3) | c((3 < 5), c(5 < 3))
R> (5 < 3) & c((3 < 5), c(5 < 3))
R> (5 < 3) && c((3 < 5), c(5 < 7))
R> (1 < 3) && c((3 < 5), c(8 < 7))
16 / 122
Statistics and Data Analysis Using R
Data management
iris dataset
Throughout this course we will use several datasets comming with R. The
most famous one is Fisher’s iris data. To load it, type
R> data(iris)
R> head(iris)
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1 5.1 3.5 1.4 0.2 setosa
2 4.9 3.0 1.4 0.2 setosa
3 4.7 3.2 1.3 0.2 setosa
4 4.6 3.1 1.5 0.2 setosa
5 5.0 3.6 1.4 0.2 setosa
6 5.4 3.9 1.7 0.4 setosa
For more information, type
R> help(iris)
17 / 122
Statistics and Data Analysis Using R
Data management
iris data
This data set contains the measurements in centimeters of the variables
sepal length and width and petal length and width, respectively, for 50
flowers from each of 3 species of iris. The species are Iris setosa,
versicolor, and virginica.
É Sepal.Length [metric]
É Sepal.Width [metric]
É Petal.Length [metric]
É Petal.Width [metric]
É Species [nominal]
18 / 122
Statistics and Data Analysis Using R
Data management
Data structure
Data structure
R is an object-oriented programming language, so every object is instance
of a class. The name of the class can be determined by
R> class(iris)
[1] "data.frame"
The most important classes are
É character
É numeric
É logical
É data.frame
É matrix
É list
É factor
19 / 122
Statistics and Data Analysis Using R
Data management
Data structure
Data structure
20 / 122
Statistics and Data Analysis Using R
Data management
Data structure
22 / 122
Statistics and Data Analysis Using R
Data management
Data structure
23 / 122
Statistics and Data Analysis Using R
Data management
Data structure
Class of a S3 object
É To check, if an obect has an special class, use
R> obj <- 3
R> is.numeric(obj)
[1] TRUE
R> is.character(obj)
[1] FALSE
É The class of an object can simple be changed (no formal check of
attributes)
R> obj2 <- obj
R> (obj2 <- as.character(obj2))
[1] "3"
R> is.character(obj2)
[1] TRUE
R> is.numeric(obj2)
[1] FALSE
24 / 122
Statistics and Data Analysis Using R
Data management
Data structure
Handling factor-objects
25 / 122
Statistics and Data Analysis Using R
Data management
Data structure
Exercises
Basic
É What is the difference between
R> v <- c("An", "Introduction", "to", "R")
R> nchar(v)
R> length(v)
É What are the first three values of Sepal.Length?
É Define the following matrix in R:
4 12 20 0.8
X= 5 3 7 1
9 11 10 8
26 / 122
Statistics and Data Analysis Using R
Data management
Data structure
Exercises
Advanced
É What are the last three values of Sepal.Length?
É Look at the help file of matrix. Try to find out the use of the
byrow argument.
É The variable Sepal.Length should be categorisised. We want to
have two categories, one up 6 cm and one above 6cm. Create a
factor and give labels for the categories.
É How many observations are in each of the two groups?
27 / 122
Statistics and Data Analysis Using R
Data management
Read in data
Read in data
28 / 122
Statistics and Data Analysis Using R
Data management
Read in data
É Entries of a data.frame
R> iris[2, 5]
[1] setosa
Levels: setosa versicolor virginica
R> iris[3:5, "Species"]
[1] setosa setosa setosa
Levels: setosa versicolor virginica
R> head(iris$Petal.Width)
[1] 0.2 0.2 0.2 0.2 0.2 0.4
É Entries of a list
R> mod <- lm(Petal.Width ~ Species, data = iris)
R> mod[["coefficients"]]
(Intercept) Speciesversicolor Speciesvirginica
0.246 1.080 1.780
R> mod[[1]]
(Intercept) Speciesversicolor Speciesvirginica
0.246 1.080 1.780
31 / 122
Statistics and Data Analysis Using R
Data management
Read in data
attaching a data.frame
The attach() command can be used for datasets. Afterwards the
variable of the dataset are attached to the R search path.
32 / 122
Statistics and Data Analysis Using R
Data management
Read in data
Reading xls-files
To read in Excel files, you need the RODBC-libary and the following
commands:
R> library(RODBC)
R> trac <- odbcConnectExcel("Table.xls")
R> Tnames <- sqlTables(trac)$TABLE_NAME
R> df1 <- sqlFetch(trac, Tnames[1], as.is = TRUE)
R> odbcClose(trac)
33 / 122
Statistics and Data Analysis Using R
Data management
Read in data
Exercises
É Load the file data2read from the website. Have a look at the file.
É Read in data using the read.table function and save the data as
an object with name mydata. What is the class of object mydata?
É Asses the data structure of mydata with the methods str, head,
dim, colnames and rownames.
É What are the values of the third row of mydata?
É Attach mydata and give the rows of mydata where x≥8.
34 / 122
Statistics and Data Analysis Using R
Data management
Data manipulation
35 / 122
Statistics and Data Analysis Using R
Data management
Data manipulation
Data manipulation
É Replacement in objects
R> (v <- 1:9)
[1] 1 2 3 4 5 6 7 8 9
R> (v[3] <- 5)
[1] 5
É A data.frame can also be manipulated with the fix() command
É Subset of a data frame. With the following command only the
individuals of species ”setosa” are selected from the iris dataset
R> iris.Setosa <- subset(iris, iris$Species == "setosa")
É Order the elements of a numeric vector
R> order.edu <- order(swiss$Education, decreasing = TRUE)
É Give the the 3 districts with the highest Education
R> swiss[order.edu[1:3], 1:5]
Fertility Agriculture Examination Education Catholic
V. De Geneve 35.0 1.2 37 53 42.34
Neuchatel 64.4 17.6 35 32 16.92
Rive Droite 44.7 46.6 16 29 50.43
36 / 122
Statistics and Data Analysis Using R
Data management
Data export
Data export
É Data of R can be writen into a file with the write.table()
command
É The R workspace can be saved with save() command and can be
reloaded with load()
É Function xtable() to create LATEX-table output
R> library(xtable)
R> xtable(summary(iris), caption = "Summary statistic of iris data")
Excercises
38 / 122
Statistics and Data Analysis Using R
Descriptive Statistics
39 / 122
Statistics and Data Analysis Using R
Descriptive Statistics
Cross Tables
É The table command gives the frequencies of a categorial variable,
typically it is used with one or more factor-variables
R> table(CO2$Type)
Quebec Mississippi
42 42
É A cross-table of two factors is obtained by
R> table(CO2$Type, CO2$Treatment)
nonchilled chilled
Quebec 21 21
Mississippi 21 21
É We have a balanced experimental design
40 / 122
Statistics and Data Analysis Using R
Descriptive Statistics
Descriptive Statistics
41 / 122
Statistics and Data Analysis Using R
Descriptive Statistics
Excercises
Basic
É Compute the 5%, the 10% and the 75% quantile of the education
rate in the swiss data.
É Is the distribution of the education rate in the swiss data
symmetric, left skewed or right skewed?
É Compute the covariance and the correlation between the ambient
carbon dioxide concentrations and the carbon dioxide uptake rates in
the CO2 data. How do you interpret the result?
É Compute the correlation coefficient for each Treatment group
separatly.
Advanced
É Compute the correlation between the following two variables:
v1=[11,5,6,3,*,35,4,4], v2=[4.6,2,1.6,*,*,7,2,3], where ”*” indicates
a missing value.
É Give ranks to uptake and conc and compute spearmans correlation
coefficient.
43 / 122
Statistics and Data Analysis Using R
Graphics
Graphics with R
44 / 122
Statistics and Data Analysis Using R
Graphics
Classic graphics
●
50
40
Education
●
30
● ● ●
20
● ●
●
●
● ● ● ● ●●●
10
● ●●
●●●● ● ● ● ●●
● ●● ● ● ●●● ● ●
●
● ●● ●● ●
●●
0
0 20 40 60 80
Agriculture
45 / 122
Statistics and Data Analysis Using R
Graphics
Classic graphics
●
0.5
46 / 122
Statistics and Data Analysis Using R
Graphics
Classic graphics
Histogram of swiss$Education
10 15 20 25 30
Frequency
5
0
0 10 20 30 40 50 60
swiss$Education
47 / 122
Statistics and Data Analysis Using R
Graphics
Classic graphics
48 / 122
Statistics and Data Analysis Using R
Graphics
Classic graphics
Black
Brown
Blond
Red
49 / 122
Statistics and Data Analysis Using R
Graphics
Classic graphics
>4
2:4
Brown
0:2
−2:0
Eye
−4:−2
Blue
<−4
Standardized
Residuals:
Hazel
Green
Hair
50 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
51 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
52 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
53 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
mar = c(5.1,4.1,4.1,2.1)
mar[NORTH<−3]
mar[WEST<−2]
mar[EAST<−4]
oma[WEST<−2]
oma[EAST<−4]
0 2 4 6 8
Plot Area
Y
0 2 4 6 8 10
mar[SOUTH<−1]
X Figure
Outer
oma[SOUTH<−1] Margin Area
54 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
0 2 4 6 8
0 2 4 6 8
1 3
Y
Y
0 2 4 6 8 10 0 2 4 6 8 10
X Figure X Figure
0 2 4 6 8
0 2 4 6 8
2 4
Y
Y
0 2 4 6 8 10 0 2 4 6 8 10
X Figure X Figure
Outer Margin Area
55 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
56 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
●
● ●
●
4
● ●● ● ● ●
● ●●●●● ●
●
● ●
● ● ● ●●●● ● ● ●● ●●● ●●●● ●
● ● ●
● ● ●● ● ● ●●●●
● ● ●●
●● ●● ● ● ●●●●●● ●● ●●● ● ●●
3
● ● ● ●●
● ●●
●● ● ●● ●
● ●
●●● ●●● ● ● ●● ● ●
●
●●● ●
● ●
● ● ● ●●
●
2
1
4 5 6 7 8
57 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
Adding a legend
R> legend("topright", legend = c("setosa", "versicolor", "virginica"),
+ pch = c(21, 21, 21), col = c(14, "lightgreen", "blue"))
6
● setosa
● versicolor
5
● virginica
Sepal.Width
●
● ●
●
4
● ● ● ● ●
● ●● ● ●● ●
● ● ● ●
●●●●
● ● ●● ● ●●
● ● ●●● ●●●● ●
●● ●●●
●●●●
●
● ● ●
● ●●●● ● ●●
●●●● ●● ●●●●● ●●
3
● ●●
● ●●
● ●●
●● ● ●● ●
● ● ●●● ● ●● ● ● ●
●
●●● ●
●●● ● ●
● ● ● ●●
●
2
1
4 5 6 7 8
Sepal.Length
58 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
Saving graphics
59 / 122
Statistics and Data Analysis Using R
Graphics
Controlling the appearance of a plot
Exercises
Basic
É Make a histogram of each of Sepal.Length, Petal.Length,
Sepal.Width and Petal.Width in one plot for the iris data. Give
proper axis labels, a title for each subplot and a title for the whole
plot.
É Now we use the swiss data. Plot the Education rate against the
Agriculture rate. Use different symbols for the provinces with
more catholics and those with more protestant inhabitants. Choose
axis limits from 0 to 100. Add a legend to your plot.
É Save this plot as a pdf-file.
Advanced
É Add two vertiacal lines and two horizontal lines to the plot which
represent the means of the Agriculture and the Education for
both subgroups (Catholic vs Protestant).
É Look at the layout function. This is an alternative for par(mfrow).
60 / 122
Statistics and Data Analysis Using R
Probability Distributions
R as a set of statistical tables
Probability Distributions
61 / 122
Statistics and Data Analysis Using R
Probability Distributions
R as a set of statistical tables
Probability Distributions
62 / 122
Statistics and Data Analysis Using R
Probability Distributions
R as a set of statistical tables
Probability Distributions
Following (and others) distributions are available in R
63 / 122
Statistics and Data Analysis Using R
Probability Distributions
R as a set of statistical tables
N(4,0.5) distribution
0.8
dnorm(x, mean = 4, sd = 0.5)
0.6
0.4
0.2
0.0
2 3 4 5 6 7 8
64 / 122
Statistics and Data Analysis Using R
Probability Distributions
R as a set of statistical tables
Computing quantiles
To compute the 95% quantile of the N(0,2)-distribution, use qnorm()
R> qnorm(0.95, mean = 0, sd = 2)
[1] 3.289707
Quantiles of N(0,2)−distribution
4
qnorm(x,mean=0,sd=2)
3.28
2
0
−2
−4
0.95
65 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
66 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
Density estimation
67 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
Density estimation
R> data(faithful)
R> plot(density(faithful$eruptions))
R> rug(faithful$eruptions)
density.default(x = faithful$eruptions)
0.5
0.4
0.3
Density
0.2
0.1
0.0
1 2 3 4 5 6
68 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
16 | 070355555588
18 | 000022233333335577777777888822335777888
20 | 00002223378800035778
22 | 0002335578023578
24 | 00228
26 | 23
28 | 080
30 | 7
32 | 2337
34 | 250077
36 | 0000823577
38 | 2333335582225577
40 | 0000003357788888002233555577778
42 | 03335555778800233333555577778
44 | 02222335557780000000023333357778888
46 | 0000233357700000023578 69 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
ecdf(faithful$eruptions)
1.0
0.8
0.6
Fn(x)
0.4
0.2
0.0
2 3 4 5
70 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
● ● ●
●●●●
●●
●
●●
●●
●●
●●
●●●●●●
●
●●
●●
●
●●
●
●●
●●
●
4.5
●
●●
●
●●
●
●
●●
●
●
●●
●
●●
●●
●
●●
●●
●
Sample Quantiles
●●
●
●●
●●
●
●●
●
●●
●
●
●
●●
●
●●
●
●
●●
●
●●
●
●
●
●
●●
●
●●
●
●
●●
3.5
●●
●
●
●
●●
●
●●
2.5
●
●●
●
●
●●
●
●
●●
●●
●●
●
●●
●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●●
●
●
●●
●
●●
●
●
●●●
●●
●●
●●
●●
●
●●
●●
● ● ●●●●●●●
1.5
−3 −2 −1 0 1 2 3
Theoretical Quantiles
71 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
Exercises
Basic
É You randomly select a sample of 10 persons in a class. There are 20
women and 30 men in that class. What is the probabilty to have
exactly 3 woman in that sample? Suppose you are sampling with
replacement.
É What is the probability if you are sampling without replacement?
É What is P(X > 3) if X ∼N(0,2)?
É Would you say, that the eruptions of Old Faithful are normal
distributed?
É Plot a histogramm together with a density estimation for the
eruptions data. Try different values for the bandwidth.
72 / 122
Statistics and Data Analysis Using R
Probability Distributions
Examining the distribution of a set of data
Exercises
Advanced
É Continous distributions can be visualized with curve. How can
categorial distributions be visualized?
É The distribution of the eruption data can be seen as a mixture of
two normal distributions. Try to find the mean and variance of these
normals manually and plot a histogramm together with the
distribution of the mixture of normals.
73 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
HairEyeColor dataset
74 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Statistical Tests
75 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Test
Non-significant Significant
H0 True Negative Type-1 Error
Truth
H1 Type-2 Error True Positive
Table: Type-1 and Type-2 errors in hypothesis testing
76 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Statistical Tests
77 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Two-sample t-test for equality of means
Student’s t-Test
Used to test the null hypothesis that the means of two independent
populations of size n1 respectively n2 are the same
H0 : μ1 = μ2 .
ȳ1 − ȳ2
t= p ,
s 1/ n1 + 1/ n2
78 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Two-sample t-test for equality of means
20
10
nonchilled chilled
● ●
● ●●
40
● ●
●●●●
●● ●●●
40
●●●
●
Sample Quantiles
Sample Quantiles
●●
●
● ●●
● ●
●●
30
●
●
●●
●●
●
●
30
● ●●
●
●●
●●
● ●
●
●
●
●
●
20
●● ●
●●
●●
●
●●
20
●●
●●
●●
● ●●
●●●
●●●
10
●
●
●●
10
● ●
−2 −1 0 1 2 −2 −1 0 1 2
80 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Two-sample t-test for equality of means
81 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Two-sample t-test for equality of means
82 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Wilcoxon test
This test is more conservative then a t-Test with the same data
83 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Chi-squared-test
Chi-squared-test
A sample of n observations in two nominal (categorial) variables is
arranged in a Contingency Table
y
1 ... c
1 n11 ··· n1c n1.
2 n21 ··· n2c n2.
.. .. .. ..
. . ··· . .
r nr1 ··· nrc nr.
n.1 ··· n.c n
Under H0 the row variable and the column variable y are independent,
thus the expected values Ejk for cell (j, k) can be computed as
Ejk = nj. n.k / n.
84 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Chi-squared-test
Chi-squared-test
85 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Chi-squared-test
χ2 -distribution 0.5
df = 2
df = 3
df = 4
0.4
df = 5
df = 6
0.3
dchisq(x)
0.2
0.1
0.0
0 1 2 3 4 5 6 7
86 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Chi-squared-test
χ2 -distribution 10
df = 2
df = 3
df = 4
8
df = 5
df = 6
6
qchisq(x)
4
2
0
87 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Chi-squared-test
Chi-squared-test with R
Contingeny Tabel of Hair and Eye Color of male statistic students
R> HairEyeColor[, , "Male"]
Eye
Hair Brown Blue Hazel Green
Black 32 11 10 3
Brown 53 50 25 15
Red 10 10 7 7
Blond 3 30 5 8
The chi-squared-test is performed with the function chisq.test. The
argument is simply a contingeny table (obtained with the function table)
R> chisq.test(HairEyeColor[, , "Male"])
Pearson's Chi-squared test
88 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Chi-squared-test
Chi-squared-test with R
89 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Kolmogorov-Smirnov-Test
Kolmogorov-Smirnov-Test
90 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Kolmogorov-Smirnov-Test
Kolmogorov-Smirnov-Test with R
The Fertility-rate in the swiss data should be compared with a normal
distribution. Simply use the ks.test() for vector of the 47 measures
R> ks.test(swiss$Fertility, "pnorm")
One-sample Kolmogorov-Smirnov test
data: swiss$Fertility
D = 1, p-value < 2.2e-16
alternative hypothesis: two-sided
É The null hypothesis ”distribution is normal” is rejected with α=0.05
because data is not scaled
É The KS-Test with scaled data results in
R> ks.test(scale(swiss$Fertility), "pnorm")
One-sample Kolmogorov-Smirnov test
data: scale(swiss$Fertility)
D = 0.1015, p-value = 0.7179
alternative hypothesis: two-sided
É Now H0 is not rejected 91 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Kolmogorov-Smirnov-Test
Exercises
Basic
É The means of Sepal.Length should be compared for the species
”setosa” and ”virginica” in the iris data. Think about the
assumptions (normal distributions, equal variances) and choose the
right test. Is there a significant difference in mean?
É Compute the p-value with help of the t-statistic on slide 82 on your
own. What value would we have with a one-sided hypotesis?
É Why do we have df=9 in the output of the χ2 -test on slide 88?
Advanced
É It was necessary to scale the data before using the ks.test. How
can we get the right result without using the scale method? Have a
look at the help for the function and think about the ...-argument.
É Have a look at slide 88 again and compute the value of the
χ2 -statistic on our own.
92 / 122
Statistics and Data Analysis Using R
Hypothesis Testing (Inference)
Kolmogorov-Smirnov-Test
Exercises
Advanced
É You have given the following data. Are the two factors m1 [levels
”A” and ”B”] and m2 [levels ”A”, ”B” and ”C”] independent?
m1 m2 m1 m2
1 A B 9 B C
2 A A 10 B C
3 A A 11 B A
4 A B 12 B B
5 B B 13 A A
6 B C 14 B C
7 A C 15 A B
8 B B 16 B B
93 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
94 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA
ANOVA is used in so called factorial designs. Following distinction can be
done
É balanced designs: same number of observations in each cell
É unbalanced designs: different number of observations in each cell
A
low high Total
low (3.5,5,4.5) (5.5,6,6) 30.5
B
high (2,4,1.5) (8,11,9) 35.5
20.5 45.5 66
Table: balanced design with two treatments A and B with two levels 1 (low)
and 2 (high) with 3 measurements for each cell
95 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA
with
É yjk the kth measurement made in cell (, j)
É μ the overall mean
É γ main effect of the first factor
É βj the main effect of the second factor
É (γβ)j the interaction effect of the two factors
É εjk the residual error term with εjk ∼ N(0, σ 2 )
96 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA formula in R
y ∼ A + B + A:B,
where A is the first and B the second factor. The interaction term is
denoted by A:B. An equivalent model formula is
y ∼ A*B.
97 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA
98 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
F-distribution 1.4
df1 = 1, df2 = 1
df1 = 1, df2 = 2
1.2
0.6
0.4
0.2
0.0
0 1 2 3 4 5
99 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA with R
É Now we want to perform ANOVA with two main effects and
interaction for the example from above
É Let us first have a look at the data
R> plot.design(y ~ A + B + A:B)
high
7
mean of y
high
low
5
4
low
A B
Factors
100 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA with R
É The mean plot suggested that there are differences in the means,
especially for factor A
É To apply ANOVA to the data we use the function aov
É The results is shown in a nice way with the summary method
R> summary(aov(y ~ A + B + A:B))
Df Sum Sq Mean Sq F value Pr(>F)
A 1 52.083 52.083 43.8596 0.0001654 ***
B 1 2.083 2.083 1.7544 0.2219123
A:B 1 21.333 21.333 17.9649 0.0028433 **
Residuals 8 9.500 1.187
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
101 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA with R
102 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
ANOVA with R
To understand the interaction effect, the function interaction.plot()
is very useful
R> interaction.plot(A, B, y)
B
9
high
8
low
mean of y
7
6
5
4
3
low high
103 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
ANOVA
Exercises
Basis
É Compute an ANOVA for the CO2 dataset. The effect of Treatment
and Type on the carbon dioxide uptake rates should be examined.
Give an interpretation for the result. Would you use a model with or
without interaction?
É Think about the model assumptions. A useful method is to use the
plot method with the result of an ANOVA.
Advanced
É In a second step, we want to look if the ambient carbon dioxide
concentration conc has an effect on uptake too. Check this with an
ANOVA. The variable should be dichotomized at the median before
using it as a factor.
104 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
Linear Regression
Linear Regression model:
y = α + β + ε , = 1, ..., n (2)
with
y response measure for individual
É
R2 = ry
2
∈ [0, 1]
It holds that
SSreg
SSerr + SSreg = SStot and R2 = .
SStot
107 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
É The sum of squares of the residuals is a measure for the fit of the
modell.
É Residuals are the vertical distances of each point from the regression
line.
É The regression is unbiased, so E(ε) = 0.
É There should be no structure in the distribution of the residuals →
look at the residuals.
108 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
109 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
110 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
●
150
●
● ●
●
● ●
100
Ozone
●●
●
●● ● ●
●●●
● ●
● ● ● ● ●
● ●●● ●
● ●
50
●●●
● ●● ●
● ● ● ●●
● ●
● ● ● ● ● ●
●
●
●● ●●
● ● ● ● ●
● ● ●●
● ● ● ●●
●●● ● ● ●
● ●● ● ●
● ● ●● ● ●●
●●● ●
● ●● ● ●●●● ●●
● ●
0
60 70 80 90
Temp
●
150
●
● ●
●
● ●
100
Ozone
●●
●
●● ● ●
●●●
● ●
● ● ● ● ●
● ●●● ●
● ●
50
● ●
●●●
●
●● ● ●
● ● ●
● ● ● ● ● ● ●
●● ● ●
● ● ● ●● ● ● ●
● ● ● ●●● ● ●
●●
● ●
●● ●● ● ●●
● ● ●● ● ●●●●● ●
● ●● ● ●●●● ●●
● ●
0
60 70 80 90
Temp
113 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
114 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
method use
coef returns a vector of length 2 containing the regressions coefficients
fitted.values returns a vector of length n containing the predicted values
residuals returns a vector of length n containing the residuals
predict a method for predicting values with the model
summary a model summary
plot a plot for the model diagnostics
115 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
Model diagnostics
R> layout(mat = matrix(1:4, 2, 2))
R> plot(myMod)
Standardized residuals
Residuals vs Fitted Scale−Location
● 117
2.0
● 117
62
● 30 ●
Residuals
62
● 30 ● ● ●
50
●
● ●●● ● ●
● ●
1.0
● ● ● ● ●
●●● ●● ● ●●
● ● ● ●● ● ●●
●
●●●
● ● ●●
● ● ●●●● ●● ● ● ●●●●●●●● ● ● ●● ●● ● ●●●●
●● ●● ●
●●●●●● ● ● ●●●●
●
● ● ●●●●●●● ● ● ● ●● ●
●●
● ● ●●
● ●● ● ●●● ● ●
●●●● ●● ● ●●
● ● ●● ●
● ● ●
●●●●●●●●●● ● ● ●● ● ●
● ● ● ●● ● ●●
●●●
● ● ●●●●
●●
●●● ●● ● ●●● ● ● ● ●● ●
● ● ●
● ●● ●●
−50
●●●
0.0
0 20 40 60 80 0 20 40 60 80
Standardized residuals
62 ● ●
62
●●
30
●● ● 99 ●
2
2
●● ● ●●
●●
● ● ●●●●●● ● ● ● ●
●
●●
●● ●●●● ●●●●● ●● ● ●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●●
●
● ●
●● ●●●● ●●
●
●
●● ● ● ●● ●● ●● ●
● ●●
●●
●
●●
●●
● ●● ●
−2 0
●
●●
●
●● ●
●●● ●●
●
●●●●●● ●
●
●●
●
● ●
●●●
●
●●
●●
●●
●
●●
●●
●
●●
●
●
●●
●
●●
●
●
●●
●
● ●●● ●● ●
●●
Cook's distance
●●●●
−2
●
● ● ●●
117 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
●
150
●
●
Ozone
● ●
●
100
●
●●●
●● ● ●
●●● ●
●●● ● ●
● ●
● ● ●●●
50
●● ●
● ●●●
● ● ● ●
●●● ●●●
● ●●● ● ●●● ●
●
●● ● ●
●● ● ●● ●
●● ●●●● ●●● ● ●
●
●● ●●●●
●●● ●●●
●●●●● ● ● ● ●
0
40 60 80 100 120
Temp
118 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
Coefficients:
(Intercept) qTemp
-57.07404 0.01612
119 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
R> summary(myMod2)
Call:
lm(formula = Ozone ~ qTemp, data = airquality)
Residuals:
Min 1Q Median 3Q Max
-39.707 -16.741 -0.953 10.544 119.293
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -57.074043 9.386333 -6.081 1.63e-08 ***
qTemp 0.016123 0.001485 10.859 < 2e-16 ***
---
Signif. codes: 0 *** 0.001 ** 0.01 * 0.05 . 0.1 1
120 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
●
150
●
●
Ozone
● ●
●
100
●
●●
●● ●● ●
●●●
● ● ● ●● ●
● ●
● ● ●●●
50
●● ● ●● ●●
● ●
●●● ● ●
● ●●●● ●● ● ● ●
●●●
●● ●●●●
● ● ● ●●●
●●
●● ●●
●●●
●● ●●●●●● ●
●●●●● ● ● ● ●
●
●
●●
0
40 60 80 100 120
Temp
121 / 122
Statistics and Data Analysis Using R
ANOVA and linear Regression
Linear Regression
Excercises
Basic
É Which of the models myMod or myMod2 fits better to the data? Give
reasons for your choice.
É What could be a problem of the second model?
É In second analysis we want to analyse the effect of the temperature
on the wind. Use the airquality data, make descriptive analyses
and fit a model. Give interpretations for the results.
É Look at the model diagnostics. Are there outliers in the data?
Highlight the outliers.
Advanced
É Look at the airquality data. As you can see, there are 37 missing
values for Ozone. Predict this values with the predict method and
with our model myMod and plot them into the scatterplot.
É We want to estimate the effect of the agriculture rate on the
education rate in the swiss data. What could be a problem here?
122 / 122