You are on page 1of 63

R Preperation Course (Winter term 2022)

An Introduction to R

Tianjiao Zhu∗

2022-11-04

1 Getting started
1.1 Running and terminating R
During this course, we want to use the graphical user interface RStudio for the work with R, a
powerful and easy to use open-source program that is available for all platforms (Linux/Mac OS
X/Windows). To start RStudio, just make a double click (with the left mouse button) on the corre-
sponding symbol on your desktop, then the RStudio window should appear with the following screen:

The interactive R session has begun. There are 4 panels on the screen.
• Bottom left: console window (command window).
• Top left: editor window (script window).
• Top right: workspace / history window.
• Bottom right: files / plots / packages /help window.
In the bottom left panel, the program welcome you with a greeting message. It now waits with the prompt >
for subsequent commands.
Let’s start with our first line of code with
∗ Department of Quantitative Finance, University of Freiburg

1
print("Hello, World!")

## [1] "Hello, World!"


Caution: R is case sensitive!
You can exit the program with the command q() (quit). Observe that q() is an R function, therefore one has
to add the parentheses ( and ) to the function call. Alternatively, you can click on “File” in the top menu bar
and then choose the item “Quit Session. . . ” (or use the shortcut [Ctrl]+[Q]).
q()

Since up to now your workspace is empty (you have not defined anything yet), R/RStudio will be terminated
immediately. However, if your workspace already contains some newly created objects, then the following
message will be shown:
Save workspace image to /.RData? [y/n]:
Here you are asked if your workspace (these are all the objects, functions, and other things you generated
during the present session) should be stored permanently within a (binary) file called .RData in your current
working directory. This is done by entering y (yes). If you type n (no), R is terminated without saving your
data, meaning that everything you created during the last session will be lost! If you are terminating the R
session via the menu items “File” and “Quit Session. . . ”, the corresponding choices in the appearing pop-up
window are “Save”, “Don’t Save”, and “Cancel”.
Technical remark: All the objects you generate during an R session are only stored temporarily within the
main memory of your computer, a permanent backup on the hard disk is made not until you quit the program
and answer the displayed question with y (yes). In this case, your data is stored in a compressed file called
.RData within the present working directory of R. To know your current working directory, you can use the
following command:
getwd()

## [1] "C:/Users/Standard/Dropbox/R_prep_course"
This command tells you its full path. If you start R in the same directory again (which can also be done by
double-clicking on the .RData file symbol within the windows file manager), R will read in this file during the
start-up procedure, making everything you stored before available again directly from the beginning. If you
want to change your working directory, you can use the command for example:
setwd("C:/Users/Standard/Documents/R/")

If you want to be on the safe side and make a backup of all your data while R is still running, use the
command
save.image()

The present content of your workspace is shown in the upper right corner of the RStudio window (tab
“Environment”). R also stores your command history in a hidden file called .Rhistory. You can easily move
and search among formerly executed commands using the uparrow [↑] and downarrow keys [↓]. The command
history can also be reviewed using he tab “History” in the upper right corner.

1.2 Using the online documentation of R and R scripts


R provides detailed descriptions and instructions for all built-in functions. You can find these in the bottom
right corner of the RStudio window in the tab “Help”. At the beginning, the probably most important link
there is Search Engine & Keywords which allows to search for certain R commands. You can always go back
from a specific help page to the start page of the online help by clicking onto the small home button. If you
already know the exact(!) name of the R command for which you need further information, you can also
directly jump to the corresponding help page by putting a ? directly in front of the command. For example

2
?save.image

This command will display the corresponding help page. Alternatively, you can press the key [F1] directly
after entering the command name or use the R function help():
help(save.image)

Observe that the R commands in this case should not be followed by parentheses ()! But as already said
before, the latter methods require the knowledge of the exact command name for which additional information
is asked for in advance, so both might be of limited advantage for R beginners for whom the aforementioned
search engine may be more preferable. A small help text resp. an auto-completion is available directly on the
command line by pressing the [TAB]-key directly after entering the first characters of an R command.
If you are working on more lengthy and complicated R codes and programs, it is often advisable to do this as
an R script in a separate subwindow and to load and execute the code in R not until it is completed. A new
subwindow for writing some R script can be opened in RStudio via the menus “File” → “New File” → “R
Script” or by clicking onto the button under “File” and choose “R Script”. After editing the code, it can be
loaded and executed in R by clicking on “Run” or “Source” at the upper margin of the subwindow. “Run” will
only load and execute the current line of code (where the text cursor actually is) resp. the previously marked
lines of code, whereas “Source” will always load and execute the complete content of the script-subwindow.
Another useful file format in R is the Markdown file. RMarkdown is a simple formatting syntax for authoring
web pages or pdf files. It shows not only the code but also the results on the page. R Markdown can be
opened via the menus “File” → “New File” → “R Markdown”.
If you click the “Knit HTML” button, a web page will be generated that includes both content as well as the
output of any embedded R code chunks within the document.

2 Arithmetic operations and numerical functions


In R, one can perform the elementary arithmetic operations where the usual order of evaluation (multiplica-
tion/division before addition/substraction) applies. And use %% to calculate the remainder.
1+2*2

## [1] 5
2*8%%3

## [1] 4
Of course, the most popular mathematical functions are also available: log(), exp(), sin(), cos(), tan(),
acos(), asin(), atan(), sqrt(), gamma() have their intuitive and usual meaning. log() here is the natural
logarithm with basis e = 2.718282... (Euler number). abs() returns the absolute value of a number, and
arbitrary powers can be obtained with the operator “ˆ”. Factorials can be calculated with factorial(), the
binomial coefficients nk are obtained with choose(n,k). The usual order of evaluation can be changed by
putting parentheses appropriately.
Examples:
4 ∗ arctan(1)
4*atan(1)

## [1] 3.141593
e1
exp(1)

## [1] 2.718282

3
1
83
8ˆ(1/3)

## [1] 2
4!
factorial(4)

## [1] 24
6

3
choose(6,3)

## [1] 20

3 Vectors
You can concatenate several numbers into one vector with the command c(). With help of the operator
<-, you can assign the result of an operation or function to an R object. The following example shows the
creation of a vector with length 5 and name x.
Example:
x <- c(1,2,4,9,16)

Remark: The assignment operator can analogously be used “from the left to the right”:
c(1,2,4,9,16) -> x

will produce exactly the same result. A C-like assignment via = is also possible; the command below is
equivalent to the both above:
x=c(1,2,4,9,16)

Observe that in this case the variable (here x) to which a value is assigned must be on the left hand side of
the equality sign, and the value must be on the right hand side. In most cases, <-, -> and = are equivalent,
their subtleties and technical differences can and will be ignored here. Within this course, we will follow the
convention to use <- as standard assignment operator, = will only be used to declare variables within the
head of a function definition.
Observe that the vector x is displayed as an object of your workspace in the upper right sub-window of
RStudio immediately after its creation. But there you can mainly see the type of all generated R-objects, but
usually not (all of) their contents. The latter will be displayed if you just enter the name of an object.
x

## [1] 1 2 4 9 16
In case of function names (without the parentheses ( and )), you are shown the program code:
exp

## function (x) .Primitive("exp")


For generic R functions like exp(), this is not very instructive, but useful for user-written functions we will
learn later in the course.
If you apply a function presented in the previous section to a vector, the evaluation is done componentwise.
The same holds for arithmetic operations.
Examples:

4
sqrt(x)

## [1] 1.000000 1.414214 2.000000 3.000000 4.000000


2*x

## [1] 2 4 8 18 32
x*x

## [1] 1 4 16 81 256
2ˆx

## [1] 2 4 16 512 65536


Equidistant number sequences can be generated via the command seq(from,to,by) (sequence). Here from
and to determine the initial resp. final value, and by defines the magnitude of the increments.
Examples:
seq(0,10,.5)

## [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 5.5 6.0 6.5 7.0
## [16] 7.5 8.0 8.5 9.0 9.5 10.0
seq(2,7,2)

## [1] 2 4 6
If one specifies the desired length of the sequence instead, the size of the increments is adapted accordingly:
seq(0,10,length=10)

## [1] 0.000000 1.111111 2.222222 3.333333 4.444444 5.555556 6.666667


## [8] 7.777778 8.888889 10.000000
Another possibility to generate sequences with increments ±1 is provided by the “:” -operator.
1:10

## [1] 1 2 3 4 5 6 7 8 9 10
1.5:5.5

## [1] 1.5 2.5 3.5 4.5 5.5


10:1

## [1] 10 9 8 7 6 5 4 3 2 1
Remark: The bracketed numbers at the beginning of every line of a result are printed from R for better
clarity. They just show the number of the component of the output the corresponding line starts with. See for
example the two lines below the command seq(0, 10, .5): The [1] preceding the upper line states that 0.0 is the
first element of the output, the [16] below indicates that 9.5 is the twentieth component of the result.

4 Some special functions designed for vectors


Up to now you have only got to know R functions that are actually defined for single numbers and operate
element-wise when applied to a vector-valued argument. Now we present some useful functions which can be
applied to vectors, but do not operate element-wise.
With cumsum(), cumprod(), cummax(), and cummin() you obtain the cumulative sums, products, and
extremes of the argument.

5
x <- c(3,1,2,5,6,4)
x

## [1] 3 1 2 5 6 4
cumsum(x)

## [1] 3 4 6 11 17 21
cumprod(x)

## [1] 3 3 6 30 180 720


cummax(x)

## [1] 3 3 3 5 6 6
The function diff() allows to calculate successive differences.
diff(x)

## [1] -2 1 3 1 -2
There are, of course, also R functions that return a single number when applied to a vector. With sum(),
prod(), max(), and min() you obtain the sum, product, maximum, and minimum of all components of the
vector. mean() returns the arithmetic mean of all elements. median() returns the median of all elements.
var() and sd() return the sample variance and the sample standard deviation respectively.
Remark: the var() and sd() are sample variance and sample standard deviation, which means that the
denominator will be n − 1 instead of n.
Example:
x <- 1:6
x

## [1] 1 2 3 4 5 6
sum(x)

## [1] 21
prod(x)

## [1] 720
c(sum(x),prod(x))

## [1] 21 720
c(min(x),max(x))

## [1] 1 6
c(mean(x), median(x))

## [1] 3.5 3.5


var(x)

## [1] 3.5
sd(x)

## [1] 1.870829

6
The function length() determines the number of components of the argument.
length(x)

## [1] 6
Exercise 1: Define the vectors x = (3, 1, 4) and y = (2, 1, 2) in R and use some appropriate functions
introduced in above to determine their lengths (Euclidean norms), their scalar product, and the size of the
angle they form.

5 Undefined and missing values


In R, there exists a special value to characterize undefined numerical quantities: the constant NaN (not a
number). This is returned as result in case of invalid numerical operations and has to be carefully distinguished
from the values Inf and -Inf (±∞):
Examples:
c(1/0,0/0,-1/0)

## [1] Inf NaN -Inf


log(c(-2,-1,0,1,2))

## Warning in log(c(-2, -1, 0, 1, 2)): NaNs produced


## [1] NaN NaN -Inf 0.0000000 0.6931472
Almost all numerical functions can handle these three values. If the argument is NaN, usually the same value
will be returned, the treatment of -Inf and Inf, however, depends on the function in which these arguments
are inserted.
Examples:
exp(c(-Inf,0/0,Inf))

## [1] 0 NaN Inf


sin(c(-Inf,0/0,Inf))

## Warning in sin(c(-Inf, 0/0, Inf)): NaNs produced


## [1] NaN NaN NaN
f <- function(x){7}
f(NaN)

## [1] 7
Apart from the numerical constant NaN, there further exists the logical value NA (not available) which
characterizes missing values (see also the next section). It can also be inserted as an argument into numerical
functions and then typically is returned as result as well.
Example:
sqrt(c(1,2,NA,4))

## [1] 1.000000 1.414214 NA 2.000000

6 Logical quantities and relational operators in R


In R, logical quantities can have three (!) possible values: TRUE, FALSE, and NA. Missing values are, for
example, returned as result of an undecidable expression:

7
1 < 0/0

## [1] NA
Logical quantities can, as seen above, be generated by relational operations. R provides the following operators
for this purpose:

R-operator Impact
< strictly smaller
<= smaller or equal
>= greater or equal
== (exactly) equal
!= unequal

Examples:
1>2

## [1] FALSE
1!=2

## [1] TRUE
x <- c(NA,5,2,7,3,NA,-Inf,Inf)
x>3

## [1] NA TRUE FALSE TRUE FALSE NA FALSE TRUE


Observe that relational operators also operate component-wise. Therefore, the result in the last example is a
logical vector with elements TRUE and FALSE, corresponding to the components of x that fulfill resp. do
not fulfill the condition x > 3, and elements NA at positions where it is impossible to obtain a well-defined
result.
Logical quantities can be inserted into arithmetic expressions|in this case TRUE will be converted to 1 and
FALSE to 0. This allows, for example, to quickly determine how many components of a vector fulfill a
certain relation.
(1:4)>2

## [1] FALSE FALSE TRUE TRUE


sum((1:4)>2)

## [1] 2
Logical expressions can be linked via and or or and also negated, for which R provides the following operators:

R-operator Impact
& or && and
| or || or
! negation

To interrelate all components of a logical vector with and or or, one can use the functions all() and any().
When entering logical values directly, one can use the shorthand notations T and F instead of TRUE and
FALSE.
Examples:

8
!(1+1>3)

## [1] TRUE
(2>3)|(3<4)

## [1] TRUE
NA | T

## [1] TRUE
(2>3)&(3<4)

## [1] FALSE
T & NA

## [1] NA
all(c(T,F,F))

## [1] FALSE
any(c(T,F,F))

## [1] TRUE
Remark: * T and F are global variables in R having the values TRUE resp. FALSE. One should never
overwrite these by using them as names for self-made objects or variables!
There is a small but subtle distinction between the single (&,|) and double operators (&&,||): the double
operators do not operate element-wise like the others, but only evaluate the first component of each vector
successively from the left to the right and abort the evaluation as soon as the result can be uniquely determined
(for &&, this will be the case when FALSE appears for the first time, and for ||, when the first TRUE emerges).
Thus the double operators always return a single logical value.
Examples:
length(x)==1 & x>3

## [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE


length(x)==1 && x>3

## [1] FALSE
length(x)==1 | x>3

## [1] NA TRUE FALSE TRUE FALSE NA FALSE TRUE


length(x)==1 || x>3

## Warning in length(x) == 1 || x > 3: 'length(x) = 8 > 1' in coercion to


## 'logical(1)'
## [1] NA
Instead of converting logical values into numeric ones, one can also proceed the other way round. In the
latter case, 0 is turned into FALSE, NaN into NA and everything else into TRUE:
Example:
c(-Inf,Inf,NaN,-2,-1,0,5) & T

## [1] TRUE TRUE NA TRUE TRUE FALSE TRUE

9
The functions is.na() and is.nan() check if x has the value NA resp. NaN and return TRUE or FALSE
accordingly.
x <- c(1,2,NA,NaN)
is.na(x)

## [1] FALSE FALSE TRUE TRUE


is.nan(x)

## [1] FALSE FALSE FALSE TRUE

7 Data types in R
Apart from numerical and logical quantities we have seen so far, there exist some further data types in R
which are introduced exemplarily in the following. The function mode() determines the data type of an R
object.
a <- NULL
b <- TRUE
d <- 1
e <- 3+4i
f <- "abcdef"
g <- NA

mode(a)

## [1] "NULL"
mode(b)

## [1] "logical"
mode(d)

## [1] "numeric"
mode(e)

## [1] "complex"
mode(f)

## [1] "character"
mode(g)

## [1] "logical"
With help of the functions is.character(), is.numeric(), is.complex() and so on you can check whether an
object is of some specific data type. With as.character(), as.numeric(), as.complex() and so on one can
convert data from one type into another, but this might sometimes be accompanied with a loss of information.
is.numeric(d)

## [1] TRUE
is.complex(f)

## [1] FALSE
as.numeric(b)

## [1] 1

10
e <- as.numeric(e)

## Warning: imaginary parts discarded in coercion


e

## [1] 3
e <- as.complex(e)
e

## [1] 3+0i
Remark: *The quantities NA and NULL have to be carefully distinguished: While NA is a logical value of
type “logical”, NULL equals the empty set and has no specific data type (or the type “NULL”, respectively).
Moreover, NA can occur within a vector, whereas NULL can not.

8 Indexing of vectors
We now present some techniques R provides to extract single components from a vector.

8.1 Numerical indexes


With x[i] you can extract the ith component of a vector x. It is also possible to insert vectors as indexes to
choose several elements of x at the same time. Negative indexes suppress the output of the corresponding
components.
x <- c(2,4,6)
x[2:3]

## [1] 4 6
x[-2]

## [1] 2 6
x[4] <- 8
length(x)

## [1] 4
x

## [1] 2 4 6 8
x[5]

## [1] NA
x[6] <- 12
x[1:10]

## [1] 2 4 6 8 NA 12 NA NA NA NA
As you can see, R in some sense regards a vector x with length(x) = n as nothing but a sequence (x1 . . . .,
xn ,NA,NA,. . . ).

8.2 Indexing with logical vectors


Apart from numerical indexes, R also provides the possibility to use logical indexes. This means that one
can alternatively use logical vectors as indexes. In this case, all elements of a vector x are extracted which
correspond to the value TRUE within the index vector. Of course, the logical index vector should have the
same length as x here. If the logical index vector is shorter, it is repeated cyclically until is has the same

11
length as x, if it is longer, it is cut off accordingly. If some elements of the logical index vector have the value
NA, then R cannot decide whether the corresponding element of the indexed vector should be extracted
or not, which is expressed by an NA at all these places in the output. The associated components of the
indexed vector themselves, however, remain unchanged, as the second example shows.
x[x<6]

## [1] 2 4 NA
x[c(NA,T)]

## [1] NA 4 NA 8 NA 12
x

## [1] 2 4 6 8 NA 12
x[c(NA,T)] <- 1
x

## [1] 2 1 6 1 NA 1
y <- c(3,2,4,7,1,6)
y[y>=3]

## [1] 3 4 7 6
Using the function is.na(), you easily obtain a vector without annoying components NA.
x.ok <- !is.na(x)
x.ok

## [1] TRUE TRUE TRUE TRUE FALSE TRUE


x[x.ok]

## [1] 2 1 6 1 1
length(x)

## [1] 6
length(x[x.ok])

## [1] 5

8.3 Indexing with names


Another possibility is to label the components of a vector via the function names() and then to pick out
some of them by calling their names.
vector.names <- c("a","b","c","d","e","f")
names(x) <- vector.names
names(x)

## [1] "a" "b" "c" "d" "e" "f"


x

## a b c d e f
## 2 1 6 1 NA 1
x["d"]

## d

12
## 1
Observe that the quotes ” in the last command line are necessary to characterize the d in between as a
variable of type “character” as it was defined within the vector vector.names.
3*x

## a b c d e f
## 6 3 18 3 NA 3
as.vector(3*x)

## [1] 6 3 18 3 NA 3
Vectors are characterized by the fact that all of their elements are of the same data type. If an element is
attached to a vector that is of a different type than all other components of the vector, a conversion into the
data type of higher order is performed. If the data type of the newly attached element has a lower order than
that of the other vector components, the new element itself will be converted, otherwise all other components
of the vector will be transformed accordingly (only NAs remain unchanged in either case). The order of the
data types is “NULL” < “logical” < “numeric” < “complex” < “character”.
y

## [1] 3 2 4 7 1 6
y[7] <- TRUE
y

## [1] 3 2 4 7 1 6 1
y[8] <- "a"
y

## [1] "3" "2" "4" "7" "1" "6" "1" "a"


yˆ2

## Error in y^2: non-numeric argument to binary operator

9 Lists
Lists allow to concatenate objects of different data types. In simple terms, a list can be regarded as a vector
with components of different types.
first.names <- c("Maude","Harold","Bonny","Clyde","Anna","Christina","Jack","Paul") # This is a vector
first.names

## [1] "Maude" "Harold" "Bonny" "Clyde" "Anna" "Christina"


## [7] "Jack" "Paul"
list.example <- list(100:107,first.names,c(T,F,T,T,F,F,F,T))
list.example

## [[1]]
## [1] 100 101 102 103 104 105 106 107
##
## [[2]]
## [1] "Maude" "Harold" "Bonny" "Clyde" "Anna" "Christina"
## [7] "Jack" "Paul"
##
## [[3]]
## [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE

13
Components of a list can be selected analogously to those of vectors: With list.example[i] or list.example[[i]]
one obtains the ith item of the list (if one uses the double square brackets [[** and **]], one can only
pick out a single item since the double brackets do not allow to insert vectors). With [], R will return the
chosen list components in form of a (sub)list, whereas with [[]] the chosen list element itself (a vector, matrix,
number or the like) is returned.
list.example[1]

## [[1]]
## [1] 100 101 102 103 104 105 106 107
list.example[1:2]

## [[1]]
## [1] 100 101 102 103 104 105 106 107
##
## [[2]]
## [1] "Maude" "Harold" "Bonny" "Clyde" "Anna" "Christina"
## [7] "Jack" "Paul"
list.example[[1]]

## [1] 100 101 102 103 104 105 106 107


When working with lists, it is more advantageous to label all items such that they can be selected via their
names:
members <- list(number=100:107,names=first.names,paid=c(T,F,T,T,F,F,F,T))
members$number

## [1] 100 101 102 103 104 105 106 107


members$names

## [1] "Maude" "Harold" "Bonny" "Clyde" "Anna" "Christina"


## [7] "Jack" "Paul"
members$paid

## [1] TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE


Doing this allows to easily combine and extract information from the data. For example, if you want to know
the names of all members having a membership number greater than 103, or the names of all members who
have already paid their annual fee, you have to enter the following:
members$names[members$number>103]

## [1] "Anna" "Christina" "Jack" "Paul"


members$names[members$paid]

## [1] "Maude" "Bonny" "Clyde" "Paul"


Observe that the index vector here is different from the one that is indexed. The two vectors just have to fit
together, that is, they must have the same length.

10 Plots
Now we consider some elementary graphical commands in R. To get detailed information about this graphical
command, enter

14
help(plot)

As you can see, plot(x,y) plots the points with x-coordinates (abscissas) given by the vector x and y-
coordinates (ordinates) given by y. Here x and y must be, of course, vectors having the same number of
elements. Note that RStudio displays all plots generated with plot() in the lower right sub-window (tab
“Plots”), so there usually is no need to open an additional viewer, therefore we neglect R commands for this
purpose here. The optional parameter type within the plot()-command allows to choose among different
plotting types: If type=“l” (lines), line segments are drawn between successive points, with type=“s”
(stair steps) one obtains a step function, and with type=“p” (points) the single points are drawn. The
latter is the default if type is not set explicitly. With type=“n” (nothing), plotting is suppressed, only the
bounding box, the axes (with labels), and the title (if specified) are drawn. The symbol used to mark the
different points can be chosen with help of the graphical parameter pch (plotting character). For an overview
about all the graphical parameters and their meanings, use
help(par)

Examples:
plot(1,1)
1.4
1.2
1.0
1

0.8
0.6

0.6 0.8 1.0 1.2 1.4

plot(c(1,2,3),c(3,1,2),type="l")

15
3.0
2.5
c(3, 1, 2)

2.0
1.5
1.0

1.0 1.5 2.0 2.5 3.0

c(1, 2, 3)

plot(c(1,2,3,4),c(1,4,9,16),pch=1)
15
c(1, 4, 9, 16)

10
5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

c(1, 2, 3, 4)

plot(c(1,2,3,4),c(1,4,9,16),pch=c(2,5))

16
15
c(1, 4, 9, 16)

10
5

1.0 1.5 2.0 2.5 3.0 3.5 4.0

c(1, 2, 3, 4)

If you resize the graphics window with help of the mouse, the plot will automatically be reshaped accordingly.
To plot a function f : [a, b] → R exactly, one would theoretically have to determine the function values
f (x) for all possible arguments x ∈ [a, b]. This is, of course, not feasible in practice using a computer,
therefore one uses the following procedure: One divides the interval [a, b] in (typically equidistant) pieces
a = x0 < x1 < ... < xn = b, calculates the function values f (xk ) for all 0 ≤ k ≤ n and connects them with line
segments. If the distances between the xk re sufficiently small, one should obtain a fairly good approximation
of the graph of f (the distance which is “sufficiently small” depends, of course, on f as well as on the eye of
the beholder). To generate equidistant sequences (xk )0≤k≤n , one usually uses the seq()-command introduced
before.
To plot the density of the standard normal distribution N (0, 1), we first generate a vector of x-coordinates
with help of seq(). The corresponding y-coordinates are then obtained by inserting this vector into the
density function dnorm().
x <- seq(-4,4,0.1)
plot(x,dnorm(x),type="l")

17
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

If we now want to compare several normal density functions with different parameters µ and σ, we cannot
use the function plot() again because it erases the existing plot and draws a new one. To add some curve to
an already existing plot, one has to use the R commands lines() or points(). The function title() allows to
supplement a title.
plot(x,dnorm(x),type="l")
points(x,dnorm(x,mean=1,sd=2),pch=5)
points(x,dnorm(x,mean=-1,sd=1.5),pch=1)
title("Normal densities")

Normal densities
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

Remark: A title for a plot can alternatively be created using the parameter main of the function plot(). In
the above example, the title could also be set by enlarging the plot() -command as follows:

18
plot(x,dnorm(x),type="l",main="Normal densities")

Normal densities

0.4
0.3
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

The plot()-function automatically determines the appropriate scales of both axes, but these are not adapted
anymore if one adds further lines or points.
plot(x,dnorm(x),type="l",main="Normal densities")
lines(x,dnorm(x,0,0.25))

Normal densities
0.4
0.3
dnorm(x)

0.2
0.1
0.0

−4 −2 0 2 4

You can also determine the range of the x- and the y-axis yourself using xlim and ylim. Here xlim and
ylim must be vectors of length 2 which contain the desired axes limits. The parameter col allows to change
the color of the lines or points to be drawn. Eight different colors can be chosen by assigning one of the
numbers 1,2,. . . ,8 to col. To see which color corresponds to which number, enter

19
palette()

## [1] "black" "#DF536B" "#61D04F" "#2297E6" "#28E2E5" "#CD0BBC" "#F5C710"


## [8] "gray62"
Alternatively, one can also assign a text string to col which describes the desired color. At first glance, this
may seem a little more laborious, but it enables to choose from a much bigger assortment of colors. You can
list all available colors on your computer by entering
colors()

If the parameter col is not set explicitly, R chooses the color “black” (1) by default.
plot(x,dnorm(x),type="l",xlim=c(-6,6),ylim=c(0,1))
lines(x,dnorm(x,sd=0.5),col=2)
lines(x,dnorm(x,mean=1,sd=1.5),col="gold")
1.0
0.8
0.6
dnorm(x)

0.4
0.2
0.0

−6 −4 −2 0 2 4 6

If you are not satisfied with the axis labels R prints on the graph, you can adjust them via the parameters
xlab (x-axis label) resp. ylab. Setting xlab=” ” resp. ylab=” ” suppresses axis labeling.
plot(x,dnorm(x,sd=1.2),type="l",xlab="x",ylab="density of N(0,1.44)",col="blue")

20
0.30
density of N(0,1.44)

0.20
0.10
0.00

−4 −2 0 2 4

The drawing of the axes themselves can be suppressed by assigning the value “n” to the parameters xaxt
(x-axis type) resp. yaxt.
plot(x,dnorm(x),type="l",xlab="x",ylab="",xaxt="n",col="blue")
0.4
0.3
0.2
0.1
0.0

Remark: The R graphics window is solely intended for viewing, by default the plots created during an R
session are not saved permanently, even if one uses the command save.image() in between or answers the
questions “Save workspace image?” with yes when quitting R! If one wants to print a plot or to store it
permanently, one has to convert the plot into a pdf- or image file. This can be done with special R commands
like dev.print(), png() or jpeg(), but it is much easier and more convenient to use the corresponding menu
“Export” at the upper margin of the plot window in RStudio.
Exercise 2: (From “Portfolio Management Tutorial 1 Exercise 1”)

21
Assume that you are considering selecting assets from among the four candidates below (assume that there is
no relationship between the amount of rainfall and the condition of the stock market.).

Table 3: Asset 1

Market Condition Return Probability


1
Good 16 4
1
Average 12 2
1
Poor 8 4

Table 4: Asset 2

Market Condition Return Probability


1
Good 4 4
1
Average 6 2
1
Poor 8 4

Table 5: Asset 3

Market Condition Return Probability


1
Good 20 4
1
Average 14 2
1
Poor 8 4

Table 6: Asset 4

Rainfall Return Probability


1
Good 16 3
1
Average 12 3
1
Poor 8 3

Using R,
a) Solve for the expected return and the standard deviation of return for each separate investment.
b) Solve for the correlation coefficient and the covariance between each pair of investments.
c) Solve for the expected return and variance of each of the portfolios, the portions invested in each asset
is shown below. And plot the original assets and each of the portfolios in expected return standard
deviation space.

Table 7: Portions Invested in Each Asset

Portfolio Asset 1 Asset 2 Asset 3 Asset 4


1 1
A 2 2 0 0
1 1
B 2 0 2 0
1 1 1
C 3 3 3 0

22
Exercise 3: (From “Portfolio Management Tutorial 1 Exercise 2”)
The table below contains actual monthly returns data for three companies for 6 months. Using R to
a) compute the average rate of return for each company.
b) compute the standard deviation of the rate of return for each company.
c) compute the correlation coefficient between all possible pairs of securities.
1
d) compute the average return and standard deviation for the following portfolios: 2A + 12 B and 13 A +
1 1
3 B + 3 C.

Table 8: Monthly return of the company

Security 1 2 3 4 5 6
A 3.7% 0.4% −6.5% 1.4% 6.2% 2.1%
B 10.5% 0.5% 3.7% 1.0% 3.4% −1.4%
C 1.4% 14.9% −1.4% 10.8% 4.9% 16.9%

Exercise 4: (From “Portfolio Management Tutorial 1 Exercise 3”)


Assume that the average variance of return for an individual security is 50 and that the average covariance is
10. The expected variance of an equally weighted portfolio is given by
1 2
σP2 = (σ̄ − σ̄kj ) + σ̄kj
N j
Plot the portfolio variance as a function of the different number of securities.

11 Handling of generated R objects


As mentioned before, the objects generated during an R session are only saved permanently if either the
question appearing when R is terminated is answered with y resp. “Save”, or a backup is done with help of
the command save.image(). Since both methods cause a storage of all R objects currently residing in the
main memory of the computer, one has to delete in advance unnecessary things that are not needed any
further. To be shown all objects we have generated ourselves, we can either check the “Workspace”-tab in the
upper right sub-window of RStudio or use the equivalent commands ls() (list) or objects(). To erase one or
several of them, we can use rm() or remove().
ls()

## [1] "a" "b" "d" "e" "f"


## [6] "first.names" "g" "list.example" "members" "vector.names"
## [11] "x" "x.ok" "y"
objects()

## [1] "a" "b" "d" "e" "f"


## [6] "first.names" "g" "list.example" "members" "vector.names"
## [11] "x" "x.ok" "y"
rm(x)
ls()

## [1] "a" "b" "d" "e" "f"


## [6] "first.names" "g" "list.example" "members" "vector.names"
## [11] "x.ok" "y"

23
if you want to remove everything in the working environment.
rm(list = ls())

12 Matrices
R of course also knows matrices which you can generate with the function matrix(). The general syntax is as
follows:
matrix(data,nrow=1,ncol=1,byrow=FALSE,dimnames=NULL)

Here data is a (usually numerical) vector which contains the elements of the matrix to be built. These will
be inserted column by column if byrow=FALSE (default), and row by row if byrow=TRUE. With nrow
one defines the number of rows the matrix should have, and with ncol analogously the number of columns.
Usually it suffices to specify just one of these two parameters, the other then is determined automatically
by R from the length of data (if possible, that is, if the length is a multiple of the desired number of rows
resp. columns). If the vector data has less than nrow × ncol elements, it will be repeated cyclically to
obtain enough components to fill the matrix. The argument dimnames allows to name the different rows
and columns similar to the function names() for vectors (compare 7.3).
m <- matrix(1:12,nrow=3)
m

## [,1] [,2] [,3] [,4]


## [1,] 1 4 7 10
## [2,] 2 5 8 11
## [3,] 3 6 9 12
Acquaint yourself with the functioning of the different parameters by entering
matrix(1:-10,ncol=4,byrow=T)

## [,1] [,2] [,3] [,4]


## [1,] 1 0 -1 -2
## [2,] -3 -4 -5 -6
## [3,] -7 -8 -9 -10
matrix(1:3,nrow=3,ncol=4)

## [,1] [,2] [,3] [,4]


## [1,] 1 1 1 1
## [2,] 2 2 2 2
## [3,] 3 3 3 3
matrix(1:13,ncol=4,byrow=T)

## Warning in matrix(1:13, ncol = 4, byrow = T): data length [13] is not a sub-
## multiple or multiple of the number of rows [4]
## [,1] [,2] [,3] [,4]
## [1,] 1 2 3 4
## [2,] 5 6 7 8
## [3,] 9 10 11 12
## [4,] 13 1 2 3
The function dim() determines the dimensions of a matrix.
dim(m)

## [1] 3 4

24
In contrast to this, length() returns the total number of elements a matrix contains:
length(m)

## [1] 12
The indexing of matrices using numerical indexes is completely analog to that of vectors. If either the row
numbers or the column numbers are not specified explicitly, all rows resp. columns will be chosen.
m[1,2]

## [1] 4
m[,1:2]

## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
m[c(1,3),c(2,4)]

## [,1] [,2]
## [1,] 4 10
## [2,] 6 12
If one just enters a single index or index vector instead of a comma-separated pair of row and column numbers,
the matrix will be treated by R like a vector (the same happens internally when using the length()-command)
which is obtained by concatenating all columns, and the components of this vector which correspond to the
submitted index vector will be returned.
m[9]

## [1] 9
m[m>5]

## [1] 6 7 8 9 10 11 12
The functions row() and col() return integer-valued matrices which contain the row resp. column numbers
of all elements.
row(m)

## [,1] [,2] [,3] [,4]


## [1,] 1 1 1 1
## [2,] 2 2 2 2
## [3,] 3 3 3 3
col(m)

## [,1] [,2] [,3] [,4]


## [1,] 1 2 3 4
## [2,] 1 2 3 4
## [3,] 1 2 3 4
This is particularly useful if one wants to assign secondary diagonals or non-quadratic parts of a matrix as
the following example shows:
m[row(m)>=col(m)] <- 0

Matrices also require all of its elements to be of the same data type. If some elements are assigned values of
different data types, a conversion following the same rules as for vectors (see 7.3) takes place.

25
One can also name the rows and columns of a matrix, either via specifying the parameter dimnames within
the matrix()-command or subsequently using the function dimnames(). The names must be submitted in
a list having exactly two components which are vectors containing the row and column names. The lengths
of these vectors must, of course, agree with the dimensions of the matrix.
dimnames(m) <- list(c("I","II","III"),c("a","b","c","d"))
m

## a b c d
## I 0 4 7 10
## II 0 0 8 11
## III 0 0 0 12
m[,"d"]

## I II III
## 10 11 12
dimnames(m)

## [[1]]
## [1] "I" "II" "III"
##
## [[2]]
## [1] "a" "b" "c" "d"

13 Matrices and arithmetic operations


Analogously to vectors, arithmetic operations on matrices are performed element-wise (if the dimensions of
the matrices coincide).
M <- matrix(1:9,3,3)
M1 <- matrix(-9:-1,3,3)

Check this by entering


M1+M

## [,1] [,2] [,3]


## [1,] -8 -2 4
## [2,] -6 0 6
## [3,] -4 2 8
M1*M

## [,1] [,2] [,3]


## [1,] -9 -24 -21
## [2,] -16 -25 -16
## [3,] -21 -24 -9
(-1)*M1

## [,1] [,2] [,3]


## [1,] 9 6 3
## [2,] 8 5 2
## [3,] 7 4 1
v <- c(9,8,7)
v*M1

## [,1] [,2] [,3]

26
## [1,] -81 -54 -27
## [2,] -64 -40 -16
## [3,] -49 -28 -7
One can also add a scalar or a vector to a matrix. A scalar is just added to each element of the matrix, the
addition of vectors (as well as the multiplication) is done column by column. In the latter case one should
observe that the length of the vector agrees with the number of rows of the matrix, otherwise one may obtain
somewhat surprising results as the third example shows.
M1+9

## [,1] [,2] [,3]


## [1,] 0 3 6
## [2,] 1 4 7
## [3,] 2 5 8
M1+v

## [,1] [,2] [,3]


## [1,] 0 3 6
## [2,] 0 3 6
## [3,] 0 3 6
M1

## [,1] [,2] [,3]


## [1,] -9 -6 -3
## [2,] -8 -5 -2
## [3,] -7 -4 -1
M1+c(1,2,3,4)

## Warning in M1 + c(1, 2, 3, 4): longer object length is not a multiple of shorter


## object length
## [,1] [,2] [,3]
## [1,] -8 -2 0
## [2,] -6 -4 2
## [3,] -4 -2 0
Since in the last case the length of the vector is greater than the number of rows of the matrix, the last
component of the vector is added to the first element in the second column of the matrix. Then the first
component of the vector is added to the second element in the second column of the matrix and so on.
The command t() allows to transpose a matrix.
M

## [,1] [,2] [,3]


## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
t(M)

## [,1] [,2] [,3]


## [1,] 1 2 3
## [2,] 4 5 6
## [3,] 7 8 9
Observe that vectors generated with help of the function c() are neither column nor row vectors, as the
following example shows:

27
v

## [1] 9 8 7
dim(v)

## NULL
t(v)

## [,1] [,2] [,3]


## [1,] 9 8 7
as.matrix(v)

## [,1]
## [1,] 9
## [2,] 8
## [3,] 7
When applying the command t(), the vector v first is converted internally into a 3 × 1-matrix; t(v) thus
corresponds to t(as.matrix(v)). Similar R-internal conversions also take place when vectors are added to or
multiplied with matrices.

14 Matrix products
The usual matrix product (multiplying “row by column”) is obtained by using the operator %*%. If the
dimensions of the matrices do not fit together, R will issue an error message.
Examples:
M1 <- matrix(1:9,3,3)
M2 <- matrix(c(1,1,1,0,1,1,0,0,1),3,3,byrow=T)
M2 %*% M1

## [,1] [,2] [,3]


## [1,] 6 15 24
## [2,] 5 11 17
## [3,] 3 6 9
M2 %*% v

## [,1]
## [1,] 24
## [2,] 15
## [3,] 7
M2 %*% t(v)

## Error in M2 %*% t(v): non-conformable arguments


v %*% M2

## [,1] [,2] [,3]


## [1,] 9 17 24
t(v) %*% M2

## [,1] [,2] [,3]


## [1,] 9 17 24

28
M3 <- matrix(1:8,2,4)
M4 <- matrix(c(1,1,1,0,0,0),3,2)
M4 %*% M3

## [,1] [,2] [,3] [,4]


## [1,] 1 3 5 7
## [2,] 1 3 5 7
## [3,] 1 3 5 7
M3 %*% M1

## Error in M3 %*% M1: non-conformable arguments


With crossprod(x,y), one can calculate t(x) %*% y. If x and y are vectors of the same length, this is
nothing but the scalar product of both vectors. Observe that crossprod() always returns a matrix.
crossprod(M4,M)

## [,1] [,2] [,3]


## [1,] 6 15 24
## [2,] 0 0 0
crossprod(1:3,2:4)

## [,1]
## [1,] 20
as.vector(crossprod(1:3,2:4))

## [1] 20

15 Block and diagonal matrices


With cbind() (c here means “column”) you can create a new matrix by merging together vectors of the
same length or matrices having the same number of rows. The row-wise counterpart to cbind() is (surprise,
surprise) rbind().
Examples:
cbind(c(1,2,3), c(6,5,4), matrix(1,3,2))

## [,1] [,2] [,3] [,4]


## [1,] 1 6 1 1
## [2,] 2 5 1 1
## [3,] 3 4 1 1
rbind(matrix(1,2,3),matrix(2,2,3))

## [,1] [,2] [,3]


## [1,] 1 1 1
## [2,] 1 1 1
## [3,] 2 2 2
## [4,] 2 2 2
The R-command diag() allows to either generate diagonal matrices or to change resp. extract the diagonal
elements of a matrix. If a vector of length n is supplied as argument to diag(), it will create an n×n-diagonal
matrix whose diagonal elements are just those of the given vector. If the argument of diag() is a matrix
instead, the function will return a vector containing the diagonal elements of that matrix.
Example:

29
diag(1:5)

## [,1] [,2] [,3] [,4] [,5]


## [1,] 1 0 0 0 0
## [2,] 0 2 0 0 0
## [3,] 0 0 3 0 0
## [4,] 0 0 0 4 0
## [5,] 0 0 0 0 5
M <- matrix(1:16,4,4)
M

## [,1] [,2] [,3] [,4]


## [1,] 1 5 9 13
## [2,] 2 6 10 14
## [3,] 3 7 11 15
## [4,] 4 8 12 16
diag(M)

## [1] 1 6 11 16
diag(diag(M))

## [,1] [,2] [,3] [,4]


## [1,] 1 0 0 0
## [2,] 0 6 0 0
## [3,] 0 0 11 0
## [4,] 0 0 0 16
M-diag(diag(M))

## [,1] [,2] [,3] [,4]


## [1,] 0 5 9 13
## [2,] 2 0 10 14
## [3,] 3 7 0 15
## [4,] 4 8 12 0
diag(M) <- c(1,1,1,1)
M

## [,1] [,2] [,3] [,4]


## [1,] 1 5 9 13
## [2,] 2 1 10 14
## [3,] 3 7 1 15
## [4,] 4 8 12 1
Exercise 5: Use rbind(), cbind() and diag() to generate the following matrix:
 
1 1 1 1 0 0
1 1 1 0 2 0
 
1 1 1 0 0 3
 
1 1 1 4 4 4
 
2 2 2 4 4 4
 
2 2 2 4 4 4
 
2 2 2 4 4 4
2 2 2 4 4 4

30
16 Linear systems of equations
The function solve(A,b) enables to solve linear systems of equations of the form A %*% x = b.
A <- matrix(c(1,1,0,0,1,1,1,0,1),3,3)
A

## [,1] [,2] [,3]


## [1,] 1 0 1
## [2,] 1 1 0
## [3,] 0 1 1
solve(A,c(1,0,0))

## [1] 0.5 -0.5 0.5


solve(A,c(0,0,1))

## [1] -0.5 0.5 0.5


solve(A) returns the inverse of the matrix A (if existent).
Examples:
A <- matrix(c(2,1,7,5,6,8,9,3,4),3)
A

## [,1] [,2] [,3]


## [1,] 2 5 9
## [2,] 1 6 3
## [3,] 7 8 4
solve(A)

## [,1] [,2] [,3]


## [1,] 0.00000000 -0.23529412 0.17647059
## [2,] -0.07692308 0.24886878 -0.01357466
## [3,] 0.15384615 -0.08597285 -0.03167421
A %*% solve(A)

## [,1] [,2] [,3]


## [1,] 1 0.000000e+00 5.551115e-17
## [2,] 0 1.000000e+00 2.775558e-17
## [3,] 0 1.110223e-16 1.000000e+00
B <- matrix(c(1,1,0,1,0,1,0,1,-1),3,3)
B

## [,1] [,2] [,3]


## [1,] 1 1 0
## [2,] 1 0 1
## [3,] 0 1 -1
solve(B)

## Error in solve.default(B): Lapack routine dgesv: system is exactly singular: U[3,3] = 0


solve(B,c(2,1,1))

## Error in solve.default(B, c(2, 1, 1)): Lapack routine dgesv: system is exactly singular: U[3,3] = 0
The determinant of a square matrix can be obtained from the command det().

31
det(A)

## [1] -221
det(B)

## [1] 0
det(matrix(1:8,2))

## Error in determinant.matrix(x, logarithm = TRUE, ...): 'x' must be a square matrix

17 Eigenvalues and eigenvectors


The eigenvalues and -vectors of a (square!) matrix M can be obtained via
eigen(M,only.values=FALSE)

The function returns a list with the eigenvalues and the corresponding eigenvectors, the latter are normed to
unit length and combined within a matrix. The jth column of that matrix is the eigenvector corresponding
the the jth element of the vector containing the eigenvalues. If only the eigenvalues are of interest, one may
speed up the calculation by suppressing the computation of the eigenvectors with the setting only.values=T.
M <- matrix(c(1,2,2,1),2,byrow=T)
M

## [,1] [,2]
## [1,] 1 2
## [2,] 2 1
eigen(M)

## eigen() decomposition
## $values
## [1] 3 -1
##
## $vectors
## [,1] [,2]
## [1,] 0.7071068 -0.7071068
## [2,] 0.7071068 0.7071068
If one is, for example, just interested in the eigenvalues or in the second eigenvector, one can obtain these
with the following commands:
eigen(M)$values

## [1] 3 -1
eigen(M)$vectors[,2]

## [1] -0.7071068 0.7071068


The matrix M does not necessarily have to be symmetric, but if it is not, one may obtain complex eigenvalues
and -vectors.
M <- matrix(c(1,-3,1,1),2,byrow=T)
eigen(M)

## eigen() decomposition
## $values
## [1] 1+1.732051i 1-1.732051i

32
##
## $vectors
## [,1] [,2]
## [1,] 0.8660254+0.0i 0.8660254+0.0i
## [2,] 0.0000000-0.5i 0.0000000+0.5i
eigen(M,only.values=T)

## $values
## [1] 1+1.732051i 1-1.732051i
##
## $vectors
## NULL

18 Some useful programming tools


In This section, we will introduce some programming statements. With these tools, you can build your
program more efficiently.

18.1 Defining new functions in R


As a programming environment, R of course also allows to define new functions based on the existing ones.
The general syntax of a function definition is as follows:
function name <- function(argument1, argument2, ...){R commands}

The function is then stored as an object with name function name. You should avoid to choose an already
existing name for the new function, otherwise the older object having the same name is overwritten. In
particular, you should not use names of built-in R functions! To check if, for example, an object with name
“test” already exists, simply enter this name and see what happens.
test

## Error in eval(expr, envir, enclos): object 'test' not found


If a function definition consists of several commands, they must either be separated by semicolons or by
beginning new lines via the [Enter]-key on the keyboard. Assignments within a function definition are only
valid “locally” and will be annulled when the function is terminated. Functions may return values via the
eponymous command return(). If this is missing, the function will return the value of the last evaluated
expression.
Example: f (x) = x × x − 3 × x
func.1 <- function(x){y <- x*x-3*x; return(y)}

func.1(3)

## [1] 0
y

## Error in eval(expr, envir, enclos): object 'y' not found


func.1(1:5)

## [1] -2 -2 0 4 10
func.1

## function(x){y <- x*x-3*x; return(y)}


## <bytecode: 0x000001c94e93db70>

33
Example: f (x) = ex , f (x) = x × x + 1
func.2 <- function(x){c(exp(x), x*x+1)}

func.2(1)

## [1] 2.718282 2.000000


Remark: Every function is terminated immediately after a return()-command is executed, even if there are
some subsequent expressions or commands in the function definition!
In case of more lengthy and complicated function definitions, it is recommendable to use the built-in function
editor of RStudio. It can be opened by clicking on the “new file”-button and then choosing “R Script” in
the appearing list. (Alternatively, one may click on “File” in the menu bar and then choose “New File” →
“R Script”.) If a function has several arguments, there are two ways of assigning the corresponding values
correctly: assignment by order or assignment by names.

Example: f (x, y) = x y
func.3 <- function(x,y){x*sqrt(y)}

func.3(4,9)

## [1] 12
func.3(9,4)

## [1] 18
func.3(y=9,x=4)

## [1] 12
It is also possible to set default values for some arguments. Then these do not necessarily have to be assigned
when the function is called.
func.3(9)

## Error in func.3(9): argument "y" is missing, with no default


√ √
Example: f (x, y = 1) = x y
func.4 <- function(x,y=1){sqrt(x)*sqrt(y)}

func.4(9)

## [1] 3
func.4(9,4)

## [1] 6
Remark: The setting of default values can only be done with =, using <- here will result in error messages
and an abortion of the function:
func.4 <- function(x,y<-1){sqrt(x)*sqrt(y)}

## Error: <text>:1:23: unexpected assignment


## 1: func.4 <- function(x,y<-
## ^
The reason for this behavior is that = has different meanings resp. roles in R. Within the argument list
of a function, it acts a special syntactic character and not as assignment operator and therefore cannot be
replaced by <- here! In all other contexts, it will be interpreted as assignment operator by R and then is
eqivalent to <-.

34
If the number of the arguments and/or their type cannot be specified in advance for some reason, one can use
the special argument . . . within the function definition. It serves as a wildcard for which arbitrary arguments
can be inserted when the function is called. This allows for an easy transfer of arguments to an interior
function without having to declare each of them in the exterior function. We demonstrate its application
within the function myplot() below which draws the graph of the function xsin(x).
Example:
myplot <- function(x,...){plot(x,x*sin(x),...)}

x <- seq(0,5,0.01)
myplot(x)
2
1
0
x * sin(x)

−1
−2
−3
−4
−5

0 1 2 3 4 5

myplot(x,type="l",col="red")

35
2
1
0
x * sin(x)

−1
−2
−3
−4
−5

0 1 2 3 4 5

18.2 If-statement
Logical quantities allow for conditional directives within programs which cause some subsequent commands
to be executed only if a certain condition is fulfilled. Such directives can be implemented via the command
if(condition){R commands}

If the logical expression condition has the value TRUE, the subsequent R commands will be executed,
otherwise (value FALSE) not. If the value of condition is NA, R will interrupt the execution of the function
and print an error message. One should be aware that the expression condition must have a single logical
value! If the result is a logical vector with several components instead, the function if() will only evaluate its
first component and issue a warning message. If some alternative commands should be executed in case that
condition has the value FALSE, the function if() should be accompanied by
else{R commands}

immediately afterwards.
Example:
x <- 2
if(x>3){
y <- 1}else{
y <- 0
}
y

## [1] 0
We can also use the if statement when defining a function. For example, let’s check the value of x and return
the value of y.
y_value_func <- function(x){
if(x>3){
y <- 1}else{
y <- 0
}

36
return(y)
}
y_value_func(x=2)

## [1] 0
y_value_func(x=5)

## [1] 1
We demonstrate another application by the function is.pos(). This function checks if the number is positive
or not. Since here if and else are only followed by one command each, we can omit the brackets {}.
is.pos <- function(x){if(is.na(x)|is.nan(x)) return(x) else return(1-(x<0))}

The functions is.na() and is.nan() check if x has the value NA resp. NaN and return TRUE or FALSE
accordingly. Thus the actual indicator function following else is only evaluated if the expression x<0 does
not have the value NA.
x <- 4
is.na(x)

## [1] FALSE
is.nan(x)

## [1] FALSE
is.pos(x)

## [1] 1
The above version of is.pos() cannot handle properly arguments x that are vectors of length ≥ 2. This
deficiency can be remedied. (Exercise 6)
Exercise 6: Try to improve the function is.pos() such that it can also handle vector-valued arguments x.
(Hint: Recall 8.2, it is possible to use TRUE or FALSE in indexing. “Ifelse”-command is not necessary
here.)

18.3 For-loop
When you want to repeat a block of code several times, using for-loop will be a good option. Syntax of
for-loop is
for(var in seq){
R commands
}

Here var is a variable, seq is a vector. It will excute the R commands until the last item in seq has been
reached. The flowchart of for loop is showed below.
Example: Find how many even number in vector x = c(1,3,2,6,4,7).
x <- c(1,3,2,6,4,7)
counts <- 0
iter_times <- 0
for(var in x){
iter_times <- iter_times +1
if(var %% 2 == 0) counts <- counts+1
}
counts

37
Figure 1: Flowchart of for loop

## [1] 3
iter_times

## [1] 6

18.4 The function apply()


Using apply() you can apply functions to the rows or columns of a matrix. The general syntax is
apply(X,MARGIN,FUN,...)

Here X is the matrix to which the function FUN shall be applied. FUN can be any R-function (also
a self-written one) which has to be specified by just giving its name (without the following parentheses).
Additional arguments or parameters for FUN can be submitted in place of the placeholder. MARGIN defines
if the function FUN is applied to the rows (MARGIN=1) or to the columns (MARGIN=2) of the matrix X.
Examples:
M <- matrix(1:9,3,3)
M

## [,1] [,2] [,3]


## [1,] 1 4 7
## [2,] 2 5 8
## [3,] 3 6 9
apply(M,1,sum)

## [1] 12 15 18
apply(M,2,sum)

## [1] 6 15 24
apply(M,2,crossprod,y=c(2,4,6))

## [1] 28 64 100

38
A <- matrix(c(1,8,2,6,5,4,7,3,9,0,-1,11),3,4)
A

## [,1] [,2] [,3] [,4]


## [1,] 1 6 7 0
## [2,] 8 5 3 -1
## [3,] 2 4 9 11
apply(A,1,diff)

## [,1] [,2] [,3]


## [1,] 5 -3 2
## [2,] 1 -2 5
## [3,] -7 -4 2
t(apply(A,1,diff))

## [,1] [,2] [,3]


## [1,] 5 1 -7
## [2,] -3 -2 -4
## [3,] 2 5 2
apply(A,1,diff)

## [,1] [,2] [,3]


## [1,] 5 -3 2
## [2,] 1 -2 5
## [3,] -7 -4 2
apply(A,2,diff)

## [,1] [,2] [,3] [,4]


## [1,] 7 -1 -4 -1
## [2,] -6 -1 6 12
Exercise 7: Write a function that returns the difference between the maximum and the minimum of the
components of a numeric vector.
Addition: Take into account possible problems with components NA and/or NaN.
Exercise 8: Write a function that returns the components of a vector in reverse order.
Exercise 9: Write a function that returns the number of components of a numeric vector that are greater
than 0.
Exercise 10: (Numerical Integration)
a) Write a function that calculates for a given sequence (xk )1≤k≤n the Riemann sum of the sine-function:
n
X
sin(xk )(xk − xk−1 )
k=2

b) Extend the function such that it can calculate the Riemann sum of an arbitrary function f for a given
sequence (xk )1≤k≤n :
X n
sin(xk )(xk − xk−1 )
k=2

c) Improve the accuracy of the approximation by considering trapezoidal areas instead of the rectangles
underlying the Riemann sum.

39
Exercise 11: Write a function that deletes the ith row and the jth column of an m × n-matrix and returns
the resulting (m − 1) × (n − 1)-matrix.
Exercise 12: Write a function that checks if the matrix supplied as argument is a symmetric square matrix.
If this is the case, the function should return the logical value TRUE, and FALSE otherwise.
Exercise 13: Write a function that assigns the value 0 to all elements of the secondary diagonals of a square
matrix (check before if the argument really is a square matrix).
Exercise 14: Write a function that checks if three submitted vectors x, y, z ∈ R3 are linearly independent
and returns TRUE in this case, and FALSE otherwise.
Exercise 15: Write a function that returns a vector containing the differences between the maxima of the
ith row and the minima of the ith column of a square matrix.

19 Libraries
The libraries/packages of R makes R a powerful tool for statistical and data analysis. With standard
installation, most common packages have been already installed. You can view all the installed packages
from the bottom right panel with the tab “Packages” or use the command library(). If the box in front of
the package name is ticked, the package is loaded (activated) and can be used. There are a lot of packages
available on the R website. You have to first install the package then load the package to make it available
on your computer.
To install a package, you can click “install packages” in the “Packages” window and type the name of the
package (for example “tidyverse”, which is a very useful package for plotting nice graphs) and click “Install”
or use the command
install.packages("tidyverse")

To load a package, you can check the box in front of the package or use command
library("tidyverse")

## -- Attaching packages --------------------------------------- tidyverse 1.3.2 --


## v ggplot2 3.3.6 v purrr 0.3.4
## v tibble 3.1.8 v dplyr 1.0.10
## v tidyr 1.2.1 v stringr 1.4.1
## v readr 2.1.3 v forcats 0.5.2
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()

20 Data Analysis
In this section, we will introduce the data analysis with the package called “Tidyverse”. Actually the tidyverse
is a collection of R packages designed for data science. It contains “dplyr”, “ggplot2”, etc.. You can find the
information about this packages on https://www.tidyverse.org
This whole chapter is a large example showing you the methods to deal with a dataset.
Since we have just installed “tidyverse” and load the package. Let’s get started.

20.1 Data import


We will work with one data set called “diamonds”, which comes together with the package “ggplot2”. When
you load the “ggplot2” package, you can use it directly.
diamonds

40
## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
Since it is a build-in data set, we can learn the explanations of the variables from the help page.
help(diamonds)

From the help page, we know the information of this data set as follows: It is data frame with 53940 rows
and 10 variables:

Table 9: Description of data set “diamonds”

Variables Descriptions Value


price price in US dollars $326–$18,823
carat weight of the diamond 0.2–5.01
cut quality of the cut Fair, Good, Very Good, Premium,
Ideal
color diamond colour from D (best) to J (worst)
clarity a measurement of how clear the I1 (worst), SI2, SI1, VS2, VS1,
diamond is VVS2, VVS1, IF (best)
x length in mm 0–10.74
y width in mm 0–58.9
z depth in mm 0–31.8
depth total depth percentage 43–79
= z/mean(x, y) = 2z/(x + y)
table width of top of diamond relative 43–95
to widest point

But we normally use our own data set, so I prepared a “.csv” file for this data set on ILIAS. Let’s learn how
to read a data set.
Diamond <- read_csv("Diamond.csv") # type in your own directory

## Rows: 53940 Columns: 10


## -- Column specification --------------------------------------------------------
## Delimiter: ","
## chr (3): cut, color, clarity
## dbl (7): carat, depth, table, price, x, y, z
##
## i Use `spec()` to retrieve the full column specification for this data.
## i Specify the column types or set `show_col_types = FALSE` to quiet this message.
After run the function read_csv(), it prints out a column specification that gives the name and type of each

41
column.
Use the command View() to view the data set in a separate window.
View(Diamond)

Use the command summary to see a quick statistic summary of the data set.
summary(Diamond)

## carat cut color clarity


## Min. :0.2000 Length:53940 Length:53940 Length:53940
## 1st Qu.:0.4000 Class :character Class :character Class :character
## Median :0.7000 Mode :character Mode :character Mode :character
## Mean :0.7979
## 3rd Qu.:1.0400
## Max. :5.0100
## depth table price x
## Min. :43.00 Min. :43.00 Min. : 326 Min. : 0.000
## 1st Qu.:61.00 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710
## Median :61.80 Median :57.00 Median : 2401 Median : 5.700
## Mean :61.75 Mean :57.46 Mean : 3933 Mean : 5.731
## 3rd Qu.:62.50 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540
## Max. :79.00 Max. :95.00 Max. :18823 Max. :10.740
## y z
## Min. : 0.000 Min. : 0.000
## 1st Qu.: 4.720 1st Qu.: 2.910
## Median : 5.710 Median : 3.530
## Mean : 5.735 Mean : 3.539
## 3rd Qu.: 6.540 3rd Qu.: 4.040
## Max. :58.900 Max. :31.800
summary(diamonds)

## carat cut color clarity depth


## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800

42
##
Diamond
diamonds

There are some differences between “Diamond” (the dataset we import) and “diamonds” (the dataset from
“tidyverse”). The variables cut, color and clarity are ordered factor in “diamonds”. But in “Diamond”, they
are characters. A summary is not available for the character class. To convert the variable into a ordered
factor, we could use the following commands.
Diamond$cut <- factor(Diamond$cut, ordered = TRUE,
levels = c("Fair", "Good", "Very Good", "Premium", "Ideal"))
Diamond$color <- factor(Diamond$color, ordered = TRUE,
levels = c("D","E","F","G","H", "I","J"))
Diamond$clarity <- factor(Diamond$clarity, ordered = TRUE,
levels = c("I1", "SI2", "SI1", "VS2", "VS1", "VVS2", "VVS1", "IF"))

summary(Diamond)

## carat cut color clarity depth


## Min. :0.2000 Fair : 1610 D: 6775 SI1 :13065 Min. :43.00
## 1st Qu.:0.4000 Good : 4906 E: 9797 VS2 :12258 1st Qu.:61.00
## Median :0.7000 Very Good:12082 F: 9542 SI2 : 9194 Median :61.80
## Mean :0.7979 Premium :13791 G:11292 VS1 : 8171 Mean :61.75
## 3rd Qu.:1.0400 Ideal :21551 H: 8304 VVS2 : 5066 3rd Qu.:62.50
## Max. :5.0100 I: 5422 VVS1 : 3655 Max. :79.00
## J: 2808 (Other): 2531
## table price x y
## Min. :43.00 Min. : 326 Min. : 0.000 Min. : 0.000
## 1st Qu.:56.00 1st Qu.: 950 1st Qu.: 4.710 1st Qu.: 4.720
## Median :57.00 Median : 2401 Median : 5.700 Median : 5.710
## Mean :57.46 Mean : 3933 Mean : 5.731 Mean : 5.735
## 3rd Qu.:59.00 3rd Qu.: 5324 3rd Qu.: 6.540 3rd Qu.: 6.540
## Max. :95.00 Max. :18823 Max. :10.740 Max. :58.900
##
## z
## Min. : 0.000
## 1st Qu.: 2.910
## Median : 3.530
## Mean : 3.539
## 3rd Qu.: 4.040
## Max. :31.800
##

20.2 Data management


(The dataset is large, Some of the results won’t show here in the pdf file, please run the code on your own and
check the results.)

20.2.1 Add new columns to the dataset You can add new variables to the data set using a $ sign.
Example 1: Add a column called “One” where all the values in the column are 1.
Diamond$One <- 1
Diamond

## # A tibble: 53,940 x 11

43
## carat cut color clarity depth table price x y z One
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 1
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 1
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 1
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 1
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 1
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 1
## # ... with 53,930 more rows
Example 2: Add a column called “pricecarat”, the values equal to the ratio of price to carat.
Diamond$pricecarat <- Diamond$price/Diamond$carat

Diamond

Or, you can add new variables to the dataset using the mutate() function.
Example 3: Add a column called “Two” where all the values in the column are 2.
mutate(Diamond, Two = 2)

## # A tibble: 53,940 x 13
## carat cut color clarity depth table price x y z One price~1
## <dbl> <ord> <ord> <ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 1 1417.
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 1 1552.
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 1 1422.
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 1 1152.
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 1 1081.
## 6 0.24 Very G~ J VVS2 62.8 57 336 3.94 3.96 2.48 1 1400
## 7 0.24 Very G~ I VVS1 62.3 57 336 3.95 3.98 2.47 1 1400
## 8 0.26 Very G~ H SI1 61.9 55 337 4.07 4.11 2.53 1 1296.
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 1 1532.
## 10 0.23 Very G~ H VS1 59.4 61 338 4 4.05 2.39 1 1470.
## # ... with 53,930 more rows, 1 more variable: Two <dbl>, and abbreviated
## # variable name 1: pricecarat
Diamond

With this command, the “Diamond” dataset does not change (the column named “Two” does not show in
the “Diamond” dataframe.) To add it to the dataset “Diamond”, we have to assign this new dataset to
“Diamond”. Or we can assign it to a new dataset named as “DiamondwithTwo”.
DiamondwithTwo <- mutate(Diamond, Two = 2)
DiamondwithTwo
Diamond

mutate() command makes it easier for you to create multiple columns. You can create several columns
separating each of them with a “,”.
Example 4: Add a column called “color_D” where value of the diamond with a color “D” is TRUE and
others are False, and a column called “price0.8” that the price is 20% off from the original price, and a
column called “xyz” with the values equal xyz. Add a column called “m” calculating the mean price and add
a column called “med” calculating the median price.

44
diamond_new <- mutate(Diamond,
color_D = color=="D",
price0.8 = price*0.8,
xyz = x*y*z,
m = mean(price),
med = median(price))

diamond_new

Because R calculate the mean and median for the entire price column, the values of mean and median price
in the last two columns are the same in every row.

20.2.2 Select a subset from a dataset You can use the filter() function to select a subset from a
dataframe, retaining all rows that meet the specified conditions.
Import the data again before we start this section.
Diamond <- read_csv("Diamond.csv") # type in your own directory

Example 5: Select the rows from “Diamond” that the clarity is “IF” and cut is “Ideal” and carat >= 1.
filter(Diamond, clarity == "IF" & cut == "Ideal" & carat >= 1)

## # A tibble: 106 x 10
## carat cut color clarity depth table price x y z
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.16 Ideal J IF 60.6 56 5519 6.79 6.86 4.14
## 2 1.19 Ideal I IF 61.7 56 7207 6.78 6.81 4.19
## 3 1.05 Ideal H IF 61.2 55 7329 6.62 6.59 4.04
## 4 1.2 Ideal I IF 61.5 56 7367 6.79 6.84 4.19
## 5 1.21 Ideal H IF 61 57 7789 6.79 6.89 4.17
## 6 1.02 Ideal G IF 62.5 57 8162 6.37 6.44 4
## 7 1.17 Ideal H IF 61 53 8266 6.89 6.85 4.19
## 8 1.02 Ideal G IF 62.5 57 8311 6.44 6.37 4
## 9 1.23 Ideal H IF 61.3 57 8431 6.87 6.92 4.23
## 10 1.23 Ideal H IF 61.3 57 8585 6.92 6.87 4.23
## # ... with 96 more rows
Remark: In the filter() function, the “&”-operator can be achieved through a “,”.
filter(Diamond, clarity == "IF", cut == "Ideal", carat >= 1)

## # A tibble: 106 x 10
## carat cut color clarity depth table price x y z
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1.16 Ideal J IF 60.6 56 5519 6.79 6.86 4.14
## 2 1.19 Ideal I IF 61.7 56 7207 6.78 6.81 4.19
## 3 1.05 Ideal H IF 61.2 55 7329 6.62 6.59 4.04
## 4 1.2 Ideal I IF 61.5 56 7367 6.79 6.84 4.19
## 5 1.21 Ideal H IF 61 57 7789 6.79 6.89 4.17
## 6 1.02 Ideal G IF 62.5 57 8162 6.37 6.44 4
## 7 1.17 Ideal H IF 61 53 8266 6.89 6.85 4.19
## 8 1.02 Ideal G IF 62.5 57 8311 6.44 6.37 4
## 9 1.23 Ideal H IF 61.3 57 8431 6.87 6.92 4.23
## 10 1.23 Ideal H IF 61.3 57 8585 6.92 6.87 4.23
## # ... with 96 more rows
Example 6: Select rows from “Diamond” that the clarity is “IF” or “VVS1” and the price < 500.

45
filter(Diamond, clarity == "IF"|clarity == "VVS1", price <= 500)

## # A tibble: 103 x 10
## carat cut color clarity depth table price x y z
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 2 0.23 Ideal I VVS1 63 56 414 3.94 3.9 2.47
## 3 0.23 Premium I VVS1 60.5 61 414 3.98 3.95 2.4
## 4 0.23 Very Good G VVS1 61.3 59 414 3.94 3.96 2.42
## 5 0.23 Very Good G VVS1 60.2 59 415 4 4.01 2.41
## 6 0.25 Ideal I VVS1 61.9 55 421 4.07 4.1 2.53
## 7 0.23 Good D VVS1 60.3 66 425 3.91 3.95 2.37
## 8 0.24 Ideal I VVS1 62.3 57 432 3.98 3.95 2.47
## 9 0.24 Premium H VVS1 61.2 58 432 3.96 4.01 2.44
## 10 0.24 Premium H VVS1 60.8 59 432 4 4.02 2.44
## # ... with 93 more rows
To select certain columns, you can choose the column directly in “[]” or use the function select().
Example 7: Retain columns “carat” and “price”.
Diamond[c("carat", "price")] # Method using []
select(Diamond, carat, price) # Method with select()

Example 8: Retain the fourth to eighth columns.


Diamond[4:8] # Method using []
select(Diamond, 4:8) # Method with select()

Example 9: Retain all the columns except the second and the fourth.
Diamond[c(-2,-4)] # Method using []
select(Diamond, c(-2,-4)) # Method with select()

The results using above two methods are equivalent. But command select() has its advantage. You are able
to select all the columns except some of the columns with their names.
Example 9: Retain all the columns except “cut”, “color” and “clarity” and calculate the correlation matrix.
Diamond_cor <- select(Diamond, -cut, -color, -clarity)
cor(Diamond_cor)

## carat depth table price x y


## carat 1.00000000 0.02822431 0.1816175 0.9215913 0.97509423 0.95172220
## depth 0.02822431 1.00000000 -0.2957785 -0.0106474 -0.02528925 -0.02934067
## table 0.18161755 -0.29577852 1.0000000 0.1271339 0.19534428 0.18376015
## price 0.92159130 -0.01064740 0.1271339 1.0000000 0.88443516 0.86542090
## x 0.97509423 -0.02528925 0.1953443 0.8844352 1.00000000 0.97470148
## y 0.95172220 -0.02934067 0.1837601 0.8654209 0.97470148 1.00000000
## z 0.95338738 0.09492388 0.1509287 0.8612494 0.97077180 0.95200572
## z
## carat 0.95338738
## depth 0.09492388
## table 0.15092869
## price 0.86124944
## x 0.97077180
## y 0.95200572
## z 1.00000000

46
You can also rearrange the sequence of the columns with select().
Example 10: Retain all the columns but “depth” and “price” to the first two columns.
select(Diamond, depth, price, everything())

## # A tibble: 53,940 x 10
## depth price carat cut color clarity table x y z
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 61.5 326 0.23 Ideal E SI2 55 3.95 3.98 2.43
## 2 59.8 326 0.21 Premium E SI1 61 3.89 3.84 2.31
## 3 56.9 327 0.23 Good E VS1 65 4.05 4.07 2.31
## 4 62.4 334 0.29 Premium I VS2 58 4.2 4.23 2.63
## 5 63.3 335 0.31 Good J SI2 58 4.34 4.35 2.75
## 6 62.8 336 0.24 Very Good J VVS2 57 3.94 3.96 2.48
## 7 62.3 336 0.24 Very Good I VVS1 57 3.95 3.98 2.47
## 8 61.9 337 0.26 Very Good H SI1 55 4.07 4.11 2.53
## 9 65.1 337 0.22 Fair E VS2 61 3.87 3.78 2.49
## 10 59.4 338 0.23 Very Good H VS1 61 4 4.05 2.39
## # ... with 53,930 more rows
# You can also rearrange it with [], but it would be tedious.
Diamond[c("depth","price","carat","cut","color","clarity","table","x","y","z")]

## # A tibble: 53,940 x 10
## depth price carat cut color clarity table x y z
## <dbl> <dbl> <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 61.5 326 0.23 Ideal E SI2 55 3.95 3.98 2.43
## 2 59.8 326 0.21 Premium E SI1 61 3.89 3.84 2.31
## 3 56.9 327 0.23 Good E VS1 65 4.05 4.07 2.31
## 4 62.4 334 0.29 Premium I VS2 58 4.2 4.23 2.63
## 5 63.3 335 0.31 Good J SI2 58 4.34 4.35 2.75
## 6 62.8 336 0.24 Very Good J VVS2 57 3.94 3.96 2.48
## 7 62.3 336 0.24 Very Good I VVS1 57 3.95 3.98 2.47
## 8 61.9 337 0.26 Very Good H SI1 55 4.07 4.11 2.53
## 9 65.1 337 0.22 Fair E VS2 61 3.87 3.78 2.49
## 10 59.4 338 0.23 Very Good H VS1 61 4 4.05 2.39
## # ... with 53,930 more rows

20.2.3 Summarise the data set summarise command creates a new data frame summarising the inputs.
Example 11: Using the data set “Diamond”. Create a data frame that calculate the mean of the price, the
standard deviation of the price and the number of observations we have.
summarise(Diamond,
mean_price = mean(price),
sd_price = sd(price),
n = n())

## # A tibble: 1 x 3
## mean_price sd_price n
## <dbl> <dbl> <int>
## 1 3933. 3989. 53940

20.2.4 %>% operator %>% is a convenient tool to organize the tidyverse syntax. It is called multiple
times to chain the functions together.

47
Example 12: Calculate the average price, average carat and the average price to carat ratio for the diamonds
with more than 1 carat.
Diamond %>%
# First, add a column called "pricecarat", the values equal to the ratio of price to carat.
mutate(pricecarat=price/carat) %>%
# Second, we select the diamonds with more than 1 carat.
filter(carat >= 1) %>%
# Third, we calculate the average price, carat and price/carat.
summarise(m_price = mean(price),
m_carat = mean(carat),
m_pricecarat = mean(pricecarat))

## # A tibble: 1 x 3
## m_price m_carat m_pricecarat
## <dbl> <dbl> <dbl>
## 1 8142. 1.32 5989.
With the %>% operator, we can combine the commands together into one sentence.

20.2.5 Group the variables together Using the command group_by() converts a data set into a
grouped data set. And use ungroup() removing grouping.
Example 13: Calculate the average price and standard deviation of price for each group of cut.
Diamond %>%
group_by(cut) %>%
summarise(m_price = mean(price),
sd_price = sd(price)) %>%
ungroup()

## # A tibble: 5 x 3
## cut m_price sd_price
## <chr> <dbl> <dbl>
## 1 Fair 4359. 3560.
## 2 Good 3929. 3682.
## 3 Ideal 3458. 3808.
## 4 Premium 4584. 4349.
## 5 Very Good 3982. 3936.
Example 14: Calculate the average price and standard deviation of price for each group of cut and color.
Diamond %>%
group_by(cut, color) %>%
summarise(m_price = mean(price),
sd_price = sd(price)) %>%
ungroup()

## `summarise()` has grouped output by 'cut'. You can override using the `.groups`
## argument.
## # A tibble: 35 x 4
## cut color m_price sd_price
## <chr> <chr> <dbl> <dbl>
## 1 Fair D 4291. 3286.
## 2 Fair E 3682. 2977.
## 3 Fair F 3827. 3223.
## 4 Fair G 4239. 3610.

48
## 5 Fair H 5136. 3886.
## 6 Fair I 4685. 3730.
## 7 Fair J 4976. 4050.
## 8 Good D 3405. 3175.
## 9 Good E 3424. 3331.
## 10 Good F 3496. 3202.
## # ... with 25 more rows

20.2.6 Arrange the dataset arrange() function allows you to arrange values in a variable in ascending
or descending order.
Example 15: Arrange “price” from lowest to highest.
Diamond %>%
arrange(price)

## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
Example 16: Arrange “cut” by alphabetical order.
Diamond %>% arrange(cut)

## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
## 3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
## 4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
## 5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
## 6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
## 7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
## 8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
## 9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
## 10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
## # ... with 53,930 more rows
Example 17: Arrange “price” from highest to lowest.
Diamond %>%
arrange(desc(price))

## # A tibble: 53,940 x 10
## carat cut color clarity depth table price x y z

49
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
## 2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
## 3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
## 4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
## 5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
## 6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
## 7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
## 8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
## 9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
## 10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
## # ... with 53,930 more rows

20.2.7 Count the observations Count the number of observations using the command count().
Example 18: Count the number of values for each color.
Diamond %>% count(color)

## # A tibble: 7 x 2
## color n
## <chr> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
Or with the command
Diamond %>%
group_by(color) %>%
count() %>%
ungroup()

## # A tibble: 7 x 2
## color n
## <chr> <int>
## 1 D 6775
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808
Or with the command
Diamond %>%
group_by(color) %>%
summarise(n = n())

## # A tibble: 7 x 2
## color n
## <chr> <int>
## 1 D 6775

50
## 2 E 9797
## 3 F 9542
## 4 G 11292
## 5 H 8304
## 6 I 5422
## 7 J 2808

20.2.8 ifelse() If you want to check a condition and get a value return, you can use ifelse(). The syntax is
ifelse(test, yes, no)
Example 19: Create a new column called “color_DE”, if the color is D or E, the value is “DE”, otherwise,
the value is “FGHIJ”.
Diamond %>%
mutate(color_DE = ifelse(color %in% c("D", "E"),
"DE",
"FGHIJ"))

## # A tibble: 53,940 x 11
## carat cut color clarity depth table price x y z color_DE
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 DE
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 DE
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 DE
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 FGHIJ
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 FGHIJ
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 FGHIJ
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 FGHIJ
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 FGHIJ
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 DE
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 FGHIJ
## # ... with 53,930 more rows

20.2.9 Rename a column. You can use the command rename() to rename a column.
Example 20: Rename “carat” to “CARAT” and “price” to “PRICE”.
Diamond %>%
rename(CARAT = carat,
PRICE = price)

## # A tibble: 53,940 x 10
## CARAT cut color clarity depth table PRICE x y z
## <dbl> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
## 2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
## 3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
## 4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
## 5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
## 6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
## 7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
## 8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
## 9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
## 10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
## # ... with 53,930 more rows
Exercise 16:

51
1. Import the data set “Diamond.csv” and convert clarity into an ordered factor with levels “I1”, “SI2”,
“SI1”, “VS2”, “VS1”, “VVS2”, “VVS1”, “IF”.
2. Add a column called “size”, if x + y + z > 12, the value is “Large”, otherwise the value is “Small”.
3. Rename “price” to “priceEuro” and convert the price into euro price (1 dollar = 0.99 Euro)
4. Retain the diamonds with the five best clarity, which are “VS2”, “VS1”, “VVS2”, “VVS1” and “IF”.
Assign this data set to a new dataframe called “Diamond_clear”.
5. Calculate the average price in Euro and count the number of diamonds for each “size” (Large or Small)
in the dataframe “Diamond_clear”.

20.3 Graphing
Visualizing the data is one of the most important thing in analyzing the data. It helps you better understand
the data. We have learned some basic graphing command in chapter 18 using plot(). It is from the base R
originally. Now we are going to learn a more user friendly and more powerful graphing tools with the package
“ggplot2”. It is included in “tidyverse” package.
To do a plot, we use the ggplot() function. Generally, the syntax of ggplot() is ggplot(dataset, aes(x=. . . ,
y=. . . )) then add layers for example +geom_point(). Let’s see some examples.
(We use dataset “diamonds” from tidyverse in this section)

20.3.1 Histogram and bar chart Histogram and bar charts provide an approximate representation of
the distribution of the data.
Example 21: Show the distribution of the price.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(binwidth = 500) +
ggtitle("Price")

Price
10000

7500
count

5000

2500

0 5000 10000 15000 20000


price

# Plot on a log scale


ggplot(data = diamonds, aes(x = price)) +
geom_histogram(bins=100) +
scale_x_log10() +
ggtitle("Price (log10)")

52
Price (log10)

750

500
count

250

300 1000 3000 10000


price

Example 22: Show the distribution of the price for each cut.
ggplot(data = diamonds, aes(x = price)) +
geom_histogram(bins=100) +
facet_wrap(~cut, scales = "free") +
ggtitle("Price distribution for each cut")

53
Price distribution for each cut
Fair Good Very Good
80
400
60 1000
300
40
200
500
20 100

0 0 0
count

0 5000 10000 15000 0 5000 10000 15000 0 5000 10000 15000

Premium Ideal
1250

1000
2000
750

500 1000
250

0 0
0 5000 10000 15000 0 5000 10000 15000
price

Example 23: Display the number of diamonds for each clarity.


ggplot(data = diamonds, aes(x = clarity)) +
geom_bar()

10000
count

5000

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF


clarity

Example 23: Display the number of diamonds for each clarity, and display the number of diamonds by cut in

54
each bar of clarity.
ggplot(diamonds, aes(x = clarity, fill = cut)) +
geom_bar()

10000

cut
Fair
count

Good
Very Good
Premium
5000
Ideal

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF


clarity

# To know the exact number of counts for each categories, use command
count(diamonds, clarity, cut)

## # A tibble: 40 x 3
## clarity cut n
## <ord> <ord> <int>
## 1 I1 Fair 210
## 2 I1 Good 96
## 3 I1 Very Good 84
## 4 I1 Premium 205
## 5 I1 Ideal 146
## 6 SI2 Fair 466
## 7 SI2 Good 1081
## 8 SI2 Very Good 2100
## 9 SI2 Premium 2949
## 10 SI2 Ideal 2598
## # ... with 30 more rows
Example 24: Display the distribution of carat. Only plot the main body of carat weights (0 to 3 carat).
ggplot(data = diamonds, aes(x = carat)) +
geom_bar() +
scale_x_continuous(limits = c(0,3), breaks = c(0.2,0.3,0.4,0.5,1,2,3))

## Warning: Removed 32 rows containing non-finite values (stat_count).


## Warning: Removed 1 rows containing missing values (geom_bar).

55
2000
count

1000

0.20.30.40.5 1.0 2.0 3.0


carat

Exercise 17: Create a graph showing the number of diamonds for each cut of each clarity, and dis-
play the number of diamonds by color in each bar of cut. You should create a graph looks as below.

I1 SI2 SI1
200 3000
4000
150 2000 3000
100 2000
1000
50 1000
0 0 0
Fair Good
Very Good
Premium
Ideal Fair Good
Very Good
Premium
Ideal Fair Good
Very Good
Premium
Ideal color
VS2 VS1 VVS2 D
5000 E
4000 3000 2000
count

3000 F
2000
2000 1000 G
1000
1000
H
0 0 0
Fair Good
Very Good
Premium
Ideal Fair Good
Very Good
Premium
Ideal Fair Good
Very Good
Premium
Ideal I
VVS1 IF J
2000 1250
1000
1500
750
1000
500
500 250
0 0
Fair Good
Very Good
Premium
Ideal Fair Good
Very Good
Premium
Ideal
cut

Exercise 18: Display the price of best quality diamonds. Here we define best quality: cut is Ideal, color is
D or E and clarity is IF or VVS1. You should create a graph looks as below (bins=30).

56
count 200

100

0 5000 10000 15000


price

20.3.2 Scatter Plots A scatter plot use points to represent values for two different variables, one value on
x-axis and one value on y-axis.
Example 25: Build a scatterplot to analyse the relationship between carat and price (Here we use price on
y-axis and carat on x-axis).
ggplot(diamonds, aes(x = carat, y = price)) +
geom_point(alpha = 0.1, size = 0.1)

15000
price

10000

5000

0
0 1 2 3 4 5
carat

From this scatterplot, you can find that the variance in price increases as carat size increases. And the
relation seems to be exponential rather than linear.
To check the hypothesis, you can run a linear regression in R using lm() function.

57
lm_price_carat <- lm(price~carat, diamonds)
summary(lm_price_carat)

##
## Call:
## lm(formula = price ~ carat, data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18585.3 -804.8 -18.9 537.4 12731.7
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2256.36 13.06 -172.8 <2e-16 ***
## carat 7756.43 14.07 551.4 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1549 on 53938 degrees of freedom
## Multiple R-squared: 0.8493, Adjusted R-squared: 0.8493
## F-statistic: 3.041e+05 on 1 and 53938 DF, p-value: < 2.2e-16
R2 explained about 85% of the variance in price.
If we make a log transformation for both price and carat.
lm_price_carat_log <- lm(log(price)~log(carat), diamonds)
summary(lm_price_carat_log)

##
## Call:
## lm(formula = log(price) ~ log(carat), data = diamonds)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.50833 -0.16951 -0.00591 0.16637 1.33793
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 8.448661 0.001365 6190.9 <2e-16 ***
## log(carat) 1.675817 0.001934 866.6 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2627 on 53938 degrees of freedom
## Multiple R-squared: 0.933, Adjusted R-squared: 0.933
## F-statistic: 7.51e+05 on 1 and 53938 DF, p-value: < 2.2e-16
R2 explained more than 93% of the variance in price. Let’s check it on the graph. You can add a the
regression line with commond geom_smooth().
ggplot(diamonds, aes(x = log(carat), y = log(price))) +
geom_point() +
geom_smooth(method = "lm", formula = y~x)

58
11

10

9
log(price)

−1 0 1
log(carat)

Example 25 continue: Plot clarity as color.


ggplot(diamonds, aes(x = carat, y = price, color = clarity)) +
geom_point()

15000
clarity
I1
SI2
SI1
price

10000
VS2
VS1
VVS2
VVS1
5000 IF

0
0 1 2 3 4 5
carat

ggplot(diamonds, aes(x = log(carat), y = log(price), color = clarity)) +


geom_point()

59
10

9
clarity
I1
SI2
log(price)

SI1
8
VS2
VS1
VVS2

7 VVS1
IF

−1 0 1
log(carat)

Thus, we can see a clear relation between clarity and price when we control the carat size.

20.3.3 Box Plots Box plot is used to display the distribution of data based on five number summary (min,
max, 25% and 75% quantile and median).
Example 26: Display the carat distribution for each cut with box plots.
ggplot(diamonds, aes(x = cut, y = carat)) +
geom_boxplot()

3
carat

0
Fair Good Very Good Premium Ideal
cut

To get the precise number, you can use the command

60
diamonds %>%
group_by(cut) %>%
summarise(min = min(carat),
q1 = quantile(carat, 0.25),
median = median(carat),
q3 = quantile(carat, 0.75),
max = max(carat))

## # A tibble: 5 x 6
## cut min q1 median q3 max
## <ord> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Fair 0.22 0.7 1 1.2 5.01
## 2 Good 0.23 0.5 0.82 1.01 3.01
## 3 Very Good 0.2 0.41 0.71 1.02 4
## 4 Premium 0.2 0.41 0.86 1.2 4.01
## 5 Ideal 0.2 0.35 0.54 1.01 3.5
If you want to see the distribution for each color within each cut quality, you can use command
ggplot(diamonds, aes(x = cut, y = carat, fill = color)) +
geom_boxplot()

4
color
D
E
3
F
carat

G
H
2
I
J

0
Fair Good Very Good Premium Ideal
cut
#### 20.3.4 Line Graphs Line graph shows the change of one variable on the continuously change of the
other variable, normally for time series.
Example 27: Display the average carat size for each clarity with a line.
diamonds %>%
group_by(clarity) %>%

61
summarise(avg.carat = mean(carat)) %>%
ggplot(aes(x = clarity, y = avg.carat, group = 1)) +
geom_line(linetype = 2, size=2)

1.3

1.1
avg.carat

0.9

0.7

0.5

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF


clarity

# Or
ggplot(diamonds, aes(x = clarity, y = carat, group = 1)) +
geom_line(stat= "summary", fun = "mean", linetype = 2, size = 2)

# Or
ggplot(diamonds, aes(x = clarity, y = carat, group = 1)) +
stat_summary(fun = "mean", geom = "line", linetype = 2, size = 2)

You can also add cut as color to display more information.


ggplot(diamonds, aes(x = clarity, y = carat, group = cut, color = cut)) +
stat_summary(fun = "mean", geom = "line", linetype = 1, size = 1)

62
1.4

1.2

cut
1.0 Fair
Good
carat

Very Good
Premium
0.8
Ideal

0.6

I1 SI2 SI1 VS2 VS1 VVS2 VVS1 IF


clarity

Exercise 19: We haven’t use the variable “table” yet. Can you find any relationship between “table” with
any other variables? Show it on the graph. (Try histogram, scatter plot, box plot)

21 Extra tips
• R cheat sheets would make it easy for you to use the packages. You can find some of the useful cheat
sheets through the following link: https://www.rstudio.com/resources/cheatsheets
• Use google! You will find that the questions you have in programming, others have already asked online.
You can find most of the answers on this website https://stackoverflow.com

63

You might also like