You are on page 1of 55

Chapter I: Introduction to R

Tepmony Sim

Institute of Technology of Cambodia

November 2018

Tepmony Sim Statistics With R ITC 1 / 54


Outline

1 Introduction to R

2 Data and Programming

3 Saving and Loading Files

4 Programming Basics

Tepmony Sim Statistics With R ITC 1 / 54


Outline

1 Introduction to R

2 Data and Programming

3 Saving and Loading Files

4 Programming Basics

Tepmony Sim Statistics With R ITC 2 / 54


Getting Started with R

R is both a programming language and software environment for statistical


computing, which is free and open-source. To get started, you will need to
install two pieces of software:
R, the actual programming language.
- Chose your operating system, and select the most recent version.
RStudio, an excellent IDE (integrated development environment) for
working with R.
- Note, you must have R installed to use RStudio. RStudio is simply an
interface used to interact with R.
To download R and RStudio, please go to the following link and
follow the instructions therein:
https://courses.edx.org/courses/UTAustinX/UT.7.01x/
3T2014/56c5437b88fa43cf828bff5371c6a924/

Tepmony Sim Statistics With R ITC 3 / 54


Getting Started with R

RStudio has a large number of useful keyboard shortcuts. A list of these


can be found using a keyboard shortcut – the keyboard shortcut to rule
them all:
On Windows: Alt + Shift + K
On Mac: Option + Shift + K

Tepmony Sim Statistics With R ITC 4 / 54


Basic Calculations
Addition, Subtraction, Multiplication, Division:
Math R Example (Math) Example (R)
+ + 3+4 3 + 4
− - 3−4 3 - 4
× * 3×4 3 * 4
÷ / 3÷4 3 / 4
Exponents:
Math R Result
42 4^2 16
10−2 10^(-2) 0.01
100 1/2 100^(1/2) 10

100 sqrt(100) 10
Mathematical Constants:
Math R Result
π pi 3.1415927
e exp(1) 2.7182818
Tepmony Sim Statistics With R ITC 5 / 54
Basic Calculations

Logarithms (in R, log( ) refers natural logarithm ln( )):


Math R Result
ln e log(exp(1)) 1
log10 (10) log10(10) 1
log2 (32) log2(32) 5
log4 (16) log(16, base = 4) 2
Trigonometry:
Math R Result
sin(π) sin(pi) 1
cos(0) cos(0) 1
tan(π/4) tan(pi/4) 1
tan(π/4) tanpi(1/4) 1

Tepmony Sim Statistics With R ITC 6 / 54


Getting Help

In using R as a calculator, we have seen a number of functions: sqrt(),


exp(), log() and sin(). To get documentation about a function in R,
simply put a question mark in front of the function name and RStudio
will display the documentation, for example:

?sqrt
?paste
?sum
This can also be done by using function help(functionname):

help(sqrt)
help(paste)
help(sum)

Many questions were already posted on the internet. So, if you pose your
problems properly, you can get the wanted answers. See Stack Exchange
Tepmony Sim Statistics With R ITC 7 / 54
Installing Package

There is a number of built-in functions and datasets in R, but one of the


main strengths of R, as an open source project, is its package system.
Packages add additional functions and data. Frequently, if you want to do
something in R, and it is not available by default, there is a good chance
that there is a package that will fulfill your needs. To install a package,
use the install.packages() function. For example, to install a package
called “ggplot2”, in your R or RStudio console, type:

install.packages("ggplot2")

Once a package is installed, it must be loaded into your current R session


before being used. To recall any installed package, use the library()
function, e.g.,

library("ggplot2")

Tepmony Sim Statistics With R ITC 8 / 54


Outline

1 Introduction to R

2 Data and Programming

3 Saving and Loading Files

4 Programming Basics

Tepmony Sim Statistics With R ITC 9 / 54


Data Types
R has a number of data types:
Numeric (or double)
The default type when dealing with numbers. E.g. 1, -1.0, 4.25, etc.

Integer
Example: 2L, -1L, 4L, etc.

Complex
Example: 2-3i, 1+0i, sqrt(3)+1i, etc.

Logical
Two values: TRUE (or T) and FALSE (or F). NA is also considered as logical.

Character
Example: "A", "Sovann", "2", "Gender", etc.
Tepmony Sim Statistics With R ITC 10 / 54
Data Structures

R also has a number of basic data structures. A data structure is either


homogeneous (all elements are of the same data type) or heterogeneous
(elements can be of more than one data type).

Dimension Homogeneous Heterogeneous


1 Vector List
2 Matrix Dataframe
>2 Array

Tepmony Sim Statistics With R ITC 11 / 54


Vectors
There are several ways to generate in R. Frequently, we use the command
c( ) (c = combine). E.g. if one wishes to create a vector (1, 2, 3), in R,
c(1, 2, 3)
(note that the elements are separated by comma) and the result will be
displayed as
[1] 1 2 3
In R, we can also assign the name for the object, e.g.,
x <- c(1, 2, 3)
x
Try also:
<- and =
x = c(1, 2, 3)
x
Tepmony Sim Statistics With R ITC 12 / 54
Vectors
Note that vector is a homogeneous data structure, i.e. all components are
of the same type. If, R, you type:

c(3, "Statistics", TRUE)

what do you expect to see? If you type c(3, TRUE), what will the result
be?
The Operation “:” for Sequence
(y <- 1:100)
z <- 1:100
t <- c(1:100)

If you enter y, z and t, what will you see? Note that in R, there is no
scalar. They are simply single-component vectors.

2
[1] 2
Tepmony Sim Statistics With R ITC 13 / 54
Vectors
Other forms of vectors in R:
seq(from = 1.5, to = 4.2, by = 0.1)
seq(1.5, 4.2, 0.1)
rep("A", times = 10)
rep(0, 10)

seq = sequence and rep = replicate or repeat


Summary
To generate a vector, we can use:
c()
:
seq()
rep()
Now, try:
c(x, rep(seq(1, 9, 2), 3), c(1, 2, 3), 32, 2:5)
Tepmony Sim Statistics With R ITC 14 / 54
Vectors
Length of Vectors: length(vectorial argument)
length(seq(1.5, 4.2, 0.1))
length(c(1, 2, 3))
length(c(rep(1,5),2:8))

Subsetting
x[argument] = to obtain some components or subset of vector x.
x <- c(1, 3, 5, 7, 8)
x
x[3]
x[2:4]
x[-2]
x[c(1, 3, 4)]
z<-c(TRUE, TRUE, FALSE, TRUE, FALSE)
z
x[z]
Tepmony Sim Statistics With R ITC 15 / 54
Vectorization

Remark: R is good at vectorization.


Vectorization and Operations
In R, when we apply a function on a vector of length n, we generally
obtain a new vector of length n whose components are the value of the
function of the components of the original vector. E.g.
x<-1:10
x+1 (try also 1+x )
3*x (try also x*3 )
sqrt(x) (try also x^(1/2) )
x^(2) (try also x*x )
2^x
log(x)
?as.vector

Tepmony Sim Statistics With R ITC 16 / 54


Logical Operators

Logical Operators

In R, logical operators are vectorized.

Tepmony Sim Statistics With R ITC 17 / 54


Logical Operators

Examples
x <- c(1, 3, 5, 7, 8)
x
x > 3
x < 3
x == 3
x != 3
x == 3 & x != 3
x == 3 | x != 3

Subsetting and Counting


x[x > 3]
x[x != 3]
x[x == 3]
length(x[x == 3])

Tepmony Sim Statistics With R ITC 18 / 54


Logical Operators

TODO: Coercion
sum(x > 3)
as.numeric(x > 3)
which(x > 3)
x[which(x > 3)]
max(x)
which(x == max(x)) (# at which position the value of x is the largest)
which.max(x)
min(x)
which(x == min(x))
which.min(x)

Tepmony Sim Statistics With R ITC 19 / 54


More Vectorization

x <- c(1, 3, 5, 7, 8, 9)
y <- 1:100
x + 2
x + rep(2, 6)
x > 3
x > rep(3, 6)
x + y
length(x)
length(y)
length(y)/length(x)
(x + y) - y
y <- 1:60
x + y
length(y)/length(x)

Tepmony Sim Statistics With R ITC 20 / 54


More Vectorization

rep(x, 10) + y
all(x + y == rep(x, 10) + y)
identical(x + y, rep(x, 10) + y) (# same objects or not?)
any(x + y != rep(x, 10) + y)
?all.equal

Tepmony Sim Statistics With R ITC 21 / 54


Matrices
Matrix is an arrangement a vector or more in a rectangular array,
comprising rows and columns, of the same data type. In a matrix, the
order of rows and column is important (This is not true for data frame in
which we will see later).
matrix(data = NA, nrow = 1, ncol = 1, byrow = FALSE,
dimnames = NULL)
x<-1:9
x
X <- matrix(x, nrow = 3, ncol = 3)
X (# observe that R is case sensitive.)
Y <- matrix(x, nrow = 3, ncol = 3, byrow = TRUE,
dimnames=list(c("A", "B", "C")), c("E", "F", "G"))
Y
is.matrix(X)
is.matrix(Y)
?as.matrix
Tepmony Sim Statistics With R ITC 22 / 54
Matrices
When the matrix contains the same element, rearranging by row or by
column produces the same matrix.
Examples
Z <-matrix(0, 2, 4)
U <-matrix(1, 3, 3)
Z
U
As in vector case, we can subtract sub-matrix from the original matrix by
using [ ].
Subseting
X[1,2]
X[ ,2]
X[1, ]
X[-1,-2]
X[2,c(1, 3)]
Tepmony Sim Statistics With R ITC 23 / 54
Matrices
Creating matrices by cbind (combine column) and rbind (combine row)
Examples: cbind and rbind
x <- 1:9
y <- rev(x) (# rev = reverse)
z <- rep(1, 9)
C <- cbind(x, y, z)
R <- rbind(x, y, z)
C
R
When using cbind and rbind we can specify “argument” names that will
be used as column and row names.
cbind(col1 = x, col2 = y, col3 = z)
colnames(cbind(col1 = x, col2 = y, col3 = z))
rbind(row1 = x, row2 = y, row3 = z)
rownames(cbind(row1= x, row2=y, row3 = z))
Tepmony Sim Statistics With R ITC 24 / 54
Matrices
Matrix Operations:
x <- 1:9
y <-9:1
X <- matrix(x, nrow = 3, ncol = 3)
Y <- matrix(y, nrow = 3, ncol = 3)
X
Y

Addition
X + Y

Subtraction
X - Y

Scalar Multiplication
(-3) * X
Tepmony Sim Statistics With R ITC 25 / 54
Matrices
Entrywise Multiplication
X * Y

Entrywise Division
X / Y

Entrywise Exponent and Exponential


X^Y
X^3
3^X

Matrix Multiplication
X %*% Y

Matrix Transpose
t(X)
Tepmony Sim Statistics With R ITC 26 / 54
Matrices

Inverse of Matrix: solve()


v <- c(9, 2, -3, 2, 4, -2, -3, -2, 16)
Z <- matrix(v, 3, byrow = TRUE)
Z
invZ <- solve(Z)
invZ
Check:
invZ %*% Z
diag(3)
all.equal(invZ %*% Z, diag(3)) # up to some computational error
identical(invZ %*% Z, diag(3)) # exactly equal

Tepmony Sim Statistics With R ITC 27 / 54


Matrices
Inverse of Matrix: solve()
v <- c(9, 2, -3, 2, 4, -2, -3, -2, 16)
Z <- matrix(v, 3, byrow = TRUE)
Z
rowSums(Z)
colSums(Z)
rowMeans(Z)
colMeans(Z)
diag(Z)

Remark: When the argument is a matrix, diag() enters a vector of


diagonal elements of the matrix. Yet, when the argument is a vector or a
single number, it enters a diagonal matrix.
diag(3)
diag(2,3)
diag(c(2,3))
Tepmony Sim Statistics With R ITC 28 / 54
Matrices
Calculations with Vectors and Matrices
a vec <- c(1, 2, 3); b vec <- c(1, 2, 3)
c(is.vector(a vec), is.vector(b vec))
c(is.matrix(a vec), is.matrix(b vec))
a vec %*% b vec # inner or scalar product
a vec %o% b vec # outer product
as.marix(a vec)
as.marix(a vec) %*%b vec
as.marix(a vec) %*% as.matrix(b vec)
crossprod(a vec, b vec) # inner product, aT b
tcrossprod(a vec, b vec) # outer product
C mat <- matrix(1:6, 2, 3); D mat <- matrix(rep(2,6), 2, 3)
crossprod(C mat, D mat) # C T D
t(C mat) %*% D mat
all.equal(crossprod(C mat, D mat),t(C mat) %*% D mat)
crossprod(C mat, C mat) # C T C. Then check all.equal(...)
Tepmony Sim Statistics With R ITC 29 / 54
Lists

A list is a one-dimensional heterogeneous data structure. So it is indexed


like a vector with a single integer value, but each element can contain any
element of any type.
Examples: list(...)
list(42, "Hello", TRUE) # creation
ex list <- list(
a = 1:4,
b = TRUE,
c = "Hello!",
d = function(arg = 42) {print("Hello World!")}
e = diag(5)
)

Notice that all the elements of the list are of different types.

Tepmony Sim Statistics With R ITC 30 / 54


Lists

To obtain a subset from a list, we use $ operator, and [ ]. The $


operator returns the named element of a list. The [ ] syntax returns a
list, while the [[ ]] returns an element of a list.
Examples: list(...)
ex list[1]
ex list[[1]]
ex list$e
ex list[1:2]
ex list[c("a", "e")]
ex list["e"]
ex list[["e"]]
ex list$d
ex list$d(arg = 1)

Tepmony Sim Statistics With R ITC 31 / 54


Data Frames
The most common way that we store and interact with data in R and in
this course is data frame.
data.frame(...)
example data = data.frame(x = rep(seq(1,9, 2),2),
y = c(rep("Hello", 9), "Goodbye"),
z = rep(c(TRUE, FALSE), 5))
example data

Unlike a matrix, data frame is not required to have the same data
type of each element.
It is a list of vectors.
Each of its vector must contain the same data type, but the different
vectors can store different data types.
Unlike a list, the elements of a data frame must be all vectors, and
have the same length.
Tepmony Sim Statistics With R ITC 32 / 54
Data Frames

Example: Using data.frame(...) to Create a Data Frame


example data$x
all.equal(length(example data$x),
length(example data$y),
length(example data$z))
str(example data) # return the structure of the data frame
nrow(example data) # return the number of rows of the data frame
ncol(example data) # return the number of columns of the data frame
dim(example data) # return the dimension of the data frame

Tepmony Sim Statistics With R ITC 33 / 54


Data Frames

We can also import data in from various file types into R, as well as data
stored in packages.
Example: Importing Data from Other Sources
The example data above can also be found here as a .csv file. To read this
data into R, we use read csv() function from the readr package. Note
that R has a build-in function read.csv() that operates very similarly. The
function read csv() has a number of advantages over its counterpart
read.csv(). For large dataset, read csv() reads much faster than
read.csv(). In addition, it also use the tribble package to read the data
as a tribble.

library(readr)
example data from csv <- read csv("./Dropbox/Applied
Statistics/Data/example-data.csv")

Tepmony Sim Statistics With R ITC 34 / 54


Data Frames

A tibble is simply a data frame that prints with rational behavior. Notice
in the output above that we are given additional information such as
dimension and variable type.
The as.tibble() function can be used to coerce a regular data frame to a
tibble.
library(tibble)
example data <- as.tibble(example data)
example data

Alternatively, we could use the ”Import Dataset” feature in RStudio which


can be found in the environment window. (By default, the top-right pane
of RStudio.) Once completed, this process will automatically generate the
code to import a file. The resulting code will be shown in the console
window. In recent versions of RStudio, read csv() is used by default, thus
reading in a tibble.

Tepmony Sim Statistics With R ITC 35 / 54


Data Frames
Earlier we looked at installing packages, in particular the ggplot2 package.
(A package for visualization).

library("ggplot2")

Inside the ggplot2 package is a dataset called mpg. By loading the


package using the library() function, we can now access mpg.
When using data from inside a package, there are three things we would
generally like to do:
Look at the raw data.
Understand the data. (Where did it come from? What are the
variables? Etc.)
Visualize the data.
To look at the data, we have two useful commands: head() and str().

head(mpg, n = 10)
Tepmony Sim Statistics With R ITC 36 / 54
Data Frames

The function head() will display the first n observations of the data frame.
The head() function was more useful before tibbles. Notice that mpg is a
tibble already, so the output from head() indicates there are only 10
observations. Note that this applies to head(mpg, n = 10) and not mpg
itself. Also note that tibbles print a limited number of rows and columns
by default. The last line of the printed output indicates with rows and
columns were omitted.
mpg

The function str() will display the ”structure” of the data frame. It will
display the number of observations and variables, list the variables, give
the type of each variable, and show some elements of each variable. This
information can also be found in the ”Environment” window in RStudio.
str(mpg)

Tepmony Sim Statistics With R ITC 37 / 54


Data Frames
It is important to note that while matrices have rows and columns, data
frames (tibbles) instead have observations and variables. When displayed
in the console or viewer, each row is an observation and each column is a
variable. However, in general, their order does not matter, it is simply a
side-effect of how the data was entered or stored.
In this dataset an observation is for a particular model-year of a car, and
the variables describe attributes of the car, for example its highway fuel
efficiency. To understand more about the data set:
?mpg

To obtain a vector of the variable names, we use the names() function.

names(mpg)

To access one of the variables as a vector, we use the $ operator.

mpg$year
Tepmony Sim Statistics With R ITC 38 / 54
Data Frames
We can use the dim(), nrow() and ncol() functions to obtain information
about the dimension of the data frame.
dim(mpg)
nrow(mpg) # enter the sample size
ncol(mpg) # enter the number of variables

Subsetting data frames can work much like subsetting matrices using
square brackets, [,]. Here, we find fuel efficient vehicles earning over 35
miles per gallon and only display manufacturer, model and year.

mpg[mpg$hwy > 35, c("manufacturer", "model", "year")]

An alternative would be to use the subset() function, which has a much


more readable syntax.

subset(mpg, subset = hwy > 35, select = c("manufacturer",


"model", "year"))
Tepmony Sim Statistics With R ITC 39 / 54
Data Frames

Lastly, the same result can be obtained by using the filter and select
functions from the dplyr package which introduces the %>%. operator from
the magrittr package.

library("dplyr")
mpg %>% filter(hwy > 35) %>% select(manufacturer, model,
year)

When subsetting a data frame, be aware of what is being returned, as


sometimes it may be a vector instead of a data frame. Also note that
there are differences between subsetting a data frame and a tibble. A data
frame operates more like a matrix where it is possible to reduce the subset
to a vector. A tibble operates more like a list where it always subsets to
another tibble.

Tepmony Sim Statistics With R ITC 40 / 54


Outline

1 Introduction to R

2 Data and Programming

3 Saving and Loading Files

4 Programming Basics

Tepmony Sim Statistics With R ITC 41 / 54


Saving One Object to a File
It is possible to use the function saveRDS() to write a single R object to
a specified file (in rds file format). The object can be restored back using
the function readRDS(). We use the following syntax:
# Save an object to a file
saveRDS(object, file = "my data.rds")
# Restore the object
readRDS(file = "my data.rds")

Example:
women
# save a single object to file
saveRDS(women, "women.rds")
# restore it under a different name
women2 <- readRDS("women.rds")
women2
identical(women, women2)
Tepmony Sim Statistics With R ITC 42 / 54
Save Multiple Objects to a File
The function save() can be used to save one or more R objects to a
specified file (in .RData or .rda file formats). The function can be read
back from the file using the function load(). We use the following syntax:
# Saving on object in RData format
save(data1, file = "data.RData")
# Save multiple objects
save(data1, data2, file = "data.RData")
# To load the data again
load("data.RData")
Note: if you save your data with save(), it cannot be restored under
different name. The original object names are automatically used.
Example:
data1<-c(1,2,3)
data2<-c(2,3) save(data1, data2, file = "data.RData")
load("data.RData")
Tepmony Sim Statistics With R ITC 43 / 54
Saving the Entire Workspace

It is a good idea to save your workspace image when your work sessions
are long. This can be done at any time using the function save.image()
We use the following syntax:

save.image()

That stores your workspace to a file named .RData by default. This will
ensure you do not lose all your work in the event of system reboot, for
instance.
When you close R/RStudio, it asks if you want to save your workspace. If
you say yes, the next time you start R that workspace will be loaded. That
saved file will be named .RData as well.
It is also possible to specify the file name for saving your work space:

save.image(file = "my work space.RData")


load("my work space.RData") #To restore your workspace

Tepmony Sim Statistics With R ITC 44 / 54


Outline

1 Introduction to R

2 Data and Programming

3 Saving and Loading Files

4 Programming Basics

Tepmony Sim Statistics With R ITC 45 / 54


Control Flow
In R, the if/else syntax is:

if (...) {
some R code
} else {
more R code
}

For example:
x <- 1
y <- 3
if (x > y) { z = x * y
print("x is larger than y")
} else { z = x + 5 * y
print("x is less than or equal to y")
}
z
Tepmony Sim Statistics With R ITC 46 / 54
Control Flow

R also has a special function ifelse() which is very useful. It returns one of
two specified values based on a conditional statement.

ifelse(4 > 3, 1, 0)

The real power of ifelse() comes from its ability to be applied to vectors.

fib <- c(1, 1, 2, 3, 5, 8, 13, 21)


ifelse(fib > 6, "Foo", "Bar")

Now a for loop example,

x = 11:15
for (i in 1:5) {
x[i] = x[i] * 2
}
x

Tepmony Sim Statistics With R ITC 47 / 54


Control Flow

Note that this for loop is very normal in many programming languages,
but not in R. In R we would prefer not use a loop, instead we would
simply use a vectorized operation.

x <- 11:15
x <- x * 2
x

Tepmony Sim Statistics With R ITC 48 / 54


Functions
So far we have been using functions, but have not actually discussed some
of their details.
function name(arg1 = 10, arg2 = 20)

To use a function, we simply type its name, followed by an open


parenthesis, then specify values of its arguments, then finish with a closing
parenthesis.
An argument is a variable which is used in the body of the function.
Specifying the values of the arguments is essentially providing the inputs
to the function.
We can also write our own functions in R. For example, we often like to
”standardize” variables, that is, subtracting the sample mean, and dividing
by the sample standard deviation:
x − x̄
.
s

Tepmony Sim Statistics With R ITC 49 / 54


Functions

In R, we would write a function to do this. When writing a function, there


are three thing we must do:
Give the function a name. Preferably something that is short, but
descriptive.
Specify the arguments using function()
Write the body of the function within curly braces, {}.

standardize <- function(x) {


m <- mean(x)
std <- sd(x)
result <- (x - m) / std
result
}

Tepmony Sim Statistics With R ITC 50 / 54


Functions
Here, the name of the function is standardize, and the function has a
single argument x which is used in the body of function. Note that the
output of the final line of the body is what is returned by the function. In
this case the function returns the vector stored in the variable results. To
test our function, we will take a random sample of size n = 10 from a
normal distribution with a mean of 2 and a standard deviation of 5.
(test sample <- rnorm(n = 10, mean = 2, sd = 5))

Now consider a much more succinct way of writing:

standardize <- function(x) { (x - mean(x)) / sd(x) }

When specifying arguments, we can provide default arguments.

power of num <- function(num, power = 2) {


num ^ power
}
Tepmony Sim Statistics With R ITC 51 / 54
Functions

Now, look at a number of ways that we could run this function to perform
the operation 102 resulting in100. Now consider a much more succinct
way of writing:

power of num(10)
power of num(10, 2)
power of num(num = 10, power = 2)
power of num(power = 2, num = 10)

Now try this:

power of num(2, 10)

What about this?


power of num(power = 5)

Tepmony Sim Statistics With R ITC 52 / 54


Functions
To further illustrate a function with a default argument, we will write a
function that calculates sample variance two ways. By default, it will
calculate the unbiased estimate of σ 2 , which we will call s2 :
n
2 1 X
s = (xi − x̄)2 .
n−1
i=1

It will also have the ability to return the biased estimate (based on
maximum likelihood) which we will call σ̂ 2 :
n
2 1X
σ̂ = (xi − x̄)2 .
n
i=1

get var <- function(x, biased = FALSE) {


n = length(x) - 1*!biased
(1/n)*sum((x - mean(x)) ^ 2)
}
Tepmony Sim Statistics With R ITC 53 / 54
Functions

get var(test sample)


get var(test sample, biased = FALSE)
var(test sample)

We see the function is working as expected, and when returning the


unbiased estimate it matches R’s built in function var(). Finally, let?s
examine the biased estimate of σ 2 :
get var(test sample, biased = TRUE).

Reference: David Dalpiaz (2017), Applied Statistics with R.

Tepmony Sim Statistics With R ITC 54 / 54

You might also like