Professional Documents
Culture Documents
R framework
1
An Open source language means anyone can share
or modify the content of this language.
1
BIA B452F – R programming for BIA B452F
R objects can have attributes, which are like metadata for the object. These
metadata can be very useful in that they help to describe the object. For
example, column names on a data frame help to tell us what data are
contained in each of the columns.
2
BIA B452F – R programming for BIA B452F
Plot/Management pane
Console
Code editor – The Code Window is for writing your own R code as scripts.
Console – The Console displays messages about R session and any programs submitted. R uses
the following colour-coded system for different types of message in the log:
o Blue – R code executed from the code window. Results from the R program are also
displayed in blue
o Red – errors that cause R to abort running the program. Warnings for the R program are
also printed in red
Environment/history pane – The “environment” pane tells users what R objects are stored in
your current session. An R object can include model output, functions, values and many more.
The “history” pane tells users what R code have been executed from the Console.
Plot/ Management pane – The “plot” pane displays all plots generated from R codes, such as
scatterplot, histograms.
1. Create a new R script by clicking “File” -> “New File” -> “R Script”
3
BIA B452F – R programming for BIA B452F
(Note: Comments in R begin with “#”. You may skip the comments.)
3. Use mouse to highlight all codes and then click “Run” or press <CTRL> + <Enter>.
Check the outputs of the program in the console and plot pane.
4. Create a working folder “B425F” on your local C drive ( and then change to this directory by typing:
setwd("C:/B452F")
7. Get the current working directory and check the files in the
working directory using get_wd() and dir().
getwd()
dir()
9. Click the “First_R.r” to restart the R studio and reload the saved R script.
4
BIA B452F – R programming for BIA B452F
(Note: R program is CASE SENSITIVE. Note that object “Grade” is stored as a value, while “grade” is
stored as a data frame. They represent two different things and NA is a reserved keywords for
missing values.)
a) Copy and paste the program to the R script editor and then run it.
b) Check the newly created dataset (as a data frame) in the ‘Environment’ pane.
Alternatively, you can click the icon in the “Environment” pane to clear the current R
environment.
5
BIA B452F – R programming for BIA B452F
Manipulate dataset
1. Computing new variable using arithmetic
New variable can be easily added to an existing dataset (as a data frame). Suppose that the
‘course_score’ for the ‘grade’ dataset in Task 1-4 is derived by the following formula:
course_score = 0.4 * (TMA + Quiz) /2+ 0.6 * Exam
First, check if the ‘grade’ dataset is still in the current R working environment. If not, run ‘Task1-4.R’
to create the dataset again. Then, run the following R program to create a new variable
“course_score”, and added the new variable to the ‘grade’ dataset:
grade$course_score <- 0.4 * (grade$TMA + grade$Quiz)/2 + 0.6 * grade$Exam
grade
Type the following codes to change the NA in “gender” into female, and change the NA in
“Exam” as 60.
grade$gender[is.na(grade$gender)] <- "F"
grade$Exam[is.na(grade$Exam)] <- 60
grade
Type the following code to change all missing values in TMA and Exam into zero. . After that, re-
calculate the course score and result based on updated dataset. How many students failed?
grade[is.na(grade)] <- 0
grade
(Note: The “is.na()” argument examines if there is any element in the dataset “grade” that
contains NA. If so, R will convert the NA’s into zero.)
(b) You may also use the complete.cases() or na.omit() functions to apply listwise deletion (or
using complete cases) method to delete observations with missing data (i.e., an entire record is
excluded from analysis if any single value is missing).
grade <- read.csv("grade.csv") # reload the grade.csv
grade1 <- grade[complete.cases(grade),]
Or
grade2 <- na.omit(grade)
3. Updating the course score by running Step 1 again and then check the revised scores.
6
BIA B452F – R programming for BIA B452F
Suppose a course requires students to get at least 40 marks in both exam and course_score in order
to get a pass grade. The final result can be obtained as follows:
grade$result <- ifelse(grade$Exam >= 40 & grade$course_score >= 40,
"pass", "fail")
grade
Run the following R code to compute the mean and standard deviation of the numerical variables
(TMA, Quiz, Exam and course_score).
apply(grade[,c(2:4, 6)], 2, mean)
apply(grade[,c(2:4, 6)], 2, sd)
6. Subsetting data
In general, the elements of a dataset can be obtained using the notation dataframe[row indices,
column indices].
The following code extracts the 6th and 7th variable in the grade (i.e., course score and result)
result <- grade[,c(6:7)]
result
The following code extracts the “pass” records from the dataset
pass <- grade[which(grade$result=="pass"),]
pass
7. Dropping variable
You can drop variables from a dataset by assigning them to NULL. The following R code remove the
result from the dataset:
pass$result <- NULL
pass
7
BIA B452F – R programming for BIA B452F
We will use a dataset that contains the air pollution and temperature data for the city of Chicago. This
dataset is named “Chicago.csv”. Read this dataset into R using the read.csv() function. After loading
the data, type str(chicago) to explore the structure of this dataset.
1. Use select() to extract columns
First, load the R package dplyr if you haven’t done so. Then, run the following R program
names(chicago)[1:3]
subset1 <- select(chicago, city:dptp)
head(subset1)
To drop a column, add ‘-‘ before the column name, e.g. select(Chicago, -date) will drop the date
column. The following code will give the same subset.
subset2 <- select(chicago, -c(date:no2tmean2))
What will be the result if the c() function is not use i.e., select(chicago, -date:no2tmean2)?
Suppose you only want to extract the first 3 columns. Using the dplyr package, you extract
variables from “city” to “dptp” in the Chicago dataset, and store the results as subset1. The code
head(subset1)shows you the first few rows of the data.
> head(subset1);
city tmpd dptp
1 chic 31.5 31.500
2 chic 33.0 29.875
3 chic 33.0 27.375
4 chic 29.0 28.625
5 chic 32.0 28.875
6 chic 40.0 35.125
The select() function also allows you to extract variables based on specific patterns on the
variable names. For example, if you want to keep all variables that start with the letter “d”, you can
run the following R program to get the result. If you print the structure of subset3 and chicago,
you can tell the difference between them.
subset3 <- select(chicago, starts_with("d"))
str(subset3)
str(chicago)
8
BIA B452F – R programming for BIA B452F
All rows with the levels of PM2.5 > 30 were extracted, and stored inside the object chic.f. Now,
summary(chic.f$pm25tmean2)tells you the descriptive statistics of this dataset. The minimum
value is 30.05, which is greater than 30.
You can apply multiple conditions inside of filter(). For instance, if you want to extract the rows
where PM2.5 > 30 and temperature > 80 degrees Fahrenheit, you can get this done in one line.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
The result is stored inside the R object chic.f. If you want to confirm your result, use select() to
extract the date, temperature and PM2.5 level. Do the result fulfill the conditions listed above?
3. Use arrange() to reorder rows of data
Reordering rows of a data frame in R could be difficult. Fortunately, the package dplyr has a nice
function called arrange(), which allows you to rearrange rows by a certain variable (column) in R.
Now, suppose you want to rearrange the Chicago data by date. You can run the following R
program.
chicago_new <- arrange(chicago, date)
By default, arrange() rearranges data in ascending order. Now, you may want to examine if the
data are indeed sorted by ascending date. Use head and tail to check your result, as below:
head(chicago_new)
tail(chicago_new)
If you want to sort the data by descending order, use the function desc() for the sorting variable.
chicago_new <- arrange(chicago, desc(date))
Again, use head and tail to check your result. Are the data sorted by descending date?
4. Use rename() to update variable names
In your future work, you may want to rename variables from an R dataset for better understanding.
The rename()function makes your work much easier. Now, type head(chicago) to read the
variable names of the Chicago dataset.
> head(chicago)
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987/1/1 NA 34.00000 4.250000 19.98810
2 chic 33.0 29.875 1987/1/2 NA NA 3.304348 23.19099
3 chic 33.0 27.375 1987/1/3 NA 34.16667 3.333333 23.81548
…
The variable dptp means the dew point temperature, and the variable pm25mean2 refers to the
PM2.5 values. For better interpretation, you should rename these two variables. The following R
program is all what you need.
9
BIA B452F – R programming for BIA B452F
Note that the updated variable names are put before the “=” sign. Also, rename() can take
multiple arguments for variable name changes. Here, “dewpoint” and “pm25” are the updated
variable names, so they are placed before the “=” sign.
After making changes, type head(chicago_new)again to see how variable names are changed.
> head(chicago_new)
city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987/1/1 NA 34.00000 4.250000 19.98810
2 chic 33.0 29.875 1987/1/2 NA NA 3.304348 23.19099
3 chic 33.0 27.375 1987/1/3 NA 34.16667 3.333333 23.81548
…
Here is an exercise for you. The variable pm10tmean2 refers to pm10, the variable o3tmean2 refers
to ozone, and the variable no2tmean2 refers to NO2 (nitrogen dioxide). Use the rename() function
to change their names to “pm10”, “ozone” and “NO2”, respectively. (Hint: you just need to place
three more arguments behind pm25 = pm25tmean2).
5. Use mutate() to compute variable transformations
Sometimes, you may want to perform variable computations or transformations, and
mutate()provides a clean interface for doing that.
For example, with air pollution data, you want to look at whether a given day’s air pollution level is
higher than or less than average. You need to subtract each day’s air pollution from its average
(detrending). To do this, you can create a pm25detrend variable that subtracts the mean from the
pm25 variable. Here is the R program that you need.
chicago_new2 <- mutate(chicago_new,
pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago_new2)
Plot a histogram for pm25detrend using the hist() function with breaks=30 option to define the
no. of cells for the histogram. Are most of the PM2.5 data above / below the average?
6. Use group_by() and summarize() to compute summary statistics
The function allows you to compute summary statistics that are group by a certain variable (strata).
Suppose you want to compute the annual air pollution rate from the Chicago dataset. Here, the
strata is year , which can be derived from the variable date. You will also utilize the summarize()
function to compute summary statistics.
You need three steps to complete this task. Each step is followed by an R program associated with
it.
Step 1: use as.POSIXlt() to create a year variable
chicago_new <- mutate(chicago_new, year = as.POSIXlt(date)$year + 1900)
Step 2: create a new data frame that split the Chicago dataset by year
years <- group_by(chicago_new, year)
Step 3: compute summary statistics for each year in the new data frame with the
summarize()function. Here, you obtain the average PM2.5, the maximum value of ozone, and the
median value of nitrogen dioxide level per year.
10
BIA B452F – R programming for BIA B452F
You can apply this technique to the example from above (part 6). Here it is.
result <- mutate(chicago_new, year = as.POSIXlt(date)$year + 1900) %>%
group_by(year) %>%
summarize(pm25 = mean(pm25, na.rm = TRUE),
ozone = max(ozone, na.rm = TRUE),
NO2 = median(NO2, na.rm = TRUE))
11
BIA B452F – R programming for BIA B452F
[1] 0
2. for loop
The structure of a for loop is:
for(“condition”){
R statement}
Here is an example a for loop. A series of numbers are printed, as the value of k increases.
> for(k in 1:10){
+ print(k + 1)
+ }
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
Now, it is your turn to write a simple for loop. For j from 1 to 10, print out the values of j^2.
3. Nested loop
Both if-else loops and for loops can be nested, i.e. a loop is executed inside another loop. Here
is an R program that demonstrates a nested for loops.
for (j in 1:5){
for (k in 1:3){
print(j*k)
}
}
If you have a large nested loops, you may consider replacing them with functions in R.
12
BIA B452F – R programming for BIA B452F
Type f(), what do you see in the R console? If you type class(f), you will see f is indeed a
function. Function has its own class in R.
2. Function arguments
Suppose you want to print the statement “Hello, world!” 10 times. How do you achieve it? You can
either cut and paste the code cat("Hello, world!\n") 10 times. However, not only this is
ineffective, but when you colleague run your program, they need to cut and paste the code again,
depending on their own needs.
Instead, you should write a function to complete this task. The following R program allows you to
print the statement as many times as you like.
f <- function(num) {
for(i in seq_len(num)) {
cat("Hello, world!\n")
}
}
This function take an argument, num. This argument can be any positive integers, which controls
how many times the statement is printed. Inside the function, there is a for loop. Starting from 1,
13
BIA B452F – R programming for BIA B452F
the for loop will print the statement “Hello, world!” depending on the value of num. Type
f(5)gives you 5 “Hello, world!” statements, for instance.
Note that you must assign a value to num. Otherwise, the function would return an error. Try f()
and f(3) for comparisons.
3. Functions with returning values
In the last example, your R function does NOT return anything other than the results. What if you
want to count the number of characters in “Hello, world!” statement? You can revise the last
example and let R count it for you (for your inference, the statement “Hello, world!” has 14
characters, including spaces and punctuations).
f <- function(num) {
hello <- "Hello, world!\n"
for(i in seq_len(num)) {
cat(hello)
}
chars <- nchar(hello) * num
chars
}
The R program above prints the statement “Hello, world!” N times, where N = the value of num. It
also prints the number of characters in the printout. In R, you always place the return value as the
last expression inside the function. Since variable chars is the last expression that is evaluated in
this function, chars becomes the return value of the function. Now, execute the function with any
positive number for num and see the results. What will happen if num is negative?
Recall that if you forget to assign a value to num, i.e. running f(), R gives an error message. To avoid
this, you can provide a default value for num. In the following example, the default value for num is
equal to 1.
f <- function(num = 1) {
hello <- "Hello, world!\n"
for(i in seq_len(num)) {
cat(hello)
}
chars <- nchar(hello) * num
chars
}
Now, when you type f(), R would take the default value for num (which is 1), and return the
following output.
> f()
Hello, world!
[1] 14
Now, it is your turn to design your own function. Write a function that converts temperature from
Celsius to Fahrenheit. First, your function takes one argument, tempC. Next, you can convert Celsius
to Fahrenheit using the following expression:
𝐹𝑎ℎ𝑟𝑒𝑛ℎ𝑒𝑖𝑡 = 𝐶𝑒𝑙𝑐𝑖𝑢𝑠 ∗ 1.8 + 32
Finally, your function should return one value, tempF. Name your function “temp”. After designing
your function, verify that 40 degrees Celsius is equal to 104 degrees Fahrenheit.
14
BIA B452F – R programming for BIA B452F
4. “Lazy” evaluation
Arguments inside functions are evaluated according to users’ input. If you design a function with
three (3) arguments, but you only provide one input value, R can still evaluate the function. Here is
an example.
f <- function(a,b,c){
print(a)
# print(a+b)
# print(a*b*c)
}
f(12)
Since only the first argument is used, R does not return any error. However, if you remove the #
signs for the 2nd and 3rd line inside the function, and then run again…
> f(12)
[1] 12
Error in print(a + b) : argument "b" is missing, with no default
The error shows up after the 1st statement, print(a), is executed. When R executes the 2nd
statement, print(a+b), it expects users to provide values for both a and b. If only one value is
provided, R considers the second value is missed, and thus it could not execute this statement.
Same concept applies on the 3rd statement, print(a*b*c).
Now, type f(1,2,3), or provide any 3 positive integers as arguments of the function. Do you see
any error message?
5. Summary of functions
You can write a function in R, using the function() directive. Functions are assigned to R
objects, just like any other R object.
You can define function arguments, and these arguments can take default values. Functions
have can be defined with named arguments; these function arguments can have default
values.
You should have appropriate number of arguments corresponding to the evaluation needs.
Functions always return the last expression evaluated in the function body.
When your R coding are repetitive (a lot of cutting and pasting), this is a good sign that you
may want to write R functions to make your work easier.
15
BIA B452F – R programming for BIA B452F
Self-study tasks
x <- as.Date("2012-12-21")
x
Now, you have a new object x. If you type class(x), what do you obtain?
2. Current date
In R Console, simply type Sys.Date(). This gives you the current date. Note that the output is as
same as that in as.Date(), both of which have format as “yyyy-mm-dd”.
3. Operations on date
The most common operations for date and time, are the sum or difference of them. For example,
you want to compute the day difference within a leap year. Common sense tells us a leap year has
366 days. However, can we verify this in R? Let’s try the following example.
x1 <- as.Date("2012-01-01"); y1 <- as.Date("2013-01-01")
x2 <- as.Date("2013-01-01"); y2 <- as.Date("2014-01-01")
y1 – x1
y2 - x2
Run the R program above. What is the answer of y1-x1? What about y2-x2? Note that 2012 is a leap
year, but 2013 is NOT. Therefore, the answers between two subtractions are not the same.
4. Create a time object
A time object can be created with command as.POSIXlt(). Now, run the following R program
z <- as.POSIXlt("2001-09-11 08:46:00")
z
If you type class(z), what do you obtain? It shows that z is a POSIXlt object.
5. Current time
In R Console, simply type Sys.time(). This gives you three things: current date in “yyyy-mm-dd”
format, current time in “HH:MM:SS” format, and the time zone.
6. Operations on time
Here, we introduce an R command difftime. This command computes the difference between two
date or time objects. The command has the following structure:
difftime(time1, time2, units = c("auto", "secs", "mins", "hours",
"days", "weeks"))
time1 and time2 are date objects or time objects. You can fit time1 with a date object and time2
with a time object, or vice versa. By default, difftime gives the difference in terms of days.
However, by using units, you can ask R to print the difference in terms of hours or weeks. Run the
following R program
difftime(Sys.time(), z, units = "days")
16
BIA B452F – R programming for BIA B452F
How many days has it been since the September 11 terrorist attack?
7. Date/Time operations using “lubridate”
Similar to other programming language, date-time data can be frustrating to work with in R. R
commands for date-times are generally unintuitive and change depending on the type of date-time
object being used. Lubridate makes it easier to do the things R does with date-times and possible to
do the things R does not.
Download and install “lubridate” and then run the following codes to see how to use the functions
of the package to process date and time variables.
library(lubridate)
# parsing of date-times
a <- ymd("2020/12/21")
a
b <- mdy(01122019)
b
c <- dmy_hms("31-12-2020 07:46:00")
c
17
BIA B452F – R programming for BIA B452F
1. Use lapply
The function lapply() loops over a list object, iterating over each element in that list. During the
loops, a function is applied to each element of the list. Finally, lapply()returns the results as a
list. This function has the following structure:
lapply(X, FUN,...)
Where X is a list object, FUN refers to a function (mostly a descriptive statistics term, such as mean,
median, etc), and ... refers to additional arguments, if any.
Here, you will utilize the dataset “iris” to demonstrate how lapply()works. This dataset contains
the measurements of the variables sepal length, sepal width, petal length and petal width,
respectively, for 50 flowers from each of 3 species of iris in centimeters. The species are Iris setosa,
versicolor, and virginica. Run the following R program.
dat <- iris
dat2 <- dat[,1:4]
list1 <- lapply(dat2, mean)
str(list1)
The returned values are stored as list1. Now, type list1 to see the return values. Next, type
str(list1)to examine the structure of this object.
> str(list1)
List of 4
$ Sepal.Length: num 5.84
$ Sepal.Width : num 3.06
$ Petal.Length: num 3.76
$ Petal.Width : num 1.2
18
BIA B452F – R programming for BIA B452F
> str(v2)
Named num [1:4] 5.84 3.06 3.76 1.2
- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "
Petal.Width"
The object is no longer a list. Because the result of sapply() was a list where each element had
length 1, sapply() collapsed the output into a numeric vector.
3. Use apply()
The function apply() evaluates a function over the margins of an array. In other words, you can
use apply()to apply a function to the rows or columns of a matrix (which is just a 2-dimensional
array). For example, you can compute the row mean or column mean with apply(). The structure
of apply()is:
apply(X, MARGIN, FUN, ...)
Where X is an R data frame. MARGIN refers to the direction where the function is applied (MARGIN=1
for rows; MARGIN=2 for columns). FUN refers to a function (mean, median, etc), and ... refers to
additional arguments, if any.
Here is a quick example to show how works. Run the following R program. It uses two different
functions to compute the mean across columns in the iris dataset.
apply(dat2, 2, mean)
sapply(dat2, mean)
Compare the result against sapply(). Indeed, when you apply a function across columns, both
apply() and sapply() give you the same results. However, apply()can be used across rows and
columns, but sapply()can be used across columns ONLY.
19
BIA B452F – R programming for BIA B452F
Now, run the following R program. The results may surprise you.
colMeans(dat2)
apply(dat2, 2, mean)
rowMeans(dat2)
apply(dat2, 1, mean)
Do you realize that the output from the first two lines are the same? Also, the results between the
3rd and 4th lines are also the same.
This is not a coincidence. In fact, R provides some shortcuts for some of the output from apply().
These shortcuts has two advantages. First, they are optimized so they are operated faster than
apply() in R. Second, these shortcuts have their own meanings. It is easier to understand
rowMeans(dat2) rather than apply(dat2, 1, mean).
You may find the following four shortcuts helpful in your future programming use. Note that x must
be an object that contains only numerical values.
rowSums(x) = apply(x, 1, sum)
4. Use tapply()
tapply() is a function that helps you compute descriptive statistics in a dataset by a factor. The
structure of apply()is:
tapply(X, INDEX, FUN, ...)
Where X is a vector (or a variable from an R dataset). INDEX is a factor (or a list of factors) variable
in your R dataset. FUN refers to a function (mean, median, etc), and ... refers to additional
arguments, if any.
Unlike the previous 3 loop functions (lapply, sapply, and apply), tapply() can only execute the
function on one variable at once. Here is an example.
tapply(dat[, 1:4], dat$Species, mean)
Run the R program, and you will receive the following error message.
> tapply(dat[, 1:4], dat$Species, mean) ## ## this gives an error
Error in tapply(dat[, 1:4], dat$Species, mean) :
arguments must have same length
20
BIA B452F – R programming for BIA B452F
That is because R expect the first argument as a vector, but you provide a data frame instead. To
make this function work, replace the first argument with a variable name, such as:
tapply(dat$Sepal.Length, dat$Species, mean)
Run the R program above. Now, you will obtain the following output.
> tapply(dat$Sepal.Length, dat$Species, mean) ## ## this works
setosa versicolor virginica
5.006 5.936 6.588
If you want to compute the group mean for multiple variables. You can either run tapply()
multiple times, or you can use another function called aggregate(). Run the following R program,
which gives you the mean of each variable, grouped by species.
aggregate(dat[, 1:4], by = list(dat$Species), mean)
21