R Programming For BIA B452F

BIA B452F – R programming for BIA B452F
What is the R system?

R is an open source programming language1 for statistical computing and graphics. It provides a wide
variety of statistical (linear and nonlinear modelling, classical statistical tests, time-series analysis,
classification, clustering, etc.) and graphical techniques.
More importantly, R is a free software! There is no need to pay for the license fee.
R framework
Steps in a typical statistical data analysis

The R system is divided into 2 conceptual parts:
 The ‘base’ R system – required to run R and contains the most

fundamental functions
 Packages – “programmed” solution shared by others in the
community of R; packages bundles together code, data,
documentation, and tests. As of Jan 2017, there is over 10,000
packages available on the Comprehensive R Archive Network
(CRAN).
R studio is a free integrated development environment (IDE) that

provides a relatively user-friendly environment for using R.
1
An Open source language means anyone can share
or modify the content of this language.
1
R has five basic or “atomic” classes of objects:

 Character
 Numeric (real numbers)
 Integer
 Complex
 Logical (True/False)
R has a wide variety of data structures for holding data:
 Vector – one-dimensional arrays that can only contain objects of the
same class
 Matrix – extension of a vector to two dimensions
 Array – extension of a vector to more than two dimensions
 Data frame – a tabular (rectangular) data structure for storing data
tables or datasets and each column may be a different type but each
row must be the same length
 List – similar to a vector but can contain objects of different classes:
vector, matrix, array, data frame and list
 Factors – variables which take on a limited number of different values
or categorical variables
R objects can have attributes, which are like metadata for the object. These
metadata can be very useful in that they help to describe the object. For
example, column names on a data frame help to tell us what data are
contained in each of the columns.
2
Task 1-1 Quick Review of R programming

Write and run a simple R program script
Start R: from the start menu or double-click the ‘RStudio’ icon on the desktop.
Code Window Environment/history

Execute the program pane
Plot/Management pane
Console
 Code editor – The Code Window is for writing your own R code as scripts.
 Console – The Console displays messages about R session and any programs submitted. R uses
the following colour-coded system for different types of message in the log:
o Blue – R code executed from the code window. Results from the R program are also
displayed in blue
o Red – errors that cause R to abort running the program. Warnings for the R program are
also printed in red
 Environment/history pane – The “environment” pane tells users what R objects are stored in
your current session. An R object can include model output, functions, values and many more.
The “history” pane tells users what R code have been executed from the Console.
 Plot/ Management pane – The “plot” pane displays all plots generated from R codes, such as
scatterplot, histograms.
1. Create a new R script by clicking “File” -> “New File” -> “R Script”
3
2. Copy and paste the following code in the code windows

price <- c(84, 77, 75, 85, 79, 70) ## use c() function to create vectors of objects
size <- c(2, 1.7, 1.4, 1.8, 1.9, 1.2) ## <- symbol assigns values to variables
price; size ## use semi-colon to separate statements
plot(price, size)
(Note: Comments in R begin with “#”. You may skip the comments.)
3. Use mouse to highlight all codes and then click “Run” or press <CTRL> + <Enter>.
Check the outputs of the program in the console and plot pane.
4. Create a working folder “B425F” on your local C drive ( and then change to this directory by typing:
setwd("C:/B452F")
Caution! you MUST use slash (or forward slash)

‘/’ to change your new directory. You will
receive an error message if you use backslash ‘\’.
Alternatively, you can click ‘Session’ -> ’Set
Working Directory’ -> ‘Choose Directory’ and
then choose the directory through the pop-up
dialog box.
6. Save the R script as “First_R.r” in the document folder by

clicking the icon or press <CTRL> + S.
7. Get the current working directory and check the files in the
working directory using get_wd() and dir().
getwd()
dir()
8. Save the r script again and then quit R studio.
9. Click the “First_R.r” to restart the R studio and reload the saved R script.
10. Run the R script to plot the graph again.
4
Create and save R dataset

1. The following program creates a grade book for a course with 9 students and the 8th student was
absent in the quiz.
gender <- c("M", NA, "F", "M", "F", "M", "M", "M", "F")
TMA <- c(76.5, 78.5, NA, NA, NA, 83.5, 87.5, 87, 76)
Quiz <- c(55, 75, 25, 40, 75, 60, 50, NA, 60)
Exam <- c(50, 75.5, NA, 33.5, 80, 55, 85.5, 40, 71.5)
Grade <- c("B-", "A-", "C", "F", "B+", "B", "A", "C", "A-")
grade <- data.frame(gender, TMA, Quiz, Exam, Grade)
grade
(Note: R program is CASE SENSITIVE. Note that object “Grade” is stored as a value, while “grade” is
stored as a data frame. They represent two different things and NA is a reserved keywords for
missing values.)
a) Copy and paste the program to the R script editor and then run it.
b) Check the newly created dataset (as a data frame) in the ‘Environment’ pane.
2. Save the grade dataset as csv file using write.csv().

write.csv(grade, "grade.csv")
(Note: R will save the file in the current working directory.)
3. Clear the R working environment using rm().

rm(list = ls())
Alternatively, you can click the icon in the “Environment” pane to clear the current R
environment.
4. Re-create the grade dataset by loading the csv file.
grade <- read.csv("grade.csv")
5
Manipulate dataset
1. Computing new variable using arithmetic
New variable can be easily added to an existing dataset (as a data frame). Suppose that the
‘course_score’ for the ‘grade’ dataset in Task 1-4 is derived by the following formula:
course_score = 0.4 * (TMA + Quiz) /2+ 0.6 * Exam
First, check if the ‘grade’ dataset is still in the current R working environment. If not, run ‘Task1-4.R’
to create the dataset again. Then, run the following R program to create a new variable
“course_score”, and added the new variable to the ‘grade’ dataset:
grade$course_score <- 0.4 * (grade$TMA + grade$Quiz)/2 + 0.6 * grade$Exam
grade
2. Handling of missing values

(a) In the grade dataset, you see there are several NA’s. These NA’s represent missing values. R is
unable to determine the course score and result if there is any missing value. As a result, let’s
manually assign values to the NA’s.
Type the following codes to change the NA in “gender” into female, and change the NA in
“Exam” as 60.
grade$gender[is.na(grade$gender)] <- "F"
grade$Exam[is.na(grade$Exam)] <- 60
grade
Type the following code to change all missing values in TMA and Exam into zero. . After that, re-
calculate the course score and result based on updated dataset. How many students failed?
grade[is.na(grade)] <- 0
grade
(Note: The “is.na()” argument examines if there is any element in the dataset “grade” that
contains NA. If so, R will convert the NA’s into zero.)
(b) You may also use the complete.cases() or na.omit() functions to apply listwise deletion (or
using complete cases) method to delete observations with missing data (i.e., an entire record is
excluded from analysis if any single value is missing).
grade <- read.csv("grade.csv") # reload the grade.csv
grade1 <- grade[complete.cases(grade),]
Or
grade2 <- na.omit(grade)
How many observations remain after applying listwise deletion?
3. Updating the course score by running Step 1 again and then check the revised scores.
6
4. Computing a new variable using “ifelse” statement

Conditional logic can be used to set the value for new variable. The basic form of the statement is:
ifelse(conditions, true, false)
Suppose a course requires students to get at least 40 marks in both exam and course_score in order
to get a pass grade. The final result can be obtained as follows:
grade$result <- ifelse(grade$Exam >= 40 & grade$course_score >= 40,
"pass", "fail")
grade
Run the code and then check the result.

5. Using apply() to compute a statistical measure by each variable
The apply function is a R function which enables to make quick operations on matrix, vector or
array. It’s called as: apply(variable, margin, function) (more details will be provided in Task 1-6).
Run the following R code to compute the mean and standard deviation of the numerical variables
(TMA, Quiz, Exam and course_score).
apply(grade[,c(2:4, 6)], 2, mean)
apply(grade[,c(2:4, 6)], 2, sd)
6. Subsetting data
In general, the elements of a dataset can be obtained using the notation dataframe[row indices,
column indices].
The following code extracts the 6th and 7th variable in the grade (i.e., course score and result)
result <- grade[,c(6:7)]
result
The following code extracts the “pass” records from the dataset
pass <- grade[which(grade$result=="pass"),]
pass
Run the program and check the result.
7. Dropping variable
You can drop variables from a dataset by assigning them to NULL. The following R code remove the
result from the dataset:
pass$result <- NULL
pass
Run the program and check the result.
7
Task 1-2 Managing data frames with the package “dplyr”

The R package dplyr is a tool for working with data frame. This package is written by Hadley Wickham
of RStudio. Commands under dplyr have a common structure, as shown below:
 The first argument is ALWAYS the name of a data frame.

 The second argument describes what manipulation is done on the data frame. Note that the
columns can be extracted directly without using the $ sign.
 Store the manipulated data into a new object (data frame)
To fully utilize the functions in dplyr, your dataset should be formatted, so that each row contains only
one observation, and each column describes a characteristic of that observation. In this tutorial, we will
cover seven functions that comes from the package dplyr. They are select, filter, arrange, rename,
mutate, summarize and the pipe operator %>%.
We will use a dataset that contains the air pollution and temperature data for the city of Chicago. This
dataset is named “Chicago.csv”. Read this dataset into R using the read.csv() function. After loading
the data, type str(chicago) to explore the structure of this dataset.
1. Use select() to extract columns
First, load the R package dplyr if you haven’t done so. Then, run the following R program
names(chicago)[1:3]
subset1 <- select(chicago, city:dptp)
head(subset1)
To drop a column, add ‘-‘ before the column name, e.g. select(Chicago, -date) will drop the date
column. The following code will give the same subset.
subset2 <- select(chicago, -c(date:no2tmean2))
What will be the result if the c() function is not use i.e., select(chicago, -date:no2tmean2)?
Suppose you only want to extract the first 3 columns. Using the dplyr package, you extract
variables from “city” to “dptp” in the Chicago dataset, and store the results as subset1. The code
head(subset1)shows you the first few rows of the data.
> head(subset1);
city tmpd dptp
1 chic 31.5 31.500
2 chic 33.0 29.875
3 chic 33.0 27.375
4 chic 29.0 28.625
5 chic 32.0 28.875
6 chic 40.0 35.125
The select() function also allows you to extract variables based on specific patterns on the
variable names. For example, if you want to keep all variables that start with the letter “d”, you can
run the following R program to get the result. If you print the structure of subset3 and chicago,
you can tell the difference between them.
subset3 <- select(chicago, starts_with("d"))
str(subset3)
str(chicago)
You can refer to ?select for more details.
8
2. Use filter() to extract rows

While subset()extract specific columns, filter()extracts specific rows. Suppose you want to
extract the rows of the chicago data frame where the levels of PM2.5 > 30, you can run the
following R program.
chic.f <- filter(chicago, pm25tmean2 > 30)
summary(chic.f$pm25tmean2)
All rows with the levels of PM2.5 > 30 were extracted, and stored inside the object chic.f. Now,
summary(chic.f$pm25tmean2)tells you the descriptive statistics of this dataset. The minimum
value is 30.05, which is greater than 30.
You can apply multiple conditions inside of filter(). For instance, if you want to extract the rows
where PM2.5 > 30 and temperature > 80 degrees Fahrenheit, you can get this done in one line.
chic.f <- filter(chicago, pm25tmean2 > 30 & tmpd > 80)
select(chic.f, date, tmpd, pm25tmean2)
The result is stored inside the R object chic.f. If you want to confirm your result, use select() to
extract the date, temperature and PM2.5 level. Do the result fulfill the conditions listed above?
3. Use arrange() to reorder rows of data
Reordering rows of a data frame in R could be difficult. Fortunately, the package dplyr has a nice
function called arrange(), which allows you to rearrange rows by a certain variable (column) in R.
Now, suppose you want to rearrange the Chicago data by date. You can run the following R
program.
chicago_new <- arrange(chicago, date)
By default, arrange() rearranges data in ascending order. Now, you may want to examine if the
data are indeed sorted by ascending date. Use head and tail to check your result, as below:
head(chicago_new)
tail(chicago_new)
If you want to sort the data by descending order, use the function desc() for the sorting variable.
chicago_new <- arrange(chicago, desc(date))
Again, use head and tail to check your result. Are the data sorted by descending date?
4. Use rename() to update variable names
In your future work, you may want to rename variables from an R dataset for better understanding.
The rename()function makes your work much easier. Now, type head(chicago) to read the
variable names of the Chicago dataset.
> head(chicago)
city tmpd dptp date pm25tmean2 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987/1/1 NA 34.00000 4.250000 19.98810
2 chic 33.0 29.875 1987/1/2 NA NA 3.304348 23.19099
3 chic 33.0 27.375 1987/1/3 NA 34.16667 3.333333 23.81548
…
The variable dptp means the dew point temperature, and the variable pm25mean2 refers to the
PM2.5 values. For better interpretation, you should rename these two variables. The following R
program is all what you need.
9
chicago_new <- rename(chicago_new, dewpoint = dptp, pm25 = pm25tmean2)
Note that the updated variable names are put before the “=” sign. Also, rename() can take
multiple arguments for variable name changes. Here, “dewpoint” and “pm25” are the updated
variable names, so they are placed before the “=” sign.
After making changes, type head(chicago_new)again to see how variable names are changed.
> head(chicago_new)
city tmpd dewpoint date pm25 pm10tmean2 o3tmean2 no2tmean2
1 chic 31.5 31.500 1987/1/1 NA 34.00000 4.250000 19.98810
2 chic 33.0 29.875 1987/1/2 NA NA 3.304348 23.19099
3 chic 33.0 27.375 1987/1/3 NA 34.16667 3.333333 23.81548
…
Here is an exercise for you. The variable pm10tmean2 refers to pm10, the variable o3tmean2 refers
to ozone, and the variable no2tmean2 refers to NO2 (nitrogen dioxide). Use the rename() function
to change their names to “pm10”, “ozone” and “NO2”, respectively. (Hint: you just need to place
three more arguments behind pm25 = pm25tmean2).
5. Use mutate() to compute variable transformations
Sometimes, you may want to perform variable computations or transformations, and
mutate()provides a clean interface for doing that.
For example, with air pollution data, you want to look at whether a given day’s air pollution level is
higher than or less than average. You need to subtract each day’s air pollution from its average
(detrending). To do this, you can create a pm25detrend variable that subtracts the mean from the
pm25 variable. Here is the R program that you need.
chicago_new2 <- mutate(chicago_new,
pm25detrend = pm25 - mean(pm25, na.rm = TRUE))
head(chicago_new2)
Plot a histogram for pm25detrend using the hist() function with breaks=30 option to define the
no. of cells for the histogram. Are most of the PM2.5 data above / below the average?
6. Use group_by() and summarize() to compute summary statistics
The function allows you to compute summary statistics that are group by a certain variable (strata).
Suppose you want to compute the annual air pollution rate from the Chicago dataset. Here, the
strata is year , which can be derived from the variable date. You will also utilize the summarize()
function to compute summary statistics.
You need three steps to complete this task. Each step is followed by an R program associated with
it.
Step 1: use as.POSIXlt() to create a year variable
chicago_new <- mutate(chicago_new, year = as.POSIXlt(date)$year + 1900)
Step 2: create a new data frame that split the Chicago dataset by year
years <- group_by(chicago_new, year)
Step 3: compute summary statistics for each year in the new data frame with the
summarize()function. Here, you obtain the average PM2.5, the maximum value of ozone, and the
median value of nitrogen dioxide level per year.
10
summarize(years, pm25 = mean(pm25, na.rm = TRUE),

ozone = max(ozone, na.rm = TRUE),
NO2 = median(NO2, na.rm = TRUE))
7. Putting things together - use the pipe operator %>%

From part 6, you have done three steps to compute the annual air pollution rate from the Chicago
dataset. You may wonder if these 3 steps can be done all at once. The answer is YES. By using the
pipe operator %>%, you can operate as many function in R package dplyr as you want, in one
chained command.
Here is the general structure of the pipe operator:
1st command %>%
2nd command %>%
3rd command %>%...
You can apply this technique to the example from above (part 6). Here it is.
result <- mutate(chicago_new, year = as.POSIXlt(date)$year + 1900) %>%
group_by(year) %>%
summarize(pm25 = mean(pm25, na.rm = TRUE),
ozone = max(ozone, na.rm = TRUE),
NO2 = median(NO2, na.rm = TRUE))
Now, the output is stored as R object result.

In summary, R package dplyr help you speed up and simplify your data management process. As
you can see from the above examples, most of the manipulations can be done as few as one line in
the R console.
Task 1-3 Control structures: if-else loop and for loop

During data manipulation processes in R, you may come up with situations when you want to add/drop
variables based on some criteria. This is where you need the control structures in R. Similar to many
other programming languages, the most common control structures in R include if-else, for, while,
etc.
1. if-else loop
The structure of an R if-else statement is:
if(<condition>) {
## do something
}
else {
## do something else
}
Here is a valid if-else statement.

> ## Generate a uniform random number
> x <- runif(1, 0, 10)
> if(x > 5) {
+ y <- 10
+ } else {
+ y <- 0
+ }
> y
11
[1] 0
Now, run the code.
2. for loop
The structure of a for loop is:
for(“condition”){
R statement}
Here is an example a for loop. A series of numbers are printed, as the value of k increases.
> for(k in 1:10){
+ print(k + 1)
+ }
[1] 2
[1] 3
[1] 4
[1] 5
[1] 6
[1] 7
[1] 8
[1] 9
[1] 10
[1] 11
Now, it is your turn to write a simple for loop. For j from 1 to 10, print out the values of j^2.
3. Nested loop
Both if-else loops and for loops can be nested, i.e. a loop is executed inside another loop. Here
is an R program that demonstrates a nested for loops.
for (j in 1:5){
for (k in 1:3){
print(j*k)
}
}
If you have a large nested loops, you may consider replacing them with functions in R.
12
Task 1-4 Functions in R

Functions in R programming are extremely useful in two ways. First, it can be applied on a task that
requires running similar computations multiple times. For example, when you compute the mean values
of multiple variables in a data frame in R, you can use function apply() to compute all mean values in
one single line (more details will be discuss at the next task).
Second, functions are “transferable”. If you need to share your R program with your colleagues in your
work, you can design your task with functions, save them as an R script, and then share the script with
your colleagues. They can simply run your functions by simply changing the parameters inside the
functions.
Writing functions require you to create an interface, which is explicitly specified with a set of
parameters. A good metaphor is comparing an R function as a vending machine. When you put coins
inside the machine (apply parameters into a function), the machine will release a can of soft drink for
you (results are displayed after the function is executed).
Functions in R have two important properties:
 The “parameters” of a function can be passed as arguments to other functions.
 Functions can be nested, so that you can define a function inside of another function
1. Write your first function.

Run the following R program. This function simply prints a statement
f <- function() {
cat("Hello, world!\n")
}
Type f(), what do you see in the R console? If you type class(f), you will see f is indeed a
function. Function has its own class in R.
2. Function arguments
Suppose you want to print the statement “Hello, world!” 10 times. How do you achieve it? You can
either cut and paste the code cat("Hello, world!\n") 10 times. However, not only this is
ineffective, but when you colleague run your program, they need to cut and paste the code again,
depending on their own needs.
Instead, you should write a function to complete this task. The following R program allows you to
print the statement as many times as you like.
f <- function(num) {
for(i in seq_len(num)) {
cat("Hello, world!\n")
}
}
This function take an argument, num. This argument can be any positive integers, which controls
how many times the statement is printed. Inside the function, there is a for loop. Starting from 1,
13
the for loop will print the statement “Hello, world!” depending on the value of num. Type
f(5)gives you 5 “Hello, world!” statements, for instance.
Note that you must assign a value to num. Otherwise, the function would return an error. Try f()
and f(3) for comparisons.
3. Functions with returning values
In the last example, your R function does NOT return anything other than the results. What if you
want to count the number of characters in “Hello, world!” statement? You can revise the last
example and let R count it for you (for your inference, the statement “Hello, world!” has 14
characters, including spaces and punctuations).
f <- function(num) {
hello <- "Hello, world!\n"
cat(hello)
}
chars <- nchar(hello) * num
chars
}
The R program above prints the statement “Hello, world!” N times, where N = the value of num. It
also prints the number of characters in the printout. In R, you always place the return value as the
last expression inside the function. Since variable chars is the last expression that is evaluated in
this function, chars becomes the return value of the function. Now, execute the function with any
positive number for num and see the results. What will happen if num is negative?
Recall that if you forget to assign a value to num, i.e. running f(), R gives an error message. To avoid
this, you can provide a default value for num. In the following example, the default value for num is
equal to 1.
f <- function(num = 1) {
hello <- "Hello, world!\n"
cat(hello)
}
chars <- nchar(hello) * num
chars
}
Now, when you type f(), R would take the default value for num (which is 1), and return the
following output.
> f()
Hello, world!
[1] 14
Now, it is your turn to design your own function. Write a function that converts temperature from
Celsius to Fahrenheit. First, your function takes one argument, tempC. Next, you can convert Celsius
to Fahrenheit using the following expression:
𝐹𝑎ℎ𝑟𝑒𝑛ℎ𝑒𝑖𝑡 = 𝐶𝑒𝑙𝑐𝑖𝑢𝑠 ∗ 1.8 + 32
Finally, your function should return one value, tempF. Name your function “temp”. After designing
your function, verify that 40 degrees Celsius is equal to 104 degrees Fahrenheit.
14
4. “Lazy” evaluation
Arguments inside functions are evaluated according to users’ input. If you design a function with
three (3) arguments, but you only provide one input value, R can still evaluate the function. Here is
an example.
f <- function(a,b,c){
print(a)
# print(a+b)
# print(a*b*c)
}
f(12)
Since only the first argument is used, R does not return any error. However, if you remove the #
signs for the 2nd and 3rd line inside the function, and then run again…
> f(12)
[1] 12
Error in print(a + b) : argument "b" is missing, with no default
The error shows up after the 1st statement, print(a), is executed. When R executes the 2nd
statement, print(a+b), it expects users to provide values for both a and b. If only one value is
provided, R considers the second value is missed, and thus it could not execute this statement.
Same concept applies on the 3rd statement, print(a*b*c).
Now, type f(1,2,3), or provide any 3 positive integers as arguments of the function. Do you see
any error message?
5. Summary of functions
 You can write a function in R, using the function() directive. Functions are assigned to R
objects, just like any other R object.
 You can define function arguments, and these arguments can take default values. Functions
have can be defined with named arguments; these function arguments can have default
values.
 You should have appropriate number of arguments corresponding to the evaluation needs.
 Functions always return the last expression evaluated in the function body.
 When your R coding are repetitive (a lot of cutting and pasting), this is a good sign that you
may want to write R functions to make your work easier.
15
Self-study tasks
Task 1-5 Date/ Time operations

In R, there are two classes: the class represents dates, and classes POSIXct, POSIXlt represent times.
Dates are calculated as the number of dates from 1/1/1970. Times are stored as the number of seconds
since the beginning of 1/1/1970.
1. Create a date object
A date object can be created with command as.Date(). Now, run the following R program
x <- as.Date("2012-12-21")
x
Now, you have a new object x. If you type class(x), what do you obtain?
2. Current date
In R Console, simply type Sys.Date(). This gives you the current date. Note that the output is as
same as that in as.Date(), both of which have format as “yyyy-mm-dd”.
3. Operations on date
The most common operations for date and time, are the sum or difference of them. For example,
you want to compute the day difference within a leap year. Common sense tells us a leap year has
366 days. However, can we verify this in R? Let’s try the following example.
x1 <- as.Date("2012-01-01"); y1 <- as.Date("2013-01-01")
x2 <- as.Date("2013-01-01"); y2 <- as.Date("2014-01-01")
y1 – x1
y2 - x2
Run the R program above. What is the answer of y1-x1? What about y2-x2? Note that 2012 is a leap
year, but 2013 is NOT. Therefore, the answers between two subtractions are not the same.
4. Create a time object
A time object can be created with command as.POSIXlt(). Now, run the following R program
z <- as.POSIXlt("2001-09-11 08:46:00")
z
If you type class(z), what do you obtain? It shows that z is a POSIXlt object.
5. Current time
In R Console, simply type Sys.time(). This gives you three things: current date in “yyyy-mm-dd”
format, current time in “HH:MM:SS” format, and the time zone.
6. Operations on time
Here, we introduce an R command difftime. This command computes the difference between two
date or time objects. The command has the following structure:
difftime(time1, time2, units = c("auto", "secs", "mins", "hours",
"days", "weeks"))
time1 and time2 are date objects or time objects. You can fit time1 with a date object and time2
with a time object, or vice versa. By default, difftime gives the difference in terms of days.
However, by using units, you can ask R to print the difference in terms of hours or weeks. Run the
following R program
difftime(Sys.time(), z, units = "days")
16
How many days has it been since the September 11 terrorist attack?
7. Date/Time operations using “lubridate”
Similar to other programming language, date-time data can be frustrating to work with in R. R
commands for date-times are generally unintuitive and change depending on the type of date-time
object being used. Lubridate makes it easier to do the things R does with date-times and possible to
do the things R does not.
Download and install “lubridate” and then run the following codes to see how to use the functions
of the package to process date and time variables.
library(lubridate)
# parsing of date-times
a <- ymd("2020/12/21")
a
b <- mdy(01122019)
b
c <- dmy_hms("31-12-2020 07:46:00")
c
# get components of a date-time

year(a)
month(a)
hour(c)
minute(c)
w <- wday(c, label=TRUE)
w
class(w)
# change of time zone

c
with_tz(c, "Asia/Harbin")
with_tz(c, "Asia/Singapore")
with_tz(c, "Asia/Bangkok")
with_tz(c, "Asia/Hong_kong")
(Note: The list of time zone can be found at:

https://en.wikipedia.org/wiki/List_of_tz_database_time_zones)
What is the data type of w?
Task 1-6 Loop functions in R
In Tasks 1-15 and 1-16, you have learnt how to write R loops and functions. Indeed, R has some pre-
defined functions which implement looping implicitly. You will find these functions very helpful for data
summarizations. Here, you will learn the following functions:
a. lapply(): Loop over a list and evaluate a function on each element
b. sapply(): Same as lapply but try to simplify the result
c. apply(): Apply a function over the margins of an array
d. tapply(): Apply a function over subsets of a vector
e. mapply(): Multivariate version of lapply
17
1. Use lapply
The function lapply() loops over a list object, iterating over each element in that list. During the
loops, a function is applied to each element of the list. Finally, lapply()returns the results as a
list. This function has the following structure:
lapply(X, FUN,...)
Where X is a list object, FUN refers to a function (mostly a descriptive statistics term, such as mean,
median, etc), and ... refers to additional arguments, if any.
Here, you will utilize the dataset “iris” to demonstrate how lapply()works. This dataset contains
the measurements of the variables sepal length, sepal width, petal length and petal width,
respectively, for 50 flowers from each of 3 species of iris in centimeters. The species are Iris setosa,
versicolor, and virginica. Run the following R program.
dat <- iris
dat2 <- dat[,1:4]
list1 <- lapply(dat2, mean)
str(list1)
The lapply()is used at the 3rd line. It does the followings:

a. Extract the 1st to 4th columns of iris dataset, and convert them into a list
b. Compute the mean for each and every extracted column
c. Return the values of mean as a list
The returned values are stored as list1. Now, type list1 to see the return values. Next, type
str(list1)to examine the structure of this object.
> str(list1)
List of 4
$ Sepal.Length: num 5.84
$ Sepal.Width : num 3.06
$ Petal.Length: num 3.76
$ Petal.Width : num 1.2
See! This object is indeed a list!

2. Use sapply()
The sapply() function actually does the same operation as lapply(). However, it returns
different R objects, depending on the dimension of the input. Here is how sapply()works:
a. If the result is a list where every element is length 1, then sapply()returns a vector
b. If the result is a list where every element is a vector of the same length (> 1), then
sapply()returns a matrix
c. If it can’t figure things out, then sapply()returns a list
Let’s illustrate this by repeating the example from lapply(). Recall that lapply() returns a list.
Now, run the following R program.
18
v2 <- sapply(dat, mean)

v2
str(v2)
> str(v2)
Named num [1:4] 5.84 3.06 3.76 1.2
- attr(*, "names")= chr [1:4] "Sepal.Length" "Sepal.Width" "Petal.Length" "
Petal.Width"
The object is no longer a list. Because the result of sapply() was a list where each element had
length 1, sapply() collapsed the output into a numeric vector.
3. Use apply()
The function apply() evaluates a function over the margins of an array. In other words, you can
use apply()to apply a function to the rows or columns of a matrix (which is just a 2-dimensional
array). For example, you can compute the row mean or column mean with apply(). The structure
of apply()is:
apply(X, MARGIN, FUN, ...)
Where X is an R data frame. MARGIN refers to the direction where the function is applied (MARGIN=1
for rows; MARGIN=2 for columns). FUN refers to a function (mean, median, etc), and ... refers to
additional arguments, if any.
Here is a quick example to show how works. Run the following R program. It uses two different
functions to compute the mean across columns in the iris dataset.
apply(dat2, 2, mean)
sapply(dat2, mean)
> apply(dat2, 2, mean)

Sepal.Length Sepal.Width Petal.Length Petal.Width
5.843333 3.057333 3.758000 1.199333
Compare the result against sapply(). Indeed, when you apply a function across columns, both
apply() and sapply() give you the same results. However, apply()can be used across rows and
columns, but sapply()can be used across columns ONLY.
19
Now, run the following R program. The results may surprise you.
colMeans(dat2)
rowMeans(dat2)
Do you realize that the output from the first two lines are the same? Also, the results between the
3rd and 4th lines are also the same.
This is not a coincidence. In fact, R provides some shortcuts for some of the output from apply().
These shortcuts has two advantages. First, they are optimized so they are operated faster than
apply() in R. Second, these shortcuts have their own meanings. It is easier to understand
rowMeans(dat2) rather than apply(dat2, 1, mean).
You may find the following four shortcuts helpful in your future programming use. Note that x must
be an object that contains only numerical values.
rowSums(x) = apply(x, 1, sum)
rowMeans(x) = apply(x, 1, mean)
colSums(x) = apply(x, 2, sum)
colMeans(x) = apply(x, 2, mean)
4. Use tapply()
tapply() is a function that helps you compute descriptive statistics in a dataset by a factor. The
structure of apply()is:
tapply(X, INDEX, FUN, ...)
Where X is a vector (or a variable from an R dataset). INDEX is a factor (or a list of factors) variable
in your R dataset. FUN refers to a function (mean, median, etc), and ... refers to additional
arguments, if any.
Unlike the previous 3 loop functions (lapply, sapply, and apply), tapply() can only execute the
function on one variable at once. Here is an example.
tapply(dat[, 1:4], dat$Species, mean)
Run the R program, and you will receive the following error message.
> tapply(dat[, 1:4], dat$Species, mean) ## ## this gives an error
Error in tapply(dat[, 1:4], dat$Species, mean) :
arguments must have same length
20
That is because R expect the first argument as a vector, but you provide a data frame instead. To
make this function work, replace the first argument with a variable name, such as:
tapply(dat$Sepal.Length, dat$Species, mean)
Run the R program above. Now, you will obtain the following output.
> tapply(dat$Sepal.Length, dat$Species, mean) ## ## this works
setosa versicolor virginica
5.006 5.936 6.588
If you want to compute the group mean for multiple variables. You can either run tapply()
multiple times, or you can use another function called aggregate(). Run the following R program,
which gives you the mean of each variable, grouped by species.
aggregate(dat[, 1:4], by = list(dat$Species), mean)
21

R Programming For BIA B452F

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

R Programming For BIA B452F

Uploaded by

Copyright:

Available Formats

BIA B452F – R programming for BIA B452F

What is the R system?

Steps in a typical statistical data analysis

 The ‘base’ R system – required to run R and contains the most

R studio is a free integrated development environment (IDE) that

R has five basic or “atomic” classes of objects:

Task 1-1 Quick Review of R programming

Code Window Environment/history

2. Copy and paste the following code in the code windows

Caution! you MUST use slash (or forward slash)

6. Save the R script as “First_R.r” in the document folder by

8. Save the r script again and then quit R studio.

10. Run the R script to plot the graph again.

Create and save R dataset

2. Save the grade dataset as csv file using write.csv().

(Note: R will save the file in the current working directory.)

3. Clear the R working environment using rm().

4. Re-create the grade dataset by loading the csv file.

grade <- read.csv("grade.csv")

2. Handling of missing values

How many observations remain after applying listwise deletion?

4. Computing a new variable using “ifelse” statement

Run the code and then check the result.

Run the program and check the result.

Run the program and check the result.

Task 1-2 Managing data frames with the package “dplyr”

 The first argument is ALWAYS the name of a data frame.

You can refer to ?select for more details.

2. Use filter() to extract rows

chicago_new <- rename(chicago_new, dewpoint = dptp, pm25 = pm25tmean2)

summarize(years, pm25 = mean(pm25, na.rm = TRUE),

7. Putting things together - use the pipe operator %>%

Now, the output is stored as R object result.

Task 1-3 Control structures: if-else loop and for loop

Here is a valid if-else statement.

Now, run the code.

Task 1-4 Functions in R

1. Write your first function.

Task 1-5 Date/ Time operations

# get components of a date-time

# change of time zone

(Note: The list of time zone can be found at:

The lapply()is used at the 3rd line. It does the followings:

See! This object is indeed a list!

v2 <- sapply(dat, mean)

> apply(dat2, 2, mean)

rowMeans(x) = apply(x, 1, mean)

colSums(x) = apply(x, 2, sum)

colMeans(x) = apply(x, 2, mean)

You might also like