You are on page 1of 47

EM622 Data Analysis and Visualization

Techniques for Decision-Making

Introduction to R and Data Manipulation

1 / 47
Getting Started
RStudio console

Options (Import dataset)


File Viewer (Data & Code)

Console (for typing commands) Plots


2 / 47
Your first graph
Copy and paste:
data(iris)
plot(Sepal.Width ~ Sepal.Length, data=iris,
col=c("red","orange","blue")[iris$Species],pch=16,
xlab="Sepal Length", ylab="Sepal Width")
legend("topright", legend=levels(iris$Species),
col=c("red","orange","blue"), bty="n",pch=16)

3 / 47
Agenda

1. Basic operations
2. Data structures
3. Data Manipulation
4. Your First Graph

4 / 47
Basic Operation - Import data
1. Import data from drop down menu in R Studio:

2. Import data from SAS/SPSS, etc: http://www.statmethods.net/input/importingdata.html

5 / 47
Intermediate - Import data

## install.packages(c("tseries","lubridate"))
library(tseries)
library(lubridate)
amazon <- as.data.frame(get.hist.quote("amzn",
start="2013-1-1", end="2018-9-15", quote=c("Cl")))

## time series starts 2013-01-02


## time series ends 2018-09-14

amazon$Date<-ymd(row.names(amazon))
tail(amazon)

## Close Date
## 2018-09-07 1952.07 2018-09-07
## 2018-09-10 1939.01 2018-09-10
## 2018-09-11 1987.15 2018-09-11
## 2018-09-12 1990.00 2018-09-12
## 2018-09-13 1989.87 2018-09-13
## 2018-09-14 1970.19 2018-09-14

6 / 47
Advanced - Import data
# list of addresses for raw data.
addressList <- list(
drives_address = "http://stats.nba.com/js/data/sportvu/drivesData.js",
defense_address = "http://stats.nba.com/js/data/sportvu/defenseData.js",
catchshoot_address = "http://stats.nba.com/js/data/sportvu/catchShootData.js")

# function that grabs the data from the website and converts to R data frame
readIt <- function(address) {
web_page <- readLines(address)

## regex to strip javascript bits and convert raw to csv format


x1 <- gsub("[\\{\\}\\]]", "", web_page, perl = TRUE)
x2 <- gsub("[\\[]", "\n", x1, perl = TRUE)
x3 <- gsub("\"rowSet\":\n", "", x2, perl = TRUE)
x4 <- gsub(";", ",", x3, perl = TRUE)

# read the resulting csv with read.table()


nba <- read.table(textConnection(x4), header = T,
sep = ",", skip = 2, stringsAsFactors = FALSE)
return(nba)
}
# download the data
df_list <- lapply(addressList, readIt)

7 / 47
Advanced (Cont.) - Import data

# check the data


catchshoot<-df_list$catchshoot_address
#str(catchshoot) # Get information about structure
head(catchshoot)

## PLAYER_ID PLAYER FIRST_NAME LAST_NAME TEAM_ABBREVIATION GP MIN


## 1 202691 Klay Thompson Klay Thompson GSW 78 34.0
## 2 1717 Dirk Nowitzki Dirk Nowitzki DAL 53 26.3
## 3 2594 Kyle Korver Kyle Korver CLE 35 24.6
## 4 201586 Serge Ibaka Serge Ibaka TOR 23 30.9
## 5 201567 Kevin Love Kevin Love CLE 60 31.4
## 6 202331 Paul George Paul George IND 74 35.8
## PTS FGM FGA FG_PCT FG3M FG3A FG3_PCT EFG_PCT PTS_TOT X
## 1 11.5 4.2 9.3 0.454 3.1 7.1 0.438 0.621 899 NA
## 2 8.1 3.4 7.5 0.446 1.3 3.5 0.388 0.535 427 NA
## 3 7.6 2.7 5.7 0.470 2.2 4.7 0.470 0.662 265 NA
## 4 7.5 2.9 6.9 0.424 1.7 4.3 0.394 0.547 173 NA
## 5 7.5 2.6 6.6 0.388 2.3 5.8 0.395 0.561 448 NA
## 6 7.4 2.7 6.1 0.437 2.0 4.8 0.420 0.603 546 NA

8 / 47
Advanced: scraping the web using R

#install.packages("rvest")
library(rvest)
# Store web url
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
#Scrape the website for the movie rating
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
#rating
# Scrape the website for the cast
cast <- lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
#cast

https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/

9 / 47
Advanced (Cont.): scraping the web using R

#Scrape the website for the movie rating


rating

## [1] 7.8

# Scrape the website for the cast


cast

## character(0)

https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/

10 / 47
Basic Operation - Export data

I Export dataframe into a spreedsheet,the easiest way to do this is to


use write.csv().
I By default, write.csv() includes row names, but these are usually
unnecessary and may cause confusion.
I The export file will be stored under working directory.
# export 'mydf' as a .csv file:
write.csv(mydf,"test.csv")

I How to find out your working directory?


# returns an absolute filepath representing the current working directory o
getwd()
## [1] "/Users/annieyu/Dropbox/622 visualization/lectures/Lecture 3_intro_t

I Write data into other format files:


http://www.cookbook- r.com/Data_input_and_output/Writing_data_to_a_file/

11 / 47
Basic Operation - Install pacakges
Two ways to install a package:
1. From drop down menu in R Studio:

2. Using command:
# Download and install packages from CRAN-like repositories or from local f
install.packages(c("ggplot2","tidyr","dplyr"))
# Always load package before call it:
library(ggplot2)
12 / 47
Basic Operation - Update pacakges
1. To update all your installed packages to the latest versions available:

update.packages()

2. To store your R code, always create a R script:

3. Export your images to pdf/png format:

13 / 47
Getting Started
R programming style

I R is case sensitive: a and A are two different objects.


I The assignment symbol is <-. Alternatively, the classical = symbol
can be used.
I The symbol # comments to the end of the line:

# This is a comment
# The two following statements are equivalent:
a <- 1
# Assigning value 1 to object a:
a = 1

14 / 47
Data Structure
1. Vector
2. Matrix
3. Array
4. Data Frame
5. List

http://venus.ifca.unican.es/Rintro/dataStruct.html

15 / 47
Data Structure - Variable
Like most other languages, R lets you assign values to variables and refer
to them by name:
x <- 1
# x gets 1
y <- 2
# c(...): a generic function which combines values into a vector
z <- c(x,y)
# evaluate z to see what's stored as z
z

## [1] 1 2

Notice that the substitution is done at the time that the value is assigned
to z, not the time that z is evaluated:
y <- 5
z

## [1] 1 2

16 / 47
Data Structure - Vector
Fetch element(s) by location in a vector:

a <- c(1,2,3,4,5,6,7,8)
a

## [1] 1 2 3 4 5 6 7 8

# fetch the 5th item in vector a:


a[5]

## [1] 5

# fetch item 1 through 6:


a[1:6]

## [1] 1 2 3 4 5 6

# fetch item 1, 3, 7:
a[c(1,3,7)]

## [1] 1 3 7

17 / 47
Data Structure - Array
I In R, you can construct more complicated data structures than just
vectors.
I An array object is just a vector that’s associated with a dimension
attribute.

# Define an array
a <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim=c(2, 4))
a

## [,1] [,2] [,3] [,4]


## [1,] 1 3 5 7
## [2,] 2 4 6 8

# fetch one cell in array a:


a[2,3]

## [1] 6

# fetch 1st row only


a[1,]

## [1] 1 3 5 7

18 / 47
Data Structure - Data frame
I A data frame is a list that contains multiple named vectors that are
the same length.
I Like a spreadsheet or a database table, particularly good for
representing experimental data.
# data.frame() is a function to creates data frames
team <-c("A","B","C","D","E")
first <- c(92, 89, 94, 72, 59)
second <- c(70, 73, 77, 90, 102)
mydf <- data.frame(team, first, second)
mydf

## team first second


## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102

# refer to the components of a data frame by name:


mydf$team

## [1] A B C D E
## Levels: A B C D E
19 / 47
Data Structure - List
I R has a built-in data type for mixing objects of different types, called
lists.

# list() function to construct R lists.


#Example: a list containing two strings, and a data frame
e <- list(thing=c("hat","shoes"), size=c("8.25","5"), myData=mydf)
e

## $thing
## [1] "hat" "shoes"
##
## $size
## [1] "8.25" "5"
##
## $myData
## team first second
## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102

20 / 47
Data Structure - List Cont

# fetch the 1st item in the list:


e$thing

## [1] "hat" "shoes"

e[1]

## $thing
## [1] "hat" "shoes"

# fetch the 1st row in the data frame


# which is the third component in the list:
e$myData[1,]

## team first second


## 1 A 92 70

21 / 47
Data Structure - Get Info about structure
# Here are some sample variables for example:
n <- 1:4
let <- LETTERS[1:4]
let

## [1] "A" "B" "C" "D"

df <- data.frame(n, let)


df

## n let
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D

# Get information about structure


str(df)

## 'data.frame': 4 obs. of 2 variables:


## $ n : int 1 2 3 4
## $ let: Factor w/ 4 levels "A","B","C","D": 1 2 3 4

22 / 47
Data Structure - Get Info about structure

# Get the length of a vector


length(n)

## [1] 4

# Number of rows
nrow(df)

## [1] 4

# Number of columns
ncol(df)

## [1] 2

# Get num of rows and columns


dim(df)

## [1] 4 2

23 / 47
1
Data Exploration
“Happy families are all alike; every unhappy family is unhappy in its own
way. ” Leo Tolstoy

“Tidy datasets are all alike, but every messy dataset is messy in its own
way. ” Hadley Wickham

1 Hadley Wickham. http://r4ds.had.co.nz/tidy-data.html


24 / 47
Working with NA and NaN
There are some special characters in R
I NA : Not Available (ie missing values)

I NaN : Not a Number

I Inf: Infinity

I -Inf : Minus Infinity

# For instance:
0/0

## [1] NaN

1/0

## [1] Inf

# Here's how to test whether a variable has one of these values:


y <- NA
# Is y NA?
is.na(y)

## [1] TRUE

25 / 47
Working with NA and NaN
Ignoring "bad" values in vector summary functions:
I If you run functions like mean() or sum() on a vector or data frame
containing NA or NaN, they will return NA and NaN(bad value).
I Many of these functions take the flag na.rm, which tells them to
ignore these values:
df1 <- c(1, 2, 3, NA, 5)
mean(df1)

## [1] NA

mean(df1, na.rm=TRUE)

## [1] 2.75

df2 <- c(1, 2, 3, NaN, 5)


sum(df2)

## [1] NaN

sum(df2, na.rm=TRUE)

## [1] 11
26 / 47
Example: Import Data
library(readr)
HW <- read_csv("dataSets/Student_List_HW.csv")
HW<-as.data.frame(HW)
summary(HW)

## Last_Name First_Name Status


## Length:20 Length:20 Length:20
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
##
## Home Homework_1 Homework_2 Homework_3
## Length:20 Min. :58.00 Min. :77.00 Min. : 80.00
## Class :character 1st Qu.:70.50 1st Qu.:80.00 1st Qu.: 85.50
## Mode :character Median :74.50 Median :88.00 Median : 90.50
## Mean :77.39 Mean :87.35 Mean : 90.90
## 3rd Qu.:84.25 3rd Qu.:93.00 3rd Qu.: 98.25
## Max. :99.00 Max. :99.00 Max. :100.00
## NA's :2

27 / 47
Example: Replace Missing Variables
HW$Homework_1[is.na(HW$Homework_1)]<-0
HW$Home[which(HW$Last_Name=="Garcia")]<-"NJ"
HW$Home[is.na(HW$Home)]<-"Unknown"
HW<-HW[complete.cases(HW),]
summary(HW)

## Last_Name First_Name Status


## Length:18 Length:18 Length:18
## Class :character Class :character Class :character
## Mode :character Mode :character Mode :character
##
##
##
## Home Homework_1 Homework_2 Homework_3
## Length:18 Min. : 0.00 Min. :77.00 Min. : 80.00
## Class :character 1st Qu.:66.75 1st Qu.:80.00 1st Qu.: 86.25
## Mode :character Median :74.50 Median :86.00 Median : 90.50
## Mean :70.28 Mean :86.39 Mean : 91.33
## 3rd Qu.:84.25 3rd Qu.:91.75 3rd Qu.: 98.75
## Max. :99.00 Max. :98.00 Max. :100.00

28 / 47
Subset Observations (Rows)2

2 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-

cheatsheet.pdf
29 / 47
Subset Observations (Rows) Cont.

#load dplyr
library(dplyr)
Subset_HW_1 <- filter(HW,Status == "Master")
head(Subset_HW_1)

## Last_Name First_Name Status Home Homework_1 Homework_2 Homework_3


## 1 Brown Susan Master NJ 74 88 98
## 2 Wilson Karen Master NJ 0 93 84
## 3 Moore Nancy Master PA 74 91 89
## 4 Taylor Betty Master GA 93 92 88
## 5 Anderson Anthony Master CA 96 98 100
## 6 Thomas Donald Master NJ 82 77 96

30 / 47
Subset Variables (Columns)

There are many options to choose columns

31 / 47
Subset Variables (Columns) Cont.

Subset_HW_2 <- select(HW,contains("Name"),contains("Homework"))


head(Subset_HW_2)

## Last_Name First_Name Homework_1 Homework_2 Homework_3


## 1 Smith Patricia 82 97 82
## 2 Johnson Jennifer 0 77 99
## 3 Williams Robert 99 80 80
## 4 Jones Michael 75 82 86
## 5 Brown Susan 74 88 98
## 7 Miller Richard 85 78 82

32 / 47
Subset Observations (Rows) and Variables (Columns)

Subset_HW_3 <- subset(HW,Status == "Master" ,


select=c("Last_Name","First_Name",
"Homework_1","Homework_2","Homework_3"))
head(Subset_HW_3)

## Last_Name First_Name Homework_1 Homework_2 Homework_3


## 5 Brown Susan 74 88 98
## 8 Wilson Karen 0 93 84
## 9 Moore Nancy 74 91 89
## 10 Taylor Betty 93 92 88
## 11 Anderson Anthony 96 98 100
## 12 Thomas Donald 82 77 96

33 / 47
Pipe Operator

Piping makes coding more readable and allow us to make several actions
in one sentence such as sort, filter, or create a variable.

34 / 47
Pipe Operator Cont.

HW %>%
filter(Status == "Master") %>%
select(contains("Name"),contains("Homework"))%>%
arrange(desc(Homework_1))%>%
head()

## Last_Name First_Name Homework_1 Homework_2 Homework_3


## 1 Anderson Anthony 96 98 100
## 2 Taylor Betty 93 92 88
## 3 Garcia Linda 93 91 100
## 4 Thomas Donald 82 77 96
## 5 Brown Susan 74 88 98
## 6 Moore Nancy 74 91 89

35 / 47
Create New Columns and Re-order
The mutate() function will add new columns to the data frame.
Arrange or re-order rows using arrange().

HW_update<-HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
arrange(desc(Homework_Average))
head(HW_update)

## Last_Name First_Name Status Home Homework_1 Homework_2


## 1 Anderson Anthony Master CA 96 98
## 2 Garcia Linda Master NJ 93 91
## 3 Wang Thomas PhD CHINA 72 98
## 4 Martin Morgan Undergraduate NJ 72 88
## 5 Brown Susan Master NJ 74 88
## 6 Taylor Betty Master GA 93 92
## Homework_3 Homework_Average
## 1 100 98.6
## 2 100 95.9
## 3 95 91.3
## 4 99 90.3
## 5 98 90.2
## 6 88 90.2

36 / 47
Split-Apply-Combine
Idea: split up a big problem into manageable pieces, apply a function to
each piece and then combine all the pieces together.

Split Apply Combine


(by X) X Y (average)
A 2
A 4
X Y
X Y A 3 X Y
A 2 A 3
A 4 X Y X Y B 2.5
B 0 B 0 B 2.5 C 7.5
B 5 B 5
C 5
C 10
X Y X Y
C 5 B 7.5
C 10

37 / 47
Group Data
Implement group operations in the “split-apply-combine” concept:

38 / 47
Group Data

Group_Summarise_HW<- HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
group_by(Status) %>%
summarise(Homework_Average=mean(Homework_Average),
Number_of_Student=length(Status))%>%
arrange(desc(Homework_Average))
head(Group_Summarise_HW)

## # A tibble: 3 x 3
## Status Homework_Average Number_of_Student
## <chr> <dbl> <int>
## 1 Master 87.4 8
## 2 PhD 86.4 2
## 3 Undergraduate 83.7 8

39 / 47
Reshape Data3
Lets change the layout of a data set, our tools from Tidyr library are:

3 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-

cheatsheet.pdf
40 / 47
Reshape Data Cont.

I gather() makes "wide" data longer


I unite() combines two variables into one variable

#load tidyr
library(tidyr)
tidyr_HW<- HW %>% unite(Name, First_Name, Last_Name, sep = " ")%>%
select(-c(Status,Home)) %>%
gather(Homework, Score, Homework_1:Homework_3)
head(tidyr_HW)

## Name Homework Score


## 1 Patricia Smith Homework_1 82
## 2 Jennifer Johnson Homework_1 0
## 3 Robert Williams Homework_1 99
## 4 Michael Jones Homework_1 75
## 5 Susan Brown Homework_1 74
## 6 Richard Miller Homework_1 85

41 / 47
Merge Data
Exam<- read_csv("dataSets/Student_List_Exam.csv")
Exam<-as.data.frame(Exam)
head(Exam,3)

## Last_Name First_Name Exam Project


## 1 Smith Patricia 77 65
## 2 Johnson Jennifer 100 96
## 3 Williams Robert 92 53

HW_update<-mutate(HW,Homework_Average =
0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)
Merged_df<-inner_join(HW_update, Exam,by=c("Last_Name","First_Name"))
head(Merged_df,3)

## Last_Name First_Name Status Home Homework_1 Homework_2 Homework_3


## 1 Smith Patricia Undergraduate MD 82 97 82
## 2 Johnson Jennifer Undergraduate NY 0 77 99
## 3 Williams Robert Undergraduate NY 99 80 80
## Homework_Average Exam Project
## 1 86.5 77 65
## 2 72.6 100 96
## 3 83.8 92 53

42 / 47
ggplot2

I ggplot2 is an R package designed for creating high quality plots.


I ggplot is based on the layered grammar of graphics, which means
that plots can be constructed layer by layer.

#you need to install the package just once


install.packages('ggplot2')

43 / 47
Composition of plots in ggplot2
Plots have two main components: 1) data to use and 2) type of plot.

Basic We want
function points Aesthetics
for plotting

ggplot(data=economics) + geom_point(aes(x=date, y=unemploy))

Specify Specify
Dataset what goes what goes
on the on the
X axis Y axis

Type of plot
Data to use

44 / 47
Our first offcial graph
library(ggplot2)
ggplot(data=iris)+
geom_point(aes(x=Sepal.Width,y=Sepal.Length,colour=Species))

Species
Sepal.Length

setosa
6 versicolor
virginica

2.0 2.5 3.0 3.5 4.0 4.5


Sepal.Width

45 / 47
Resources

1. Rob Kabacoff, “R in Action”: https://www.amazon.com/Action- Data- Analysis- Graphics/dp/

1617291382/ref=pd_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=EEBN1DRHWQ6J09Z6TTBY

2. Michael J Crawley, “The R Book”:


http://users.humboldt.edu/ygkim/CrawleyMJ_TheRBook.pdf

3. Joseph Adler, “R in a Nutshell”:


http://www.amazon.com/R- Nutshell- Joseph- Adler/dp/144931208X

4. Quick-R tutorial: http://www.statmethods.net/input/datatypes.html

5. Cookbook for R, Data input and output:


http://www.cookbook- r.com/Data_input_and_output/Writing_data_to_a_file/

46 / 47
What have we learned?

1. Define Data structures such as vector, array, list and dataframe.


2. Basic operations such as install package, import/export datasets
3. Common data manipulation operations such as filtering for rows,
selecting specific columns, re-ordering rows, adding new columns,
summarizing data, and performing the "split-apply-combine" task
4. Draw the graph

47 / 47

You might also like