Professional Documents
Culture Documents
1 / 47
Getting Started
RStudio console
3 / 47
Agenda
1. Basic operations
2. Data structures
3. Data Manipulation
4. Your First Graph
4 / 47
Basic Operation - Import data
1. Import data from drop down menu in R Studio:
5 / 47
Intermediate - Import data
## install.packages(c("tseries","lubridate"))
library(tseries)
library(lubridate)
amazon <- as.data.frame(get.hist.quote("amzn",
start="2013-1-1", end="2018-9-15", quote=c("Cl")))
amazon$Date<-ymd(row.names(amazon))
tail(amazon)
## Close Date
## 2018-09-07 1952.07 2018-09-07
## 2018-09-10 1939.01 2018-09-10
## 2018-09-11 1987.15 2018-09-11
## 2018-09-12 1990.00 2018-09-12
## 2018-09-13 1989.87 2018-09-13
## 2018-09-14 1970.19 2018-09-14
6 / 47
Advanced - Import data
# list of addresses for raw data.
addressList <- list(
drives_address = "http://stats.nba.com/js/data/sportvu/drivesData.js",
defense_address = "http://stats.nba.com/js/data/sportvu/defenseData.js",
catchshoot_address = "http://stats.nba.com/js/data/sportvu/catchShootData.js")
# function that grabs the data from the website and converts to R data frame
readIt <- function(address) {
web_page <- readLines(address)
7 / 47
Advanced (Cont.) - Import data
8 / 47
Advanced: scraping the web using R
#install.packages("rvest")
library(rvest)
# Store web url
lego_movie <- read_html("http://www.imdb.com/title/tt1490017/")
#Scrape the website for the movie rating
rating <- lego_movie %>%
html_nodes("strong span") %>%
html_text() %>%
as.numeric()
#rating
# Scrape the website for the cast
cast <- lego_movie %>%
html_nodes("#titleCast .itemprop span") %>%
html_text()
#cast
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
9 / 47
Advanced (Cont.): scraping the web using R
## [1] 7.8
## character(0)
https://stat4701.github.io/edav/2015/04/02/rvest_tutorial/
10 / 47
Basic Operation - Export data
11 / 47
Basic Operation - Install pacakges
Two ways to install a package:
1. From drop down menu in R Studio:
2. Using command:
# Download and install packages from CRAN-like repositories or from local f
install.packages(c("ggplot2","tidyr","dplyr"))
# Always load package before call it:
library(ggplot2)
12 / 47
Basic Operation - Update pacakges
1. To update all your installed packages to the latest versions available:
update.packages()
13 / 47
Getting Started
R programming style
# This is a comment
# The two following statements are equivalent:
a <- 1
# Assigning value 1 to object a:
a = 1
14 / 47
Data Structure
1. Vector
2. Matrix
3. Array
4. Data Frame
5. List
http://venus.ifca.unican.es/Rintro/dataStruct.html
15 / 47
Data Structure - Variable
Like most other languages, R lets you assign values to variables and refer
to them by name:
x <- 1
# x gets 1
y <- 2
# c(...): a generic function which combines values into a vector
z <- c(x,y)
# evaluate z to see what's stored as z
z
## [1] 1 2
Notice that the substitution is done at the time that the value is assigned
to z, not the time that z is evaluated:
y <- 5
z
## [1] 1 2
16 / 47
Data Structure - Vector
Fetch element(s) by location in a vector:
a <- c(1,2,3,4,5,6,7,8)
a
## [1] 1 2 3 4 5 6 7 8
## [1] 5
## [1] 1 2 3 4 5 6
# fetch item 1, 3, 7:
a[c(1,3,7)]
## [1] 1 3 7
17 / 47
Data Structure - Array
I In R, you can construct more complicated data structures than just
vectors.
I An array object is just a vector that’s associated with a dimension
attribute.
# Define an array
a <- array(c(1, 2, 3, 4, 5, 6, 7, 8), dim=c(2, 4))
a
## [1] 6
## [1] 1 3 5 7
18 / 47
Data Structure - Data frame
I A data frame is a list that contains multiple named vectors that are
the same length.
I Like a spreadsheet or a database table, particularly good for
representing experimental data.
# data.frame() is a function to creates data frames
team <-c("A","B","C","D","E")
first <- c(92, 89, 94, 72, 59)
second <- c(70, 73, 77, 90, 102)
mydf <- data.frame(team, first, second)
mydf
## [1] A B C D E
## Levels: A B C D E
19 / 47
Data Structure - List
I R has a built-in data type for mixing objects of different types, called
lists.
## $thing
## [1] "hat" "shoes"
##
## $size
## [1] "8.25" "5"
##
## $myData
## team first second
## 1 A 92 70
## 2 B 89 73
## 3 C 94 77
## 4 D 72 90
## 5 E 59 102
20 / 47
Data Structure - List Cont
e[1]
## $thing
## [1] "hat" "shoes"
21 / 47
Data Structure - Get Info about structure
# Here are some sample variables for example:
n <- 1:4
let <- LETTERS[1:4]
let
## n let
## 1 1 A
## 2 2 B
## 3 3 C
## 4 4 D
22 / 47
Data Structure - Get Info about structure
## [1] 4
# Number of rows
nrow(df)
## [1] 4
# Number of columns
ncol(df)
## [1] 2
## [1] 4 2
23 / 47
1
Data Exploration
“Happy families are all alike; every unhappy family is unhappy in its own
way. ” Leo Tolstoy
“Tidy datasets are all alike, but every messy dataset is messy in its own
way. ” Hadley Wickham
I Inf: Infinity
# For instance:
0/0
## [1] NaN
1/0
## [1] Inf
## [1] TRUE
25 / 47
Working with NA and NaN
Ignoring "bad" values in vector summary functions:
I If you run functions like mean() or sum() on a vector or data frame
containing NA or NaN, they will return NA and NaN(bad value).
I Many of these functions take the flag na.rm, which tells them to
ignore these values:
df1 <- c(1, 2, 3, NA, 5)
mean(df1)
## [1] NA
mean(df1, na.rm=TRUE)
## [1] 2.75
## [1] NaN
sum(df2, na.rm=TRUE)
## [1] 11
26 / 47
Example: Import Data
library(readr)
HW <- read_csv("dataSets/Student_List_HW.csv")
HW<-as.data.frame(HW)
summary(HW)
27 / 47
Example: Replace Missing Variables
HW$Homework_1[is.na(HW$Homework_1)]<-0
HW$Home[which(HW$Last_Name=="Garcia")]<-"NJ"
HW$Home[is.na(HW$Home)]<-"Unknown"
HW<-HW[complete.cases(HW),]
summary(HW)
28 / 47
Subset Observations (Rows)2
2 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-
cheatsheet.pdf
29 / 47
Subset Observations (Rows) Cont.
#load dplyr
library(dplyr)
Subset_HW_1 <- filter(HW,Status == "Master")
head(Subset_HW_1)
30 / 47
Subset Variables (Columns)
31 / 47
Subset Variables (Columns) Cont.
32 / 47
Subset Observations (Rows) and Variables (Columns)
33 / 47
Pipe Operator
Piping makes coding more readable and allow us to make several actions
in one sentence such as sort, filter, or create a variable.
34 / 47
Pipe Operator Cont.
HW %>%
filter(Status == "Master") %>%
select(contains("Name"),contains("Homework"))%>%
arrange(desc(Homework_1))%>%
head()
35 / 47
Create New Columns and Re-order
The mutate() function will add new columns to the data frame.
Arrange or re-order rows using arrange().
HW_update<-HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
arrange(desc(Homework_Average))
head(HW_update)
36 / 47
Split-Apply-Combine
Idea: split up a big problem into manageable pieces, apply a function to
each piece and then combine all the pieces together.
37 / 47
Group Data
Implement group operations in the “split-apply-combine” concept:
38 / 47
Group Data
Group_Summarise_HW<- HW %>%
filter(Status != "Unknown") %>%
mutate(Homework_Average = 0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)%>%
group_by(Status) %>%
summarise(Homework_Average=mean(Homework_Average),
Number_of_Student=length(Status))%>%
arrange(desc(Homework_Average))
head(Group_Summarise_HW)
## # A tibble: 3 x 3
## Status Homework_Average Number_of_Student
## <chr> <dbl> <int>
## 1 Master 87.4 8
## 2 PhD 86.4 2
## 3 Undergraduate 83.7 8
39 / 47
Reshape Data3
Lets change the layout of a data set, our tools from Tidyr library are:
3 https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-
cheatsheet.pdf
40 / 47
Reshape Data Cont.
#load tidyr
library(tidyr)
tidyr_HW<- HW %>% unite(Name, First_Name, Last_Name, sep = " ")%>%
select(-c(Status,Home)) %>%
gather(Homework, Score, Homework_1:Homework_3)
head(tidyr_HW)
41 / 47
Merge Data
Exam<- read_csv("dataSets/Student_List_Exam.csv")
Exam<-as.data.frame(Exam)
head(Exam,3)
HW_update<-mutate(HW,Homework_Average =
0.2*Homework_1+0.3*Homework_2+0.5*Homework_3)
Merged_df<-inner_join(HW_update, Exam,by=c("Last_Name","First_Name"))
head(Merged_df,3)
42 / 47
ggplot2
43 / 47
Composition of plots in ggplot2
Plots have two main components: 1) data to use and 2) type of plot.
Basic We want
function points Aesthetics
for plotting
Specify Specify
Dataset what goes what goes
on the on the
X axis Y axis
Type of plot
Data to use
44 / 47
Our first offcial graph
library(ggplot2)
ggplot(data=iris)+
geom_point(aes(x=Sepal.Width,y=Sepal.Length,colour=Species))
Species
Sepal.Length
setosa
6 versicolor
virginica
45 / 47
Resources
1617291382/ref=pd_sbs_14_t_0?_encoding=UTF8&psc=1&refRID=EEBN1DRHWQ6J09Z6TTBY
46 / 47
What have we learned?
47 / 47