You are on page 1of 45

Lecture 7:

Merges and functions


Ben Fanson
Simeon Lisovski
Lecture Outline
1) concatenating data.frames
2) merges/joins
3) functions



Quick review of last week
1) if-then...
if( trt == 'a') {
print('yes')
}else{
print('no') }


Quick review of last week

2) for loops...
for( trt in c('a','b','c') ) {
print(trt)
}

Appends
Bird_id treatment growth_rate
1 t1 12.3
2 t2 10.3
3 t3 14.5
Bird_id treatment growth_rate
4 t1 14.3
5 t2 9.3
6 t3 15.6
Bird_id treatment growth_rate
1 t1 12.3
2 t2 10.3
3 t3 14.5
Bird_id treatment growth_rate
4 t1 14.3
5 t2 9.3
6 t3 15.6
Bird_id treatment growth_rate
1 t1 12.3
2 t2 10.3
3 t3 14.5
4 t1 14.3
5 t2 9.3
6 t3 15.6
Appends
Bird_id lifespan
1 45
2 34
3 40
Bird_id growth_rate
1 14.3
2 9.3
3 15.6
Bird_id lifespan growth_rate
1 45 14.3
2 34 9.3
3 40 15.6
Merges (aka joins) unique identifier
treatment
t1
t1
growth_rate
12.3
14.3
treatment
t2
t2
growth_rate
10.3
9.3
treatment
t3
t3
growth_rate
14.5
15.6
append: rbind()
= ds1
= ds2
= ds3
rbind(ds1, ds2, ds3)
treatment
t1
t1
t2
t2
t3
t3
growth_rate
12.3
14.3
10.3
9.3
14.5
15.6
treatment
t1
t1
growth_rate
12.3
14.3
treatment
t2
t2
growth_rate
10.3
9.3
rbind.fill()
= ds1
= ds2
rbind.fill(ds1, ds2)
treatment
t1
t1
t2
t2
growth_rate
12.3
14.3
10.3
9.3
comments
good
delete
comments
NA
NA
good
delete
rbind.fill() is in dplyr [technically plyr package]
combine columns: cbind()
= ds1
= ds2
cbind(ds1, ds2)
treatment
t1
t2
t3
t1
t2
t3
growth_rate
12.3
10.3
14.5
14.3
9.3
15.6
treatment
t1
t2
t3
t1
t2
t3
growth_rate
12.3
10.3
14.5
14.3
9.3
15.6
Types of Common Merges (aka
joins)
Inner Join
Left Outer Join Full Outer Join
Method:
One-to-One, One-to-Many, or Many-to-Many
ds1 ds2
ds1 ds2
ds1 ds2
id var2 var3
1 a b
2 a b
3 a b
id var4 var5
1 c d
2 c d
3 c d
jargon: left and right datasets
Left
Right
left is called 'x' in R
right is called 'y' in R
Inner Joins
Bird_id lifespan
1 45
2 34
3 40
4 50
Bird_id growth_rate
1 14.3
2 9.3
5 12.3
merge( left, right, by='Bird_id' )
left
right
Bird_id lifespan growth_rate
1 45 14.3
2 34 9.3
Inner Joins
Bird_id trt lifespan
1 A 45
2 A 34
3 B 40
4 B 50
Bird_id trt growth_rate
1 A 14.3
2 A 9.3
5 B 12.3
merge( left, right by=c('Bird_id','trt') )
left
right
Bird_id trt lifespan growth_rate
1 A 45 14.3
2 A 34 9.3
left outer join
Bird_id lifespan
1 45
2 34
3 40
4 50
Bird_id lifespan growth_rate
1 45 14.3
2 34 9.3
3 40 NA
4 50 NA
merge(left,right, by='Bird_id', all.x=T)
left
right
Bird_id growth_rate
1 14.3
2 9.3
5 12.3
full outer join
Bird_id lifespan
1 45
2 34
3 40
4 50
Bird_id lifespan growth_rate
1 45 14.3
2 34 9.3
3 40 NA
4 50 NA
5 NA 12.3
merge(left,right, by='Bird_id', all=T)
left
right
Bird_id growth_rate
1 14.3
2 9.3
5 12.3
id var2 var3
1 a b
2 a b
3 a b
id var4 var5
1 c d
2 c d
3 c d
One-to-One Merge
left right
One-to-Many Merge
id trt value
1 t1 123
1 t2 32
2 t1 35
3 t1 34
3 t2 12
3 t3 10
id age
1 11
2 9
3 4
left
right
One-to-Many Merge
id trt value
1 t1 123
1 t2 32
2 t1 35
3 t1 34
3 t2 12
3 t3 10
id age
1 11
2 9
3 4
left
right
One-to-Many Merge
id trt value
1 t1 123
1 t2 32
2 t1 35
3 t1 34
3 t2 12
3 t3 10
id age
1 11
2 9
3 4
left
right
Many-to-Many Merge
id trt value
1 t1 123
1 t2 32
2 t1 35
2 t2 23
id age
1 9
1 11
2 4
2 5
left
right
Many-to-Many Merge
id trt value
1 t1 123
1 t2 32
2 t1 35
2 t2 23
id age
1 9
1 11
2 4
2 5
left
right
Many-to-Many Merge
id trt value
1 t1 123
1 t2 32
2 t1 35
2 t2 23
id age
1 9
1 11
2 4
2 5
left
right
Why does it matter to think about one-
to-one, one-to-many, ....?
1) Merges can indicate that something is not quite right in your
datasets

2) For instance,...


Bird_id lifespan
1 45
2 34
3 40
Bird_id growth_rate
1 14.3
1 14.3
2 9.3
3 15.6
Bird_id lifespan growth_rate
1 45 14.3
1 45 14.3
2 34 9.3
3 40 15.6
e.g. duplicates in a dataset
rule for inner join one-to-one: nrow(ds3) min( nrow(ds1), nrow(ds2))
nrow(ds1)=3 nrow(ds2)=4
nrow(ds3)=4
merge(ds1,ds2, by='Bird_id')
1) not using a 'by=' option [best practice is always use or R guesses]

Common Merge Mistakes
1) not using a 'by=' option [best practice is always use or R guesses]

2) Duplicates in the 'unique' identifier, leading to a many-to-many
merge when expecting a one-to-many
e.g. which(duplicated(ds$id))

Common Merge Mistakes
1) not using a 'by=' option [best practice is always use or R guesses]

2) Duplicates in the 'unique' identifier, leading to a many-to-many
merge when expecting a one-to-many
e.g. which(duplicated(ds$id))

3) unique identifiers are not exactly the same
e.g. 'Burt' 'burt' [make sure your dataset is clean]

Common Merge Mistakes
1) not using a 'by=' option [best practice is always use or R guesses]

2) Duplicates in the 'unique' identifier, leading to a many-to-many
merge when expecting a one-to-many
e.g. which(duplicated(ds$id))

3) unique identifiers are not exactly the same
e.g. 'Burt' 'burt' [make sure your dataset is clean]

4) failing to check your nrow(output_ds) to see if it is doing what you
think



Common Merge Mistakes
Writing functions
making user-defined functions is a R strength
so far, we have seen lots of pre-defined functions
e.g. mean(), sum(), select(), summarise()

writing your own
ownFunction <- function(x) print(x)



Functions
making user-defined functions is a R strength
so far, we have seen lots of pre-defined functions
e.g. mean(), sum(), select(), rnorm()

writing your own
ownFunction <- function(x) print(x)




Functions
Function name
argument(s)
tell the function what to do
making user-defined functions is a R strength
so far, we have seen lots of pre-defined functions
e.g. mean(), sum(), select(), rnorm()

writing your own
ownFunction <- function(x){ print(x) }



Functions
Function name
argument(s)
tell the function what to do
multiple arguments [ function(argument1, argument2) ]
printResult <- function(part1, part2){
paste('Result =', part1, part2)
}

printResult(part1=3.01, part2='mm')

Functions
use {} for multiple
lines of code
default arguments
printResult <- function(part1=3.01, part2='mm'){
paste('Result =', part1, part2)
}

printResult()

Functions
uses default values for both part1 and part2
default arguments
printResult <- function(part1=3.01, part2='mm'){
paste('Result =', part1, part2)
}

printResult(part1=c(1.33,2.34,4.35) )

Functions
uses default values for just part2
'...' argument [generic argument]
printResult <- function(part1='3.01', part2='mm', ...){
paste('Result =', part1, part2,...)
}

printResult(part1=c(1.33,2.34,4.35), sep='#' )

Functions
Global vs. Local variables
- any object created outside a function is global

- any object created within a function is local and will be
deleted after the function is run


Functions
Local variables
addAmounts <- function(x1, x2){
total_amount <- x1 + x2
print(total_amount)
}
addAmounts(x1=10,x2=10)
print(total_amount)


Functions
Global vs. Local variables
total_amount <- 10 # global variable
addAmounts <- function(x1, x2){
total_amount <- x1 + x2 # local variable
print(total_amount)
}
addAmounts(x1=10,x2=10)
total_amount


Functions
use return() to get a local variable
addAmounts <- function(x1, x2){
total_amount <- x1 + x2 # local variable
print(total_amount)
return(total_amount) # returns the local variable
}
new_amount <- addAmounts(x1=10,x2=10)
new_amount


Functions
Modularization
script 1
script 2
script 3
funcPlotting
funcStats
funcGeneric
.Rprofile
source('funcPlotting')
source('funcStats')
source('funcGeneric')



R plotting
1) overview of plotting in R
2) introduction to ggplot [aka grammar of graphics ]
3) Week 9 and 10 will be introduction to base plot (by Simeon)



Next Week
Lecture 7: Hands on Section
1) get Lecture7.R from github

2) get all data files in data/lecture7/

3) open up Lecture7.R in Rcourse_proj.Rpoj

4) start working through the example and then try the exercise
Lecture 7 files