You are on page 1of 58

Lecture 1: Overview of

Workflow and Why R
Ben Fanson
Simeon Lisovski
Workshop Structure
1) 10 weeks meeting weekly
a) Tuesday, 10am, green room

2) Sessions are broken up into 2 parts
a) Lecture part (20 - 45min)

b) Hands on section (20 - 45min)

3) Informal setting (ask questions...)
Our background
Ben Fanson
• Masters in Statistics
• converted to R (2 years ago) from SAS

Simeon Lisovski
• R programmer for years
• has his own R package
Workshop Goals
1) Development of an analysis workflow


2) Become proficient in R fundamentals for data
analysis

Disclaimer
1) This is NOT a statistics workshop

2) there are many, many ways to do the same thing
in R. Our goal is that you know how to do at least
one way.

3) We do not focus on programming efficiency

4) This workshop is a work in progress...
Lecture 1 Goals
1) convince you that you should care

2) overview of data analysis workflow

3) Why R?

4) R editors (Rstudio)

5) Getting started with R
Workflow: Reasons to Care
1) Research more reproducible

2) Efficiency

3) Trend towards submitting code

Reasons to Care
1) Research more reproducible

2) Efficiency

3) Trend towards submitting code

Reasons to Care
Reasons to Care
1) Research more reproducible

2) Efficiency

3) Trend towards submitting code

Reasons to Care
Efficiency
- ~80% of my data analysis is data cleaning, restructuring, subsetting,
and visualizing.

- Re-running analyses after finding errors, new transformation,
adding more data

- reviewers requesting a specific analysis

- easier to go back to old code and figure out your process

Reasons to Care
1) Research more reproducible

2) Efficiency

3) Trend towards submitting code


Reasons to Care
R script and journals
Lendvai et al. 2013 Poc Roy Soc
R code
R graphs
Reasons to Care
R script and journals
Earn et al. 2014 Poc Roy Soc Population-level
fever
Reasons to Care
R script and journals
Earn et al. 2014 Poc Roy Soc Population-level
fever
...(another 43 pages)
Analysis Workflow
Acquire/store
Data
Data Cleaning
Reformatting
Queries/merges
Data Preparation
Analysis Workflow
Acquire/store
Data
Data Cleaning
Reformatting
Queries/merges
Data Preparation
Analysis
Statistical
Methods
Assess output
Visualize
output
Tabular
Summaries
Analysis Workflow
Acquire/store
Data
Data Cleaning
Reformatting
Queries/merges
Data Preparation
Analysis
Statistical
Methods
Assess output
Visualize
output
Tabular
Summaries
Tables
Figures
Reports
Datasets
Write-up
Transparency
- organized, logical, well documented
- e.g. commented code, structured programs, organized folders

Modularity
- Keep scripts simple (not too many tasks per script), have re-usable functions
in one location
- e.g. file with project functions

Portability
- Make it easy to share scripts (collaborators, reviewers)
- e.g. relative pathnames, nested folders

http://stats.stackexchange.com/questions/2910/how-to-efficiently-manage-a-statistical-analysis-project
http://blog.revolutionanalytics.com/2010/10/a-workflow-for-r.html


Attributes of a good
workflow
Analysis Workflow
Acquire/store
Data
Data Preparation
Lots of systems available
Data management
Text files

Spreadsheets (e.g Excel)

Relational databases

Good practices
Data management
1) Raw data should be read-only
a) Data should match lab notebook exactly (e.g. do not remove
outliers/suspected errors)

2) Separate tables for different response components
(e.g. morphology, behavior, physiology)
a) Keep tables simple (help prevents data entering errors)


3) Every data row should have unique identifier
a) Minimize repeating of information (keeps table simpler and less
likely to misspell/mis-enter info)

Good practices
Data management
4) Keep track of all data files associated with a project
and have a short description of what data they
contain

5) Decide on naming conventions before collecting
data
a) e.g. ‘male’ vs ‘m’ vs. ‘Male’, ‘Bactrocera tryoni’ vs. ‘BT’ vs. ‘ B.
tryoni’) *can fix with string functions in R but makes life easier+


Text files
Data management
Pros Cons
Free Very difficult for entering data
Small file size Too much flexibility
No formatting No quality control
Flexibility in output
mainly for computer output
Spreadsheets
Data management
Pros Cons
Easy to get to software Columns are not linked across a row
Medium file size Columns can contain any data type
Tabular structure

Information associated with
formatting cannot be transferred
Easy for entering data

Encourages non-reproducible
restructuring
No quality controls
Data management
Cons
Columns are not linked across a row
Columns can contain any data type
Information associated with
formatting cannot be transferred
Encourages non-reproducible
restructuring
No quality controls
Data management
Cons
Columns are not linked across a row
Columns can contain any data type
Information associated with
formatting cannot be transferred
Encourages non-reproducible
restructuring
No quality controls
Data management
Cons
Columns are not linked across a row
Columns can contain any data type
Information associated with
formatting cannot be transferred
Encourages non-reproducible
restructuring
Large datasets can be a mess
No quality controls
Data management
Cons
Columns are not linked across a row
Columns can contain any data type
Information associated with
formatting cannot be transferred
Encourages non-reproducible
restructuring
No quality controls
Data management
Cons
Columns are not linked across a row
Columns can contain any data type
Information associated with
formatting cannot be transferred
Encourages non-reproducible
restructuring
Large datasets can be a mess
No quality controls
Databases

Data management
Pros Cons
Tables are defined rigorously Steep learning curve
Columns can have only one data
type
large file size

A row is a single record and cannot
be broken up
Software can be more difficult to
get
Tables are linked enforcing data
integrity
Data entry can be more difficult
than spreadsheet if trying to
enter into table (however,
creating form is great way to
enter data )
Data entry forms can be created
Quality controls
Rcourse_project.accdb
Analysis Workflow
Acquire/store
Data
Data Cleaning
Reformatting
Queries/merges
Preparation
Analysis Workflow
Data Cleaning
Reformatting
Queries/merges
Data Preparation
Analysis
Statistical
Methods
Assess output
Visualize
output
Tabular
Summaries
Tables
Figures
Reports
Datasets
Write-up
Why R and not other
software
1) Complete analysis software
2) Free and open source
3) Developed by statisticians (and used by)
4) Popular (lots of contributions)
(Note - Negative of this is that most packages are written by nonprogrammers
so R code is often inefficient, inconsistent, and error messages are less than
informative)
5) R is a functional programming language
6) Lots of data types (e.g. image, spatial/GIS, genetics)





Why R and not other
software
R vs SAS vs. SPSS vs. MATLAB
(For a comparison http://stanfordphd.com/Statistical_Software.html)



>4500 packages
Number of Packages Over Time
Ranking of Statistics Software by use
GUIs/Editors
1) R console [ default editor ] (not recommended)
2) Rstudio [we will use this one]
3) TinnR
4) Vim-R
5) Emacs
6) Others…
http://stackoverflow.com/questions/1173463/recommendations-for-windows-
text-editor-for-r
Rstudio overview
Complete R interface
a) Editor with colour coding and tab autofill/help pages
b) Integrated plots
c) R console
d) History
e) Version control (e.g. git or subversion)
f) Variable list


Rstudio example…
Getting Started with R
Rprofile.site, .Rprofile
• These files are sourced on startup of R
• Rprofile.site is sourced first. e.g. C:\Program Files\R\R-3.1.0\etc\
• .Rprofile in Local (project location) is then sourced
Getting Started with R
what to put in Rprofile.site, .Rprofile
• Packages you commonly use [ e.g. library(ggplot2) ]

• I source my list of Rfunctions (stored in My Documents - which is part of my
data backing up)





- For some more tips, see http://stackoverflow.com/questions/1189759/expert-r-users-whats-in-
your-rprofile
- http://www.r-bloggers.com/customize-your-rprofile-and-keep-your-workspace-clean/
R scripts
Setting up your Script
1. Header describing file

2. Global variables
a) e.g. db_dir <- ‘C:/’
db_file <- ‘database.accdb‘ # I have database location and file that can used with my personal
access function

3. Directories [best to use relative pathname]
a) setwd() # set the working directory
b) getwd() # get current working directory
c) list.files(), dir() # get a list of files in a directory

4. Source file with functions for project [if you did not use .Rprofile]


Header
Global
settings
Section 1
Online
• Google
• Stackoverflow (general programming – stackoverflow.com)
• Cross validated (statistics - stats.stackexchange.com/)
• Quick-R (http://www.statmethods.net/)

Books (plenty out there to choose from; lots of redundancy)
• R cookbook
• The R book (the big book)
• Data Analysis and Graphic using R
• A Beginner´s Guide to R ( by Zuur)

(see http://www.r-bloggers.com/r-programming-books-updated/ for more suggestions)
R Resources
R Core Concepts
R as matrix language
R objects and classes
Importing data
Next Week
Lecture 1: Hands on Section
Setup directories for Rcourse_proj
1) create folder '/Rcourse_proj' [pick wherever you want]


2) create folder '/Rcourse_proj/R programs'


3) create folder '/Rcourse_proj/data'

Create Rstudio project
1) Open Rstudio


2) Create new project associated with directory /Rcourse_proj

Initialize .Rprofile
Create .Rprofile
- in windows, this is slightly harder than it should be. Open Rcoure_proj and paste the
following code…
sink('.Rprofile') # create a new file called .Rprofile
sink() #this creates a file in your working directory (make sure directory is Rcourse_proj)


Add following text to .Rprofile (open file in text editor to add below)
printResult <- function(){print('Yes, this worked') }
printResult()

sink('.Rprofile'); sink()
Create an R script
Make a new R script

Save as 'R programs/lecture1.r'

Load a library/package
First install the package from CRAN (R website)...
install.packages('ggplot2') # can use Tools>Install packages... in Rstudio


Next load the package...
library(ggplot2) # or require(ggplot2)
example(ggplot) # see that is works
Basic R aspects



1) commenting are indicated by #
# this is a comment and will not be compiled
2) <- (or =) assigns an object
x <- 10 # or x = 10
y <- 'test'
z <- c(1,2,3,4) # c() concatenates all the items – makes a vector
3) Operators [ +, - , *, / , ^]
x + x
(total <- x * x ) # parentheses print out the results
x^2
4) Attributes of an object [will take more about next week]
class(x)
str(x) # difference will become apparent when we work with lists


Basic R aspects
5) Logical
x <- 5; y <- 'test' # make two variables
x == 5 # returns TRUE
x < 1 # returns FALSE
x == 5 & y == 'test' # returns TRUE
x == 15 | y == 'test' # returns TRUE

6) Functions have parenthesis after
mean(1:5) # get the mean
plot(x=1:10,y=1:10) # scatterplot

Common R mistakes
1) '\' vs '/' in file paths – Windows uses '\', R uses '/'

2) R is case sensitive: Mean() ≠ mean() # use autofill tab in Rstudio to help with
this

3) '=' vs. '==' # latter is comparing two items logically
x = 5 # sets x to 5
x == 10 # compares x to 10
4) typing to next line – need ',' or operator to indicate line is not completed
x<- c(1,1,1,1,
0,0,0,0) # this works
x<- 1 +
2 # this works
x <- 1
+ 2 # does not work as x<-1 is complete
Global attributes
options() # get a list of current options
options( stringAsFactors=TRUE ) # will explain this one in next lecture

directories
setwd('R programs') # set working directory to R programs
getwd() # see what is your working directory
dir() # list of contents of directory
setwd('..') # go up a level
dir.create('query') # create a new folder called queries in current folder

Some useful functions/objects
ls(), rm() # get a list of current options
ls() # get a list of objects
rm() # delete one or more object
rep(), seq()
rep(x=5, times=10) # repeat 5 ten times
seq(from=0, to=1, by=0.1) # make a sequence
LETTERS, letters
LETTTERS # capitalized alphabet
letters # lowercase alphabet

Some useful functions/objects
month.abb, month.name
month.abb # list of abbreviated month
month.name # list of month names
pi
pi # 3.141593
Functions Learned
Functions
getwd() c()
setwd() ls()
library() class(), str()
*, /, +,-,^,&,| dir(), list.files()
options() dir.create()
source() rep()
rm() seq()